Tokenizers¶

character_tokenizer¶

class deep_qa.data.tokenizers.character_tokenizer.CharacterTokenizer(params: deep_qa.common.params.Params)[source]¶

Bases: deep_qa.data.tokenizers.tokenizer.Tokenizer

A CharacterTokenizer splits strings into character tokens.

Notes

Note that in the code, we’re still using the “words” namespace, and the “num_sentence_words” padding key, instead of using a different “characters” namespace. This is so that the rest of the code doesn’t have to change as much to just use this different tokenizer. For example, this is an issue when adding start and stop tokens - how is an Instance class supposed to know if it should use the “words” or the “characters” namespace when getting a start token id? If we just always use the “words” namespace for the top-level token namespace, it’s not an issue.

But confusingly, we’ll still use the “characters” embedding key... At least the user-facing parts all use characters; it’s only in writing tokenizer code that you need to be careful about namespaces. TODO(matt): it probably makes sense to change the default namespace to “tokens”, and use that for both the words in WordTokenizer and the characters in CharacterTokenizer, so the naming isn’t so confusing.

embed_input(input_layer: keras.engine.topology.Layer, embed_function: typing.Callable[[keras.engine.topology.Layer, str, str], keras.engine.topology.Layer], text_trainer, embedding_suffix: str = '')[source]¶

Applies embedding layers to the input_layer. See TextTrainer._embed_input for a more detailed comment on what this method does.

Parameters:

input_layer: Keras ``Input()`` layer

The layer to embed.

embed_function: Callable[[‘Layer’, str, str], ‘Tensor’]

This should be the __get_embedded_input method from your instantiated TextTrainer. This function actually applies an Embedding layer (and maybe also a projection and dropout) to the input layer.

text_trainer: TextTrainer

Simple Tokenizers will just need to use the embed_function that gets passed as a parameter here, but complex Tokenizers might need more than just an embedding function. So that you can get an encoder or other things from the TextTrainer here if you need them, we take this object as a parameter.

embedding_suffix: str, optional (default=””)

A suffix to add to embedding keys that we use, so that, e.g., you could specify several different word embedding matrices, for whatever reason.

get_padding_lengths(sentence_length: int, word_length: int) → typing.Dict[str, int][source]¶: When dealing with padding in TextTrainer, TextInstances need to know what to pad and how much. This function takes a potential max sentence length and word length, and returns a lengths dictionary containing keys for the padding that is applicable to this encoding.

get_sentence_shape(sentence_length: int, word_length: int) → typing.Tuple[int][source]¶: If we have a text sequence of length sentence_length, what shape would that correspond to with this encoding? For words or characters only, this would just be (sentence_length,). For an encoding that contains both words and characters, it might be (sentence_length, word_length).

get_words_for_indexer(text: str) → typing.Dict[str, typing.List[str]][source]¶: The DataIndexer needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes some text and returns whatever the DataIndexer would be asked to index from that text. Note that this returns a dictionary of token lists keyed by namespace. Typically, the key would be either ‘words’ or ‘characters’. An example for indexing the string ‘the’ might be {‘words’: [‘the’], ‘characters’: [‘t’, ‘h’, ‘e’]}, if you are indexing both words and characters.

index_text(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[source]¶: This method actually converts some text into an indexed list. This could be a list of integers (for either word tokens or characters), or it could be a list of arrays (for word tokens combined with characters), or something else.

tokenize(text: str) → typing.List[str][source]¶: Actually splits the string into a sequence of tokens. Note that this will only give you top-level tokenization! If you’re using a word-and-character tokenizer, for instance, this will only return the word tokenization.

tokenizer¶

class deep_qa.data.tokenizers.tokenizer.Tokenizer(params: deep_qa.common.params.Params)[source]¶

Bases: object

A Tokenizer splits strings into sequences of tokens that can be used in a model. The “tokens” here could be words, characters, or words and characters. The Tokenizer object handles various things involved with this conversion, including getting a list of tokens for pre-computing a vocabulary, getting the shape of a word sequence in a model, etc. The Tokenizer needs to handle these things because the tokenization you do could affect the shape of word sequence tensors in the model (e.g., a sentence could have shape (num_words,), (num_characters,), or (num_words, num_characters)).

static _spans_match(sentence_tokens: typing.List[str], span_tokens: typing.List[str], index: int) → bool[source]¶

char_span_to_token_span(sentence: str, span: typing.Tuple[int, int], slack: int = 3) → typing.Tuple[int, int][source]¶

Converts a character span from a sentence into the corresponding token span in the tokenized version of the sentence. If you pass in a character span that does not correspond to complete tokens in the tokenized version, we’ll do our best, but the behavior is officially undefined.

The basic outline of this method is to find the token that starts the same number of characters into the sentence as the given character span. We try to handle a bit of error in the tokenization by checking slack tokens in either direction from that initial estimate.

The returned (begin, end) indices are inclusive for begin, and exclusive for end. So, for example, (2, 2) is an empty span, (2, 3) is the one-word span beginning at token index 2, and so on.

embed_input(input_layer: keras.engine.topology.Layer, embed_function: typing.Callable[[keras.engine.topology.Layer, str, str], keras.engine.topology.Layer], text_trainer, embedding_suffix: str = '')[source]¶

Applies embedding layers to the input_layer. See TextTrainer._embed_input for a more detailed comment on what this method does.

Parameters:

input_layer: Keras ``Input()`` layer

The layer to embed.

embed_function: Callable[[‘Layer’, str, str], ‘Tensor’]

This should be the __get_embedded_input method from your instantiated TextTrainer. This function actually applies an Embedding layer (and maybe also a projection and dropout) to the input layer.

text_trainer: TextTrainer

Simple Tokenizers will just need to use the embed_function that gets passed as a parameter here, but complex Tokenizers might need more than just an embedding function. So that you can get an encoder or other things from the TextTrainer here if you need them, we take this object as a parameter.

embedding_suffix: str, optional (default=””)

A suffix to add to embedding keys that we use, so that, e.g., you could specify several different word embedding matrices, for whatever reason.

get_custom_objects() → typing.Dict[str, typing.Layer][source]¶: If you use any custom Layers in your embed_input method, you need to return them here, so that the TextTrainer can correctly load models.

get_padding_lengths(sentence_length: int, word_length: int) → typing.Dict[str, int][source]¶: When dealing with padding in TextTrainer, TextInstances need to know what to pad and how much. This function takes a potential max sentence length and word length, and returns a lengths dictionary containing keys for the padding that is applicable to this encoding.

get_sentence_shape(sentence_length: int, word_length: int) → typing.Tuple[int][source]¶: If we have a text sequence of length sentence_length, what shape would that correspond to with this encoding? For words or characters only, this would just be (sentence_length,). For an encoding that contains both words and characters, it might be (sentence_length, word_length).

get_words_for_indexer(text: str) → typing.Dict[str, typing.List[str]][source]¶: The DataIndexer needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes some text and returns whatever the DataIndexer would be asked to index from that text. Note that this returns a dictionary of token lists keyed by namespace. Typically, the key would be either ‘words’ or ‘characters’. An example for indexing the string ‘the’ might be {‘words’: [‘the’], ‘characters’: [‘t’, ‘h’, ‘e’]}, if you are indexing both words and characters.

index_text(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[source]¶: This method actually converts some text into an indexed list. This could be a list of integers (for either word tokens or characters), or it could be a list of arrays (for word tokens combined with characters), or something else.

tokenize(text: str) → typing.List[str][source]¶: Actually splits the string into a sequence of tokens. Note that this will only give you top-level tokenization! If you’re using a word-and-character tokenizer, for instance, this will only return the word tokenization.

word_and_character_tokenizer¶

class deep_qa.data.tokenizers.word_and_character_tokenizer.WordAndCharacterTokenizer(params: deep_qa.common.params.Params)[source]¶

Bases: deep_qa.data.tokenizers.tokenizer.Tokenizer

A WordAndCharacterTokenizer first splits strings into words, then splits those words into characters, and returns a representation that contains both a word index and a sequence of character indices for each word. See the documention for WordTokenizer for a note about naming, and the typical notion of “tokenization” in NLP.

Notes

In embed_input, this Tokenizer uses an encoder to get a character-level word embedding, which then gets concatenated with a standard word embedding from an embedding matrix. To specify the encoder to use for this character-level word embedding, use the "word" key in the encoder parameter to your model (which should be a TextTrainer subclass - see the documentation there for some more info). If you do not give a "word" key in the encoder dict, we’ll create a new encoder using the "default" parameters.

embed_input(input_layer: keras.engine.topology.Layer, embed_function: typing.Callable[[keras.engine.topology.Layer, str, str], keras.engine.topology.Layer], text_trainer, embedding_suffix: str = '')[source]¶

A combined word-and-characters representation requires some fancy footwork to do the embedding properly.

This method assumes the input shape is (..., sentence_length, word_length + 1), where the first integer for each word in the tensor is the word index, and the remaining word_length entries is the character sequence. We’ll first split this into two tensors, one of shape (..., sentence_length), and one of shape (..., sentence_length, word_length), where the first is the word sequence, and the second is the character sequence for each word. We’ll pass the word sequence through an embedding layer, as normal, and pass the character sequence through a _separate_ embedding layer, then an encoder, to get a word vector out. We’ll then concatenate the two word vectors, returning a tensor of shape (..., sentence_length, embedding_dim * 2).

get_custom_objects() → typing.Dict[str, typing.Any][source]¶: If you use any custom Layers in your embed_input method, you need to return them here, so that the TextTrainer can correctly load models.

get_padding_lengths(sentence_length: int, word_length: int) → typing.Dict[str, int][source]¶: When dealing with padding in TextTrainer, TextInstances need to know what to pad and how much. This function takes a potential max sentence length and word length, and returns a lengths dictionary containing keys for the padding that is applicable to this encoding.

get_sentence_shape(sentence_length: int, word_length: int = None) → typing.Tuple[int][source]¶: If we have a text sequence of length sentence_length, what shape would that correspond to with this encoding? For words or characters only, this would just be (sentence_length,). For an encoding that contains both words and characters, it might be (sentence_length, word_length).

get_words_for_indexer(text: str) → typing.Dict[str, typing.List[str]][source]¶: The DataIndexer needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes some text and returns whatever the DataIndexer would be asked to index from that text. Note that this returns a dictionary of token lists keyed by namespace. Typically, the key would be either ‘words’ or ‘characters’. An example for indexing the string ‘the’ might be {‘words’: [‘the’], ‘characters’: [‘t’, ‘h’, ‘e’]}, if you are indexing both words and characters.

index_text(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[source]¶: This method actually converts some text into an indexed list. This could be a list of integers (for either word tokens or characters), or it could be a list of arrays (for word tokens combined with characters), or something else.

tokenize(text: str) → typing.List[str][source]¶: Actually splits the string into a sequence of tokens. Note that this will only give you top-level tokenization! If you’re using a word-and-character tokenizer, for instance, this will only return the word tokenization.

word_splitter¶

class deep_qa.data.tokenizers.word_splitter.NltkWordSplitter[source]¶

Bases: deep_qa.data.tokenizers.word_splitter.WordSplitter

A tokenizer that uses nltk’s word_tokenize method.

I found that nltk is very slow, so I switched to using my own simple one, which is a good deal faster. But I’m adding this one back so that there’s consistency with older versions of the code, if you really want it.

split_words(sentence: str) → typing.List[str][source]¶

class deep_qa.data.tokenizers.word_splitter.NoOpWordSplitter[source]¶

Bases: deep_qa.data.tokenizers.word_splitter.WordSplitter

This is a word splitter that does nothing. We’re playing a little loose with python’s dynamic typing, breaking the typical WordSplitter API a bit and assuming that you’ve already split sentence into a list somehow, so you don’t need to do anything else here. For example, the PreTokenizedTaggingInstance requires this word splitter, because it reads in pre-tokenized data from a file.

split_words(sentence: str) → typing.List[str][source]¶

class deep_qa.data.tokenizers.word_splitter.SimpleWordSplitter[source]¶

Bases: deep_qa.data.tokenizers.word_splitter.WordSplitter

Does really simple tokenization. NLTK was too slow, so we wrote our own simple tokenizer instead. This just does an initial split(), followed by some heuristic filtering of each whitespace-delimited token, separating contractions and punctuation. We assume lower-cased, reasonably well-formed English sentences as input.

_can_split(token: str)[source]¶

split_words(sentence: str) → typing.List[str][source]¶

Splits a sentence into word tokens. We handle four kinds of things: words with punctuation that should be ignored as a special case (Mr. Mrs., etc.), contractions/genitives (isn’t, don’t, Matt’s), and beginning and ending punctuation (“antennagate”, (parentheticals), and such.).

The basic outline is to split on whitespace, then check each of these cases. First, we strip off beginning punctuation, then strip off ending punctuation, then strip off contractions. When we strip something off the beginning of a word, we can add it to the list of tokens immediately. When we strip it off the end, we have to save it to be added to after the word itself has been added. Before stripping off any part of a token, we first check to be sure the token isn’t in our list of special cases.

class deep_qa.data.tokenizers.word_splitter.SpacyWordSplitter[source]¶

Bases: deep_qa.data.tokenizers.word_splitter.WordSplitter

A tokenizer that uses spaCy’s Tokenizer, which is much faster than the others.

split_words(sentence: str) → typing.List[str][source]¶

class deep_qa.data.tokenizers.word_splitter.WordSplitter[source]¶

Bases: object

A WordSplitter splits strings into words. This is typically called a “tokenizer” in NLP, but we need Tokenizer to refer to something else, so we’re using WordSplitter here instead.

split_words(sentence: str) → typing.List[str][source]¶

tokenizers.word_tokenizer¶

class deep_qa.data.tokenizers.word_tokenizer.WordTokenizer(params: deep_qa.common.params.Params)[source]¶

Bases: deep_qa.data.tokenizers.tokenizer.Tokenizer

A WordTokenizer splits strings into word tokens.

There are several ways that you can split a string into words, so we rely on a WordProcessor to do that work for us. Note that we’re using the word “tokenizer” here for something different than is typical in NLP - we’re referring here to how strings are represented as numpy arrays, not the linguistic notion of splitting sentences into tokens. Those things are handled in the WordProcessor, which is a common dependency in several Tokenizers.

Parameters:

processor: Dict[str, Any], default={}

Contains parameters for processing text strings into word tokens, including, e.g., splitting, stemming, and filtering words. See WordProcessor for a complete description of available parameters.

embed_input(input_layer: keras.engine.topology.Layer, embed_function: typing.Callable[[keras.engine.topology.Layer, str, str], keras.engine.topology.Layer], text_trainer, embedding_suffix: str = '')[source]¶

Applies embedding layers to the input_layer. See TextTrainer._embed_input for a more detailed comment on what this method does.

Parameters:

input_layer: Keras ``Input()`` layer

The layer to embed.

embed_function: Callable[[‘Layer’, str, str], ‘Tensor’]

This should be the __get_embedded_input method from your instantiated TextTrainer. This function actually applies an Embedding layer (and maybe also a projection and dropout) to the input layer.

text_trainer: TextTrainer

Simple Tokenizers will just need to use the embed_function that gets passed as a parameter here, but complex Tokenizers might need more than just an embedding function. So that you can get an encoder or other things from the TextTrainer here if you need them, we take this object as a parameter.

embedding_suffix: str, optional (default=””)

A suffix to add to embedding keys that we use, so that, e.g., you could specify several different word embedding matrices, for whatever reason.

get_padding_lengths(sentence_length: int, word_length: int) → typing.Dict[str, int][source]¶: When dealing with padding in TextTrainer, TextInstances need to know what to pad and how much. This function takes a potential max sentence length and word length, and returns a lengths dictionary containing keys for the padding that is applicable to this encoding.

get_sentence_shape(sentence_length: int, word_length: int) → typing.Tuple[int][source]¶: If we have a text sequence of length sentence_length, what shape would that correspond to with this encoding? For words or characters only, this would just be (sentence_length,). For an encoding that contains both words and characters, it might be (sentence_length, word_length).

get_words_for_indexer(text: str) → typing.Dict[str, typing.List[str]][source]¶: The DataIndexer needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes some text and returns whatever the DataIndexer would be asked to index from that text. Note that this returns a dictionary of token lists keyed by namespace. Typically, the key would be either ‘words’ or ‘characters’. An example for indexing the string ‘the’ might be {‘words’: [‘the’], ‘characters’: [‘t’, ‘h’, ‘e’]}, if you are indexing both words and characters.

index_text(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[source]¶: This method actually converts some text into an indexed list. This could be a list of integers (for either word tokens or characters), or it could be a list of arrays (for word tokens combined with characters), or something else.

tokenize(text: str) → typing.List[str][source]¶: Actually splits the string into a sequence of tokens. Note that this will only give you top-level tokenization! If you’re using a word-and-character tokenizer, for instance, this will only return the word tokenization.