TextTrainer

class deep_qa.training.text_trainer.TextTrainer(params: deep_qa.common.params.Params)[source]

This is a Trainer that deals with word sequences as its fundamental data type (any TextDataset or TextInstance subtype is fine). That means we have to deal with padding, with converting words (or characters) to indices, and encoding word sequences. This class adds methods on top of Trainer to deal with all of that stuff.

This class has five kinds of methods:

  1. protected methods that are overriden from Trainer, and which you shouldn’t need to worry about
  2. utility methods for building models, intended for use by subclasses
  3. abstract methods that determine a few key points of behavior in concrete subclasses (e.g., what your input data type is)
  4. model-specific methods that you might have to override, depending on what your model looks like - similar to (3), but simple models don’t need to override these
  5. private methods that you shouldn’t need to worry about

There are two main ways you’re intended to interact with this class, then: by calling the utility methods when building your model, and by customizing the behavior of your concrete model by using the parameters to this class.

Parameters:

embeddings : Dict[str, Any], optional (default=50 dim word embeddings, 8 dim character

embeddings, 0.5 dropout on both)

These parameters specify the kind of embeddings to use for words, character, tags, or whatever you want to embed. This dictionary behaves similarly to the encoder and seq2seq_encoder parameter dictionaries. Valid keys are dimension, dropout, pretrained_file, fine_tune, and project. The value for dimension is an int specifying the dimensionality of the embedding (default 50 for words, 8 for characters); dropout is a float, specifying the amount of dropout to use on the embedding layer (default 0.5); pretrained_file is a (string) path to a glove-formatted file containing pre-trained embeddings; fine_tune is a boolean specifying whether the pretrained embeddings should be trainable (default False); and project is a boolean specifying whether to add a projection layer after the embedding layer (only really useful in conjunction with pre-trained embeddings, to get them into a lower-dimensional space; default False).

data_generator: Dict[str, Any], optional (default=None)

If not None, we will pass these parameters to a DataGenerator object to create data batches, instead of creating one big array for all of our training data. See DataGenerator for the available options here. Note that in order to take full advantage of the capabilities of a DataGenerator, you should make sure your model correctly implements _set_padding_lengths(), get_padding_lengths(), get_padding_memory_scaling(), and get_instance_sorting_keys(). Also note that some of the things DataGenerator does can change the behavior of your learning algorithm, so you should think carefully about how exactly you want batches to be structured before you choose these parameters.

num_sentence_words: int, optional (default=None)

Upper limit on length of word sequences in the training data. Ignored during testing (we use the value set at training time, either from this parameter or from a loaded model). If this is not set, we’ll calculate a max length from the data.

num_word_characters: int, optional (default=None)

Upper limit on length of words in the training data. Only applicable for “words and characters” text encoding.

tokenizer: Dict[str, Any], optional (default={})

Which tokenizer to use for TextInstances. See :mod:deep_qa.data.tokenizers.tokenizer for more information.

encoder: Dict[str, Dict[str, Any]], optional (default={‘default’: {}})

These parameters specify the kind of encoder used to encode any word sequence input. An encoder takes a sequence of vectors and returns a single vector.

If given, this must be a dict, where each key is a name that can be used for encoders in the model, and the value corresponding to the key is a set of parameters that will be passed on to the constructor of the encoder. We will use the “type” key in this dict (which must match one of the keys in encoders) to determine the type of the encoder, then pass the remaining args to the encoder constructor.

Hint: Use "lstm" or "cnn" for sentences, "treelstm" for logical forms, and "bow" for either.

encoder_fallback_behavior: string, optional (default=”crash”)

Determines the behavior when an encoder is asked for by name, but you have not given parameters for an encoder with that name. See _get_encoder for more information.

seq2seq_encoder: Dict[str, Dict[str, Any]], optional (default={‘default’: {‘encoder_params’: {}, ‘wrapper_params: {}}})

Like encoder, except seq2seq encoders return a sequence of vectors instead of a single vector (the difference between our “encoders” and “seq2seq encoders” is the difference in Keras between LSTM() and LSTM(return_sequences=True)).

seq2seq_encoder_fallback_behavior: string, optional (default=”crash”)

Determines the behavior when a seq2seq encoder is asked for by name, but you have not given parameters for an encoder with that name. See _get_seq2seq_encoder for more information.

Utility methods

These methods are intended for use by subclasses, mostly in your _build_model implementation.

TextTrainer._get_sentence_shape(sentence_length: int = None) → typing.Tuple[int][source]

Returns a tuple specifying the shape of a tensor representing a sentence. This is not necessarily just (self.num_sentence_words,), because different text_encodings lead to different tensor shapes. If you have an input that is a sequence of words, you need to call this to get the shape to pass to an Input layer. If you don’t, your model won’t work correctly for all tokenizers.

TextTrainer._embed_input(input_layer: keras.engine.topology.Layer, embedding_suffix: str = '')[source]

This function embeds a word sequence input, using an embedding defined by embedding_suffix. You should call this function in your _build_model method any time you want to convert word indices into word embeddings. Note that if this is used in conjunction with _get_sentence_shape, we will do the correct thing for whatever Tokenizer you use. The actual input to this might be words and characters, and we might actually do a concatenation of a word embedding and a character-level encoder. All of this is handled transparently to your concrete model subclass, if you use the API correctly, calling _get_sentence_shape() to get the shape for your Input layer, and passing that input layer into this _embed_input() method.

We need to take the input Layer here, instead of just returning a Layer that you can use as you wish, because we might have to apply several layers to the input, depending on the parameters you specified for embedding things. So we return, essentially, embedding(input_layer).

The input layer can have arbitrary shape, as long as it ends with a word sequence. For example, you could pass in a single sentence, a set of sentences, or a set of sets of sentences, and we will handle them correctly.

Internally, we will create a dictionary mapping embedding names to embedding layers, so if you have several things you want to embed with the same embedding layer, be sure you use the same name each time (or just don’t pass a name, which accomplishes the same thing). If for some reason you want to have different embeddings for different inputs, use a different name for the embedding.

In this function, we pass the work off to self.tokenizer, which might need to do some additional processing to actually give you a word embedding (e.g., if your text encoder uses both words and characters, we need to run the character encoder and concatenate the result with a word embedding).

Note that the embedding_suffix parameter is a suffix to whatever name the tokenizer will give to the embeddings it creates. Typically, the tokenizer will use the name words, though it could also use characters, or something else. So if you pass _A for embedding_suffix, you will end up with actual embedding names like words_A and characters_A. These are the keys you need to specify in your parameter file, for embedding sizes etc. When constructing actual Embedding layers, we will further append the string _embedding, so the layer would be named words_A_embedding.

TextTrainer._get_encoder(name='default', fallback_behavior: str = None)[source]

This method is intended to be used in your _build_model implementation, any time you want to convert a sequence of vectors into a single vector. The encoder name corresponds to entries in the encoder parameter passed to the constructor of this object, allowing you to customize the kind and behavior of the encoder just through parameters.

A sentence encoder takes as input a sequence of word embeddings, and returns as output a single vector encoding the sentence. This is typically either a simple RNN or an LSTM, but could be more complex, if the “sentence” is actually a logical form.

Parameters:

name : str, optional (default=”default”)

The name of the encoder. Multiple calls to _get_encoder using the same name will return the same encoder. To get parameters for creating the encoder, we look in self.encoder_params, which is specified by the encoder parameter in self.__init__. If name is not a key in self.encoder_params, the behavior is defined by the fallback_behavior parameter.

fallback_behavior : str, optional (default=None)

Determines what to do when name is not a key in self.encoder_params. If you pass None (the default), we will use self.encoder_fallback_behavior, specified by the encoder_fallback_behavior parameter to self.__init__. There are three options:

  • "crash": raise an error. This is the default for self.encoder_fallback_behavior. The intention is to help you find bugs - if you specify a particular encoder name in self._build_model without giving a fallback behavior, you probably wanted to use a particular set of parameters, so we crash if they are not provided.
  • "use default params": In this case, we return a new encoder created with self.encoder_params["default"].
  • "use default encoder": In this case, we reuse the encoder created with self.encoder_params["default"]. This effectively changes the name parameter to "default" when the given name is not in self.encoder_params.
TextTrainer._get_seq2seq_encoder(name='default', fallback_behavior: str = None)[source]

This method is intended to be used in your _build_model implementation, any time you want to convert a sequence of vectors into another sequence of vector. The encoder name corresponds to entries in the encoder parameter passed to the constructor of this object, allowing you to customize the kind and behavior of the encoder just through parameters.

A seq2seq encoder takes as input a sequence of vectors, and returns as output a sequence of vectors. This method is essentially identical to _get_encoder, except that it gives an encoder that returns a sequence of vectors instead of a single vector.

Parameters:

name : str, optional (default=”default”)

The name of the encoder. Multiple calls to _get_seq2seq_encoder using the same name will return the same encoder. To get parameters for creating the encoder, we look in self.seq2seq_encoder_params, which is specified by the seq2seq_encoder parameter in self.__init__. If name is not a key in self.seq2seq_encoder_params, the behavior is defined by the fallback_behavior parameter.

fallback_behavior : str, optional (default=None)

Determines what to do when name is not a key in self.seq2seq_encoder_params. If you pass None (the default), we will use self.seq2seq_encoder_fallback_behavior, specified by the seq2seq_encoder_fallback_behavior parameter to self.__init__. There are three options:

  • "crash": raise an error. This is the default for self.seq2seq_encoder_fallback_behavior. The intention is to help you find bugs - if you specify a particular encoder name in self._build_model without giving a fallback behavior, you probably wanted to use a particular set of parameters, so we crash if they are not provided.
  • "use default params": In this case, we return a new encoder created with self.seq2seq_encoder_params["default"].
  • "use default encoder": In this case, we reuse the encoder created with self.seq2seq_encoder_params["default"]. This effectively changes the name parameter to "default" when the given name is not in self.seq2seq_encoder_params.
TextTrainer._set_text_lengths_from_model_input(input_slice)[source]

Given an input slice (a tuple) from a model representing the max length of the sentences and the max length of each words, set the padding max lengths. This gets called when loading a model, and is necessary to get padding correct when using loaded models. Subclasses need to call this in their _set_padding_lengths_from_model method.

Parameters:

input_slice : tuple

A slice from a concrete model class that represents an input word sequence. The tuple must be of length one or two, and the first dimension should correspond to the length of the sentences while the second dimension (if provided) should correspond to the max length of the words in each sentence.

Abstract methods

You must implement these methods in your model (along with _build_model()). The simplest concrete TextTrainer implementations only have four methods: __init__, _instance_type (typically one line), _set_padding_lengths_from_model (also typically one line, for simple models), and _build_model. See TrueFalseModel and SimpleTagger for examples.

TextTrainer._instance_type() → deep_qa.data.instances.instance.Instance[source]

When reading datasets, what Instance type should we create? The Instance class contains code that creates actual numpy arrays, so this instance type determines the inputs that you will get to your model, and the outputs that are used for training.

TextTrainer._set_padding_lengths_from_model()[source]

This gets called when loading a saved model. It is analogous to _set_padding_lengths, but needs to set all of the values set in that method just by inspecting the loaded model. If we didn’t have this, we would not be able to correctly pad data after loading a model.

Semi-abstract methods

You’ll likely need to override these methods, if you have anything more complex than a single sentence as input.

TextTrainer.get_padding_lengths() → typing.Dict[str, int][source]

This is about padding. Any solver will have some number of things that need padding in order to make consistently-sized data arrays, like the length of a sentence. This method returns a dictionary of all of those things, mapping a length key to an int.

If any of the entries in this dictionary is None, the padding code will calculate a padding length from the data itself. This could either be a good idea or a bad idea - if you have outliers in your data, you could be wasting a whole lot of memory and computation time if you pad the whole dataset to the size of the outlier. On the other hand, if you do batch-specific padding, this can save you a whole lot of time, if you group batches by similar lengths.

Here we return the lengths that are applicable to encoding words and sentences. If you have additional padding dimensions, call super().get_padding_lengths() and then update the dictionary.

TextTrainer.get_instance_sorting_keys() → typing.List[str][source]

If we’re using dynamic padding, we want to group the instances by padding length, so that we minimize the amount of padding necessary per batch. This variable sets what exactly gets sorted by. We’ll call get_padding_lengths() on each instance, pull out these keys, and sort by them in the order specified. You’ll want to override this in your model class if you have more complex models.

The default implementation is to sort first by num_sentence_words, then by num_word_characters (if applicable).

TextTrainer.get_padding_memory_scaling(padding_lengths: typing.Dict[str, int]) → int[source]

This method is for computing adaptive batch sizes. We assume that memory usage is a function that looks like this: M = b * O(p) * c, where M is the memory usage, b is the batch size, c is some constant that depends on how much GPU memory you have and various model hyperparameters, and O(p) is a function outlining how memory usage asymptotically varies with the padding lengths. Our approach will be to let the user effectively set \frac{M}{c} using the adaptive_memory_usage_constant parameter in DataGenerator. The model (this method) specifies O(p), so we can solve for the batch size b. The more specific you get in specifying O(p) in this function, the better a job we can do in optimizing memory usage.

Parameters:

padding_lengths: Dict[str, int]

Dictionary containing padding lengths, mapping keys like num_sentence_words to ints. This method computes a function of these ints.

Returns:

O(p): int

The big-O complexity of the model, evaluated with the specific ints given in padding_lengths dictionary.

TextTrainer._set_padding_lengths(dataset_padding_lengths: typing.Dict[str, int])[source]

This is about padding. Any model will have some number of things that need padding in order to make a consistent set of input arrays, like the length of a sentence. This method sets those variables given a dictionary of lengths from a dataset.

Note that you might choose not to update some of these lengths, either because you want to keep the model flexible to allow for dynamic (batch-specific) padding, or because you’ve set a hard limit in the class parameters and don’t want to change it.

Overridden Trainer methods

You probably don’t need to override these, except for probably _get_custom_objects. The rest of them you shouldn’t need to worry about at all (except to call them, if they are part of the external Trainer API), but we document them here for completeness.

TextTrainer.create_data_arrays(dataset: deep_qa.data.datasets.dataset.IndexedDataset, batch_size: int = None)[source]

Takes a raw dataset and converts it into training inputs and labels that can be used to either train a model or make predictions. Depending on parameters passed to the constructor of this Trainer, this could either return two actual array objects, or a single generator that generates batches of two array objects.

Parameters:

dataset: Dataset

A Dataset of the same format as read by load_dataset_from_files() (we will call this directly with the output from that method, in fact)

batch_size: int, optional (default = None)

The batch size with which the dataset should be created. If this is None, the default self.batch_size will be used.

Returns:

input_arrays: numpy.array or Tuple[numpy.array]

label_arrays: numpy.array, Tuple[numpy.array], or None

generator: a Python generator returning Tuple[input_arrays, label_arrays]

If this is returned, it is the only return value. We either return a Tuple[input_arrays, label_arrays], or this generator.

TextTrainer.load_dataset_from_files(files: typing.List[str])[source]

This method assumes you have a TextDataset that can be read from a single file. If you have something more complicated, you’ll need to override this method (though, a solver that has background information could call this method, then do additional processing on the rest of the list, for instance).

TextTrainer.score_dataset(dataset: deep_qa.data.datasets.dataset.TextDataset)[source]

See the superclass docs (Trainer.score_dataset()) for usage info. Just a note here that we do not use data generators for this method, even if you’ve said elsewhere that you want to use them, so that we can easily return the labels for the data. This means that we’ll do whole-dataset padding, and this could be slow. We could probably fix this, but it’s good enough for now.

TextTrainer.set_model_state_from_dataset(dataset: deep_qa.data.datasets.dataset.TextDataset)[source]

Given a raw Dataset object, set whatever model state is necessary. The most obvious use case for this is for computing a vocabulary in TextTrainer. Note that this is not an IndexedDataset, and you should not make it one. Use set_model_state_from_indexed_dataset() for setting state that depends on the data having already been indexed; otherwise you’ll duplicate the work of doing the indexing.

TextTrainer.set_model_state_from_indexed_dataset(dataset: deep_qa.data.datasets.dataset.IndexedDataset)[source]

Given an IndexedDataset, set whatever model state is necessary. This is typically stuff around padding.

classmethod TextTrainer._get_custom_objects()[source]
TextTrainer._dataset_indexing_kwargs() → typing.Dict[str, typing.Any][source]
TextTrainer._load_auxiliary_files()[source]

Called during model loading. If you have some auxiliary pickled object, such as an object storing the vocabulary of your model, you can load it here.

TextTrainer._overall_debug_output(output_dict: typing.Dict[str, <built-in function array>]) → str[source]

We’ll do something different here: if “embedding” is in output_dict, we’ll output the embedding matrix at the top of the debug file. Note that this could be _huge_ - you should only do this for debugging on very simple datasets.

TextTrainer._save_auxiliary_files()[source]

Called after training. If you have some auxiliary object, such as an object storing the vocabulary of your model, you can save it here. The model config is saved by default.

TextTrainer._set_params_from_model()[source]

Called after a model is loaded, this lets you update member variables that contain model parameters, like max sentence length, that are not stored as weights in the model object. This is necessary if you want to process a new data instance to be compatible with the model for prediction, for instance.

TextTrainer._uses_data_generators()[source]

Training models with Keras requires a different API if you produce data in batches uses a generator or if you just provide one big numpy array with all of your data, which Keras has to split into batches. This method tells us which Keras API we should use. If your model class produces data using a generator, return True here; otherwise, return False. The default implementation just returns False.