Home

DeepQA is a library built on top of Keras to make NLP easier. There are four main benefits to this library:

  1. It is hard to get NLP right in Keras. There are a lot of issues around padding sequences and masking that are not handled well in the main Keras code, and we have well-tested code that does the right thing for, e.g., computing attentions over padded sequences, or distributing text encoders across several sentences or words.
  2. We have implemented a base class, TextTrainer, that provides a nice, consistent API around building NLP models in Keras. This API has functionality around processing data instances, embedding words and/or characters, easily getting various kinds of sentence encoders, and so on.
  3. We provide a nice interface to training, validating, and debugging Keras models. It is very easy to experiment with variants of a model family, just by changing some parameters in a JSON file. For example, you can go from using fixed GloVe vectors to represent words, to fine-tuning those embeddings, to using a concatenation of word vectors and a character-level CNN to represent words, just by changing parameters in a JSON experiment file. If your model is built using the TextTrainer API, all of this works transparently to the model class - the model just knows that it’s getting some kind of word vector.
  4. We have implemented a number of state-of-the-art models, particularly focused around question answering systems (though we’ve dabbled in models for other tasks, as well). The actual model code for these systems are typically 50 lines or less.

This library has several main components:

  • A training module, which has a bunch of helper code for training Keras models of various kinds.
  • A models module, containing implementations of actual Keras models grouped around various prediction tasks.
  • A layers module, which contains code for custom Keras Layers that we have written.
  • A data module, containing code for reading in data from files and converting it into numpy arrays suitable for use with Keras.
  • A common module, which has a few random things dealing with reading parameters and a few other things.

Running Models

deep_qa.run.compute_accuracy(predictions: <built-in function array>, labels: <built-in function array>)[source]

Computes a simple categorical accuracy metric, useful if you used score_dataset to get predictions.

deep_qa.run.evaluate_model(param_path: str, dataset_files: typing.List[str] = None, model_class=None)[source]

Loads a model and evaluates it on some test set.

Parameters:

param_path: str, required

A json file specifying a DeepQaModel.

dataset_files: List[str], optional, (default=None)

A list of dataset files to evaluate on. If this is None, we’ll evaluate from the test_files parameter in the input files. If that’s also None, we’ll crash.

model_class: DeepQaModel, optional (default=None)

This option is useful if you have implemented a new model class which is not one of the ones implemented in this library.

Returns:

Numpy arrays of model predictions in the format of model.outputs.

deep_qa.run.load_model(param_path: str, model_class=None)[source]

Loads and returns a model.

Parameters:

param_path: str, required

A json file specifying a DeepQaModel.

model_class: DeepQaModel, optional (default=None)

This option is useful if you have implemented a new model class which is not one of the ones implemented in this library.

Returns:

A DeepQaModel instance.

deep_qa.run.prepare_environment(params: typing.Union[deep_qa.common.params.Params, dict])[source]

Sets random seeds for reproducible experiments. This may not work as expected if you use this from within a python project in which you have already imported Keras. If you use the scripts/run_model.py entry point to training models with this library, your experiments should be reproducible. If you are using this from your own project, you will want to call this function before importing Keras.

Parameters:

params: Params object or dict, required.

A Params object or dict holding the json parameters.

deep_qa.run.run_model(param_dict: typing.Dict[str, <built-in function any>], model_class=None)[source]

This function is the normal entry point to DeepQA. Use this to run a DeepQA model in your project. Note that if you care about exactly reproducible experiments, you should avoid importing Keras before you import and use this function, as Keras relies on random seeds which can be set in this function via a JSON specification file.

Note that this function performs training and will also evaluate the trained model on development and test sets if provided in the parameter json.

Parameters:

param_dict: Dict[str, any], required.

A parameter file specifying a DeepQaModel.

model_class: DeepQaModel, optional (default=None).

This option is useful if you have implemented a new model class which is not one of the ones implemented in this library.

deep_qa.run.run_model_from_file(param_path: str)[source]

A wrapper around the run_model function which loads json from a file.

Parameters:

param_path: str, required.

A json paramter file specifying a DeepQA model.

deep_qa.run.score_dataset(param_path: str, dataset_files: typing.List[str], model_class=None)[source]

Loads a model from a saved parameter path and scores a dataset with it, returning the predictions.

Parameters:

param_path: str, required

A json file specifying a DeepQaModel.

dataset_files: List[str]

A list of dataset files to score, the same as you would have specified as train_files or test_files in your parameter file.

model_class: DeepQaModel, optional (default=None)

This option is useful if you have implemented a new model class which is not one of the ones implemented in this library.

Returns:

predictions: numpy.array

Numpy array of model predictions in the format of model.outputs (typically one array, but could be List[numpy.array] if your model has multiple outputs).

labels: numpy.array

The labels on the dataset, as read by the model. We return this so you can compute whatever metrics you want, if the data was labeled.

deep_qa.run.score_dataset_with_ensemble(param_paths: typing.List[str], dataset_files: typing.List[str], model_class=None) → typing.Tuple[<built-in function array>, <built-in function array>][source]

Loads all of the models specified in param_paths, uses each of them to score the dataset specified by dataset_files, and averages their scores, return an array of ensembled model predictions.

Parameters:

param_paths: List[str]

A list of parameter files that were used to train models. You must have already trained the corresponding model, as we’ll load it and use it in an ensemble here.

dataset_files: List[str]

A list of dataset files to score, the same as you would have specified as test_files in any one of the model parameter files.

model_class: ``DeepQaModel``, optional (default=None)

This option is useful if you have implemented a new model class which is not one of the ones implemented in this library.

Returns:

predictions: numpy.array

Numpy array of model predictions in the format of model.outputs (typically one array, but could be List[numpy.array] if your model has multiple outputs).

labels: numpy.array

The labels on the dataset, as read by the first model. We return this so you can compute whatever metrics you want, if the data was labeled. Note that if your models all represent that data differently, this will only give the first one. Hopefully the representation of the labels is consistent across the models, though; if not, the whole idea of ensembling them this way is moot, anyway.

About Trainers

A Trainer is the core interface to the DeepQA code. Trainers specify data, a model, and a way to train the model with the data. This module groups all of the common code related to these things, making only minimal assumptions about what kind of data you’re using or what the structure of your model is. Really, a Trainer is just a nicer interface to a Keras Model, we just call it something else to not create too much naming confusion, and because the Trainer class provides a lot of functionality around training the model that a Keras Model doesn’t.

On top of Trainer, which is a nicer interface to a Keras Model, this module provides a TextTrainer, which adds a lot of functionality for building Keras Models that work with text. We provide APIs around word embeddings, sentence encoding, reading and padding datasets, and similar things. All of the concrete models that we have so far in DeepQA inherit from TextTrainer, so understanding how to use this class is pretty important to understanding DeepQA.

We also deal with the notion of pre-training in this module. A Pretrainer is a Trainer that depends on another Trainer, building its model using pieces of the enclosed Trainer, so that training the Pretrainer updates the weights in the enclosed Trainer object.

Trainer

class deep_qa.training.trainer.Trainer(params: deep_qa.common.params.Params)[source]

A Trainer object specifies data, a model, and a way to train the model with the data. Here we group all of the common code related to these things, making only minimal assumptions about what kind of data you’re using or what the structure of your model is.

The main benefits of this class are having a common place for setting parameters related to training, actually running the training with those parameters, and code for saving and loading models.

The intended use of this class is that you construct a subclass that defines a model, overriding the abstract methods and (optionally) some of the protected methods in this class. Thus there are four kinds of methods in this class: (1) public methods, that are typically only used by deep_qa/run.py (or some other driver that you create), (2) abstract methods (beginning with _), which must be overridden by any concrete subclass, (3) protected methods (beginning with _) that you are meant to override in concrete subclasses, and (4) private methods (beginning with __) that you should not need to mess with. We only include the first three in the public docs.

Parameters:

train_files: List[str], optional (default=None)

The files containing the data that should be used for training. See load_dataset_from_files() for more information.

validation_files: List[str], optional (default=None)

The files containing the data that should be used for validation, if you do not want to use a split of the training data for validation. The default of None means to just use the validation_split parameter to split the training data for validation.

test_files: List[str], optional (default=None)

The files containing the data that should be used for evaluation. The default of None means to just not perform test set evaluation.

max_training_instances: int, optional (default=None)

Upper limit on the number of training instances. If this is set, and we get more than this, we will truncate the data. Mostly useful for testing things out on small datasets before running them on large datasets.

max_validation_instances: int, optional (default=None)

Upper limit on the number of validation instances, analogous to max_training_instances.

max_test_instances: int, optional (default=None)

Upper limit on the number of test instances, analogous to max_training_instances.

train_steps_per_epoch: int, optional (default=None)

If create_data_arrays() returns a generator instead of actual arrays, how many steps should we run from this generator before declaring an “epoch” finished? The default here is reasonable - if this is None, we will set it from the data.

validation_steps: int, optional (default=None)

Like train_steps_per_epoch, but for validation data.

test_steps: int, optional (default=None)

Like train_steps_per_epoch, but for test data.

save_models: bool, optional (default=True)

Should we save the models that we train? If this is True, you are required to also set the model_serialization_prefix parameter, or the code will crash.

model_serialization_prefix: str, optional (default=None)

Prefix for saving and loading model files. Must be set if save_models is True.

num_gpus: int, optional (default=1) Number of GPUs to use. In DeepQa we use Data Parallelism,

meaning that we create copies of the full model for each GPU, allowing the batch size of your model to be scaled depending on the number of GPUs. Note that using multiple GPUs effectively increases your batch size by the number of GPUs you have, meaning that other code which depends on the batch size will be effected - for example, if you are using dynamic padding, the batches will be larger and hence more padded, as the dataset is chunked into fewer overall batches.

batch_size: int, optional (default=32)

Batch size to use when training.

num_epochs: int, optional (default=20)

Number of training epochs.

validation_split: float, optional (default=0.1)

Amount of training data to use for validation. If validation_files is not set, we will split the training data into train/dev, using this proportion as dev. If validation_files is set, this parameter gets ignored.

optimizer: str or Dict[str, Any], optional (default=’adam’)

If this is a str, it must correspond to an optimizer available in Keras (see the list in deep_qa.training.optimizers). If it is a dictionary, it must contain a “type” key, with a value that is one of the optimizers in that list. The remaining parameters in the dict are passed as kwargs to the optimizer’s constructor.

loss: str, optional (default=’categorical_crossentropy’)

The loss function to pass to model.fit(). This is currently limited to only loss functions that are available as strings in Keras. If you want to use a custom loss function, simply override self.loss in the constructor of your model, after the call to super().__init__.

metrics: List[str], optional (default=[‘accuracy’])

The metrics to evaluate and print after each epoch of training. This is currently limited to only loss functions that are available as strings in Keras. If you want to use a custom metric, simply override self.metrics in the constructor of your model, after the call to super().__init__.

validation_metric: str, optional (default=’val_acc’)

Metric to monitor on the validation data for things like early stopping and saving the best model.

patience: int, optional (default=1)

Number of epochs to be patient before early stopping. I.e., if the validation_metric does not improve for this many epochs, we will stop training.

fit_kwargs: Dict[str, Any], optional (default={})

A dict of additional arguments to Keras’ model.fit() method, in case you want to set something that we don’t already have options for. These get added to the options already captured by other arguments.

tensorboard_log: str, optional (default=None)

If set, we will output tensorboard log information here.

tensorboard_histogram_freq: int, optional (default=0)

Tensorboard histogram frequency: note that activating the tensorboard histgram (frequency > 0) can drastically increase model training time. Please set frequency with consideration to desired runtime.

debug: Dict[str, Any], optional (default={})

This should be a dict, containing the following keys:

  • “layer_names”, which has as a value a list of names that must match layer names in the model built by this Trainer.
  • “data”, which has as a value either “training”, “validation”, or a list of file names. If you give “training” or “validation”, we’ll use those datasets, otherwise we’ll load data from the provided files. Note that currently “validation” only works if you provide validation files, not if you’re just using Keras to split the training data.
  • “masks”, an optional key that functions identically to “layer_names”, except we output the mask at each layer given here.

show_summary_with_masking_info: bool, optional (default=False)

This is a debugging setting, mostly - we have written a custom model.summary() method that supports showing masking info, to help understand what’s going on with the masks.

Public methods

Trainer.can_train()[source]
Trainer.evaluate_model(data_files: typing.List[str], max_instances: int = None)[source]
Trainer.load_data_arrays(data_files: typing.List[str], batch_size: int = None, max_instances: int = None) → typing.Tuple[deep_qa.data.datasets.dataset.Dataset, <built-in function array>, <built-in function array>][source]

Loads a Dataset from a list of files, then converts it into numpy arrays for both inputs and outputs, returning all three of these to you. This literally just calls self.load_dataset_from_files, then self.create_data_arrays; it’s just a convenience method if you want to do both of these at the same time, and also lets you truncate the dataset if you want.

Note that if you have any kind of state in your model that depends on a training dataset (e.g., a vocabulary, or padding dimensions) those must be set prior to calling this method.

Parameters:

data_files: List[str]

The files to load. These will get passed to self.load_dataset_from_files(), which subclasses must implement.

batch_size: int, optional (default = None)

Optionally pass a specific batch size to load the data arrays with. If this is not specified, we use the default self.batch_size attribute. This is a parameter so you can specify different batch sizes for training vs validation, for instance, which is useful if you are doing multi-gpu training.

max_instances: int, optional (default=None)

If not None, we will restrict the dataset to only this many instances. This is mostly useful for testing models out on subsets of your data.

Returns:

dataset: Dataset

A Dataset object containing the instances read from the data files

input_arrays: numpy.array

An array or tuple of arrays suitable to be passed as inputs x to Keras’ model.fit(x, y), model.evaluate(x, y) or model.predict(x) methods

label_arrays: numpy.array

An array or tuple of arrays suitable to be passed as outputs y to Keras’ model.fit(x, y) or model.evaluate(x, y) methods

Trainer.load_model(epoch: int = None)[source]

Loads a serialized model, using the model_serialization_prefix that was passed to the constructor. If epoch is not None, we try to load the model from that epoch. If epoch is not given, we load the best saved model.

Trainer.train()[source]

Trains the model.

All training parameters have already been passed to the constructor, so we need no arguments to this method.

Abstract methods

If you’re doing NLP, TextTrainer implements most of these, so you shouldn’t have to worry about them. The only one it doesn’t is _build_model (though it adds some other abstract methods that you might have to worry about).

Trainer.create_data_arrays(dataset: deep_qa.data.datasets.dataset.IndexedDataset, batch_size: int = None) → typing.Tuple[<built-in function array>, <built-in function array>][source]

Takes a raw dataset and converts it into training inputs and labels that can be used to either train a model or make predictions. Depending on parameters passed to the constructor of this Trainer, this could either return two actual array objects, or a single generator that generates batches of two array objects.

Parameters:

dataset: Dataset

A Dataset of the same format as read by load_dataset_from_files() (we will call this directly with the output from that method, in fact)

batch_size: int, optional (default = None)

The batch size with which the dataset should be created. If this is None, the default self.batch_size will be used.

Returns:

input_arrays: numpy.array or Tuple[numpy.array]

label_arrays: numpy.array, Tuple[numpy.array], or None

generator: a Python generator returning Tuple[input_arrays, label_arrays]

If this is returned, it is the only return value. We either return a Tuple[input_arrays, label_arrays], or this generator.

Trainer.load_dataset_from_files(files: typing.List[str]) → deep_qa.data.datasets.dataset.Dataset[source]

Given a list of file inputs, load a raw dataset from the files. This is a list because some datasets are specified in more than one file (e.g., a file containing the instances, and a file containing background information about those instances).

Trainer.score_dataset(dataset: deep_qa.data.datasets.dataset.Dataset) → typing.Tuple[<built-in function array>, <built-in function array>][source]

Takes a Dataset, indexes it, and returns the output of evaluating the model on all instances, and labels for the instances from the data, if they were given. The specifics of the numpy array that are returned depend on the model and the instance type in the dataset.

Parameters:

dataset: Dataset

A Dataset read by :func:`~Trainer.load_dataset_from_files().

Returns:

predictions: numpy.array

Predictions for each Instance in the Dataset. This could actually be a tuple/list of arrays, if your model has multiple outputs

labels: numpy.array

The labels for each Instance in the Dataset, if there were any (this will be None if there were no labels). We return this so you can easily compute metrics over these predictions if you wish. It’s hard to get numpy arrays with the labels from a non-indexed-and-padded Dataset, so we return it here so you don’t have to do any funny business to get the label array.

Trainer.set_model_state_from_dataset(dataset: deep_qa.data.datasets.dataset.Dataset)[source]

Given a raw Dataset object, set whatever model state is necessary. The most obvious use case for this is for computing a vocabulary in TextTrainer. Note that this is not an IndexedDataset, and you should not make it one. Use set_model_state_from_indexed_dataset() for setting state that depends on the data having already been indexed; otherwise you’ll duplicate the work of doing the indexing.

Trainer.set_model_state_from_indexed_dataset(dataset: deep_qa.data.datasets.dataset.IndexedDataset)[source]

Given an IndexedDataset, set whatever model state is necessary. This is typically stuff around padding.

Trainer._build_model() → deep_qa.training.models.DeepQaModel[source]

Constructs and returns a DeepQaModel (which is a wrapper around a Keras Model) that will take the output of self._get_training_data as input, and produce as output a true/false decision for each input. Note that in the multiple gpu case, this function will be called multiple times for the different GPUs. As such, you should be wary of this function having side effects unrelated to building a computation graph.

The returned model will be used to call model.fit(train_input, train_labels).

Trainer._set_params_from_model()[source]

Called after a model is loaded, this lets you update member variables that contain model parameters, like max sentence length, that are not stored as weights in the model object. This is necessary if you want to process a new data instance to be compatible with the model for prediction, for instance.

Trainer._dataset_indexing_kwargs() → typing.Dict[str, typing.Any][source]

In order to index a dataset, we may need some parameters (e.g., an object that stores the vocabulary of your model, in order to convert words into indices). You can pass those here, or return an emtpy dictionary if there’s nothing. These will get passed to Dataset.to_indexed_dataset().

Protected methods

Trainer._get_callbacks()[source]

Returns a set of Callbacks which are used to perform various functions within Keras’ .fit method. Here, we use an early stopping callback to add patience with respect to the validation metric and a Lambda callback which performs the model specific callbacks which you might want to build into a model, such as re-encoding some background knowledge.

Additionally, there is also functionality to create Tensorboard log files. These can be visualised using ‘tensorboard –logdir /path/to/log/files’ after training.

classmethod Trainer._get_custom_objects()[source]

If you’ve used any Layers that Keras doesn’t know about, you need to specify them in this dictionary, so we can load them correctly.

Trainer._instance_debug_output(instance: deep_qa.data.instances.instance.Instance, outputs: typing.Dict[str, <built-in function array>]) → str[source]

This method takes an Instance and all of the debug outputs for that Instance, puts them into some human-readable format, and returns that as a string. outputs will have one key corresponding to each item in the debug.layer_names parameter given to the constructor of this object.

The default here is pass instead of raise NotImplementedError, because you’re not required to implement debugging for your model.

Trainer._load_auxiliary_files()[source]

Called during model loading. If you have some auxiliary pickled object, such as an object storing the vocabulary of your model, you can load it here.

Trainer._output_debug_info(output_dict: typing.Dict[str, <built-in function array>], epoch: int)[source]
Trainer._overall_debug_output(output_dict: typing.Dict[str, <built-in function array>]) → str[source]
Trainer._post_epoch_hook(epoch: int)[source]

This method gets called directly after model.fit(), before making any early stopping decisions. If you want to modify anything after each iteration (e.g., computing a different kind of validation loss to use for early stopping, or just computing and printing accuracy on some other held out data), you can do that here. If you require extra parameters, use calls to local methods rather than passing new parameters, as this hook is run via a Keras Callback, which is fairly strict in it’s interface.

Trainer._pre_epoch_hook(epoch: int)[source]

This method gets called before each epoch of training. If you want to do any kind of processing in between epochs (e.g., updating the training data for whatever reason), here is your chance to do so.

Trainer._save_auxiliary_files()[source]

Called after training. If you have some auxiliary object, such as an object storing the vocabulary of your model, you can save it here. The model config is saved by default.

Trainer._uses_data_generators()[source]

Training models with Keras requires a different API if you produce data in batches uses a generator or if you just provide one big numpy array with all of your data, which Keras has to split into batches. This method tells us which Keras API we should use. If your model class produces data using a generator, return True here; otherwise, return False. The default implementation just returns False.

TextTrainer

class deep_qa.training.text_trainer.TextTrainer(params: deep_qa.common.params.Params)[source]

This is a Trainer that deals with word sequences as its fundamental data type (any TextDataset or TextInstance subtype is fine). That means we have to deal with padding, with converting words (or characters) to indices, and encoding word sequences. This class adds methods on top of Trainer to deal with all of that stuff.

This class has five kinds of methods:

  1. protected methods that are overriden from Trainer, and which you shouldn’t need to worry about
  2. utility methods for building models, intended for use by subclasses
  3. abstract methods that determine a few key points of behavior in concrete subclasses (e.g., what your input data type is)
  4. model-specific methods that you might have to override, depending on what your model looks like - similar to (3), but simple models don’t need to override these
  5. private methods that you shouldn’t need to worry about

There are two main ways you’re intended to interact with this class, then: by calling the utility methods when building your model, and by customizing the behavior of your concrete model by using the parameters to this class.

Parameters:

embeddings : Dict[str, Any], optional (default=50 dim word embeddings, 8 dim character

embeddings, 0.5 dropout on both)

These parameters specify the kind of embeddings to use for words, character, tags, or whatever you want to embed. This dictionary behaves similarly to the encoder and seq2seq_encoder parameter dictionaries. Valid keys are dimension, dropout, pretrained_file, fine_tune, and project. The value for dimension is an int specifying the dimensionality of the embedding (default 50 for words, 8 for characters); dropout is a float, specifying the amount of dropout to use on the embedding layer (default 0.5); pretrained_file is a (string) path to a glove-formatted file containing pre-trained embeddings; fine_tune is a boolean specifying whether the pretrained embeddings should be trainable (default False); and project is a boolean specifying whether to add a projection layer after the embedding layer (only really useful in conjunction with pre-trained embeddings, to get them into a lower-dimensional space; default False).

data_generator: Dict[str, Any], optional (default=None)

If not None, we will pass these parameters to a DataGenerator object to create data batches, instead of creating one big array for all of our training data. See DataGenerator for the available options here. Note that in order to take full advantage of the capabilities of a DataGenerator, you should make sure your model correctly implements _set_padding_lengths(), get_padding_lengths(), get_padding_memory_scaling(), and get_instance_sorting_keys(). Also note that some of the things DataGenerator does can change the behavior of your learning algorithm, so you should think carefully about how exactly you want batches to be structured before you choose these parameters.

num_sentence_words: int, optional (default=None)

Upper limit on length of word sequences in the training data. Ignored during testing (we use the value set at training time, either from this parameter or from a loaded model). If this is not set, we’ll calculate a max length from the data.

num_word_characters: int, optional (default=None)

Upper limit on length of words in the training data. Only applicable for “words and characters” text encoding.

tokenizer: Dict[str, Any], optional (default={})

Which tokenizer to use for TextInstances. See :mod:deep_qa.data.tokenizers.tokenizer for more information.

encoder: Dict[str, Dict[str, Any]], optional (default={‘default’: {}})

These parameters specify the kind of encoder used to encode any word sequence input. An encoder takes a sequence of vectors and returns a single vector.

If given, this must be a dict, where each key is a name that can be used for encoders in the model, and the value corresponding to the key is a set of parameters that will be passed on to the constructor of the encoder. We will use the “type” key in this dict (which must match one of the keys in encoders) to determine the type of the encoder, then pass the remaining args to the encoder constructor.

Hint: Use "lstm" or "cnn" for sentences, "treelstm" for logical forms, and "bow" for either.

encoder_fallback_behavior: string, optional (default=”crash”)

Determines the behavior when an encoder is asked for by name, but you have not given parameters for an encoder with that name. See _get_encoder for more information.

seq2seq_encoder: Dict[str, Dict[str, Any]], optional (default={‘default’: {‘encoder_params’: {}, ‘wrapper_params: {}}})

Like encoder, except seq2seq encoders return a sequence of vectors instead of a single vector (the difference between our “encoders” and “seq2seq encoders” is the difference in Keras between LSTM() and LSTM(return_sequences=True)).

seq2seq_encoder_fallback_behavior: string, optional (default=”crash”)

Determines the behavior when a seq2seq encoder is asked for by name, but you have not given parameters for an encoder with that name. See _get_seq2seq_encoder for more information.

Utility methods

These methods are intended for use by subclasses, mostly in your _build_model implementation.

TextTrainer._get_sentence_shape(sentence_length: int = None) → typing.Tuple[int][source]

Returns a tuple specifying the shape of a tensor representing a sentence. This is not necessarily just (self.num_sentence_words,), because different text_encodings lead to different tensor shapes. If you have an input that is a sequence of words, you need to call this to get the shape to pass to an Input layer. If you don’t, your model won’t work correctly for all tokenizers.

TextTrainer._embed_input(input_layer: keras.engine.topology.Layer, embedding_suffix: str = '')[source]

This function embeds a word sequence input, using an embedding defined by embedding_suffix. You should call this function in your _build_model method any time you want to convert word indices into word embeddings. Note that if this is used in conjunction with _get_sentence_shape, we will do the correct thing for whatever Tokenizer you use. The actual input to this might be words and characters, and we might actually do a concatenation of a word embedding and a character-level encoder. All of this is handled transparently to your concrete model subclass, if you use the API correctly, calling _get_sentence_shape() to get the shape for your Input layer, and passing that input layer into this _embed_input() method.

We need to take the input Layer here, instead of just returning a Layer that you can use as you wish, because we might have to apply several layers to the input, depending on the parameters you specified for embedding things. So we return, essentially, embedding(input_layer).

The input layer can have arbitrary shape, as long as it ends with a word sequence. For example, you could pass in a single sentence, a set of sentences, or a set of sets of sentences, and we will handle them correctly.

Internally, we will create a dictionary mapping embedding names to embedding layers, so if you have several things you want to embed with the same embedding layer, be sure you use the same name each time (or just don’t pass a name, which accomplishes the same thing). If for some reason you want to have different embeddings for different inputs, use a different name for the embedding.

In this function, we pass the work off to self.tokenizer, which might need to do some additional processing to actually give you a word embedding (e.g., if your text encoder uses both words and characters, we need to run the character encoder and concatenate the result with a word embedding).

Note that the embedding_suffix parameter is a suffix to whatever name the tokenizer will give to the embeddings it creates. Typically, the tokenizer will use the name words, though it could also use characters, or something else. So if you pass _A for embedding_suffix, you will end up with actual embedding names like words_A and characters_A. These are the keys you need to specify in your parameter file, for embedding sizes etc. When constructing actual Embedding layers, we will further append the string _embedding, so the layer would be named words_A_embedding.

TextTrainer._get_encoder(name='default', fallback_behavior: str = None)[source]

This method is intended to be used in your _build_model implementation, any time you want to convert a sequence of vectors into a single vector. The encoder name corresponds to entries in the encoder parameter passed to the constructor of this object, allowing you to customize the kind and behavior of the encoder just through parameters.

A sentence encoder takes as input a sequence of word embeddings, and returns as output a single vector encoding the sentence. This is typically either a simple RNN or an LSTM, but could be more complex, if the “sentence” is actually a logical form.

Parameters:

name : str, optional (default=”default”)

The name of the encoder. Multiple calls to _get_encoder using the same name will return the same encoder. To get parameters for creating the encoder, we look in self.encoder_params, which is specified by the encoder parameter in self.__init__. If name is not a key in self.encoder_params, the behavior is defined by the fallback_behavior parameter.

fallback_behavior : str, optional (default=None)

Determines what to do when name is not a key in self.encoder_params. If you pass None (the default), we will use self.encoder_fallback_behavior, specified by the encoder_fallback_behavior parameter to self.__init__. There are three options:

  • "crash": raise an error. This is the default for self.encoder_fallback_behavior. The intention is to help you find bugs - if you specify a particular encoder name in self._build_model without giving a fallback behavior, you probably wanted to use a particular set of parameters, so we crash if they are not provided.
  • "use default params": In this case, we return a new encoder created with self.encoder_params["default"].
  • "use default encoder": In this case, we reuse the encoder created with self.encoder_params["default"]. This effectively changes the name parameter to "default" when the given name is not in self.encoder_params.
TextTrainer._get_seq2seq_encoder(name='default', fallback_behavior: str = None)[source]

This method is intended to be used in your _build_model implementation, any time you want to convert a sequence of vectors into another sequence of vector. The encoder name corresponds to entries in the encoder parameter passed to the constructor of this object, allowing you to customize the kind and behavior of the encoder just through parameters.

A seq2seq encoder takes as input a sequence of vectors, and returns as output a sequence of vectors. This method is essentially identical to _get_encoder, except that it gives an encoder that returns a sequence of vectors instead of a single vector.

Parameters:

name : str, optional (default=”default”)

The name of the encoder. Multiple calls to _get_seq2seq_encoder using the same name will return the same encoder. To get parameters for creating the encoder, we look in self.seq2seq_encoder_params, which is specified by the seq2seq_encoder parameter in self.__init__. If name is not a key in self.seq2seq_encoder_params, the behavior is defined by the fallback_behavior parameter.

fallback_behavior : str, optional (default=None)

Determines what to do when name is not a key in self.seq2seq_encoder_params. If you pass None (the default), we will use self.seq2seq_encoder_fallback_behavior, specified by the seq2seq_encoder_fallback_behavior parameter to self.__init__. There are three options:

  • "crash": raise an error. This is the default for self.seq2seq_encoder_fallback_behavior. The intention is to help you find bugs - if you specify a particular encoder name in self._build_model without giving a fallback behavior, you probably wanted to use a particular set of parameters, so we crash if they are not provided.
  • "use default params": In this case, we return a new encoder created with self.seq2seq_encoder_params["default"].
  • "use default encoder": In this case, we reuse the encoder created with self.seq2seq_encoder_params["default"]. This effectively changes the name parameter to "default" when the given name is not in self.seq2seq_encoder_params.
TextTrainer._set_text_lengths_from_model_input(input_slice)[source]

Given an input slice (a tuple) from a model representing the max length of the sentences and the max length of each words, set the padding max lengths. This gets called when loading a model, and is necessary to get padding correct when using loaded models. Subclasses need to call this in their _set_padding_lengths_from_model method.

Parameters:

input_slice : tuple

A slice from a concrete model class that represents an input word sequence. The tuple must be of length one or two, and the first dimension should correspond to the length of the sentences while the second dimension (if provided) should correspond to the max length of the words in each sentence.

Abstract methods

You must implement these methods in your model (along with _build_model()). The simplest concrete TextTrainer implementations only have four methods: __init__, _instance_type (typically one line), _set_padding_lengths_from_model (also typically one line, for simple models), and _build_model. See TrueFalseModel and SimpleTagger for examples.

TextTrainer._instance_type() → deep_qa.data.instances.instance.Instance[source]

When reading datasets, what Instance type should we create? The Instance class contains code that creates actual numpy arrays, so this instance type determines the inputs that you will get to your model, and the outputs that are used for training.

TextTrainer._set_padding_lengths_from_model()[source]

This gets called when loading a saved model. It is analogous to _set_padding_lengths, but needs to set all of the values set in that method just by inspecting the loaded model. If we didn’t have this, we would not be able to correctly pad data after loading a model.

Semi-abstract methods

You’ll likely need to override these methods, if you have anything more complex than a single sentence as input.

TextTrainer.get_padding_lengths() → typing.Dict[str, int][source]

This is about padding. Any solver will have some number of things that need padding in order to make consistently-sized data arrays, like the length of a sentence. This method returns a dictionary of all of those things, mapping a length key to an int.

If any of the entries in this dictionary is None, the padding code will calculate a padding length from the data itself. This could either be a good idea or a bad idea - if you have outliers in your data, you could be wasting a whole lot of memory and computation time if you pad the whole dataset to the size of the outlier. On the other hand, if you do batch-specific padding, this can save you a whole lot of time, if you group batches by similar lengths.

Here we return the lengths that are applicable to encoding words and sentences. If you have additional padding dimensions, call super().get_padding_lengths() and then update the dictionary.

TextTrainer.get_instance_sorting_keys() → typing.List[str][source]

If we’re using dynamic padding, we want to group the instances by padding length, so that we minimize the amount of padding necessary per batch. This variable sets what exactly gets sorted by. We’ll call get_padding_lengths() on each instance, pull out these keys, and sort by them in the order specified. You’ll want to override this in your model class if you have more complex models.

The default implementation is to sort first by num_sentence_words, then by num_word_characters (if applicable).

TextTrainer.get_padding_memory_scaling(padding_lengths: typing.Dict[str, int]) → int[source]

This method is for computing adaptive batch sizes. We assume that memory usage is a function that looks like this: M = b * O(p) * c, where M is the memory usage, b is the batch size, c is some constant that depends on how much GPU memory you have and various model hyperparameters, and O(p) is a function outlining how memory usage asymptotically varies with the padding lengths. Our approach will be to let the user effectively set \frac{M}{c} using the adaptive_memory_usage_constant parameter in DataGenerator. The model (this method) specifies O(p), so we can solve for the batch size b. The more specific you get in specifying O(p) in this function, the better a job we can do in optimizing memory usage.

Parameters:

padding_lengths: Dict[str, int]

Dictionary containing padding lengths, mapping keys like num_sentence_words to ints. This method computes a function of these ints.

Returns:

O(p): int

The big-O complexity of the model, evaluated with the specific ints given in padding_lengths dictionary.

TextTrainer._set_padding_lengths(dataset_padding_lengths: typing.Dict[str, int])[source]

This is about padding. Any model will have some number of things that need padding in order to make a consistent set of input arrays, like the length of a sentence. This method sets those variables given a dictionary of lengths from a dataset.

Note that you might choose not to update some of these lengths, either because you want to keep the model flexible to allow for dynamic (batch-specific) padding, or because you’ve set a hard limit in the class parameters and don’t want to change it.

Overridden Trainer methods

You probably don’t need to override these, except for probably _get_custom_objects. The rest of them you shouldn’t need to worry about at all (except to call them, if they are part of the external Trainer API), but we document them here for completeness.

TextTrainer.create_data_arrays(dataset: deep_qa.data.datasets.dataset.IndexedDataset, batch_size: int = None)[source]

Takes a raw dataset and converts it into training inputs and labels that can be used to either train a model or make predictions. Depending on parameters passed to the constructor of this Trainer, this could either return two actual array objects, or a single generator that generates batches of two array objects.

Parameters:

dataset: Dataset

A Dataset of the same format as read by load_dataset_from_files() (we will call this directly with the output from that method, in fact)

batch_size: int, optional (default = None)

The batch size with which the dataset should be created. If this is None, the default self.batch_size will be used.

Returns:

input_arrays: numpy.array or Tuple[numpy.array]

label_arrays: numpy.array, Tuple[numpy.array], or None

generator: a Python generator returning Tuple[input_arrays, label_arrays]

If this is returned, it is the only return value. We either return a Tuple[input_arrays, label_arrays], or this generator.

TextTrainer.load_dataset_from_files(files: typing.List[str])[source]

This method assumes you have a TextDataset that can be read from a single file. If you have something more complicated, you’ll need to override this method (though, a solver that has background information could call this method, then do additional processing on the rest of the list, for instance).

TextTrainer.score_dataset(dataset: deep_qa.data.datasets.dataset.TextDataset)[source]

See the superclass docs (Trainer.score_dataset()) for usage info. Just a note here that we do not use data generators for this method, even if you’ve said elsewhere that you want to use them, so that we can easily return the labels for the data. This means that we’ll do whole-dataset padding, and this could be slow. We could probably fix this, but it’s good enough for now.

TextTrainer.set_model_state_from_dataset(dataset: deep_qa.data.datasets.dataset.TextDataset)[source]

Given a raw Dataset object, set whatever model state is necessary. The most obvious use case for this is for computing a vocabulary in TextTrainer. Note that this is not an IndexedDataset, and you should not make it one. Use set_model_state_from_indexed_dataset() for setting state that depends on the data having already been indexed; otherwise you’ll duplicate the work of doing the indexing.

TextTrainer.set_model_state_from_indexed_dataset(dataset: deep_qa.data.datasets.dataset.IndexedDataset)[source]

Given an IndexedDataset, set whatever model state is necessary. This is typically stuff around padding.

classmethod TextTrainer._get_custom_objects()[source]
TextTrainer._dataset_indexing_kwargs() → typing.Dict[str, typing.Any][source]
TextTrainer._load_auxiliary_files()[source]

Called during model loading. If you have some auxiliary pickled object, such as an object storing the vocabulary of your model, you can load it here.

TextTrainer._overall_debug_output(output_dict: typing.Dict[str, <built-in function array>]) → str[source]

We’ll do something different here: if “embedding” is in output_dict, we’ll output the embedding matrix at the top of the debug file. Note that this could be _huge_ - you should only do this for debugging on very simple datasets.

TextTrainer._save_auxiliary_files()[source]

Called after training. If you have some auxiliary object, such as an object storing the vocabulary of your model, you can save it here. The model config is saved by default.

TextTrainer._set_params_from_model()[source]

Called after a model is loaded, this lets you update member variables that contain model parameters, like max sentence length, that are not stored as weights in the model object. This is necessary if you want to process a new data instance to be compatible with the model for prediction, for instance.

TextTrainer._uses_data_generators()[source]

Training models with Keras requires a different API if you produce data in batches uses a generator or if you just provide one big numpy array with all of your data, which Keras has to split into batches. This method tells us which Keras API we should use. If your model class produces data using a generator, return True here; otherwise, return False. The default implementation just returns False.

Multi GPU Training

deep_qa.training.multi_gpu.compile_parallel_model(model_builder: typing.Callable[[], deep_qa.training.models.DeepQaModel], compile_arguments: deep_qa.common.params.Params) → deep_qa.training.models.DeepQaModel[source]

This function compiles a multi-gpu version of your model. This is done using data parallelism, by making N copies of the model on the different GPUs, all of which share parameters. Gradients are updated synchronously, using the average gradient from all of the outputs of the various models. This effectively allows you to scale a model up to batch_sizes which cannot fit on a single GPU.

This method returns a “primary” copy of the model, which has had its training function which is run by Keras overridden to be a training function which trains all of the towers of the model. The other towers never have their training functions initialised or used and are completely hidden from the user. The returned model can be serialised in the same way as any other model and has no dependency on multiple gpus being available when it is loaded.

Note that by calling this function, the model_builder function will be called multiple times for the different GPUs. As such, you should be wary of this function having side effects unrelated to building a computation graph.

Parameters:

model_builder: Callable[any, DeepQaModel], required.

A function which returns an uncompiled DeepQaModel.

compile_arguments: Params, required

Model parameters which are passed to compile. These should be the same as if you were building a single GPU model, with the exception of the num_gpus field.

Returns:

The “primary” copy of the DeepQaModel, which holds the training function which

trains all of the copies of the model.

Misc

Models

class deep_qa.training.models.DeepQaModel(*args, **kwargs)[source]

Bases: keras.engine.training.Model

This is a Model that adds functionality to Keras’ Model class. In particular, we use tensorflow optimisers directly in order to make use of sparse gradient updates, which Keras does not handle. Additionally, we provide some nicer summary functions which include mask information. We are overriding key components of Keras here and you should probably have a pretty good grip on the internals of Keras before you change stuff below, as there could be unexpected consequences.

_fit_loop(f: <built-in function callable>, ins: typing.List[<built-in function array>], out_labels: typing.List[str] = None, batch_size: int = 32, epochs: int = 100, verbose: int = 1, callbacks: typing.List[keras.callbacks.Callback] = None, val_f: <built-in function callable> = None, val_ins: typing.List[<built-in function array>] = None, shuffle: bool = True, callback_metrics: typing.List[str] = None, initial_epoch: int = 0)[source]

Abstract fit function which preprocesses and batches data before training a model. We override this keras backend function to support multi-gpu training via splitting a large batch size across multiple gpus. This function is broadly the same as the Keras backend version aside from this - changed elements have corresponding comments attached.

Note that this should not be called directly - it is used by calling model.fit().

Assume that step_function returns a list, labeled by out_labels.

Parameters:

f: A callable ``Step`` or a Keras ``Function``, required.

A DeepQA Step or Keras Function returning a list of tensors.

ins: List[numpy.array], required.

The list of tensors to be fed to step_function.

out_labels: List[str], optional (default = None).

The display names of the outputs of step_function.

batch_size: int, optional (default = 32).

The integer batch size.

epochs: int, optional (default = 100).

Number of times to iterate over the data.

verbose: int, optional, (default = 1)

Verbosity mode, 0, 1 or 2.

callbacks: List[Callback], optional (default = None).

A list of Keras callbacks to be called during training.

val_f: A callable ``Step`` or a Keras ``Function``, optional (default = None).

The Keras function to call for validation.

val_ins: List[numpy.array], optional (default)

A list of tensors to be fed to val_f.

shuffle: bool, optional (default = True).

whether to shuffle the data at the beginning of each epoch

callback_metrics: List[str], optional, (default = None).

A list of strings, the display names of the validation metrics. passed to the callbacks. They should be the concatenation of list the display names of the outputs of f and the list of display names of the outputs of f_val.

initial_epoch: int, optional (default = 0).

The epoch at which to start training (useful for resuming a previous training run).

Returns:

A Keras History object.

_keras_summary()[source]
_make_predict_function()[source]
_make_test_function()[source]
_make_train_function()[source]

We override this method so that we can use tensorflow optimisers directly. This is desirable as tensorflow handles gradients of sparse tensors efficiently.

_multi_gpu_batch(variable_list)[source]
_prepare_callbacks(callbacks: typing.List[keras.callbacks.Callback], val_ins: typing.List[<built-in function array>], epochs: int, batch_size: int, num_train_samples: int, callback_metrics: typing.List[str], do_validation: bool, verbose: int)[source]

Sets up Keras callbacks to perform various monitoring functions during training.

_summary_with_mask_info()[source]
compile(params: deep_qa.common.params.Params)[source]

The only reason we are overriding this method is because keras automatically wraps our tensorflow optimiser in a keras wrapper, which we don’t want. We override the only method in Model which uses this attribute, _make_train_function, which raises an error if compile is not called first. As we move towards using a Tensorflow first optimisation loop, more things will be added here which add functionality to the way Keras runs tensorflow Session calls.

summary(show_masks=False, **kwargs)[source]
train_on_batch(x: typing.List[<built-in function array>], y: typing.List[<built-in function array>], sample_weight: typing.List[<built-in function array>] = None, class_weight: typing.Dict[int, <built-in function array>] = None)[source]

Runs a single gradient update on a single batch of data. We override this method in order to provide multi-gpu training capability.

Parameters:

x: List[numpy.array], required

Numpy array of training data, or list of Numpy arrays if the model has multiple inputs. If all inputs in the model are named, you can also pass a dictionary mapping input names to Numpy arrays.

y: List[numpy.array], required

A Numpy array of labels, or list of Numpy arrays if the model has multiple outputs. If all outputs in the model are named, you can also pass a dictionary mapping output names to Numpy arrays.

sample_weight: List[numpy.array], optional (default = None)

optional array of the same length as x, containing weights to apply to the model’s loss for each sample. In the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample. In this case you should make sure to specify sample_weight_mode=”temporal” in compile().

class_weight: optional dictionary mapping

class indices (integers) to a weight (float) to apply to the model’s loss for the samples from this class during training. This can be useful to tell the model to “pay more attention” to samples from an under-represented class.

Returns:

Scalar training loss

(if the model has a single output and no metrics)

or list of scalars (if the model has multiple outputs

and/or metrics). The attribute model.metrics_names will give you

the display labels for the scalar outputs.

deep_qa.training.models.count_total_params(layers, layer_set=None)[source]
deep_qa.training.models.print_layer_summary(layer, relevant_nodes, positions)[source]
deep_qa.training.models.print_row(fields, positions)[source]
deep_qa.training.models.print_summary_with_masking(layers, relevant_nodes=None)[source]

Optimizers

It turns out that Keras’ design is somewhat crazy*, and there is no list of optimizers that you can just import from Keras. So, this module specifies a list, and a helper function or two for dealing with optimizer parameters. Unfortunately, this means that we have a list that must be kept in sync with Keras. Oh well.

* Have you seen their get_from_module() method? See here: https://github.com/fchollet/keras/blob/6e42b0e4a77fb171295b541a6ae9a3a4a79f9c87/keras/utils/generic_utils.py#L10. That method means I could pass in ‘clip_norm’ as an optimizer, and it would try to use that function as an optimizer. It also means there is no simple list of implemented optimizers I can grab.

* I should also note that Keras is an incredibly useful library that does a lot of things really well. It just has a few quirks...

deep_qa.training.optimizers.optimizer_from_params(params: typing.Union[deep_qa.common.params.Params, str])[source]

This method converts from a parameter object like we use in our Trainer code into an optimizer object suitable for use with Keras. The simplest case for both of these is a string that shows up in optimizers above - if params is just one of those strings, we return it, and everyone is happy. If not, we assume params is a Dict[str, Any], with a “type” key, where the value for “type” must be one of those strings above. We take the rest of the parameters and pass them to the optimizer’s constructor.

About Data

This module contains code for processing data. There’s a DataIndexer, whose job it is to convert from strings to word (or character) indices suitable for use with an embedding matrix. There’s code to load pre-trained embeddings from a file, to tokenize sentences, and, most importantly, to convert training and testing examples into numpy arrays that can be used with Keras.

The most important thing to understand about the data processing code is the Dataset object. A Dataset is a collection of Instances, which are the individual examples used for training and testing. Dataset has two subclasses: TextDataset, which contains Instances with raw strings and can be read directly from a file, and IndexedDataset, which contains Instances whose raw strings have been converted to word (or character) indices. The IndexedDataset has methods for padding sequences to a consistent length, so that models can be compiled, and for converting the Instances to numpy arrays. The file formats read by TextDataset, and the format of the numpy arrays produced by IndexedDataset, are determined by the underlying Instance type used by the Dataset. See the instances module for more detail on this.

Base Instances

An Instance is a single training or testing example for a Keras model. The base classes for working with Instances are found in instance.py. There are two subclasses: (1) TextInstance, which is a raw instance that contains actual strings, and can be used to determine a vocabulary for a model, or read directly from a file; and (2) IndexedInstance, which has had its raw strings converted to word (or character) indices, and can be padded to a consistent length and converted to numpy arrays for use with Keras.

Concrete Instance classes are organized in the code by the task they are designed for (e.g., text classification, reading comprehension, sequence tagging, etc.).

A lot of the magic of how the DeepQA library works happens here, in the concrete Instance classes in this module. Most of the code can be totally agnostic to how exactly the input is structured, because the conversion to numpy arrays happens here, not in the Trainer or TextTrainer classes, with only the specific _build_model() methods needing to know about the format of their input and output (and even some of the details there are transparent to the model class).

This module contains the base Instance classes that concrete classes inherit from. Specifically, there are three classes:

  1. Instance, that just exists as a base type with no functionality
  2. TextInstance, which adds a words() method and a method to convert strings to indices using a DataIndexer.
  3. IndexedInstance, which is a TextInstance that has had all of its strings converted into indices.

This class has methods to deal with padding (so that sequences all have the same length) and converting an Instance into a set of Numpy arrays suitable for use with Keras.

As this codebase is dealing mostly with textual question answering, pretty much all of the concrete Instance types will have both a TextInstance and a corresponding IndexedInstance, which you can see in the individual files for each Instance type.

class deep_qa.data.instances.instance.IndexedInstance(label, index: int = None)[source]

Bases: deep_qa.data.instances.instance.Instance

An indexed data instance has all word tokens replaced with word indices, along with some kind of label, suitable for input to a Keras model. An IndexedInstance is created from an Instance using a DataIndexer, and the indices here have no recoverable meaning without the DataIndexer.

For example, we might have the following Instance: - TrueFalseInstance('Jamie is nice, Holly is mean', True, 25)

After being converted into an IndexedInstance, we might have the following: - IndexedTrueFalseInstance([1, 6, 7, 1, 6, 8], True, 25)

This would mean that "Jamie" and "Holly" were OOV to the DataIndexer, and the other words were given indices.

static _get_word_sequence_lengths(word_indices: typing.List) → typing.Dict[str, int][source]

Because TextEncoders can return complex data structures, we might actually have several things to pad for a single word sequence. We check for that and handle it in a single spot here. We return a dictionary containing ‘num_sentence_words’, which is the number of words in word_indices. If the word representations also contain characters, the dictionary additionally contains a ‘num_word_characters’ key, with a value corresponding to the longest word in the sequence.

as_training_data()[source]

Convert this IndexedInstance to NumPy arrays suitable for use as training data to Keras models.

Returns:

train_data : (inputs, label)

The IndexedInstance as NumPy arrays to be uesd in Keras. Note that inputs might itself be a complex tuple, depending on the Instance type.

classmethod empty_instance()[source]

Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.

get_padding_lengths() → typing.Dict[str, int][source]

Returns the length of this instance in all dimensions that require padding.

Different kinds of instances have different fields that are padded, such as sentence length, number of background sentences, number of options, etc.

Returns:

padding_lengths: Dict[str, int]

A dictionary mapping padding keys (like “num_sentence_words”) to lengths.

pad(padding_lengths: typing.Dict[str, int])[source]

Add zero-padding to make each data example of equal length for use in the neural network.

This modifies the current object.

Parameters:

padding_lengths: Dict[str, int]

In this dictionary, each str refers to a type of token (e.g. num_sentence_words), and the corresponding int is the value. This dictionary must have the same keys as was returned by get_padding_lengths(). We will use these lengths to pad the instance in all of the necessary dimensions to the given leangths.

static pad_sequence_to_length(sequence: typing.List, desired_length: int, default_value: typing.Callable[[], typing.Any] = <function IndexedInstance.<lambda>>, truncate_from_right: bool = True) → typing.List[source]

Take a list of indices and pads them to the desired length.

Parameters:

word_sequence : List of int

A list of word indices.

desired_length : int

Maximum length of each sequence. Longer sequences are truncated to this length, and shorter ones are padded to it.

default_value: Callable, default=lambda: 0

Callable that outputs a default value (of any type) to use as padding values.

truncate_from_right : bool, default=True

If truncating the indices is necessary, this parameter dictates whether we do so on the left or right.

Returns:

padded_word_sequence : List of int

A padded or truncated list of word indices.

Notes

The reason we truncate from the right by default is for cases that are questions, with long set ups. We at least want to get the question encoded, which is always at the end, even if we’ve lost much of the question set up. If you want to truncate from the other direction, you can.

static pad_word_sequence(word_sequence: typing.List[int], padding_lengths: typing.Dict[str, int], truncate_from_right: bool = True) → typing.List[source]

Take a list of indices and pads them.

Parameters:

word_sequence : List of int

A list of word indices.

padding_lengths : Dict[str, int]

In this dictionary, each str refers to a type of token (e.g. num_sentence_words), and the corresponding int is the value. This dictionary must have the same dimension as was returned by get_padding_lengths(). We will use these lengths to pad the instance in all of the necessary dimensions to the given leangths.

truncate_from_right : bool, default=True

If truncating the indices is necessary, this parameter dictates whether we do so on the left or right.

Returns:

padded_word_sequence : List of int

A padded list of word indices.

Notes

The reason we truncate from the right by default is for cases that are questions, with long set ups. We at least want to get the question encoded, which is always at the end, even if we’ve lost much of the question set up. If you want to truncate from the other direction, you can.

TODO(matt): we should probably switch the default to truncate from the left, and clear up the naming here - it’s easy to get confused about what “truncate from right” means.

class deep_qa.data.instances.instance.Instance(label, index: int = None)[source]

Bases: object

A data instance, used either for training a neural network or for testing one.

Parameters:

label : Any

Any kind of label that you might want to predict in a model. Could be a class label, a tag sequence, a character span in a passage, etc.

index : int, optional

Used for matching instances with other data, such as background sentences.

class deep_qa.data.instances.instance.TextInstance(label, index: int = None)[source]

Bases: deep_qa.data.instances.instance.Instance

An Instance that has some attached text, typically either a sentence or a logical form. This is called a TextInstance because the individual tokens here are encoded as strings, and we can get a list of strings out when we ask what words show up in the instance.

We use these kinds of instances to fit a DataIndexer (i.e., deciding which words should be mapped to an unknown token); to use them in training or testing, we need to first convert them into IndexedInstances.

In order to actually convert text into some kind of indexed sequence, we rely on a TextEncoder. There are several TextEncoder subclasses, that will let you use word token sequences, character sequences, and other options. By default we use word tokens. You can override this by setting the encoder class variable.

_index_text(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[int][source]
_words_from_text(text: str) → typing.Dict[str, typing.List[str]][source]
classmethod read_from_line(line: str)[source]

Reads an instance of this type from a line.

Parameters:

line : str

A line from a data file.

Returns:

indexed_instance : IndexedInstance

A TextInstance that has had all of its strings converted into indices.

Notes

We throw a RuntimeError here instead of a NotImplementedError, because it’s not expected that all subclasses will implement this.

to_indexed_instance(data_indexer: deep_qa.data.data_indexer.DataIndexer) → deep_qa.data.instances.instance.IndexedInstance[source]

Converts the words in this Instance into indices using the DataIndexer.

Parameters:

data_indexer : DataIndexer

DataIndexer to use in converting the Instance to an IndexedInstance.

Returns:

indexed_instance : IndexedInstance

A TextInstance that has had all of its strings converted into indices.

tokenizer = <deep_qa.data.tokenizers.word_tokenizer.WordTokenizer object>
words() → typing.Dict[str, typing.List[str]][source]

Returns a list of all of the words in this instance, contained in a namespace dictionary.

This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the DataIndexer to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.

Returns:

namespace : Dictionary of {str: List[str]}

The str key refers to vocabularies, and the List[str] should contain the tokens in that vocabulary. For example, you should use the key words to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.

Entailment Instances

These Instances are designed for an entailment task, where the input is a pair of sentences (or larger text sequences) and the output is a classification decision.

SentencePairInstances

class deep_qa.data.instances.entailment.sentence_pair_instance.IndexedSentencePairInstance(first_sentence_indices: typing.List[int], second_sentence_indices: typing.List[int], label: typing.List[int], index: int = None)[source]

Bases: deep_qa.data.instances.instance.IndexedInstance

This is an indexed instance that is commonly used for labeled sentence pairs. Examples of this are SnliInstances where we have a labeled pair of text and hypothesis, and a sentence2vec instance where the objective is to train an encoder to predict whether the sentences are in context or not.

as_training_data()[source]

Convert this IndexedInstance to NumPy arrays suitable for use as training data to Keras models.

Returns:

train_data : (inputs, label)

The IndexedInstance as NumPy arrays to be uesd in Keras. Note that inputs might itself be a complex tuple, depending on the Instance type.

classmethod empty_instance()[source]

Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.

get_padding_lengths() → typing.Dict[str, int][source]

Returns the length of this instance in all dimensions that require padding.

Different kinds of instances have different fields that are padded, such as sentence length, number of background sentences, number of options, etc.

Returns:

padding_lengths: Dict[str, int]

A dictionary mapping padding keys (like “num_sentence_words”) to lengths.

pad(padding_lengths: typing.Dict[str, int])[source]

Add zero-padding to make each data example of equal length for use in the neural network.

This modifies the current object.

Parameters:

padding_lengths: Dict[str, int]

In this dictionary, each str refers to a type of token (e.g. num_sentence_words), and the corresponding int is the value. This dictionary must have the same keys as was returned by get_padding_lengths(). We will use these lengths to pad the instance in all of the necessary dimensions to the given leangths.

class deep_qa.data.instances.entailment.sentence_pair_instance.SentencePairInstance(first_sentence: str, second_sentence: str, label: typing.List[int], index: int = None)[source]

Bases: deep_qa.data.instances.instance.TextInstance

SentencePairInstance contains a labeled pair of instances accompanied by a binary label. You could have the label represent whatever you want, such as entailment, or occuring in the same context, or whatever.

classmethod read_from_line(line: str)[source]

Expected format: [sentence1][tab][sentence2][tab][label]

to_indexed_instance(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]

Converts the words in this Instance into indices using the DataIndexer.

Parameters:

data_indexer : DataIndexer

DataIndexer to use in converting the Instance to an IndexedInstance.

Returns:

indexed_instance : IndexedInstance

A TextInstance that has had all of its strings converted into indices.

words() → typing.Dict[str, typing.List[str]][source]

Returns a list of all of the words in this instance, contained in a namespace dictionary.

This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the DataIndexer to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.

Returns:

namespace : Dictionary of {str: List[str]}

The str key refers to vocabularies, and the List[str] should contain the tokens in that vocabulary. For example, you should use the key words to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.

SnliInstances

class deep_qa.data.instances.entailment.snli_instance.SnliInstance(text: str, hypothesis: str, label: str, index: int = None)[source]

Bases: deep_qa.data.instances.entailment.sentence_pair_instance.SentencePairInstance

An SnliInstance is a SentencePairInstance that represents a pair of (text, hypothesis) from the Stanford Natural Language Inference (SNLI) dataset, with an associated label. The main thing we need to add here is handling of the label, because there are a few different ways we can use this Instance.

The label can either be a three-way decision (one of either “entails”, “contradicts”, or “neutral”), or a binary decision (grouping either “entails” and “contradicts”, for relevance decisions, or “contradicts” and “neutral”, for entails/not entails decisions.

The input label must be one of the strings in the label_mapping field below. The difference between the *_softmax and *_sigmoid labels are just for implementation reasons. A softmax over two dimensions is exactly equivalent to a sigmoid, but to make our lives easier in building models, sometimes we use a sigmoid and sometimes we use a softmax over two dimensions. Having separate labels for these cases makes it easier to use this data in whatever kind of model you want.

It might make sense to push this difference more generally into some common place, so that we can separate the label itself from how it’s encoded for training. But that might also be complicated to implement, and it’s not needed right now. TODO(matt): if we find ourselves doing this kind of thing in several places, we should think about making that change.

label_mapping = {'entails_softmax': [0, 1], 'not_entails_softmax': [1, 0], 'attention_false': [0], 'entails': [1, 0, 0], 'contradicts': [0, 1, 0], 'not_entails_sigmoid': [0], 'neutral': [0, 0, 1], 'attention_true': [1], 'entails_sigmoid': [1]}
classmethod read_from_line(line: str)[source]

Reads an SnliInstance object from a line. The format has one of two options:

  1. [example index][tab][text][tab][hypothesis][tab][label]
  2. [text][tab][hypothesis][tab][label]

[label] is assumed to be one of “entails”, “contradicts”, or “neutral”.

to_attention_instance()[source]

This returns a new SnliInstance with a different label.

to_entails_instance(activation: str)[source]

This returns a new SnliInstance with a different label. The new label will be binary (entails / not entails), but we need to distinguish between two different label types. Sometimes we need the label to be encoded in a single dimension (i.e., either 0 or 1), and sometimes we need it to be encoded in two dimensions (i.e., either [0, 1] or [1, 0]). This depends on the activation function of the final layer in our network - a sigmoid activation will need the former, while a softmax activation will need the later. So, we encode these differently, as strings, which will be converted to the right array later, in IndexedSnliInstance.

Reading Comprehension Instances

These Instances are designed for the set of tasks known today as “reading comprehension”, where the input is a natural language question, a passage, and (optionally) some number of answer options, and the output is either a (span begin index, span end index) decision over the passage, or a classification decision over the answer options (if provided).

QuestionPassageInstances

class deep_qa.data.instances.reading_comprehension.question_passage_instance.IndexedQuestionPassageInstance(question_indices: typing.List[int], passage_indices: typing.List[int], label: typing.List[int], index: int = None)[source]

Bases: deep_qa.data.instances.instance.IndexedInstance

This is an indexed instance that is used for (question, passage) pairs.

as_training_data()[source]

Convert this IndexedInstance to NumPy arrays suitable for use as training data to Keras models.

Returns:

train_data : (inputs, label)

The IndexedInstance as NumPy arrays to be uesd in Keras. Note that inputs might itself be a complex tuple, depending on the Instance type.

classmethod empty_instance()[source]

Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.

get_padding_lengths() → typing.Dict[str, int][source]

We need to pad at least the question length, the passage length, and the word length across all the questions and passages. Subclasses that add more arguments should also override this method to enable padding on said arguments.

pad(padding_lengths: typing.Dict[str, int])[source]

In this function, we pad the questions and passages (in terms of number of words in each), as well as the individual words in the questions and passages themselves.

class deep_qa.data.instances.reading_comprehension.question_passage_instance.QuestionPassageInstance(question_text: str, passage_text: str, label: typing.Any, index: int = None)[source]

Bases: deep_qa.data.instances.instance.TextInstance

A QuestionPassageInstance is a base class for datasets that consist primarily of a question text and a passage, where the passage contains the answer to the question. This class should not be used directly due to the missing _index_label function, use a subclass instead.

_index_label(label: typing.Any) → typing.List[int][source]

Index the labels. Since we don’t know what form the label takes, we leave it to subclasses to implement this method.

to_indexed_instance(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]

Converts the words in this Instance into indices using the DataIndexer.

Parameters:

data_indexer : DataIndexer

DataIndexer to use in converting the Instance to an IndexedInstance.

Returns:

indexed_instance : IndexedInstance

A TextInstance that has had all of its strings converted into indices.

words() → typing.Dict[str, typing.List[str]][source]

Returns a list of all of the words in this instance, contained in a namespace dictionary.

This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the DataIndexer to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.

Returns:

namespace : Dictionary of {str: List[str]}

The str key refers to vocabularies, and the List[str] should contain the tokens in that vocabulary. For example, you should use the key words to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.

McQuestionPassageInstances

class deep_qa.data.instances.reading_comprehension.mc_question_passage_instance.IndexedMcQuestionPassageInstance(question_indices: typing.List[int], passage_indices: typing.List[int], option_indices: typing.List[typing.List[int]], label: typing.List[int], index: int = None)[source]

Bases: deep_qa.data.instances.reading_comprehension.question_passage_instance.IndexedQuestionPassageInstance

as_training_data()[source]

Convert this IndexedInstance to NumPy arrays suitable for use as training data to Keras models.

Returns:

train_data : (inputs, label)

The IndexedInstance as NumPy arrays to be uesd in Keras. Note that inputs might itself be a complex tuple, depending on the Instance type.

classmethod empty_instance()[source]

Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.

get_padding_lengths() → typing.Dict[str, int][source]

We need to pad the answer option length (in words), the number of answer options, the question length (in words), the passage length (in words), and the word length (in characters) among all the questions, passages, and answer options.

pad(padding_lengths: typing.Dict[str, int])[source]

In this function, we pad the questions and passages (in terms of number of words in each), as well as the individual words in the questions and passages themselves. We also pad the number of answer options, the answer options (in terms of numbers or words in each), as well as the individual words in the answer options.

class deep_qa.data.instances.reading_comprehension.mc_question_passage_instance.McQuestionPassageInstance(question: str, passage: str, answer_options: typing.List[str], label: int, index: int = None)[source]

Bases: deep_qa.data.instances.reading_comprehension.question_passage_instance.QuestionPassageInstance

A McQuestionPassageInstance is a QuestionPassageInstance that represents a (question, passage, answer_options) tuple from the McQuestionPassageInstance dataset, with an associated label indicating the index of the correct answer choice.

_index_label(label: typing.Tuple[int, int]) → typing.List[int][source]

Specify how to index self.label, which is needed to convert the McQuestionPassageInstance into an IndexedInstance (conversion handled in superclass).

classmethod read_from_line(line: str)[source]

Reads a McQuestionPassageInstance object from a line. The format has one of two options:

  1. [example index][tab][passage][tab][question][tab][options][tab][label]
  2. [passage][tab][question][tab][options][tab][label]

The answer_options column is assumed formatted as: [option]###[option]###[option]... That is, we split on three hashes ("###").

to_indexed_instance(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]

Converts the words in this Instance into indices using the DataIndexer.

Parameters:

data_indexer : DataIndexer

DataIndexer to use in converting the Instance to an IndexedInstance.

Returns:

indexed_instance : IndexedInstance

A TextInstance that has had all of its strings converted into indices.

words() → typing.Dict[str, typing.List[str]][source]

Returns a list of all of the words in this instance, contained in a namespace dictionary.

This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the DataIndexer to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.

Returns:

namespace : Dictionary of {str: List[str]}

The str key refers to vocabularies, and the List[str] should contain the tokens in that vocabulary. For example, you should use the key words to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.

CharacterSpanInstances

class deep_qa.data.instances.reading_comprehension.character_span_instance.CharacterSpanInstance(question: str, passage: str, label: typing.Tuple[int, int], index: int = None)[source]

Bases: deep_qa.data.instances.reading_comprehension.question_passage_instance.QuestionPassageInstance

A CharacterSpanInstance is a QuestionPassageInstance that represents a (question, passage) pair with an associated label, which is the data given for the span prediction task. The label is a span of characters in the passage that indicates where the answer to the question begins and where the answer to the question ends.

The main thing this class handles over QuestionPassageInstance is in specifying the form of and how to index the label, which is given as a span of _characters_ in the passage. The label we are going to use in the rest of the code is a span of _tokens_ in the passage, so the mapping from character labels to token labels depends on the tokenization we did, and the logic to handle this is, unfortunately, a little complicated. The label conversion happens when converting a CharacterSpanInstance to in IndexedInstance (where character indices are generally lost, anyway).

This class should be used to represent training instances for the SQuAD (Stanford Question Answering) and NewsQA datasets, to name a few.

_index_label(label: typing.Tuple[int, int]) → typing.List[int][source]

Specify how to index self.label, which is needed to convert the CharacterSpanInstance into an IndexedInstance (handled in superclass).

classmethod read_from_line(line: str)[source]

Reads a CharacterSpanInstance object from a line. The format has one of two options:

  1. [example index][tab][question][tab][passage][tab][label]
  2. [question][tab][passage][tab][label]

[label] is assumed to be a comma-separated pair of integers.

stop_token = '@@STOP@@'
to_indexed_instance(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]

Converts the words in this Instance into indices using the DataIndexer.

Parameters:

data_indexer : DataIndexer

DataIndexer to use in converting the Instance to an IndexedInstance.

Returns:

indexed_instance : IndexedInstance

A TextInstance that has had all of its strings converted into indices.

class deep_qa.data.instances.reading_comprehension.character_span_instance.IndexedCharacterSpanInstance(question_indices: typing.List[int], passage_indices: typing.List[int], label: typing.List[int], index: int = None)[source]

Bases: deep_qa.data.instances.reading_comprehension.question_passage_instance.IndexedQuestionPassageInstance

as_training_data()[source]

Convert this IndexedInstance to NumPy arrays suitable for use as training data to Keras models.

Returns:

train_data : (inputs, label)

The IndexedInstance as NumPy arrays to be uesd in Keras. Note that inputs might itself be a complex tuple, depending on the Instance type.

Sequence Tagging Instances

These Instances are designed for a sequence tagging task, where the input is a passage of natural language (e.g., a sentence), and the output is some classification decision for each token in that passage (e.g., part-of-speech tags, any kind of BIO tagging like NER or chunking, etc.).

TaggingInstances

class deep_qa.data.instances.sequence_tagging.tagging_instance.IndexedTaggingInstance(text_indices: typing.List[int], label: typing.List[int], index: int = None)[source]

Bases: deep_qa.data.instances.instance.IndexedInstance

as_training_data()[source]

Convert this IndexedInstance to NumPy arrays suitable for use as training data to Keras models.

Returns:

train_data : (inputs, label)

The IndexedInstance as NumPy arrays to be uesd in Keras. Note that inputs might itself be a complex tuple, depending on the Instance type.

classmethod empty_instance()[source]

Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.

get_padding_lengths() → typing.Dict[str, int][source]

Returns the length of this instance in all dimensions that require padding.

Different kinds of instances have different fields that are padded, such as sentence length, number of background sentences, number of options, etc.

Returns:

padding_lengths: Dict[str, int]

A dictionary mapping padding keys (like “num_sentence_words”) to lengths.

pad(padding_lengths: typing.Dict[str, int])[source]

Add zero-padding to make each data example of equal length for use in the neural network.

This modifies the current object.

Parameters:

padding_lengths: Dict[str, int]

In this dictionary, each str refers to a type of token (e.g. num_sentence_words), and the corresponding int is the value. This dictionary must have the same keys as was returned by get_padding_lengths(). We will use these lengths to pad the instance in all of the necessary dimensions to the given leangths.

class deep_qa.data.instances.sequence_tagging.tagging_instance.TaggingInstance(text: str, label: typing.Any, index: int = None)[source]

Bases: deep_qa.data.instances.instance.TextInstance

A TaggingInstance represents a passage of text and a tag sequence over that text.

There are some sticky issues with tokenization and how exactly the label is specified. For example, if your label is a sequence of tags, that assumes a particular tokenization, which interacts in a funny way with our tokenization code. This is a general superclass containing common functionality for most simple sequence tagging tasks. The specifics of reading in data from a file and converting that data into properly-indexed tag sequences is left to subclasses.

_index_label(label: typing.Any, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[int][source]

Index the labels. Since we don’t know what form the label takes, we leave it to subclasses to implement this method. If you need to convert tag names into indices, use the namespace ‘tags’ in the DataIndexer.

tags_in_label()[source]

Returns all of the tag words in this instance, so that we can convert them into indices. This is called in self.words(). Not necessary if you have some pre-indexed labeling scheme.

to_indexed_instance(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]
words() → typing.Dict[str, typing.List[str]][source]

Returns a list of all of the words in this instance, contained in a namespace dictionary.

This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the DataIndexer to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.

Returns:

namespace : Dictionary of {str: List[str]}

The str key refers to vocabularies, and the List[str] should contain the tokens in that vocabulary. For example, you should use the key words to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.

PretokenizedTaggingInstances

class deep_qa.data.instances.sequence_tagging.pretokenized_tagging_instance.PreTokenizedTaggingInstance(text: typing.List[str], label: typing.List[str], index: int = None)[source]

Bases: deep_qa.data.instances.sequence_tagging.tagging_instance.TaggingInstance

This is a TaggingInstance where the text has been pre-tokenized. Thus the text member variable here is actually a List[str], instead of a str.

When using this Instance, you must use the NoOpWordSplitter as well, or things will break. You probably also do not want any kind of filtering (though stemming is ok), because only the words will get filtered, not the labels.

_index_label(label: typing.List[str], data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[int][source]

Index the labels. Since we don’t know what form the label takes, we leave it to subclasses to implement this method. If you need to convert tag names into indices, use the namespace ‘tags’ in the DataIndexer.

classmethod read_from_line(line: str)[source]

Reads a PreTokenizedTaggingInstance from a line. The format has one of two options:

  1. [example index][token1]###[tag1][tab][token2]###[tag2][tab]...
  2. [token1]###[tag1][tab][token2]###[tag2][tab]...
tags_in_label()[source]

Returns all of the tag words in this instance, so that we can convert them into indices. This is called in self.words(). Not necessary if you have some pre-indexed labeling scheme.

Text Classification Instances

These Instances are designed for any classification task over a single passage of text. The input is the passage (e.g., a sentence, a document, etc.), and the output is a single label (e.g., positive / negative sentiment, spam / not spam, essay grade, etc.).

TextClassificationInstances

class deep_qa.data.instances.text_classification.text_classification_instance.IndexedTextClassificationInstance(word_indices: typing.List[int], label, index: int = None)[source]

Bases: deep_qa.data.instances.instance.IndexedInstance

as_training_data()[source]

Convert this IndexedInstance to NumPy arrays suitable for use as training data to Keras models.

Returns:

train_data : (inputs, label)

The IndexedInstance as NumPy arrays to be uesd in Keras. Note that inputs might itself be a complex tuple, depending on the Instance type.

classmethod empty_instance()[source]

Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.

get_padding_lengths() → typing.Dict[str, int][source]

Returns the length of this instance in all dimensions that require padding.

Different kinds of instances have different fields that are padded, such as sentence length, number of background sentences, number of options, etc.

Returns:

padding_lengths: Dict[str, int]

A dictionary mapping padding keys (like “num_sentence_words”) to lengths.

pad(padding_lengths: typing.Dict[str, int])[source]

Add zero-padding to make each data example of equal length for use in the neural network.

This modifies the current object.

Parameters:

padding_lengths: Dict[str, int]

In this dictionary, each str refers to a type of token (e.g. num_sentence_words), and the corresponding int is the value. This dictionary must have the same keys as was returned by get_padding_lengths(). We will use these lengths to pad the instance in all of the necessary dimensions to the given leangths.

class deep_qa.data.instances.text_classification.text_classification_instance.TextClassificationInstance(text: str, label: bool, index: int = None)[source]

Bases: deep_qa.data.instances.instance.TextInstance

A TextClassificationInstance is a TextInstance that is a single passage of text, where that passage has some associated (categorical, or possibly real-valued) label.

classmethod read_from_line(line: str)[source]

Reads a TextClassificationInstance object from a line. The format has one of four options:

  1. [sentence]
  2. [sentence index][tab][sentence]
  3. [sentence][tab][label]
  4. [sentence index][tab][sentence][tab][label]

If no label is given, we use None as the label.

to_indexed_instance(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]

Converts the words in this Instance into indices using the DataIndexer.

Parameters:

data_indexer : DataIndexer

DataIndexer to use in converting the Instance to an IndexedInstance.

Returns:

indexed_instance : IndexedInstance

A TextInstance that has had all of its strings converted into indices.

words() → typing.Dict[str, typing.List[str]][source]

Returns a list of all of the words in this instance, contained in a namespace dictionary.

This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the DataIndexer to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.

Returns:

namespace : Dictionary of {str: List[str]}

The str key refers to vocabularies, and the List[str] should contain the tokens in that vocabulary. For example, you should use the key words to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.

Tokenizers

character_tokenizer

class deep_qa.data.tokenizers.character_tokenizer.CharacterTokenizer(params: deep_qa.common.params.Params)[source]

Bases: deep_qa.data.tokenizers.tokenizer.Tokenizer

A CharacterTokenizer splits strings into character tokens.

Notes

Note that in the code, we’re still using the “words” namespace, and the “num_sentence_words” padding key, instead of using a different “characters” namespace. This is so that the rest of the code doesn’t have to change as much to just use this different tokenizer. For example, this is an issue when adding start and stop tokens - how is an Instance class supposed to know if it should use the “words” or the “characters” namespace when getting a start token id? If we just always use the “words” namespace for the top-level token namespace, it’s not an issue.

But confusingly, we’ll still use the “characters” embedding key... At least the user-facing parts all use characters; it’s only in writing tokenizer code that you need to be careful about namespaces. TODO(matt): it probably makes sense to change the default namespace to “tokens”, and use that for both the words in WordTokenizer and the characters in CharacterTokenizer, so the naming isn’t so confusing.

embed_input(input_layer: keras.engine.topology.Layer, embed_function: typing.Callable[[keras.engine.topology.Layer, str, str], keras.engine.topology.Layer], text_trainer, embedding_suffix: str = '')[source]

Applies embedding layers to the input_layer. See TextTrainer._embed_input for a more detailed comment on what this method does.

Parameters:

input_layer: Keras ``Input()`` layer

The layer to embed.

embed_function: Callable[[‘Layer’, str, str], ‘Tensor’]

This should be the __get_embedded_input method from your instantiated TextTrainer. This function actually applies an Embedding layer (and maybe also a projection and dropout) to the input layer.

text_trainer: TextTrainer

Simple Tokenizers will just need to use the embed_function that gets passed as a parameter here, but complex Tokenizers might need more than just an embedding function. So that you can get an encoder or other things from the TextTrainer here if you need them, we take this object as a parameter.

embedding_suffix: str, optional (default=””)

A suffix to add to embedding keys that we use, so that, e.g., you could specify several different word embedding matrices, for whatever reason.

get_padding_lengths(sentence_length: int, word_length: int) → typing.Dict[str, int][source]

When dealing with padding in TextTrainer, TextInstances need to know what to pad and how much. This function takes a potential max sentence length and word length, and returns a lengths dictionary containing keys for the padding that is applicable to this encoding.

get_sentence_shape(sentence_length: int, word_length: int) → typing.Tuple[int][source]

If we have a text sequence of length sentence_length, what shape would that correspond to with this encoding? For words or characters only, this would just be (sentence_length,). For an encoding that contains both words and characters, it might be (sentence_length, word_length).

get_words_for_indexer(text: str) → typing.Dict[str, typing.List[str]][source]

The DataIndexer needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes some text and returns whatever the DataIndexer would be asked to index from that text. Note that this returns a dictionary of token lists keyed by namespace. Typically, the key would be either ‘words’ or ‘characters’. An example for indexing the string ‘the’ might be {‘words’: [‘the’], ‘characters’: [‘t’, ‘h’, ‘e’]}, if you are indexing both words and characters.

index_text(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[source]

This method actually converts some text into an indexed list. This could be a list of integers (for either word tokens or characters), or it could be a list of arrays (for word tokens combined with characters), or something else.

tokenize(text: str) → typing.List[str][source]

Actually splits the string into a sequence of tokens. Note that this will only give you top-level tokenization! If you’re using a word-and-character tokenizer, for instance, this will only return the word tokenization.

tokenizer

class deep_qa.data.tokenizers.tokenizer.Tokenizer(params: deep_qa.common.params.Params)[source]

Bases: object

A Tokenizer splits strings into sequences of tokens that can be used in a model. The “tokens” here could be words, characters, or words and characters. The Tokenizer object handles various things involved with this conversion, including getting a list of tokens for pre-computing a vocabulary, getting the shape of a word sequence in a model, etc. The Tokenizer needs to handle these things because the tokenization you do could affect the shape of word sequence tensors in the model (e.g., a sentence could have shape (num_words,), (num_characters,), or (num_words, num_characters)).

static _spans_match(sentence_tokens: typing.List[str], span_tokens: typing.List[str], index: int) → bool[source]
char_span_to_token_span(sentence: str, span: typing.Tuple[int, int], slack: int = 3) → typing.Tuple[int, int][source]

Converts a character span from a sentence into the corresponding token span in the tokenized version of the sentence. If you pass in a character span that does not correspond to complete tokens in the tokenized version, we’ll do our best, but the behavior is officially undefined.

The basic outline of this method is to find the token that starts the same number of characters into the sentence as the given character span. We try to handle a bit of error in the tokenization by checking slack tokens in either direction from that initial estimate.

The returned (begin, end) indices are inclusive for begin, and exclusive for end. So, for example, (2, 2) is an empty span, (2, 3) is the one-word span beginning at token index 2, and so on.

embed_input(input_layer: keras.engine.topology.Layer, embed_function: typing.Callable[[keras.engine.topology.Layer, str, str], keras.engine.topology.Layer], text_trainer, embedding_suffix: str = '')[source]

Applies embedding layers to the input_layer. See TextTrainer._embed_input for a more detailed comment on what this method does.

Parameters:

input_layer: Keras ``Input()`` layer

The layer to embed.

embed_function: Callable[[‘Layer’, str, str], ‘Tensor’]

This should be the __get_embedded_input method from your instantiated TextTrainer. This function actually applies an Embedding layer (and maybe also a projection and dropout) to the input layer.

text_trainer: TextTrainer

Simple Tokenizers will just need to use the embed_function that gets passed as a parameter here, but complex Tokenizers might need more than just an embedding function. So that you can get an encoder or other things from the TextTrainer here if you need them, we take this object as a parameter.

embedding_suffix: str, optional (default=””)

A suffix to add to embedding keys that we use, so that, e.g., you could specify several different word embedding matrices, for whatever reason.

get_custom_objects() → typing.Dict[str, typing.Layer][source]

If you use any custom Layers in your embed_input method, you need to return them here, so that the TextTrainer can correctly load models.

get_padding_lengths(sentence_length: int, word_length: int) → typing.Dict[str, int][source]

When dealing with padding in TextTrainer, TextInstances need to know what to pad and how much. This function takes a potential max sentence length and word length, and returns a lengths dictionary containing keys for the padding that is applicable to this encoding.

get_sentence_shape(sentence_length: int, word_length: int) → typing.Tuple[int][source]

If we have a text sequence of length sentence_length, what shape would that correspond to with this encoding? For words or characters only, this would just be (sentence_length,). For an encoding that contains both words and characters, it might be (sentence_length, word_length).

get_words_for_indexer(text: str) → typing.Dict[str, typing.List[str]][source]

The DataIndexer needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes some text and returns whatever the DataIndexer would be asked to index from that text. Note that this returns a dictionary of token lists keyed by namespace. Typically, the key would be either ‘words’ or ‘characters’. An example for indexing the string ‘the’ might be {‘words’: [‘the’], ‘characters’: [‘t’, ‘h’, ‘e’]}, if you are indexing both words and characters.

index_text(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[source]

This method actually converts some text into an indexed list. This could be a list of integers (for either word tokens or characters), or it could be a list of arrays (for word tokens combined with characters), or something else.

tokenize(text: str) → typing.List[str][source]

Actually splits the string into a sequence of tokens. Note that this will only give you top-level tokenization! If you’re using a word-and-character tokenizer, for instance, this will only return the word tokenization.

word_and_character_tokenizer

class deep_qa.data.tokenizers.word_and_character_tokenizer.WordAndCharacterTokenizer(params: deep_qa.common.params.Params)[source]

Bases: deep_qa.data.tokenizers.tokenizer.Tokenizer

A WordAndCharacterTokenizer first splits strings into words, then splits those words into characters, and returns a representation that contains both a word index and a sequence of character indices for each word. See the documention for WordTokenizer for a note about naming, and the typical notion of “tokenization” in NLP.

Notes

In embed_input, this Tokenizer uses an encoder to get a character-level word embedding, which then gets concatenated with a standard word embedding from an embedding matrix. To specify the encoder to use for this character-level word embedding, use the "word" key in the encoder parameter to your model (which should be a TextTrainer subclass - see the documentation there for some more info). If you do not give a "word" key in the encoder dict, we’ll create a new encoder using the "default" parameters.

embed_input(input_layer: keras.engine.topology.Layer, embed_function: typing.Callable[[keras.engine.topology.Layer, str, str], keras.engine.topology.Layer], text_trainer, embedding_suffix: str = '')[source]

A combined word-and-characters representation requires some fancy footwork to do the embedding properly.

This method assumes the input shape is (..., sentence_length, word_length + 1), where the first integer for each word in the tensor is the word index, and the remaining word_length entries is the character sequence. We’ll first split this into two tensors, one of shape (..., sentence_length), and one of shape (..., sentence_length, word_length), where the first is the word sequence, and the second is the character sequence for each word. We’ll pass the word sequence through an embedding layer, as normal, and pass the character sequence through a _separate_ embedding layer, then an encoder, to get a word vector out. We’ll then concatenate the two word vectors, returning a tensor of shape (..., sentence_length, embedding_dim * 2).

get_custom_objects() → typing.Dict[str, typing.Any][source]

If you use any custom Layers in your embed_input method, you need to return them here, so that the TextTrainer can correctly load models.

get_padding_lengths(sentence_length: int, word_length: int) → typing.Dict[str, int][source]

When dealing with padding in TextTrainer, TextInstances need to know what to pad and how much. This function takes a potential max sentence length and word length, and returns a lengths dictionary containing keys for the padding that is applicable to this encoding.

get_sentence_shape(sentence_length: int, word_length: int = None) → typing.Tuple[int][source]

If we have a text sequence of length sentence_length, what shape would that correspond to with this encoding? For words or characters only, this would just be (sentence_length,). For an encoding that contains both words and characters, it might be (sentence_length, word_length).

get_words_for_indexer(text: str) → typing.Dict[str, typing.List[str]][source]

The DataIndexer needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes some text and returns whatever the DataIndexer would be asked to index from that text. Note that this returns a dictionary of token lists keyed by namespace. Typically, the key would be either ‘words’ or ‘characters’. An example for indexing the string ‘the’ might be {‘words’: [‘the’], ‘characters’: [‘t’, ‘h’, ‘e’]}, if you are indexing both words and characters.

index_text(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[source]

This method actually converts some text into an indexed list. This could be a list of integers (for either word tokens or characters), or it could be a list of arrays (for word tokens combined with characters), or something else.

tokenize(text: str) → typing.List[str][source]

Actually splits the string into a sequence of tokens. Note that this will only give you top-level tokenization! If you’re using a word-and-character tokenizer, for instance, this will only return the word tokenization.

word_splitter

class deep_qa.data.tokenizers.word_splitter.NltkWordSplitter[source]

Bases: deep_qa.data.tokenizers.word_splitter.WordSplitter

A tokenizer that uses nltk’s word_tokenize method.

I found that nltk is very slow, so I switched to using my own simple one, which is a good deal faster. But I’m adding this one back so that there’s consistency with older versions of the code, if you really want it.

split_words(sentence: str) → typing.List[str][source]
class deep_qa.data.tokenizers.word_splitter.NoOpWordSplitter[source]

Bases: deep_qa.data.tokenizers.word_splitter.WordSplitter

This is a word splitter that does nothing. We’re playing a little loose with python’s dynamic typing, breaking the typical WordSplitter API a bit and assuming that you’ve already split sentence into a list somehow, so you don’t need to do anything else here. For example, the PreTokenizedTaggingInstance requires this word splitter, because it reads in pre-tokenized data from a file.

split_words(sentence: str) → typing.List[str][source]
class deep_qa.data.tokenizers.word_splitter.SimpleWordSplitter[source]

Bases: deep_qa.data.tokenizers.word_splitter.WordSplitter

Does really simple tokenization. NLTK was too slow, so we wrote our own simple tokenizer instead. This just does an initial split(), followed by some heuristic filtering of each whitespace-delimited token, separating contractions and punctuation. We assume lower-cased, reasonably well-formed English sentences as input.

_can_split(token: str)[source]
split_words(sentence: str) → typing.List[str][source]

Splits a sentence into word tokens. We handle four kinds of things: words with punctuation that should be ignored as a special case (Mr. Mrs., etc.), contractions/genitives (isn’t, don’t, Matt’s), and beginning and ending punctuation (“antennagate”, (parentheticals), and such.).

The basic outline is to split on whitespace, then check each of these cases. First, we strip off beginning punctuation, then strip off ending punctuation, then strip off contractions. When we strip something off the beginning of a word, we can add it to the list of tokens immediately. When we strip it off the end, we have to save it to be added to after the word itself has been added. Before stripping off any part of a token, we first check to be sure the token isn’t in our list of special cases.

class deep_qa.data.tokenizers.word_splitter.SpacyWordSplitter[source]

Bases: deep_qa.data.tokenizers.word_splitter.WordSplitter

A tokenizer that uses spaCy’s Tokenizer, which is much faster than the others.

split_words(sentence: str) → typing.List[str][source]
class deep_qa.data.tokenizers.word_splitter.WordSplitter[source]

Bases: object

A WordSplitter splits strings into words. This is typically called a “tokenizer” in NLP, but we need Tokenizer to refer to something else, so we’re using WordSplitter here instead.

split_words(sentence: str) → typing.List[str][source]

tokenizers.word_tokenizer

class deep_qa.data.tokenizers.word_tokenizer.WordTokenizer(params: deep_qa.common.params.Params)[source]

Bases: deep_qa.data.tokenizers.tokenizer.Tokenizer

A WordTokenizer splits strings into word tokens.

There are several ways that you can split a string into words, so we rely on a WordProcessor to do that work for us. Note that we’re using the word “tokenizer” here for something different than is typical in NLP - we’re referring here to how strings are represented as numpy arrays, not the linguistic notion of splitting sentences into tokens. Those things are handled in the WordProcessor, which is a common dependency in several Tokenizers.

Parameters:

processor: Dict[str, Any], default={}

Contains parameters for processing text strings into word tokens, including, e.g., splitting, stemming, and filtering words. See WordProcessor for a complete description of available parameters.

embed_input(input_layer: keras.engine.topology.Layer, embed_function: typing.Callable[[keras.engine.topology.Layer, str, str], keras.engine.topology.Layer], text_trainer, embedding_suffix: str = '')[source]

Applies embedding layers to the input_layer. See TextTrainer._embed_input for a more detailed comment on what this method does.

Parameters:

input_layer: Keras ``Input()`` layer

The layer to embed.

embed_function: Callable[[‘Layer’, str, str], ‘Tensor’]

This should be the __get_embedded_input method from your instantiated TextTrainer. This function actually applies an Embedding layer (and maybe also a projection and dropout) to the input layer.

text_trainer: TextTrainer

Simple Tokenizers will just need to use the embed_function that gets passed as a parameter here, but complex Tokenizers might need more than just an embedding function. So that you can get an encoder or other things from the TextTrainer here if you need them, we take this object as a parameter.

embedding_suffix: str, optional (default=””)

A suffix to add to embedding keys that we use, so that, e.g., you could specify several different word embedding matrices, for whatever reason.

get_padding_lengths(sentence_length: int, word_length: int) → typing.Dict[str, int][source]

When dealing with padding in TextTrainer, TextInstances need to know what to pad and how much. This function takes a potential max sentence length and word length, and returns a lengths dictionary containing keys for the padding that is applicable to this encoding.

get_sentence_shape(sentence_length: int, word_length: int) → typing.Tuple[int][source]

If we have a text sequence of length sentence_length, what shape would that correspond to with this encoding? For words or characters only, this would just be (sentence_length,). For an encoding that contains both words and characters, it might be (sentence_length, word_length).

get_words_for_indexer(text: str) → typing.Dict[str, typing.List[str]][source]

The DataIndexer needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes some text and returns whatever the DataIndexer would be asked to index from that text. Note that this returns a dictionary of token lists keyed by namespace. Typically, the key would be either ‘words’ or ‘characters’. An example for indexing the string ‘the’ might be {‘words’: [‘the’], ‘characters’: [‘t’, ‘h’, ‘e’]}, if you are indexing both words and characters.

index_text(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[source]

This method actually converts some text into an indexed list. This could be a list of integers (for either word tokens or characters), or it could be a list of arrays (for word tokens combined with characters), or something else.

tokenize(text: str) → typing.List[str][source]

Actually splits the string into a sequence of tokens. Note that this will only give you top-level tokenization! If you’re using a word-and-character tokenizer, for instance, this will only return the word tokenization.

Data Generators

class deep_qa.data.data_generator.DataGenerator(text_trainer, params: deep_qa.common.params.Params)[source]

Bases: object

A DataGenerator takes an IndexedDataset and converts it into a generator, yielding batches suitable for training. You might want to do this instead of just creating one large set of numpy arrays for a few reasons:

  1. Creating large arrays for your whole data could take a whole lot of memory, maybe more than is available on your machine.
  2. Creating one large array means padding all of your instances to the same length. This typically means you waste a whole lot of computation on padding tokens. Using a DataGenerator instead allows you to only pad each batch to the same length, instead of all of your instances across your whole dataset. We’ve typically seen a 4-5x speed up just from doing this (partially because Keras is pretty bad at doing variable-length computation; the speed-up isn’t quite as large with plain tensorflow, I think).
  3. If we’re varying the padding lengths in each batch, we can also vary the batch size, to optimize GPU memory usage. This means we’ll use smaller batch sizes for big instances, and larger batch sizes for small instances. We’ve seen speedups up to 10-12x (on top of the 4-5x speed up above) from doing this.
Parameters:

text_trainer: TextTrainer

We need access to the TextTrainer object so we can call some methods on it, such as get_instance_sorting_keys().

dynamic_padding: bool, optional (default=False)

If True, we will set padding lengths based on the data per batch, instead of on the whole dataset. This only works if your model is structured to allow variable-length sequences (typically using None for specific dimensions when you build your model), and if you don’t set padding values in _set_padding_lengths(). This flag specifically is read in _set_padding_lengths() to know if we should set certain padding values or not. It’s handled correctly for num_sentence_words and num_word_characters in TextTrainer, but you need to be sure to implement it correctly in subclasses for this to work.

padding_noise: double, optional (default=.1)

When sorting by padding length, we add a bit of noise to the lengths, so that the sorting isn’t deterministic. This parameter determines how much noise we add, as a percentage of the actual padding value for each instance.

sort_every_epoch: bool, optional (default=True)

If True, we will re-sort the data after every epoch, then re-group the instances into batches. If padding_noise is zero, this does nothing, but if it’s non-zero, this will give you a slightly different ordering, so you don’t have exactly the same batches at every epoch. If you’re doing adaptive batch sizes, this will lead to re-computing the adaptive batches each epoch, which could give a different number of batches for the whole dataset, which means each “epoch” might no longer correspond to exactly one pass over the data. This is probably a pretty minor issue, though.

adaptive_batch_sizes: bool, optional (default=False)

Only relevant if dynamic_padding is True. If adaptive_batch_sizes is True, we will vary the batch size to try to optimize GPU memory usage. Because padding lengths are done dynamically, we can have larger batches when padding lengths are smaller, maximizing our usage of the GPU. In order for this to work, you need to do two things: (1) override _get_padding_memory_scaling() to give a big-O bound on memory usage given padding lengths, and (2) tune the adaptive_memory_usage_constant parameter for your particular model and GPU. See the documentation for _get_padding_memory_scaling() for more information.

adaptive_memory_usage_constant: int, optional (default=None)

Only relevant if adaptive_batch_sizes is True. This is a manually-tuned parameter, specific to a particular model architecture and amount of GPU memory (e.g., if you change the number of hidden layers in your model, this number will need to change). See _get_padding_memory_scaling() for more detail. The recommended way to tune this parameter is to (1) use a fixed batch size, with biggest_batch_first set to True, and find out the maximum batch size you can handle on your biggest instances without running out of memory. Then (2) turn on adaptive_batch_sizes, and set this parameter so that you get the right batch size for your biggest instances. If you set the log level to DEBUG in scripts/run_model.py, you can see the batch sizes that are computed.

maximum_batch_size: int, optional (default=1000000)

If we’re using adaptive batch sizes, you can use this to be sure you do not create batches larger than this, even if you have enough memory to handle it on your GPU. You might choose to do this to keep smaller batches because you like the noisier gradient estimates that come from smaller batches, for instance.

biggest_batch_first: bool, optional (default=False)

This is largely for testing, to see how large of a batch you can safely use with your GPU. It’s only meaningful if you’re using dynamic padding - this will let you try out the largest batch that you have in the data first, so that if you’re going to run out of memory, you know it early, instead of waiting through the whole batch to find out at the end that you’re going to crash.

create_generator(dataset: deep_qa.data.datasets.dataset.IndexedDataset, batch_size: int = None)[source]

Main external API call: converts an IndexedDataset into a data generator suitable for use with Keras’ fit_generator and related methods.

last_num_batches = None

This field can be read after calling create_generator to get the number of steps you should take per epoch in model.fit_generator or model.evaluate_generator for this data.

Datasets

deep_qa.data.dataset

class deep_qa.data.datasets.dataset.Dataset(instances: typing.List[deep_qa.data.instances.instance.Instance])[source]

Bases: object

A collection of Instances.

This base class has general methods that apply to all collections of Instances. That basically is just methods that operate on sets, like merging and truncating.

merge(other: deep_qa.data.datasets.dataset.Dataset) → deep_qa.data.datasets.dataset.Dataset[source]

Combine two datasets. If you call try to merge two Datasets of the same subtype, you will end up with a Dataset of the same type (i.e., calling IndexedDataset.merge() with another IndexedDataset will return an IndexedDataset). If the types differ, this method currently raises an error, because the underlying Instance objects are not currently type compatible.

truncate(max_instances: int)[source]

If there are more instances than max_instances in this dataset, returns a new dataset with a random subset of size max_instances. If there are fewer than max_instances already, we just return self.

class deep_qa.data.datasets.dataset.IndexedDataset(instances: typing.List[deep_qa.data.instances.instance.IndexedInstance])[source]

Bases: deep_qa.data.datasets.dataset.Dataset

A Dataset of IndexedInstances, with some helper methods.

IndexedInstances have text sequences replaced with lists of word indices, and are thus able to be padded to consistent lengths and converted to training inputs.

as_training_data()[source]

Takes each IndexedInstance and converts it into (inputs, labels), according to the Instance’s as_training_data() method. Both the inputs and the labels are numpy arrays. Note that if the Instances return tuples for their inputs, we convert the list of tuples into a tuple of lists, before converting everything to numpy arrays.

pad_instances(padding_lengths: typing.Dict[str, int] = None, verbose: bool = True)[source]

Makes all of the IndexedInstances in the dataset have the same length by padding them. This Dataset object doesn’t know what things there are in the Instance to pad, but the Instances do, and so does the model that called us, passing in a padding_lengths dictionary. The keys in that dictionary must match the lengths that the Instance knows about.

Given that, this method does two things: (1) it asks each of the Instances what their padding lengths are, and takes a max (using padding_lengths()). It then reconciles those values with the padding_lengths we were passed as an argument to this method, and pads the instances with IndexedInstance.pad(). If padding_lengths has a particular key specified with a value, that value takes precedence over whatever we computed in our data. TODO(matt): with dynamic padding, we should probably have this be a max padding length, not a hard setting, but that requires some API changes.

This method modifies the current object, it does not return a new IndexedDataset.

Parameters:

padding_lengths: Dict[str, int]

If a key is present in this dictionary with a non-None value, we will pad to that length instead of the length calculated from the data. This lets you, e.g., set a maximum value for sentence length, or word length, if you want to throw out long sequences.

verbose: bool, optional (default=True)

Should we output logging information when we’re doing this padding? If the dataset is large, this is nice to have, because padding a large dataset could take a long time. But if you’re doing this inside of a data generator, having all of this output per batch is a bit obnoxious.

padding_lengths()[source]
sort_by_padding(sorting_keys: typing.List[str], padding_noise: float = 0.0)[source]

Sorts the Instances in this Dataset by their padding lengths, using the keys in sorting_keys (in the order in which they are provided).

class deep_qa.data.datasets.dataset.TextDataset(instances: typing.List[deep_qa.data.instances.instance.TextInstance], params: deep_qa.common.params.Params = None)[source]

Bases: deep_qa.data.datasets.dataset.Dataset

A Dataset of TextInstances, with a few helper methods.

TextInstances aren’t useful for much with Keras until they’ve been indexed. So this class just has methods to read in data from a file and convert it into other kinds of Datasets.

static read_from_file(filename: str, instance_class, params: deep_qa.common.params.Params = None)[source]
static read_from_lines(lines: typing.List[str], instance_class, params: deep_qa.common.params.Params = None)[source]
to_indexed_dataset(data_indexer: deep_qa.data.data_indexer.DataIndexer) → deep_qa.data.datasets.dataset.IndexedDataset[source]

Converts the Dataset into an IndexedDataset, given a DataIndexer.

deep_qa.data.datasets.dataset.log_label_counts(instances: typing.List[deep_qa.data.instances.instance.TextInstance])[source]

Entailment

class deep_qa.data.datasets.entailment.snli_dataset.SnliDataset(instances: typing.List[deep_qa.data.instances.instance.TextInstance], params: deep_qa.common.params.Params = None)[source]

Bases: deep_qa.data.datasets.dataset.TextDataset

static read_from_file(filename: str, instance_class, params: deep_qa.common.params.Params = None)[source]

Language Modeling

class deep_qa.data.datasets.language_modeling.language_modeling_dataset.LanguageModelingDataset(instances: typing.List[deep_qa.data.instances.instance.TextInstance], params: deep_qa.common.params.Params = None)[source]

Bases: deep_qa.data.datasets.dataset.TextDataset

static read_from_file(filename: str, instance_class, params: deep_qa.common.params.Params = None)[source]

General Data Utils

deep_qa.data.data_indexer

class deep_qa.data.data_indexer.DataIndexer[source]

Bases: object

A DataIndexer maps strings to integers, allowing for strings to be mapped to an out-of-vocabulary token.

DataIndexers are fit to a particular dataset, which we use to decide which words are in-vocabulary.

DataIndexers also allow for several different namespaces, so you can have separate word indices for ‘a’ as a word, and ‘a’ as a character, for instance. Most of the methods on this class allow you to pass in a namespace; by default we use the ‘words’ namespace, and you can omit the namespace argument everywhere and just use the default.

add_word_to_index(word: str, namespace: str = 'words') → int[source]

Adds word to the index, if it is not already present. Either way, we return the index of the word.

finalize()[source]
fit_word_dictionary(dataset, min_count: int = 1)[source]

Given a Dataset, this method decides which words are given an index, and which ones are mapped to an OOV token (in this case “UNK”). This method must be called before any dataset is indexed with this DataIndexer. If you don’t first fit the word dictionary, you’ll basically map every token onto “UNK”.

We call instance.words() for each instance in the dataset, and then keep all words that appear at least min_count times.

Parameters:

dataset: ``TextDataset``

The dataset to index.

min_count: int, optional (default=1)

The minimum number of occurences a word must have in the dataset in order to be assigned an index.

get_vocab_size(namespace: str = 'words')[source]
get_word_from_index(index: int, namespace: str = 'words')[source]
get_word_index(word: str, namespace: str = 'words')[source]
set_from_file(filename: str, oov_token: str = '@@UNKNOWN@@', namespace: str = 'words')[source]
words_in_index(namespace: str = 'words')[source]

deep_qa.data.embeddings

class deep_qa.data.embeddings.PretrainedEmbeddings[source]

Bases: object

static get_embedding_layer(embeddings_filename: str, data_indexer: deep_qa.data.data_indexer.DataIndexer, trainable=False, log_misses=False, name='pretrained_embedding')[source]

Reads a pre-trained embedding file and generates a Keras Embedding layer that has weights initialized to the pre-trained embeddings. The Embedding layer can either be trainable or not.

We use the DataIndexer to map from the word strings in the embeddings file to the indices that we need, and to know which words from the embeddings file we can safely ignore. If we come across a word in DataIndexer that does not show up with the embeddings file, we give it a zero vector.

The embeddings file is assumed to be gzipped, formatted as [word] [dim 1] [dim 2] ...

static initialize_random_matrix(shape, seed=1337)[source]

About Models

In this module we define a number of concrete models. The models are grouped by task, where each task has a roughly coherent input/output specification. See the README in each submodule for a description of the task models in that submodule are designed to solve.

You should think of these models as more of “model families” than actual models, though, as there are typically options left unspecified in the models themselves. For example, models in this module might have a layer that encodes word sequences into vectors; they just call a method on TextTrainer to get an encoder, and the decision for which actual encoder is used (an LSTM, a CNN, or something else) happens in the parameters passed to TextTrainer. If you really want to, you can hard-code specific decisions for these things, but most models we have here use the TextTrainer API to abstract away these decisions, giving implementations of a class of similar models, instead of a single model.

We also define a few general Pretrainers in a submodule here. The Pretrainers in this top-level submodule are suitable to pre-train a large class of models (e.g., any model that encodes sentences), while more task-specific Pretrainers are found in that task’s submodule.

Below, we describe a few popular models that we’ve implemented and include our output when training.

Attention Sum Reader

The Attention Sum Reader Network is implemented in attention_sum_reader.

Press to show/hide train logs

Train Logs:

Using Theano backend.
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5105)
/home/nelsonl/miniconda3/envs/deep_qa/lib/python3.5/site-packages/theano/sandbox/cuda/__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.
  warnings.warn(warn)
2017-01-26 23:52:54,082 - INFO - deep_qa.common.checks - Keras version: 1.2.0
2017-01-26 23:52:54,082 - INFO - deep_qa.common.checks - Theano version: 0.8.2
2017-01-26 23:52:54,269 - INFO - __main__ - Training model
2017-01-26 23:52:54,270 - INFO - deep_qa.training.trainer - Running training (TextTrainer)
2017-01-26 23:52:54,270 - INFO - deep_qa.training.trainer - Getting training data
2017-01-26 23:52:58,914 - INFO - deep_qa.data.dataset - Finished reading dataset; label counts: [(0, 42399), (1, 44896), (2, 23832), (3, 11274), (4, 585)]
2017-01-26 23:58:07,539 - INFO - deep_qa.training.text_trainer - Indexing dataset
2017-01-27 00:03:28,722 - INFO - deep_qa.training.text_trainer - Padding dataset to lengths {'num_option_words': None, 'num_question_words': None, 'wod_sequence_length': None, 'num_options': None, 'num_passage_words': None}
2017-01-27 00:03:28,722 - INFO - deep_qa.data.dataset - Getting max lengths from instances
2017-01-27 00:03:29,714 - INFO - deep_qa.data.dataset - Instance max lengths: {'num_option_words': 68, 'num_question_words': 121, 'num_options': 5, 'nm_passage_words': 3090}
2017-01-27 00:03:29,714 - INFO - deep_qa.data.dataset - Now actually padding instances to length: {'num_option_words': 68, 'num_question_words': 121, num_options': 5, 'num_passage_words': 3090}
2017-01-27 00:05:40,054 - INFO - deep_qa.training.trainer - Getting validation data
2017-01-27 00:05:40,347 - INFO - deep_qa.data.dataset - Finished reading dataset; label counts: [(0, 3522), (1, 3429), (2, 1835), (3, 784), (4, 430)]
2017-01-27 00:05:40,348 - INFO - deep_qa.training.text_trainer - Indexing dataset
2017-01-27 00:06:02,773 - INFO - deep_qa.training.text_trainer - Padding dataset to lengths {'num_option_words': 68, 'num_question_words': 121, 'word_sequence_length': None, 'num_options': 5, 'num_passage_words': 3090}
2017-01-27 00:06:02,774 - INFO - deep_qa.data.dataset - Getting max lengths from instances
2017-01-27 00:06:02,851 - INFO - deep_qa.data.dataset - Instance max lengths: {'num_option_words': 8, 'num_question_words': 95, 'num_options': 5, 'num_passage_words': 2186}
2017-01-27 00:06:02,851 - INFO - deep_qa.data.dataset - Now actually padding instances to length: {'num_option_words': 68, 'num_question_words': 121, 'num_options': 5, 'num_passage_words': 3090}
2017-01-27 00:06:13,387 - INFO - deep_qa.training.trainer - Building the model
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
document_input (InputLayer)      (None, 3090)          0
____________________________________________________________________________________________________
question_input (InputLayer)      (None, 121)           0
____________________________________________________________________________________________________
word_embedding (TimeDistributedE multiple              80112384    question_input[0][0]
                                                                   document_input[0][0]
____________________________________________________________________________________________________
bidirectional_1 (Bidirectional)  (None, 768)           1476864     word_embedding[0][0]
____________________________________________________________________________________________________
bidirectional_2 (Bidirectional)  (None, 3090, 768)     1476864     word_embedding[1][0]
____________________________________________________________________________________________________
question_document_softmax (Atten (None, 3090)          0           bidirectional_1[0][0]
                                                                   bidirectional_2[0][0]
____________________________________________________________________________________________________
options_input (InputLayer)       (None, 5, 68)         0
____________________________________________________________________________________________________
options_probability_sum (OptionA (None, 5)             0           document_input[0][0]
                                                                   question_document_softmax[0][0]
                                                                   options_input[0][0]
____________________________________________________________________________________________________
l1normalize_1 (L1Normalize)      (None, 5)             0           options_probability_sum[0][0]
====================================================================================================
Total params: 83,066,112
Trainable params: 83,066,112
Non-trainable params: 0
____________________________________________________________________________________________________
Train on 127786 samples, validate on 10000 samples
Epoch 1/5
127786/127786 [==============================] - 34850s - loss: 1.0131 - acc: 0.5290 - val_loss: 0.9776 - val_acc: 0.5624
Epoch 2/5
127786/127786 [==============================] - 34828s - loss: 0.6713 - acc: 0.7267 - val_loss: 1.0838 - val_acc: 0.5514
Epoch 3/5
127786/127786 [==============================] - 34835s - loss: 0.2720 - acc: 0.8996 - val_loss: 1.4446 - val_acc: 0.5335

Entailment Models

Entailment models take two sequences of text as input and make a classification decision on the pair. Typically that decision represents whether one sentence entails the other, but we’ll use this family of models to represent any kind of classification decision over pairs of text.

Inputs: Two text sequences

Output: Some classification decision (typically “entails/not entails”, “entails/neutral/contradicts”, or similar)

DecomposableAttention

class deep_qa.models.entailment.decomposable_attention.DecomposableAttention(params: deep_qa.common.params.Params)[source]

Bases: deep_qa.training.text_trainer.TextTrainer

This TextTrainer implements the Decomposable Attention model described in “A Decomposable Attention Model for Natural Language Inference”, by Parikh et al., 2016, with some optional enhancements before the decomposable attention actually happens. Specifically, Parikh’s original model took plain word embeddings as input to the decomposable attention; we allow other operations the transform these word embeddings, such as running a biLSTM on them, before running the decomposable attention layer.

Inputs:

  • A “text” sentence, with shape (batch_size, sentence_length)
  • A “hypothesis” sentence, with shape (batch_size, sentence_length)

Outputs:

  • An entailment decision per input text/hypothesis pair, in {entails, contradicts, neutral}.
Parameters:

num_seq2seq_layers : int, optional (default=0)

After getting a word embedding, how many stacked seq2seq encoders should we use before doing the decomposable attention? The default of 0 recreates the original decomposable attention model.

share_encoders : bool, optional (default=True)

Should we use the same seq2seq encoder for the text and hypothesis, or different ones?

decomposable_attention_params : Dict[str, Any], optional (default={})

These parameters get passed to the DecomposableAttentionEntailment layer object, and control things like the number of output labels, number of hidden layers in the entailment MLPs, etc. See that class for a complete description of options here.

_build_model()[source]

Constructs and returns a DeepQaModel (which is a wrapper around a Keras Model) that will take the output of self._get_training_data as input, and produce as output a true/false decision for each input. Note that in the multiple gpu case, this function will be called multiple times for the different GPUs. As such, you should be wary of this function having side effects unrelated to building a computation graph.

The returned model will be used to call model.fit(train_input, train_labels).

classmethod _get_custom_objects()[source]
_instance_type()[source]

When reading datasets, what Instance type should we create? The Instance class contains code that creates actual numpy arrays, so this instance type determines the inputs that you will get to your model, and the outputs that are used for training.

_set_padding_lengths_from_model()[source]

This gets called when loading a saved model. It is analogous to _set_padding_lengths, but needs to set all of the values set in that method just by inspecting the loaded model. If we didn’t have this, we would not be able to correctly pad data after loading a model.

get_padding_memory_scaling(padding_lengths: typing.Dict[str, int]) → int[source]

This method is for computing adaptive batch sizes. We assume that memory usage is a function that looks like this: M = b * O(p) * c, where M is the memory usage, b is the batch size, c is some constant that depends on how much GPU memory you have and various model hyperparameters, and O(p) is a function outlining how memory usage asymptotically varies with the padding lengths. Our approach will be to let the user effectively set \frac{M}{c} using the adaptive_memory_usage_constant parameter in DataGenerator. The model (this method) specifies O(p), so we can solve for the batch size b. The more specific you get in specifying O(p) in this function, the better a job we can do in optimizing memory usage.

Parameters:

padding_lengths: Dict[str, int]

Dictionary containing padding lengths, mapping keys like num_sentence_words to ints. This method computes a function of these ints.

Returns:

O(p): int

The big-O complexity of the model, evaluated with the specific ints given in padding_lengths dictionary.

Reading Comprehension

AttentionSumReader

class deep_qa.models.reading_comprehension.attention_sum_reader.AttentionSumReader(params: deep_qa.common.params.Params)[source]

Bases: deep_qa.training.text_trainer.TextTrainer

This TextTrainer implements the Attention Sum Reader model described by Kadlec et. al 2016. It takes a question and document as input, encodes the document and question words with two separate Bidirectional GRUs, and then takes the dot product of the question embedding with the document embedding of each word in the document. This creates an attention over words in the document, and it then selects the option with the highest summed or mean weight as the answer.

_build_model()[source]

The basic outline here is that we’ll pass the questions and the document / passage (think of this as a collection of possible answer choices) into a word embedding layer.

Then, we run the word embeddings from the document (a sequence) through a bidirectional GRU and output a sequence that is the same length as the input sequence size. For each time step, the output item (“contextual embedding”) is the concatenation of the forward and backward hidden states in the bidirectional GRU encoder at that time step.

To get the encoded question, we pass the words of the question into another bidirectional GRU. This time, the output encoding is a vector containing the concatenation of the last hidden state in the forward network with the last hidden state of the backward network.

We then take the dot product of the question embedding with each of the contextual embeddings for the words in the documents. We sum up all the occurences of a word (“total attention”), and pick the word with the highest total attention in the document as the answer.

classmethod _get_custom_objects()[source]
_instance_type()[source]

Return the instance type that the model trains on.

_set_padding_lengths(padding_lengths: typing.Dict[str, int])[source]

Set the padding lengths of the model.

_set_padding_lengths_from_model()[source]

This gets called when loading a saved model. It is analogous to _set_padding_lengths, but needs to set all of the values set in that method just by inspecting the loaded model. If we didn’t have this, we would not be able to correctly pad data after loading a model.

get_padding_lengths() → typing.Dict[str, int][source]

Return a dictionary with the appropriate padding lengths.

BidirectionalAttentionFlow

class deep_qa.models.reading_comprehension.bidirectional_attention.BidirectionalAttentionFlow(params: deep_qa.common.params.Params)[source]

Bases: deep_qa.training.text_trainer.TextTrainer

This class implements Minjoon Seo’s Bidirectional Attention Flow model for answering reading comprehension questions (ICLR 2017).

The basic layout is pretty simple: encode words as a combination of word embeddings and a character-level encoder, pass the word representations through a bi-LSTM/GRU, use a matrix of attentions to put question information into the passage word representations (this is the only part that is at all non-standard), pass this through another few layers of bi-LSTMs/GRUs, and do a softmax over span start and span end.

Parameters:

num_hidden_seq2seq_layers : int, optional (default: 2)

At the end of the model, we add a few stacked biLSTMs (or similar), to give the model some depth. This parameter controls how many deep layers we should use.

num_passage_words : int, optional (default: None)

If set, we will truncate (or pad) all passages to this length. If not set, we will pad all passages to be the same length as the longest passage in the data.

num_question_words : int, optional (default: None)

Same as num_passage_words, but for the number of words in the question. (default: None)

num_highway_layers : int, optional (default: 2)

After constructing a word embedding, but before the first biLSTM layer, Min has some Highway layers operating on the word embedding layer. This parameter specifies how many of those to do. (default: 2)

highway_activation : string, optional (default: 'relu')

Specifies the activation function to use for the Highway layers mentioned above. Any Keras activation function is acceptable here.

similarity_function : Dict[str, Any], optional (default: {'type': 'linear', 'combination': 'x,y,x*y'})

Specifies the similarity function to use when computing a similarity matrix between question words and passage words. By default we use the function Min used in his paper.

Notes

Min’s code uses tensors of shape (batch_size, num_sentences, sentence_length) to represent the passage, splitting it up into sentences, where here we just have one long passage sequence. I was originally afraid this might mean he applied the biLSTM on each sentence independently, but it looks like he flattens it to our shape before he does any actual operations on it. So, I think this is implementing pretty much exactly what he did, but I’m not totally certain.

_build_model()[source]

Constructs and returns a DeepQaModel (which is a wrapper around a Keras Model) that will take the output of self._get_training_data as input, and produce as output a true/false decision for each input. Note that in the multiple gpu case, this function will be called multiple times for the different GPUs. As such, you should be wary of this function having side effects unrelated to building a computation graph.

The returned model will be used to call model.fit(train_input, train_labels).

classmethod _get_custom_objects()[source]
_instance_type()[source]

When reading datasets, what Instance type should we create? The Instance class contains code that creates actual numpy arrays, so this instance type determines the inputs that you will get to your model, and the outputs that are used for training.

_set_padding_lengths(padding_lengths: typing.Dict[str, int])[source]

This is about padding. Any model will have some number of things that need padding in order to make a consistent set of input arrays, like the length of a sentence. This method sets those variables given a dictionary of lengths from a dataset.

Note that you might choose not to update some of these lengths, either because you want to keep the model flexible to allow for dynamic (batch-specific) padding, or because you’ve set a hard limit in the class parameters and don’t want to change it.

_set_padding_lengths_from_model()[source]

This gets called when loading a saved model. It is analogous to _set_padding_lengths, but needs to set all of the values set in that method just by inspecting the loaded model. If we didn’t have this, we would not be able to correctly pad data after loading a model.

static get_best_span(span_begin_probs, span_end_probs)[source]
get_instance_sorting_keys() → typing.List[str][source]

If we’re using dynamic padding, we want to group the instances by padding length, so that we minimize the amount of padding necessary per batch. This variable sets what exactly gets sorted by. We’ll call get_padding_lengths() on each instance, pull out these keys, and sort by them in the order specified. You’ll want to override this in your model class if you have more complex models.

The default implementation is to sort first by num_sentence_words, then by num_word_characters (if applicable).

get_padding_lengths() → typing.Dict[str, int][source]

This is about padding. Any solver will have some number of things that need padding in order to make consistently-sized data arrays, like the length of a sentence. This method returns a dictionary of all of those things, mapping a length key to an int.

If any of the entries in this dictionary is None, the padding code will calculate a padding length from the data itself. This could either be a good idea or a bad idea - if you have outliers in your data, you could be wasting a whole lot of memory and computation time if you pad the whole dataset to the size of the outlier. On the other hand, if you do batch-specific padding, this can save you a whole lot of time, if you group batches by similar lengths.

Here we return the lengths that are applicable to encoding words and sentences. If you have additional padding dimensions, call super().get_padding_lengths() and then update the dictionary.

get_padding_memory_scaling(padding_lengths: typing.Dict[str, int]) → int[source]

This method is for computing adaptive batch sizes. We assume that memory usage is a function that looks like this: M = b * O(p) * c, where M is the memory usage, b is the batch size, c is some constant that depends on how much GPU memory you have and various model hyperparameters, and O(p) is a function outlining how memory usage asymptotically varies with the padding lengths. Our approach will be to let the user effectively set \frac{M}{c} using the adaptive_memory_usage_constant parameter in DataGenerator. The model (this method) specifies O(p), so we can solve for the batch size b. The more specific you get in specifying O(p) in this function, the better a job we can do in optimizing memory usage.

Parameters:

padding_lengths: Dict[str, int]

Dictionary containing padding lengths, mapping keys like num_sentence_words to ints. This method computes a function of these ints.

Returns:

O(p): int

The big-O complexity of the model, evaluated with the specific ints given in padding_lengths dictionary.

GatedAttentionReader

class deep_qa.models.reading_comprehension.gated_attention_reader.GatedAttentionReader(params: deep_qa.common.params.Params)[source]

Bases: deep_qa.training.text_trainer.TextTrainer

This TextTrainer implements the Gated Attention Reader model described in “Gated-Attention Readers for Text Comprehension” by Dhingra et. al 2016. It encodes the document with a variable number of gated attention layers, and then encodes the query. It takes the dot product of these two final encodings to generate an attention over the words in the document, and it then selects the option with the highest summed or mean weight as the answer.

Parameters:

multiword_option_mode: str, optional (default=”mean”)

Describes how to calculate the probability of options that contain multiple words. If “mean”, the probability of the option is taken to be the mean of the probabilities of its constituent words. If “sum”, the probability of the option is taken to be the sum of the probabilities of its constituent words.

num_gated_attention_layers: int, optional (default=3)

The number of gated attention layers to pass the document embedding through. Must be at least 1.

cloze_token: str, optional (default=None)

If not None, the string that represents the cloze token in a cloze question. Used to calculate the attention over the document, as the model does it differently for cloze vs non-cloze datasets.

gating_function: str, optional (default=”*”)

The gating function to use in the Gated Attention layer. "*" is for elementwise multiplication, "+" is for elementwise addition, and "|" is for concatenation.

gated_attention_dropout: float, optional (default=0.3)

The proportion of units to drop out after each gated attention layer.

qd_common_feature: boolean, optional (default=True)

Whether to use the question-document common word feature. This feature simply indicates, for each word in the document, whether it appears in the query and has been shown to improve reading comprehension performance.

_build_model()[source]

The basic outline here is that we’ll pass the questions and the document / passage (think of this as a collection of possible answer choices) into a word embedding layer.

classmethod _get_custom_objects()[source]
_instance_type()[source]

Return the instance type that the model trains on.

_set_padding_lengths(padding_lengths: typing.Dict[str, int])[source]

Set the padding lengths of the model.

_set_padding_lengths_from_model()[source]

This gets called when loading a saved model. It is analogous to _set_padding_lengths, but needs to set all of the values set in that method just by inspecting the loaded model. If we didn’t have this, we would not be able to correctly pad data after loading a model.

get_padding_lengths() → typing.Dict[str, int][source]

Return a dictionary with the appropriate padding lengths.

Text Classification

Text classification models take a sequence of text as input and classify it into one of several classes.

Input: Text sequence

Output: Class label

ClassificationModel

class deep_qa.models.text_classification.classification_model.ClassificationModel(params: deep_qa.common.params.Params)[source]

Bases: deep_qa.training.text_trainer.TextTrainer

A TextTrainer that simply takes word sequences as input (could be either sentences or logical forms), encodes the sequence using a sentence encoder, then uses a few dense layers to decide on some classification label for the text sequence (currently hard-coded for a binary classification decision, but that’s easy to fix if we need to).

We don’t really expect this model to work for question answering - it’s just a sentence classification model. The best it can do is basically to learn word cooccurrence information, similar to how the Salience solver works, and I’m not at all confident that this does that job better than Salience. We’ve implemented this mostly as a simple baseline.

Note that this also can’t actually answer questions at this point. You have to do some post-processing to get from true/false decisions to question answers, and I removed that from TextTrainer to make the code simpler.

_build_model()[source]
train_input: numpy array: int32 (samples, num_words). Left padded arrays of word indices
from sentences in training data
_instance_type()[source]
_set_padding_lengths_from_model()[source]

This gets called when loading a saved model. It is analogous to _set_padding_lengths, but needs to set all of the values set in that method just by inspecting the loaded model. If we didn’t have this, we would not be able to correctly pad data after loading a model.

About Layers

Custom layers that we have implemented belong here. These include things like knowledge encoders (which encode the memory component of a memory network), knowledge selectors (which perform an attention over the memory), and entailment models. There’s also an encoders submodule, containing sentence encoders that convert an embedded word (or character) sequence into a vector.

Core Layers

Additive

class deep_qa.layers.additive.Additive(initializer='glorot_uniform', **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer adds a parameter value to each cell in the input tensor, similar to a bias vector in a Dense layer, but this only adds, one value per cell. The value to add is learned.

Parameters:

initializer: str, optional (default=’glorot_uniform’)

Keras initializer for the additive weight.

build(input_shape)[source]

Creates the layer weights.

Must be implemented on all layers that have weights.

# Arguments
input_shape: Keras tensor (future input to layer)
or list/tuple of Keras tensors to reference for weight shape computations.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

BiGRUIndexSelector

class deep_qa.layers.bigru_index_selector.BiGRUIndexSelector(target_index, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer takes 3 inputs: a tensor of document indices, the seq2seq GRU output over the document feeding it in forward, the seq2seq GRU output over the document feeding it in backwards. It also takes one parameter, the word index whose biGRU outputs we want to extract

Inputs:
  • document indices: shape (batch_size, document_length)
  • forward GRU output: shape (batch_size, document_length, GRU hidden dim)
  • backward GRU output: shape (batch_size, document_length, GRU hidden dim)
Output:
  • GRU outputs at index: shape (batch_size, GRU hidden dim * 2)
Parameters:

target_index : int

The word index to extract the forward and backward GRU output from.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shapes)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

ComplexConcat

class deep_qa.layers.complex_concat.ComplexConcat(combination: str, axis: int = -1, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer does K.concatenate() on a collection of tensors, but allows for more complex operations than Merge(mode='concat'). Specifically, you can perform an arbitrary number of elementwise linear combinations of the vectors, and concatenate all of the results. If you do not need to do this, you should use the regular Merge layer instead of this ComplexConcat.

Because the inputs all have the same shape, we assume that the masks are also the same, and just return the first mask.

Input:
  • A list of tensors. The tensors that you combine must have the same shape, so that we can do elementwise operations on them, and all tensors must have the same number of dimensions, and match on all dimensions except the concatenation axis.
Output:
  • A tensor with some combination of the input tensors concatenated along a specific dimension.
Parameters:

axis : int

The axis to use for K.concatenate.

combination: List of str

A comma-separated list of combinations to perform on the input tensors. These are either tensor indices (1-indexed), or an arithmetic operation between two tensor indices (valid operations: *, +, -, /). For example, these are all valid combination parameters: "1,2", "1,2*3", "1-2,2-1", "1,1*1", and "1,2,1*2".

_get_combination(combination: str, tensors: typing.List[typing.Tensor])[source]
_get_combination_length(combination: str, input_shapes: typing.List[typing.Tuple[int]])[source]
compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

Highway

class deep_qa.layers.highway.Highway(**kwargs)[source]

Bases: keras.legacy.layers.Highway

Keras’ Highway layer does not support masking, but it easily could, just by returning the mask. This Layer makes this possible.

L1Normalize

class deep_qa.layers.l1_normalize.L1Normalize(**kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer normalizes a tensor by its L1 norm. This could just be a Lambda layer that calls our tensors.l1_normalize function, except that Lambda layers do not properly handle masked input.

The expected input to this layer is a tensor of shape (batch_size, x), with an optional mask of the same shape. We also accept as input a tensor of shape (batch_size, x, 1), which will be squeezed to shape (batch_size, x) (though the mask must still be of shape (batch_size, x)).

We give no output mask, as we expect this to only be used at the end of the model, to get a final probability distribution over class labels. If you need this to propagate the mask for your model, it would be pretty easy to change it to optionally do so - submit a PR.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.

NoisyOr

class deep_qa.layers.noisy_or.BetweenZeroAndOne[source]

Bases: keras.constraints.Constraint

Constrains the weights to be between zero and one

class deep_qa.layers.noisy_or.NoisyOr(axis=-1, name='noisy_or', param_init='uniform', noise_param_constraint=None, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This layer takes as input a tensor of probabilities and calculates the noisy-or probability across a given axis based on the noisy-or equation:

  • p(x) = 1 - \prod_{i=1:N}(1 - q * p(x|y_n))

where :math`q` is the noise parameter.

Inputs:
  • probabilities: shape (batch, ..., N, ...) Optionally takes a mask of the same shape, where N is the number of y’s in the above equation (i.e. the number of probabilities being combined in the product), in the dimension corresponding to the specified axis.
Output:
  • X: shape (batch, ..., ...) The output has one less dimension than the input, and has an optional mask of the same shape. The lost dimension corresponds to the specified axis. The output mask is the result of K.any() on the input mask, along the specified axis.
Parameters:

axis : int, default=-1

The axis over which to combine probabilities.

name : string, default=’noisy_or’

Name of the layer, ued to debug both the layer and its parameter.

param_init : string, default=’uniform’

The initialization of the noise parameter.

noise_param_constraint : Keras Constraint, default=None

Optional, a constraint which would be applied to the noise parameter.

build(input_shape)[source]
compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

OptionAttentionSum

class deep_qa.layers.option_attention_sum.OptionAttentionSum(multiword_option_mode='mean', **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer takes three inputs: a tensor of document indices, a tensor of document probabilities, and a tensor of answer options. In addition, it takes a parameter: a string describing how to calculate the probability of options that consist of multiple words. We compute the probability of each of the answer options in the fashion described in the paper “Text Comprehension with the Attention Sum Reader Network” (Kadlec et. al 2016).

Inputs:
  • document indices: shape (batch_size, document_length)
  • document probabilities: shape (batch_size, document_length)
  • options: shape (batch size, num_options, option_length)
Output:
  • option_probabilities (batch_size, num_options)
compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shapes)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

Overlap

class deep_qa.layers.overlap.Overlap(**kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer takes 2 inputs: a tensor_a (e.g. a document) and a tensor_b (e.g. a question). It returns a one-hot vector suitable for feature representation with the same shape as tensor_a, indicating at each index whether the element in tensor_a appears in tensor_b. Note that the output is not the same shape as tensor_a.

Inputs:
  • tensor_a: shape (batch_size, length_a)
  • tensor_b shape (batch_size, length_b)
Output:
  • Collection of one-hot vectors indicating overlap: shape (batch_size, length_a, 2)

Notes

This layer is used to implement the “Question Evidence Common Word Feature” discussed in section 3.2.4 of Dhingra et. al, 2016.

compute_output_shape(input_shapes)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.

SubtractMinimum

class deep_qa.layers.subtract_minimum.SubtractMinimum(axis: int, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This layer is used to normalize across a tensor axis. Normalization is done by finding the minimum value across the specified axis, and then subtracting that value from all values (again, across the spcified axis). Note that this also works just fine if you want to find the minimum across more than one axis.

Inputs:
  • A tensor with arbitrary dimension, and a mask of the same shape (currently doesn’t support masks with other shapes).
Output:
  • The same tensor, with the minimum across one (or more) of the dimensions subtracted.
Parameters:

axis: int

The axis (or axes) across which to find the minimum. Can be a single int, a list of ints, or None. We just call K.min with this parameter, so anything that’s valid there works here too.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

VectorMatrixMerge

class deep_qa.layers.vector_matrix_merge.VectorMatrixMerge(concat_axis: int, mask_concat_axis: int = None, propagate_mask: bool = True, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer takes a tensor with K modes and a collection of other tensors with K - 1 modes, and concatenates the lower-order tensors at the beginning of the higher-order tensor along a given mode. We call this a vector-matrix merge to evoke the notion of appending vectors onto a matrix, but this will also work with higher-order tensors.

For example, if you have a memory tensor of shape (batch_size, knowledge_length, encoding_dim), containing knowledge_length encoded sentences, you could use this layer to concatenate N individual encoded sentences with it, resulting in a tensor of shape (batch_size, N + knowledge_length, encoding_dim).

This layer supports masking - we will pass through whatever mask you have on the matrix, and concatenate ones to it, similar to how to we concatenate the inputs. We need to know what axis to do that concatenation on, though - we’ll default to the input concatenation axis, but you can specify a different one if you need to. We just ignore masks on the vectors, because doing the right thing with masked vectors here is complicated. If you want to handle that later, submit a PR.

This Layer is essentially the opposite of a VectorMatrixSplit.

Parameters:

concat_axis: int

The axis to concatenate the vectors and matrix on.

mask_concat_axis: int, optional (default=None)

The axis to concatenate the masks on (defaults to concat_axis if None)

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shapes)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

VectorMatrixSplit

class deep_qa.layers.vector_matrix_split.VectorMatrixSplit(split_axis: int, mask_split_axis: int = None, propagate_mask: bool = True, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer takes a tensor with K modes and splits it into a tensor with K - 1 modes and a tensor with K modes, but one less row in one of the dimensions. We call this a vector-matrix split to evoke the notion of taking a row- (or column-) vector off of a matrix and returning both the vector and the remaining matrix, but this will also work with higher-order tensors.

For example, if you have a sentence that has a combined (word + characters) representation of the tokens in the sentence, you’d have a tensor of shape (batch_size, sentence_length, word_length + 1). You could split that using this Layer into a tensor of shape (batch_size, sentence_length) for the word tokens in the sentence, and a tensor of shape (batch_size, sentence_length, word_length) for the character for each word token.

This layer supports masking - we will split the mask the same way that we split the inputs.

This Layer is essentially the opposite of a VectorMatrixMerge.

static _split_tensor(tensor, split_axis: int)[source]
compute_mask(inputs, input_mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

Attention

Attention

class deep_qa.layers.attention.attention.Attention(similarity_function: typing.Dict[str, typing.Any] = None, normalize: bool = True, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer takes two inputs: a vector and a matrix. We compute the similarity between the vector and each row in the matrix, and then (optionally) perform a softmax over rows using those computed similarities. We handle masking properly for masked rows in the matrix, though we ignore any masking on the vector.

By default similarity is computed with a dot product, but you can alternatively use a parameterized similarity function if you wish.

Inputs:

  • vector: shape (batch_size, embedding_dim), mask is ignored if provided
  • matrix: shape (batch_size, num_rows, embedding_dim), with mask (batch_size, num_rows)

Output:

  • attention: shape (batch_size, num_rows). If normalize is True, we return no mask, as we’ve already applied it (masked input rows have value 0 in the output). If normalize is False, we return the matrix mask, if there was one.
Parameters:

similarity_function_params : Dict[str, Any], optional (default: {})

These parameters get passed to a similarity function (see deep_qa.tensors.similarity_functions for more info on what’s acceptable). The default similarity function with no parameters is a simple dot product.

normalize : bool, optional (default: True)

If true, we normalize the computed similarities with a softmax, to return a probability distribution for your attention. If false, this is just computing a similarity score.

build(input_shape)[source]

Creates the layer weights.

Must be implemented on all layers that have weights.

# Arguments
input_shape: Keras tensor (future input to layer)
or list/tuple of Keras tensors to reference for weight shape computations.
compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shapes)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

GatedAttention

class deep_qa.layers.attention.gated_attention.GatedAttention(gating_function='*', **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This layer implements the majority of the Gated Attention module described in “Gated-Attention Readers for Text Comprehension” by Dhingra et. al 2016.

The module is described in section 3.2.2. For each token d_i in D, the GA module forms a “token-specific representation” of the query q_i using soft attention, and then multiplies the query representation element-wise with the document token representation.

    1. \alpha_i = softmax(Q^T d_i)
    1. q_i = Q \alpha_i
    1. x_i = d_i \odot q_i (\odot is element-wise multiplication)

This layer implements equations 2 and 3 above but in a batched manner to get X, a tensor with all x_i. Thus, the input to the layer is \alpha (normalized_qd_attention), a tensor with all \alpha_i, as well as Q (question_matrix), and D (document_matrix), a tensor with all d_i. Equation 6 uses element-wise multiplication to model the interactions between d_i and q_i, and the paper reports results when using other such gating functions like sum or concatenation.

Inputs:
  • document_, a matrix of shape (batch, document length, biGRU hidden length). Represents the document as encoded by the biGRU.
  • question_matrix, a matrix of shape (batch, question length, biGRU hidden length). Represents the question as encoded by the biGRU.
  • normalized_qd_attention, the soft attention over the document and question. Matrix of shape (batch, document length, question length).
Output:
  • X, a tensor of shape (batch, document length, biGRU hidden length) if the gating function is * or +, or (batch, document length, biGRU hidden length * 2) if the gating function is || This serves as a representation of each token in the document.
Parameters:

gating_function : string, default=”*”

The gating function to use for modeling the interactions between the document and query token. Supported gating functions are "*" for elementwise multiplication, "+" for elementwise addition, and "||" for concatenation.

Notes

To find out how we calculated equation 1, see the GatedAttentionReader model (roughly, a masked_batch_dot and a masked_softmax)

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shapes)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

MaskedSoftmax

class deep_qa.layers.attention.masked_softmax.MaskedSoftmax(**kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer performs a masked softmax. This could just be a Lambda layer that calls our tensors.masked_softmax function, except that Lambda layers do not properly handle masked input.

The expected input to this layer is a tensor of shape (batch_size, num_options), with a mask of the same shape. We also accept an input tensor of shape (batch_size, num_options, 1), which we will squeeze to be (batch_size, num_options) (though the mask must still be (batch_size, num_options)).

While we give the expected input as having two modes, we also accept higher-order tensors. In those cases, we’ll first perform a last_dim_flatten on both the input and the mask, so that we always do the softmax over a single dimension (the last one).

We give no output mask, as we expect this to only be used at the end of the model, to get a final probability distribution over class labels (and it’s a softmax, so you’ll have zeros in the tensor itself; do you really still need a mask?). If you need this to propagate the mask for whatever reason, it would be pretty easy to change it to optionally do so - submit a PR.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.

MatrixAttention

class deep_qa.layers.attention.matrix_attention.MatrixAttention(similarity_function: typing.Dict[str, typing.Any] = None, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer takes two matrices as input and returns a matrix of attentions.

We compute the similarity between each row in each matrix and return unnormalized similarity scores. We don’t worry about zeroing out any masked values, because we propagate a correct mask.

By default similarity is computed with a dot product, but you can alternatively use a parameterized similarity function if you wish.

This is largely similar to using TimeDistributed(Attention), except the result is unnormalized, and we return a mask, so you can do a masked normalization with the result. You should use this instead of TimeDistributed(Attention) if you want to compute multiple normalizations of the attention matrix.

Input:
  • matrix_1: (batch_size, num_rows_1, embedding_dim), with mask (batch_size, num_rows_1)
  • matrix_2: (batch_size, num_rows_2, embedding_dim), with mask (batch_size, num_rows_2)
Output:
  • (batch_size, num_rows_1, num_rows_2), with mask of same shape
Parameters:

similarity_function_params: Dict[str, Any], default={}

These parameters get passed to a similarity function (see deep_qa.tensors.similarity_functions for more info on what’s acceptable). The default similarity function with no parameters is a simple dot product.

build(input_shape)[source]

Creates the layer weights.

Must be implemented on all layers that have weights.

# Arguments
input_shape: Keras tensor (future input to layer)
or list/tuple of Keras tensors to reference for weight shape computations.
compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

MaxSimilaritySoftmax

class deep_qa.layers.attention.max_similarity_softmax.MaxSimilaritySoftmax(knowledge_axis, max_knowledge_length, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This layer takes encoded questions and knowledge in a multiple choice setting and computes the similarity between each of the question embeddings and the background knowledge, and returns a softmax over the options.

Inputs:

  • encoded_questions (batch_size, num_options, encoding_dim)
  • encoded_knowledge (batch_size, num_options, knowledge_length, encoding_dim)

Output:

  • option_probabilities (batch_size, num_options)

This is a pretty niche layer that does a very specific computation. We only made it its own class instead of a Lambda layer so that we could handle masking correctly, which Lambda does not.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shapes)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.

WeightedSum

class deep_qa.layers.attention.weighted_sum.WeightedSum(use_masking: bool = True, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer takes a matrix of vectors and a vector of row weights, and returns a weighted sum of the vectors. You might use this to get some aggregate sentence representation after computing an attention over the sentence, for example.

Inputs:

  • matrix: (batch_size, num_rows, embedding_dim), with mask (batch_size, num_rows)
  • vector: (batch_size, num_rows), mask is ignored

Outputs:

  • A weighted sum of the rows in the matrix, with shape (batch_size, embedding_dim), with mask=``None``.
Parameters:

use_masking: bool, default=True

If true, we will apply the input mask to the matrix before doing the weighted sum. If you’ve computed your vector weights with masking, so that masked entries are 0, this is unnecessary, and you can set this parameter to False to avoid an expensive computation.

Notes

You probably should have used a mask when you computed your attention weights, so any row that’s masked in the matrix should already be 0 in the attention vector. But just in case you didn’t, we’ll handle a mask on the matrix here too. If you know that you did masking right on the attention, you can optionally remove the mask computation here, which will save you a bit of time and memory.

While the above spec shows inputs with 3 and 2 modes, we also allow inputs of any order; we always sum over the second-to-last dimension of the “matrix”, weighted by the last dimension of the “vector”. Higher-order tensors get complicated for matching things, though, so there is a hard constraint: all dimensions in the “matrix” before the final embedding must be matched in the “vector”.

For example, say I have a “matrix” with dimensions (batch_size, num_queries, num_words, embedding_dim), representing some kind of embedding or encoding of several multi-word queries. My attention “vector” must then have at least those dimensions, and could have more. So I could have an attention over words per query, with shape (batch_size, num_queries, num_words), or I could have an attention over query words for every document in some list, with shape (batch_size, num_documents, num_queries, num_words). Both of these cases are fine. In the first case, the returned tensor will have shape (batch_size, num_queries, embedding_dim), and in the second case, it will have shape (batch_size, num_documents, num_queries, embedding_dim). But you can’t have an attention “vector” that does not include all of the queries, so shape (batch_size, num_words) is not allowed - you haven’t specified how to handle that dimension in the “matrix”, so we can’t do anything with this input.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shapes)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

Backend Layers

Layers in this module generally just implement some simple operation from the Keras backend as a Layer. The reason we have these as Layers is largely so that we can properly handle masking.

AddMask

class deep_qa.layers.backend.add_mask.AddMask(mask_value: float = 0.0, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer adds a mask to a tensor. It is intended solely for testing, though if you have a use case for this outside of testing, feel free to use it. The call() method just returns the inputs, and the compute_mask method calls K.not_equal(inputs, mask_value), and that’s it. This is different from Keras’ Masking layer, which assumes higher-order input and does a K.any() call in compute_mask.

Input:
  • tensor: a tensor of arbitrary shape
Output:
  • the same tensor, now with a mask attached of the same shape
Parameters:

mask_value: float, optional (default=0.0)

This is the value that we will compare to in compute_mask.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

BatchDot

class deep_qa.layers.backend.batch_dot.BatchDot(**kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer calls K.batch_dot() on two inputs tensor_a and tensor_b. This function will work for tensors of arbitrary size as long as abs(K.ndim(tensor_a) - K.ndim(tensor_b)) < 1, due to limitations in K.batch_dot(). When the input tensors have more than three dimensions, they must have the same shape, except for the last two dimensions. See the examples for more explanation of what this means.

We always assume the dimension to perform the dot is the last one, and that the masks have one fewer dimension that the tensors. Note that this layer does not return zeroes in places that are masked, but does pass a correct mask forward. If this then gets fed into masked_softmax, for instance, your tensor will be correctly normalized. We always assume the dimension to perform the dot is the last one, and that the masks have one fewer dimension than the tensors.

Inputs:
  • tensor_a: tensor with ndim >= 2.
  • tensor_b: tensor with ndim >= 2.
Output:
  • a_dot_b

Examples

The following examples will try to give some insight on how this layer works in relation to K.batch_dot(). Note that the Keras documentation (as of 2/13/17) on K.batch_dot is incorrect, and that this layer behaves differently from the documented behavior.

As a first example, let’s suppose that tensor_a and tensor_b have the same number of dimensions. Let the shape of tensor_a be (2, 3, 2), and let the shape of tensor_b be (2, 4, 2). The mask accompanying these inputs always has one less dimension, so the tensor_a_mask has shape (2, 3) and tensor_b_mask has shape (2, 4). The shape of the batch_dot output would thus be (2, 3, 4). This is because we are taking the batch dot of the last dimension, so the output shape is (2, 3) (from tensor_a) with (4) (from tensor_b) appended on (to get (2, 3, 4) in total). The output mask has the same shape as the output, and is thus (2, 3, 4) as well.

>>> import keras.backend as K
>>> tensor_a = K.ones(shape=(2, 3, 2))
>>> tensor_b = K.ones(shape=(2, 4, 2))
>>> K.eval(K.batch_dot(tensor_a, tensor_b, axes=(2,2))).shape
(2, 3, 4)

Next, let’s look at an example where tensor_a and tensor_b are “uneven” (different number of dimensions). Let the shape of tensor_a be (2, 4, 2), and let the shape of tensor_b be (2, 4, 3, 2). The mask accompanying these inputs always has one less dimension, so the tensor_a_mask has shape (2, 4) and tensor_b_mask has shape (2, 4, 3). The shape of the batch_dot output would thus be (2, 4, 3). In the case of uneven tensors, we always expand the last dimension of the smaller tensor to make them even. Thus in this case, we expand tensor_a to get a new shape of (2, 4, 2, 1). Now we are taking the batch_dot of a tensor with shape (2, 4, 2, 1) and (2, 4, 3, 2). Note that the first two dimensions of this tensor are the same (2, 4) – this is a requirement imposed by K.batch_dot. Following the methodology of calculating the output shape above, we get that the output is (2, 4, 1, 3) since we get (2, 4, 1) from tensor_a and (3) from tensor_b. We then squeeze the tensor to remove the 1-dimension to get a final shape of (2, 4, 3). Note that the mask has the same shape.

>>> import keras.backend as K
>>> tensor_a = K.ones(shape=(2, 4, 2))
>>> tensor_b = K.ones(shape=(2, 4, 3, 2))
>>> tensor_a_expanded = K.expand_dims(tensor_a, axis=-1)
>>> unsqueezed_bd = K.batch_dot(tensor_a_expanded, tensor_b, axes=(2,3))
>>> final_bd = K.squeeze(unsqueezed_bd, axis=K.ndim(tensor_a)-1)
>>> K.eval(final_bd).shape
(2, 4, 3)

Lastly, let’s look at the uneven case where tensor_a has more dimensions than tensor_b. Let the shape of tensor_a be (2, 3, 4, 2), and let the shape of tensor_b be (2, 3, 2). Since the mask accompanying these inputs always has one less dimension, tensor_a_mask has shape (2, 3, 4) and tensor_b_mask has shape (2, 3). The shape of the batch_dot output would thus be (2, 3, 4). Since these tensors are uneven, expand the smaller tensor, tensor_b, to get a new shape of (2, 3, 2, 1). Now we are taking the batch_dot of a tensor with shape (2, 3, 4, 2) and (2, 3, 2, 1). Note again that the first two dimensions of this tensor are the same (2, 3). We can see that the output shape is (2, 3, 4, 1) since we get (2, 3, 4) from tensor_a and (1) from tensor_b. We then squeeze the tensor to remove the 1-dimension to get a final shape of (2, 3, 4). Note that the mask has the same shape.

>>> import keras.backend as K
>>> tensor_a = K.ones(shape=(2, 3, 4, 2))
>>> tensor_b = K.ones(shape=(2, 3, 2))
>>> tensor_b_expanded = K.expand_dims(tensor_b, axis=-1)
>>> unsqueezed_bd = K.batch_dot(tensor_a, tensor_b_expanded, axes=(3, 2))
>>> final_bd = K.squeeze(unsqueezed_bd, axis=K.ndim(tensor_a)-1)
>>> K.eval(final_bd).shape
(2, 3, 4)
compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.

CollapseToBatch

class deep_qa.layers.backend.collapse_to_batch.CollapseToBatch(num_to_collapse: int, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

Reshapes a higher order tensor, taking the first num_to_collapse dimensions after the batch dimension and folding them into the batch dimension. For example, a tensor of shape (2, 4, 5, 3), collapsed with num_to_collapse = 2, would become a tensor of shape (40, 3). We perform identical computation on the input mask, if there is one.

This is essentially what Keras’ TimeDistributed layer does (and then undoes) to apply a layer to a higher-order tensor, and that’s the intended use for this layer. However, TimeDistributed cannot handle distributing across dimensions with unknown lengths at graph compilation time. This layer works even in that case. So, if your actual tensor shape at graph compilation time looks like (None, None, None, 3), or (None, 4, None, 3), you can still use this layer (and ExpandFromBatch) to get the same result as TimeDistributed. If your shapes are fully known at graph compilation time, just use TimeDistributed, as it’s a nicer API for the same functionality.

Inputs:
  • tensor with ndim >= 3
Output:
  • tensor with ndim = input_ndim - num_to_collapse, with the removed dimensions folded into the first (batch-size) dimension
Parameters:

num_to_collapse: int

The number of dimensions to fold into the batch size.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

ExpandFromBatch

class deep_qa.layers.backend.expand_from_batch.ExpandFromBatch(num_to_expand: int, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

Reshapes a collapsed tensor, taking the batch size and separating it into num_to_expand dimensions, following the shape of a second input tensor. This is meant to be used in conjunction with CollapseToBatch, to achieve the same effect as Keras’ TimeDistributed layer, but for shapes that are not fully specified at graph compilation time.

For example, say you had an original tensor of shape (None (2), 4, None (5), 3), then collapsed it with CollapseToBatch(2)(tensor) to get a tensor with shape (None (40), 3) (here I’m using None (x) to denote a dimension with unknown length at graph compilation time, where x is the actual runtime length). You can then call ExpandFromBatch(2)(collapsed, tensor) with the result to expand the first two dimensions out of the batch again (presumably after you’ve done some computation when it was collapsed).

Inputs:
  • a tensor that has been collapsed with CollapseToBatch(num_to_expand).
  • the original tensor that was used as input to CollapseToBatch (or one with identical shape in the collapsed dimensions). We will use this input only to get its shape.
Output:
  • tensor with ndim = input_ndim + num_to_expand, with the additional dimensions coming immediately after the first (batch-size) dimension.
Parameters:

num_to_expand: int

The number of dimensions to expand from the batch size.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

Envelope

class deep_qa.layers.backend.envelope.Envelope(**kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

Given a probability distribution over a begin index and an end index of some sequence, this Layer computes an envelope over the sequence, a probability that each element lies within “begin” and “end”.

Specifically, the computation done here is the following:

after_span_begin = K.cumsum(span_begin, axis=-1)
after_span_end = K.cumsum(span_end, axis=-1)
before_span_end = 1 - after_span_end
envelope = after_span_begin * before_span_end
Inputs:
  • span_begin: tensor with shape (batch_size, sequence_length), representing a probability distribution over a start index in the sequence
  • span_end: tensor with shape (batch_size, sequence_length), representing a probability distribution over an end index in the sequence
Outputs:
  • envelope: tensor with shape (batch_size, sequence_length), representing a probability for each index of the sequence belonging in the span

If there is a mask associated with either of the inputs, we ignore it, assuming that you used the mask correctly when you computed your probability distributions. But we support masking in this layer, so that you have an output mask if you really need it. We just return the first mask that is not None (or None, if both are None).

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.

Max

class deep_qa.layers.backend.max.Max(axis: int = -1, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer performs a max over some dimension. Keras has a similar layer called GlobalMaxPooling1D, but it is not as configurable as this one, and it does not support masking.

If the mask is not None, it must be the same shape as the input.

Input:
  • A tensor of arbitrary shape (having at least 3 dimensions).
Output:
  • A tensor with one less dimension, where we have taken a max over one of the dimensions.
compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

Permute

class deep_qa.layers.backend.permute.Permute(pattern: typing.Tuple[int], **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer calls K.permute_dimensions on both the input and the mask.

If the mask is not None, it must have the same shape as the input.

Input:
  • A tensor of arbitrary shape.
Output:
  • A tensor with permuted dimensions.
compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.

Repeat

class deep_qa.layers.backend.repeat.Repeat(axis: int, repetitions: int, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer calls K.repeat_elements on both the input and the mask, after calling K.expand_dims.

If the mask is not None, we must be able to call K.expand_dims using the same axis parameter as we do for the input.

Input:
  • A tensor of arbitrary shape.
Output:
  • The input tensor repeated along one of the dimensions.
Parameters:

axis: int

We will add a dimension to the input tensor at this axis.

repetitions: int

The new dimension will have this size to it, with each slice being identical to the original input tensor.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

RepeatLike

class deep_qa.layers.backend.repeat_like.RepeatLike(axis: int, copy_from_axis: int, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer is like Repeat, but gets the number of repetitions to use from a second input tensor. This allows doing a number of repetitions that is unknown at graph compilation time, and is necessary when the repetitions argument to Repeat would be None.

If the mask is not None, we must be able to call K.expand_dims using the same axis parameter as we do for the input.

Input:
  • A tensor of arbitrary shape, which we will expand and tile.
  • A second tensor whose shape along one dimension we will copy
Output:
  • The input tensor repeated along one of the dimensions.
Parameters:

axis: int

We will add a dimension to the input tensor at this axis.

copy_from_axis: int

We will copy the dimension from the second tensor at this axis.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

Encoders

BagOfWords

class deep_qa.layers.encoders.bag_of_words.BOWEncoder(**kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

Bag of Words Encoder takes a matrix of shape (num_words, word_dim) and returns a vector of size (word_dim), which is an average of the (unmasked) rows in the input matrix. This could have been done using a Lambda layer, except that Lambda layer does not support masking (as of Keras 1.0.7).

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.

ConvolutionalEncoder

class deep_qa.layers.encoders.convolutional_encoder.CNNEncoder(units: int, num_filters: int, ngram_filter_sizes: typing.Tuple[int] = (2, 3, 4, 5), conv_layer_activation: str = 'relu', l1_regularization: float = None, l2_regularization: float = None, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

CNNEncoder is a combination of multiple convolution layers and max pooling layers. This is defined as a single layer to be consistent with the other encoders in terms of input and output specifications. The input to this “layer” is of shape (batch_size, num_words, embedding_dim) and the output is of size (batch_size, output_dim).

The CNN has one convolution layer per each ngram filter size. Each convolution operation gives out a vector of size num_filters. The number of times a convolution layer will be used depends on the ngram size: input_length - ngram_size + 1. The corresponding maxpooling layer aggregates all these outputs from the convolution layer and outputs the max.

This operation is repeated for every ngram size passed, and consequently the dimensionality of the output after maxpooling is len(ngram_filter_sizes) * num_filters.

We then use a fully connected layer to project in back to the desired output_dim. For more details, refer to “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”, Zhang and Wallace 2016, particularly Figure 1.

Parameters:

units: int

After doing convolutions, we’ll project the collected features into a vector of this size. This used to be output_dim, but Keras changed it to units. I prefer the name output_dim, so we’ll leave the code using output_dim, and just use the name units in the external API.

num_filters: int

This is the output dim for each convolutional layer, which is the same as the number of “filters” learned by that layer.

ngram_filter_sizes: Tuple[int], optional (default=(2, 3, 4, 5))

This specifies both the number of convolutional layers we will create and their sizes. The default of (2, 3, 4, 5) will have four convolutional layers, corresponding to encoding ngrams of size 2 to 5 with some number of filters.

conv_layer_activation: str, optional (default=’relu’)

l1_regularization: float, optional (default=None)

l2_regularization: float, optional (default=None)

build(input_shape)[source]

Creates the layer weights.

Must be implemented on all layers that have weights.

# Arguments
input_shape: Keras tensor (future input to layer)
or list/tuple of Keras tensors to reference for weight shape computations.
compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

PositionalEncoder

class deep_qa.layers.encoders.positional_encoder.PositionalEncoder(**kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

A PositionalEncoder is very similar to a kind of weighted bag of words encoder, where the weighting is done by an index-dependent vector, not a scalar. If you think this is an odd thing to do, it is. The original authors provide no real reasoning behind the exact method other than it takes into account word order. This is here mainly to reproduce results for comparison.

It takes a matrix of shape (num_words, word_dim) and returns a vector of size (word_dim), which implements the following linear combination of the rows:

representation = sum_(j=1)^(n) { l_j * w_j }

where w_j is the j-th word representation in the sentence and l_j is a vector defined as follows:

l_kj = (1 - j)/m - (k/d)((1-2j)/m)

where:
  • j is the word sentence index.
  • m is the sentence length.
  • k is the vector index(ie the k-th element of a vector).
  • d is the dimension of the embedding.
    • represents element-wise multiplication.

This method was originally introduced in End-To-End Memory Networks(pg 4-5): https://arxiv.org/pdf/1503.08895v5.pdf

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.

AttentiveGRU

class deep_qa.layers.encoders.attentive_gru.AttentiveGru(output_dim, input_length, **kwargs)[source]

Bases: keras.layers.recurrent.GRU

GRUs typically operate over sequences of words. The motivation behind this encoding is that a weighted average loses ordering information over it’s inputs - for instance, this is important in the BABI tasks.

See Dynamic Memory Networks for more information: https://arxiv.org/pdf/1603.01417v1.pdf. This class extends the Keras Gated Recurrent Unit by implementing a method which substitutes the GRU update gate (normally a vector, z - it is noted below where it is normally computed) for a scalar attention weight (one per input, such as from the output of a softmax over the input vectors), which is pre-computed. As mentioned above, instead of using word embedding sequences as input to the GRU, we are using sentence encoding sequences.

The implementation of this class is subtle - it is only very slightly different from a standard GRU. When it is initialised, the Keras backend will call the build method. It uses this to check that inputs being passed to this function are the correct size, so we allow this to be the actual input size as normal. However, for the internal implementation, everywhere where this global shape is used, we override it to be one less, as we are passing in a tensor of shape (batch, knowledge_length, 1 + encoding_dim) as we are including the attention mask. Therefore, we need all of the weights to have shape (, encoding_dim), NOT (, 1 + encoding_dim). All of the below methods which are overridden use some form of this dimension, so we correct them.

build(input_shape)[source]

This is used by Keras to verify things, but also to build the weights. The only differences from the Keras GRU (which we copied exactly other than the below) are: We generate weights with dimension input_dim[2] - 1, rather than dimension input_dim[2]. There are a few variables which are created in non-‘gpu’ modes which are not required. These are commented out but left in for clarity below.

preprocess_input(inputs, training=None)[source]

We have to override this preprocessing step, because if we are using the cpu, we do the weight - input multiplications in the internals of the GRU as separate, smaller matrix multiplications and concatenate them after. Therefore, before this happens, we split off the attention and then add it back afterwards.

step(inputs, states)[source]

The input to step is a tensor of shape (batch, 1 + encoding_dim), i.e. a timeslice of the input to this AttentiveGRU, where the time axis is the knowledge_length. Before we start, we strip off the attention from the beginning. Then we do the equations for a normal GRU, except we don’t calculate the output gate z, substituting the attention weight for it instead. Note that there is some redundancy here - for instance, in the GPU mode, we do a larger matrix multiplication than required, as we don’t use one part of it. However, for readability and similarity to the original GRU code in Keras, it has not been changed. In each section, there are commented out lines which contain code. If you were to uncomment these, remove the differences in the input size and replace the attention with the z gate at the output, you would have a standard GRU back again. We literally copied the Keras GRU code here, making some small modifications.

Entailment Model Layers

DecomposableAttention

class deep_qa.layers.entailment_models.decomposable_attention.DecomposableAttentionEntailment(num_hidden_layers: int = 1, hidden_layer_width: int = 50, hidden_layer_activation: str = 'relu', final_activation: str = 'softmax', output_dim: int = 3, initializer: str = 'uniform', **kwargs)[source]

Bases: deep_qa.layers.entailment_models.word_alignment.WordAlignmentEntailment

This layer is a reimplementation of the entailment algorithm described in “A Decomposable Attention Model for Natural Language Inference”, Parikh et al., 2016. The algorithm has three main steps:

  1. Attend: Compute dot products between all pairs of projections of words in the hypothesis and the premise, normalize those dot products to use them to align each word in premise to a phrase in the hypothesis and vice-versa. These alignments are then used to summarize the aligned phrase in the other sentence as a weighted sum. The initial word projections are computed using a feed forward NN, F.
  2. Compare: Pass a concatenation of each word in the premise and the summary of its aligned phrase in the hypothesis through a feed forward NN, G, to get a projected comparison. Do the same with the hypothesis and the aligned phrase from the premise.
  3. Aggregate: Sum over the comparisons to get a single vector each for premise-hypothesis comparison, and hypothesis-premise comparison. Pass them through a third feed forward NN (H), to get the entailment decision.

This layer can take either a tuple (premise, hypothesis) or a concatenation of them as input.

Input:

  • Tuple input: a premise sentence and a hypothesis sentence, both with shape (batch_size, sentence_length, embed_dim) and masks of shape (batch_size, sentence_length)
  • Single input: a single tensor of shape (batch_size, sentence_length * 2, embed_dim), with a mask of shape (batch_size, sentence_length * 2), which we will split in half to get the premise and hypothesis sentences.

Output:

  • Entailment decisions with the given output_dim.
Parameters:

num_hidden_layers: int, optional (default=1)

Number of hidden layers in each of the feed forward neural nets described above.

hidden_layer_width: int, optional (default=50)

Width of each hidden layer in each of the feed forward neural nets described above.

hidden_layer_activation: str, optional (default=’relu’)

Activation for each hidden layer in each of the feed forward neural nets described above.

final_activation: str, optional (default=’softmax’)

Activation to use for the final output. Should almost certainly be ‘softmax’.

output_dim: int, optional (default=3)

Dimensionality of the final output. If this is the last layer in your model, this needs to be the same as the number of labels you have.

initializer: str, optional (default=’uniform’)

Will be passed to self.add_weight() for each of the weight matrices in the feed forward neural nets described above.

Notes

premise_length = hypothesis_length = sentence_length below.

static _attend(target_embedding, s2t_alignment)[source]

Takes target embedding, and source-target alignment attention and produces a weighted average of the target embedding per each source word.

target_embedding: (batch_size, target_length, embed_dim) s2t_alignment: (batch_size, source_length, target_length)

_compare(source_embedding, s2t_attention)[source]

Takes word embeddings from a sentence, and aggregated representations of words aligned to each of those words from another sentence, and returns a projection of their concatenation.

source_embedding: (batch_size, source_length, embed_dim) s2t_attention: (batch_size, source_length, embed_dim)

build(input_shape)[source]

This model has three feed forward NNs (F, G and H in the paper). We assume that all three NNs have the same hyper-parameters: num_hidden_layers, hidden_layer_width and hidden_layer_activation. That is, F, G and H have the same structure and activations. Their actual weights are different, though. H has a separate softmax layer at the end.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

MultipleChoiceTupleEntailment

class deep_qa.layers.entailment_models.multiple_choice_tuple_entailment.MultipleChoiceTupleEntailment(**kwargs)[source]

Bases: deep_qa.layers.entailment_models.word_alignment.WordAlignmentEntailment

A kind of decomposable attention where the premise (or background) is in the form of SVO triples, and entailment is computed by finding the answer in a multiple choice setting that aligns best with the tuples that align with the question. This happens in two steps:

  1. We use the _align function from WordAlignmentEntailment to find the premise tuples whose SV, or VO pairs align best with the question.
  2. We then use the _align function again to find the answer that aligns best with the unaligned part of the tuples, weighed by how much they partially align with the question in step 1.

TODO(pradeep): Also match S with question, VO with answer, O with question and SV with answer.

build(input_shape)[source]
compute_mask(x, mask=None)[source]
compute_output_shape(input_shape)[source]

WordAlignment

Word alignment entailment models operate on word level representations, and define alignment as a function of how well the words in the premise align with those in the hypothesis. These are different from the encoded sentence entailment models where both the premise and hypothesis are encoded as single vectors and entailment functions are defined on top of them.

At this point this doesn’t quite fit into the memory network setup because the model doesn’t operate on the encoded sentence representations, but instead consumes the word level representations. TODO(pradeep): Make this work with the memory network eventually.

class deep_qa.layers.entailment_models.word_alignment.WordAlignmentEntailment(**kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This is an abstract class for word alignment entailment. It defines an _align function.

static _align(source_embedding, target_embedding, source_mask, target_mask, normalize_alignment=True)[source]

Takes source and target sequence embeddings and returns a source-to-target alignment weights. That is, for each word in the source sentence, returns a probability distribution over target_sequence that shows how well each target word aligns (i.e. is similar) to it.

source_embedding: (batch_size, source_length, embed_dim) target_embedding: (batch_size, target_length, embed_dim) source_mask: None or (batch_size, source_length, 1) target_mask: None or (batch_size, target_length, 1) normalize_alignment (bool): Will apply a (masked) softmax over alignments is True.

Returns: s2t_attention: (batch_size, source_length, target_length)

Wrappers

EncoderWrapper

class deep_qa.layers.wrappers.encoder_wrapper.EncoderWrapper(layer, keep_dims=False, **kwargs)[source]

Bases: deep_qa.layers.wrappers.time_distributed.TimeDistributed

This class TimeDistributes a sentence encoder, applying the encoder to several word sequences. The only difference between this and the regular TimeDistributed is in how we handle the mask. Typically, an encoder will handle masked embedded input, and return None as its mask, as it just returns a vector and no more masking is necessary. However, if the encoder is TimeDistributed, we might run into a situation where _all_ of the words in a given sequence are masked (because we padded the number of sentences, for instance). In this case, we just want to mask the entire sequence. EncoderWrapper returns a mask with the same dimension as the input sequences, where sequences are masked if _all_ of their words were masked.

Notes

For seq2seq encoders, one should use either TimeDistributed or TimeDistributedWithMask since EncoderWrapper reduces the dimensionality of the input mask.

compute_mask(x, input_mask=None)[source]

OutputMask

class deep_qa.layers.wrappers.output_mask.OutputMask(**kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This Layer is purely for debugging. You can wrap this on a layer’s output to get the mask output by that layer as a model output, for easier visualization of what the model is actually doing.

Don’t try to use this in an actual model.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).

TimeDistributed

class deep_qa.layers.wrappers.time_distributed.TimeDistributed(layer, keep_dims=False, **kwargs)[source]

Bases: keras.layers.wrappers.TimeDistributed

This class fixes two bugs in Keras: (1) the input mask is not passed to the wrapped layer, and (2) Keras’ TimeDistributed currently only allows a single input, not a list. We currently don’t handle the case where the _output_ of the wrapped layer is a list, however. (Not that that’s particularly hard, we just haven’t needed it yet, so haven’t implemented it.)

Notes

If the output shape for TimeDistributed has a final dimension of 1, we essentially sqeeze it, reshaping to have one fewer dimension. That change takes place in the actual call method as well as the compute_output_shape method.

build(input_shape)[source]
compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]
get_config()[source]
get_output_mask_shape_for(input_shape)[source]
static reshape_inputs_and_masks(inputs, masks)[source]

Tensor Utils

Here are some general tensor manipulation utilities that we’ve written to help in other parts of the code base.

Core Tensor Utils

backend

These are utility functions that are similar to calls to Keras’ backend. Some of these are here because a current function in keras.backend is broken, some are things that just haven’t been implemented.

deep_qa.tensors.backend.apply_feed_forward(input_tensor, weights, activation)[source]

Takes an input tensor, sequence of weights and an activation and builds an MLP. This can also be achieved by defining a sequence of Dense layers in Keras, but doing this might be desirable if the operation needs to be done within the call method of a more complex layer. Moreover, we are not applying biases here. The input tensor can have any number of dimensions. But the last dimension, and the sequence of weights are expected to be compatible.

deep_qa.tensors.backend.hardmax(unnormalized_attention, knowledge_length)[source]

A similar operation to softmax, except all of the weight is placed on the mode of the distribution. So, e.g., this function transforms [.34, .2, -1.4] -> [1, 0, 0].

TODO(matt): we really should have this take an optional mask...

deep_qa.tensors.backend.l1_normalize(tensor_to_normalize, mask=None)[source]

Normalize a tensor by its L1 norm. Takes an optional mask.

When the vector to be normalized is all 0’s we return the uniform distribution (taking masking into account, so masked values are still 0.0). When the vector to be normalized is completely masked, we return the uniform distribution over the max padding length of the tensor.

See the tests for concrete examples of the aforementioned behaviors.

Parameters:

tensor_to_normalize : Tensor

Tensor of shape (batch size, x) to be normalized, where x is arbitrary.

mask: Tensor, optional

Tensor of shape (batch size, x) indicating which elements of tensor_to_normalize are padding and should not be considered when normalizing.

Returns:

normalized_tensor : Tensor

Normalized tensor with shape (batch size, x).

deep_qa.tensors.backend.last_dim_flatten(input_tensor)[source]

Takes a tensor and returns a matrix while preserving only the last dimension from the input.

deep_qa.tensors.backend.switch(cond, then_tensor, else_tensor)[source]

Keras’ implementation of K.switch currently uses tensorflow’s switch function, which only accepts scalar value conditions, rather than boolean tensors which are treated in an elementwise function. This doesn’t match with Theano’s implementation of switch, but using tensorflow’s where, we can exactly retrieve this functionality.

deep_qa.tensors.backend.tile_scalar(scalar, vector)[source]

NOTE: If your vector has known shape (i.e., the relevant dimension from K.int_shape(vector) is not None), you should just use K.repeat_elements(scalar) instead of this. This method works, however, when the number of entries in your vector is unknown at graph compilation time.

This method takes a (collection of) scalar(s) (shape: (batch_size, 1)), and tiles that scala a number of times, giving a vector of shape (batch_size, tile_length). (I say “scalar” and “vector” here because I’m ignoring the batch_size). We need the vector as input so we know what the tile_length is - the vector is otherwise ignored.

This is not done as a Keras Layer, however; if you want to use this function, you’ll need to do it _inside_ of a Layer somehow, either in a Lambda or in the call() method of a Layer you’re writing.

TODO(matt): we could probably make a more general tile_tensor method, which can do this for any dimenionsality. There is another place in the code where we do this with a matrix and a tensor; all three of these can probably be one function.

deep_qa.tensors.backend.tile_vector(vector, matrix)[source]

NOTE: If your matrix has known shape (i.e., the relevant dimension from K.int_shape(matrix) is not None), you should just use K.repeat_elements(vector) instead of this. This method works, however, when the number of rows in your matrix is unknown at graph compilation time.

This method takes a (collection of) vector(s) (shape: (batch_size, vector_dim)), and tiles that vector a number of times, giving a matrix of shape (batch_size, tile_length, vector_dim). (I say “vector” and “matrix” here because I’m ignoring the batch_size). We need the matrix as input so we know what the tile_length is - the matrix is otherwise ignored.

This is necessary in a number of places in the code. For instance, if you want to do a dot product of a vector with all of the vectors in a matrix, the most efficient way to do that is to tile the vector first, then do an element-wise product with the matrix, then sum out the last mode. So, we capture this functionality here.

This is not done as a Keras Layer, however; if you want to use this function, you’ll need to do it _inside_ of a Layer somehow, either in a Lambda or in the call() method of a Layer you’re writing.

deep_qa.tensors.backend.very_negative_like(tensor)[source]

masked_operations

deep_qa.tensors.masked_operations.masked_batch_dot(tensor_a, tensor_b, mask_a, mask_b)[source]

The simplest case where this function is applicable is the following:

tensor_a: (batch_size, a_length, embed_dim) tensor_b: (batch_size, b_length, embed_dim) mask_a: None or (batch_size, a_length) mask_b: None or (batch_size, b_length)

Returns: a_dot_b: (batch_size, a_length, b_length), with zeros for masked elements.

This function will also work for larger tensors, as long as abs(K.ndim(tensor_a) - K.ndim(tensor_b)) < 1 (this is due to the limitations of K.batch_dot). We always assume the dimension to perform the dot is the last one, and that the masks have one fewer dimension than the tensors.

deep_qa.tensors.masked_operations.masked_softmax(vector, mask)[source]

K.softmax(vector) does not work if some elements of vector should be masked. This performs a softmax on just the non-masked portions of vector (passing None in for the mask is also acceptable; you’ll just get a regular softmax).

We assume that both vector and mask (if given) have shape (batch_size, vector_dim).

In the case that the input vector is completely masked, this function returns an array of 0.0. This behavior may cause NaN if this is used as the last layer of a model that uses categorial cross-entropy loss.

Similarity Functions

bilinear

class deep_qa.tensors.similarity_functions.bilinear.Bilinear(**kwargs)[source]

Bases: deep_qa.tensors.similarity_functions.similarity_function.SimilarityFunction

This similarity function performs a bilinear transformation of the two input vectors. This function has a matrix of weights W and a bias b, and the similarity between two vectors x and y is computed as x^T W y + b.

compute_similarity(tensor_1, tensor_2)[source]

Takes two tensors of the same shape, such as (batch_size, length_1, length_2, embedding_dim). Computes a (possibly parameterized) similarity on the final dimension and returns a tensor with one less dimension, such as (batch_size, length_1, length_2).

initialize_weights(tensor_1_dim: int, tensor_2_dim: int) → typing.List[typing.K.variable][source]

Called in a Layer.build() method that uses this SimilarityFunction, here we both initialize whatever weights are necessary for this similarity function, and return them so they can be included in Layer.trainable_weights.

Parameters:

tensor_1_dim : int

The last dimension (typically embedding_dim) of the first input tensor. We need this so we can initialize weights appropriately.

tensor_2_dim : int

The last dimension (typically embedding_dim) of the second input tensor. We need this so we can initialize weights appropriately.

cosine_similarity

class deep_qa.tensors.similarity_functions.cosine_similarity.CosineSimilarity(**kwargs)[source]

Bases: deep_qa.tensors.similarity_functions.similarity_function.SimilarityFunction

This similarity function simply computes the cosine similarity between each pair of vectors. It has no parameters.

compute_similarity(tensor_1, tensor_2)[source]

Takes two tensors of the same shape, such as (batch_size, length_1, length_2, embedding_dim). Computes a (possibly parameterized) similarity on the final dimension and returns a tensor with one less dimension, such as (batch_size, length_1, length_2).

initialize_weights(tensor_1_dim: int, tensor_2_dim: int) → typing.List[typing.K.variable][source]

Called in a Layer.build() method that uses this SimilarityFunction, here we both initialize whatever weights are necessary for this similarity function, and return them so they can be included in Layer.trainable_weights.

Parameters:

tensor_1_dim : int

The last dimension (typically embedding_dim) of the first input tensor. We need this so we can initialize weights appropriately.

tensor_2_dim : int

The last dimension (typically embedding_dim) of the second input tensor. We need this so we can initialize weights appropriately.

dot_product

class deep_qa.tensors.similarity_functions.dot_product.DotProduct(**kwargs)[source]

Bases: deep_qa.tensors.similarity_functions.similarity_function.SimilarityFunction

This similarity function simply computes the dot product between each pair of vectors. It has no parameters.

compute_similarity(tensor_1, tensor_2)[source]

Takes two tensors of the same shape, such as (batch_size, length_1, length_2, embedding_dim). Computes a (possibly parameterized) similarity on the final dimension and returns a tensor with one less dimension, such as (batch_size, length_1, length_2).

initialize_weights(tensor_1_dim: int, tensor_2_dim: int) → typing.List[typing.K.variable][source]

Called in a Layer.build() method that uses this SimilarityFunction, here we both initialize whatever weights are necessary for this similarity function, and return them so they can be included in Layer.trainable_weights.

Parameters:

tensor_1_dim : int

The last dimension (typically embedding_dim) of the first input tensor. We need this so we can initialize weights appropriately.

tensor_2_dim : int

The last dimension (typically embedding_dim) of the second input tensor. We need this so we can initialize weights appropriately.

linear

class deep_qa.tensors.similarity_functions.linear.Linear(combination: str = 'x, y', **kwargs)[source]

Bases: deep_qa.tensors.similarity_functions.similarity_function.SimilarityFunction

This similarity function performs a dot product between a vector of weights and some combination of the two input vectors. The combination done is configurable.

If the two vectors are x and y, we allow the following kinds of combinations: x, y, x*y, x+y, x-y, x/y, where each of those binary operations is performed elementwise. You can list as many combinations as you want, comma separated. For example, you might give “x,y,x*y” as the combination parameter to this class. The computed similarity function would then be w^T [x; y; x*y] + b, where w is a vector of weights, b is a bias parameter, and [;] is vector concatenation.

Note that if you want a bilinear similarity function with a diagonal weight matrix W, where the similarity function is computed as x * w * y + b (with w the diagonal of W), you can accomplish that with this class by using “x*y” for combination.

_combine_tensors(tensor_1, tensor_2)[source]
_get_combination(combination: str, tensor_1, tensor_2)[source]
_get_combination_dim(combination: str, tensor_1_dim: int, tensor_2_dim: int) → int[source]
_get_combined_dim(tensor_1_dim: int, tensor_2_dim: int) → int[source]
compute_similarity(tensor_1, tensor_2)[source]

Takes two tensors of the same shape, such as (batch_size, length_1, length_2, embedding_dim). Computes a (possibly parameterized) similarity on the final dimension and returns a tensor with one less dimension, such as (batch_size, length_1, length_2).

initialize_weights(tensor_1_dim: int, tensor_2_dim: int) → typing.List[typing.K.variable][source]

Called in a Layer.build() method that uses this SimilarityFunction, here we both initialize whatever weights are necessary for this similarity function, and return them so they can be included in Layer.trainable_weights.

Parameters:

tensor_1_dim : int

The last dimension (typically embedding_dim) of the first input tensor. We need this so we can initialize weights appropriately.

tensor_2_dim : int

The last dimension (typically embedding_dim) of the second input tensor. We need this so we can initialize weights appropriately.

similarity_function

Similarity functions take a pair of tensors with the same shape, and compute a similarity function on the vectors in the last dimension. For example, the tensors might both have shape (batch_size, sentence_length, embedding_dim), and we will compute some function of the two vectors of length embedding_dim for each position (batch_size, sentence_length), returning a tensor of shape (batch_size, sentence_length).

The similarity function could be as simple as a dot product, or it could be a more complex, parameterized function. The SimilarityFunction class exposes an API for a Layer that wants to allow for multiple similarity functions, such as for initializing and returning weights.

If you want to compute a similarity between tensors of different sizes, you need to first tile them in the appropriate dimensions to make them the same before you can use these functions. The Attention and MatrixAttention layers do this.

class deep_qa.tensors.similarity_functions.similarity_function.SimilarityFunction(name: str, initialization: str = 'glorot_uniform', activation: str = 'linear')[source]

Bases: object

compute_similarity(tensor_1, tensor_2)[source]

Takes two tensors of the same shape, such as (batch_size, length_1, length_2, embedding_dim). Computes a (possibly parameterized) similarity on the final dimension and returns a tensor with one less dimension, such as (batch_size, length_1, length_2).

initialize_weights(tensor_1_dim: int, tensor_2_dim: int) → typing.List[typing.K.variable][source]

Called in a Layer.build() method that uses this SimilarityFunction, here we both initialize whatever weights are necessary for this similarity function, and return them so they can be included in Layer.trainable_weights.

Parameters:

tensor_1_dim : int

The last dimension (typically embedding_dim) of the first input tensor. We need this so we can initialize weights appropriately.

tensor_2_dim : int

The last dimension (typically embedding_dim) of the second input tensor. We need this so we can initialize weights appropriately.

Common Utils

Here are some general utilities that we’ve written to help in other parts of the code base.

Checks

exception deep_qa.common.checks.ConfigurationError(message)[source]

Bases: Exception

deep_qa.common.checks.ensure_pythonhashseed_set()[source]
deep_qa.common.checks.log_keras_version_info()[source]

Parameter Utils

class deep_qa.common.params.Params(params: typing.Dict[str, typing.Any], history: str = '')[source]

Bases: collections.abc.MutableMapping

Represents a parameter dictionary with a history, and contains other functionality around parameter passing and validation for DeepQA.

There are currently two benefits of a Params object over a plain dictionary for parameter passing:

  1. We handle a few kinds of parameter validation, including making sure that parameters representing discrete choices actually have acceptable values, and making sure no extra parameters are passed.
  2. We log all parameter reads, including default values. This gives a more complete specification of the actual parameters used than is given in a JSON / HOCON file, because those may not specify what default values were used, whereas this will log them.

The convention for using a Params object in DeepQA is that you will consume the parameters as you read them, so that there are none left when you’ve read everything you expect. This lets us easily validate that you didn’t pass in any extra parameters, just by making sure that the parameter dictionary is empty. You should do this when you’re done handling parameters, by calling Params.assert_empty().

DEFAULT = <object object>
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 47
_abc_registry = <_weakrefset.WeakSet object>
as_dict(quiet=False)[source]

Sometimes we need to just represent the parameters as a dict, for instance when we pass them to a Keras layer(so that they can be serialised).

Parameters:

quiet: bool, optional (default = False)

Whether to log the parameters before returning them as a dict.

assert_empty(class_name: str)[source]

Raises a ConfigurationError if self.params is not empty. We take class_name as an argument so that the error message gives some idea of where an error happened, if there was one. class_name should be the name of the calling class, the one that got extra parameters (if there are any).

get(key: str, default: typing.Any = <object object>)[source]

Performs the functionality associated with dict.get(key) but also checks for returned dicts and returns a Params object in their place with an updated history.

pop(key: str, default: typing.Any = <object object>)[source]

Performs the functionality associated with dict.pop(key), along with checking for returned dictionaries, replacing them with Param objects with an updated history.

If key is not present in the dictionary, and no default was specified, we raise a ConfigurationError, instead of the typical KeyError.

pop_choice(key: str, choices: typing.List[typing.Any], default_to_first_choice: bool = False)[source]

Gets the value of key in the params dictionary, ensuring that the value is one of the given choices. Note that this pops the key from params, modifying the dictionary, consistent with how parameters are processed in this codebase.

Parameters:

key: str

Key to get the value from in the param dictionary

choices: List[Any]

A list of valid options for values corresponding to key. For example, if you’re specifying the type of encoder to use for some part of your model, the choices might be the list of encoder classes we know about and can instantiate. If the value we find in the param dictionary is not in choices, we raise a ConfigurationError, because the user specified an invalid value in their parameter file.

default_to_first_choice: bool, optional (default=False)

If this is True, we allow the key to not be present in the parameter dictionary. If the key is not present, we will use the return as the value the first choice in the choices list. If this is False, we raise a ConfigurationError, because specifying the key is required (e.g., you have to specify your model class when running an experiment, but you can feel free to use default settings for encoders if you want).

deep_qa.common.params.pop_choice(params: typing.Dict[str, typing.Any], key: str, choices: typing.List[typing.Any], default_to_first_choice: bool = False, history: str = '?.') → typing.Any[source]

Performs the same function as Params.pop_choice(), but is required in order to deal with places that the Params object is not welcome, such as inside Keras layers. See the docstring of that method for more detail on how this function works.

This method adds a history parameter, in the off-chance that you know it, so that we can reproduce Params.pop_choice() exactly. We default to using ”?.” if you don’t know the history, so you’ll have to fix that in the log if you want to actually recover the logged parameters.

deep_qa.common.params.replace_none(dictionary: typing.Dict[str, typing.Any]) → typing.Dict[str, typing.Any][source]