Base Instances

An Instance is a single training or testing example for a Keras model. The base classes for working with Instances are found in instance.py. There are two subclasses: (1) TextInstance, which is a raw instance that contains actual strings, and can be used to determine a vocabulary for a model, or read directly from a file; and (2) IndexedInstance, which has had its raw strings converted to word (or character) indices, and can be padded to a consistent length and converted to numpy arrays for use with Keras.

Concrete Instance classes are organized in the code by the task they are designed for (e.g., text classification, reading comprehension, sequence tagging, etc.).

A lot of the magic of how the DeepQA library works happens here, in the concrete Instance classes in this module. Most of the code can be totally agnostic to how exactly the input is structured, because the conversion to numpy arrays happens here, not in the Trainer or TextTrainer classes, with only the specific _build_model() methods needing to know about the format of their input and output (and even some of the details there are transparent to the model class).

This module contains the base Instance classes that concrete classes inherit from. Specifically, there are three classes:

  1. Instance, that just exists as a base type with no functionality
  2. TextInstance, which adds a words() method and a method to convert strings to indices using a DataIndexer.
  3. IndexedInstance, which is a TextInstance that has had all of its strings converted into indices.

This class has methods to deal with padding (so that sequences all have the same length) and converting an Instance into a set of Numpy arrays suitable for use with Keras.

As this codebase is dealing mostly with textual question answering, pretty much all of the concrete Instance types will have both a TextInstance and a corresponding IndexedInstance, which you can see in the individual files for each Instance type.

class deep_qa.data.instances.instance.IndexedInstance(label, index: int = None)[source]

Bases: deep_qa.data.instances.instance.Instance

An indexed data instance has all word tokens replaced with word indices, along with some kind of label, suitable for input to a Keras model. An IndexedInstance is created from an Instance using a DataIndexer, and the indices here have no recoverable meaning without the DataIndexer.

For example, we might have the following Instance: - TrueFalseInstance('Jamie is nice, Holly is mean', True, 25)

After being converted into an IndexedInstance, we might have the following: - IndexedTrueFalseInstance([1, 6, 7, 1, 6, 8], True, 25)

This would mean that "Jamie" and "Holly" were OOV to the DataIndexer, and the other words were given indices.

static _get_word_sequence_lengths(word_indices: typing.List) → typing.Dict[str, int][source]

Because TextEncoders can return complex data structures, we might actually have several things to pad for a single word sequence. We check for that and handle it in a single spot here. We return a dictionary containing ‘num_sentence_words’, which is the number of words in word_indices. If the word representations also contain characters, the dictionary additionally contains a ‘num_word_characters’ key, with a value corresponding to the longest word in the sequence.

as_training_data()[source]

Convert this IndexedInstance to NumPy arrays suitable for use as training data to Keras models.

Returns:

train_data : (inputs, label)

The IndexedInstance as NumPy arrays to be uesd in Keras. Note that inputs might itself be a complex tuple, depending on the Instance type.

classmethod empty_instance()[source]

Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.

get_padding_lengths() → typing.Dict[str, int][source]

Returns the length of this instance in all dimensions that require padding.

Different kinds of instances have different fields that are padded, such as sentence length, number of background sentences, number of options, etc.

Returns:

padding_lengths: Dict[str, int]

A dictionary mapping padding keys (like “num_sentence_words”) to lengths.

pad(padding_lengths: typing.Dict[str, int])[source]

Add zero-padding to make each data example of equal length for use in the neural network.

This modifies the current object.

Parameters:

padding_lengths: Dict[str, int]

In this dictionary, each str refers to a type of token (e.g. num_sentence_words), and the corresponding int is the value. This dictionary must have the same keys as was returned by get_padding_lengths(). We will use these lengths to pad the instance in all of the necessary dimensions to the given leangths.

static pad_sequence_to_length(sequence: typing.List, desired_length: int, default_value: typing.Callable[[], typing.Any] = <function IndexedInstance.<lambda>>, truncate_from_right: bool = True) → typing.List[source]

Take a list of indices and pads them to the desired length.

Parameters:

word_sequence : List of int

A list of word indices.

desired_length : int

Maximum length of each sequence. Longer sequences are truncated to this length, and shorter ones are padded to it.

default_value: Callable, default=lambda: 0

Callable that outputs a default value (of any type) to use as padding values.

truncate_from_right : bool, default=True

If truncating the indices is necessary, this parameter dictates whether we do so on the left or right.

Returns:

padded_word_sequence : List of int

A padded or truncated list of word indices.

Notes

The reason we truncate from the right by default is for cases that are questions, with long set ups. We at least want to get the question encoded, which is always at the end, even if we’ve lost much of the question set up. If you want to truncate from the other direction, you can.

static pad_word_sequence(word_sequence: typing.List[int], padding_lengths: typing.Dict[str, int], truncate_from_right: bool = True) → typing.List[source]

Take a list of indices and pads them.

Parameters:

word_sequence : List of int

A list of word indices.

padding_lengths : Dict[str, int]

In this dictionary, each str refers to a type of token (e.g. num_sentence_words), and the corresponding int is the value. This dictionary must have the same dimension as was returned by get_padding_lengths(). We will use these lengths to pad the instance in all of the necessary dimensions to the given leangths.

truncate_from_right : bool, default=True

If truncating the indices is necessary, this parameter dictates whether we do so on the left or right.

Returns:

padded_word_sequence : List of int

A padded list of word indices.

Notes

The reason we truncate from the right by default is for cases that are questions, with long set ups. We at least want to get the question encoded, which is always at the end, even if we’ve lost much of the question set up. If you want to truncate from the other direction, you can.

TODO(matt): we should probably switch the default to truncate from the left, and clear up the naming here - it’s easy to get confused about what “truncate from right” means.

class deep_qa.data.instances.instance.Instance(label, index: int = None)[source]

Bases: object

A data instance, used either for training a neural network or for testing one.

Parameters:

label : Any

Any kind of label that you might want to predict in a model. Could be a class label, a tag sequence, a character span in a passage, etc.

index : int, optional

Used for matching instances with other data, such as background sentences.

class deep_qa.data.instances.instance.TextInstance(label, index: int = None)[source]

Bases: deep_qa.data.instances.instance.Instance

An Instance that has some attached text, typically either a sentence or a logical form. This is called a TextInstance because the individual tokens here are encoded as strings, and we can get a list of strings out when we ask what words show up in the instance.

We use these kinds of instances to fit a DataIndexer (i.e., deciding which words should be mapped to an unknown token); to use them in training or testing, we need to first convert them into IndexedInstances.

In order to actually convert text into some kind of indexed sequence, we rely on a TextEncoder. There are several TextEncoder subclasses, that will let you use word token sequences, character sequences, and other options. By default we use word tokens. You can override this by setting the encoder class variable.

_index_text(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[int][source]
_words_from_text(text: str) → typing.Dict[str, typing.List[str]][source]
classmethod read_from_line(line: str)[source]

Reads an instance of this type from a line.

Parameters:

line : str

A line from a data file.

Returns:

indexed_instance : IndexedInstance

A TextInstance that has had all of its strings converted into indices.

Notes

We throw a RuntimeError here instead of a NotImplementedError, because it’s not expected that all subclasses will implement this.

to_indexed_instance(data_indexer: deep_qa.data.data_indexer.DataIndexer) → deep_qa.data.instances.instance.IndexedInstance[source]

Converts the words in this Instance into indices using the DataIndexer.

Parameters:

data_indexer : DataIndexer

DataIndexer to use in converting the Instance to an IndexedInstance.

Returns:

indexed_instance : IndexedInstance

A TextInstance that has had all of its strings converted into indices.

tokenizer = <deep_qa.data.tokenizers.word_tokenizer.WordTokenizer object>
words() → typing.Dict[str, typing.List[str]][source]

Returns a list of all of the words in this instance, contained in a namespace dictionary.

This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the DataIndexer to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.

Returns:

namespace : Dictionary of {str: List[str]}

The str key refers to vocabularies, and the List[str] should contain the tokens in that vocabulary. For example, you should use the key words to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.