Base Instances¶
An Instance
is a single training or testing example for a Keras model. The base classes for
working with Instances
are found in instance.py. There are two subclasses: (1)
TextInstance
, which is a raw instance that contains
actual strings, and can be used to determine a vocabulary for a model, or read directly from a
file; and (2) IndexedInstance
, which has had its raw
strings converted to word (or character) indices, and can be padded to a consistent length and
converted to numpy arrays for use with Keras.
Concrete Instance
classes are organized in the code by the task they are designed for (e.g.,
text classification, reading comprehension, sequence tagging, etc.).
A lot of the magic of how the DeepQA library works happens here, in the concrete Instance classes
in this module. Most of the code can be totally agnostic to how exactly the input is structured,
because the conversion to numpy arrays happens here, not in the Trainer or TextTrainer classes,
with only the specific _build_model()
methods needing to know about the format of their input
and output (and even some of the details there are transparent to the model class).
This module contains the base Instance
classes that concrete classes
inherit from. Specifically, there are three classes:
Instance
, that just exists as a base type with no functionalityTextInstance
, which adds awords()
method and a method to convert strings to indices using a DataIndexer.IndexedInstance
, which is aTextInstance
that has had all of its strings converted into indices.
This class has methods to deal with padding (so that sequences all have the
same length) and converting an Instance
into a set of Numpy arrays
suitable for use with Keras.
As this codebase is dealing mostly with textual question answering, pretty much
all of the concrete Instance
types will have both a TextInstance
and a
corresponding IndexedInstance
, which you can see in the individual files
for each Instance
type.
-
class
deep_qa.data.instances.instance.
IndexedInstance
(label, index: int = None)[source]¶ Bases:
deep_qa.data.instances.instance.Instance
An indexed data instance has all word tokens replaced with word indices, along with some kind of label, suitable for input to a Keras model. An
IndexedInstance
is created from anInstance
using aDataIndexer
, and the indices here have no recoverable meaning without theDataIndexer
.For example, we might have the following
Instance
: -TrueFalseInstance('Jamie is nice, Holly is mean', True, 25)
After being converted into an
IndexedInstance
, we might have the following: -IndexedTrueFalseInstance([1, 6, 7, 1, 6, 8], True, 25)
This would mean that
"Jamie"
and"Holly"
were OOV to theDataIndexer
, and the other words were given indices.-
static
_get_word_sequence_lengths
(word_indices: typing.List) → typing.Dict[str, int][source]¶ Because
TextEncoders
can return complex data structures, we might actually have several things to pad for a single word sequence. We check for that and handle it in a single spot here. We return a dictionary containing ‘num_sentence_words’, which is the number of words in word_indices. If the word representations also contain characters, the dictionary additionally contains a ‘num_word_characters’ key, with a value corresponding to the longest word in the sequence.
-
as_training_data
()[source]¶ Convert this
IndexedInstance
to NumPy arrays suitable for use as training data to Keras models.Returns: train_data : (inputs, label)
The
IndexedInstance
as NumPy arrays to be uesd in Keras. Note thatinputs
might itself be a complex tuple, depending on theInstance
type.
-
classmethod
empty_instance
()[source]¶ Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.
-
get_padding_lengths
() → typing.Dict[str, int][source]¶ Returns the length of this instance in all dimensions that require padding.
Different kinds of instances have different fields that are padded, such as sentence length, number of background sentences, number of options, etc.
Returns: padding_lengths: Dict[str, int]
A dictionary mapping padding keys (like “num_sentence_words”) to lengths.
-
pad
(padding_lengths: typing.Dict[str, int])[source]¶ Add zero-padding to make each data example of equal length for use in the neural network.
This modifies the current object.
Parameters: padding_lengths: Dict[str, int]
In this dictionary, each
str
refers to a type of token (e.g.num_sentence_words
), and the correspondingint
is the value. This dictionary must have the same keys as was returned byget_padding_lengths()
. We will use these lengths to pad the instance in all of the necessary dimensions to the given leangths.
-
static
pad_sequence_to_length
(sequence: typing.List, desired_length: int, default_value: typing.Callable[[], typing.Any] = <function IndexedInstance.<lambda>>, truncate_from_right: bool = True) → typing.List[source]¶ Take a list of indices and pads them to the desired length.
Parameters: word_sequence : List of int
A list of word indices.
desired_length : int
Maximum length of each sequence. Longer sequences are truncated to this length, and shorter ones are padded to it.
default_value: Callable, default=lambda: 0
Callable that outputs a default value (of any type) to use as padding values.
truncate_from_right : bool, default=True
If truncating the indices is necessary, this parameter dictates whether we do so on the left or right.
Returns: padded_word_sequence : List of int
A padded or truncated list of word indices.
Notes
The reason we truncate from the right by default is for cases that are questions, with long set ups. We at least want to get the question encoded, which is always at the end, even if we’ve lost much of the question set up. If you want to truncate from the other direction, you can.
-
static
pad_word_sequence
(word_sequence: typing.List[int], padding_lengths: typing.Dict[str, int], truncate_from_right: bool = True) → typing.List[source]¶ Take a list of indices and pads them.
Parameters: word_sequence : List of int
A list of word indices.
padding_lengths : Dict[str, int]
In this dictionary, each
str
refers to a type of token (e.g.num_sentence_words
), and the correspondingint
is the value. This dictionary must have the same dimension as was returned byget_padding_lengths()
. We will use these lengths to pad the instance in all of the necessary dimensions to the given leangths.truncate_from_right : bool, default=True
If truncating the indices is necessary, this parameter dictates whether we do so on the left or right.
Returns: padded_word_sequence : List of int
A padded list of word indices.
Notes
The reason we truncate from the right by default is for cases that are questions, with long set ups. We at least want to get the question encoded, which is always at the end, even if we’ve lost much of the question set up. If you want to truncate from the other direction, you can.
TODO(matt): we should probably switch the default to truncate from the left, and clear up the naming here - it’s easy to get confused about what “truncate from right” means.
-
static
-
class
deep_qa.data.instances.instance.
Instance
(label, index: int = None)[source]¶ Bases:
object
A data instance, used either for training a neural network or for testing one.
Parameters: label : Any
Any kind of label that you might want to predict in a model. Could be a class label, a tag sequence, a character span in a passage, etc.
index : int, optional
Used for matching instances with other data, such as background sentences.
-
class
deep_qa.data.instances.instance.
TextInstance
(label, index: int = None)[source]¶ Bases:
deep_qa.data.instances.instance.Instance
An
Instance
that has some attached text, typically either a sentence or a logical form. This is called aTextInstance
because the individual tokens here are encoded as strings, and we can get a list of strings out when we ask what words show up in the instance.We use these kinds of instances to fit a
DataIndexer
(i.e., deciding which words should be mapped to an unknown token); to use them in training or testing, we need to first convert them intoIndexedInstances
.In order to actually convert text into some kind of indexed sequence, we rely on a
TextEncoder
. There are severalTextEncoder
subclasses, that will let you use word token sequences, character sequences, and other options. By default we use word tokens. You can override this by setting theencoder
class variable.-
_index_text
(text: str, data_indexer: deep_qa.data.data_indexer.DataIndexer) → typing.List[int][source]¶
-
classmethod
read_from_line
(line: str)[source]¶ Reads an instance of this type from a line.
Parameters: line : str
A line from a data file.
Returns: indexed_instance : IndexedInstance
A
TextInstance
that has had all of its strings converted into indices.Notes
We throw a
RuntimeError
here instead of aNotImplementedError
, because it’s not expected that all subclasses will implement this.
-
to_indexed_instance
(data_indexer: deep_qa.data.data_indexer.DataIndexer) → deep_qa.data.instances.instance.IndexedInstance[source]¶ Converts the words in this
Instance
into indices using theDataIndexer
.Parameters: data_indexer : DataIndexer
DataIndexer
to use in converting theInstance
to anIndexedInstance
.Returns: indexed_instance : IndexedInstance
A
TextInstance
that has had all of its strings converted into indices.
-
tokenizer
= <deep_qa.data.tokenizers.word_tokenizer.WordTokenizer object>¶
-
words
() → typing.Dict[str, typing.List[str]][source]¶ Returns a list of all of the words in this instance, contained in a namespace dictionary.
This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the
DataIndexer
to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.Returns: namespace : Dictionary of {str: List[str]}
The
str
key refers to vocabularies, and theList[str]
should contain the tokens in that vocabulary. For example, you should use the keywords
to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.
-