Text Classification Instances¶
These Instances are designed for any classification task over a single passage of text. The
input is the passage (e.g., a sentence, a document, etc.), and the output is a single label (e.g.,
positive / negative sentiment, spam / not spam, essay grade, etc.).
TextClassificationInstances¶
-
class
deep_qa.data.instances.text_classification.text_classification_instance.IndexedTextClassificationInstance(word_indices: typing.List[int], label, index: int = None)[source]¶ Bases:
deep_qa.data.instances.instance.IndexedInstance-
as_training_data()[source]¶ Convert this
IndexedInstanceto NumPy arrays suitable for use as training data to Keras models.Returns: train_data : (inputs, label)
The
IndexedInstanceas NumPy arrays to be uesd in Keras. Note thatinputsmight itself be a complex tuple, depending on theInstancetype.
-
classmethod
empty_instance()[source]¶ Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.
-
get_padding_lengths() → typing.Dict[str, int][source]¶ Returns the length of this instance in all dimensions that require padding.
Different kinds of instances have different fields that are padded, such as sentence length, number of background sentences, number of options, etc.
Returns: padding_lengths: Dict[str, int]
A dictionary mapping padding keys (like “num_sentence_words”) to lengths.
-
pad(padding_lengths: typing.Dict[str, int])[source]¶ Add zero-padding to make each data example of equal length for use in the neural network.
This modifies the current object.
Parameters: padding_lengths: Dict[str, int]
In this dictionary, each
strrefers to a type of token (e.g.num_sentence_words), and the correspondingintis the value. This dictionary must have the same keys as was returned byget_padding_lengths(). We will use these lengths to pad the instance in all of the necessary dimensions to the given leangths.
-
-
class
deep_qa.data.instances.text_classification.text_classification_instance.TextClassificationInstance(text: str, label: bool, index: int = None)[source]¶ Bases:
deep_qa.data.instances.instance.TextInstanceA TextClassificationInstance is a
TextInstancethat is a single passage of text, where that passage has some associated (categorical, or possibly real-valued) label.-
classmethod
read_from_line(line: str)[source]¶ Reads a TextClassificationInstance object from a line. The format has one of four options:
- [sentence]
- [sentence index][tab][sentence]
- [sentence][tab][label]
- [sentence index][tab][sentence][tab][label]
If no label is given, we use
Noneas the label.
-
to_indexed_instance(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]¶ Converts the words in this
Instanceinto indices using theDataIndexer.Parameters: data_indexer : DataIndexer
DataIndexerto use in converting theInstanceto anIndexedInstance.Returns: indexed_instance : IndexedInstance
A
TextInstancethat has had all of its strings converted into indices.
-
words() → typing.Dict[str, typing.List[str]][source]¶ Returns a list of all of the words in this instance, contained in a namespace dictionary.
This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the
DataIndexerto work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.Returns: namespace : Dictionary of {str: List[str]}
The
strkey refers to vocabularies, and theList[str]should contain the tokens in that vocabulary. For example, you should use the keywordsto represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.
-
classmethod