Reading Comprehension Instances

These Instances are designed for the set of tasks known today as “reading comprehension”, where the input is a natural language question, a passage, and (optionally) some number of answer options, and the output is either a (span begin index, span end index) decision over the passage, or a classification decision over the answer options (if provided).

QuestionPassageInstances

class deep_qa.data.instances.reading_comprehension.question_passage_instance.IndexedQuestionPassageInstance(question_indices: typing.List[int], passage_indices: typing.List[int], label: typing.List[int], index: int = None)[source]

Bases: deep_qa.data.instances.instance.IndexedInstance

This is an indexed instance that is used for (question, passage) pairs.

as_training_data()[source]

Convert this IndexedInstance to NumPy arrays suitable for use as training data to Keras models.

Returns:

train_data : (inputs, label)

The IndexedInstance as NumPy arrays to be uesd in Keras. Note that inputs might itself be a complex tuple, depending on the Instance type.

classmethod empty_instance()[source]

Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.

get_padding_lengths() → typing.Dict[str, int][source]

We need to pad at least the question length, the passage length, and the word length across all the questions and passages. Subclasses that add more arguments should also override this method to enable padding on said arguments.

pad(padding_lengths: typing.Dict[str, int])[source]

In this function, we pad the questions and passages (in terms of number of words in each), as well as the individual words in the questions and passages themselves.

class deep_qa.data.instances.reading_comprehension.question_passage_instance.QuestionPassageInstance(question_text: str, passage_text: str, label: typing.Any, index: int = None)[source]

Bases: deep_qa.data.instances.instance.TextInstance

A QuestionPassageInstance is a base class for datasets that consist primarily of a question text and a passage, where the passage contains the answer to the question. This class should not be used directly due to the missing _index_label function, use a subclass instead.

_index_label(label: typing.Any) → typing.List[int][source]

Index the labels. Since we don’t know what form the label takes, we leave it to subclasses to implement this method.

to_indexed_instance(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]

Converts the words in this Instance into indices using the DataIndexer.

Parameters:

data_indexer : DataIndexer

DataIndexer to use in converting the Instance to an IndexedInstance.

Returns:

indexed_instance : IndexedInstance

A TextInstance that has had all of its strings converted into indices.

words() → typing.Dict[str, typing.List[str]][source]

Returns a list of all of the words in this instance, contained in a namespace dictionary.

This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the DataIndexer to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.

Returns:

namespace : Dictionary of {str: List[str]}

The str key refers to vocabularies, and the List[str] should contain the tokens in that vocabulary. For example, you should use the key words to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.

McQuestionPassageInstances

class deep_qa.data.instances.reading_comprehension.mc_question_passage_instance.IndexedMcQuestionPassageInstance(question_indices: typing.List[int], passage_indices: typing.List[int], option_indices: typing.List[typing.List[int]], label: typing.List[int], index: int = None)[source]

Bases: deep_qa.data.instances.reading_comprehension.question_passage_instance.IndexedQuestionPassageInstance

as_training_data()[source]

Convert this IndexedInstance to NumPy arrays suitable for use as training data to Keras models.

Returns:

train_data : (inputs, label)

The IndexedInstance as NumPy arrays to be uesd in Keras. Note that inputs might itself be a complex tuple, depending on the Instance type.

classmethod empty_instance()[source]

Returns an empty, unpadded instance of this class. Necessary for option padding in multiple choice instances.

get_padding_lengths() → typing.Dict[str, int][source]

We need to pad the answer option length (in words), the number of answer options, the question length (in words), the passage length (in words), and the word length (in characters) among all the questions, passages, and answer options.

pad(padding_lengths: typing.Dict[str, int])[source]

In this function, we pad the questions and passages (in terms of number of words in each), as well as the individual words in the questions and passages themselves. We also pad the number of answer options, the answer options (in terms of numbers or words in each), as well as the individual words in the answer options.

class deep_qa.data.instances.reading_comprehension.mc_question_passage_instance.McQuestionPassageInstance(question: str, passage: str, answer_options: typing.List[str], label: int, index: int = None)[source]

Bases: deep_qa.data.instances.reading_comprehension.question_passage_instance.QuestionPassageInstance

A McQuestionPassageInstance is a QuestionPassageInstance that represents a (question, passage, answer_options) tuple from the McQuestionPassageInstance dataset, with an associated label indicating the index of the correct answer choice.

_index_label(label: typing.Tuple[int, int]) → typing.List[int][source]

Specify how to index self.label, which is needed to convert the McQuestionPassageInstance into an IndexedInstance (conversion handled in superclass).

classmethod read_from_line(line: str)[source]

Reads a McQuestionPassageInstance object from a line. The format has one of two options:

  1. [example index][tab][passage][tab][question][tab][options][tab][label]
  2. [passage][tab][question][tab][options][tab][label]

The answer_options column is assumed formatted as: [option]###[option]###[option]... That is, we split on three hashes ("###").

to_indexed_instance(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]

Converts the words in this Instance into indices using the DataIndexer.

Parameters:

data_indexer : DataIndexer

DataIndexer to use in converting the Instance to an IndexedInstance.

Returns:

indexed_instance : IndexedInstance

A TextInstance that has had all of its strings converted into indices.

words() → typing.Dict[str, typing.List[str]][source]

Returns a list of all of the words in this instance, contained in a namespace dictionary.

This is mainly used for computing word counts when fitting a word vocabulary on a dataset. The namespace dictionary allows you to have several embedding matrices with different vocab sizes, e.g., for words and for characters (in fact, words and characters are the only use cases I can think of for now, but this allows you to do other more crazy things if you want). You can call the namespaces whatever you want, but if you want the DataIndexer to work correctly without namespace arguments, you should use the key ‘words’ to represent word tokens.

Returns:

namespace : Dictionary of {str: List[str]}

The str key refers to vocabularies, and the List[str] should contain the tokens in that vocabulary. For example, you should use the key words to represent word tokens, and the correspoding value in the dictionary would be a list of all the words in the instance.

CharacterSpanInstances

class deep_qa.data.instances.reading_comprehension.character_span_instance.CharacterSpanInstance(question: str, passage: str, label: typing.Tuple[int, int], index: int = None)[source]

Bases: deep_qa.data.instances.reading_comprehension.question_passage_instance.QuestionPassageInstance

A CharacterSpanInstance is a QuestionPassageInstance that represents a (question, passage) pair with an associated label, which is the data given for the span prediction task. The label is a span of characters in the passage that indicates where the answer to the question begins and where the answer to the question ends.

The main thing this class handles over QuestionPassageInstance is in specifying the form of and how to index the label, which is given as a span of _characters_ in the passage. The label we are going to use in the rest of the code is a span of _tokens_ in the passage, so the mapping from character labels to token labels depends on the tokenization we did, and the logic to handle this is, unfortunately, a little complicated. The label conversion happens when converting a CharacterSpanInstance to in IndexedInstance (where character indices are generally lost, anyway).

This class should be used to represent training instances for the SQuAD (Stanford Question Answering) and NewsQA datasets, to name a few.

_index_label(label: typing.Tuple[int, int]) → typing.List[int][source]

Specify how to index self.label, which is needed to convert the CharacterSpanInstance into an IndexedInstance (handled in superclass).

classmethod read_from_line(line: str)[source]

Reads a CharacterSpanInstance object from a line. The format has one of two options:

  1. [example index][tab][question][tab][passage][tab][label]
  2. [question][tab][passage][tab][label]

[label] is assumed to be a comma-separated pair of integers.

stop_token = '@@STOP@@'
to_indexed_instance(data_indexer: deep_qa.data.data_indexer.DataIndexer)[source]

Converts the words in this Instance into indices using the DataIndexer.

Parameters:

data_indexer : DataIndexer

DataIndexer to use in converting the Instance to an IndexedInstance.

Returns:

indexed_instance : IndexedInstance

A TextInstance that has had all of its strings converted into indices.

class deep_qa.data.instances.reading_comprehension.character_span_instance.IndexedCharacterSpanInstance(question_indices: typing.List[int], passage_indices: typing.List[int], label: typing.List[int], index: int = None)[source]

Bases: deep_qa.data.instances.reading_comprehension.question_passage_instance.IndexedQuestionPassageInstance

as_training_data()[source]

Convert this IndexedInstance to NumPy arrays suitable for use as training data to Keras models.

Returns:

train_data : (inputs, label)

The IndexedInstance as NumPy arrays to be uesd in Keras. Note that inputs might itself be a complex tuple, depending on the Instance type.