Datasets

deep_qa.data.dataset

class deep_qa.data.datasets.dataset.Dataset(instances: typing.List[deep_qa.data.instances.instance.Instance])[source]

Bases: object

A collection of Instances.

This base class has general methods that apply to all collections of Instances. That basically is just methods that operate on sets, like merging and truncating.

merge(other: deep_qa.data.datasets.dataset.Dataset) → deep_qa.data.datasets.dataset.Dataset[source]

Combine two datasets. If you call try to merge two Datasets of the same subtype, you will end up with a Dataset of the same type (i.e., calling IndexedDataset.merge() with another IndexedDataset will return an IndexedDataset). If the types differ, this method currently raises an error, because the underlying Instance objects are not currently type compatible.

truncate(max_instances: int)[source]

If there are more instances than max_instances in this dataset, returns a new dataset with a random subset of size max_instances. If there are fewer than max_instances already, we just return self.

class deep_qa.data.datasets.dataset.IndexedDataset(instances: typing.List[deep_qa.data.instances.instance.IndexedInstance])[source]

Bases: deep_qa.data.datasets.dataset.Dataset

A Dataset of IndexedInstances, with some helper methods.

IndexedInstances have text sequences replaced with lists of word indices, and are thus able to be padded to consistent lengths and converted to training inputs.

as_training_data()[source]

Takes each IndexedInstance and converts it into (inputs, labels), according to the Instance’s as_training_data() method. Both the inputs and the labels are numpy arrays. Note that if the Instances return tuples for their inputs, we convert the list of tuples into a tuple of lists, before converting everything to numpy arrays.

pad_instances(padding_lengths: typing.Dict[str, int] = None, verbose: bool = True)[source]

Makes all of the IndexedInstances in the dataset have the same length by padding them. This Dataset object doesn’t know what things there are in the Instance to pad, but the Instances do, and so does the model that called us, passing in a padding_lengths dictionary. The keys in that dictionary must match the lengths that the Instance knows about.

Given that, this method does two things: (1) it asks each of the Instances what their padding lengths are, and takes a max (using padding_lengths()). It then reconciles those values with the padding_lengths we were passed as an argument to this method, and pads the instances with IndexedInstance.pad(). If padding_lengths has a particular key specified with a value, that value takes precedence over whatever we computed in our data. TODO(matt): with dynamic padding, we should probably have this be a max padding length, not a hard setting, but that requires some API changes.

This method modifies the current object, it does not return a new IndexedDataset.

Parameters:

padding_lengths: Dict[str, int]

If a key is present in this dictionary with a non-None value, we will pad to that length instead of the length calculated from the data. This lets you, e.g., set a maximum value for sentence length, or word length, if you want to throw out long sequences.

verbose: bool, optional (default=True)

Should we output logging information when we’re doing this padding? If the dataset is large, this is nice to have, because padding a large dataset could take a long time. But if you’re doing this inside of a data generator, having all of this output per batch is a bit obnoxious.

padding_lengths()[source]
sort_by_padding(sorting_keys: typing.List[str], padding_noise: float = 0.0)[source]

Sorts the Instances in this Dataset by their padding lengths, using the keys in sorting_keys (in the order in which they are provided).

class deep_qa.data.datasets.dataset.TextDataset(instances: typing.List[deep_qa.data.instances.instance.TextInstance], params: deep_qa.common.params.Params = None)[source]

Bases: deep_qa.data.datasets.dataset.Dataset

A Dataset of TextInstances, with a few helper methods.

TextInstances aren’t useful for much with Keras until they’ve been indexed. So this class just has methods to read in data from a file and convert it into other kinds of Datasets.

static read_from_file(filename: str, instance_class, params: deep_qa.common.params.Params = None)[source]
static read_from_lines(lines: typing.List[str], instance_class, params: deep_qa.common.params.Params = None)[source]
to_indexed_dataset(data_indexer: deep_qa.data.data_indexer.DataIndexer) → deep_qa.data.datasets.dataset.IndexedDataset[source]

Converts the Dataset into an IndexedDataset, given a DataIndexer.

deep_qa.data.datasets.dataset.log_label_counts(instances: typing.List[deep_qa.data.instances.instance.TextInstance])[source]

Entailment

class deep_qa.data.datasets.entailment.snli_dataset.SnliDataset(instances: typing.List[deep_qa.data.instances.instance.TextInstance], params: deep_qa.common.params.Params = None)[source]

Bases: deep_qa.data.datasets.dataset.TextDataset

static read_from_file(filename: str, instance_class, params: deep_qa.common.params.Params = None)[source]

Language Modeling

class deep_qa.data.datasets.language_modeling.language_modeling_dataset.LanguageModelingDataset(instances: typing.List[deep_qa.data.instances.instance.TextInstance], params: deep_qa.common.params.Params = None)[source]

Bases: deep_qa.data.datasets.dataset.TextDataset

static read_from_file(filename: str, instance_class, params: deep_qa.common.params.Params = None)[source]