A collection of Instances.
This base class has general methods that apply to all collections of Instances. That basically is just methods that operate on sets, like merging and truncating.
merge(other: deep_qa.data.datasets.dataset.Dataset) → deep_qa.data.datasets.dataset.Dataset[source]¶
Combine two datasets. If you call try to merge two Datasets of the same subtype, you will end up with a Dataset of the same type (i.e., calling IndexedDataset.merge() with another IndexedDataset will return an IndexedDataset). If the types differ, this method currently raises an error, because the underlying Instance objects are not currently type compatible.
A Dataset of IndexedInstances, with some helper methods.
IndexedInstances have text sequences replaced with lists of word indices, and are thus able to be padded to consistent lengths and converted to training inputs.
IndexedInstanceand converts it into (inputs, labels), according to the Instance’s as_training_data() method. Both the inputs and the labels are numpy arrays. Note that if the
Instancesreturn tuples for their inputs, we convert the list of tuples into a tuple of lists, before converting everything to numpy arrays.
pad_instances(padding_lengths: typing.Dict[str, int] = None, verbose: bool = True)[source]¶
Makes all of the
IndexedInstancesin the dataset have the same length by padding them. This
Datasetobject doesn’t know what things there are in the
Instanceto pad, but the
Instancesdo, and so does the model that called us, passing in a
padding_lengthsdictionary. The keys in that dictionary must match the lengths that the
Given that, this method does two things: (1) it asks each of the
Instanceswhat their padding lengths are, and takes a max (using
padding_lengths()). It then reconciles those values with the
padding_lengthswe were passed as an argument to this method, and pads the instances with
padding_lengthshas a particular key specified with a value, that value takes precedence over whatever we computed in our data. TODO(matt): with dynamic padding, we should probably have this be a max padding length, not a hard setting, but that requires some API changes.
This method modifies the current object, it does not return a new
padding_lengths: Dict[str, int]
If a key is present in this dictionary with a non-None value, we will pad to that length instead of the length calculated from the data. This lets you, e.g., set a maximum value for sentence length, or word length, if you want to throw out long sequences.
verbose: bool, optional (default=True)
Should we output logging information when we’re doing this padding? If the dataset is large, this is nice to have, because padding a large dataset could take a long time. But if you’re doing this inside of a data generator, having all of this output per batch is a bit obnoxious.
TextDataset(instances: typing.List[deep_qa.data.instances.instance.TextInstance], params: deep_qa.common.params.Params = None)[source]¶
A Dataset of TextInstances, with a few helper methods.
TextInstances aren’t useful for much with Keras until they’ve been indexed. So this class just has methods to read in data from a file and convert it into other kinds of Datasets.
read_from_file(filename: str, instance_class, params: deep_qa.common.params.Params = None)[source]¶
read_from_lines(lines: typing.List[str], instance_class, params: deep_qa.common.params.Params = None)[source]¶
SnliDataset(instances: typing.List[deep_qa.data.instances.instance.TextInstance], params: deep_qa.common.params.Params = None)[source]¶
LanguageModelingDataset(instances: typing.List[deep_qa.data.instances.instance.TextInstance], params: deep_qa.common.params.Params = None)[source]¶