DataGenerator(text_trainer, params: deep_qa.common.params.Params)[source]¶
IndexedDatasetand converts it into a generator, yielding batches suitable for training. You might want to do this instead of just creating one large set of numpy arrays for a few reasons:
- Creating large arrays for your whole data could take a whole lot of memory, maybe more than is available on your machine.
- Creating one large array means padding all of your instances to the same length. This
typically means you waste a whole lot of computation on padding tokens. Using a
DataGeneratorinstead allows you to only pad each batch to the same length, instead of all of your instances across your whole dataset. We’ve typically seen a 4-5x speed up just from doing this (partially because Keras is pretty bad at doing variable-length computation; the speed-up isn’t quite as large with plain tensorflow, I think).
- If we’re varying the padding lengths in each batch, we can also vary the batch size, to optimize GPU memory usage. This means we’ll use smaller batch sizes for big instances, and larger batch sizes for small instances. We’ve seen speedups up to 10-12x (on top of the 4-5x speed up above) from doing this.
We need access to the
TextTrainerobject so we can call some methods on it, such as
dynamic_padding: bool, optional (default=False)
True, we will set padding lengths based on the data per batch, instead of on the whole dataset. This only works if your model is structured to allow variable-length sequences (typically using
Nonefor specific dimensions when you build your model), and if you don’t set padding values in
_set_padding_lengths(). This flag specifically is read in
_set_padding_lengths()to know if we should set certain padding values or not. It’s handled correctly for
TextTrainer, but you need to be sure to implement it correctly in subclasses for this to work.
padding_noise: double, optional (default=.1)
When sorting by padding length, we add a bit of noise to the lengths, so that the sorting isn’t deterministic. This parameter determines how much noise we add, as a percentage of the actual padding value for each instance.
sort_every_epoch: bool, optional (default=True)
True, we will re-sort the data after every epoch, then re-group the instances into batches. If
padding_noiseis zero, this does nothing, but if it’s non-zero, this will give you a slightly different ordering, so you don’t have exactly the same batches at every epoch. If you’re doing adaptive batch sizes, this will lead to re-computing the adaptive batches each epoch, which could give a different number of batches for the whole dataset, which means each “epoch” might no longer correspond to exactly one pass over the data. This is probably a pretty minor issue, though.
adaptive_batch_sizes: bool, optional (default=False)
Only relevant if
True, we will vary the batch size to try to optimize GPU memory usage. Because padding lengths are done dynamically, we can have larger batches when padding lengths are smaller, maximizing our usage of the GPU. In order for this to work, you need to do two things: (1) override
_get_padding_memory_scaling()to give a big-O bound on memory usage given padding lengths, and (2) tune the adaptive_memory_usage_constant parameter for your particular model and GPU. See the documentation for
_get_padding_memory_scaling()for more information.
adaptive_memory_usage_constant: int, optional (default=None)
Only relevant if
True. This is a manually-tuned parameter, specific to a particular model architecture and amount of GPU memory (e.g., if you change the number of hidden layers in your model, this number will need to change). See
_get_padding_memory_scaling()for more detail. The recommended way to tune this parameter is to (1) use a fixed batch size, with
True, and find out the maximum batch size you can handle on your biggest instances without running out of memory. Then (2) turn on
adaptive_batch_sizes, and set this parameter so that you get the right batch size for your biggest instances. If you set the log level to
scripts/run_model.py, you can see the batch sizes that are computed.
maximum_batch_size: int, optional (default=1000000)
If we’re using adaptive batch sizes, you can use this to be sure you do not create batches larger than this, even if you have enough memory to handle it on your GPU. You might choose to do this to keep smaller batches because you like the noisier gradient estimates that come from smaller batches, for instance.
biggest_batch_first: bool, optional (default=False)
This is largely for testing, to see how large of a batch you can safely use with your GPU. It’s only meaningful if you’re using dynamic padding - this will let you try out the largest batch that you have in the data first, so that if you’re going to run out of memory, you know it early, instead of waiting through the whole batch to find out at the end that you’re going to crash.
create_generator(dataset: deep_qa.data.datasets.dataset.IndexedDataset, batch_size: int = None)[source]¶
Main external API call: converts an
IndexedDatasetinto a data generator suitable for use with Keras’
fit_generatorand related methods.
This field can be read after calling
create_generatorto get the number of steps you should take per epoch in
model.evaluate_generatorfor this data.