General Data Utils


Bases: object

A DataIndexer maps strings to integers, allowing for strings to be mapped to an out-of-vocabulary token.

DataIndexers are fit to a particular dataset, which we use to decide which words are in-vocabulary.

DataIndexers also allow for several different namespaces, so you can have separate word indices for ‘a’ as a word, and ‘a’ as a character, for instance. Most of the methods on this class allow you to pass in a namespace; by default we use the ‘words’ namespace, and you can omit the namespace argument everywhere and just use the default.

add_word_to_index(word: str, namespace: str = 'words') → int[source]

Adds word to the index, if it is not already present. Either way, we return the index of the word.

fit_word_dictionary(dataset, min_count: int = 1)[source]

Given a Dataset, this method decides which words are given an index, and which ones are mapped to an OOV token (in this case “UNK”). This method must be called before any dataset is indexed with this DataIndexer. If you don’t first fit the word dictionary, you’ll basically map every token onto “UNK”.

We call instance.words() for each instance in the dataset, and then keep all words that appear at least min_count times.


dataset: ``TextDataset``

The dataset to index.

min_count: int, optional (default=1)

The minimum number of occurences a word must have in the dataset in order to be assigned an index.

get_vocab_size(namespace: str = 'words')[source]
get_word_from_index(index: int, namespace: str = 'words')[source]
get_word_index(word: str, namespace: str = 'words')[source]
set_from_file(filename: str, oov_token: str = '@@UNKNOWN@@', namespace: str = 'words')[source]
words_in_index(namespace: str = 'words')[source]


Bases: object

static get_embedding_layer(embeddings_filename: str, data_indexer:, trainable=False, log_misses=False, name='pretrained_embedding')[source]

Reads a pre-trained embedding file and generates a Keras Embedding layer that has weights initialized to the pre-trained embeddings. The Embedding layer can either be trainable or not.

We use the DataIndexer to map from the word strings in the embeddings file to the indices that we need, and to know which words from the embeddings file we can safely ignore. If we come across a word in DataIndexer that does not show up with the embeddings file, we give it a zero vector.

The embeddings file is assumed to be gzipped, formatted as [word] [dim 1] [dim 2] ...

static initialize_random_matrix(shape, seed=1337)[source]