Encoders

BagOfWords

class deep_qa.layers.encoders.bag_of_words.BOWEncoder(**kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

Bag of Words Encoder takes a matrix of shape (num_words, word_dim) and returns a vector of size (word_dim), which is an average of the (unmasked) rows in the input matrix. This could have been done using a Lambda layer, except that Lambda layer does not support masking (as of Keras 1.0.7).

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.

ConvolutionalEncoder

class deep_qa.layers.encoders.convolutional_encoder.CNNEncoder(units: int, num_filters: int, ngram_filter_sizes: typing.Tuple[int] = (2, 3, 4, 5), conv_layer_activation: str = 'relu', l1_regularization: float = None, l2_regularization: float = None, **kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

CNNEncoder is a combination of multiple convolution layers and max pooling layers. This is defined as a single layer to be consistent with the other encoders in terms of input and output specifications. The input to this “layer” is of shape (batch_size, num_words, embedding_dim) and the output is of size (batch_size, output_dim).

The CNN has one convolution layer per each ngram filter size. Each convolution operation gives out a vector of size num_filters. The number of times a convolution layer will be used depends on the ngram size: input_length - ngram_size + 1. The corresponding maxpooling layer aggregates all these outputs from the convolution layer and outputs the max.

This operation is repeated for every ngram size passed, and consequently the dimensionality of the output after maxpooling is len(ngram_filter_sizes) * num_filters.

We then use a fully connected layer to project in back to the desired output_dim. For more details, refer to “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”, Zhang and Wallace 2016, particularly Figure 1.

Parameters:

units: int

After doing convolutions, we’ll project the collected features into a vector of this size. This used to be output_dim, but Keras changed it to units. I prefer the name output_dim, so we’ll leave the code using output_dim, and just use the name units in the external API.

num_filters: int

This is the output dim for each convolutional layer, which is the same as the number of “filters” learned by that layer.

ngram_filter_sizes: Tuple[int], optional (default=(2, 3, 4, 5))

This specifies both the number of convolutional layers we will create and their sizes. The default of (2, 3, 4, 5) will have four convolutional layers, corresponding to encoding ngrams of size 2 to 5 with some number of filters.

conv_layer_activation: str, optional (default=’relu’)

l1_regularization: float, optional (default=None)

l2_regularization: float, optional (default=None)

build(input_shape)[source]

Creates the layer weights.

Must be implemented on all layers that have weights.

# Arguments
input_shape: Keras tensor (future input to layer)
or list/tuple of Keras tensors to reference for weight shape computations.
compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
get_config()[source]

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.

PositionalEncoder

class deep_qa.layers.encoders.positional_encoder.PositionalEncoder(**kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

A PositionalEncoder is very similar to a kind of weighted bag of words encoder, where the weighting is done by an index-dependent vector, not a scalar. If you think this is an odd thing to do, it is. The original authors provide no real reasoning behind the exact method other than it takes into account word order. This is here mainly to reproduce results for comparison.

It takes a matrix of shape (num_words, word_dim) and returns a vector of size (word_dim), which implements the following linear combination of the rows:

representation = sum_(j=1)^(n) { l_j * w_j }

where w_j is the j-th word representation in the sentence and l_j is a vector defined as follows:

l_kj = (1 - j)/m - (k/d)((1-2j)/m)

where:
  • j is the word sentence index.
  • m is the sentence length.
  • k is the vector index(ie the k-th element of a vector).
  • d is the dimension of the embedding.
    • represents element-wise multiplication.

This method was originally introduced in End-To-End Memory Networks(pg 4-5): https://arxiv.org/pdf/1503.08895v5.pdf

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).
compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.

AttentiveGRU

class deep_qa.layers.encoders.attentive_gru.AttentiveGru(output_dim, input_length, **kwargs)[source]

Bases: keras.layers.recurrent.GRU

GRUs typically operate over sequences of words. The motivation behind this encoding is that a weighted average loses ordering information over it’s inputs - for instance, this is important in the BABI tasks.

See Dynamic Memory Networks for more information: https://arxiv.org/pdf/1603.01417v1.pdf. This class extends the Keras Gated Recurrent Unit by implementing a method which substitutes the GRU update gate (normally a vector, z - it is noted below where it is normally computed) for a scalar attention weight (one per input, such as from the output of a softmax over the input vectors), which is pre-computed. As mentioned above, instead of using word embedding sequences as input to the GRU, we are using sentence encoding sequences.

The implementation of this class is subtle - it is only very slightly different from a standard GRU. When it is initialised, the Keras backend will call the build method. It uses this to check that inputs being passed to this function are the correct size, so we allow this to be the actual input size as normal. However, for the internal implementation, everywhere where this global shape is used, we override it to be one less, as we are passing in a tensor of shape (batch, knowledge_length, 1 + encoding_dim) as we are including the attention mask. Therefore, we need all of the weights to have shape (, encoding_dim), NOT (, 1 + encoding_dim). All of the below methods which are overridden use some form of this dimension, so we correct them.

build(input_shape)[source]

This is used by Keras to verify things, but also to build the weights. The only differences from the Keras GRU (which we copied exactly other than the below) are: We generate weights with dimension input_dim[2] - 1, rather than dimension input_dim[2]. There are a few variables which are created in non-‘gpu’ modes which are not required. These are commented out but left in for clarity below.

preprocess_input(inputs, training=None)[source]

We have to override this preprocessing step, because if we are using the cpu, we do the weight - input multiplications in the internals of the GRU as separate, smaller matrix multiplications and concatenate them after. Therefore, before this happens, we split off the attention and then add it back afterwards.

step(inputs, states)[source]

The input to step is a tensor of shape (batch, 1 + encoding_dim), i.e. a timeslice of the input to this AttentiveGRU, where the time axis is the knowledge_length. Before we start, we strip off the attention from the beginning. Then we do the equations for a normal GRU, except we don’t calculate the output gate z, substituting the attention weight for it instead. Note that there is some redundancy here - for instance, in the GPU mode, we do a larger matrix multiplication than required, as we don’t use one part of it. However, for readability and similarity to the original GRU code in Keras, it has not been changed. In each section, there are commented out lines which contain code. If you were to uncomment these, remove the differences in the input size and replace the attention with the z gate at the output, you would have a standard GRU back again. We literally copied the Keras GRU code here, making some small modifications.