Entailment Model Layers


class deep_qa.layers.entailment_models.decomposable_attention.DecomposableAttentionEntailment(num_hidden_layers: int = 1, hidden_layer_width: int = 50, hidden_layer_activation: str = 'relu', final_activation: str = 'softmax', output_dim: int = 3, initializer: str = 'uniform', **kwargs)[source]

Bases: deep_qa.layers.entailment_models.word_alignment.WordAlignmentEntailment

This layer is a reimplementation of the entailment algorithm described in “A Decomposable Attention Model for Natural Language Inference”, Parikh et al., 2016. The algorithm has three main steps:

  1. Attend: Compute dot products between all pairs of projections of words in the hypothesis and the premise, normalize those dot products to use them to align each word in premise to a phrase in the hypothesis and vice-versa. These alignments are then used to summarize the aligned phrase in the other sentence as a weighted sum. The initial word projections are computed using a feed forward NN, F.
  2. Compare: Pass a concatenation of each word in the premise and the summary of its aligned phrase in the hypothesis through a feed forward NN, G, to get a projected comparison. Do the same with the hypothesis and the aligned phrase from the premise.
  3. Aggregate: Sum over the comparisons to get a single vector each for premise-hypothesis comparison, and hypothesis-premise comparison. Pass them through a third feed forward NN (H), to get the entailment decision.

This layer can take either a tuple (premise, hypothesis) or a concatenation of them as input.


  • Tuple input: a premise sentence and a hypothesis sentence, both with shape (batch_size, sentence_length, embed_dim) and masks of shape (batch_size, sentence_length)
  • Single input: a single tensor of shape (batch_size, sentence_length * 2, embed_dim), with a mask of shape (batch_size, sentence_length * 2), which we will split in half to get the premise and hypothesis sentences.


  • Entailment decisions with the given output_dim.

num_hidden_layers: int, optional (default=1)

Number of hidden layers in each of the feed forward neural nets described above.

hidden_layer_width: int, optional (default=50)

Width of each hidden layer in each of the feed forward neural nets described above.

hidden_layer_activation: str, optional (default=’relu’)

Activation for each hidden layer in each of the feed forward neural nets described above.

final_activation: str, optional (default=’softmax’)

Activation to use for the final output. Should almost certainly be ‘softmax’.

output_dim: int, optional (default=3)

Dimensionality of the final output. If this is the last layer in your model, this needs to be the same as the number of labels you have.

initializer: str, optional (default=’uniform’)

Will be passed to self.add_weight() for each of the weight matrices in the feed forward neural nets described above.


premise_length = hypothesis_length = sentence_length below.

static _attend(target_embedding, s2t_alignment)[source]

Takes target embedding, and source-target alignment attention and produces a weighted average of the target embedding per each source word.

target_embedding: (batch_size, target_length, embed_dim) s2t_alignment: (batch_size, source_length, target_length)

_compare(source_embedding, s2t_attention)[source]

Takes word embeddings from a sentence, and aggregated representations of words aligned to each of those words from another sentence, and returns a projection of their concatenation.

source_embedding: (batch_size, source_length, embed_dim) s2t_attention: (batch_size, source_length, embed_dim)


This model has three feed forward NNs (F, G and H in the paper). We assume that all three NNs have the same hyper-parameters: num_hidden_layers, hidden_layer_width and hidden_layer_activation. That is, F, G and H have the same structure and activations. Their actual weights are different, though. H has a separate softmax layer at the end.

compute_mask(inputs, mask=None)[source]

Computes an output mask tensor.

# Arguments
inputs: Tensor or list of tensors. mask: Tensor or list of tensors.
# Returns
None or a tensor (or list of tensors,
one per output tensor of the layer).

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Container (one layer of abstraction above).

# Returns
Python dictionary.


class deep_qa.layers.entailment_models.multiple_choice_tuple_entailment.MultipleChoiceTupleEntailment(**kwargs)[source]

Bases: deep_qa.layers.entailment_models.word_alignment.WordAlignmentEntailment

A kind of decomposable attention where the premise (or background) is in the form of SVO triples, and entailment is computed by finding the answer in a multiple choice setting that aligns best with the tuples that align with the question. This happens in two steps:

  1. We use the _align function from WordAlignmentEntailment to find the premise tuples whose SV, or VO pairs align best with the question.
  2. We then use the _align function again to find the answer that aligns best with the unaligned part of the tuples, weighed by how much they partially align with the question in step 1.

TODO(pradeep): Also match S with question, VO with answer, O with question and SV with answer.

compute_mask(x, mask=None)[source]


Word alignment entailment models operate on word level representations, and define alignment as a function of how well the words in the premise align with those in the hypothesis. These are different from the encoded sentence entailment models where both the premise and hypothesis are encoded as single vectors and entailment functions are defined on top of them.

At this point this doesn’t quite fit into the memory network setup because the model doesn’t operate on the encoded sentence representations, but instead consumes the word level representations. TODO(pradeep): Make this work with the memory network eventually.

class deep_qa.layers.entailment_models.word_alignment.WordAlignmentEntailment(**kwargs)[source]

Bases: deep_qa.layers.masked_layer.MaskedLayer

This is an abstract class for word alignment entailment. It defines an _align function.

static _align(source_embedding, target_embedding, source_mask, target_mask, normalize_alignment=True)[source]

Takes source and target sequence embeddings and returns a source-to-target alignment weights. That is, for each word in the source sentence, returns a probability distribution over target_sequence that shows how well each target word aligns (i.e. is similar) to it.

source_embedding: (batch_size, source_length, embed_dim) target_embedding: (batch_size, target_length, embed_dim) source_mask: None or (batch_size, source_length, 1) target_mask: None or (batch_size, target_length, 1) normalize_alignment (bool): Will apply a (masked) softmax over alignments is True.

Returns: s2t_attention: (batch_size, source_length, target_length)