Natural language inference using spaCy and Keras

Introduction

This notebook details an implementation of the natural language inference model presented in (Parikh et al, 2016). The model is notable for the small number of paramaters and hyperparameters it specifices, while still yielding good performance.

Constructing the dataset


In [1]:
import spacy
import numpy as np

We only need the GloVe vectors from spaCy, not a full NLP pipeline.


In [2]:
nlp = spacy.load('en_vectors_web_lg')

Function to load the SNLI dataset. The categories are converted to one-shot representation. The function comes from an example in spaCy.


In [3]:
import json
from keras.utils import to_categorical

LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
def read_snli(path):
    texts1 = []
    texts2 = []
    labels = []
    with open(path, 'r') as file_:
        for line in file_:
            eg = json.loads(line)
            label = eg['gold_label']
            if label == '-':  # per Parikh, ignore - SNLI entries
                continue
            texts1.append(eg['sentence1'])
            texts2.append(eg['sentence2'])
            labels.append(LABELS[label])
    return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))


/home/jds/tensorflow-gpu/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.

Because Keras can do the train/test split for us, we'll load all SNLI triples from one file.


In [8]:
texts,hypotheses,labels = read_snli('snli/snli_1.0_train.jsonl')

In [9]:
def create_dataset(nlp, texts, hypotheses, num_oov, max_length, norm_vectors = True):
    sents = texts + hypotheses
    
    # the extra +1 is for a zero vector represting NULL for padding
    num_vectors = max(lex.rank for lex in nlp.vocab) + 2 
    
    # create random vectors for OOV tokens
    oov = np.random.normal(size=(num_oov, nlp.vocab.vectors_length))
    oov = oov / oov.sum(axis=1, keepdims=True)
    
    vectors = np.zeros((num_vectors + num_oov, nlp.vocab.vectors_length), dtype='float32')
    vectors[num_vectors:, ] = oov
    for lex in nlp.vocab:
        if lex.has_vector and lex.vector_norm > 0:
            vectors[lex.rank + 1] = lex.vector / lex.vector_norm if norm_vectors == True else lex.vector
            
    sents_as_ids = []
    for sent in sents:
        doc = nlp(sent)
        word_ids = []
        
        for i, token in enumerate(doc):
            # skip odd spaces from tokenizer
            if token.has_vector and token.vector_norm == 0:
                continue
                
            if i > max_length:
                break
                
            if token.has_vector:
                word_ids.append(token.rank + 1)
            else:
                # if we don't have a vector, pick an OOV entry
                word_ids.append(token.rank % num_oov + num_vectors) 
                
        # there must be a simpler way of generating padded arrays from lists...
        word_id_vec = np.zeros((max_length), dtype='int')
        clipped_len = min(max_length, len(word_ids))
        word_id_vec[:clipped_len] = word_ids[:clipped_len]
        sents_as_ids.append(word_id_vec)
        
        
    return vectors, np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])

In [10]:
sem_vectors, text_vectors, hypothesis_vectors = create_dataset(nlp, texts, hypotheses, 100, 50, True)

In [11]:
texts_test,hypotheses_test,labels_test = read_snli('snli/snli_1.0_test.jsonl')

In [12]:
_, text_vectors_test, hypothesis_vectors_test = create_dataset(nlp, texts_test, hypotheses_test, 100, 50, True)

We use spaCy to tokenize the sentences and return, when available, a semantic vector for each token.

OOV terms (tokens for which no semantic vector is available) are assigned to one of a set of randomly-generated OOV vectors, per (Parikh et al, 2016).

Note that we will clip sentences to 50 words maximum.


In [13]:
from keras import layers, Model, models
from keras import backend as K

Building the model

The embedding layer copies the 300-dimensional GloVe vectors into GPU memory. Per (Parikh et al, 2016), the vectors, which are not adapted during training, are projected down to lower-dimensional vectors using a trained projection matrix.


In [14]:
def create_embedding(vectors, max_length, projected_dim):
    return models.Sequential([
        layers.Embedding(
            vectors.shape[0],
            vectors.shape[1],
            input_length=max_length,
            weights=[vectors],
            trainable=False),
        
        layers.TimeDistributed(
            layers.Dense(projected_dim,
                         activation=None,
                         use_bias=False))
    ])

The Parikh model makes use of three feedforward blocks that construct nonlinear combinations of their input. Each block contains two ReLU layers and two dropout layers.


In [15]:
def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):
    return models.Sequential([
        layers.Dense(num_units, activation=activation),
        layers.Dropout(dropout_rate),
        layers.Dense(num_units, activation=activation),
        layers.Dropout(dropout_rate)
    ])

The basic idea of the (Parikh et al, 2016) model is to:

  1. Align: Construct an alignment of subphrases in the text and hypothesis using an attention-like mechanism, called "decompositional" because the layer is applied to each of the two sentences individually rather than to their product. The dot product of the nonlinear transformations of the inputs is then normalized vertically and horizontally to yield a pair of "soft" alignment structures, from text->hypothesis and hypothesis->text. Concretely, for each word in one sentence, a multinomial distribution is computed over the words of the other sentence, by learning a multinomial logistic with softmax target.
  2. Compare: Each word is now compared to its aligned phrase using a function modeled as a two-layer feedforward ReLU network. The output is a high-dimensional representation of the strength of association between word and aligned phrase.
  3. Aggregate: The comparison vectors are summed, separately, for the text and the hypothesis. The result is two vectors: one that describes the degree of association of the text to the hypothesis, and the second, of the hypothesis to the text.
  4. Finally, these two vectors are processed by a dense layer followed by a softmax classifier, as usual.

Note that because in entailment the truth conditions of the consequent must be a subset of those of the antecedent, it is not obvious that we need both vectors in step (3). Entailment is not symmetric. It may be enough to just use the hypothesis->text vector. We will explore this possibility later.

We need a couple of little functions for Lambda layers to normalize and aggregate weights:


In [16]:
def normalizer(axis):
    def _normalize(att_weights):
        exp_weights = K.exp(att_weights)
        sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)
        return exp_weights/sum_weights
    return _normalize

def sum_word(x):
    return K.sum(x, axis=1)

In [17]:
def build_model(vectors, max_length, num_hidden, num_classes, projected_dim, entail_dir='both'):
    input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')
    input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')
    
    # embeddings (projected)
    embed = create_embedding(vectors, max_length, projected_dim)
   
    a = embed(input1)
    b = embed(input2)
    
    # step 1: attend
    F = create_feedforward(num_hidden)
    att_weights = layers.dot([F(a), F(b)], axes=-1)
    
    G = create_feedforward(num_hidden)
    
    if entail_dir == 'both':
        norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
        norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
        alpha = layers.dot([norm_weights_a, a], axes=1)
        beta  = layers.dot([norm_weights_b, b], axes=1)

        # step 2: compare
        comp1 = layers.concatenate([a, beta])
        comp2 = layers.concatenate([b, alpha])
        v1 = layers.TimeDistributed(G)(comp1)
        v2 = layers.TimeDistributed(G)(comp2)

        # step 3: aggregate
        v1_sum = layers.Lambda(sum_word)(v1)
        v2_sum = layers.Lambda(sum_word)(v2)
        concat = layers.concatenate([v1_sum, v2_sum])
    elif entail_dir == 'left':
        norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
        alpha = layers.dot([norm_weights_a, a], axes=1)
        comp2 = layers.concatenate([b, alpha])
        v2 = layers.TimeDistributed(G)(comp2)
        v2_sum = layers.Lambda(sum_word)(v2)
        concat = v2_sum
    else:
        norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
        beta  = layers.dot([norm_weights_b, b], axes=1)
        comp1 = layers.concatenate([a, beta])
        v1 = layers.TimeDistributed(G)(comp1)
        v1_sum = layers.Lambda(sum_word)(v1)
        concat = v1_sum
    
    H = create_feedforward(num_hidden)
    out = H(concat)
    out = layers.Dense(num_classes, activation='softmax')(out)
    
    model = Model([input1, input2], out)
    
    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model

In [18]:
K.clear_session()
m = build_model(sem_vectors, 50, 200, 3, 200)
m.summary()


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
words1 (InputLayer)             (None, 50)           0                                            
__________________________________________________________________________________________________
words2 (InputLayer)             (None, 50)           0                                            
__________________________________________________________________________________________________
sequential_1 (Sequential)       (None, 50, 200)      321381600   words1[0][0]                     
                                                                 words2[0][0]                     
__________________________________________________________________________________________________
sequential_2 (Sequential)       (None, 50, 200)      80400       sequential_1[1][0]               
                                                                 sequential_1[2][0]               
__________________________________________________________________________________________________
dot_1 (Dot)                     (None, 50, 50)       0           sequential_2[1][0]               
                                                                 sequential_2[2][0]               
__________________________________________________________________________________________________
lambda_2 (Lambda)               (None, 50, 50)       0           dot_1[0][0]                      
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 50, 50)       0           dot_1[0][0]                      
__________________________________________________________________________________________________
dot_3 (Dot)                     (None, 50, 200)      0           lambda_2[0][0]                   
                                                                 sequential_1[2][0]               
__________________________________________________________________________________________________
dot_2 (Dot)                     (None, 50, 200)      0           lambda_1[0][0]                   
                                                                 sequential_1[1][0]               
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 50, 400)      0           sequential_1[1][0]               
                                                                 dot_3[0][0]                      
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 50, 400)      0           sequential_1[2][0]               
                                                                 dot_2[0][0]                      
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, 50, 200)      120400      concatenate_1[0][0]              
__________________________________________________________________________________________________
time_distributed_3 (TimeDistrib (None, 50, 200)      120400      concatenate_2[0][0]              
__________________________________________________________________________________________________
lambda_3 (Lambda)               (None, 200)          0           time_distributed_2[0][0]         
__________________________________________________________________________________________________
lambda_4 (Lambda)               (None, 200)          0           time_distributed_3[0][0]         
__________________________________________________________________________________________________
concatenate_3 (Concatenate)     (None, 400)          0           lambda_3[0][0]                   
                                                                 lambda_4[0][0]                   
__________________________________________________________________________________________________
sequential_4 (Sequential)       (None, 200)          120400      concatenate_3[0][0]              
__________________________________________________________________________________________________
dense_8 (Dense)                 (None, 3)            603         sequential_4[1][0]               
==================================================================================================
Total params: 321,703,403
Trainable params: 381,803
Non-trainable params: 321,321,600
__________________________________________________________________________________________________

The number of trainable parameters, ~381k, is the number given by Parikh et al, so we're on the right track.

Training the model

Parikh et al use tiny batches of 4, training for 50MM batches, which amounts to around 500 epochs. Here we'll use large batches to better use the GPU, and train for fewer epochs -- for purposes of this experiment.


In [19]:
m.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))


Train on 549367 samples, validate on 9824 samples
Epoch 1/50
549367/549367 [==============================] - 34s 62us/step - loss: 0.7599 - acc: 0.6617 - val_loss: 0.5396 - val_acc: 0.7861
Epoch 2/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.5611 - acc: 0.7763 - val_loss: 0.4892 - val_acc: 0.8085
Epoch 3/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.5212 - acc: 0.7948 - val_loss: 0.4574 - val_acc: 0.8261
Epoch 4/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4986 - acc: 0.8045 - val_loss: 0.4410 - val_acc: 0.8274
Epoch 5/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4819 - acc: 0.8114 - val_loss: 0.4224 - val_acc: 0.8383
Epoch 6/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4714 - acc: 0.8166 - val_loss: 0.4200 - val_acc: 0.8379
Epoch 7/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4633 - acc: 0.8203 - val_loss: 0.4098 - val_acc: 0.8457
Epoch 8/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4558 - acc: 0.8232 - val_loss: 0.4114 - val_acc: 0.8415
Epoch 9/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4508 - acc: 0.8250 - val_loss: 0.4062 - val_acc: 0.8477
Epoch 10/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4433 - acc: 0.8286 - val_loss: 0.3982 - val_acc: 0.8486
Epoch 11/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4388 - acc: 0.8307 - val_loss: 0.3953 - val_acc: 0.8497
Epoch 12/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4351 - acc: 0.8321 - val_loss: 0.3973 - val_acc: 0.8522
Epoch 13/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4309 - acc: 0.8342 - val_loss: 0.3939 - val_acc: 0.8539
Epoch 14/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4269 - acc: 0.8355 - val_loss: 0.3932 - val_acc: 0.8517
Epoch 15/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4247 - acc: 0.8369 - val_loss: 0.3938 - val_acc: 0.8515
Epoch 16/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4208 - acc: 0.8379 - val_loss: 0.3936 - val_acc: 0.8504
Epoch 17/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4194 - acc: 0.8390 - val_loss: 0.3885 - val_acc: 0.8560
Epoch 18/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4162 - acc: 0.8402 - val_loss: 0.3874 - val_acc: 0.8561
Epoch 19/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4140 - acc: 0.8409 - val_loss: 0.3889 - val_acc: 0.8545
Epoch 20/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4114 - acc: 0.8426 - val_loss: 0.3864 - val_acc: 0.8583
Epoch 21/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4092 - acc: 0.8430 - val_loss: 0.3870 - val_acc: 0.8561
Epoch 22/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4062 - acc: 0.8442 - val_loss: 0.3852 - val_acc: 0.8577
Epoch 23/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4050 - acc: 0.8450 - val_loss: 0.3850 - val_acc: 0.8578
Epoch 24/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4035 - acc: 0.8455 - val_loss: 0.3825 - val_acc: 0.8555
Epoch 25/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.4018 - acc: 0.8460 - val_loss: 0.3837 - val_acc: 0.8573
Epoch 26/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3989 - acc: 0.8476 - val_loss: 0.3843 - val_acc: 0.8599
Epoch 27/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3979 - acc: 0.8481 - val_loss: 0.3841 - val_acc: 0.8589
Epoch 28/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3967 - acc: 0.8484 - val_loss: 0.3811 - val_acc: 0.8575
Epoch 29/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3956 - acc: 0.8492 - val_loss: 0.3829 - val_acc: 0.8589
Epoch 30/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3938 - acc: 0.8499 - val_loss: 0.3859 - val_acc: 0.8562
Epoch 31/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3925 - acc: 0.8500 - val_loss: 0.3798 - val_acc: 0.8587
Epoch 32/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3906 - acc: 0.8509 - val_loss: 0.3834 - val_acc: 0.8569
Epoch 33/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3893 - acc: 0.8511 - val_loss: 0.3806 - val_acc: 0.8588
Epoch 34/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3885 - acc: 0.8515 - val_loss: 0.3828 - val_acc: 0.8603
Epoch 35/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3879 - acc: 0.8520 - val_loss: 0.3800 - val_acc: 0.8594
Epoch 36/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3860 - acc: 0.8530 - val_loss: 0.3796 - val_acc: 0.8577
Epoch 37/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3856 - acc: 0.8532 - val_loss: 0.3857 - val_acc: 0.8591
Epoch 38/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3838 - acc: 0.8535 - val_loss: 0.3835 - val_acc: 0.8603
Epoch 39/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3830 - acc: 0.8543 - val_loss: 0.3830 - val_acc: 0.8599
Epoch 40/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3818 - acc: 0.8548 - val_loss: 0.3832 - val_acc: 0.8559
Epoch 41/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3806 - acc: 0.8551 - val_loss: 0.3845 - val_acc: 0.8553
Epoch 42/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3803 - acc: 0.8550 - val_loss: 0.3789 - val_acc: 0.8617
Epoch 43/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3791 - acc: 0.8556 - val_loss: 0.3835 - val_acc: 0.8580
Epoch 44/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3778 - acc: 0.8565 - val_loss: 0.3799 - val_acc: 0.8580
Epoch 45/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3766 - acc: 0.8571 - val_loss: 0.3790 - val_acc: 0.8625
Epoch 46/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3770 - acc: 0.8569 - val_loss: 0.3820 - val_acc: 0.8590
Epoch 47/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3761 - acc: 0.8573 - val_loss: 0.3831 - val_acc: 0.8581
Epoch 48/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3739 - acc: 0.8579 - val_loss: 0.3828 - val_acc: 0.8599
Epoch 49/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3738 - acc: 0.8577 - val_loss: 0.3785 - val_acc: 0.8590
Epoch 50/50
549367/549367 [==============================] - 33s 60us/step - loss: 0.3726 - acc: 0.8580 - val_loss: 0.3820 - val_acc: 0.8585
Out[19]:
<keras.callbacks.History at 0x7f5c9f49c438>

The result is broadly in the region reported by Parikh et al: ~86 vs 86.3%. The small difference might be accounted by differences in max_length (here set at 50), in the training regime, and that here we use Keras' built-in validation splitting rather than the SNLI test set.

Experiment: the asymmetric model

It was suggested earlier that, based on the semantics of entailment, the vector representing the strength of association between the hypothesis to the text is all that is needed for classifying the entailment.

The following model removes consideration of the complementary vector (text to hypothesis) from the computation. This will decrease the paramater count slightly, because the final dense layers will be smaller, and speed up the forward pass when predicting, because fewer calculations will be needed.


In [20]:
m1 = build_model(sem_vectors, 50, 200, 3, 200, 'left')
m1.summary()


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
words2 (InputLayer)             (None, 50)           0                                            
__________________________________________________________________________________________________
words1 (InputLayer)             (None, 50)           0                                            
__________________________________________________________________________________________________
sequential_5 (Sequential)       (None, 50, 200)      321381600   words1[0][0]                     
                                                                 words2[0][0]                     
__________________________________________________________________________________________________
sequential_6 (Sequential)       (None, 50, 200)      80400       sequential_5[1][0]               
                                                                 sequential_5[2][0]               
__________________________________________________________________________________________________
dot_4 (Dot)                     (None, 50, 50)       0           sequential_6[1][0]               
                                                                 sequential_6[2][0]               
__________________________________________________________________________________________________
lambda_5 (Lambda)               (None, 50, 50)       0           dot_4[0][0]                      
__________________________________________________________________________________________________
dot_5 (Dot)                     (None, 50, 200)      0           lambda_5[0][0]                   
                                                                 sequential_5[1][0]               
__________________________________________________________________________________________________
concatenate_4 (Concatenate)     (None, 50, 400)      0           sequential_5[2][0]               
                                                                 dot_5[0][0]                      
__________________________________________________________________________________________________
time_distributed_5 (TimeDistrib (None, 50, 200)      120400      concatenate_4[0][0]              
__________________________________________________________________________________________________
lambda_6 (Lambda)               (None, 200)          0           time_distributed_5[0][0]         
__________________________________________________________________________________________________
sequential_8 (Sequential)       (None, 200)          80400       lambda_6[0][0]                   
__________________________________________________________________________________________________
dense_16 (Dense)                (None, 3)            603         sequential_8[1][0]               
==================================================================================================
Total params: 321,663,403
Trainable params: 341,803
Non-trainable params: 321,321,600
__________________________________________________________________________________________________

The parameter count has indeed decreased by 40,000, corresponding to the 200x200 smaller H function.


In [21]:
m1.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))


Train on 549367 samples, validate on 9824 samples
Epoch 1/50
549367/549367 [==============================] - 25s 46us/step - loss: 0.7331 - acc: 0.6770 - val_loss: 0.5257 - val_acc: 0.7936
Epoch 2/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.5518 - acc: 0.7799 - val_loss: 0.4717 - val_acc: 0.8159
Epoch 3/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.5147 - acc: 0.7967 - val_loss: 0.4449 - val_acc: 0.8278
Epoch 4/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4948 - acc: 0.8060 - val_loss: 0.4326 - val_acc: 0.8344
Epoch 5/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4814 - acc: 0.8122 - val_loss: 0.4247 - val_acc: 0.8359
Epoch 6/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4712 - acc: 0.8162 - val_loss: 0.4143 - val_acc: 0.8430
Epoch 7/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4635 - acc: 0.8205 - val_loss: 0.4172 - val_acc: 0.8401
Epoch 8/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4570 - acc: 0.8223 - val_loss: 0.4106 - val_acc: 0.8422
Epoch 9/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4505 - acc: 0.8259 - val_loss: 0.4043 - val_acc: 0.8451
Epoch 10/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4459 - acc: 0.8280 - val_loss: 0.4050 - val_acc: 0.8467
Epoch 11/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4405 - acc: 0.8300 - val_loss: 0.3975 - val_acc: 0.8481
Epoch 12/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4360 - acc: 0.8324 - val_loss: 0.4026 - val_acc: 0.8496
Epoch 13/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4327 - acc: 0.8334 - val_loss: 0.4024 - val_acc: 0.8471
Epoch 14/50
549367/549367 [==============================] - 24s 45us/step - loss: 0.4293 - acc: 0.8350 - val_loss: 0.3955 - val_acc: 0.8496
Epoch 15/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4263 - acc: 0.8369 - val_loss: 0.3980 - val_acc: 0.8490
Epoch 16/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4236 - acc: 0.8377 - val_loss: 0.3958 - val_acc: 0.8496
Epoch 17/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4213 - acc: 0.8384 - val_loss: 0.3954 - val_acc: 0.8496
Epoch 18/50
549367/549367 [==============================] - 24s 45us/step - loss: 0.4187 - acc: 0.8394 - val_loss: 0.3929 - val_acc: 0.8514
Epoch 19/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4157 - acc: 0.8409 - val_loss: 0.3939 - val_acc: 0.8507
Epoch 20/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4135 - acc: 0.8417 - val_loss: 0.3953 - val_acc: 0.8522
Epoch 21/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4122 - acc: 0.8424 - val_loss: 0.3974 - val_acc: 0.8506
Epoch 22/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4099 - acc: 0.8435 - val_loss: 0.3918 - val_acc: 0.8522
Epoch 23/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4075 - acc: 0.8443 - val_loss: 0.3901 - val_acc: 0.8513
Epoch 24/50
549367/549367 [==============================] - 24s 44us/step - loss: 0.4067 - acc: 0.8447 - val_loss: 0.3885 - val_acc: 0.8543
Epoch 25/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4047 - acc: 0.8454 - val_loss: 0.3846 - val_acc: 0.8531
Epoch 26/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.4031 - acc: 0.8461 - val_loss: 0.3864 - val_acc: 0.8562
Epoch 27/50
549367/549367 [==============================] - 24s 45us/step - loss: 0.4020 - acc: 0.8467 - val_loss: 0.3874 - val_acc: 0.8546
Epoch 28/50
549367/549367 [==============================] - 24s 45us/step - loss: 0.4001 - acc: 0.8473 - val_loss: 0.3848 - val_acc: 0.8534
Epoch 29/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3991 - acc: 0.8479 - val_loss: 0.3865 - val_acc: 0.8562
Epoch 30/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3976 - acc: 0.8484 - val_loss: 0.3833 - val_acc: 0.8574
Epoch 31/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3961 - acc: 0.8487 - val_loss: 0.3846 - val_acc: 0.8585
Epoch 32/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3942 - acc: 0.8498 - val_loss: 0.3805 - val_acc: 0.8573
Epoch 33/50
549367/549367 [==============================] - 24s 44us/step - loss: 0.3935 - acc: 0.8503 - val_loss: 0.3856 - val_acc: 0.8579
Epoch 34/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3923 - acc: 0.8507 - val_loss: 0.3829 - val_acc: 0.8560
Epoch 35/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3920 - acc: 0.8508 - val_loss: 0.3864 - val_acc: 0.8575
Epoch 36/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3907 - acc: 0.8516 - val_loss: 0.3873 - val_acc: 0.8563
Epoch 37/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3891 - acc: 0.8519 - val_loss: 0.3850 - val_acc: 0.8570
Epoch 38/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3872 - acc: 0.8522 - val_loss: 0.3815 - val_acc: 0.8591
Epoch 39/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3887 - acc: 0.8520 - val_loss: 0.3829 - val_acc: 0.8590
Epoch 40/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3868 - acc: 0.8531 - val_loss: 0.3807 - val_acc: 0.8600
Epoch 41/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3859 - acc: 0.8537 - val_loss: 0.3832 - val_acc: 0.8574
Epoch 42/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3849 - acc: 0.8537 - val_loss: 0.3850 - val_acc: 0.8576
Epoch 43/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3834 - acc: 0.8541 - val_loss: 0.3825 - val_acc: 0.8563
Epoch 44/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3829 - acc: 0.8548 - val_loss: 0.3844 - val_acc: 0.8540
Epoch 45/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8552 - val_loss: 0.3841 - val_acc: 0.8559
Epoch 46/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8549 - val_loss: 0.3880 - val_acc: 0.8567
Epoch 47/50
549367/549367 [==============================] - 24s 45us/step - loss: 0.3799 - acc: 0.8559 - val_loss: 0.3767 - val_acc: 0.8635
Epoch 48/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3800 - acc: 0.8560 - val_loss: 0.3786 - val_acc: 0.8563
Epoch 49/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3781 - acc: 0.8563 - val_loss: 0.3812 - val_acc: 0.8596
Epoch 50/50
549367/549367 [==============================] - 25s 45us/step - loss: 0.3788 - acc: 0.8560 - val_loss: 0.3782 - val_acc: 0.8601
Out[21]:
<keras.callbacks.History at 0x7f5ca1bf3e48>

This model performs the same as the slightly more complex model that evaluates alignments in both directions. Note also that processing time is improved, from 64 down to 48 microseconds per step.

Let's now look at an asymmetric model that evaluates text to hypothesis comparisons. The prediction is that such a model will correctly classify a decent proportion of the exemplars, but not as accurately as the previous two.

We'll just use 10 epochs for expediency.


In [96]:
m2 = build_model(sem_vectors, 50, 200, 3, 200, 'right')
m2.summary()


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
words1 (InputLayer)             (None, 50)           0                                            
__________________________________________________________________________________________________
words2 (InputLayer)             (None, 50)           0                                            
__________________________________________________________________________________________________
sequential_13 (Sequential)      (None, 50, 200)      321381600   words1[0][0]                     
                                                                 words2[0][0]                     
__________________________________________________________________________________________________
sequential_14 (Sequential)      (None, 50, 200)      80400       sequential_13[1][0]              
                                                                 sequential_13[2][0]              
__________________________________________________________________________________________________
dot_8 (Dot)                     (None, 50, 50)       0           sequential_14[1][0]              
                                                                 sequential_14[2][0]              
__________________________________________________________________________________________________
lambda_9 (Lambda)               (None, 50, 50)       0           dot_8[0][0]                      
__________________________________________________________________________________________________
dot_9 (Dot)                     (None, 50, 200)      0           lambda_9[0][0]                   
                                                                 sequential_13[2][0]              
__________________________________________________________________________________________________
concatenate_6 (Concatenate)     (None, 50, 400)      0           sequential_13[1][0]              
                                                                 dot_9[0][0]                      
__________________________________________________________________________________________________
time_distributed_9 (TimeDistrib (None, 50, 200)      120400      concatenate_6[0][0]              
__________________________________________________________________________________________________
lambda_10 (Lambda)              (None, 200)          0           time_distributed_9[0][0]         
__________________________________________________________________________________________________
sequential_16 (Sequential)      (None, 200)          80400       lambda_10[0][0]                  
__________________________________________________________________________________________________
dense_32 (Dense)                (None, 3)            603         sequential_16[1][0]              
==================================================================================================
Total params: 321,663,403
Trainable params: 341,803
Non-trainable params: 321,321,600
__________________________________________________________________________________________________

In [97]:
m2.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=10,validation_split=.2)


Train on 455226 samples, validate on 113807 samples
Epoch 1/10
455226/455226 [==============================] - 22s 49us/step - loss: 0.8920 - acc: 0.5771 - val_loss: 0.8001 - val_acc: 0.6435
Epoch 2/10
455226/455226 [==============================] - 22s 47us/step - loss: 0.7808 - acc: 0.6553 - val_loss: 0.7267 - val_acc: 0.6855
Epoch 3/10
455226/455226 [==============================] - 22s 47us/step - loss: 0.7329 - acc: 0.6825 - val_loss: 0.6966 - val_acc: 0.7006
Epoch 4/10
455226/455226 [==============================] - 22s 47us/step - loss: 0.7055 - acc: 0.6978 - val_loss: 0.6713 - val_acc: 0.7150
Epoch 5/10
455226/455226 [==============================] - 22s 47us/step - loss: 0.6862 - acc: 0.7081 - val_loss: 0.6533 - val_acc: 0.7253
Epoch 6/10
455226/455226 [==============================] - 21s 47us/step - loss: 0.6694 - acc: 0.7179 - val_loss: 0.6472 - val_acc: 0.7277
Epoch 7/10
455226/455226 [==============================] - 22s 47us/step - loss: 0.6555 - acc: 0.7252 - val_loss: 0.6338 - val_acc: 0.7347
Epoch 8/10
455226/455226 [==============================] - 22s 48us/step - loss: 0.6434 - acc: 0.7310 - val_loss: 0.6246 - val_acc: 0.7385
Epoch 9/10
455226/455226 [==============================] - 22s 47us/step - loss: 0.6325 - acc: 0.7367 - val_loss: 0.6164 - val_acc: 0.7424
Epoch 10/10
455226/455226 [==============================] - 22s 47us/step - loss: 0.6216 - acc: 0.7426 - val_loss: 0.6082 - val_acc: 0.7478
Out[97]:
<keras.callbacks.History at 0x7fa6850cf080>

Comparing this fit to the validation accuracy of the previous two models after 10 epochs, we observe that its accuracy is roughly 10% lower.

It is reassuring that the neural modeling here reproduces what we know from the semantics of natural language!