In [1]:

    
from __future__ import print_function
try:
    import cPickle as pickle
except:
    import pickle

import keras.backend as K

from keras.preprocessing import sequence
from keras.models import Model
from keras.layers import Input, Embedding, LSTM, Bidirectional, Lambda
import numpy as np









    



Using TensorFlow backend.



In [2]:

    
# load the indexed data
indexed_question_1s = pickle.load(open("./data/processed/02.indexed_question_1s_train.pkl", "rb"))
indexed_question_2s = pickle.load(open("./data/processed/02.indexed_question_2s_train.pkl", "rb"))
labels_list = pickle.load(open("./data/processed/02.labels_train.pkl", "rb"))



In [3]:

    
# load the word to index dictionary
word_indices = pickle.load(open("./data/processed/02.word_indices.pkl", "rb"))

Padding

We're almost ready to train our model. There's just one hitch though: neural networks take as input fixed-length vectors. What are we to do, since our questions are sequences of ints with variable length?

The answer is to pad the shorter instances to the length of the longest instance, thus making them all the same length! We'll pad with the 0 character -- this is why we set the padding character to have a 0 index in the word to index dictionary. Keras will automatically figure out that these 0's are padding, and not take them into account when doing model computations (this is called masking).

It's common to also truncate sequences. For example, say that the average length of our questions is 10 words, but there's one outlier with 900 words. Padding all of the other questions to 900 words would be a huge waste of space, when we could simply truncate that one outlier with 900 words to 10 words. Thus, we'll set a max length of 100 words; if a question is less than 100 words, it'll be padded up, and if it's longer it'll be truncated.

Note that since the two questions are actually separate inputs to the model, as you'll see later, their max length could be set to different values if you wanted. This is useful if you're comparing, say, a question and a document -- you'd expect the question to be much shorter than the document, and adjust your lengths accordingly.



In [4]:

    
maxlen = 100
max_training_instances=10000



In [5]:

    
# It takes a long time to train on all 400,000 samples on CPU (5 hours/epoch) --- let's cut it down to 
# max_training_instances size. The dataset itself is a bit unbalanced, around 67% non-duplicate 
# / 33% duplicate. We can use this opportunity to make it more balanced as well.
indices_with_0 = [index for index,value in enumerate(labels_list) if value==0]
indices_with_1 = [index for index,value in enumerate(labels_list) if value==1]

reduced_indexed_question_1s = []
reduced_indexed_question_2s = []
reduced_labels_list = []

for i in range(max_training_instances):
    # if i is even (~50%), pull something from indices_with_0 and add it to 
    # the truncated dataset. Else, pull something from indices_with_1 and 
    # add it to the truncatd dataset. If any of the list of indices are empty,
    # use the other one.
    # TODO: I'm pretty sure this if can be refactored, but it's late and I can't think
    # right now.
    if i % 2 == 0:
        if indices_with_0:
            index = indices_with_0.pop()
        else:
            index = indices_with_1.pop()
    else:
        if indices_with_1:
            index = indices_with_1.pop()
        else:
            index = indices_with_0.pop()
    reduced_indexed_question_1s.append(indexed_question_1s[index])
    reduced_indexed_question_2s.append(indexed_question_2s[index])
    reduced_labels_list.append(labels_list[index])



In [6]:

    
print(len(reduced_indexed_question_1s))
print(len(reduced_indexed_question_2s))
print(len(reduced_labels_list))



In [7]:

    
# Now we want to pad / truncate our instances to a max length.
# Keras has a handy function to do this, but it isn't hard to implement yourself as well.
padded_question_1s = sequence.pad_sequences(reduced_indexed_question_1s, maxlen=maxlen)
padded_question_2s = sequence.pad_sequences(reduced_indexed_question_2s, maxlen=maxlen)

padded_question_1s_shape = padded_question_1s.shape
padded_question_2s_shape = padded_question_2s.shape

# We also want to convert our list of labels to a numpy array for use in the model.
labels = np.array(reduced_labels_list)

Let's inspect the shapes of our padded questions.



In [8]:

    
print("padded_question_1s_shape: {}".format(padded_question_1s_shape))
print("padded_question_2s_shape: {}".format(padded_question_1s_shape))
print("labels shape: {}".format(labels.shape))









    



padded_question_1s_shape: (10000, 100)
padded_question_2s_shape: (10000, 100)
labels shape: (10000,)

Great, each of our questions is now of length maxlen.

At this point, we need to get the "vocabulary" of the training data. This is the number of unique indices in the data, so in this case it's easy to calculate by taking the length of the word_indices dictionary.



In [9]:

    
vocabulary_size = len(word_indices)
print("Vocabulary size: {}".format(vocabulary_size))









    



Vocabulary size: 104472

The batch_size controls how many training instances we process (do a gradient update on) at once, since it's impossible to train on all of the data at once. 32 is a fairly standard number.



In [10]:

    
batch_size = 32

Building the model

Now that we have our data sorted out, we can finally build our Keras model. As a computation graph framework, Keras has a nice "functional API"; the notion is that you "construct" layers, and then "apply" these layers to tensors by calling them. This probably sounds quite abstract, but hopefully the code below illustrates.

Setting up input layers

At the beginning of every model in Keras, you need "input" layers which indicate what the shape of the incoming data arrays are going to be, and provides a means for the other layers to interface with it.



In [11]:

    
# We are passed in two matrices, one of shape (batch_size, question_1_length) and 
# (batch_size, question_2_length). In this case, these are both (32, 100) by default. 
# Note that the input layer's shape argument does not include the batch size, and it is a 
# tuple with a value of (maxlen,)
question_1_input = Input(shape=(padded_question_1s_shape[-1:]))
question_2_input = Input(shape=(padded_question_2s_shape[-1:]))

print("question_1_input {}".format(question_1_input))
print("question_2_input {}".format(question_2_input))









    



question_1_input Tensor("input_1:0", shape=(?, 100), dtype=float32)
question_2_input Tensor("input_2:0", shape=(?, 100), dtype=float32)

In the above printed representation of our tensor, you'll notice that the shape is a weird (?, maxlen). In this case, the ? refers to a dimension that can be of any size. Since that is our batch_size, we can vary the batch size to be whatever we want and the model will still work.

The Embedding Layer

Now that our questions are in the graph, we want to use an embedding layer to project each int index (which actually represents one word) into a higher-dimensional space. The way we do this is by using an Embedding layer. This layer replaces each index with a vector, and the vector should ideally represent the semantic meaning of the index. In this way, the model can get some notion of "meaning" between the indices.

In this model, the Embedding layer is randomly initialized --- every index is assigned a random vector at first. As the model trains, it will tweak the vector assigned to each word in order to minimize the loss. However, this naturally leads to a lot more parameters to tune, which makes the model harder to learn.

It's thus common practice in the field to use pre-trained embeddings. Pre-trained embeddings are what they sound like, embeddings for a word that already have gotten to a pretty good representation. By using these pretrained embeddings and not updating them (so keeping them fixed and not letting the model change them), you drastically lower the amount of parameters the model has to fiddle with and also prevent the model from overfitting (as it can make the embeddings overly domain-specific, while the pretrained embeddings are quite general).



In [12]:

    
# Embedding layer for question 1. For each word in the question, it'll
# transform it into a fixed-length vector of size 128.
embedding_layer_1 = Embedding(input_dim=vocabulary_size, output_dim=128, 
                              mask_zero=True, input_length=maxlen)

# Embedding layer for question 2. For each word in the question, it'll 
# transform it into a fixed-length vector of size 128.
embedding_layer_2 = Embedding(vocabulary_size, 128, 
                              mask_zero=True, input_length=maxlen)

# Now, we apply the embedding layers that we constructed to the input
# shape: (batch_size, question_1_length, embedding_output_dim) or (32, 100, 128) by default
question_1_embedded = embedding_layer_1(question_1_input)
print("question_1_embedded {}".format(question_1_embedded))

# shape: (batch_size, question_2_length, embedding_output_dim) or (32, 100, 128) by default
question_2_embedded = embedding_layer_2(question_2_input)
print("question_2_embedded {}".format(question_2_embedded))









    



question_1_embedded Tensor("embedding_1/Gather:0", shape=(?, 100, 128), dtype=float32)
question_2_embedded Tensor("embedding_2/Gather:0", shape=(?, 100, 128), dtype=float32)

Encoding the words

Now, our data consists of matrices of shape (batch_size, maxlen, embedding_dimension). It might be hard to intuitively think about what this means, but you've intuitively replaced each "int" index in the sentence with a vector of size embedding dimension (so from (batch_size, maxlen) to (batch_size, maxlen, embedding_dimension)).

Now that we have embedded our questions, it's time to encode them. A popular choice in modern NLP is to use a recurrent neural networks, especially the Bidirectional LSTM (biLSTM). An LSTM essentially takes a single question as input (something of shape (maxlen, embedding_dimension) in this case), and squeezes it into a fixed-length vector of size (LSTM_output_units). In this manner, you can think of the LSTM as "encoding" the question. into a single vector.

The "Bidirectional" part comes from the idea that you should run the question (a sequence of vectors) through the LSTM, and then reverse the question and run it through another LSTM. Then, you take the vector that was outputted from both and concatenate it. This intuitively lets the LSTM "read from both directions".



In [13]:

    
# Now we take the embedded questions, and we encode them with a bidirectional LSTM.
# Think of a LSTM as converting/encoding a sequence of vectors into a fixed length vector.
# In this case, it takes in a single question of size (100, 128) and returns something of 
# size (2*LSTM_output_units). Since it is batched, we go from (32, 100, 128) to (32, 2*LSTM_output_units)

# Bidirectional LSTM encoder for question_1_embedded
question_1_encoder = Bidirectional(LSTM(units=64))
 
# Bidirectional LSTM encoder for question_2_embedded
question_2_encoder = Bidirectional(LSTM(units=64))

# Now, we apply the Bidirectional LSTM encoders to our embedded questions.
# shape: (batch_size, 2*LSTM_output_units), or (32, 128) by default
question_1_encoded = question_1_encoder(question_1_embedded)
print("question_1_encoded: {}".format(question_1_encoded))

# shape: (batch_size, 2*LSTM_output_units), or (32, 128) by default
question_2_encoded = question_2_encoder(question_2_embedded)
print("question_2_encoded: {}".format(question_2_encoded))









    



question_1_encoded: Tensor("bidirectional_1/concat_2:0", shape=(?, 128), dtype=float32)
question_2_encoded: Tensor("bidirectional_2/concat_2:0", shape=(?, 128), dtype=float32)

Getting an output probability

Lastly, we compute a similarity metric between each of the two vectors, over the batch. Our similarity metric will be: exp(-||question_1_encoded-question_2_encoded||), or in words, e to the power of the negative L1 norm (a.k.a manhattan distance). With this metric, for each question pair (vector of size LSTM_units*2) we get a value between 0 and 1, with questions having a larger L1 norm being closer to 0 and questions having a smaller L1 norm being closer to 1. We can intuitively interpret this as the probability that two sentences are semantically the same, assuming that if two sentences have the same semantic meaning, they are probably duplicate questions.



In [14]:

    
# The L1 Norm/Manhattan distance formula is simple: subtract vector 1 from vector 2, and add up the 
# absolute value of the resulting vector.

# We'll first write a function to calculate our similarity metric given two tensors.
def l1_similarity(vectors):
    vector_1, vector_2 = vectors
    # Note that vector_1 and vector_2 are of shape (batch_size, LSTM_units*2)
    # First, take the absolute value of the difference. shape(batch_size, LSTM_units*2)
    abs_diff = K.abs(vector_1-vector_2)
    
    # Now, sum across the "first" axis and negate it (which thus negates every element of it). 
    # This is roughly analogous to summing the rows.
    # keepdims=True does not reduce the dimensionality, and just leaves it as 1.
    # shape: (batch_size, 1)
    negative_l1_distance = -K.sum(abs_diff, axis=1, keepdims=True)
    
    # Finally, apply the exponential function and return the output.
    # shape: (batch_size, 1), where the "1" is a value in [0, 1] that
    # describes the probability of the two vectors being semantically similar.
    return K.exp(negative_l1_distance)



In [15]:

    
# We now want to pass our two encoded questions to our similarity function.
# To do so, we'll use a keras Lambda layer, which lets us wrap an arbitrary
# function in a Lambda object. Note that _ALL_ operations on keras tensors 
# in the Model class _must_ be a layer; we thus cannot call the function directly.

# Here, we're creating a layer and using it in one line.
# output shape: (batch_size, 1)
duplicate_probabilities = Lambda(l1_similarity)([question_1_encoded, question_2_encoded])
print("duplicate_probabilities: {}".format(duplicate_probabilities))









    



duplicate_probabilities: Tensor("lambda_1/Exp:0", shape=(?, 1), dtype=float32)

Wrapping up the model

Now that we've successfully strung together a bunch of our layers and inputs to get a final probability, we can create a keras Model to seamlessly take the input numpy arrays, run them through the computation graph we built in the way we specified, to get an output probability that it will automatically compare to the label in order to adjust the loss.

To do all of this, we just need to create an instance of the Model class and specify which Input layers are our inputs, and what value from the graph is our final output. Note that since we have multiple inputs, we need to pass a list of Input tensors.



In [16]:

    
# These duplicate probabilties are what we want to output from our model, so we'll create
# the model now.
duplicate_questions_model = Model(inputs=[question_1_input, question_2_input], outputs=duplicate_probabilities)

Compiling the model

Now, we compile our model into a Tensorflow/Theano graph. Keras handles this for us, but we need to specify an optimization algorithm to use, as well as a loss function. adam is generally a good choice of optimizer, and binary_crossentropy is appropriate for a binary classification task like the one we have.

We can also specify a list of metrics to be printed during training and testing, so we'll print the accuracy.



In [17]:

    
duplicate_questions_model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])



In [18]:

    
# Print a summary of the layers of our model and their inputs and outputs
duplicate_questions_model.summary()









    



____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_1 (InputLayer)             (None, 100)           0                                            
____________________________________________________________________________________________________
input_2 (InputLayer)             (None, 100)           0                                            
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 100, 128)      13372416                                     
____________________________________________________________________________________________________
embedding_2 (Embedding)          (None, 100, 128)      13372416                                     
____________________________________________________________________________________________________
bidirectional_1 (Bidirectional)  (None, 128)           98816                                        
____________________________________________________________________________________________________
bidirectional_2 (Bidirectional)  (None, 128)           98816                                        
____________________________________________________________________________________________________
lambda_1 (Lambda)                (None, 1)             0                                            
====================================================================================================
Total params: 26,942,464.0
Trainable params: 26,942,464.0
Non-trainable params: 0.0
____________________________________________________________________________________________________

Training our model

Now, we can finally pass in our input arrays and output labels and watch the model train!



In [19]:

    
# Now, we can finally fit our model on training data!
# Note that the order of the input x matters.
duplicate_questions_model.fit(x=[padded_question_1s, padded_question_2s], y=labels, 
                              batch_size=batch_size, epochs=4, validation_split=0.1)









    



Train on 9000 samples, validate on 1000 samples
Epoch 1/4
9000/9000 [==============================] - 427s - loss: 0.6605 - acc: 0.5999 - val_loss: 0.6197 - val_acc: 0.6680
Epoch 2/4
9000/9000 [==============================] - 446s - loss: 0.4942 - acc: 0.8378 - val_loss: 0.6072 - val_acc: 0.6740
Epoch 3/4
9000/9000 [==============================] - 439s - loss: 0.3912 - acc: 0.9071 - val_loss: 0.6339 - val_acc: 0.6620
Epoch 4/4
9000/9000 [==============================] - 443s - loss: 0.3257 - acc: 0.9388 - val_loss: 0.7801 - val_acc: 0.6670






    Out[19]:





<keras.callbacks.History at 0x13704c0b8>

Some quick analysis

Great, you can see that the model is definitely learning the task since the training accuracy (denoted by acc) goes up for each epoch. However, the validation accuracy seems to peak at epoch 2 and goes down afterwords, perhaps because we are overfitting on the data (remember that this is a relatively small slice of the actual dataset).

Next Steps

This is a quick model to perform the task, and there are a lot of different ways to make it better. For example, you could:

Try using pretrained embeddings (i.e. word2vec or GloVe). There's a good blog post on how to do so here. I'd expect this to give sizeable gains.
Try adding early stopping, so the model can automatically stop training after a certain decrease in the validation accuracy.
Try visualizing your model with TensorBoard (Tensorflow backend only).
Try tweaking the output dimensionality of some of the layers. 64*2 dimensions is a bit small for a BiLSTM, but we used it here to keep training times manageable. 128*2 is more common (but will also drastically increase train time).
Gated Recurrent Units (GRUs) are a popular alternative to LSTMs --- try using GRUs in the model.
Right now, we're encoding question_1 and question_2 with different biLSTM encoders. This doesn't make a lot of intuitive sense though --- there's nothing intrinsically different about question_1 vs question_2 (i.e. if we were to flip question_1 and question_2, we'd probably get different results). Try using the same encoder to encoder both of the questions (this is also known as sharing weights between the encoders).
Manhattan distance might not be optimal, maybe play around with Euclidean distance / cosine similarity. Make sure that you always output a number between 0 and 1, though!
Try training on all the data (be prepared to wait if you don't have access to a GPU)
Build your own model, and try a completely different architecture! The world is your oyster at this point :)