In [1]:
from __future__ import print_function
try:
import cPickle as pickle
except:
import pickle
import keras.backend as K
from keras.preprocessing import sequence
from keras.models import Model
from keras.layers import Input, Embedding, LSTM, Bidirectional, Lambda
import numpy as np
In [2]:
# load the indexed data
indexed_question_1s = pickle.load(open("./data/processed/02.indexed_question_1s_train.pkl", "rb"))
indexed_question_2s = pickle.load(open("./data/processed/02.indexed_question_2s_train.pkl", "rb"))
labels_list = pickle.load(open("./data/processed/02.labels_train.pkl", "rb"))
In [3]:
# load the word to index dictionary
word_indices = pickle.load(open("./data/processed/02.word_indices.pkl", "rb"))
We're almost ready to train our model. There's just one hitch though: neural networks take as input fixed-length vectors. What are we to do, since our questions are sequences of ints with variable length?
The answer is to pad the shorter instances to the length of the longest instance, thus making them all the same length! We'll pad with the 0
character -- this is why we set the padding character to have a 0 index in the word to index dictionary. Keras will automatically figure out that these 0's are padding, and not take them into account when doing model computations (this is called masking).
It's common to also truncate sequences. For example, say that the average length of our questions is 10 words, but there's one outlier with 900 words. Padding all of the other questions to 900 words would be a huge waste of space, when we could simply truncate that one outlier with 900 words to 10 words. Thus, we'll set a max length of 100 words; if a question is less than 100 words, it'll be padded up, and if it's longer it'll be truncated.
Note that since the two questions are actually separate inputs to the model, as you'll see later, their max length could be set to different values if you wanted. This is useful if you're comparing, say, a question and a document -- you'd expect the question to be much shorter than the document, and adjust your lengths accordingly.
In [4]:
maxlen = 100
max_training_instances=10000
In [5]:
# It takes a long time to train on all 400,000 samples on CPU (5 hours/epoch) --- let's cut it down to
# max_training_instances size. The dataset itself is a bit unbalanced, around 67% non-duplicate
# / 33% duplicate. We can use this opportunity to make it more balanced as well.
indices_with_0 = [index for index,value in enumerate(labels_list) if value==0]
indices_with_1 = [index for index,value in enumerate(labels_list) if value==1]
reduced_indexed_question_1s = []
reduced_indexed_question_2s = []
reduced_labels_list = []
for i in range(max_training_instances):
# if i is even (~50%), pull something from indices_with_0 and add it to
# the truncated dataset. Else, pull something from indices_with_1 and
# add it to the truncatd dataset. If any of the list of indices are empty,
# use the other one.
# TODO: I'm pretty sure this if can be refactored, but it's late and I can't think
# right now.
if i % 2 == 0:
if indices_with_0:
index = indices_with_0.pop()
else:
index = indices_with_1.pop()
else:
if indices_with_1:
index = indices_with_1.pop()
else:
index = indices_with_0.pop()
reduced_indexed_question_1s.append(indexed_question_1s[index])
reduced_indexed_question_2s.append(indexed_question_2s[index])
reduced_labels_list.append(labels_list[index])
In [6]:
print(len(reduced_indexed_question_1s))
print(len(reduced_indexed_question_2s))
print(len(reduced_labels_list))
In [7]:
# Now we want to pad / truncate our instances to a max length.
# Keras has a handy function to do this, but it isn't hard to implement yourself as well.
padded_question_1s = sequence.pad_sequences(reduced_indexed_question_1s, maxlen=maxlen)
padded_question_2s = sequence.pad_sequences(reduced_indexed_question_2s, maxlen=maxlen)
padded_question_1s_shape = padded_question_1s.shape
padded_question_2s_shape = padded_question_2s.shape
# We also want to convert our list of labels to a numpy array for use in the model.
labels = np.array(reduced_labels_list)
Let's inspect the shapes of our padded questions.
In [8]:
print("padded_question_1s_shape: {}".format(padded_question_1s_shape))
print("padded_question_2s_shape: {}".format(padded_question_1s_shape))
print("labels shape: {}".format(labels.shape))
Great, each of our questions is now of length maxlen
.
At this point, we need to get the "vocabulary" of the training data. This is the number of unique indices in the data, so in this case it's easy to calculate by taking the length of the word_indices
dictionary.
In [9]:
vocabulary_size = len(word_indices)
print("Vocabulary size: {}".format(vocabulary_size))
The batch_size
controls how many training instances we process (do a gradient update on) at once, since it's impossible to train on all of the data at once. 32 is a fairly standard number.
In [10]:
batch_size = 32
Now that we have our data sorted out, we can finally build our Keras model. As a computation graph framework, Keras has a nice "functional API"; the notion is that you "construct" layers, and then "apply" these layers to tensors by calling them. This probably sounds quite abstract, but hopefully the code below illustrates.
In [11]:
# We are passed in two matrices, one of shape (batch_size, question_1_length) and
# (batch_size, question_2_length). In this case, these are both (32, 100) by default.
# Note that the input layer's shape argument does not include the batch size, and it is a
# tuple with a value of (maxlen,)
question_1_input = Input(shape=(padded_question_1s_shape[-1:]))
question_2_input = Input(shape=(padded_question_2s_shape[-1:]))
print("question_1_input {}".format(question_1_input))
print("question_2_input {}".format(question_2_input))
In the above printed representation of our tensor, you'll notice that the shape is a weird (?, maxlen)
. In this case, the ?
refers to a dimension that can be of any size. Since that is our batch_size, we can vary the batch size to be whatever we want and the model will still work.
Now that our questions are in the graph, we want to use an embedding layer to project each int index (which actually represents one word) into a higher-dimensional space. The way we do this is by using an Embedding
layer. This layer replaces each index with a vector, and the vector should ideally represent the semantic meaning of the index. In this way, the model can get some notion of "meaning" between the indices.
In this model, the Embedding layer is randomly initialized --- every index is assigned a random vector at first. As the model trains, it will tweak the vector assigned to each word in order to minimize the loss. However, this naturally leads to a lot more parameters to tune, which makes the model harder to learn.
It's thus common practice in the field to use pre-trained embeddings. Pre-trained embeddings are what they sound like, embeddings for a word that already have gotten to a pretty good representation. By using these pretrained embeddings and not updating them (so keeping them fixed and not letting the model change them), you drastically lower the amount of parameters the model has to fiddle with and also prevent the model from overfitting (as it can make the embeddings overly domain-specific, while the pretrained embeddings are quite general).
In [12]:
# Embedding layer for question 1. For each word in the question, it'll
# transform it into a fixed-length vector of size 128.
embedding_layer_1 = Embedding(input_dim=vocabulary_size, output_dim=128,
mask_zero=True, input_length=maxlen)
# Embedding layer for question 2. For each word in the question, it'll
# transform it into a fixed-length vector of size 128.
embedding_layer_2 = Embedding(vocabulary_size, 128,
mask_zero=True, input_length=maxlen)
# Now, we apply the embedding layers that we constructed to the input
# shape: (batch_size, question_1_length, embedding_output_dim) or (32, 100, 128) by default
question_1_embedded = embedding_layer_1(question_1_input)
print("question_1_embedded {}".format(question_1_embedded))
# shape: (batch_size, question_2_length, embedding_output_dim) or (32, 100, 128) by default
question_2_embedded = embedding_layer_2(question_2_input)
print("question_2_embedded {}".format(question_2_embedded))
Now, our data consists of matrices of shape (batch_size, maxlen, embedding_dimension)
. It might be hard to intuitively think about what this means, but you've intuitively replaced each "int" index in the sentence with a vector of size embedding dimension
(so from (batch_size, maxlen)
to (batch_size, maxlen, embedding_dimension)
).
Now that we have embedded our questions, it's time to encode them. A popular choice in modern NLP is to use a recurrent neural networks, especially the Bidirectional LSTM (biLSTM). An LSTM essentially takes a single question as input (something of shape (maxlen, embedding_dimension)
in this case), and squeezes it into a fixed-length vector of size (LSTM_output_units)
. In this manner, you can think of the LSTM as "encoding" the question. into a single vector.
The "Bidirectional" part comes from the idea that you should run the question (a sequence of vectors) through the LSTM, and then reverse the question and run it through another LSTM. Then, you take the vector that was outputted from both and concatenate it. This intuitively lets the LSTM "read from both directions".
In [13]:
# Now we take the embedded questions, and we encode them with a bidirectional LSTM.
# Think of a LSTM as converting/encoding a sequence of vectors into a fixed length vector.
# In this case, it takes in a single question of size (100, 128) and returns something of
# size (2*LSTM_output_units). Since it is batched, we go from (32, 100, 128) to (32, 2*LSTM_output_units)
# Bidirectional LSTM encoder for question_1_embedded
question_1_encoder = Bidirectional(LSTM(units=64))
# Bidirectional LSTM encoder for question_2_embedded
question_2_encoder = Bidirectional(LSTM(units=64))
# Now, we apply the Bidirectional LSTM encoders to our embedded questions.
# shape: (batch_size, 2*LSTM_output_units), or (32, 128) by default
question_1_encoded = question_1_encoder(question_1_embedded)
print("question_1_encoded: {}".format(question_1_encoded))
# shape: (batch_size, 2*LSTM_output_units), or (32, 128) by default
question_2_encoded = question_2_encoder(question_2_embedded)
print("question_2_encoded: {}".format(question_2_encoded))
Lastly, we compute a similarity metric between each of the two vectors, over the batch. Our similarity metric will be: exp(-||question_1_encoded-question_2_encoded||), or in words, e to the power of the negative L1 norm (a.k.a manhattan distance). With this metric, for each question pair (vector of size LSTM_units*2) we get a value between 0 and 1, with questions having a larger L1 norm being closer to 0 and questions having a smaller L1 norm being closer to 1. We can intuitively interpret this as the probability that two sentences are semantically the same, assuming that if two sentences have the same semantic meaning, they are probably duplicate questions.
In [14]:
# The L1 Norm/Manhattan distance formula is simple: subtract vector 1 from vector 2, and add up the
# absolute value of the resulting vector.
# We'll first write a function to calculate our similarity metric given two tensors.
def l1_similarity(vectors):
vector_1, vector_2 = vectors
# Note that vector_1 and vector_2 are of shape (batch_size, LSTM_units*2)
# First, take the absolute value of the difference. shape(batch_size, LSTM_units*2)
abs_diff = K.abs(vector_1-vector_2)
# Now, sum across the "first" axis and negate it (which thus negates every element of it).
# This is roughly analogous to summing the rows.
# keepdims=True does not reduce the dimensionality, and just leaves it as 1.
# shape: (batch_size, 1)
negative_l1_distance = -K.sum(abs_diff, axis=1, keepdims=True)
# Finally, apply the exponential function and return the output.
# shape: (batch_size, 1), where the "1" is a value in [0, 1] that
# describes the probability of the two vectors being semantically similar.
return K.exp(negative_l1_distance)
In [15]:
# We now want to pass our two encoded questions to our similarity function.
# To do so, we'll use a keras Lambda layer, which lets us wrap an arbitrary
# function in a Lambda object. Note that _ALL_ operations on keras tensors
# in the Model class _must_ be a layer; we thus cannot call the function directly.
# Here, we're creating a layer and using it in one line.
# output shape: (batch_size, 1)
duplicate_probabilities = Lambda(l1_similarity)([question_1_encoded, question_2_encoded])
print("duplicate_probabilities: {}".format(duplicate_probabilities))
Now that we've successfully strung together a bunch of our layers and inputs to get a final probability, we can create a keras Model
to seamlessly take the input numpy arrays, run them through the computation graph we built in the way we specified, to get an output probability that it will automatically compare to the label in order to adjust the loss.
To do all of this, we just need to create an instance of the Model
class and specify which Input
layers are our inputs, and what value from the graph is our final output. Note that since we have multiple inputs, we need to pass a list of Input
tensors.
In [16]:
# These duplicate probabilties are what we want to output from our model, so we'll create
# the model now.
duplicate_questions_model = Model(inputs=[question_1_input, question_2_input], outputs=duplicate_probabilities)
Now, we compile our model into a Tensorflow/Theano graph. Keras handles this for us, but we need to specify an optimization algorithm to use, as well as a loss function. adam
is generally a good choice of optimizer, and binary_crossentropy
is appropriate for a binary classification task like the one we have.
We can also specify a list of metrics to be printed during training and testing, so we'll print the accuracy.
In [17]:
duplicate_questions_model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
In [18]:
# Print a summary of the layers of our model and their inputs and outputs
duplicate_questions_model.summary()
In [19]:
# Now, we can finally fit our model on training data!
# Note that the order of the input x matters.
duplicate_questions_model.fit(x=[padded_question_1s, padded_question_2s], y=labels,
batch_size=batch_size, epochs=4, validation_split=0.1)
Out[19]:
Great, you can see that the model is definitely learning the task since the training accuracy (denoted by acc
) goes up for each epoch. However, the validation accuracy seems to peak at epoch 2 and goes down afterwords, perhaps because we are overfitting on the data (remember that this is a relatively small slice of the actual dataset).
This is a quick model to perform the task, and there are a lot of different ways to make it better. For example, you could: