In [190]:
import cPickle as p
import numpy as np
import tensorflow as tf
import word2vec
import random
Welcome to the first lab of 6.S191!
Things to install:
And we'll be doing all of this in te context of Twitter sentiment analysis. Given a tweet like:
omg 6.S196 is so cool #deeplearning #mit
We want an algorithm to label this tweet as positive or negative. It's intractable to try to solve this task via some lexical rules, so instead, we're going to use deep learning to embed these tweets into some deep latent space where distinguishing between the two is realtively simple.
Given some dataset with tweets $X$, and sentiments $Y$, we want to learn a function $f$, such that $Y = f(X)$. In our context, $f$ is deep neural network parameterized by some network weights $\Theta$, and we're going to do our learning via gradient descent.
To start, we need someway to measure how good our $f$ is, so we can take a gradient in respective to that performance and move in the right direction. We call this performance evaluation our Loss function, L , and this is something we want to minimize.
Since we are doing classification (pos vs neg), a common loss function to use is cross entropy.
$$L( \Theta ) = - \Sigma_i ( f(x_i)*log(y_i) + (1-f(x_i))log(1-y_i) ) $$ where $f(x)$ is the probablity a tweet $x$ is positive, which we want to be 1 given it's postive and 0 given that it's negative and $y$ is the correct answer. We can access this function with tf.nn.sigmoid_cross_entropy_with_logits
, which will come handy in code. Given that $f$ is parameterized by $\Theta$, we can take the gradient $\frac{dL}{d\Theta}$, and we learn by updating our parameters to minimize the loss.
Note that this loss is 0 if the prediction is correct and very large if we predict something has 0 probablity of being positive when it is positive.
To measure how well we're doing, we can't just look at how well our model performs on it's training data. It could be just memorizing the training data and perform terribly on data it hasn't seen before. To really measure how $f$ performs in the wild, we need to present it with unseen data, and we can tune our hyper-parameters (like learning rate, num layers etc.) over this first unseen set, which we call our development (or validation) set. However, given that we optimized our hyper-parameters to the development set, to get a true fair assesment of the model, we test it in respect to a held-out test set at the end, and generaly report those numbers.
In summary: Namely, we training on one set, i.e. a training set, evaluate and tune our hyper paremeters in regards to our performance on the dev set, and report finals results on a completely heldout test set.
Let's load these now, this ratio of sizes if fairly standard.
In [202]:
trainSet = p.load( open('data/train.p','rb'))
devSet = p.load( open('data/dev.p','rb'))
testSet = p.load( open('data/test.p','rb'))
## Let's look at the size of what we have here. Note, we could use a much larger train set,
## but we keep it mid-size so you can run this whole thing off your laptop
len(trainSet), len(devSet), len(testSet)
Out[202]:
The first question we need to address is how do we represent a tweet? how do we represent a word? One way to do this is with one_hot vectors for each word. Where a given word $w_i= [0,0,...,1,..0]$. However, in this representation, words like "love" and "adore" are as similar as "love" and "hate", because the cosine similarity is 0 in both cases. Another issue is that these vectors are huge in order to represent the whole vocab. To get around this issue the NLP community developed a techique called Word Embeddings.
The basic idea is we represent a word with a vector $\phi$ by the context the word appears in. By training a neural network to predict the context of words across a large training set, we can use the weights of that neural networks to get a dense, and useful representation that captures context. Word Embeddings capture all kinds of useful semantic relationships. For example, one cool emergent property is $ \phi(king) - \phi(queen) = \phi(man) - \phi(woman)$. To learn more about the magic behind word embeddings we recommend Chris Colahs "blog post". A common tool for generating Word Embeddings is word2vec, which is what we'll be using today.
In [206]:
## Note, these tweets were preprocessings to remove non alphanumeric chars, replace unfrequent words, and padded to same length.
## Note, we're going to train our embeddings on only our train set in order to keep our dev/test tests fair
trainSentences = [" ".join(tweetPair[0]) for tweetPair in trainSet]
print trainSentences[0]
p.dump(trainSentences, open('data/trainSentences.p','wb'))
## Word2vec module expects a file containing a list of strings, a target to store the model, and then the size of the
## embedding vector
word2vec.word2vec('data/trainSentences.p','data/emeddings.bin', 100, verbose=True)
w2vModel = word2vec.load('data/emeddings.bin')
print w2vModel.vocab
In [207]:
## Each word looks something like represented by a 100 dimension vector like this
print "embedding for the word fun", w2vModel['fun']
Now lets look at the words most similar to the word "fun"
In [208]:
indices, cosineSim = w2vModel.cosine('fun')
print w2vModel.vocab[indices]
word_embeddings = w2vModel.vectors
vocab_size = len(w2vModel.vocab)
Feel free to play around here test the properties of your embeddings, how they cluster etc. In the interest of time, we're going to move on straight to models.
Now in order to use these embeddings, we have to represent each tweet as a list of indices into the embedding matrix. This preprocessing code is available in processing.py if you are interested.
Tensorflow is a hugely popular library for building neural nets. The general workflow in building models in tensorflow is as follows:
Here is a toy example putting 2 and 2 together, and initializing some random weight matrix.
In [209]:
session = tf.Session()
# 1.BUILD GRAPH
# Set placeholders with a type for data you'll eventually feed in (like tweets and sentiments)
a = tf.placeholder(tf.int32)
b = tf.placeholder(tf.int32)
# Set up variables, like weight matrices.
# Using tf.get_variable, specify the name, shape, type and initliazer of the variable.
W = tf.get_variable("ExampleMatrix", [2, 2], tf.float32, tf.random_normal_initializer(stddev=1.0 / 2))
# Set up the operations you need, like matrix multiplications, non-linear layers, and your loss function minimizer
c = a*b
# 2.RUN GRAPH
# Initialize any variables you have (just W in this case)
tf.global_variables_initializer().run(session=session)
# Specify the values tensor you want returned, and ops you want run
fetch = {'c':c, 'W':W}
# Fill in the place holders
feed_dict = {
a: 2,
b: 2,
}
# Run and get back fetch filled in
results = session.run( fetch, feed_dict = feed_dict)
print( results['c'])
print( results['W'])
# Close session
session.close()
# Reset the graph so it doesn't get in the way later
tf.reset_default_graph()
MLP or Multi-layer perceptron is a basic archetecture where where we multiply our representation with some matrix W
and add some bias b
and then apply some nonlineanity like tanh
at each layer. Layers are fully connected to the next. As the network gets deeper, it's expressive power grows exponentially and so they can draw some pretty fancy decision boundaries. In this exercise, you'll build your own MLP, with 1 hidden layer (layer that isn't input or output), with 100 dimensions.
To make training more stable and efficient, we'll do this we'll actually evalaute 20 tweets at a time, and take gradients and respect to the loss on the 20. We call this idea training with mini-batches.
sentence length
(20) word_ids , and since we are packing mini-batch size
number of tweets in the graph a time tweets per iteration, we need a matrix of minibatch * sentence length
. Feel free to check out the placeholder api heremini-batch size
length vector of sentiments.vocab_size * embedding_size
In [213]:
"TODO"
Out[213]:
batch_size * sentence_length * embedding size
.embedding size
for each tweet. You should end up with a Tensor of dimensions batch_size * embedding size
.tf.tanh
In [214]:
"TODO"
Out[214]:
In [215]:
## Make sure to call your output embedding logits, and your sentiments placeholder sentiments in python
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits, sentiments)
loss = tf.reduce_sum(loss)
optimizer = tf.train.AdamOptimizer(1e-2).minimize(loss)
In [184]:
trainSet = p.load( open('data/trainTweets_preprocessed.p','rb'))
random.shuffle(trainSet)
" TODO Init vars"
losses = []
for i in range(5000):
trainTweet = np.array( [ t[0] for t in trainSet[i: i+ minibatch_size]])
trainLabels = np.array( [int(t[1]) for t in trainSet[i: i+ minibatch_size] ])
results = "TODO, run graph with data"
losses.append(results['loss'])
if i % 500 == 0:
print("Iteration",i,"Loss", sum(losses[-500:-1])/500. if i > 0 else losses[-1])
In [186]:
validationSet = p.load( open('data/devTweets_preprocessed.p','rb'))
random.shuffle(validationSet)
losses = []
for i in range(20000/20):
valTweet = np.array( [ t[0] for t in validationSet[i: i+ minibatch_size]])
valLabels = np.array( [int(t[1]) for t in validationSet[i: i+ minibatch_size] ])
results = "TODO"
losses.append(results['loss'])
print("Dev Loss", sum(losses)*1./len(losses))
Things to try on your own:
In [212]:
# Step 1:
tf.reset_default_graph()
session = tf.Session()
minibatch_size = 20
tweet_length = 20
embedding_size = 100
hidden_dim_size = 100
output_size = 1
init_bias = 0
tweets = tf.placeholder(tf.int32, shape=[minibatch_size,tweet_length])
sentiments = tf.placeholder(tf.float32, shape=[minibatch_size])
embeddingMatrix = tf.placeholder(tf.float32, shape =[vocab_size, embedding_size] )
W_hidden = tf.get_variable("W_hidden", [embedding_size, hidden_dim_size], tf.float32, tf.random_normal_initializer(stddev=1.0 / hidden_dim_size))
b_hidden = tf.get_variable("b_hidden", [hidden_dim_size], initializer=tf.constant_initializer(init_bias))
W_output = tf.get_variable("W_output", [hidden_dim_size, output_size], tf.float32, tf.random_normal_initializer(stddev=1.0 / hidden_dim_size))
b_output = tf.get_variable("b_output", [output_size], initializer=tf.constant_initializer(init_bias))
# Step 2:
tweet_embedded = tf.nn.embedding_lookup(embeddingMatrix, tweets)
averagedTweets = tf.reduce_mean(tweet_embedded, axis=1)
hidden_proj = tf.matmul( averagedTweets, W_hidden) + b_hidden
non_linearity = tf.nn.tanh(hidden_proj)
logits = tf.matmul( non_linearity, W_output)+ b_output
logits = tf.reshape(logits, shape=[minibatch_size])
## Make sure to call your output embedding logits, and your sentiments placeholder sentiments in python
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits, sentiments)
loss = tf.reduce_sum(loss)
optimizer = tf.train.AdamOptimizer().minimize(loss)
# Step 3:
trainSet = p.load( open('data/trainTweets_preprocessed.p','rb'))
random.shuffle(trainSet)
tf.global_variables_initializer().run(session=session)
losses = []
for i in range(5000):
trainTweet = np.array( [ t[0] for t in trainSet[i: i+ minibatch_size]])
trainLabels = np.array( [int(t[1]) for t in trainSet[i: i+ minibatch_size] ])
feed_dict = {
embeddingMatrix: word_embeddings,
tweets: trainTweet,
sentiments: trainLabels
}
fetch = {
'loss': loss,
'trainOp': optimizer
}
results = session.run(fetch, feed_dict=feed_dict)
losses.append(results['loss'])
if i % 500 == 0:
print("Iteration",i,"Loss", sum(losses[-500:-1])/500. if i > 0 else losses[-1])
# Step 4:
validationSet = p.load( open('data/devTweets_preprocessed.p','rb'))
random.shuffle(validationSet)
losses = []
for i in range(20000/20):
valTweet = np.array( [ t[0] for t in validationSet[i: i+ minibatch_size]])
valLabels = np.array( [int(t[1]) for t in validationSet[i: i+ minibatch_size] ])
feed_dict = {
embeddingMatrix: word_embeddings,
tweets: valTweet,
sentiments: valLabels
}
fetch = {
'loss': loss,
}
results = session.run(fetch, feed_dict=feed_dict)
losses.append(results['loss'])
print("Dev Loss", sum(losses)*1./len(losses))
In [ ]: