In [190]:
import cPickle as p
import numpy as np
import tensorflow as tf
import word2vec
import random

NLP Lab, Part I

Welcome to the first lab of 6.S191!

Administrivia

Things to install:

Lab Objectives:

  • Learn Machine Learning methodology basics (train/dev/test sets)
  • Learn some Natural Language Processing basics (word embeddings with word2vec)
  • Learn the basics of tensorflow, build your first deep neural nets (MLP, RNN) and get results!

And we'll be doing all of this in te context of Twitter sentiment analysis. Given a tweet like:

omg 6.S196 is so cool #deeplearning #mit

We want an algorithm to label this tweet as positive or negative. It's intractable to try to solve this task via some lexical rules, so instead, we're going to use deep learning to embed these tweets into some deep latent space where distinguishing between the two is realtively simple.

Machine Learning Basics

Given some dataset with tweets $X$, and sentiments $Y$, we want to learn a function $f$, such that $Y = f(X)$. In our context, $f$ is deep neural network parameterized by some network weights $\Theta$, and we're going to do our learning via gradient descent.

Objective Function

To start, we need someway to measure how good our $f$ is, so we can take a gradient in respective to that performance and move in the right direction. We call this performance evaluation our Loss function, L , and this is something we want to minimize.

Since we are doing classification (pos vs neg), a common loss function to use is cross entropy. $$L( \Theta ) = - \Sigma_i ( f(x_i)*log(y_i) + (1-f(x_i))log(1-y_i) ) $$ where $f(x)$ is the probablity a tweet $x$ is positive, which we want to be 1 given it's postive and 0 given that it's negative and $y$ is the correct answer. We can access this function with tf.nn.sigmoid_cross_entropy_with_logits, which will come handy in code. Given that $f$ is parameterized by $\Theta$, we can take the gradient $\frac{dL}{d\Theta}$, and we learn by updating our parameters to minimize the loss.

Note that this loss is 0 if the prediction is correct and very large if we predict something has 0 probablity of being positive when it is positive.

Methodology

To measure how well we're doing, we can't just look at how well our model performs on it's training data. It could be just memorizing the training data and perform terribly on data it hasn't seen before. To really measure how $f$ performs in the wild, we need to present it with unseen data, and we can tune our hyper-parameters (like learning rate, num layers etc.) over this first unseen set, which we call our development (or validation) set. However, given that we optimized our hyper-parameters to the development set, to get a true fair assesment of the model, we test it in respect to a held-out test set at the end, and generaly report those numbers.

In summary: Namely, we training on one set, i.e. a training set, evaluate and tune our hyper paremeters in regards to our performance on the dev set, and report finals results on a completely heldout test set.

Let's load these now, this ratio of sizes if fairly standard.


In [202]:
trainSet = p.load( open('data/train.p','rb'))
devSet = p.load( open('data/dev.p','rb'))
testSet = p.load( open('data/test.p','rb'))

## Let's look at the size of what we have here. Note, we could use a much larger train set, 
## but we keep it mid-size so you can run this whole thing off your laptop
len(trainSet), len(devSet), len(testSet)


Out[202]:
(60000, 20000, 20000)

NLP Basics

The first question we need to address is how do we represent a tweet? how do we represent a word? One way to do this is with one_hot vectors for each word. Where a given word $w_i= [0,0,...,1,..0]$. However, in this representation, words like "love" and "adore" are as similar as "love" and "hate", because the cosine similarity is 0 in both cases. Another issue is that these vectors are huge in order to represent the whole vocab. To get around this issue the NLP community developed a techique called Word Embeddings.

Word2Vec

The basic idea is we represent a word with a vector $\phi$ by the context the word appears in. By training a neural network to predict the context of words across a large training set, we can use the weights of that neural networks to get a dense, and useful representation that captures context. Word Embeddings capture all kinds of useful semantic relationships. For example, one cool emergent property is $ \phi(king) - \phi(queen) = \phi(man) - \phi(woman)$. To learn more about the magic behind word embeddings we recommend Chris Colahs "blog post". A common tool for generating Word Embeddings is word2vec, which is what we'll be using today.


In [206]:
## Note, these tweets were preprocessings to remove non alphanumeric chars, replace unfrequent words, and padded to same length.
## Note, we're going to train our embeddings on only our train set in order to keep our dev/test tests fair 
trainSentences = [" ".join(tweetPair[0]) for tweetPair in trainSet]
print trainSentences[0]
p.dump(trainSentences, open('data/trainSentences.p','wb'))
## Word2vec module expects a file containing a list of strings, a target to store the model, and then the size of the
## embedding vector
word2vec.word2vec('data/trainSentences.p','data/emeddings.bin', 100, verbose=True)

w2vModel = word2vec.load('data/emeddings.bin')
print w2vModel.vocab


finally back twitterberry messed up my phone lets try this one out padtoken padtoken padtoken padtoken padtoken padtoken padtoken padtoken
Starting training using file data/trainSentences.p
Vocab size: 7597
Words in train file: 1326930
Alpha: 0.000042  Progress: 99.99%  Words/thread/sec: 503.89k  [u'</s>' u'padtoken' u'unktoken' ..., u'fart' u'nonstop' u'gue']

In [207]:
## Each word looks something like represented by a 100 dimension vector like this
print "embedding for the word fun", w2vModel['fun']


embedding for the word fun [ 0.02739932 -0.06195362 -0.02174884 -0.16449896  0.0113832   0.05549333
  0.15455414 -0.15823498  0.00534459  0.14625058  0.19411038  0.06777722
 -0.12640333 -0.01663971  0.00228494 -0.04999322  0.21230859  0.11675727
 -0.02723708  0.090425   -0.07684573 -0.03013001  0.15054527  0.19291012
  0.15725572  0.03772355 -0.03657226 -0.0596232   0.02194676  0.00029824
 -0.00298259 -0.0167528  -0.00211832  0.09068366 -0.06469334  0.01311877
 -0.00739915 -0.07736088  0.10405199  0.0827125   0.07532453 -0.05642802
  0.01675165 -0.1692307   0.00315937  0.04331524 -0.03228528  0.00192614
  0.15636554 -0.10258818 -0.03600844 -0.17567421  0.04040479  0.00313568
  0.13499634 -0.0862968   0.11669343 -0.09716116  0.11241419  0.00813499
 -0.18653914 -0.10887115 -0.07146084  0.05805456 -0.07779012  0.01769644
  0.09540395  0.05453155  0.13957299  0.11317214 -0.07434633 -0.00324123
  0.15098293 -0.07858422  0.08103402  0.08550779 -0.06233279 -0.05078358
  0.00469909  0.14089508  0.05289229 -0.13117778  0.06547305  0.17320769
  0.14908808  0.0870463   0.09451518  0.08013666  0.18811567 -0.05976165
  0.11511379  0.11410256 -0.10913139 -0.04317797 -0.05894752 -0.04362189
 -0.03782897  0.15169522 -0.18289736  0.19401944]

Now lets look at the words most similar to the word "fun"


In [208]:
indices, cosineSim = w2vModel.cosine('fun')
print w2vModel.vocab[indices]

word_embeddings = w2vModel.vectors
vocab_size = len(w2vModel.vocab)


[u'nice' u'great' u'lovely' u'busy' u"fun'" u'wonderful' u'enjoyable'
 u'boring' u'storming' u'gr8']

Feel free to play around here test the properties of your embeddings, how they cluster etc. In the interest of time, we're going to move on straight to models.

Now in order to use these embeddings, we have to represent each tweet as a list of indices into the embedding matrix. This preprocessing code is available in processing.py if you are interested.

Tensorflow Basics

Tensorflow is a hugely popular library for building neural nets. The general workflow in building models in tensorflow is as follows:

  • Specify a computation graph (The struture and computations of your neural net)
  • Use your session to feed data into the graph and fetch things from the graph (like the loss, and train operation) Inside the graph, we put our neural net, our loss function, and our optimizer and once this is constructed, we can feed in the data, fetch the loss and the train op to train it.

Here is a toy example putting 2 and 2 together, and initializing some random weight matrix.


In [209]:
session = tf.Session()
# 1.BUILD GRAPH
# Set placeholders with a type for data you'll eventually feed in (like tweets and sentiments)
a = tf.placeholder(tf.int32)
b = tf.placeholder(tf.int32)
# Set up variables, like weight matrices. 
# Using tf.get_variable, specify the name, shape, type and initliazer of the variable.
W = tf.get_variable("ExampleMatrix", [2, 2], tf.float32, tf.random_normal_initializer(stddev=1.0 / 2))
# Set up the operations you need, like matrix multiplications, non-linear layers, and your loss function minimizer
c = a*b
# 2.RUN GRAPH
# Initialize any variables you have (just W in this case)
tf.global_variables_initializer().run(session=session)
# Specify the values tensor you want returned, and ops you want run
fetch = {'c':c, 'W':W}
# Fill in the place holders
feed_dict = {
    a: 2,
    b: 2,
}
# Run and get back fetch filled in
results = session.run( fetch, feed_dict = feed_dict)

print( results['c'])
print( results['W'])
# Close session
session.close()
# Reset the graph so it doesn't get in the way later
tf.reset_default_graph()


4
[[-0.17049156 -0.24742639]
 [-0.84466755 -0.52191824]]

Building an MLP

MLP or Multi-layer perceptron is a basic archetecture where where we multiply our representation with some matrix W and add some bias b and then apply some nonlineanity like tanh at each layer. Layers are fully connected to the next. As the network gets deeper, it's expressive power grows exponentially and so they can draw some pretty fancy decision boundaries. In this exercise, you'll build your own MLP, with 1 hidden layer (layer that isn't input or output), with 100 dimensions.

To make training more stable and efficient, we'll do this we'll actually evalaute 20 tweets at a time, and take gradients and respect to the loss on the 20. We call this idea training with mini-batches.

Defining the Graph

Step 1: Placeholders, Variables with specified shapes

  • Let start off with placeholders for our tweets, and lets use a minibatch of size 20. Remember each tweet is will be represented as a vector of sentence length (20) word_ids , and since we are packing mini-batch size number of tweets in the graph a time tweets per iteration, we need a matrix of minibatch * sentence length. Feel free to check out the placeholder api here
  • Set up a placeholder for your labels, namely the mini-batch size length vector of sentiments.
  • Set up a placeholder for our pretrained word embeddings. This will take shape vocab_size * embedding_size
  • Set up a variable for your weight matrix, and bias. Check out the variable api here Let's use a hidden dimension size of 100 (so 100 neurons in the next layer) For the Weight matrix, use tf.random_normal_initializer(stddev=1.0 / hidden_dim_size), as this does something called symetry breaking and keeps the neural network from getting stuck at the start. For the bias vector, use tf.constant_initializer(0)

In [213]:
"TODO"


Out[213]:
'TODO'

Step 2: Putting in the Operations

  • Load the word embeddings for the word ids. You can do this using tf.nn.embedding_lookup. Remember to use your embeddings placeholder. You should end up with a Tensor of dimensions batch_size * sentence_length * embedding size.
  • To represent a whole tweet, let's use a neural bag of words. This means we represent each tweet by the words that occur in it; it's a basic representation but gets us pretty far. To do this in a neural way, we can just average the embeddings in the tweet, leaving a single vector of embedding size for each tweet. You should end up with a Tensor of dimensions batch_size * embedding size.
  • Apply projection to the hidden layer of size 100 (ie. multiply the input by a weight vector and add a bias )
  • Apply a nonlinearity like tf.tanh
  • Project this to the output layer of size 1 (ie. multiply the input by a wieght vector and add a bias). Put this in a python variable called logits.

In [214]:
"TODO"


Out[214]:
'TODO'
  • Set up loss function, and optimizer to minimize it. We'll be using Adam as our optimizer

In [215]:
## Make sure to call your output embedding logits, and your sentiments placeholder sentiments in python
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits, sentiments)
loss = tf.reduce_sum(loss)
optimizer = tf.train.AdamOptimizer(1e-2).minimize(loss)

Run the Graph

Step 3: Set up training, and fetch optimizer at each iteration to train the model

  • First initialize all variables as in the toy example
  • Sample 20 random tweet,sentiment pairs for our feed_dict dictionary. Remember to feed in the embedding matrix.
  • fetch dictionary, the ops we want to run and tensors we want back
  • Execute this many times to train

In [184]:
trainSet = p.load( open('data/trainTweets_preprocessed.p','rb'))
random.shuffle(trainSet)

" TODO Init vars"

losses = []
for i in range(5000):
    trainTweet = np.array(  [ t[0] for t in trainSet[i: i+ minibatch_size]])
    trainLabels = np.array( [int(t[1]) for t in trainSet[i: i+ minibatch_size] ])
    
    results = "TODO, run graph with data"
    losses.append(results['loss'])
    if i % 500 == 0:
        print("Iteration",i,"Loss", sum(losses[-500:-1])/500. if i > 0 else losses[-1])


('Iteration', 0, 'Loss', 13.86216)
('Iteration', 500, 'Loss', 11.728893522262574)
('Iteration', 1000, 'Loss', 11.927853281974793)
('Iteration', 1500, 'Loss', 11.440789708137512)
('Iteration', 2000, 'Loss', 11.733890412330627)
('Iteration', 2500, 'Loss', 11.876432689666748)
('Iteration', 3000, 'Loss', 11.315906629562377)
('Iteration', 3500, 'Loss', 11.482173993110656)
('Iteration', 4000, 'Loss', 11.288390522003175)
('Iteration', 4500, 'Loss', 12.245812825202941)

Step 4: Check validation results, and tune

  • Try running the graph on validation data, without fetching the train op.
  • See how the results compare. If the train loss is much lower than the development loss, we may be overfitting. If the train loss is still high, try experimenting with the model archetecture to increase it's capacity.

In [186]:
validationSet = p.load( open('data/devTweets_preprocessed.p','rb'))
random.shuffle(validationSet)

losses = []
for i in range(20000/20):
    valTweet = np.array(  [ t[0] for t in validationSet[i: i+ minibatch_size]])
    valLabels = np.array( [int(t[1]) for t in validationSet[i: i+ minibatch_size] ])

    results = "TODO" 
    losses.append(results['loss'])
print("Dev Loss", sum(losses)*1./len(losses))


('Dev Loss', 13.220003290176392)

Future Steps:

Things to try on your own:

  • Adding in a tensor for accuracy, and log it at each step.
  • Iterate over whole validation dataset to get more stable validation score
  • Try tensorboard and graphing accuracy over both sets time.
  • experiment with different archetectures that maximize validation score. Maybe bag of words, which doesn't distinguish between "bad not good" and "good not bad" isn't a good enough representation.
  • test it on the test data
  • Do the RNN tutorial! # Solutions! Do not look unless you really have to. Ask TA's for help first. Fight for the intuition, you'll get more out of it.

In [212]:
# Step 1:
tf.reset_default_graph()
session = tf.Session()


minibatch_size = 20
tweet_length = 20
embedding_size = 100
hidden_dim_size = 100
output_size = 1
init_bias = 0

tweets          = tf.placeholder(tf.int32, shape=[minibatch_size,tweet_length])
sentiments      = tf.placeholder(tf.float32, shape=[minibatch_size])
embeddingMatrix = tf.placeholder(tf.float32, shape =[vocab_size, embedding_size] )
W_hidden = tf.get_variable("W_hidden", [embedding_size, hidden_dim_size], tf.float32, tf.random_normal_initializer(stddev=1.0 / hidden_dim_size))
b_hidden = tf.get_variable("b_hidden", [hidden_dim_size], initializer=tf.constant_initializer(init_bias))
W_output = tf.get_variable("W_output", [hidden_dim_size, output_size], tf.float32, tf.random_normal_initializer(stddev=1.0 / hidden_dim_size))
b_output = tf.get_variable("b_output", [output_size], initializer=tf.constant_initializer(init_bias))

# Step 2:
tweet_embedded =  tf.nn.embedding_lookup(embeddingMatrix, tweets)
averagedTweets = tf.reduce_mean(tweet_embedded, axis=1)
hidden_proj = tf.matmul( averagedTweets, W_hidden) + b_hidden
non_linearity = tf.nn.tanh(hidden_proj)
logits = tf.matmul( non_linearity,  W_output)+ b_output
logits = tf.reshape(logits, shape=[minibatch_size])

## Make sure to call your output embedding logits, and your sentiments placeholder sentiments in python
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits, sentiments)
loss = tf.reduce_sum(loss)
optimizer = tf.train.AdamOptimizer().minimize(loss)

# Step 3:
trainSet = p.load( open('data/trainTweets_preprocessed.p','rb'))
random.shuffle(trainSet)

tf.global_variables_initializer().run(session=session)

losses = []
for i in range(5000):
    trainTweet = np.array(  [ t[0] for t in trainSet[i: i+ minibatch_size]])
    trainLabels = np.array( [int(t[1]) for t in trainSet[i: i+ minibatch_size] ])
    
    feed_dict = {
        embeddingMatrix: word_embeddings,
        tweets: trainTweet,
        sentiments: trainLabels
    }
    fetch = {
        'loss': loss,
        'trainOp': optimizer
    }
    results = session.run(fetch, feed_dict=feed_dict)
    losses.append(results['loss'])
    if i % 500 == 0:
        print("Iteration",i,"Loss", sum(losses[-500:-1])/500. if i > 0 else losses[-1])
    

# Step 4:
validationSet = p.load( open('data/devTweets_preprocessed.p','rb'))
random.shuffle(validationSet)

losses = []
for i in range(20000/20):
    valTweet = np.array(  [ t[0] for t in validationSet[i: i+ minibatch_size]])
    valLabels = np.array( [int(t[1]) for t in validationSet[i: i+ minibatch_size] ])
    feed_dict = {
        embeddingMatrix: word_embeddings,
        tweets: valTweet,
        sentiments: valLabels
    }
    fetch = {
        'loss': loss,
    }
    results = session.run(fetch, feed_dict=feed_dict)
    losses.append(results['loss'])
print("Dev Loss", sum(losses)*1./len(losses))


('Iteration', 0, 'Loss', 13.865759)
('Iteration', 500, 'Loss', 13.850158506393432)
('Iteration', 1000, 'Loss', 13.583849094390869)
('Iteration', 1500, 'Loss', 12.831425987243652)
('Iteration', 2000, 'Loss', 12.317071950912476)
('Iteration', 2500, 'Loss', 11.794303876876832)
('Iteration', 3000, 'Loss', 12.680396286010742)
('Iteration', 3500, 'Loss', 12.530222269058228)
('Iteration', 4000, 'Loss', 12.272984921455384)
('Iteration', 4500, 'Loss', 12.515208539962769)
('Dev Loss', 12.61152536869049)

In [ ]: