LSTM for Sentiment Analysis

We attempt to do some sentiment analysis with a dataset provided by the University of Michigan for the Kaggle UMICH SI650 - Sentiment Classification competition. We will only use the training data, since the test data is unlabeled and we cannot run an evaluation locally.

Setup Imports


In [1]:
from __future__ import division, print_function
from keras.layers.core import Dense, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.preprocessing import sequence
from sklearn.cross_validation import train_test_split
import collections
import nltk
import numpy as np


Using Theano backend.

Generate vocabulary from training data


In [2]:
maxlen = 0
word_freqs = collections.Counter()
num_recs = 0
ftrain = open("../data/umich-sentiment-train.txt", "rb")
for line in ftrain:
    label, sentence = line.strip().split("\t")
    words = nltk.word_tokenize(sentence.decode("ascii", "ignore").lower())
    if len(words) > maxlen:
        maxlen = len(words)
    for word in words:
        word_freqs[word] += 1
    num_recs += 1
ftrain.close()

# print some statistics about our data, that will drive our parameters
print("maxlen: %d, vocab size: %d" % (maxlen, len(word_freqs)))


maxlen: 42, vocab size: 2313

In [3]:
MAX_FEATURES = 2000
MAX_SENTENCE_LENGTH = 40

In [4]:
# special words: UNK = -1, PAD = 0
vocab = {"UNK": -1, "PAD": 0}
reverse_vocab = {v:k for k, v in vocab.items()}
for idx, word in enumerate([w[0] for w in word_freqs.most_common(MAX_FEATURES - 1)]):
    vocab[word] = idx + 1
    reverse_vocab[idx + 1] = word

Convert sentences to token sequences


In [5]:
X = np.empty((num_recs, ), dtype=list)
y = np.zeros((num_recs, ))
i = 0
ftrain = open("../data/umich-sentiment-train.txt", "rb")
for line in ftrain:
    label, sentence = line.strip().split("\t")
    words = nltk.word_tokenize(sentence.decode("ascii", "ignore").lower())
    seqs = []
    for word in words:
        seqs.append(vocab.get(word, -1))
    X[i] = seqs
    y[i] = int(label)
    i += 1
ftrain.close()

X = sequence.pad_sequences(X, maxlen=MAX_SENTENCE_LENGTH)

Split input into training and test


In [6]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=0)
print(Xtrain.shape, Xtest.shape, ytrain.shape, ytest.shape)


(4960, 40) (2126, 40) (4960,) (2126,)

Build Model

Note that our last layer is a single Dense node, since we want a score for our sentiment.


In [7]:
model = Sequential()
model.add(Embedding(MAX_FEATURES, 128, input_length=MAX_SENTENCE_LENGTH, dropout=0.2))
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))
model.add(Dense(1))
model.add(Activation("sigmoid"))

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [8]:
model.fit(Xtrain, ytrain, batch_size=32, nb_epoch=10, validation_data=(Xtest, ytest))


Train on 4960 samples, validate on 2126 samples
Epoch 1/10
4960/4960 [==============================] - 13s - loss: 0.3371 - acc: 0.8506 - val_loss: 0.0905 - val_acc: 0.9704
Epoch 2/10
4960/4960 [==============================] - 12s - loss: 0.0991 - acc: 0.9591 - val_loss: 0.0409 - val_acc: 0.9854
Epoch 3/10
4960/4960 [==============================] - 13s - loss: 0.0851 - acc: 0.9643 - val_loss: 0.0542 - val_acc: 0.9779
Epoch 4/10
4960/4960 [==============================] - 13s - loss: 0.0666 - acc: 0.9734 - val_loss: 0.0428 - val_acc: 0.9849
Epoch 5/10
4960/4960 [==============================] - 13s - loss: 0.0646 - acc: 0.9738 - val_loss: 0.0348 - val_acc: 0.9864
Epoch 6/10
4960/4960 [==============================] - 13s - loss: 0.0505 - acc: 0.9817 - val_loss: 0.0404 - val_acc: 0.9854
Epoch 7/10
4960/4960 [==============================] - 13s - loss: 0.0549 - acc: 0.9756 - val_loss: 0.0352 - val_acc: 0.9854
Epoch 8/10
4960/4960 [==============================] - 13s - loss: 0.0483 - acc: 0.9821 - val_loss: 0.0378 - val_acc: 0.9854
Epoch 9/10
4960/4960 [==============================] - 13s - loss: 0.0499 - acc: 0.9821 - val_loss: 0.0449 - val_acc: 0.9835
Epoch 10/10
4960/4960 [==============================] - 13s - loss: 0.0475 - acc: 0.9798 - val_loss: 0.0389 - val_acc: 0.9882
Out[8]:
<keras.callbacks.History at 0x115499390>

Evaluate Model


In [9]:
loss, accuracy = model.evaluate(Xtest, ytest, batch_size=32)
print("loss on test set: %.3f, accuracy: %.3f" % (loss, accuracy))


2126/2126 [==============================] - 1s     
loss on test set: 0.039, accuracy: 0.988

Predict sentiment on some random sentences


In [10]:
random_idxs = np.random.randint(0, Xtest.shape[0], 10)
for i in range(random_idxs.shape[0]):
    xtest = Xtest[random_idxs[i]].reshape(1, MAX_SENTENCE_LENGTH)
    ylabel = ytest[i]
    ypred = model.predict(xtest)[0][0]
    sent_pred = " ".join([reverse_vocab[x] for x in xtest[0].tolist() if x != 0])
    print("%.3f\t%d\t%s" % (ypred, ylabel, sent_pred))


1.000	0	da vinci code is awesome ! !
0.000	0	harry potter sucks ! !
0.000	1	the da vinci code sucked big time .
1.000	1	the da vinci code is awesome ! !
0.001	1	brokeback mountain is fucking horrible..
1.000	0	i love brokeback mountain !
1.000	0	because i would like to make friends who like the same things i like , and i really like harry potter , so i thought that joining a community like this would be a good start .
1.000	1	the da vinci code was awesome , i ca n't wait to read it ...
0.000	0	oh , and brokeback mountain was a terrible movie .
1.000	1	da vinci code is awesome ! !

In [ ]: