Predict Sentiment Analysis

IMDB Movie Review Sentiment Problem Description

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly polar moving reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given moving review has a positive or negative sentiment.

The data was collected by Stanford researchers and was used in a 2011 paper [PDF] where a split of 50/50 of the data was used for training and test. An accuracy of 88.89% was achieved. This data set was also in a Kaggle compeition titled “Bag of Words Meets Bags of Popcorn” in late 2014 to early 2015. Accuracy was achieved above 97% with winners achieving 99%.


In [1]:
import numpy
from keras.datasets import imdb
from matplotlib import pyplot
# load the dataset
(X_train, y_train), (X_test, y_test) = imdb.load_data()
X = numpy.concatenate((X_train, X_test), axis=0)
y = numpy.concatenate((y_train, y_test), axis=0)


Using TensorFlow backend.

In [2]:
# summarize size
print("Training data: ")
print(X.shape)
print(y.shape)

# Summarize number of classes
print("Classes: ")
print(numpy.unique(y))

# Summarize number of words
print("Number of words: ")
print(len(numpy.unique(numpy.hstack(X))))


Training data: 
(50000,)
(50000,)
Classes: 
[0 1]
Number of words: 
88585

In [3]:
# Summarize review length
print("Review length: ")
result = map(len, X)
print("Mean %.2f words (%f)" % (numpy.mean(result), numpy.std(result)))
# plot review length
pyplot.boxplot(result)
pyplot.show()


Review length: 
Mean 234.76 words (172.911495)

Word Embeddings

A recent breakthrough in the field of natural language processing is called word embedding.

This is a technique where words are encoded as real-valued vectors in a high dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space.

Discrete words are mapped to vectors of continuous numbers. This is useful when working with natural language problems with neural networks and deep learning models are we require numbers as input.

Keras provides a convenient way to convert positive integer representations of words into a word embedding by an Embedding layer.

The layer takes arguments that define the mapping including the maximum number of expected words also called the vocabulary size (e.g. the largest integer value that will be seen as an integer). The layer also allows you to specify the dimensionality for each word vector, called the output dimension.

We would like to use a word embedding representation for the IMDB dataset.

Let’s say that we are only interested in the first 5,000 most used words in the dataset. Therefore our vocabulary size will be 5,000. We can choose to use a 32-dimension vector to represent each word. Finally, we may choose to cap the maximum review length at 500 words, truncating reviews longer than that and padding reviews shorter than that with 0 values.

We would load the IMDB dataset as follows:

imdb.load_data(nb_words=500)

Simple Multi-Layer Perceptron Model for the IMDB Dataset


In [4]:
# MLP for the IMDB problem
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

In [5]:
# load the dataset but only keep the top n words
top_words = 5000
test_split = 0.33
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)

In [6]:
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

In [7]:
X_train[4,:]


Out[7]:
array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    1,  249, 1323,    7,   61,  113,   10,   10,   13, 1637,
         14,   20,   56,   33, 2401,   18,  457,   88,   13, 2626, 1400,
         45, 3171,   13,   70,   79,   49,  706,  919,   13,   16,  355,
        340,  355, 1696,   96,  143,    4,   22,   32,  289,    7,   61,
        369,   71, 2359,    5,   13,   16,  131, 2073,  249,  114,  249,
        229,  249,   20,   13,   28,  126,  110,   13,  473,    8,  569,
         61,  419,   56,  429,    6, 1513,   18,   35,  534,   95,  474,
        570,    5,   25,  124,  138,   88,   12,  421, 1543,   52,  725,
          2,   61,  419,   11,   13, 1571,   15, 1543,   20,   11,    4,
          2,    5,  296,   12, 3524,    5,   15,  421,  128,   74,  233,
        334,  207,  126,  224,   12,  562,  298, 2167, 1272,    7, 2601,
          5,  516,  988,   43,    8,   79,  120,   15,  595,   13,  784,
         25, 3171,   18,  165,  170,  143,   19,   14,    5,    2,    6,
        226,  251,    7,   61,  113], dtype=int32)

In [8]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
#model.add(Dense(256, input_dim=max_words))
model.add(Flatten())
model.add(Dense(30, activation='relu'))
model.add(Dense(30, activation='relu'))
model.add(Dense(30, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_1 (Embedding)          (None, 500, 32)       160000      embedding_input_1[0][0]          
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 16000)         0           embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 30)            480030      flatten_1[0][0]                  
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 30)            930         dense_1[0][0]                    
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 30)            930         dense_2[0][0]                    
____________________________________________________________________________________________________
dense_4 (Dense)                  (None, 1)             31          dense_3[0][0]                    
====================================================================================================
Total params: 641921
____________________________________________________________________________________________________
None

In [9]:
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=4, batch_size=128, verbose=1)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))


Train on 25000 samples, validate on 25000 samples
Epoch 1/4
25000/25000 [==============================] - 17s - loss: 0.4989 - acc: 0.7193 - val_loss: 0.3300 - val_acc: 0.8574
Epoch 2/4
25000/25000 [==============================] - 17s - loss: 0.1990 - acc: 0.9240 - val_loss: 0.3204 - val_acc: 0.8706
Epoch 3/4
25000/25000 [==============================] - 13s - loss: 0.0796 - acc: 0.9752 - val_loss: 0.3967 - val_acc: 0.8659
Epoch 4/4
25000/25000 [==============================] - 12s - loss: 0.0208 - acc: 0.9954 - val_loss: 0.5222 - val_acc: 0.8627
Accuracy: 86.27%

One-Dimensional Convolutional Neural Network Model for the IMDB Dataset


In [10]:
# CNN for the IMDB problem
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Convolution1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

In [11]:
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
test_split = 0.33
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)
# pad dataset to a maximum review length in words
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

In [12]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_2 (Embedding)          (None, 500, 32)       160000      embedding_input_2[0][0]          
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 500, 32)       3104        embedding_2[0][0]                
____________________________________________________________________________________________________
maxpooling1d_1 (MaxPooling1D)    (None, 250, 32)       0           convolution1d_1[0][0]            
____________________________________________________________________________________________________
flatten_2 (Flatten)              (None, 8000)          0           maxpooling1d_1[0][0]             
____________________________________________________________________________________________________
dense_5 (Dense)                  (None, 250)           2000250     flatten_2[0][0]                  
____________________________________________________________________________________________________
dense_6 (Dense)                  (None, 1)             251         dense_5[0][0]                    
====================================================================================================
Total params: 2163605
____________________________________________________________________________________________________
None

In [13]:
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=2, batch_size=128, verbose=1)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))


Train on 25000 samples, validate on 25000 samples
Epoch 1/2
25000/25000 [==============================] - 226s - loss: 0.4457 - acc: 0.7574 - val_loss: 0.3042 - val_acc: 0.8727
Epoch 2/2
25000/25000 [==============================] - 251s - loss: 0.2287 - acc: 0.9108 - val_loss: 0.2828 - val_acc: 0.8828
Accuracy: 88.28%

In [ ]:


In [ ]: