Five-Line Sentiment Analysis Classifier

In this notebook, I will explain how to develop sentiment analysis classifiers that are based on a bag-of-words model. Then, I will demonstrate how these classifiers can be utilized to solve Kaggle's "When Bag of Words Meets Bags of Popcorn" challenge.

Code Recipe: Creating Sentiment Classifier

Using GraphLab it is very easy and straight foward to create a sentiment classifier based on bag-of-words model. Given a dataset stored as a CSV file, you can construct your sentiment classifier using the following code:



In [ ]:

    
import graphlab as gl
train_data = gl.SFrame.read_csv(traindata_path,header=True, delimiter='\t',quote_char='"', column_type_hints = {'id':str, 'sentiment' : int, 'review':str } )
train_data['1grams features'] = gl.text_analytics.count_ngrams(train_data['review'],1)
train_data['2grams features'] = gl.text_analytics.count_ngrams(train_data['review'],2)
cls = gl.classifier.create(train_data, target='sentiment', features=['1grams features','2grams features'])

In the rest of this notebook, we will explain this code recipe in details, by demonstrating how this recipe can used to create IMDB movie reviews sentiment classifier.

Set up

Before we begin constructing the classifiers, we need to import some Python libraries: graphlab (gl), and IPython display utilities. We also set IPython notebook and GraphLab Canvas to produce plots directly in this notebook.



In [2]:

    
import graphlab as gl
from IPython.display import display
from IPython.display import Image

gl.canvas.set_target('ipynb')

Dataset

Throughout this notebook, I will use Kaggle's IMDB movies reviews datasets that is available to download from the following link: https://www.kaggle.com/c/word2vec-nlp-tutorial/data. I downloaded labeledTrainData.tsv and testData.tsv files, and unzipped them to the following local files.



In [3]:

    
traindata_path = "/home/graphlab/data/sentiment/labeledTrainData.tsv"
testdata_path = "/home/graphlab/data/sentiment/testData.tsv"

Loading Data

We will load the data with IMDB movie reviews to an SFrame using SFrame.read_csv function.



In [4]:

    
movies_reviews_data = gl.SFrame.read_csv(traindata_path,header=True, delimiter='\t',quote_char='"', column_type_hints = {'id':str, 'sentiment' : str, 'review':str } )









    



[INFO] Start server at: ipc:///tmp/graphlab_server-3660 - Server binary: /usr/local/lib/python2.7/dist-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1423667443.log
[INFO] GraphLab Server Version: 1.2.1






    




PROGRESS: Finished parsing file /home/graphlab/data/sentiment/labeledTrainData.tsv






    




PROGRESS: Parsing completed. Parsed 25000 lines in 0.809814 secs.

By using the SFrame show function, we can visualize the data and notice that the train dataset consists of 12,500 positive and 12,500 negative, and overall 24,932 unique reviews.



In [5]:

    
movies_reviews_data.show()









    Out[5]:

Constructing Bag-of-Words Classifier

One of the common techniques to perform document classification (and reviews classification) is using Bag-of-Words model, in which the frequency of each word in the document is used as a feature for training a classifier. GraphLab's text analytics toolkit makes it easy to calculate the frequency of each word in each review. Namely, by using the count_ngrams function with n=1, we can calculate the frequency of each word in each review. By running the following command:



In [6]:

    
movies_reviews_data['1grams features'] = gl.text_analytics.count_ngrams(movies_reviews_data ['review'],1)

By running the last command, we created a new column in movies_reviews_data SFrame object. In this column each value is a dictionary object, where each dictionary's keys are the different words which appear in the corresponding review, and the dictionary's values are the frequency of each word. We can view the values of this new column using the following command.



In [7]:

    
movies_reviews_data.show(['review','1grams features'])









    Out[7]:

We are now ready to construct and evaluate the movie reviews sentiment classifier using the calculated above features. But first, to be able to perform a quick evaluation of the constructed classifier, we need to create labeled train and test datasets. We will create train and test datasets by randomly splitting the train dataset into two parts. The first part will contain 80% of the labeled train dataset and will be used as the training dataset, while the second part will contain 20% of the labeled train dataset and will be used as the testing dataset. We will create these two dataset by using the following command:



In [8]:

    
train_set, test_set = movies_reviews_data.random_split(0.8, seed=5)

We are now ready to create a classifier using the following command:



In [9]:

    
model_1 = gl.classifier.create(train_set, target='sentiment', features=['1grams features'])









    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 19953






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 1






    




PROGRESS: Number of unpacked features : 69281






    




PROGRESS: Number of coefficients    : 69282






    




PROGRESS: Starting L-BFGS 
--------------------------------------------------------






    




PROGRESS:   Iter   Grad-Norm        Loss   Step size Elapsed time






    




PROGRESS:      0   2.360e+03   1.383e+04   1.000e-06        0.10s






    




PROGRESS:      1   1.640e+03   1.036e+04   1.000e+00        0.36s






    




PROGRESS:      2   2.035e+03   6.526e+03   1.000e+00        0.52s






    




PROGRESS:      3   3.788e+02   2.943e+03   1.000e+00        0.62s






    




PROGRESS:      4   2.096e+02   2.200e+03   1.000e+00        0.73s






    




PROGRESS:      5   3.327e+02   1.319e+03   1.000e+00        0.87s






    




PROGRESS:      6   7.930e+02   1.401e+03   1.000e+00        1.02s






    




PROGRESS:      7   3.368e+02   6.833e+02   1.000e+00        1.31s






    




PROGRESS:      8   1.774e+02   5.223e+02   1.000e+00        1.46s






    




PROGRESS:      9   4.718e+01   4.271e+02   1.000e+00        1.65s






    




PROGRESS:     10   3.447e+01   3.804e+02   1.000e+00        1.83s

We can evaluate the performence of the classifier by evaluating it on the test dataset



In [10]:

    
result1 = model_1.evaluate(test_set)

In order to get an easy view of the classifier's prediction result, we define and use the following function



In [11]:

    
def print_statistics(result):
    print "*" * 30
    print "Accuracy        : ", result["accuracy"]
    print "Confusion Matrix: \n", result["confusion_matrix"]
print_statistics(result1)









    



******************************
Accuracy        :  0.888448583317
Confusion Matrix: 
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      0       |        0        |  2167 |
|      0       |        1        |  343  |
|      1       |        0        |  220  |
|      1       |        1        |  2317 |
+--------------+-----------------+-------+
[4 rows x 3 columns]

As can be seen in the results above, in just a few relatively straight foward lines of code, we have developed a sentiment classifier that has accuracy of about ~0.88. Next, we demonstrate how we can improve the classifier accuracy even more.

Improving The Classifier

One way to improve the movie reviews sentiment classifier is to extract more meaningful features from the reviews. One method to add additional features, which might be meaningful, is to calculate the frequency of every two consecutive words in each review. To calculate the frequency of each two consecutive words in each review, as before, we will use GraphLab's count_ngrams function only this time we will set n to be equal 2 (n=2) to create new column named '2grams features'.



In [12]:

    
movies_reviews_data['2grams features'] = gl.text_analytics.count_ngrams(movies_reviews_data['review'],2)

As before, we will construct and evaluate a movie reviews sentiment classifier. However, this time we will use both the '1grams features' and the '2grams features' features



In [13]:

    
train_set, test_set = movies_reviews_data.random_split(0.8, seed=5)
model_2 = gl.classifier.create(train_set, target='sentiment', features=['1grams features','2grams features'])
result2 = model_2.evaluate(test_set)
print_statistics(result2)









    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 19953






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 2






    




PROGRESS: Number of unpacked features : 1244894






    




PROGRESS: Number of coefficients    : 1244895






    




PROGRESS: Starting L-BFGS 
--------------------------------------------------------






    




PROGRESS:   Iter   Grad-Norm        Loss   Step size Elapsed time






    




PROGRESS:      0   2.360e+03   1.383e+04   1.000e-06        1.28s






    




PROGRESS:      1   1.357e+03   8.186e+03   1.000e+00        2.31s






    




PROGRESS:      2   6.327e+02   3.856e+03   1.000e+00        2.86s






    




PROGRESS:      3   4.666e+02   1.795e+03   1.000e+00        3.16s






    




PROGRESS:      4   3.725e+02   7.322e+02   1.000e+00        3.51s






    




PROGRESS:      5   7.518e+01   2.669e+02   1.000e+00        3.83s






    




PROGRESS:      6   4.297e+01   1.736e+02   1.000e+00        4.13s






    




PROGRESS:      7   2.675e+01   7.689e+01   1.000e+00        4.45s






    




PROGRESS:      8   1.356e+01   3.933e+01   1.000e+00        4.76s






    




PROGRESS:      9   1.021e+01   2.165e+01   1.000e+00        5.07s






    




PROGRESS:     10   5.665e+00   1.071e+01   1.000e+00        5.41s






    



******************************
Accuracy        :  0.900336833763
Confusion Matrix: 
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      0       |        0        |  2222 |
|      0       |        1        |  288  |
|      1       |        0        |  215  |
|      1       |        1        |  2322 |
+--------------+-----------------+-------+
[4 rows x 3 columns]

Indeed, the new constructed classifier seems to be more accurate with an accuracy of about ~0.9.

Unlabeled Test File

To test how well the presented method works, we will use all the 25,000 labeled IMDB movie reviews in the train dataset to construct a classifier. Afterwards, we will utilize the constructed classifier to predict sentiment for each review in the unlabeled dataset. Lastly, we will create a submission file according to Kaggle's guidelines and submit it.



In [14]:

    
#creating classifier using all 25,000 reviews
traindata_path = "/home/graphlab/data/sentiment/labeledTrainData.tsv"
train_data = gl.SFrame.read_csv(traindata_path,header=True, delimiter='\t',quote_char='"', column_type_hints = {'id':str, 'sentiment' : int, 'review':str } )
train_data['1grams features'] = gl.text_analytics.count_ngrams(train_data['review'],1)
train_data['2grams features'] = gl.text_analytics.count_ngrams(train_data['review'],2)

cls = gl.classifier.create(train_data, target='sentiment', features=['1grams features','2grams features'])
#creating the test dataset
test_data = gl.SFrame.read_csv(testdata_path,header=True, delimiter='\t',quote_char='"', column_type_hints = {'id':str, 'review':str } )
test_data['1grams features'] = gl.text_analytics.count_ngrams(test_data['review'],1)
test_data['2grams features'] = gl.text_analytics.count_ngrams(test_data['review'],2)

#predicting the sentiment of each review in the test dataset
test_data['sentiment'] = cls.classify(test_data)['class'].astype(int)

#saving the prediction to a CSV for submission
test_data[['id','sentiment']].save("/home/graphlab/data/sentiment/predictions.csv", format="csv")









    




PROGRESS: Finished parsing file /home/graphlab/data/sentiment/labeledTrainData.tsv






    




PROGRESS: Parsing completed. Parsed 25000 lines in 0.35 secs.






    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 25000






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 2






    




PROGRESS: Number of unpacked features : 1458862






    




PROGRESS: Number of coefficients    : 1458863






    




PROGRESS: Starting L-BFGS 
--------------------------------------------------------






    




PROGRESS:   Iter   Grad-Norm        Loss   Step size Elapsed time






    




PROGRESS:      0   2.952e+03   1.733e+04   1.000e-06        1.41s






    




PROGRESS:      1   1.563e+03   9.367e+03   1.000e+00        2.55s






    




PROGRESS:      2   7.436e+02   4.592e+03   1.000e+00        3.19s






    




PROGRESS:      3   5.845e+02   2.128e+03   1.000e+00        3.52s






    




PROGRESS:      4   3.755e+02   7.845e+02   1.000e+00        3.94s






    




PROGRESS:      5   6.853e+01   3.269e+02   1.000e+00        4.42s






    




PROGRESS:      6   3.852e+01   1.884e+02   1.000e+00        4.95s






    




PROGRESS:      7   1.118e+02   1.076e+02   1.000e+00        5.54s






    




PROGRESS:      8   1.581e+02   5.782e+01   1.000e+00        6.36s






    




PROGRESS:      9   1.368e+01   2.544e+01   1.000e+00        7.08s






    




PROGRESS:     10   1.345e+01   2.449e+01   1.000e+00        7.87s






    




PROGRESS: Finished parsing file /home/graphlab/data/sentiment/testData.tsv






    




PROGRESS: Parsing completed. Parsed 25000 lines in 0.40 secs.

We then submitted the predictions.csv file to the Kaggle challange website and scored AUC of about 0.88.