by Lucas Hure and Ian Smith
This project is an exploration of different machine learning techniques applied to Sentiment Analysis in the context of movie reviews. Sentiment Analysis is the process of building a model to analyze and predict sentiment quality using natural language processing and text analysis techniques.
The dataset we used was provided by the website Kaggle and is a collection of movie reviews from Rotten Tomatoes. Typically, sentiment classification is done on a binary scale of "positive" or "negative". One reason we chose this specific dataset is that the reviews are rated on a 5-degree scale:
0 - negative, 1 - somewhat negative, 2 - neutral, 3 - somewhat positive, 4 - positive
This makes the classification task much more granular and therefore much more challenging, with an accuracy expectation of 20% for a fully random model, as opposed to 50% for a dataset with a binarized sentiment scale. Very little work has been done on classification on such datasets and this was a major motivation for us. We were also very interested in Natural Language Processing and in seeing how certain applications of NLP would affect the accuracy of our models.
The first part of this project comprises Data Exploration, where we look at the distribution of sentiment classes, the correlation between negation and sentiment, phrase length and sentiment, the ways in which sentiments are mispredicted (which class are they predicted as in that case), etc...
Throughout the rest of our project, we tried to achieve a balance between optimizing the performance of a certain model and exploring new ones. Specifically, we started by engineering our own naive model "from the ground up" as a baseline for exploration and improvement. We then moved on to using a Naive Bayes model, a Logistic Regression model and Support Vector Machines.
All of the project's data and code can be found in the following public github repository: https://github.com/DryingPole/sentiment
The repository has three main directories:
notebooks: This folder contains all of the IPython notebooks used for our project. All notebooks except Final Project Process Book.ipynb should be ignored, as they only contain works in progress that were later migrated to the final notebook. Final Project Process Book.ipynb is the notebook that is uploaded to iSites.
resources: This directory contains all of the data used for the project. This includes primarily train.tsv and test.tsv, which were obtained from the Kaggle site.
sentiment: This directory contains a few Python files. The primary Python files of interest at core.py and bow.py. core.py centralizes a number of helper functions used throughout the notebook. bow.py includes some of our intial models, including the word-dictionary based model and an ensemble model.
Kaggle competition Website: http://www.kaggle.com/c/sentiment-analysis-on-movie-reviews
In [1]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline
%load_ext autoreload
%autoreload 2
# load required modules
import requests
import StringIO
import zipfile
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
import datetime as dt
import random
import collections
import re
import seaborn as sns
# load custom modules
import imp
source_path = '../sentiment/'
core = imp.load_source('core', source_path + 'core.py')
bow = imp.load_source('bow', source_path + 'bow.py')
bayes = imp.load_source('bayes', source_path + 'bayes.py')
None
In [2]:
# use 'load_reviews' to load the data set into a data frame. This will "normalize" the
# headings and make all phrases lower case.
phrases = core.load_reviews('../resources/train.tsv')
phrases.head()
Out[2]:
We can see from the below analysis that the sentiment classes follow a roughly normal distribution, with the majority of samples classified as 2 (or "neutral"). More extreme positive and negative sentiments are rarer, with extremely negative sentiments making up the smallest part of the data set at only 4.5%.
In [3]:
# Get the total counts for each sentiment class within the data set
sent_counts = phrases[['sentiment']].groupby(by='sentiment').size().to_frame(name='Occurences')
sent_dist = sent_counts.divide(phrases.shape[0]).rename(columns={'Occurences': 'Distribution'})
sent_counts.merge(sent_dist, left_index=True, right_index=True)
Out[3]:
In [4]:
import math
import matplotlib.mlab as mlab
# Distribution of each class within the data set
mew, std = phrases.sentiment.mean(), phrases.sentiment.std()
xs = np.linspace(-1, 5, 200)
sent_dist.plot(kind='bar', title='Distribution of Sentiment Class vs. Normal Distribution', figsize=(14,8), legend=None)
plt.plot(xs, mlab.normpdf(xs, mew, std), c='r')
plt.xlabel('Sentiment Class')
plt.ylabel('Overall Percent of Data Set')
None
Next we wanted to try to get a sense for how strongly correlated negative reviews were with certain "fundamental" negative or negation words. For this purpose, we defined a tiny dictionary of "negation" words. Our theory is that if identifying a such a correlation might help us identify negative reviews more readily since it's a well known problem that naive sentiment detection methods often fail to detect these constructs.
In [5]:
# Data Exploration -- capture some features of each phrase
import core
def load_negation_dict(path='../resources/neg_words.csv'):
ndf = pd.read_csv(path, names=['neg_word'], index_col=0)
ndf['Value'] = True
return ndf.to_dict()['Value']
n_words = load_negation_dict()
pe = phrases.copy()
pe['word_list'] = pe.phrase.str.split()
pe['neg_count'] = map(lambda ws: core.lreduce(lambda acc, w: acc + (1 if w in n_words else 0), ws, 0), pe.word_list)
pe['contains_neg'] = map(lambda nc: True if nc > 0 else False, pe.neg_count)
pe['word_count'] = map(len, pe.word_list)
pe.head()
Out[5]:
In [12]:
# Analyze phrase sentiment as related to the appearance of the wo
g = pe[pe['contains_neg'] == True].groupby('sentiment').count()
prop_of_neg = sent_counts.merge(g, left_index=True , right_index=True)
prop_of_neg['prop'] = prop_of_neg.neg_count / prop_of_neg.Occurences
prop_of_neg.plot(kind='bar', y='prop', figsize=(12,8))
plt.title('Proportion of Samples Containing "Negation" Words by Sentiment Class')
plt.xlabel('Sentiment Class')
plt.ylabel('Percentage of Samples Containing "Negation" Words')
None
We found that the even using a very small dictionary of "negation" words, these words were occurred with almost twice as frequently in the "negative" sentiment classes than the neutral and positive sentiment classes. However, we realized that our method for tokenizing the phrases into words might not be entirely correct since it splits phrases on spaces. To see an example of the kind of issue one might encounter, consider the following examples.
Consider the example phrase "We considered parsing, but it was too difficult." What we find is that the appearance of punctuation in phrases can have the undersired effect of obscuring the words with which they co-occur. This means that our models may not recognize potentially sentiment-significant words if they coincide with punctuation.
In [13]:
sample_phrase = "We considered parsing, but it was too difficult."
naive_split = sample_phrase.split()
better_split = re.split("\W+", sample_phrase)
print naive_split
print better_split
In [ ]:
Next, we wanted to determine if a meaningful heuristic could be extracted from the data to establish a correlation between the length of the phrase and it's sentiment class.
In [14]:
# Determine the max word count to establish an appropriate range for bucketing phrases by number of words
print "Longest Phrase: ", max(pe.word_count)
def filter_by_wc(wc):
"""
Define a function that groups phrases by number of words within a defined range
"""
return pe[['sentiment']][(pe.word_count <= wc) & (pe.word_count > (wc-10))].sentiment.tolist()
bins = [filter_by_wc(x) for x in np.arange(10,60,10)]
plt.figure(figsize=(14,14))
plt.title("Distribution of Sentiment Ratings within Phrases")
plt.xlabel("Words per Phrase")
plt.ylabel("Distribution of Sentiment Labels")
sns.violinplot(bins, names = [str(x)+" to "+str(y) for x, y in zip(np.arange(0,50,10), np.arange(10,60,10))])
None
We can see that as phrase length increases, the distribution of classes becomes less normal. Fewer phrases are rated 2s and more phrases receive either positive or negative ratings. We suspect that this is because longer phrases will tend to provide more overall context for a review. If you look a many of the shorter phrases in the data set, they are in fact sub-phrases of the longer phrases. Many of these shorter phrases on their own lack sufficient context or the appearance of "strong sentiment" words needed to give the phrase definite polarity.
While we are not immediately sure how to exploit this correlation in terms of detecting a phrases sentiment, we think it may help us make more effective choices of the phrases we use for training our models later on. Alternately, we may be able to augment some of our models with evidence-based heuristics to provide some improvement.
In [155]:
We found it difficult to extract meaningful features from the data set using elementary analysis. Some of our initial assumptions about the data proved not to hold: the number of words in a review appear not to have any bearing on its overall sentiment; the appearance of "negating" terms (such as "not", "neither", etc.) are more strongly correlated with neutral and negative ratings than positive ratings, but the overall proportion of phrases in the data set containing negating terms is quite low, suggesting that this would be a difficult feature to exploit.
For our first model, we wanted to try something extremely naive in order to try to gage the difficulty of the problem. This first approach is quite similar to a "Bag of Words" model typically seen with Naive Bayes. That is, this initial model does not take phrase structure or part of speech into account at all. The approach is simple: glean from the training data set every one-word phrase and it's associated sentiment score; once the dictionary has been built, parse each phrase into individual words and use a rudimentary algorithm to aggregate the ranking of a phrase's individual words into a holistic sentiment score. The idea behind this approach is to establish a baseline for indicativeness of individual word polarities within a phrase. If we are lucky, the data set will have a sufficiently rich set of sentiment-laden words to allow this algorithm to perform moderately well.
With a balanced data set -- that is, one in which each class appears equally frequently -- we would expect randomly guessing each phrase's sentiment class to be roughly 20% accurate. Thus, if our naive algorithm can double or triple that amount, it will already represent a good deal of progress and give us an indication how effective we can expect such a naive "bag of words" approaches to be. Note that thiss fine-grained sentiment classification represents a significantly more difficult task than a typical "binary" classification scheme. As such, predicting classes with more than 60% accuracy is actually quite good; in fact the highest score currently on Kaggle is close to 75%.
As a first attempt, we define a model called 'BagOfWordsModel' in the module 'bow.py'. This model provides 'fit' and 'preidct' methods; the 'fit' method expects a list of single-word phrases, and a list of corresponding sentiments for those phrases. It builds a dictionary that is used by the 'predict' method to predict sentiments.
For our first attempt, we will split our training set into one-word phrases and multi-word phrases. Our one-word phrases will serve as our 'training' data set. After we've trained the model, we'll predict sentiments for the remaining phrases, then determine the model's accuracy by comparing against the known sentiment labels.
The exact method used for label prediction is quite simple: we use a weighted average of the sentiments of the each words in a phrase and round the number to the nearest integer to arrive at the predicted class. Our model also optionally takes a dictionary of class weights to allow certain classes to contribute more to the overall prediction for a phrase.
In [15]:
# use the one-word phrases as a training set for the BoW model. Split the data set into training data - one-word phrases -
# and test data - all other phrases
# Define the regex for single words and use this to split the data set
single_word = r'\A[\w-]+\Z'
# Extract the training data
train = phrases[phrases.phrase.str.match(single_word)][['phrase', 'sentiment']]
X_train, Y_train = train.phrase.tolist(), train.sentiment.tolist()
# Extract the test data
test = phrases[map(lambda x: not x, phrases.phrase.str.match(single_word))][['phrase', 'sentiment']]
X_test, Y_test = test.phrase, test.sentiment
# Show the head of the training and test data sets
print "Training Set:"
print train.head(5)
print "\nTest Set:"
print test.head(5)
In [16]:
# Train the model, then predict and score sentiments for the test data
bow_model = bow.BagOfWordsModel()
bow_model.fit(X_train, Y_train)
accuracy = bow_model.score(X_test, Y_test)
print accuracy
print bow_model._scoring
Our first attempt with a naive algorithm gives us a 51% classification accuracy, which is more than two and a half times better than randomly guessing the classes.
This isn't bad for a first pass, but we want to get a better sense for which classes prove most difficult to predict. We can use our sent_counts data frame to determine our overall accuracy for each sentiment class.
In [17]:
acc_counts = bow_model._scoring
sent_count_dict = sent_counts.to_dict()['Occurences']
bow_acc_df = pd.DataFrame(data=np.transpose([acc_counts.keys(),
acc_counts.values(),
sent_count_dict.values()]),
columns=['sentiment', 'correct', 'total'])
bow_acc_df['accuracy'] = np.round(bow_acc_df.correct / bow_acc_df.total, 3)
print bow_acc_df[['sentiment', 'accuracy']].head()
bow_acc_df.plot(kind='bar', x='sentiment', y='accuracy', title='Predictive Accuracy by Class', figsize=(8, 6))
plt.ylabel('Accuracy')
plt.xlabel('Sentiment Class')
None
As we can see from the above, our model performs exceptionally poorly on the most extreme sentiment classes of 0 and 4. Since our model drives off of the sentiments of individual words and allows us to adjust the weights assigned to each class, let's see if adjusting the default weights can improve our prediction accuracy.
In [19]:
# Define a function that searches a number of weight combinations and returns the weight combination and model that receives the
# best accuracy score.
def search_best_weights():
best_acc = 0.0
best_weights = None
best_model = None
zeros = [5, 25, 50, 100]
ones = [2, 10, 20]
threes = [3, 5]
fours = [5, 25, 50]
for z in zeros:
for o in ones:
for t in threes:
for f in fours:
wm = {0: z, 1: o, 2: 1, 3: t, 4: f}
wbm = bow.BagOfWordsModel()
wbm.fit(X_train, Y_train)
wbm_acc = wbm.score(X_test, Y_test, weight_map=wm)
if wbm_acc > best_acc:
best_acc = wbm_acc
best_weights = wm
best_model = wbm
return best_acc, best_model, best_weights
ba, bm, bw = search_best_weights()
print ba, bw
Based on this search, it appears that our original default weights -- {0: 5, 1: 3, 2: 1, 3: 3, 4: 5} -- were the best. There doesn't seem to be much more to be gained using this model's approach given that a number of different weight combinations only degraded performance.
Mutinomial Naive Bayes is often used to perform basic sentiment analysis in what is called a "bag of words" approach. To use this model, we'll use the CountVectorizer class of scikit learn to convert our training data into a vector where each word represents a feature. The CountVectorizer will return a sparse matrix indicating the features -- or words -- that appear in each phrase.
Our research of sentiment-analysis methodologies indicates that word frequency tends not to provide a good indication of sentiment so we'll use a binarized Multinomial Bayes algorithm in which each word-feature is counted only once per phrase. We'll also use a default stop-words list to try to pair down the feature list and improve accuracy.
In [20]:
# Prepare the data -- use the vectorizer to return a feature set that corresponds all of the vocabulary found in the data set.
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
swords = 'english' # eliminate stop words
rarray = phrases.phrase.tolist()
vectorizer = CountVectorizer(binary=True, stop_words=swords)
vectorizer.fit(rarray)
X = vectorizer.transform(rarray).tocsc()
Y = phrases.sentiment.tolist()
xtrain, xtest, ytrain, ytest = train_test_split(X, Y)
We'll define some helper functions to help us transform data and run our models. This will help us reduce the amount of boilerplat code we need to write to test our models.
In [21]:
'''
Function
--------
A function that will vectorize a data frame of reviews into a sparse matrix and a list
'''
def transform_data(reviews, **kwargs):
if "binary" not in kwargs:
kwargs["binary"] = True
if "min_df" not in kwargs:
kwargs["min_df"] = 0.0
rarray = reviews.phrase.tolist()
vectorizer = kwargs['vectorizer'] if 'vectorizer' in kwargs else CountVectorizer(**kwargs)
vectorizer.fit(rarray)
X = vectorizer.transform(rarray).tocsc()
return X, reviews.sentiment.tolist()
'''
'''
def split(X, Y, **kwargs):
return train_test_split(X, Y, **kwargs)
'''
Function
--------
Prepare specific dataset and run specific model on it.
Vectorizes and splits data into train and test sets, then fits data to the model decided upon and prints out
accuracy scores.
Parameters
----------
review_df: reviews dataframe. The original "reviews" set, the "stop_reviews" set, the "bal_reviews" dataset,
or the "bin_reviews" dataset
model: an instance of any model that provides 'fit', 'predict', and 'score' methods
Returns
-------
prints out accuracy score
'''
def test_model(model, data=None):
xtrain, xtest, ytrain, ytest = data
mf = model.fit(xtrain, ytrain)
print "Model[%s] Accuracy: %0.2f%%" % (str(mf), mf.score(xtest, ytest))
return mf
In [22]:
from sklearn.naive_bayes import MultinomialNB
X_trans, Y_trans = transform_data(phrases)
XY_all = split(X_trans, Y_trans)
mnb = test_model(MultinomialNB(), data=XY_all)
Our first naive model, using almost no data pre-processing gets an accuracy rate of 61% on the data set. Below we analyze visualize the scatter plot of the expected vs. actual predictions to get a sense for which classes are most accurately predicted.
In [23]:
def scatter_exp_vs_pred(df=None, expected=None, predicted=None):
"""
A function that takes a data frame with 'expected' and 'predicted' columns and
creates a scatter plot for those columns
:param df: A dataframe with 'expected' and 'predicted' columns, optionally None
:param expected: A list of expected values
:param predicted: A list of predicted values
:return: None.
"""
if df is None:
df = pd.DataFrame()
df['expected'] = expected
df['predicted'] = predicted
# res_df = pd.DataFrame(data=np.transpose([pred, XY_all[3]]), columns=["predicted", "expected"])
rstats = df.groupby(["predicted", "expected"]).size().reset_index().rename(columns={0: "counts"})
plt.figure(figsize=(12, 8))
plt.title("Predicted vs. Expected Sentiment Values")
plt.ylabel("Predicted Sentiment")
plt.xlabel("Expected Sentiment")
plt.scatter(rstats.expected, rstats.predicted, s=[0.1*c for c in rstats.counts], alpha=0.6)
plt.show()
return rstats
In [24]:
# Create a data frame of the expected vs. predicted sentiments. Count each predicted,expected pair and plot
pred = mnb.predict(XY_all[1])
rstats = scatter_exp_vs_pred(expected=XY_all[3], predicted=pred)
We can see based on above scatter plot that most of the miscategorizations are showing up in neigbhoring sentiment classes. 2s are generally being miscategorized as 1s or 3s. 0s are mostly miscategorized as 1s and most miscategorizations of 3s are either 2s or 4s. This seems somewhat encouraging if we view the classification problem as a linear regression problem; however, MNB does not use this model for classification.
In [25]:
# Show exact figures for expected vs. predicted sentiment class.
rstats.head(25)
Out[25]:
The disadvantage of the above approach is that is doesn't provide any indication of whether the model is being overfit. To correct this problem, we'll look at the respective accuracy scores of the model on both the training data and the test data.
In [26]:
xtrain, xtest, ytrain, ytest = XY_all
training_accuracy = mnb.score(xtrain, ytrain)
test_accuracy = mnb.score(xtest, ytest)
print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data: %0.2f" % (test_accuracy)
It looks like the may be some slight overfitting. To get a better sense for the model's overall performance, we'll use cross validation and see how the mean cross validation score compares to the above score on the test data.
In [27]:
from sklearn.cross_validation import cross_val_score
r = cross_val_score(MultinomialNB(), xtrain, ytrain, cv=10)
print
print "10-fold Cross Validation Scores: ", core.lreduce(lambda s, f: s + ("%0.2f, " % f), r, "")
print "Average Score: %0.2f" % r.mean()
Our model consistently achieves accuracy scores between 60 and 62%. We wanted to see if we achieve any improvement in accuracy by performing a grid search and using different parameters. The results of the grid search below show only the most marginal improvement with tuned parameters. It doens't look like it will deliver the sort of improvements we desire.
In [28]:
alphas = [0.0, 0.5, 1.0, 5, 10, 20, 50]
min_df = [0.00001, 0.0001, 0.001, 0.005, 0.01]
best_sc = 0.0
best_mnb = None
best_params = None
for md in min_df:
x, y = transform_data(phrases, min_df=md)
xtrain, xtest, ytrain, ytest = split(x, y)
for a in alphas:
clf = MultinomialNB(alpha=a).fit(xtrain, ytrain)
sc = clf.score(xtest, ytest)
if sc > best_sc:
best_params = (a, md)
best_sc = sc
best_mnb = clf
In [29]:
print "Best Parameters: alpha:\t%0.2f\tmin_df:\t%0.5f" % (best_params[0], best_params[1])
print "Best Accuracy: ", best_sc
In [33]:
from sklearn.linear_model import LogisticRegression
xtrain, xtest, ytrain, ytest = XY_all
clf = LogisticRegression().fit(xtrain, ytrain)
lreg_trnscore = clf.score(xtrain, ytrain)
lreg_tscore = clf.score(xtest, ytest)
print "Training accuracy %0.2f" % lreg_trnscore
print "Test accuracy %0.2f" % lreg_tscore
A cursory test using a logistic regression model shows a 3% improvement in performance. We plot the expected
In [34]:
lreg_pred = clf.predict(xtest)
lreg_stats = scatter_exp_vs_pred(expected=ytest, predicted=lreg_pred)
To put these number into perspective, we join the lreg_stats data frame to data frame that was returned for the MultinomialNB classifier. The 'comparison' column shows how logistic regression classifications compare to those of the MNB classifier.
In [35]:
comp = lreg_stats.rename(columns={'counts':'LR_counts'}).merge(rstats, left_on=['predicted', 'expected'], right_on=['predicted', 'expected'])
comp['comparison'] = map(lambda x, y: float(x) / y, comp.LR_counts, comp.counts)
comp.head(25)
Out[35]:
In [36]:
outperform = comp[(comp.comparison > 1) & (comp.expected == comp.predicted)]
outperform.head()
Out[36]:
We see that logistic regression is doing better primarily as a result of guess 2s much better than MNB. Next, we want to see if we can tune our logistic regression model by testing different smoothing paramters. For this we use a grid search to test values between 2 and 5.
In [37]:
# use cross validation to find the optimal value for k
from sklearn.grid_search import GridSearchCV
c = np.arange(2, 5, 0.4)
param = {'C': c}
lreg = LogisticRegression()
clf2 = GridSearchCV(lreg, param, cv=3)
clf2.fit(xtrain, ytrain)
Out[37]:
In [398]:
# visualize scores
a = clf2.grid_scores_
scores = [b.cv_validation_scores for b in a]
def plot_gs_results(scores):
fig = plt.figure(figsize=(12, 12))
# Add a boxplot of score per C value
ax1 = fig.add_subplot(211)
sns.boxplot(scores, ax=ax1)
plt.title('Distribution of Accuracy Scores as a Function of $C$')
plt.ylabel('Prediction Accuracy')
plt.xticks(np.arange(1,9,1), [str(x) for x in param['C']])
# Add a plot of the mean scores
ax2 = fig.add_subplot(212)
plt.title('Mean Accuracy Score as a Function of $C$')
plt.xlabel('Choice of C')
plt.ylabel('Prediction Accuracy')
plt.scatter(xrange(1,9), np.mean(scores, axis=1), c='C', marker='o')
plt.xticks(np.arange(1,9,1), [str(x) for x in param['C']])
plt.xlabel('Choice of C')
plt.show()
None
plot_gs_results(scores)
Based on the plot above, we can see that $2.8$ provides the best value for C, although the difference between each value is $C$ is nearly negligible.
In [377]:
opti_clf = LogisticRegression(C = 2.8).fit(xtrain, ytrain)
print "Accuracy: %0.5f%%" % opti_clf.score(xtest,ytest)
In [399]:
from sklearn.svm import LinearSVC
svc = LinearSVC().fit(xtrain, ytrain)
svc_train_scr = svc.score(xtrain, ytrain)
svc_test_scr = svc.score(xtest, ytest)
print "SVC Accuracy (train) %0.4f" % svc_train_scr
print "SVC Accuracy (test) %0.4f" % svc_test_scr
In [407]:
svc_preds = svc.predict(xtest)
svc_stats = scatter_exp_vs_pred(expected=ytest, predicted=svc_preds)
In [409]:
comp = comp.merge(svc_stats.rename(columns={'counts': 'svc_counts'}),
left_on=['predicted', 'expected'], right_on=['predicted', 'expected'])
comp['svc_comparison'] = map(lambda x, y: float(x) / y, comp.svc_counts, comp.counts)
comp.head(25)
Out[409]:
We can see here that SVC compares quite favorably to logistic regression and multinomial Bayes with improvements in correctly guessing almost every category. Unfortunately, this method is very computationally heavy which makes additional analysis with this method somewhat cumbersome.
While the predictive accuracy of these models is actually not bad, it can be somewhat misleading. The issue is that the distribution of classes within the data set is not normal; the vast majority of samples are 2s. This means that any model with a "high recall" of 2 will actually do fairly well against this data set. What we do not get is an accurate sense of how well our model would perform against any arbitrary sample. To assess each model's predictive abilities more accurately then, we should try to test our models on a data set of balanced classes.
To begin, we'll create balanced test and training data sets and run our models against these. We should expect to see the predictive accuracy drop significantly, especially given earlier analysis that showed most of the models' accuracy coming from correctly predicting 2s.
We define a utility function that splits the data into balanced classes, then train our models and analyze the results as before.
In [ ]:
In [384]:
# Create a data set with balanced classes. The class with the least number of samples is '0', with only 7072.
smallest = phrases.groupby('sentiment').count().min()[0]
print "Smallest Class Size: ", smallest
In [401]:
zeroes = phrases[phrases.sentiment == 0]
ones = phrases[phrases.sentiment == 1]
twos = phrases[phrases.sentiment == 2]
threes = phrases[phrases.sentiment == 3]
fours = phrases[phrases.sentiment == 4]
ones_samples = ones.loc[np.random.choice(ones.index, smallest, replace=False)]
twos_samples = twos.loc[np.random.choice(twos.index, smallest, replace=False)]
threes_samples = threes.loc[np.random.choice(threes.index, smallest, replace=False)]
fours_samples = fours.loc[np.random.choice(fours.index, smallest, replace=False)]
bal_reviews = pd.concat([zeroes, ones_samples, twos_samples, threes_samples, fours_samples])
In [402]:
# Transform and Split the balanced data set.
XY_bal = transform_data(bal_reviews)
XY_bal_all = split(XY_bal[0], XY_bal[1])
In [411]:
bal_models = [test_model(m, XY_bal_all) for m in [MultinomialNB(alpha=1.0), LogisticRegression(C=2.8), LinearSVC()]]
In [412]:
ypreds = [m.predict(XY_bal_all[1]) for m in bal_models]
bal_stats = [scatter_exp_vs_pred(expected=XY_bal_all[3], predicted=p) for p in ypreds]
As expected, the predictive accuracy of all of the previously explored models saw declines of 10% or more. Balancing the data set has yieleded another interesting insight: namely that comparatively, each models accuracy in predicting the more polarized sentiment classes now seems even better than their accuracy at detecting the neutral phrases.
Such fine grained sentiment analysis is acutally quite difficult. Many sentiment analysis tasks that appear in popular publications focus on more coarse-grained classification: a phrase is categorized as either positive or negative.
We wanted to get a sense for how accurately we could predict sentiments when they fall into one of these two classifications. To do this, we created a "new" data set from the Kaggle data set by removing all neutral reviews (the 2s) and mapping the classes 0 and 1 to "negative" reviews, and 3 and 4 to "positive."
In [413]:
#create dataset of binarized sentiments to positive and negative.
bin_reviews = phrases[(phrases.sentiment != 2)]
sents = bin_reviews.sentiment.tolist()
bin_sents = []
for i in sents:
if (i == 1) | (i == 0):
bin_sents.append(0)
else:
bin_sents.append(1)
#create DF of binary sentiments
bin_reviews = bin_reviews.drop(['sentiment'], axis = 1)
bin_reviews['sentiment'] = bin_sents
In [414]:
bin_reviews.head()
Out[414]:
In [415]:
X_bin, Y_bin = transform_data(bin_reviews)
XY_bin_all = split(X_bin, Y_bin)
xbtrain, xbtest, ybtrain, ybtest = XY_bin_all
[test_model(m, XY_bin_all) for m in [MultinomialNB(alpha=1.0), LogisticRegression(C=2.8), LinearSVC()]]
Out[415]:
As we can see, using this coarse-grained sentiment detection we obtain really impressive accuracy.
One of the persistent problems with sentiment analysis is being able to deal effectively with modifying words that negate or change the meaning of other words.
In [430]:
'''
Function
--------
Prepare specific dataset and run specific model on it.
Vectorizes and splits data into train and test sets, then fits data to the model decided upon and prints out
accuracy scores.
Parameters
----------
revs: reviews dataframe. The original "reviews" set, the "stop_reviews" set, the "bal_reviews" dataset,
or the "bin_reviews" dataset
model: 'Bayes', 'Logistic' or 'SVC'
Returns
-------
prints out accuracy score
'''
def reviews_and_model(revs, model):
#vectorize dataset
rarray = revs.phrase.tolist()
vectorizer = CountVectorizer(min_df=0.00, binary=True)
vectorizer.fit(rarray)
X = vectorizer.transform(rarray).tocsc()
#split data into train and test
Y = revs.sentiment.tolist()
xtrain, xtest, ytrain, ytest = train_test_split(X, Y)
#run model
clf = model.fit(xtrain, ytrain)
print "Accuracy: %0.2f%%" % clf.score(xtest, ytest)
In [421]:
pe = bin_reviews.copy()
pe['word_list'] = pe.phrase.str.split()
pe['contain_neg'] = map(lambda l:"n't" in l or "not" in l, pe.word_list)
pe['word_count'] = map(len, pe.word_list)
# Analyze phrase sentiment as related to the appearance of the word "not"
neg_counts = pe.groupby(["sentiment","contain_neg"]).size().to_frame("neg_count").reset_index()
print neg_counts
In [425]:
def neg_preprocess(revs_list):
phrases = revs_list
new_phrases = []
for phrase in phrases:
parts = re.split("[\s,\.\!\?\:]+", phrase)
if "n't" in parts or "not" in parts:
if "n't" in parts:
ind = parts.index("n't")
if "not" in parts:
ind = parts.index("not")
temp = []
for p in parts:
if parts.index(p) <= ind:
temp.append(p)
elif parts.index(p) > ind and p != '':
temp.append("NOT_" + p)
new_phrases.append(' '.join(temp))
else:
new_phrases.append(phrase)
return new_phrases
In [426]:
neg_reviews = bin_reviews.copy()
lst = neg_reviews['phrase'].tolist()
neg_reviews = neg_reviews.drop(['phrase'], axis = 1)
In [427]:
def ultimate_neg(rev_list, sent_list):
phrs = rev_list
sents = sent_list
new_phrs = []
new_sents = []
for phr, sen in zip(phrs, sents):
prts = re.split("[\s,\.\!\?\:]+", phr)
if "n't" in prts or "not" in prts:
new_phrs.append(phr)
new_sents.append(0)
else:
new_phrs.append(phr)
new_sents.append(sen)
return new_phrs, new_sents
In [428]:
neg_neg = bin_reviews.copy()
neg_l = neg_neg.phrase.tolist()
neg_s = neg_neg.sentiment.tolist()
neg_neg = neg_neg.drop(['phrase'], axis = 1)
neg_neg = neg_neg.drop(['sentiment'], axis = 1)
In [431]:
neglist, negsent = ultimate_neg(neg_l,neg_s)
neg_neg['phrase'] = neglist
neg_neg['sentiment'] = negsent
reviews_and_model(neg_neg, LogisticRegression(C = 10))
From this exploration of Sentiment Analysis techniques, we learned about some of the challenges of Natural Language Processing. Words are features, but those features don't capture by themselves the subtleties of sentence structure, part of speech, and context. For example, one of the shortcomings of our models is that they use a Bag of Words approach, which doesn't take some of those subtleties into account; each word is a distinct feature.
We tried to combat those limitations by manipulating the data in different ways, namely:
Because the dataset was very imbalanced in terms of class distribution, we decided to create a subset of our original set, in which each class would have a randomly chosen number of phrases equal to the number of phrases in the minority class, thereby artificially creating a perfectly balanced dataset. On the whole, additionally to our initial dataset, we experimented with 3 other datasets optimized in different ways.
Using cross-validation to find optimal parameter settings, heuristic functions and optimized data, we managed to push our accuracy to 65% on the 5-class dataset, which is over three times higher than random expectation, and to a surprising 91% on the binarized-sentiment dataset. The model that performed the best on both the 5-class and binarized datasets was Logistic Regression, which slighlty outperformed the Naive Bayes classifier. Looking for reasons explaining that discrepency, we realized that a Bayesian approach relies on the idea that features are conditionally independent. This is clearly not fully the case with the datasets we used, which despite the fact that the NB classifier still yielded very good results, probably affected its performance.
Further improvements could be made in the form of:
In [ ]: