Sentiment Analysis Project

For this project, we'll perform the same type of NLTK VADER sentiment analysis, this time on our movie reviews dataset.

The 2,000 record IMDb movie review database is accessible through NLTK directly with

from nltk.corpus import movie_reviews

However, since we already have it in a tab-delimited file we'll use that instead.

Load the Data


In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\t')
df.head()


Out[1]:
label review
0 neg how do films like mouse hunt get into theatres...
1 neg some talented actresses are blessed with a dem...
2 pos this has been an extraordinary year for austra...
3 pos according to hollywood movies made in last few...
4 neg my first press screening of 1998 and already i...

Remove Blank Records (optional)


In [2]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

df.drop(blanks, inplace=True)

In [3]:
df['label'].value_counts()


Out[3]:
pos    969
neg    969
Name: label, dtype: int64

Import SentimentIntensityAnalyzer and create an sid object

This assumes that the VADER lexicon has been downloaded.


In [5]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

Use sid to append a comp_score to the dataset


In [6]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

df.head()


Out[6]:
label review scores compound comp_score
0 neg how do films like mouse hunt get into theatres... {'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co... -0.9125 neg
1 neg some talented actresses are blessed with a dem... {'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com... -0.8618 neg
2 pos this has been an extraordinary year for austra... {'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com... 0.9953 pos
3 pos according to hollywood movies made in last few... {'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co... 0.9972 pos
4 neg my first press screening of 1998 and already i... {'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com... -0.7264 neg

Perform a comparison analysis between the original label and comp_score


In [7]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [8]:
accuracy_score(df['label'],df['comp_score'])


Out[8]:
0.6367389060887513

In [9]:
print(classification_report(df['label'],df['comp_score']))


              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

   micro avg       0.64      0.64      0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938


In [10]:
print(confusion_matrix(df['label'],df['comp_score']))


[[427 542]
 [162 807]]

So, it looks like VADER couldn't judge the movie reviews very accurately. This demonstrates one of the biggest challenges in sentiment analysis - understanding human semantics. Many of the reviews had positive things to say about a movie, reserving final judgement to the last sentence.

Great Job!