Sentiment Analysis Project

For this project, we'll perform the same type of NLTK VADER sentiment analysis, this time on our movie reviews dataset.

The 2,000 record IMDb movie review database is accessible through NLTK directly with

from nltk.corpus import movie_reviews

However, since we already have it in a tab-delimited file we'll use that instead.

Load the Data



In [1]:

    
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\t')
df.head()









    Out[1]:







  
    
      
      label
      review
    
  
  
    
      0
      neg
      how do films like mouse hunt get into theatres...
    
    
      1
      neg
      some talented actresses are blessed with a dem...
    
    
      2
      pos
      this has been an extraordinary year for austra...
    
    
      3
      pos
      according to hollywood movies made in last few...
    
    
      4
      neg
      my first press screening of 1998 and already i...

Remove Blank Records (optional)



In [2]:

    
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

df.drop(blanks, inplace=True)



In [3]:

    
df['label'].value_counts()









    Out[3]:





pos    969
neg    969
Name: label, dtype: int64

Import `SentimentIntensityAnalyzer` and create an sid object

This assumes that the VADER lexicon has been downloaded.



In [5]:

    
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

Use sid to append a `comp_score` to the dataset



In [6]:

    
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

df.head()









    Out[6]:







  
    
      
      label
      review
      scores
      compound
      comp_score
    
  
  
    
      0
      neg
      how do films like mouse hunt get into theatres...
      {'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...
      -0.9125
      neg
    
    
      1
      neg
      some talented actresses are blessed with a dem...
      {'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...
      -0.8618
      neg
    
    
      2
      pos
      this has been an extraordinary year for austra...
      {'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com...
      0.9953
      pos
    
    
      3
      pos
      according to hollywood movies made in last few...
      {'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co...
      0.9972
      pos
    
    
      4
      neg
      my first press screening of 1998 and already i...
      {'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com...
      -0.7264
      neg

Perform a comparison analysis between the original `label` and `comp_score`



In [7]:

    
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix



In [8]:

    
accuracy_score(df['label'],df['comp_score'])









    Out[8]:





0.6367389060887513



In [9]:

    
print(classification_report(df['label'],df['comp_score']))









    



              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

   micro avg       0.64      0.64      0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938



In [10]:

    
print(confusion_matrix(df['label'],df['comp_score']))

So, it looks like VADER couldn't judge the movie reviews very accurately. This demonstrates one of the biggest challenges in sentiment analysis - understanding human semantics. Many of the reviews had positive things to say about a movie, reserving final judgement to the last sentence.

	label	review
0	neg	how do films like mouse hunt get into theatres...
1	neg	some talented actresses are blessed with a dem...
2	pos	this has been an extraordinary year for austra...
3	pos	according to hollywood movies made in last few...
4	neg	my first press screening of 1998 and already i...

	label	review	scores	compound	comp_score
0	neg	how do films like mouse hunt get into theatres...	{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...	-0.9125	neg
1	neg	some talented actresses are blessed with a dem...	{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...	-0.8618	neg
2	pos	this has been an extraordinary year for austra...	{'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com...	0.9953	pos
3	pos	according to hollywood movies made in last few...	{'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co...	0.9972	pos
4	neg	my first press screening of 1998 and already i...	{'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com...	-0.7264	neg

Sentiment Analysis Project

Load the Data

Remove Blank Records (optional)

Import SentimentIntensityAnalyzer and create an sid object

Use sid to append a comp_score to the dataset

Perform a comparison analysis between the original label and comp_score

Great Job!

Import `SentimentIntensityAnalyzer` and create an sid object

Use sid to append a `comp_score` to the dataset

Perform a comparison analysis between the original `label` and `comp_score`