Now that we've seen word vectors we can start to investigate sentiment analysis. The goal is to find commonalities between documents, with the understanding that similarly combined vectors should correspond to similar sentiments.
While the scope of sentiment analysis is very broad, we will focus our work in two ways.
We won't try to determine if a sentence is objective or subjective, fact or opinion. Rather, we care only if the text expresses a positive, negative or neutral opinion.
We'll also try to aggregate all of the sentences in a document or paragraph, to arrive at an overall opinion.
We won't try to perform a fine-grained analysis that would determine the degree of positivity/negativity. That is, we're not trying to guess how many stars a reviewer awarded, just whether the review was positive or negative.
VADER is an NLTK module that provides sentiment scores based on words used ("completely" boosts a score, while "slightly" reduces it), on capitalization & punctuation ("GREAT!!!" is stronger than "great."), and negations (words like "isn't" and "doesn't" affect the outcome).
To view the source code visit https://www.nltk.org/_modules/nltk/sentiment/vader.html
Download the VADER lexicon. You only need to do this once.
In [1]:
import nltk
nltk.download('vader_lexicon')
Out[1]:
In [2]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
VADER's SentimentIntensityAnalyzer() takes in a string and returns a dictionary of scores in each of four categories:
In [3]:
a = 'This was a good movie.'
sid.polarity_scores(a)
Out[3]:
In [4]:
a = 'This was the best, most awesome movie EVER MADE!!!'
sid.polarity_scores(a)
Out[4]:
In [5]:
a = 'This was the worst film to ever disgrace the screen.'
sid.polarity_scores(a)
Out[5]:
In [6]:
import numpy as np
import pandas as pd
df = pd.read_csv('../TextFiles/amazonreviews.tsv', sep='\t')
df.head()
Out[6]:
In [7]:
df['label'].value_counts()
Out[7]:
In [8]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)
blanks = [] # start with an empty list
for i,lb,rv in df.itertuples(): # iterate over the DataFrame
if type(rv)==str: # avoid NaN values
if rv.isspace(): # test 'review' for whitespace
blanks.append(i) # add matching index numbers to the list
df.drop(blanks, inplace=True)
In [9]:
df['label'].value_counts()
Out[9]:
In this case there were no empty records. Good!
In [10]:
sid.polarity_scores(df.loc[0]['review'])
Out[10]:
In [11]:
df.loc[0]['label']
Out[11]:
Great! Our first review was labeled "positive", and earned a positive compound score.
In [12]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
df.head()
Out[12]:
In [13]:
df['compound'] = df['scores'].apply(lambda score_dict: score_dict['compound'])
df.head()
Out[13]:
In [14]:
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')
df.head()
Out[14]:
In [15]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
In [16]:
accuracy_score(df['label'],df['comp_score'])
Out[16]:
In [17]:
print(classification_report(df['label'],df['comp_score']))
In [18]:
print(confusion_matrix(df['label'],df['comp_score']))