The dataset in use comes from https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences# . We begin our analysis by inspecting Amazon reviews.
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
# Grab and process the raw data.
data_path = ("/Users/jacquelynzuker/Desktop/sentiment labelled sentences/amazon_cells_labelled.txt"
)
amazon_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
amazon_raw.columns = ['message', 'satisfaction']
Keywords are chosen for input into the Naive Bayes model. The keywords were chosen by browsing through the raw Amazon dataset and selecting "emotionally-charged" words that might be good predictors. The model was then run, and if the chosen keyword provided no benefit, it was discarded from the list.
In [3]:
keywords = ['must have', 'excellent', 'awesome', 'recommend', 'good',
'great', 'happy', 'love', 'satisfied', 'best', 'works',
'liked', 'easy', 'quick', 'incredible', 'perfectly',
'right', 'cool', 'joy', 'easier', 'fast', 'nice', 'family',
'incredible', 'sweetest', 'poor', 'broke', 'doesn\'t work',
'not work', 'died', 'don\'t buy', 'problem', 'useless',
'awful', 'failed', 'terrible', 'horrible', 'the', '10',
'cool']
for key in keywords:
# Note that we add spaces around the key so that we're getting the word,
# not just pattern matching.
amazon_raw[str(key)] = amazon_raw.message.str.contains(
'' + str(key) + '',
case = False
)
Checking the balance of the inputted dataset.
In [4]:
amazon_raw.satisfaction.value_counts()
Out[4]:
In [5]:
myList = [0] * 1000
for column in amazon_raw.columns[2:]:
myColumn = amazon_raw[column]
for index in amazon_raw.index:
if amazon_raw[column][index] == True:
myList[index] += 1
In [6]:
np.sum(myList)
Out[6]:
In [7]:
amazon_raw.sum()
Out[7]:
One disadvantage inherent in the naive Bayes model is that it does not work well with keywords which are correlated in the reviews
In [10]:
plt.figure(figsize=(15,10))
sns.heatmap(amazon_raw.corr())
plt.title("Correlation between keywords in the model")
plt.show()
In [11]:
data = amazon_raw[keywords]
target = amazon_raw['satisfaction']
In [12]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB
# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()
# Fit our model to the data.
bnb.fit(data, target)
# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)
# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
data.shape[0],
(target != y_pred).sum()
))
In [13]:
data_path = ("/Users/jacquelynzuker/Desktop/sentiment labelled sentences/imdb_labelled.txt"
)
imdb_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
imdb_raw.columns = ['message', 'satisfaction']
amazon_raw = imdb_raw
In [14]:
#imdb_raw
In [13]:
data_path = ("/Users/jacquelynzuker/Desktop/sentiment labelled sentences/yelp_labelled.txt"
)
yelp_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
yelp_raw.columns = ['message', 'satisfaction']
amazon_raw = yelp_raw
In [ ]: