Jackie Zuker
It's time to revisit your classifier from the previous assignment. Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:
Write up your iterations and answers to the above questions in a few pages. Submit a link below and go over it with your mentor to see if they have any other ideas on how you could improve your classifier's performance.
Five versions of my classifier:
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
In [2]:
# Grab and process the raw data.
data_path = ("/Users/jacquelynzuker/Desktop/sentiment labelled sentences/amazon_cells_labelled.txt"
)
amazon_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
amazon_raw.columns = ['message', 'satisfaction']
In [3]:
keywords = ['must have', 'excellent', 'awesome', 'recommend', 'good',
'great', 'happy', 'love', 'satisfied', 'best', 'works',
'liked', 'easy', 'quick', 'incredible', 'perfectly',
'right', 'cool', 'joy', 'easier', 'fast', 'nice', 'family',
'incredible', 'sweetest', 'poor', 'broke', 'doesn\'t work',
'not work', 'died', 'don\'t buy', 'problem', 'useless',
'awful', 'failed', 'terrible', 'horrible', '10',
'cool']
for key in keywords:
# Note that we add spaces around the key so that we're getting the word,
# not just pattern matching.
amazon_raw[str(key)] = amazon_raw.message.str.contains(
'' + str(key) + '',
case = False
)
In [4]:
data = amazon_raw[keywords]
target = amazon_raw['satisfaction']
In [5]:
# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()
scores = cross_val_score(bnb, data, target, cv = 10, scoring="accuracy")
print("Percent accuracy within each fold:\n")
print(scores)
print("\nMean accuracy:\n")
print(scores.mean())
In [6]:
# Create a dataFrame to compare performance of Classifier Models
classifiers_compare = pd.DataFrame()
classifiers_compare = classifiers_compare.append([["BernoulliNB", (scores.mean())]])
In [7]:
gnb = GaussianNB()
scores = cross_val_score(gnb, data, target, cv = 10, scoring="accuracy")
print("Percent accuracy within each fold:\n")
print(scores)
print("\nMean accuracy:\n")
print(scores.mean())
classifiers_compare = classifiers_compare.append([["GaussianNB", (scores.mean())]])
In [8]:
mnb = MultinomialNB()
scores = cross_val_score(mnb, data, target, cv = 10, scoring="accuracy")
print("Percent accuracy within each fold:\n")
print(scores)
print("\nMean accuracy:\n")
print(scores.mean())
classifiers_compare = classifiers_compare.append([["MultinomialNB", (scores.mean())]])
In [9]:
dtc = DecisionTreeClassifier(max_depth=5)
scores = cross_val_score(dtc, data, target, cv = 10, scoring="accuracy")
scores
print("Percent accuracy within each fold:\n")
print(scores)
print("\nMean accuracy:\n")
print(scores.mean())
classifiers_compare = classifiers_compare.append([["DecisionTree", (scores.mean())]])
In [10]:
# search for an optimal value of K for KNN
k_range = range(1,30)
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors = k)
scores = cross_val_score(knn, data, target, cv = 10,
scoring="accuracy")
k_scores.append(scores.mean())
print(k_scores)
In [11]:
import matplotlib.pyplot as plt
%matplotlib inline
# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel("Value of K for KNN")
plt.ylabel("Cross-Validated Accuracy")
Out[11]:
Looks like k=5 is the optimal k-value for this dataset.
In [12]:
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, data, target, cv = 10, scoring = "accuracy")
print("Percent accuracy within each fold:\n")
print(scores)
print("\nMean accuracy:\n")
print(scores.mean())
classifiers_compare = classifiers_compare.append([["KNN", (scores.mean())]])
In [13]:
classifiers_compare
Out[13]:
In [14]:
plt.figure(figsize=(15,5))
sns.set(style="whitegrid")
sns.barplot(x=0, y=1, data=classifiers_compare, palette="BuGn_d")
plt.xlabel("Model and Distribution")
plt.ylim(0,1)
plt.ylabel("Percent Accuracy")
plt.show()
None of these classifiers seem to overfit. 10-fold cross validation was used in all cases, and the highest rate of accuracy was 77.3% on this run.
The Bernoulli Naive Bayes model returned the best accuracy when cross-validation was used. This is because the Bernoulli distribution is well suited to cases with two possible outcomes.
The model used had a large impact on performance. Specific keywords are also likely to be impactful to performance. Let's look at the difference in accuracy when we remove the five most-common keywords from the model.
In [15]:
# Create a new Data Frame with the value counts for each key
mydf = pd.DataFrame(data.sum())
mydf.reset_index(inplace=True)
mydf.columns = ["key", "valueCounts"]
# Return the 5 most common keys found in the review set
myTopVals = sorted(list(mydf["valueCounts"]))[-5]
myTopVals
# Remove keys with more than threshold value of valuecounts.
newKeys = mydf[mydf.valueCounts < myTopVals].key
newKeys = list(newKeys)
In [16]:
data = amazon_raw[newKeys]
target = amazon_raw['satisfaction']
scores = cross_val_score(bnb, data, target, cv = 10, scoring="accuracy")
print("Percent accuracy within each fold:\n")
print(scores)
print("\nMean accuracy:\n")
print(scores.mean())
By removing the top 5 keys from the Bernoulli Naive Bayes mode, there was a change in overall accuracy from 77.3% to 62.2%. The choice of which keys to include in the model have a large impact on the overall accuracy of the model.