It's time to revisit your classifier from the previous assignment. Using the evaluation techniques we've covered here, look at your classifier's performance in more detail. Then go back and iterate. Repeat this process until you have five different versions of your classifier. Once you've iterated, answer these questions to compare the performance of each:
Write up your iterations and answers to the above questions in a few pages. Submit a link below and go over it with your mentor to see if they have any other ideas on how you could improve your classifier's performance.
Five versions of my classifier:
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
In [2]:
classifiers_compare = pd.DataFrame()
#classifiers_compare.columns = [["Name", "Accuracy"]]
In [3]:
# Grab and process the raw data.
data_path = ("/Users/jacquelynzuker/Desktop/sentiment labelled sentences/amazon_cells_labelled.txt"
)
amazon_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
amazon_raw.columns = ['message', 'satisfaction']
In [4]:
keywords = ['must have', 'excellent', 'awesome', 'recommend', 'good',
'great', 'happy', 'love', 'satisfied', 'best', 'works',
'liked', 'easy', 'quick', 'incredible', 'perfectly',
'right', 'cool', 'joy', 'easier', 'fast', 'nice', 'family',
'incredible', 'sweetest', 'poor', 'broke', 'doesn\'t work',
'not work', 'died', 'don\'t buy', 'problem', 'useless',
'awful', 'failed', 'terrible', 'horrible', '10',
'cool']
for key in keywords:
# Note that we add spaces around the key so that we're getting the word,
# not just pattern matching.
amazon_raw[str(key)] = amazon_raw.message.str.contains(
'' + str(key) + '',
case = False
)
In [5]:
data = amazon_raw[keywords]
target = amazon_raw['satisfaction']
In [6]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB
# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()
# Fit our model to the data.
bnb.fit(data, target)
# Classify, storing the result in a new variable.
y_pred = bnb.predict(data)
# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
data.shape[0],
(target != y_pred).sum()
))
In [7]:
classifiers_compare = classifiers_compare.append([["straightUp", (target == y_pred).mean()]])
print("Without a training dataset, the Bernoulli Naive Bayes model estimated accuracy: {}%".format(
(target == y_pred).mean()*100))
In [8]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
data, target, test_size = 0.3, random_state = 0)
In [9]:
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
y_pred = bnb.fit(x_train, y_train).predict(x_test)
The model was run on the training data. Now let's see how well it predicts the test data.
In [10]:
(y_test == y_pred).mean()
classifiers_compare = classifiers_compare.append([["bernoulli70", (y_test == y_pred).mean()]])
In [11]:
x_test.shape[0]
Out[11]:
In [12]:
confusion_matrix(y_test, y_pred)
Out[12]:
In [13]:
clf = svm.SVC(kernel='linear', C=1).fit(x_train, y_train)
clf.score(x_test, y_test)
Out[13]:
In [14]:
from sklearn.cross_validation import cross_val_score
In [15]:
# Our data is binary / boolean, so we're importing the Bernoulli classifier.
from sklearn.naive_bayes import BernoulliNB
# Instantiate our model and store it in a new variable.
bnb = BernoulliNB()
scores = cross_val_score(bnb, data, target, cv = 3,
scoring="accuracy")
print(scores)
In [16]:
print(scores.mean())
In [17]:
classifiers_compare = classifiers_compare.append([["bernoulli_CV3", (scores.mean())]])
In [18]:
scores = cross_val_score(bnb, data, target, cv = 10,
scoring="accuracy")
print(scores)
In [19]:
print(scores.mean())
In [20]:
classifiers_compare = classifiers_compare.append([["bernoulli_CV10", (scores.mean())]])
In [21]:
from sklearn.cross_validation import cross_val_score
In [22]:
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, data, target, cv = 3, scoring = "accuracy")
print(scores)
In [23]:
print(scores.mean())
In [24]:
classifiers_compare = classifiers_compare.append([["KNN_CV3", (scores.mean())]])
In [25]:
# search for an optimal value of K for KNN
k_range = range(1,30)
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors = k)
scores = cross_val_score(knn, data, target, cv = 10,
scoring="accuracy")
k_scores.append(scores.mean())
print(k_scores)
In [ ]:
In [26]:
import matplotlib.pyplot as plt
%matplotlib inline
# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel("Value of K for KNN")
plt.ylabel("Cross-Validated Accuracy")
Out[26]:
Looks like k=5 is the optimal k-value for this dataset.
In [27]:
classifiers_compare = classifiers_compare.append([["KNN_CV10K5", (k_scores[4])]])
The straightup classifier model (with no testing/training dataset) is the most likely to be overfit, based on the fact that all of the data is included in the model and then is also used to estimate accuracy.
The Bernoulli Model with the 10 cross-validated folds is likely to be the least susceptible to overfitting and the most reliable accuracy indicator.
The model used had a large impact on performance. In this case, Bernoulli Naive Bayes was a better model than K-Nearest Neighbors.
In [28]:
classifiers_compare
Out[28]: