We build an analytics model using text as our data, specifically trying to understand the sentiment of tweets about the company Apple. This is a special classification problem, often called Sentiment Analysis.
The challenge is to see if we can correctly classify tweets as being negative, positive, or neutral about Apple.
In [21]:
import pandas as pd # Start by importing the tweets data
In [22]:
X = pd.read_csv('../datasets/tweets.csv')
In [23]:
X.shape
Out[23]:
In [24]:
X.columns
Out[24]:
In [25]:
X.info()
In [26]:
X.head(5)
Out[26]:
It contains 1181 tweets (as text) and one manually labeled sentiment.
In [27]:
min(X.Avg)
Out[27]:
In [28]:
max(X.Avg)
Out[28]:
2 means very positive, 0 is neutral and -2 is very negative
In [29]:
X.Avg.hist();
In [30]:
corpusTweets = X.Tweet.tolist() # get a list of all tweets, then is easier to apply preprocessign to each item
In [31]:
# Convert to lower-case
corpusLowered = [s.lower() for s in corpusTweets]
In [32]:
corpusLowered[0:5]
Out[32]:
In [33]:
# Remove punctuation
import re
corpusNoPunct = [re.sub(r'([^\s\w_]|_)+', ' ', s.strip()) for s in corpusLowered]
In [34]:
corpusNoPunct[0:5]
Out[34]:
Now we remove the stopwords. First we define which are the common words (stopwords) to be removed:
In [35]:
import os
def readStopwords():
'''
returns stopwords as strings
Assume that a file called "stopwords.txt"
exists in the folder
'''
filename = "stopwords.txt"
path = os.path.join("", filename)
file = open(path, 'r')
return file.read().splitlines() # splitlines is used to remove newlines
In [36]:
stopWords = set(readStopwords())
In [37]:
"the" in stopWords # quick test
Out[37]:
In [38]:
stopWords.add("apple")
stopWords.add("appl")
stopWords.add("iphone")
stopWords.add("ipad")
stopWords.add("ipod")
stopWords.add("itunes")
stopWords.add("ios")
stopWords.add("http")
print ("apple" in stopWords)
print ("google" in stopWords)
To remove a word from the corpus if that word is contained in our stopwords set, we need first to tokenise the corpus (i.e., split it into words or tokens):
In [39]:
# tokenise
corpusTokens = [s.split() for s in corpusNoPunct]
In [40]:
corpusTokens[0:3]
Out[40]:
In [41]:
# Stem document
from nltk import PorterStemmer
porter = PorterStemmer()
In [42]:
corpus = []
for tweet in corpusTokens:
cleanTokens = [token for token in tweet if token not in stopWords] # a list of tokens
stemmedTokens = [porter.stem(token) for token in cleanTokens]
cleanTweet = ' '.join(stemmedTokens)
corpus.append(cleanTweet)
In [43]:
corpus[0:5]
Out[43]:
In [44]:
from sklearn.feature_extraction.text import CountVectorizer
In [45]:
cv = CountVectorizer(lowercase=False, max_features=500)
cv.fit(corpus)
Out[45]:
In [46]:
'apple' in cv.vocabulary_ # a quick test
Out[46]:
In [47]:
cv.get_feature_names()[0:20] # in alphabetical order
Out[47]:
Now we use the voctoriser to transform the corpus into a sparse matrix where each tweet has 1 if the feature is present in it or 0 if not.
In [48]:
bagOfWords = cv.transform(corpus)
In [49]:
bagOfWords
Out[49]:
In [50]:
sum_words = bagOfWords.toarray().sum(axis=0)
In [51]:
words_freq = [(word, sum_words[idx]) for word, idx in cv.vocabulary_.items()]
In [52]:
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
In [53]:
words_freq[:10]
Out[53]:
We put it into a data frame to use it in the classifier
In [54]:
df = pd.DataFrame(bagOfWords.toarray())
In [55]:
df.shape
Out[55]:
In [56]:
df.info()
In [57]:
df.head(1)
Out[57]:
We start by splitting the tweets into training and test sets, as usual
In [58]:
import numpy.random
numpy.random.seed(100) # just for reproducibility
In [59]:
from sklearn.model_selection import train_test_split
In [60]:
X.Avg = [int(round(a)) for a in X.Avg] # cluster target into 5 classes
In [61]:
X_train, X_test, y_train, y_test = train_test_split(df, X.Avg, test_size=0.25)
In [62]:
X_test.shape
Out[62]:
In [63]:
from sklearn.naive_bayes import MultinomialNB
In [64]:
classifier = MultinomialNB()
In [65]:
classifier.fit(X_train, y_train)
Out[65]:
In [66]:
predictions = classifier.predict(X_test)
In [67]:
predictions[0:100]
Out[67]:
In [68]:
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy: {:.2}".format(metrics.accuracy_score(y_test, predictions)))
The classifier was correct 64% of times (not only if a tweet was negative but also if it was strongly negative or moderately negative).
A very useful metric is the confusion matrix that displays the predictions and the actual values in a matrix:
In [69]:
mat = metrics.confusion_matrix(y_test, predictions)
In [70]:
mat
Out[70]:
It's more clear if we visualise it as a heat map:
In [71]:
import matplotlib.pyplot as plt
In [72]:
labels = ['strongly neg.', 'negative', 'neutral', 'positive', 'strongly pos.']
fig = plt.figure()
ax = fig.add_subplot(111)
cm = ax.matshow(mat)
# plot the title, use y to leave some space before the labels
plt.title("Confusion matrix - Tweets arranged by sentiment", y=1.2)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.setp(ax.get_xticklabels(), rotation=-30, ha="right",
rotation_mode="anchor")
plt.xlabel("Predicted")
plt.ylabel("Actual")
# Loop over data dimensions and create text annotations.
for i in range(len(mat)):
for j in range(len(mat)):
text = ax.text(j, i, mat[i, j],
ha="center", va="center", color="w")
# Create colorbar
fig.colorbar(cm);
The numbers in the diagonal are all the times when the predicted sentiment for a tweet was the same as the actual sentiment.
Now we can define accuracy as the sum of all the values in the diagonal divided by the total of the values.
The best accuracy would be 1.0 when all values are on the diagonal (no errors!), whereas the worst is 0.0 (nothing correct)!
In [73]:
correctPredictions = sum(mat[i][i] for i in range(len(mat)))
correctPredictions
Out[73]:
In [74]:
print("Accuracy: {:.2}".format(correctPredictions / len(y_test)))
In [75]:
neutralTweets = sum(1 for sentiment in y_test if sentiment == 0) # neutral tweets in Test dataset
neutralTweets
Out[75]:
In [76]:
len(y_test) - neutralTweets
Out[76]:
This tells us that in our test dataset we have 178 observation with neutral sentiment and 118 with positive or negative tweets.
So the accuracy of a baseline model that always predict non-negative tweets would be:
In [77]:
print("Accuracy baseline: {:.2}".format(neutralTweets / len(y_test)))
So our Naive Bayesian model does better than the simple baseline.
The classifier can be applied to new tweets, of course, to predict their sentiment:
In [78]:
# for simplicity, it re-uses the vectorizer and the classifier without passing them
# as arguments. Industrialising it would mean to create a pipeline with
# vectoriser > classifier > label string
def predictSentiment(t):
bow = cv.transform([t])
prediction = classifier.predict(bow)
if prediction == 0:
return "Neutral"
elif prediction > 0:
return "Positive"
else:
return "Negative"
In [79]:
predictSentiment("I don't know what to think about apple!")
Out[79]:
Ok. We try with two new tweets and see what we get, one positive and one negative
In [80]:
predictSentiment("I love apple, its products are always the best, really!")
Out[80]:
In [81]:
predictSentiment("Apple lost its mojo, I will never buy again an iphone better an Android")
Out[81]:
Now, this is a more generic case, where we have 5 classes as target.
The case you see more often is the binary one, with only two classes, which has some special characteristics and metrics.
Let's convert our target into a binary one: a tweet can be either negative or not negative (i.e., positive or neutral).
First of all, we need to transform our original dataset to reduce the sentiment classes to only two classes.
In [82]:
X.loc[X.Avg < 0] = -1 # negative sentiment
X.loc[X.Avg >= 0] = 1 # NON-negative sentiment
We need to re-apply the classifier
In [83]:
X_train, X_test, y_train, y_test = train_test_split(df, X.Avg, test_size=0.25)
In [86]:
classifier = MultinomialNB() # 0.77
In [89]:
classifier.fit(X_train, y_train)
Out[89]:
In [90]:
predictionsTwo = classifier.predict(X_test)
In [91]:
predictionsTwo[0:100]
Out[91]:
As you can see, there is no more classes 2, 0 or -2 now
In [92]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy: {:.2}".format(metrics.accuracy_score(y_test, predictionsTwo)))
Of course is better, we have less classes to predict, less errors to make.
Let's see how the confusion matrix looks like:
In [93]:
matBinary = metrics.confusion_matrix(y_test, predictionsTwo)
matBinary
Out[93]:
In [94]:
labels = ['negative', 'NOT negative']
fig = plt.figure()
ax = fig.add_subplot(111)
cm = ax.matshow(matBinary)
# plot the title, use y to leave some space before the labels
plt.title("Confusion matrix - Tweets arranged by sentiment", y=1.2)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.setp(ax.get_xticklabels(), rotation=-30, ha="right",
rotation_mode="anchor")
plt.xlabel("Predicted")
plt.ylabel("Actual")
# Loop over data dimensions and create text annotations.
for i in range(len(matBinary)):
for j in range(len(matBinary)):
text = ax.text(j, i, matBinary[i, j],
ha="center", va="center", color="w")
# Create colorbar
fig.colorbar(cm);
In a two-class problem, we are often looking to discriminate between observations with a specific outcome, from normal observations. Such as a disease state or no disease state or spam versus no-spam. One being the positive event and the other the no-event, the negative event.
In our case, let's say the negative event is the negative tweet and the positive event is the NON-negative tweet.
These are basic terms used in binary classification:
“true positive” for correctly predicted event values (in our scenario the non-negative tweets: positive or neutral).
“true negative” for correctly predicted no-event values (in our scenario the negative tweets).
“false positive” for incorrectly predicted event values. In Hypothesis Testing it is also known as Type 1 error or the incorrect rejection of Null Hypothesis.
“false negative” for incorrectly predicted no-event values. It is also known as Type 2 error, which leads to the failure in rejection of Null Hypothesis.
In [95]:
tn, fp, fn, tp = matBinary.ravel()
In [96]:
print("True Negatives: ",tn)
print("False Positives: ",fp)
print("False Negatives: ",fn)
print("True Positives: ",tp)
Accuracy can be re-formulated as the ratio between the true events (positive and negative) and the total events:
In [97]:
Accuracy = (tn+tp)/(tp+tn+fp+fn)
print("Accuracy: {:.2f}".format(Accuracy))
Accuracy is not a reliable metric for the real performance of a classifier, because it will yield misleading results if the data set is unbalanced (that is, when the numbers of observations in different classes vary greatly).
Then you may consider additional metrics like Precision, Recall, F score (combined metric):
It is the ‘Completeness’, ability of the model to identify all relevant instances, True Positive Rate, aka Sensitivity.
Imagine a scenario where your focus is to have the least False Negatives, ofr example if you are trying to predict if an email is a spam or not, you don’t want authentic messages to be wrongly classified as spam. Then Sensitivity can come to rescue:
In [98]:
Sensitivity = tp/(tp+fn)
print("Sensitivity {:0.2f}".format(Sensitivity))
Sensitivity is a real number between 0 and 1. A sensitivity of 1 means that ALL the Negative cases have been correctly classified.
In [99]:
#Specificity
Specificity = tn/(tn+fp)
print("Specificity {:0.2f}".format(Specificity))
Until now, we have seen classification problems where we predict the target class directly.
Sometimes it can be more insightful or flexible to predict the probabilities for each class instead. From one side you will get an idea of how confident is the classifier for each class, on the other side you can use them to calibrate the threshold for how to interpret the predicted probabilities.
For example, in a binary classifier the default is to use a threshold of 0.5, meaning that a probability less than 0.5 is a negative outcome and a probability equal or over 0.5 is a positive outcome. But this threshold can be adjusted to tune the behavior of the model for the specific problem, e.g. to reduce more of one or another type of error, as we have seen above. Think about a classifier that predict if an event is a nuclear attack or not. Clearly you want to have as less as possible false alarms!
A diagnostic tools helpin in choosing the right threshold is the ROC curve.
This is the plot of the ‘True Positive Rate’ (Sensitivity) on the y-axis against the ‘False Positive Rate’ (1 minus Specificity) on the x-axis, at different classification thresholds between 0 and 1.
It captures all the thresholds simultaneously and the area under the ROC curve measures how well a parameter can distinguish between two groups. Threshold =0 is at the axis origin (0,0) while the threshold = 1 is at the top right end of the curve.
Put another way, it plots the false alarm rate versus the hit rate.
Let's see an example using our binary classification above.
First, you need probabilities to create the ROC curve.
In [100]:
probs = classifier.predict_proba(X_test) # get the probabilities
In [101]:
preds = probs[:,1] ## keep probabilities for the positive outcome only
fpr, tpr, threshold = metrics.roc_curve(y_test, preds) # calculate roc
roc_auc = metrics.auc(fpr, tpr) # calculate AUC
In [102]:
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.plot([0, 1], [0, 1],'r--') # plot random guessing
plt.legend(loc = 'lower right')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
The ROC curve is a useful tool for a few reasons:
A random guessing classifier (the red line above) has an Area Under the Curve (often referred as AUC) of 0.5, while AUC for a perfect classifier is equal to 1. In general AUC of above 0.8 is considered "good".
Looking at the ROC curve you can choose a threshold that gives a desirable balance between the:
In [103]:
# Precision
Precision = tp/(tp+fp)
print("Precision or Positive Predictive Power: {:0.2f}".format(Precision))
Similarly, you can calculate the Negative Predictive Power:
In [104]:
# Negative Predictive Value
print("Negative predictive Power: {:0.2f}".format(tn / (tn+fn)))
The F1 score is the harmonic mean of the Precision & Sensitivity, and is used to indicate a balance between them. It ranges from 0 to 1; F1 Score reaches its best value at 1 (perfect precision & sensitivity) and worst at 0.
In [105]:
# F1 Score
f1 = (2 * Precision * Sensitivity) / (Precision + Sensitivity)
print("F1 Score {:0.2f}".format(f1))
In [106]:
classifierTuned = MultinomialNB(class_prior=[.4, 0.6]) # try to max specificity
In [107]:
classifierTuned.fit(X_train, y_train)
predictionsTuned = classifierTuned.predict(X_test)
In [108]:
matTuned = metrics.confusion_matrix(y_test, predictionsTuned)
matTuned
Out[108]:
In [109]:
tn, fp, fn, tp = matTuned.ravel()
In [110]:
Accuracy = (tn+tp)/(tp+tn+fp+fn)
print("Accuracy: {:.2f}".format(Accuracy)) # it was 0.79
In [111]:
Sensitivity = tp/(tp+fn)
print("Sensitivity {:0.2f}".format(Sensitivity)) #it was 0.9
In [112]:
Specificity = tn/(tn+fp)
print("Specificity {:0.2f}".format(Specificity)) # it was 0.53
We have greatly improved the specificity at the cost of a smaller decrease of the sensitivity and accuracy.
In a 2x2, once you have picked one category as positive, the other is automatically negative. With 5 categories, you basically have 5 different sensitivities, depending on which of the five categories you pick as "positive". You could still calculate their metrics by collapsing to a 2x2, i.e. Class1 versus not-Class1, then Class2 versus not-Class2, and so on, as we did above.
You can actually have sensitivity and specificity regardless of the number of classes. The only difference here is that you will get one specificity and sensitivity and accuracy and F1-score for each of the classes. If you want to report, you can report the average of these values.
We have to do these calculations for each class separately, then we average these measures, to get the average of precision and the average of recall. I leave this as exercise for you.