Load data from http://media.wiley.com/product_ancillary/6X/11186614/DOWNLOAD/ch03.zip, Mandrill.xlsx
In [1]:
# code written in py_3.0
import pandas as pd
import numpy as np
Load HAM training data - i.e., tweets about the product
In [2]:
# find path to your Mandrill.xlsx
df_ham = pd.read_excel(open('C:/Users/craigrshenton/Desktop/Dropbox/excel_data_sci/ch03/Mandrill.xlsx','rb'), sheetname=0)
df_ham = df_ham.iloc[0:, 0:1]
df_ham.head() # use .head() to just show top 5 results
Out[2]:
Load SPAM training data - i.e., tweets NOT about the product
In [3]:
df_spam = pd.read_excel(open('C:/Users/craigrshenton/Desktop/Dropbox/excel_data_sci/ch03/Mandrill.xlsx','rb'), sheetname=1)
df_spam = df_spam.iloc[0:, 0:1]
df_spam.head()
Out[3]:
Install Natural Language Toolkit: http://www.nltk.org/install.html. You may also need to download nltk's dictionaries
In [4]:
# python -m nltk.downloader punkt
In [5]:
from nltk.tokenize import word_tokenize
test = df_ham.Tweet[0]
print(word_tokenize(test))
Following Marco Bonzanini's example https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/, I setup a pre-processing chain that recognises '@-mentions', 'emoticons', 'URLs' and '#hash-tags' as tokens
In [6]:
import re
emoticons_str = r"""
(?:
[:=;] # Eyes
[oO\-]? # Nose (optional)
[D\)\]\(\]/\\OpP] # Mouth
)"""
regex_str = [
emoticons_str,
r'<[^>]+>', # HTML tags
r'(?:@[\w_]+)', # @-mentions
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
r'(?:[\w_]+)', # other words
r'(?:\S)' # anything else
]
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
def tokenize(s):
return tokens_re.findall(s)
def preprocess(s, lowercase=False):
tokens = tokenize(s)
if lowercase:
tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
return tokens
In [7]:
print(preprocess(test))
In [8]:
tweet = preprocess(test)
tweet
Out[8]:
Remove common stop-words + the non-default stop-words: 'RT' (i.e., re-tweet), 'via' (used in mentions), and ellipsis '…'
In [9]:
from nltk.corpus import stopwords
import string
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via', '…', '¿', '“', '”']
In [10]:
tweet_stop = [term for term in preprocess(test) if term not in stop]
In [11]:
tweet_stop
Out[11]:
In [12]:
from collections import Counter
count_all = Counter()
for tweet in df_ham.Tweet:
# Create a list with all the terms
terms_all = [term for term in preprocess(tweet) if term not in stop]
# Update the counter
count_all.update(terms_all)
# Print the first 5 most frequent words
print(count_all.most_common(10))
start
In [13]:
df_ham["Tweet"] = df_ham["Tweet"].str.lower()
clean = []
for row in df_ham["Tweet"]:
tweet = [term for term in preprocess(row) if term not in stop]
clean.append(' '.join(tweet))
df_ham["Tweet"] = clean # we now have clean tweets
df_ham["Class"] = 'ham' # add classification
df_ham.head()
Out[13]:
In [14]:
df_spam["Tweet"] = df_spam["Tweet"].str.lower()
clean = []
for row in df_spam["Tweet"]:
tweet = [term for term in preprocess(row) if term not in stop]
clean.append(' '.join(tweet))
df_spam["Tweet"] = clean # we now have clean tweets
df_spam["Class"] = 'spam' # add classification
df_spam.head()
Out[14]:
In [15]:
df_data = pd.concat([df_ham,df_spam])
df_data = df_data.reset_index(drop=True)
df_data = df_data.reindex(np.random.permutation(df_data.index))
df_data.head()
Out[15]:
In [16]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(df_data["Tweet"].values)
counts
Out[16]:
In [17]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
targets = df_data['Class'].values
classifier.fit(counts, targets)
Out[17]:
testing data
In [18]:
df_test = pd.read_excel(open('C:/Users/craigrshenton/Desktop/Dropbox/excel_data_sci/ch03/Mandrill.xlsx','rb'), sheetname=6)
df_test = df_test.iloc[0:, 2:3]
df_test.head()
Out[18]:
In [19]:
df_test["Tweet"] = df_test["Tweet"].str.lower()
clean = []
for row in df_test["Tweet"]:
tweet = [term for term in preprocess(row) if term not in stop]
clean.append(' '.join(tweet))
df_test["Tweet"] = clean # we now have clean tweets
df_test.head()
Out[19]:
In [20]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('classifier', MultinomialNB()) ])
pipeline.fit(df_data['Tweet'].values, df_data['Class'].values)
df_test["Prediction Class"] = pipeline.predict(df_test['Tweet'].values) # add classification ['spam', 'ham']
df_test
Out[20]:
In [21]:
true_class = pd.read_excel(open('C:/Users/craigrshenton/Desktop/Dropbox/excel_data_sci/ch03/Mandrill.xlsx','rb'), sheetname=6)
df_test["True Class"] = true_class.iloc[0:, 1:2]
df_test
Out[21]:
Naturally, in a business application we will generally not have a set of independent test data available. To get around this, we can use cross-validation. Here, we split the training set into two parts, a large training set (~80%), and a smaller testing set (~20%). In this example we also repeat 6 times to average out the results using k-fold cross-validation and scikit-learn's 'KFold' function
In [36]:
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score
k_fold = KFold(n=len(df_data), n_folds=6)
scores = []
confusion = np.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
train_text = df_data.iloc[train_indices]['Tweet'].values
train_y = df_data.iloc[train_indices]['Class'].values
test_text = df_data.iloc[test_indices]['Tweet'].values
test_y = df_data.iloc[test_indices]['Class'].values
pipeline.fit(train_text, train_y)
predictions = pipeline.predict(test_text)
confusion += confusion_matrix(test_y, predictions)
score = f1_score(test_y, predictions, pos_label='spam')
scores.append(score)
print('Total emails classified:', len(df_data))
print('Score:', sum(scores)/len(scores))
The F1 score is a measure of a test's accuracy, in both precision and recall. F1 score reaches its best value at 1 and worst at 0, so the model's score of 0.836 is not bad for a first pass
In [35]:
print('Confusion matrix:')
print(confusion)
A confusion matrix helps us understand how the model performed for individual features. Out of the 300 tweets, the model incorrectly classified about 39 tweets that are about the produt, and 6 tweets that are not
In order to improve the results there are two approaches we can take: