In [1]:
# code written in py_3.0

import pandas as pd
import numpy as np

Load HAM training data - i.e., tweets about the product


In [2]:
# find path to your Mandrill.xlsx
df_ham = pd.read_excel(open('C:/Users/craigrshenton/Desktop/Dropbox/excel_data_sci/ch03/Mandrill.xlsx','rb'), sheetname=0)
df_ham = df_ham.iloc[0:, 0:1]
df_ham.head() # use .head() to just show top 5 results


Out[2]:
Tweet
0 [blog] Using Nullmailer and Mandrill for your ...
1 [blog] Using Postfix and free Mandrill email s...
2 @aalbertson There are several reasons emails g...
3 @adrienneleigh I just switched it over to Mand...
4 @ankeshk +1 to @mailchimp We use MailChimp for...

Load SPAM training data - i.e., tweets NOT about the product


In [3]:
df_spam = pd.read_excel(open('C:/Users/craigrshenton/Desktop/Dropbox/excel_data_sci/ch03/Mandrill.xlsx','rb'), sheetname=1)
df_spam = df_spam.iloc[0:, 0:1]
df_spam.head()


Out[3]:
Tweet
0 ¿En donde esta su remontada Mandrill?
1 .@Katie_PhD Alternate, 'reproachful mandrill' ...
2 .@theophani can i get "drill" in there? it wou...
3 “@ChrisJBoyland: Baby Mandrill Paignton Zoo 29...
4 “@MISSMYA #NameAnAmazingBand MANDRILL!” Mint C...

Install Natural Language Toolkit: http://www.nltk.org/install.html. You may also need to download nltk's dictionaries


In [4]:
# python -m nltk.downloader punkt

In [5]:
from nltk.tokenize import word_tokenize
 
test = df_ham.Tweet[0]
print(word_tokenize(test))


['[', 'blog', ']', 'Using', 'Nullmailer', 'and', 'Mandrill', 'for', 'your', 'Ubuntu', 'Linux', 'server', 'outboud', 'mail', ':', 'http', ':', '//bit.ly/ZjHOk7', '#', 'plone']

Following Marco Bonzanini's example https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/, I setup a pre-processing chain that recognises '@-mentions', 'emoticons', 'URLs' and '#hash-tags' as tokens


In [6]:
import re
 
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
 
def tokenize(s):
    return tokens_re.findall(s)
 
def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

In [7]:
print(preprocess(test))


['[', 'blog', ']', 'Using', 'Nullmailer', 'and', 'Mandrill', 'for', 'your', 'Ubuntu', 'Linux', 'server', 'outboud', 'mail', ':', 'http://bit.ly/ZjHOk7', '#plone']

In [8]:
tweet = preprocess(test)
tweet


Out[8]:
['[',
 'blog',
 ']',
 'Using',
 'Nullmailer',
 'and',
 'Mandrill',
 'for',
 'your',
 'Ubuntu',
 'Linux',
 'server',
 'outboud',
 'mail',
 ':',
 'http://bit.ly/ZjHOk7',
 '#plone']

Remove common stop-words + the non-default stop-words: 'RT' (i.e., re-tweet), 'via' (used in mentions), and ellipsis '…'


In [9]:
from nltk.corpus import stopwords
import string
 
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via', '…', '¿', '“', '”']

In [10]:
tweet_stop = [term for term in preprocess(test) if term not in stop]

In [11]:
tweet_stop


Out[11]:
['blog',
 'Using',
 'Nullmailer',
 'Mandrill',
 'Ubuntu',
 'Linux',
 'server',
 'outboud',
 'mail',
 'http://bit.ly/ZjHOk7',
 '#plone']

In [12]:
from collections import Counter

count_all = Counter()
for tweet in df_ham.Tweet:
    # Create a list with all the terms
    terms_all = [term for term in preprocess(tweet) if term not in stop]
    # Update the counter
    count_all.update(terms_all)
# Print the first 5 most frequent words
print(count_all.most_common(10))


[('Mandrill', 86), ('http://help.mandrill.com', 22), ('email', 21), ('I', 18), ('request', 16), ('@mandrillapp', 14), ('details', 13), ('emails', 13), ('de', 12), ('mandrill', 12)]

start


In [13]:
df_ham["Tweet"] = df_ham["Tweet"].str.lower()

clean = []
for row in df_ham["Tweet"]:
    tweet = [term for term in preprocess(row) if term not in stop]
    clean.append(' '.join(tweet))
    
df_ham["Tweet"] = clean  # we now have clean tweets
df_ham["Class"] = 'ham' # add classification 
df_ham.head()


Out[13]:
Tweet Class
0 blog using nullmailer mandrill ubuntu linux se... ham
1 blog using postfix free mandrill email service... ham
2 @aalbertson several reasons emails go spam min... ham
3 @adrienneleigh switched mandrill let's see imp... ham
4 @ankeshk 1 @mailchimp use mailchimp marketing ... ham

In [14]:
df_spam["Tweet"] = df_spam["Tweet"].str.lower()

clean = []
for row in df_spam["Tweet"]:
    tweet = [term for term in preprocess(row) if term not in stop]
    clean.append(' '.join(tweet))
    
df_spam["Tweet"] = clean  # we now have clean tweets
df_spam["Class"] = 'spam' # add classification 
df_spam.head()


Out[14]:
Tweet Class
0 en donde esta su remontada mandrill spam
1 @katie_phd alternate reproachful mandrill cove... spam
2 @theophani get drill would picture mandrill ho... spam
3 @chrisjboyland baby mandrill paignton zoo 29 t... spam
4 @missmya #nameanamazingband mandrill mint cond... spam

In [15]:
df_data = pd.concat([df_ham,df_spam])
df_data = df_data.reset_index(drop=True)
df_data = df_data.reindex(np.random.permutation(df_data.index))
df_data.head()


Out[15]:
Tweet Class
281 spark mandrill theme #nerdatwork spam
188 @mandrill n k n k n 5 k 4 correction spam
112 mandrill webhooks interspire bounce processing... ham
43 @matt_pickett u want reach mailchimp mandrill ... ham
242 gostei de um vídeo @youtube de @franciscodanrl... spam

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(df_data["Tweet"].values)
counts


Out[16]:
<300x1662 sparse matrix of type '<class 'numpy.int64'>'
	with 3493 stored elements in Compressed Sparse Row format>

In [17]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
targets = df_data['Class'].values
classifier.fit(counts, targets)


Out[17]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

testing data


In [18]:
df_test = pd.read_excel(open('C:/Users/craigrshenton/Desktop/Dropbox/excel_data_sci/ch03/Mandrill.xlsx','rb'), sheetname=6)
df_test = df_test.iloc[0:, 2:3]
df_test.head()


Out[18]:
Tweet
0 Just love @mandrillapp transactional email ser...
1 @rossdeane Mind submitting a request at http:/...
2 @veroapp Any chance you'll be adding Mandrill ...
3 @Elie__ @camj59 jparle de relai SMTP!1 million...
4 would like to send emails for welcome, passwor...

In [19]:
df_test["Tweet"] = df_test["Tweet"].str.lower()

clean = []
for row in df_test["Tweet"]:
    tweet = [term for term in preprocess(row) if term not in stop]
    clean.append(' '.join(tweet))
    
df_test["Tweet"] = clean  # we now have clean tweets
df_test.head()


Out[19]:
Tweet
0 love @mandrillapp transactional email service ...
1 @rossdeane mind submitting request http://help...
2 @veroapp chance you'll adding mandrill support...
3 @elie__ @camj59 jparle de relai smtp 1 million...
4 would like send emails welcome password resets...

Following Zac Stewart's example (see here, and here), we use sklearn's 'pipeline' feature to merge the feature extraction and classification into one operation


In [20]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('vectorizer',  CountVectorizer()),
    ('classifier',  MultinomialNB()) ])

pipeline.fit(df_data['Tweet'].values, df_data['Class'].values)

df_test["Prediction Class"] = pipeline.predict(df_test['Tweet'].values) # add classification ['spam', 'ham']
df_test


Out[20]:
Tweet Prediction Class
0 love @mandrillapp transactional email service ... ham
1 @rossdeane mind submitting request http://help... ham
2 @veroapp chance you'll adding mandrill support... ham
3 @elie__ @camj59 jparle de relai smtp 1 million... ham
4 would like send emails welcome password resets... ham
5 coworker using mandrill would entrust email ha... ham
6 @mandrill realised 5 seconds hitting send ham
7 holy shit ’ http://www.mandrill.com/ ham
8 new subscriber profile page activity timeline ... ham
9 @mandrillapp increases scalability http://bit.... ham
10 beets @missmya #nameanamazingband mandrill spam
11 @luissand0val fernando vargas mandrill mexican... spam
12 photo oculi-ds mandrill natalie manuel http://... spam
13 @mandrill neither sadpanda together :( spam
14 @mandrill n k n k n 5 k 4, long time think spam
15 megaman x spark mandrill acapella http://youtu... spam
16 @angeluserrare1 storm eagle ftw nom ás dejes q... spam
17 gostei de um vídeo @youtube http://youtu.be/xz... spam
18 2 year-old mandrill jj thinking pic http://ow.... ham
19 120 years moscow zoo mandrill поста ссср #post... spam

In [21]:
true_class = pd.read_excel(open('C:/Users/craigrshenton/Desktop/Dropbox/excel_data_sci/ch03/Mandrill.xlsx','rb'), sheetname=6)
df_test["True Class"] = true_class.iloc[0:, 1:2]
df_test


Out[21]:
Tweet Prediction Class True Class
0 love @mandrillapp transactional email service ... ham APP
1 @rossdeane mind submitting request http://help... ham APP
2 @veroapp chance you'll adding mandrill support... ham APP
3 @elie__ @camj59 jparle de relai smtp 1 million... ham APP
4 would like send emails welcome password resets... ham APP
5 coworker using mandrill would entrust email ha... ham APP
6 @mandrill realised 5 seconds hitting send ham APP
7 holy shit ’ http://www.mandrill.com/ ham APP
8 new subscriber profile page activity timeline ... ham APP
9 @mandrillapp increases scalability http://bit.... ham APP
10 beets @missmya #nameanamazingband mandrill spam OTHER
11 @luissand0val fernando vargas mandrill mexican... spam OTHER
12 photo oculi-ds mandrill natalie manuel http://... spam OTHER
13 @mandrill neither sadpanda together :( spam OTHER
14 @mandrill n k n k n 5 k 4, long time think spam OTHER
15 megaman x spark mandrill acapella http://youtu... spam OTHER
16 @angeluserrare1 storm eagle ftw nom ás dejes q... spam OTHER
17 gostei de um vídeo @youtube http://youtu.be/xz... spam OTHER
18 2 year-old mandrill jj thinking pic http://ow.... ham OTHER
19 120 years moscow zoo mandrill поста ссср #post... spam OTHER

Naturally, in a business application we will generally not have a set of independent test data available. To get around this, we can use cross-validation. Here, we split the training set into two parts, a large training set (~80%), and a smaller testing set (~20%). In this example we also repeat 6 times to average out the results using k-fold cross-validation and scikit-learn's 'KFold' function


In [36]:
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score

k_fold = KFold(n=len(df_data), n_folds=6)
scores = []
confusion = np.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
    train_text = df_data.iloc[train_indices]['Tweet'].values
    train_y = df_data.iloc[train_indices]['Class'].values

    test_text = df_data.iloc[test_indices]['Tweet'].values
    test_y = df_data.iloc[test_indices]['Class'].values

    pipeline.fit(train_text, train_y)
    predictions = pipeline.predict(test_text)

    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label='spam')
    scores.append(score)

print('Total emails classified:', len(df_data))
print('Score:', sum(scores)/len(scores))


Total emails classified: 300
Score: 0.836360280546

The F1 score is a measure of a test's accuracy, in both precision and recall. F1 score reaches its best value at 1 and worst at 0, so the model's score of 0.836 is not bad for a first pass


In [35]:
print('Confusion matrix:')
print(confusion)


Confusion matrix:
[[144   6]
 [ 39 111]]

A confusion matrix helps us understand how the model performed for individual features. Out of the 300 tweets, the model incorrectly classified about 39 tweets that are about the produt, and 6 tweets that are not

In order to improve the results there are two approaches we can take:

  • We could improve the data preprocessing by cleaning the data with more filters
  • We can tune the parameters of the naïve Bayes classifier