Load data from http://media.wiley.com/product_ancillary/6X/11186614/DOWNLOAD/ch03.zip, Mandrill.xlsx



In [1]:

    
# code written in py_3.0

import pandas as pd
import numpy as np

Load HAM training data - i.e., tweets about the product



In [2]:

    
# find path to your Mandrill.xlsx
df_ham = pd.read_excel(open('C:/Users/craigrshenton/Desktop/Dropbox/excel_data_sci/ch03/Mandrill.xlsx','rb'), sheetname=0)
df_ham = df_ham.iloc[0:, 0:1]
df_ham.head() # use .head() to just show top 5 results









    Out[2]:






  
    
      
      Tweet
    
  
  
    
      0
      [blog] Using Nullmailer and Mandrill for your ...
    
    
      1
      [blog] Using Postfix and free Mandrill email s...
    
    
      2
      @aalbertson There are several reasons emails g...
    
    
      3
      @adrienneleigh I just switched it over to Mand...
    
    
      4
      @ankeshk +1 to @mailchimp We use MailChimp for...

Load SPAM training data - i.e., tweets NOT about the product



In [3]:

    
df_spam = pd.read_excel(open('C:/Users/craigrshenton/Desktop/Dropbox/excel_data_sci/ch03/Mandrill.xlsx','rb'), sheetname=1)
df_spam = df_spam.iloc[0:, 0:1]
df_spam.head()









    Out[3]:






  
    
      
      Tweet
    
  
  
    
      0
      ¿En donde esta su remontada Mandrill?
    
    
      1
      .@Katie_PhD Alternate, 'reproachful mandrill' ...
    
    
      2
      .@theophani can i get "drill" in there? it wou...
    
    
      3
      “@ChrisJBoyland: Baby Mandrill Paignton Zoo 29...
    
    
      4
      “@MISSMYA #NameAnAmazingBand MANDRILL!” Mint C...

Install Natural Language Toolkit: http://www.nltk.org/install.html. You may also need to download nltk's dictionaries



In [4]:

    
# python -m nltk.downloader punkt



In [5]:

    
from nltk.tokenize import word_tokenize
 
test = df_ham.Tweet[0]
print(word_tokenize(test))









    



['[', 'blog', ']', 'Using', 'Nullmailer', 'and', 'Mandrill', 'for', 'your', 'Ubuntu', 'Linux', 'server', 'outboud', 'mail', ':', 'http', ':', '//bit.ly/ZjHOk7', '#', 'plone']

Following Marco Bonzanini's example https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/, I setup a pre-processing chain that recognises '@-mentions', 'emoticons', 'URLs' and '#hash-tags' as tokens



In [6]:

    
import re
 
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
 
def tokenize(s):
    return tokens_re.findall(s)
 
def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens



In [7]:

    
print(preprocess(test))









    



['[', 'blog', ']', 'Using', 'Nullmailer', 'and', 'Mandrill', 'for', 'your', 'Ubuntu', 'Linux', 'server', 'outboud', 'mail', ':', 'http://bit.ly/ZjHOk7', '#plone']



In [8]:

    
tweet = preprocess(test)
tweet









    Out[8]:





['[',
 'blog',
 ']',
 'Using',
 'Nullmailer',
 'and',
 'Mandrill',
 'for',
 'your',
 'Ubuntu',
 'Linux',
 'server',
 'outboud',
 'mail',
 ':',
 'http://bit.ly/ZjHOk7',
 '#plone']

Remove common stop-words + the non-default stop-words: 'RT' (i.e., re-tweet), 'via' (used in mentions), and ellipsis '…'



In [9]:

    
from nltk.corpus import stopwords
import string
 
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via', '…', '¿', '“', '”']



In [10]:

    
tweet_stop = [term for term in preprocess(test) if term not in stop]



In [11]:

    
tweet_stop









    Out[11]:





['blog',
 'Using',
 'Nullmailer',
 'Mandrill',
 'Ubuntu',
 'Linux',
 'server',
 'outboud',
 'mail',
 'http://bit.ly/ZjHOk7',
 '#plone']



In [12]:

    
from collections import Counter

count_all = Counter()
for tweet in df_ham.Tweet:
    # Create a list with all the terms
    terms_all = [term for term in preprocess(tweet) if term not in stop]
    # Update the counter
    count_all.update(terms_all)
# Print the first 5 most frequent words
print(count_all.most_common(10))









    



[('Mandrill', 86), ('http://help.mandrill.com', 22), ('email', 21), ('I', 18), ('request', 16), ('@mandrillapp', 14), ('details', 13), ('emails', 13), ('de', 12), ('mandrill', 12)]

start



In [13]:

    
df_ham["Tweet"] = df_ham["Tweet"].str.lower()

clean = []
for row in df_ham["Tweet"]:
    tweet = [term for term in preprocess(row) if term not in stop]
    clean.append(' '.join(tweet))
    
df_ham["Tweet"] = clean  # we now have clean tweets
df_ham["Class"] = 'ham' # add classification 
df_ham.head()









    Out[13]:






  
    
      
      Tweet
      Class
    
  
  
    
      0
      blog using nullmailer mandrill ubuntu linux se...
      ham
    
    
      1
      blog using postfix free mandrill email service...
      ham
    
    
      2
      @aalbertson several reasons emails go spam min...
      ham
    
    
      3
      @adrienneleigh switched mandrill let's see imp...
      ham
    
    
      4
      @ankeshk 1 @mailchimp use mailchimp marketing ...
      ham



In [14]:

    
df_spam["Tweet"] = df_spam["Tweet"].str.lower()

clean = []
for row in df_spam["Tweet"]:
    tweet = [term for term in preprocess(row) if term not in stop]
    clean.append(' '.join(tweet))
    
df_spam["Tweet"] = clean  # we now have clean tweets
df_spam["Class"] = 'spam' # add classification 
df_spam.head()









    Out[14]:






  
    
      
      Tweet
      Class
    
  
  
    
      0
      en donde esta su remontada mandrill
      spam
    
    
      1
      @katie_phd alternate reproachful mandrill cove...
      spam
    
    
      2
      @theophani get drill would picture mandrill ho...
      spam
    
    
      3
      @chrisjboyland baby mandrill paignton zoo 29 t...
      spam
    
    
      4
      @missmya #nameanamazingband mandrill mint cond...
      spam



In [15]:

    
df_data = pd.concat([df_ham,df_spam])
df_data = df_data.reset_index(drop=True)
df_data = df_data.reindex(np.random.permutation(df_data.index))
df_data.head()









    Out[15]:






  
    
      
      Tweet
      Class
    
  
  
    
      281
      spark mandrill theme #nerdatwork
      spam
    
    
      188
      @mandrill n k n k n 5 k 4 correction
      spam
    
    
      112
      mandrill webhooks interspire bounce processing...
      ham
    
    
      43
      @matt_pickett u want reach mailchimp mandrill ...
      ham
    
    
      242
      gostei de um vídeo @youtube de @franciscodanrl...
      spam



In [16]:

    
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(df_data["Tweet"].values)
counts









    Out[16]:





<300x1662 sparse matrix of type '<class 'numpy.int64'>'
	with 3493 stored elements in Compressed Sparse Row format>



In [17]:

    
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
targets = df_data['Class'].values
classifier.fit(counts, targets)









    Out[17]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

testing data



In [18]:

    
df_test = pd.read_excel(open('C:/Users/craigrshenton/Desktop/Dropbox/excel_data_sci/ch03/Mandrill.xlsx','rb'), sheetname=6)
df_test = df_test.iloc[0:, 2:3]
df_test.head()









    Out[18]:






  
    
      
      Tweet
    
  
  
    
      0
      Just love @mandrillapp transactional email ser...
    
    
      1
      @rossdeane Mind submitting a request at http:/...
    
    
      2
      @veroapp Any chance you'll be adding Mandrill ...
    
    
      3
      @Elie__ @camj59 jparle de relai SMTP!1 million...
    
    
      4
      would like to send emails for welcome, passwor...



In [19]:

    
df_test["Tweet"] = df_test["Tweet"].str.lower()

clean = []
for row in df_test["Tweet"]:
    tweet = [term for term in preprocess(row) if term not in stop]
    clean.append(' '.join(tweet))
    
df_test["Tweet"] = clean  # we now have clean tweets
df_test.head()









    Out[19]:






  
    
      
      Tweet
    
  
  
    
      0
      love @mandrillapp transactional email service ...
    
    
      1
      @rossdeane mind submitting request http://help...
    
    
      2
      @veroapp chance you'll adding mandrill support...
    
    
      3
      @elie__ @camj59 jparle de relai smtp 1 million...
    
    
      4
      would like send emails welcome password resets...

Following Zac Stewart's example (see here, and here), we use sklearn's 'pipeline' feature to merge the feature extraction and classification into one operation



In [20]:

    
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('vectorizer',  CountVectorizer()),
    ('classifier',  MultinomialNB()) ])

pipeline.fit(df_data['Tweet'].values, df_data['Class'].values)

df_test["Prediction Class"] = pipeline.predict(df_test['Tweet'].values) # add classification ['spam', 'ham']
df_test









    Out[20]:






  
    
      
      Tweet
      Prediction Class
    
  
  
    
      0
      love @mandrillapp transactional email service ...
      ham
    
    
      1
      @rossdeane mind submitting request http://help...
      ham
    
    
      2
      @veroapp chance you'll adding mandrill support...
      ham
    
    
      3
      @elie__ @camj59 jparle de relai smtp 1 million...
      ham
    
    
      4
      would like send emails welcome password resets...
      ham
    
    
      5
      coworker using mandrill would entrust email ha...
      ham
    
    
      6
      @mandrill realised 5 seconds hitting send
      ham
    
    
      7
      holy shit ’ http://www.mandrill.com/
      ham
    
    
      8
      new subscriber profile page activity timeline ...
      ham
    
    
      9
      @mandrillapp increases scalability http://bit....
      ham
    
    
      10
      beets @missmya #nameanamazingband mandrill
      spam
    
    
      11
      @luissand0val fernando vargas mandrill mexican...
      spam
    
    
      12
      photo oculi-ds mandrill natalie manuel http://...
      spam
    
    
      13
      @mandrill neither sadpanda together :(
      spam
    
    
      14
      @mandrill n k n k n 5 k 4, long time think
      spam
    
    
      15
      megaman x spark mandrill acapella http://youtu...
      spam
    
    
      16
      @angeluserrare1 storm eagle ftw nom ás dejes q...
      spam
    
    
      17
      gostei de um vídeo @youtube http://youtu.be/xz...
      spam
    
    
      18
      2 year-old mandrill jj thinking pic http://ow....
      ham
    
    
      19
      120 years moscow zoo mandrill поста ссср #post...
      spam



In [21]:

    
true_class = pd.read_excel(open('C:/Users/craigrshenton/Desktop/Dropbox/excel_data_sci/ch03/Mandrill.xlsx','rb'), sheetname=6)
df_test["True Class"] = true_class.iloc[0:, 1:2]
df_test









    Out[21]:






  
    
      
      Tweet
      Prediction Class
      True Class
    
  
  
    
      0
      love @mandrillapp transactional email service ...
      ham
      APP
    
    
      1
      @rossdeane mind submitting request http://help...
      ham
      APP
    
    
      2
      @veroapp chance you'll adding mandrill support...
      ham
      APP
    
    
      3
      @elie__ @camj59 jparle de relai smtp 1 million...
      ham
      APP
    
    
      4
      would like send emails welcome password resets...
      ham
      APP
    
    
      5
      coworker using mandrill would entrust email ha...
      ham
      APP
    
    
      6
      @mandrill realised 5 seconds hitting send
      ham
      APP
    
    
      7
      holy shit ’ http://www.mandrill.com/
      ham
      APP
    
    
      8
      new subscriber profile page activity timeline ...
      ham
      APP
    
    
      9
      @mandrillapp increases scalability http://bit....
      ham
      APP
    
    
      10
      beets @missmya #nameanamazingband mandrill
      spam
      OTHER
    
    
      11
      @luissand0val fernando vargas mandrill mexican...
      spam
      OTHER
    
    
      12
      photo oculi-ds mandrill natalie manuel http://...
      spam
      OTHER
    
    
      13
      @mandrill neither sadpanda together :(
      spam
      OTHER
    
    
      14
      @mandrill n k n k n 5 k 4, long time think
      spam
      OTHER
    
    
      15
      megaman x spark mandrill acapella http://youtu...
      spam
      OTHER
    
    
      16
      @angeluserrare1 storm eagle ftw nom ás dejes q...
      spam
      OTHER
    
    
      17
      gostei de um vídeo @youtube http://youtu.be/xz...
      spam
      OTHER
    
    
      18
      2 year-old mandrill jj thinking pic http://ow....
      ham
      OTHER
    
    
      19
      120 years moscow zoo mandrill поста ссср #post...
      spam
      OTHER

Naturally, in a business application we will generally not have a set of independent test data available. To get around this, we can use cross-validation. Here, we split the training set into two parts, a large training set (~80%), and a smaller testing set (~20%). In this example we also repeat 6 times to average out the results using k-fold cross-validation and scikit-learn's 'KFold' function



In [36]:

    
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score

k_fold = KFold(n=len(df_data), n_folds=6)
scores = []
confusion = np.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
    train_text = df_data.iloc[train_indices]['Tweet'].values
    train_y = df_data.iloc[train_indices]['Class'].values

    test_text = df_data.iloc[test_indices]['Tweet'].values
    test_y = df_data.iloc[test_indices]['Class'].values

    pipeline.fit(train_text, train_y)
    predictions = pipeline.predict(test_text)

    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label='spam')
    scores.append(score)

print('Total emails classified:', len(df_data))
print('Score:', sum(scores)/len(scores))









    



Total emails classified: 300
Score: 0.836360280546

The F1 score is a measure of a test's accuracy, in both precision and recall. F1 score reaches its best value at 1 and worst at 0, so the model's score of 0.836 is not bad for a first pass



In [35]:

    
print('Confusion matrix:')
print(confusion)









    



Confusion matrix:
[[144   6]
 [ 39 111]]

A confusion matrix helps us understand how the model performed for individual features. Out of the 300 tweets, the model incorrectly classified about 39 tweets that are about the produt, and 6 tweets that are not

In order to improve the results there are two approaches we can take:

We could improve the data preprocessing by cleaning the data with more filters
We can tune the parameters of the naïve Bayes classifier

	Tweet
0	[blog] Using Nullmailer and Mandrill for your ...
1	[blog] Using Postfix and free Mandrill email s...
2	@aalbertson There are several reasons emails g...
3	@adrienneleigh I just switched it over to Mand...
4	@ankeshk +1 to @mailchimp We use MailChimp for...

	Tweet
0	¿En donde esta su remontada Mandrill?
1	.@Katie_PhD Alternate, 'reproachful mandrill' ...
2	.@theophani can i get "drill" in there? it wou...
3	“@ChrisJBoyland: Baby Mandrill Paignton Zoo 29...
4	“@MISSMYA #NameAnAmazingBand MANDRILL!” Mint C...

	Tweet	Class
0	blog using nullmailer mandrill ubuntu linux se...	ham
1	blog using postfix free mandrill email service...	ham
2	@aalbertson several reasons emails go spam min...	ham
3	@adrienneleigh switched mandrill let's see imp...	ham
4	@ankeshk 1 @mailchimp use mailchimp marketing ...	ham

	Tweet	Class
0	en donde esta su remontada mandrill	spam
1	@katie_phd alternate reproachful mandrill cove...	spam
2	@theophani get drill would picture mandrill ho...	spam
3	@chrisjboyland baby mandrill paignton zoo 29 t...	spam
4	@missmya #nameanamazingband mandrill mint cond...	spam

	Tweet	Class
281	spark mandrill theme #nerdatwork	spam
188	@mandrill n k n k n 5 k 4 correction	spam
112	mandrill webhooks interspire bounce processing...	ham
43	@matt_pickett u want reach mailchimp mandrill ...	ham
242	gostei de um vídeo @youtube de @franciscodanrl...	spam

	Tweet
0	Just love @mandrillapp transactional email ser...
1	@rossdeane Mind submitting a request at http:/...
2	@veroapp Any chance you'll be adding Mandrill ...
3	@Elie__ @camj59 jparle de relai SMTP!1 million...
4	would like to send emails for welcome, passwor...

	Tweet
0	love @mandrillapp transactional email service ...
1	@rossdeane mind submitting request http://help...
2	@veroapp chance you'll adding mandrill support...
3	@elie__ @camj59 jparle de relai smtp 1 million...
4	would like send emails welcome password resets...