Naive Bayes Implementation



In [82]:

    
from __future__ import division # ensure that all division is float division
from __future__ import print_function # print function works properly when used with paranthesis

%matplotlib inline
import matplotlib.pyplot as plt

import os, sys, re
import numpy as np
import pandas as pd
import seaborn as sns

pd.set_option("display.max_colwidth", 255)

Read in SMS Data.

The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/.

A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/.

Primary Source: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
Secondary: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Read in Data



In [83]:

    
df = pd.read_csv("../data/sms.tsv", sep="\t", names=['label', 'message'])
print(df.shape)
df.head()









    



(5572, 2)






    Out[83]:






  
    
      
      label
      message
    
  
  
    
      0
      ham
      Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
    
    
      1
      ham
      Ok lar... Joking wif u oni...
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
    
    
      3
      ham
      U dun say so early hor... U c already then say...
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives around here though

Stratified Train Test Split

Stratified means the proprtions of spam/ham in the train/test sets reflect the original dataset. You can see the percentage is about the same here.



In [84]:

    
df.shape









    Out[84]:





(5572, 2)



In [58]:

    
from sklearn.cross_validation import train_test_split
train, test = train_test_split(df, test_size=0.05, stratify=df.label)
print(train.shape, test.shape)
train.label.value_counts()['ham'] / len(train), test.label.value_counts()['ham'] / len(test)









    



(5293, 2) (279, 2)






    Out[58]:





(0.86586057056489707, 0.86738351254480284)

Create sample data frame and sample rows.

Extract two sample messages that we will use for testing in the functions below.



In [85]:

    
sample_df = train.sample(2)

sample_row1 = sample_df.iloc[0] # first row of sample_df
sample_row2 = sample_df.iloc[1] # second row of sample_df

sample_message1 = sample_row1.message
sample_message2 = sample_row2.message

print(sample_row1.label, "|", sample_message1)
print(sample_row2.label, "|", sample_message2)









    



ham | No it's waiting in e car dat's bored wat. Cos wait outside got nothing 2 do. At home can do my stuff or watch tv wat.
ham | Stupid.its not possible

Tokenize Message

Use http://regex101.com to come up with regular expressions.



In [87]:

    
def tokenize(msg):
    """
    input: "Change again... It's e one next to escalator..."
    output: ["change", "again", "it's", "one", "next", "to", "escalator"]
    """
    msg_lowered = msg.lower()
    # at least two characters long, cannot start with number
    all_tokens = re.findall(r"\b[a-z][a-z0-9']+\b", msg_lowered)
    return list(set(all_tokens))

tokens1 = tokenize(sample_message1)
tokens2 = tokenize(sample_message2)

print(sample_message1)
print(sample_message2)

print(tokens1)
print(tokens2)









    



No it's waiting in e car dat's bored wat. Cos wait outside got nothing 2 do. At home can do my stuff or watch tv wat.
Stupid.its not possible
['at', 'in', 'home', 'no', 'tv', "it's", "dat's", 'waiting', 'outside', 'got', 'wat', 'do', 'watch', 'nothing', 'wait', 'cos', 'bored', 'car', 'stuff', 'can', 'my', 'or']
['not', 'stupid', 'its', 'possible']

Vectorize Message

Walk through the steps of vectorizing a message outside of a function.



In [88]:

    
token_dict1 = {} # this is a dictionary that looks like {word1: 1, word2: 1, word3: 1}
for token in tokens1:
    token_dict1[token] = 1 
series1 = pd.Series(token_dict1) # convert the dictionary into a series where the row labels are words

# rewrite the same as above using a dict comprehension
series1 = pd.Series({token: 1 for token in tokens1})

token_dict2 = {} # this is a dictionary that looks like {word1: 1, word2: 1, word3: 1}
for token in tokens2:
    token_dict2[token] = 1 
series2 = pd.Series(token_dict2) # convert the dictionary into a series where the row labels are words

# rewrite the same as above using a dict comprehension
series2 = pd.Series({token: 1 for token in tokens2})

print("Sample Message 1:", sample_message1)
print("Tokens 1:", tokens1)
print("Series 1:")
print(series1)
print()
print("Sample Message 2:", sample_message2)
print("Tokens 2:", tokens2)
print("Series 2:")
print(series2)
print()

print("Combine Series 1 and Series 2:")
df2 = pd.DataFrame([series1, series2]) # comebine the two 
df2.fillna(0, inplace=True)
df2









    



Sample Message 1: No it's waiting in e car dat's bored wat. Cos wait outside got nothing 2 do. At home can do my stuff or watch tv wat.
Tokens 1: ['at', 'in', 'home', 'no', 'tv', "it's", "dat's", 'waiting', 'outside', 'got', 'wat', 'do', 'watch', 'nothing', 'wait', 'cos', 'bored', 'car', 'stuff', 'can', 'my', 'or']
Series 1:
at         1
bored      1
can        1
car        1
cos        1
dat's      1
do         1
got        1
home       1
in         1
it's       1
my         1
no         1
nothing    1
or         1
outside    1
stuff      1
tv         1
wait       1
waiting    1
wat        1
watch      1
dtype: int64

Sample Message 2: Stupid.its not possible
Tokens 2: ['not', 'stupid', 'its', 'possible']
Series 2:
its         1
not         1
possible    1
stupid      1
dtype: int64

Combine Series 1 and Series 2:






    Out[88]:






  
    
      
      at
      bored
      can
      car
      cos
      dat's
      do
      got
      home
      in
      ...
      or
      outside
      possible
      stuff
      stupid
      tv
      wait
      waiting
      wat
      watch
    
  
  
    
      0
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
      ...
      1
      1
      0
      1
      0
      1
      1
      1
      1
      1
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      1
      0
      1
      0
      0
      0
      0
      0
    
  

2 rows × 26 columns

Repeat the same process as above of tokenzing and then vectorizing using a function.



In [89]:

    
def vectorize_row(row):
    """
    input: row in data frame with a ".message" attribute
    output: vectorized row where the row labels are words and the values are 1 for each row
    """
    message = row.message
    tokens = tokenize(message)
    vectorized_row = pd.Series({token: 1 for token in tokens})
    return vectorized_row



In [90]:

    
vectorize_row(sample_row1)









    Out[90]:





at         1
bored      1
can        1
car        1
cos        1
dat's      1
do         1
got        1
home       1
in         1
it's       1
my         1
no         1
nothing    1
or         1
outside    1
stuff      1
tv         1
wait       1
waiting    1
wat        1
watch      1
dtype: int64



In [91]:

    
vectorize_row(sample_row2)









    Out[91]:





its         1
not         1
possible    1
stupid      1
dtype: int64

Create Feature Matrix

This is input to our Naive Bayes model.



In [65]:

    
def get_feature_matrix(df):
    feature_matrix = df.apply(vectorize_row, axis=1)
    feature_matrix.fillna(0, inplace=True)
    return feature_matrix



In [92]:

    
get_feature_matrix(sample_df)









    Out[92]:






  
    
      
      at
      bored
      can
      car
      cos
      dat's
      do
      got
      home
      in
      ...
      or
      outside
      possible
      stuff
      stupid
      tv
      wait
      waiting
      wat
      watch
    
  
  
    
      506
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
      ...
      1
      1
      0
      1
      0
      1
      1
      1
      1
      1
    
    
      1454
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      1
      0
      1
      0
      0
      0
      0
      0
    
  

2 rows × 26 columns



In [93]:

    
feature_matrix = get_feature_matrix(train)
feature_matrix.shape









    Out[93]:





(5293, 7795)



In [94]:

    
feature_matrix.columns[:50]









    Out[94]:





Index([u'a21', u'a30', u'aa', u'aah', u'aaniye', u'aaooooright', u'aathi',
       u'ab', u'abbey', u'abdomen', u'abeg', u'abel', u'aberdeen', u'abi',
       u'ability', u'abiola', u'abj', u'able', u'abnormally', u'about',
       u'aboutas', u'above', u'abroad', u'absence', u'absolutely',
       u'absolutly', u'abstract', u'abt', u'abta', u'aburo', u'abuse',
       u'abusers', u'ac', u'acc', u'accent', u'accenture', u'accept',
       u'access', u'accessible', u'accidant', u'accident', u'accidentally',
       u'accommodation', u'accommodationvouchers', u'accomodate',
       u'accomodations', u'accordin', u'accordingly', u'account',
       u'account's'],
      dtype='object')



In [95]:

    
feature_matrix.columns[-50:]









    Out[95]:





Index([u'ym', u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'you'd',
       u'you'll', u'you're', u'you've', u'youdoing', u'youi', u'young',
       u'younger', u'youphone', u'your', u'your's', u'youre', u'yourinclusive',
       u'yourjob', u'yours', u'yourself', u'youuuuu', u'youwanna', u'yoville',
       u'yowifes', u'yoyyooo', u'yr', u'yrs', u'ystrday', u'yummmm', u'yummy',
       u'yun', u'yunny', u'yuo', u'yuou', u'yup', u'yupz', u'zac', u'zebra',
       u'zed', u'zeros', u'zhong', u'zindgi', u'zoe', u'zogtorius', u'zoom',
       u'zouk', u'zyada'],
      dtype='object')



In [96]:

    
feature_matrix.head()









    Out[96]:






  
    
      
      a21
      a30
      aa
      aah
      aaniye
      aaooooright
      aathi
      ab
      abbey
      abdomen
      ...
      zebra
      zed
      zeros
      zhong
      zindgi
      zoe
      zogtorius
      zoom
      zouk
      zyada
    
  
  
    
      3938
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2976
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      891
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4088
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2136
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 7795 columns

Calculate Feature Probabilities (Train/Fit Model)

The conditional probability of each word is given additive smoothing below.



In [99]:

    
def get_conditional_probability_for_word(col, k=0.5):
    return (col.sum() + k) / (len(col) + 2*k)



In [100]:

    
def get_feature_prob(feature_matrix):
    
    spam_boolean_mask = (df.label == "spam")
    ham_boolean_mask = (df.label == "ham")
    
    # Explanation for "confusing" syntax:
    # http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
    
    feature_matrix_spam = feature_matrix.loc[spam_boolean_mask, :] # get all rows for spam boolean mask
    feature_matrix_ham = feature_matrix.loc[ham_boolean_mask, :] # get all rows for ham boolean mask
    
    # mymatrix[:, 0] is to get the first column
    # mymatrix[:, 1] is to get the second column
    
    # mymatrix[0, :] is to get the first row
    # mymatrix[1, :] is to get the second row
    
    # mymatrix[boolean_mask, :] is to get the rows where boolean_mask is True
    
    feature_prob_spam = feature_matrix_spam.apply(get_conditional_probability_for_word, axis=0)
    feature_prob_ham = feature_matrix_ham.apply(get_conditional_probability_for_word, axis=0)
    
    feature_prob = pd.concat([feature_prob_spam, feature_prob_ham], axis=1)
    feature_prob.columns = ['spam', 'ham']
    
    return feature_prob



In [101]:

    
feature_prob = get_feature_prob(feature_matrix)
feature_prob.shape









    Out[101]:





(7795, 2)



In [102]:

    
feature_prob.head()

Analyze Feature Probabilities in Classifier

Words with the largest conditional probability for predicting spam.

P(w_i | y= "spam")



In [75]:

    
feature_prob.sort_values(by='spam', ascending=False).head(10)

Words with the smallest conditional probability for predicting ham.

P(w_i | y= "ham")



In [76]:

    
feature_prob.sort_values(by='ham', ascending=True).head(10)









    Out[76]:






  
    
      
      spam
      ham
    
  
  
    
      a21
      0.002110
      0.000109
    
    
      ree
      0.002110
      0.000109
    
    
      daytime
      0.002110
      0.000109
    
    
      ref
      0.006329
      0.000109
    
    
      dating
      0.023207
      0.000109
    
    
      datebox1282essexcm61xn
      0.003516
      0.000109
    
    
      refused
      0.004923
      0.000109
    
    
      regalportfolio
      0.002110
      0.000109
    
    
      dartboard
      0.002110
      0.000109
    
    
      regard
      0.002110
      0.000109

Key Takeaway: These models are trained looking only at one class at a time, so the largest conditional probabilities may end up being common stop words. However, this will occur in both classes which ends up "cancelling out". The stop words won't predict one way or the other. Instead, looking at the least predictive words of the opposite class - in this case the words least predictive of "ham" will show us highly predictive spam words.



In [77]:

    
df[df.message.str.contains("a21", case=False)]









    Out[77]:






  
    
      
      label
      message
    
  
  
    
      1673
      spam
      URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050001295 from land line. Claim A21. Valid 12hrs only



In [78]:

    
df[df.message.str.contains("landmark", case=False)]









    Out[78]:






  
    
      
      label
      message
    
  
  
    
      4373
      spam
      Ur balance is now £600. Next question: Complete the landmark, Big, A. Bob, B. Barry or C. Ben ?. Text A, B or C to 83738. Good luck!



In [79]:

    
df[df.message.str.contains("landlines", case=False)]









    Out[79]:






  
    
      
      label
      message
    
  
  
    
      3998
      spam
      Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!
    
    
      4864
      spam
      Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!

Predict Test Data



In [104]:

    
test.iloc[0]









    Out[104]:





label                                                                                 ham
message    Just checking in on you. Really do miss seeing Jeremiah. Do have a great month
Name: 350, dtype: object



In [80]:

    
def get_spam_prob(row):
    
    new_msg = row.message
    
    tokens = tokenize(new_msg)
    
    log_prob_if_spam = 0.0
    log_prob_if_not_spam = 0.0
    
    for word, prob in feature_prob.iterrows():
        
        prob_if_spam = prob.spam
        prob_if_not_spam = prob.ham
        
        if word in tokens:
            log_prob_if_spam += math.log(prob_if_spam)
            log_prob_if_not_spam += math.log(prob_if_not_spam)
        else:
            log_prob_if_spam += math.log(1.0 - prob_if_spam)
            log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam)
        
    prob_if_spam = math.exp(log_prob_if_spam)
    prob_if_not_spam = math.exp(log_prob_if_not_spam)
        
    return prob_if_spam / (prob_if_spam + prob_if_not_spam)
    
#     return pd.Series({
#         "spam_prob": prob_if_spam, #/ (prob_if_spam + prob_if_not_spam), 
#         "ham_prob": prob_if_not_spam #/ (prob_if_spam + prob_if_not_spam)
#     })



In [81]:

    
# test_probs = test.apply(get_spam_prob, axis=1)

	spam	ham
to	0.622363	0.252072
call	0.436709	0.045921
you	0.315752	0.269306
your	0.307314	0.074498
now	0.259494	0.060100
or	0.239803	0.045266
for	0.238397	0.091950
free	0.232771	0.012544
the	0.228551	0.180083
txt	0.200422	0.002945

	label	message
0	ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives around here though

	a21	a30	aa	aah	aaniye	aaooooright	aathi	ab	abbey	abdomen	...	zebra	zed	zeros	zhong	zindgi	zoe	zogtorius	zoom	zouk	zyada
3938	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2976	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
891	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4088	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2136	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	spam	ham
a21	0.002110	0.000109
a30	0.000703	0.000327
aa	0.000703	0.000327
aah	0.000703	0.000764
aaniye	0.000703	0.000327

	label	message
3998	spam	Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!
4864	spam	Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!