Naive Bayes Implementation


In [82]:
from __future__ import division # ensure that all division is float division
from __future__ import print_function # print function works properly when used with paranthesis

%matplotlib inline
import matplotlib.pyplot as plt

import os, sys, re
import numpy as np
import pandas as pd
import seaborn as sns

pd.set_option("display.max_colwidth", 255)

Read in SMS Data.

The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/.

A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/.

Read in Data


In [83]:
df = pd.read_csv("../data/sms.tsv", sep="\t", names=['label', 'message'])
print(df.shape)
df.head()


(5572, 2)
Out[83]:
label message
0 ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives around here though

Stratified Train Test Split

Stratified means the proprtions of spam/ham in the train/test sets reflect the original dataset. You can see the percentage is about the same here.


In [84]:
df.shape


Out[84]:
(5572, 2)

In [58]:
from sklearn.cross_validation import train_test_split
train, test = train_test_split(df, test_size=0.05, stratify=df.label)
print(train.shape, test.shape)
train.label.value_counts()['ham'] / len(train), test.label.value_counts()['ham'] / len(test)


(5293, 2) (279, 2)
Out[58]:
(0.86586057056489707, 0.86738351254480284)

Create sample data frame and sample rows.

Extract two sample messages that we will use for testing in the functions below.


In [85]:
sample_df = train.sample(2)

sample_row1 = sample_df.iloc[0] # first row of sample_df
sample_row2 = sample_df.iloc[1] # second row of sample_df

sample_message1 = sample_row1.message
sample_message2 = sample_row2.message

print(sample_row1.label, "|", sample_message1)
print(sample_row2.label, "|", sample_message2)


ham | No it's waiting in e car dat's bored wat. Cos wait outside got nothing 2 do. At home can do my stuff or watch tv wat.
ham | Stupid.its not possible

Tokenize Message

Use http://regex101.com to come up with regular expressions.


In [87]:
def tokenize(msg):
    """
    input: "Change again... It's e one next to escalator..."
    output: ["change", "again", "it's", "one", "next", "to", "escalator"]
    """
    msg_lowered = msg.lower()
    # at least two characters long, cannot start with number
    all_tokens = re.findall(r"\b[a-z][a-z0-9']+\b", msg_lowered)
    return list(set(all_tokens))

tokens1 = tokenize(sample_message1)
tokens2 = tokenize(sample_message2)

print(sample_message1)
print(sample_message2)

print(tokens1)
print(tokens2)


No it's waiting in e car dat's bored wat. Cos wait outside got nothing 2 do. At home can do my stuff or watch tv wat.
Stupid.its not possible
['at', 'in', 'home', 'no', 'tv', "it's", "dat's", 'waiting', 'outside', 'got', 'wat', 'do', 'watch', 'nothing', 'wait', 'cos', 'bored', 'car', 'stuff', 'can', 'my', 'or']
['not', 'stupid', 'its', 'possible']

Vectorize Message

Walk through the steps of vectorizing a message outside of a function.


In [88]:
token_dict1 = {} # this is a dictionary that looks like {word1: 1, word2: 1, word3: 1}
for token in tokens1:
    token_dict1[token] = 1 
series1 = pd.Series(token_dict1) # convert the dictionary into a series where the row labels are words

# rewrite the same as above using a dict comprehension
series1 = pd.Series({token: 1 for token in tokens1})

token_dict2 = {} # this is a dictionary that looks like {word1: 1, word2: 1, word3: 1}
for token in tokens2:
    token_dict2[token] = 1 
series2 = pd.Series(token_dict2) # convert the dictionary into a series where the row labels are words

# rewrite the same as above using a dict comprehension
series2 = pd.Series({token: 1 for token in tokens2})

print("Sample Message 1:", sample_message1)
print("Tokens 1:", tokens1)
print("Series 1:")
print(series1)
print()
print("Sample Message 2:", sample_message2)
print("Tokens 2:", tokens2)
print("Series 2:")
print(series2)
print()

print("Combine Series 1 and Series 2:")
df2 = pd.DataFrame([series1, series2]) # comebine the two 
df2.fillna(0, inplace=True)
df2


Sample Message 1: No it's waiting in e car dat's bored wat. Cos wait outside got nothing 2 do. At home can do my stuff or watch tv wat.
Tokens 1: ['at', 'in', 'home', 'no', 'tv', "it's", "dat's", 'waiting', 'outside', 'got', 'wat', 'do', 'watch', 'nothing', 'wait', 'cos', 'bored', 'car', 'stuff', 'can', 'my', 'or']
Series 1:
at         1
bored      1
can        1
car        1
cos        1
dat's      1
do         1
got        1
home       1
in         1
it's       1
my         1
no         1
nothing    1
or         1
outside    1
stuff      1
tv         1
wait       1
waiting    1
wat        1
watch      1
dtype: int64

Sample Message 2: Stupid.its not possible
Tokens 2: ['not', 'stupid', 'its', 'possible']
Series 2:
its         1
not         1
possible    1
stupid      1
dtype: int64

Combine Series 1 and Series 2:
Out[88]:
at bored can car cos dat's do got home in ... or outside possible stuff stupid tv wait waiting wat watch
0 1 1 1 1 1 1 1 1 1 1 ... 1 1 0 1 0 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 1 0 0 0 0 0

2 rows × 26 columns

Repeat the same process as above of tokenzing and then vectorizing using a function.


In [89]:
def vectorize_row(row):
    """
    input: row in data frame with a ".message" attribute
    output: vectorized row where the row labels are words and the values are 1 for each row
    """
    message = row.message
    tokens = tokenize(message)
    vectorized_row = pd.Series({token: 1 for token in tokens})
    return vectorized_row

In [90]:
vectorize_row(sample_row1)


Out[90]:
at         1
bored      1
can        1
car        1
cos        1
dat's      1
do         1
got        1
home       1
in         1
it's       1
my         1
no         1
nothing    1
or         1
outside    1
stuff      1
tv         1
wait       1
waiting    1
wat        1
watch      1
dtype: int64

In [91]:
vectorize_row(sample_row2)


Out[91]:
its         1
not         1
possible    1
stupid      1
dtype: int64

Create Feature Matrix

This is input to our Naive Bayes model.


In [65]:
def get_feature_matrix(df):
    feature_matrix = df.apply(vectorize_row, axis=1)
    feature_matrix.fillna(0, inplace=True)
    return feature_matrix

In [92]:
get_feature_matrix(sample_df)


Out[92]:
at bored can car cos dat's do got home in ... or outside possible stuff stupid tv wait waiting wat watch
506 1 1 1 1 1 1 1 1 1 1 ... 1 1 0 1 0 1 1 1 1 1
1454 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 1 0 0 0 0 0

2 rows × 26 columns


In [93]:
feature_matrix = get_feature_matrix(train)
feature_matrix.shape


Out[93]:
(5293, 7795)

In [94]:
feature_matrix.columns[:50]


Out[94]:
Index([u'a21', u'a30', u'aa', u'aah', u'aaniye', u'aaooooright', u'aathi',
       u'ab', u'abbey', u'abdomen', u'abeg', u'abel', u'aberdeen', u'abi',
       u'ability', u'abiola', u'abj', u'able', u'abnormally', u'about',
       u'aboutas', u'above', u'abroad', u'absence', u'absolutely',
       u'absolutly', u'abstract', u'abt', u'abta', u'aburo', u'abuse',
       u'abusers', u'ac', u'acc', u'accent', u'accenture', u'accept',
       u'access', u'accessible', u'accidant', u'accident', u'accidentally',
       u'accommodation', u'accommodationvouchers', u'accomodate',
       u'accomodations', u'accordin', u'accordingly', u'account',
       u'account's'],
      dtype='object')

In [95]:
feature_matrix.columns[-50:]


Out[95]:
Index([u'ym', u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'you'd',
       u'you'll', u'you're', u'you've', u'youdoing', u'youi', u'young',
       u'younger', u'youphone', u'your', u'your's', u'youre', u'yourinclusive',
       u'yourjob', u'yours', u'yourself', u'youuuuu', u'youwanna', u'yoville',
       u'yowifes', u'yoyyooo', u'yr', u'yrs', u'ystrday', u'yummmm', u'yummy',
       u'yun', u'yunny', u'yuo', u'yuou', u'yup', u'yupz', u'zac', u'zebra',
       u'zed', u'zeros', u'zhong', u'zindgi', u'zoe', u'zogtorius', u'zoom',
       u'zouk', u'zyada'],
      dtype='object')

In [96]:
feature_matrix.head()


Out[96]:
a21 a30 aa aah aaniye aaooooright aathi ab abbey abdomen ... zebra zed zeros zhong zindgi zoe zogtorius zoom zouk zyada
3938 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2976 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
891 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4088 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2136 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 7795 columns

Calculate Feature Probabilities (Train/Fit Model)

The conditional probability of each word is given additive smoothing below.


In [99]:
def get_conditional_probability_for_word(col, k=0.5):
    return (col.sum() + k) / (len(col) + 2*k)

In [100]:
def get_feature_prob(feature_matrix):
    
    spam_boolean_mask = (df.label == "spam")
    ham_boolean_mask = (df.label == "ham")
    
    # Explanation for "confusing" syntax:
    # http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
    
    feature_matrix_spam = feature_matrix.loc[spam_boolean_mask, :] # get all rows for spam boolean mask
    feature_matrix_ham = feature_matrix.loc[ham_boolean_mask, :] # get all rows for ham boolean mask
    
    # mymatrix[:, 0] is to get the first column
    # mymatrix[:, 1] is to get the second column
    
    # mymatrix[0, :] is to get the first row
    # mymatrix[1, :] is to get the second row
    
    # mymatrix[boolean_mask, :] is to get the rows where boolean_mask is True
    
    feature_prob_spam = feature_matrix_spam.apply(get_conditional_probability_for_word, axis=0)
    feature_prob_ham = feature_matrix_ham.apply(get_conditional_probability_for_word, axis=0)
    
    feature_prob = pd.concat([feature_prob_spam, feature_prob_ham], axis=1)
    feature_prob.columns = ['spam', 'ham']
    
    return feature_prob

In [101]:
feature_prob = get_feature_prob(feature_matrix)
feature_prob.shape


Out[101]:
(7795, 2)

In [102]:
feature_prob.head()


Out[102]:
spam ham
a21 0.002110 0.000109
a30 0.000703 0.000327
aa 0.000703 0.000327
aah 0.000703 0.000764
aaniye 0.000703 0.000327

Analyze Feature Probabilities in Classifier

Words with the largest conditional probability for predicting spam.

P(w_i | y= "spam")


In [75]:
feature_prob.sort_values(by='spam', ascending=False).head(10)


Out[75]:
spam ham
to 0.622363 0.252072
call 0.436709 0.045921
you 0.315752 0.269306
your 0.307314 0.074498
now 0.259494 0.060100
or 0.239803 0.045266
for 0.238397 0.091950
free 0.232771 0.012544
the 0.228551 0.180083
txt 0.200422 0.002945

Words with the smallest conditional probability for predicting ham.

P(w_i | y= "ham")


In [76]:
feature_prob.sort_values(by='ham', ascending=True).head(10)


Out[76]:
spam ham
a21 0.002110 0.000109
ree 0.002110 0.000109
daytime 0.002110 0.000109
ref 0.006329 0.000109
dating 0.023207 0.000109
datebox1282essexcm61xn 0.003516 0.000109
refused 0.004923 0.000109
regalportfolio 0.002110 0.000109
dartboard 0.002110 0.000109
regard 0.002110 0.000109

Key Takeaway: These models are trained looking only at one class at a time, so the largest conditional probabilities may end up being common stop words. However, this will occur in both classes which ends up "cancelling out". The stop words won't predict one way or the other. Instead, looking at the least predictive words of the opposite class - in this case the words least predictive of "ham" will show us highly predictive spam words.


In [77]:
df[df.message.str.contains("a21", case=False)]


Out[77]:
label message
1673 spam URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050001295 from land line. Claim A21. Valid 12hrs only

In [78]:
df[df.message.str.contains("landmark", case=False)]


Out[78]:
label message
4373 spam Ur balance is now £600. Next question: Complete the landmark, Big, A. Bob, B. Barry or C. Ben ?. Text A, B or C to 83738. Good luck!

In [79]:
df[df.message.str.contains("landlines", case=False)]


Out[79]:
label message
3998 spam Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!
4864 spam Bored housewives! Chat n date now! 0871750.77.11! BT-national rate 10p/min only from landlines!

Predict Test Data


In [104]:
test.iloc[0]


Out[104]:
label                                                                                 ham
message    Just checking in on you. Really do miss seeing Jeremiah. Do have a great month
Name: 350, dtype: object

In [80]:
def get_spam_prob(row):
    
    new_msg = row.message
    
    tokens = tokenize(new_msg)
    
    log_prob_if_spam = 0.0
    log_prob_if_not_spam = 0.0
    
    for word, prob in feature_prob.iterrows():
        
        prob_if_spam = prob.spam
        prob_if_not_spam = prob.ham
        
        if word in tokens:
            log_prob_if_spam += math.log(prob_if_spam)
            log_prob_if_not_spam += math.log(prob_if_not_spam)
        else:
            log_prob_if_spam += math.log(1.0 - prob_if_spam)
            log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam)
        
    prob_if_spam = math.exp(log_prob_if_spam)
    prob_if_not_spam = math.exp(log_prob_if_not_spam)
        
    return prob_if_spam / (prob_if_spam + prob_if_not_spam)
    
#     return pd.Series({
#         "spam_prob": prob_if_spam, #/ (prob_if_spam + prob_if_not_spam), 
#         "ham_prob": prob_if_not_spam #/ (prob_if_spam + prob_if_not_spam)
#     })

In [81]:
# test_probs = test.apply(get_spam_prob, axis=1)