Naive Bayes (the easy way)

We'll cheat by using sklearn.naive_bayes to train a spam classifier! Most of the code is just loading our training data into a pandas DataFrame that we can play with:



In [1]:

    
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('e:/sundog-consult/Udemy/DataScience/emails/spam', 'spam'))
data = data.append(dataFrameFromDirectory('e:/sundog-consult/Udemy/DataScience/emails/ham', 'ham'))

Let's have a look at that DataFrame:



In [2]:

    
data.head()









    Out[2]:






  
    
      
      class
      message
    
  
  
    
      e:/sundog-consult/Udemy/DataScience/emails/spam\00001.7848dde101aa985090474a91ec93fcf0
      spam
      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Tr...
    
    
      e:/sundog-consult/Udemy/DataScience/emails/spam\00002.d94f1b97e48ed3b553b3508d116e6a09
      spam
      1) Fight The Risk of Cancer!\n\nhttp://www.adc...
    
    
      e:/sundog-consult/Udemy/DataScience/emails/spam\00003.2ee33bc6eacdb11f38d052c44819ba6c
      spam
      1) Fight The Risk of Cancer!\n\nhttp://www.adc...
    
    
      e:/sundog-consult/Udemy/DataScience/emails/spam\00004.eac8de8d759b7e74154f142194282724
      spam
      ##############################################...
    
    
      e:/sundog-consult/Udemy/DataScience/emails/spam\00005.57696a39d7d84318ce497886896bf90d
      spam
      I thought you might like these:\n\n1) Slim Dow...

Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.



In [3]:

    
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)









    Out[3]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's try it out:



In [4]:

    
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions









    Out[4]:





array(['spam', 'ham'], 
      dtype='<U4')

Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.



In [ ]:

	class	message
e:/sundog-consult/Udemy/DataScience/emails/spam\00001.7848dde101aa985090474a91ec93fcf0	spam	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Tr...
e:/sundog-consult/Udemy/DataScience/emails/spam\00002.d94f1b97e48ed3b553b3508d116e6a09	spam	1) Fight The Risk of Cancer!\n\nhttp://www.adc...
e:/sundog-consult/Udemy/DataScience/emails/spam\00003.2ee33bc6eacdb11f38d052c44819ba6c	spam	1) Fight The Risk of Cancer!\n\nhttp://www.adc...
e:/sundog-consult/Udemy/DataScience/emails/spam\00004.eac8de8d759b7e74154f142194282724	spam	##############################################...
e:/sundog-consult/Udemy/DataScience/emails/spam\00005.57696a39d7d84318ce497886896bf90d	spam	I thought you might like these:\n\n1) Slim Dow...