Homework 2 Problem 3


In [1]:
import numpy as np
import pandas as pd
import read_matrix as rm
import nb_train
import nb_test
import svm_train
import svm_test

Part 3.a

The first machine learning algorithm for classifying spam emails is the Naive Bayes model. First we trained the model using the MATRIX.TRAIN data file.


In [2]:
df_train = rm.read_data('spam_data/MATRIX.TRAIN')

In [3]:
nb_model = nb_train.train(df_train)

Next we ran the model against the testing data.


In [4]:
df_test = rm.read_data('spam_data/MATRIX.TEST')

In [5]:
nb_predictions = nb_test.test(nb_model, df_test)

The following is the testing error.


In [6]:
y = df_test.iloc[:,0]
nb_error = nb_test.compute_error(y, nb_predictions)

In [7]:
print('NB Test error: {}'.format(nb_error))


NB Test error: 0.01625

Part 3.b.

The five most indicative words of a spam message are the following.


In [8]:
words = nb_test.k_most_indicative_words(5, nb_model.to_dataframe().iloc[:,1:])

In [9]:
print('The {} most spam-worthy words are: {}'.format(len(words), words))


The 5 most spam-worthy words are: ['httpaddr', 'spam', 'unsubscrib', 'ebai', 'valet']

Part 3.c.

To test the convergence properties of the Naive Bayes classifier on the email data set, it needs to be run on different training set sizes. Here we use six different sized training sets to see how the error rate progresses.


In [10]:
training_set_files = {
        50   : 'spam_data/MATRIX.TRAIN.50', 
        100  : 'spam_data/MATRIX.TRAIN.100', 
        200  : 'spam_data/MATRIX.TRAIN.200', 
        400  : 'spam_data/MATRIX.TRAIN.400', 
        800  : 'spam_data/MATRIX.TRAIN.800', 
        1400 : 'spam_data/MATRIX.TRAIN.1400'
    }

Estimate the models and compute the errors.


In [11]:
nb_models = {}
for size, filename in training_set_files.items():
    df_next = rm.read_data(filename)
    m = nb_train.train(df_next)
    nb_models[size] = m

In [12]:
nb_errors = {}
for size, model in nb_models.items():
    guessed_y = nb_test.test(model, df_test)
    err = nb_test.compute_error(y, guessed_y)
    nb_errors[size] = err

The resulting errors are


In [13]:
print('Naive Bayes')
for size, error in nb_errors.items():
    print('size: {}; error: {}'.format(size, error))


Naive Bayes
size: 50; error: 0.13125
size: 100; error: 0.04
size: 200; error: 0.02625
size: 400; error: 0.02
size: 800; error: 0.01625
size: 1400; error: 0.01625

As the training set size increases, the error rate for the Naive Bayes classifier decreases. It converges above a training set size of about 1000 emails.

Part 3.d.

The second model used to classify the emails is a support vector machine. As in part (a), we train the SVM model using the MATRIX.TRAIN data.


In [14]:
tau = 8
max_iters = 40

In [15]:
svm_model = svm_train.train(df_train, tau, max_iters)

Next, we run the trained SVM model against the testing data.


In [16]:
svm_predictions = svm_test.test(svm_model, df_test)

In [17]:
print(svm_predictions.shape)


(800,)

The testing error is:


In [18]:
ytest = 2 * df_test.iloc[:,0].as_matrix() - 1
svm_error = svm_test.compute_error(ytest, svm_predictions)

In [19]:
print('SVM Test Error: {}'.format(svm_error))


SVM Test Error: 0.0

For the varying sized training sets, we estimate an SVM model.


In [20]:
svm_models = {}
for size, filename in training_set_files.items():
    df_next = rm.read_data(filename)
    m = svm_train.train(df_next, tau, max_iters)
    svm_models[size] = m

And we compute the errors for each model.


In [23]:
svm_errors = {}
for size, model in svm_models.items():
    guessed_y = svm_test.test(model, df_test)
    err = svm_test.compute_error(ytest, guessed_y)
    svm_errors[size] = err

The resulting errors are


In [24]:
print('Support Vector Machine')
for size, error in svm_errors.items():
    print('size: {}; error: {}'.format(size, error))


Support Vector Machine
size: 50; error: 0.01875
size: 100; error: 0.02
size: 200; error: 0.0025
size: 400; error: 0.00375
size: 800; error: 0.0
size: 1400; error: 0.0

Part 3.e.

For this data set, the SVM is a much better classifier than the Naive Bayes classifier. Indeed, it converges to zero error much more rapidly than the Naive Bayes classifier in the simulations.


In [ ]: