In [1]:
import numpy as np
import pandas as pd
import read_matrix as rm
import nb_train
import nb_test
import svm_train
import svm_test
The first machine learning algorithm for classifying spam emails is the Naive Bayes model. First we trained the model using the MATRIX.TRAIN
data file.
In [2]:
df_train = rm.read_data('spam_data/MATRIX.TRAIN')
In [3]:
nb_model = nb_train.train(df_train)
Next we ran the model against the testing data.
In [4]:
df_test = rm.read_data('spam_data/MATRIX.TEST')
In [5]:
nb_predictions = nb_test.test(nb_model, df_test)
The following is the testing error.
In [6]:
y = df_test.iloc[:,0]
nb_error = nb_test.compute_error(y, nb_predictions)
In [7]:
print('NB Test error: {}'.format(nb_error))
The five most indicative words of a spam message are the following.
In [8]:
words = nb_test.k_most_indicative_words(5, nb_model.to_dataframe().iloc[:,1:])
In [9]:
print('The {} most spam-worthy words are: {}'.format(len(words), words))
To test the convergence properties of the Naive Bayes classifier on the email data set, it needs to be run on different training set sizes. Here we use six different sized training sets to see how the error rate progresses.
In [10]:
training_set_files = {
50 : 'spam_data/MATRIX.TRAIN.50',
100 : 'spam_data/MATRIX.TRAIN.100',
200 : 'spam_data/MATRIX.TRAIN.200',
400 : 'spam_data/MATRIX.TRAIN.400',
800 : 'spam_data/MATRIX.TRAIN.800',
1400 : 'spam_data/MATRIX.TRAIN.1400'
}
Estimate the models and compute the errors.
In [11]:
nb_models = {}
for size, filename in training_set_files.items():
df_next = rm.read_data(filename)
m = nb_train.train(df_next)
nb_models[size] = m
In [12]:
nb_errors = {}
for size, model in nb_models.items():
guessed_y = nb_test.test(model, df_test)
err = nb_test.compute_error(y, guessed_y)
nb_errors[size] = err
The resulting errors are
In [13]:
print('Naive Bayes')
for size, error in nb_errors.items():
print('size: {}; error: {}'.format(size, error))
As the training set size increases, the error rate for the Naive Bayes classifier decreases. It converges above a training set size of about 1000 emails.
The second model used to classify the emails is a support vector machine. As in part (a), we train the SVM model using the MATRIX.TRAIN
data.
In [14]:
tau = 8
max_iters = 40
In [15]:
svm_model = svm_train.train(df_train, tau, max_iters)
Next, we run the trained SVM model against the testing data.
In [16]:
svm_predictions = svm_test.test(svm_model, df_test)
In [17]:
print(svm_predictions.shape)
The testing error is:
In [18]:
ytest = 2 * df_test.iloc[:,0].as_matrix() - 1
svm_error = svm_test.compute_error(ytest, svm_predictions)
In [19]:
print('SVM Test Error: {}'.format(svm_error))
For the varying sized training sets, we estimate an SVM model.
In [20]:
svm_models = {}
for size, filename in training_set_files.items():
df_next = rm.read_data(filename)
m = svm_train.train(df_next, tau, max_iters)
svm_models[size] = m
And we compute the errors for each model.
In [23]:
svm_errors = {}
for size, model in svm_models.items():
guessed_y = svm_test.test(model, df_test)
err = svm_test.compute_error(ytest, guessed_y)
svm_errors[size] = err
The resulting errors are
In [24]:
print('Support Vector Machine')
for size, error in svm_errors.items():
print('size: {}; error: {}'.format(size, error))
For this data set, the SVM is a much better classifier than the Naive Bayes classifier. Indeed, it converges to zero error much more rapidly than the Naive Bayes classifier in the simulations.
In [ ]: