In [1]:
%autosave 0
from IPython.core.display import HTML, display
display(HTML('<style>.container { width:100%; } </style>'))
The process of creating a spam detector using a Support Vector Machine is split up into five steps.
In [2]:
import os
import re
import numpy as np
import math
In [3]:
from collections import Counter
The directory https://github.com/karlstroetmann/Artificial-Intelligence/tree/master/Python/EmailData contains 960 emails that are divided into four subdirectories:
spam-train
contains 350 spam emails for training,ham-train
contains 350 non-spam emails for training,spam-test
contains 130 spam emails for testing,ham-test
contains 130 non-spam emails for testing.I have found this data on the page http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html provided by Andrew Ng.
We declare some variables so that this notebook can be adapted to other data sets.
In [4]:
spam_dir_train = 'EmailData/spam-train/'
ham__dir_train = 'EmailData/ham-train/'
spam_dir_test = 'EmailData/spam-test/'
ham__dir_test = 'EmailData/ham-test/'
Directories = [spam_dir_train, ham__dir_train, spam_dir_test, ham__dir_test]
The function $\texttt{get_word_set}(\texttt{fn})$ takes a filename $\texttt{fn}$ as its argument. It reads the file and returns a set
of all words that are found in this file. The words are transformed to lower case.
In [5]:
def get_words_set(fn):
with open(fn) as file:
text = file.read()
text = text.lower()
return set(re.findall(r"[\w']+", text))
The function read_all_files
reads all files contained in those directories that are stored in the list Directories
.
It returns a Counter
. For every word $w$ this counter contains the number of files that contain $w$.
In [6]:
def read_all_files():
Words = Counter()
for directory in Directories:
for file_name in os.listdir(directory):
Words.update(get_words_set(directory + file_name))
return Words
Common_Words
is a numpy
array of the 2500 most common words found in all of our emails.
In [11]:
M = 2500 # number of the most common words to use
Word_Counter = read_all_files()
Common_Words = np.array(list({ w for w, _ in Word_Counter.most_common(M) }))
In [12]:
Common_Words
Out[12]:
Index_Dict
is a dictionary that maps from the most common words to their index in the array Common_Words
.
In [13]:
Index_Dict = { w: i for i, w in enumerate(Common_Words) }
Index_Dict
Out[13]:
The function $\texttt{transform_to_vector}(L)$ takes a list of words $L$ and transforms this list into a vector $\mathbf{v}$. If $\texttt{CommonWords}[i] = w$, then $\mathbf{v}[i]$ specifies the number of times that $w$ occurs in $L$.
In [14]:
def transform_to_vector(L):
Result = np.zeros((len(Common_Words, )))
for w in L:
if w in Index_Dict:
Result[Index_Dict[w]] += 1
return Result
The function $\texttt{get_word_vector}(fn)$ takes a filename fn
, reads the specified file and transforms it into a feature vector.
In [15]:
def get_word_vector(fn):
with open(fn) as file:
text = file.read()
text = text.lower()
return transform_to_vector(re.findall(r"[\w']+", text))
In natural language processing, the notion term is used as a synonym for word. Given a term $t$ and a document $d$, the term frequency $\texttt{tf}(t, d)$ is defined as $$ \texttt{tf}(t, d) = \frac{d.\texttt{count}(t)}{\texttt{len}(d)}, $$ where $d.\texttt{count}(t)$ counts the number of times $t$ appears in $d$ and $\texttt{len}(d)$ is the length of the list representing $d$.
A corpus is a set of documents. Given a term $t$ and a corpus $\mathcal{C}$, the inverse document frequency
$\texttt{idf}(t,\mathcal{C})$ is defined as
$$ \texttt{idf}(t,\mathcal{C}) = \ln\left(\frac{\texttt{card}(\mathcal{C}) + 1}{\texttt{card}\bigl(\{ d \in \mathcal{C} \mid t \in d \}\bigr) + 1}\right). $$
The addition of $1$ in both nominator and denominator is called Laplace smoothing. This is necessary to prevent a division by zero error
for those terms $t$ that do not occur in the list Common_Words
.
The function $\texttt{feature_matrix}(\texttt{spam_dir}, \texttt{ham_dir})$ takes two directories that contain spam and ham, respectively. It computes a matrix $X$ and a vector $Y$, where $X$ is the feature matrix and for every row $r$ of the feature matrix, $Y[r]$ is 1 if the mail is ham and 0 if it's spam.
The way $X$ is computed is quite inefficient, it would have been better to initialize $X$ as a matrix with the shape $(N,M)$, where $N$ is the number of mails and $M$ is the number of common words.
In [16]:
def feature_matrix(spam_dir, ham_dir):
X = []
Y = []
for fn in os.listdir(spam_dir):
X.append(get_word_vector(spam_dir + fn))
Y.append(0)
for fn in os.listdir(ham_dir):
X.append(get_word_vector(ham_dir + fn))
Y.append(+1)
X = np.array(X)
Y = np.array(Y)
return X, Y
We convert the training set into a feature matrix.
In [17]:
%%time
X_train, Y_train = feature_matrix(spam_dir_train, ham__dir_train)
Up to now, the feature matrix contains only the term frequencies. Next we multiply with the inverse document frequencies.
In [18]:
N, _ = X_train.shape
IDF = {}
for w, i in Index_Dict.items():
IDF[w] = np.log((N + 1) / (Word_Counter[w] + 1))
X_train[:, i] = X_train[:, i] * IDF[w]
We build the feature matrix for the test set.
In [19]:
X_test, Y_test = feature_matrix(spam_dir_test, ham__dir_test)
In [20]:
for w, i in Index_Dict.items():
X_test[:, i] = X_test[:, i] * IDF[w]
In [21]:
import sklearn.svm as svm
Train an SVM and compute the accuracy on the training data.
In [22]:
M = svm.SVC(kernel='linear', C=100000)
M.fit (X_train, Y_train)
M.score(X_train, Y_train)
Out[22]:
Compute the accuracy for the test data.
In [23]:
M.score(X_test, Y_test)
Out[23]:
In [ ]: