Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
The final task of this chapter will be to apply our newly gained skills to a real spam filter! Naive Bayes classifiers are actually a very popular model for email filtering. Their naivety lends itself nicely to the analysis of text data, where each feature is a word (or a bag of words), and it would not be feasible to model the dependence of every word on every other word.
A bunch of interesting email datasets are mentioned in the book.
In this section, we will be using the Enrom-Spam dataset, which can be downloaded for free from the given website. However, if you followed the installation instructions at the beginning of this book and have downloaded the latest code from GitHub, you are already good to go!
If you downloaded the latest code from GitHub, you will find a number of .zip
files in the
notebooks/data/chapter7
directory. These files contain raw email data (with fields for
To:, Cc:, and text body) that are either classified as spam (with the SPAM = 1
class label) or
not (also known as ham, the HAM = 0
class label).
We build a variable called sources, which contains all the raw data files:
In [1]:
HAM = 0
SPAM = 1
datadir = 'data/chapter7'
sources = [
('beck-s.tar.gz', HAM),
('farmer-d.tar.gz', HAM),
('kaminski-v.tar.gz', HAM),
('kitchen-l.tar.gz', HAM),
('lokay-m.tar.gz', HAM),
('williams-w3.tar.gz', HAM),
('BG.tar.gz', SPAM),
('GP.tar.gz', SPAM),
('SH.tar.gz', SPAM)
]
The first step is to extract these files into subdirectories. For this, we can use the
extract_tar
function we wrote in the previous chapter:
In [2]:
def extract_tar(datafile, extractdir):
try:
import tarfile
except ImportError:
raise ImportError("You do not have tarfile installed. "
"Try unzipping the file outside of Python.")
tar = tarfile.open(datafile)
tar.extractall(path=extractdir)
tar.close()
print("%s successfully extracted to %s" % (datafile, extractdir))
In order to apply the function to all data files in the sources, we need to run a loop. The
extract_tar
function expects a path to the .tar.gz
file—which we build from datadir
and an entry in sources—and a directory to extract the files to (datadir
). This will extract
all emails in, for example, data/chapter7/beck-s.tar.gz
to the
data/chapter7/beck-s/
subdirectory:
In [3]:
for source, _ in sources:
datafile = '%s/%s' % (datadir, source)
extract_tar(datafile, datadir)
Now here's the tricky bit. Every one of these subdirectories contains a number of other directories, wherein the text files reside. So we need to write two functions:
read_single_file(filename)
: This is a function that extracts the relevant content from a single file called filename
read_files(path)
: This is a function that extracts the relevant content from all files in a particular directory called path
In [4]:
import os
def read_single_file(filename):
past_header, lines = False, []
if os.path.isfile(filename):
f = open(filename, encoding="latin-1")
for line in f:
if past_header:
lines.append(line)
elif line == '\n':
past_header = True
f.close()
content = '\n'.join(lines)
return filename, content
In [5]:
def read_files(path):
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
filepath = os.path.join(root, filename)
yield read_single_file(filepath)
Now it's time to introduce another essential data science tool that comes preinstalled with
Python Anaconda: Pandas. Pandas is built on NumPy and provides a number of useful
tools and methods to deal with data structures in Python. Just as we generally import
NumPy under the alias np
, it is common to import Pandas under the pd
alias:
In [6]:
import pandas as pd
Pandas provides a useful data structure called DataFrame
, which can be understood as a
generalization of a 2D NumPy array, as shown here:
In [7]:
pd.DataFrame({
'model': ['Normal Bayes', 'Multinomial Bayes', 'Bernoulli Bayes'],
'class': [
'cv2.ml.NormalBayesClassifier_create()',
'sklearn.naive_bayes.MultinomialNB()',
'sklearn.naive_bayes.BernoulliNB()'
]
})
Out[7]:
We can combine the preceding functions to build a Pandas DataFrame from the extracted data:
In [8]:
def build_data_frame(extractdir, classification):
rows = []
index = []
for file_name, text in read_files(extractdir):
rows.append({'text': text, 'class': classification})
index.append(file_name)
data_frame = pd.DataFrame(rows, index=index)
return data_frame
We then call it with the following command:
In [9]:
data = pd.DataFrame({'text': [], 'class': []})
for source, classification in sources:
extractdir = '%s/%s' % (datadir, source[:-7])
data = data.append(build_data_frame(extractdir, classification))
Scikit-learn offers a number of options when it comes to encoding text features, which we
discussed in Chapter 4, Representing Data and Engineering Features. One of the simplest
methods of encoding text data, we recall, is by word count: For each phrase, you count the
number of occurrences of each word within it. In scikit-learn, this is easily done using
CountVectorizer
:
In [10]:
from sklearn import feature_extraction
counts = feature_extraction.text.CountVectorizer()
X = counts.fit_transform(data['text'].values)
X.shape
Out[10]:
The result is a giant matrix, which tells us that we harvested a total of 52,076 emails that collectively contain 643,270 different words. However, scikit-learn is smart and saved the data in a sparse matrix:
In [11]:
X
Out[11]:
In order to build the vector of target labels (y
), we need to access data in the Pandas
DataFrame. This can be done by treating the DataFrame like a dictionary, where the values
attribute will give us access to the underlying NumPy array:
In [12]:
y = data['class'].values
In [13]:
from sklearn import model_selection as ms
X_train, X_test, y_train, y_test = ms.train_test_split(
X, y, test_size=0.2, random_state=42
)
We can instantiate a new normal Bayes classifier with OpenCV:
In [14]:
import cv2
model_norm = cv2.ml.NormalBayesClassifier_create()
However, OpenCV does not know about sparse matrices (at least its Python interface does
not). If we were to pass X_train
and y_train
to the train function like we did earlier,
OpenCV would complain that the data matrix is not a NumPy array. But converting the
sparse matrix into a regular NumPy array will likely make you run out of memory.
Thus, a possible workaround is to train the OpenCV classifier only on a subset of data points (say 1,000) and features (say 300):
In [15]:
import numpy as np
X_train_small = X_train[:1000, :300].toarray().astype(np.float32)
y_train_small = y_train[:1000]
Then it becomes possible to train the OpenCV classifier (although this might take a while):
It appears that
NormalBayesClassifier
is broken in OpenCV 3.1 (segmentation fault). As a result, the kernel will die.
In [16]:
# model_norm.train(X_train_small, cv2.ml.ROW_SAMPLE, y_train_small)
However, if you want to classify the full dataset, we need a more sophisticated approach.
We turn to scikit-learn's naive Bayes classifier, as it understands how to handle sparse
matrices. In fact, if you didn't pay attention and treated X_train
like every NumPy array
before, you might not even notice that anything is different:
In [17]:
from sklearn import model_selection as ms
X_train, X_test, y_train, y_test = ms.train_test_split(
X, y, test_size=0.2, random_state=42
)
Here we use MultinomialNB
from the naive_bayes
module, which is the version of
naive Bayes classifier that is best suited to handle categorical data, such as word counts.
In [18]:
from sklearn import naive_bayes
model_naive = naive_bayes.MultinomialNB()
model_naive.fit(X_train, y_train)
Out[18]:
The classifier is trained almost instantly and returns the scores for both the training and the test set:
In [19]:
model_naive.score(X_train, y_train)
Out[19]:
In [20]:
model_naive.score(X_test, y_test)
Out[20]:
And there we have it: 94.4% accuracy on the test set! Pretty good for not doing much other than using the default values, isn't it?
However, what if we were super critical of our own work and wanted to improve the result even further? There are a couple of things we could do.
One thing to do is to use $n$-gram counts instead of plain word counts. So far, we have relied on what is known as a bag of words: We simply threw every word of an email into a bag and counted the number of its occurrences. However, in real emails, the order in which words appear can carry a great deal of information!
This is exactly what $n$-gram counts are trying to convey. You can think of an $n$-gram as a phrase that is $n$ words long. For example, the phrase Statistics has its moments contains the following 1-grams: Statistics, has, its, and moments. It also has the following 2-grams: Statistics has, has its, and its moments. It also has two 3-grams (Statistics has its and has its moments), and only a single 4-gram.
We can tell CountVectorizer to include any order of $n$-grams into the feature matrix by specifying a range for $n$:
In [21]:
counts = feature_extraction.text.CountVectorizer(
ngram_range=(1, 2)
)
X = counts.fit_transform(data['text'].values)
We then repeat the entire procedure of splitting the data and training the classifier:
In [22]:
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.2, random_state=42
)
In [23]:
model_naive = naive_bayes.MultinomialNB()
model_naive.fit(X_train, y_train)
Out[23]:
You might have noticed that the training is taking much longer this time. To our delight, we find that the performance has significantly increased:
In [24]:
model_naive.score(X_test, y_test)
Out[24]:
However, $n$-gram counts are not perfect. They have the disadvantage of unfairly weighting longer documents (because there are more possible combinations of forming $n$-grams).
To avoid this problem, we can use relative frequencies instead of a simple number of occurrences. We have already encountered one way to do so, and it had a horribly complicated name.
Do you remember what it was called?
It was called the term–inverse document frequency (tf–idf), and we encountered it in Chapter 4, Representing Data and Engineering Features. If you recall, what tf–idf does is basically weigh the word count by a measure of how often they appear in the entire dataset. A useful side effect of this method is the idf part—the inverse frequency of words. This makes sure that frequent words, such as and, the, and but, carry only a small weight in the classification.
We apply tf–idf to the feature matrix by calling fit_transform on our existing feature
matrix X
:
In [25]:
tfidf = feature_extraction.text.TfidfTransformer()
In [26]:
X_new = tfidf.fit_transform(X)
Don't forget to split the data:
In [27]:
X_train, X_test, y_train, y_test = ms.train_test_split(
X_new, y, test_size=0.2, random_state=42
)
Then, when we train and score the classifier again, we suddenly find a remarkable score of 99% accuracy!
In [28]:
model_naive = naive_bayes.MultinomialNB()
model_naive.fit(X_train, y_train)
model_naive.score(X_test, y_test)
Out[28]:
To convince us of the classifier's awesomeness, we can inspect the confusion matrix. This is a matrix that shows, for every class, how many data samples were misclassified as belonging to a different class.
The diagonal elements in the matrix tell us how many samples of the class $i$ were correctly classified as belonging to the class $i$. The off-diagonal elements represent misclassifications:
In [29]:
from sklearn import metrics
In [30]:
metrics.confusion_matrix(y_test, model_naive.predict(X_test))
Out[30]:
This tells us we got 3,746 class 0 classifications correct, and 6,575 class 1 classifications correct. We confused 84 samples of class 0 as belonging to class 1 and