This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

Classifying Emails Using the Naive Bayes Classifier

The final task of this chapter will be to apply our newly gained skills to a real spam filter! Naive Bayes classifiers are actually a very popular model for email filtering. Their naivety lends itself nicely to the analysis of text data, where each feature is a word (or a bag of words), and it would not be feasible to model the dependence of every word on every other word.

A bunch of interesting email datasets are mentioned in the book.

In this section, we will be using the Enrom-Spam dataset, which can be downloaded for free from the given website. However, if you followed the installation instructions at the beginning of this book and have downloaded the latest code from GitHub, you are already good to go!

Loading the dataset

If you downloaded the latest code from GitHub, you will find a number of .zip files in the notebooks/data/chapter7 directory. These files contain raw email data (with fields for To:, Cc:, and text body) that are either classified as spam (with the SPAM = 1 class label) or not (also known as ham, the HAM = 0 class label).

We build a variable called sources, which contains all the raw data files:


In [1]:
HAM = 0
SPAM = 1
datadir = 'data/chapter7'
sources = [
    ('beck-s.tar.gz', HAM),
    ('farmer-d.tar.gz', HAM),
    ('kaminski-v.tar.gz', HAM),
    ('kitchen-l.tar.gz', HAM),
    ('lokay-m.tar.gz', HAM),
    ('williams-w3.tar.gz', HAM),
    ('BG.tar.gz', SPAM),
    ('GP.tar.gz', SPAM),
    ('SH.tar.gz', SPAM)
]

The first step is to extract these files into subdirectories. For this, we can use the extract_tar function we wrote in the previous chapter:


In [2]:
def extract_tar(datafile, extractdir):
    try:
        import tarfile
    except ImportError:
        raise ImportError("You do not have tarfile installed. "
                          "Try unzipping the file outside of Python.")

    tar = tarfile.open(datafile)
    tar.extractall(path=extractdir)
    tar.close()
    print("%s successfully extracted to %s" % (datafile, extractdir))

In order to apply the function to all data files in the sources, we need to run a loop. The extract_tar function expects a path to the .tar.gz file—which we build from datadir and an entry in sources—and a directory to extract the files to (datadir). This will extract all emails in, for example, data/chapter7/beck-s.tar.gz to the data/chapter7/beck-s/ subdirectory:


In [3]:
for source, _ in sources:
    datafile = '%s/%s' % (datadir, source)
    extract_tar(datafile, datadir)


data/chapter7/beck-s.tar.gz successfully extracted to data/chapter7
data/chapter7/farmer-d.tar.gz successfully extracted to data/chapter7
data/chapter7/kaminski-v.tar.gz successfully extracted to data/chapter7
data/chapter7/kitchen-l.tar.gz successfully extracted to data/chapter7
data/chapter7/lokay-m.tar.gz successfully extracted to data/chapter7
data/chapter7/williams-w3.tar.gz successfully extracted to data/chapter7
data/chapter7/BG.tar.gz successfully extracted to data/chapter7
data/chapter7/GP.tar.gz successfully extracted to data/chapter7
data/chapter7/SH.tar.gz successfully extracted to data/chapter7

Now here's the tricky bit. Every one of these subdirectories contains a number of other directories, wherein the text files reside. So we need to write two functions:

  • read_single_file(filename): This is a function that extracts the relevant content from a single file called filename
  • read_files(path): This is a function that extracts the relevant content from all files in a particular directory called path

In [4]:
import os
def read_single_file(filename):
    past_header, lines = False, []
    if os.path.isfile(filename):
        f = open(filename, encoding="latin-1")
        for line in f:
            if past_header:
                lines.append(line)
            elif line == '\n':
                past_header = True
        f.close()
    content = '\n'.join(lines)
    return filename, content

In [5]:
def read_files(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            filepath = os.path.join(root, filename)
            yield read_single_file(filepath)

Building a data matrix using Pandas

Now it's time to introduce another essential data science tool that comes preinstalled with Python Anaconda: Pandas. Pandas is built on NumPy and provides a number of useful tools and methods to deal with data structures in Python. Just as we generally import NumPy under the alias np, it is common to import Pandas under the pd alias:


In [6]:
import pandas as pd

Pandas provides a useful data structure called DataFrame, which can be understood as a generalization of a 2D NumPy array, as shown here:


In [7]:
pd.DataFrame({
    'model': ['Normal Bayes', 'Multinomial Bayes', 'Bernoulli Bayes'],
    'class': [
        'cv2.ml.NormalBayesClassifier_create()',
        'sklearn.naive_bayes.MultinomialNB()',
        'sklearn.naive_bayes.BernoulliNB()'
    ]
})


Out[7]:
class model
0 cv2.ml.NormalBayesClassifier_create() Normal Bayes
1 sklearn.naive_bayes.MultinomialNB() Multinomial Bayes
2 sklearn.naive_bayes.BernoulliNB() Bernoulli Bayes

We can combine the preceding functions to build a Pandas DataFrame from the extracted data:


In [8]:
def build_data_frame(extractdir, classification):
    rows = []
    index = []
    for file_name, text in read_files(extractdir):
        rows.append({'text': text, 'class': classification})
        index.append(file_name)

    data_frame = pd.DataFrame(rows, index=index)
    return data_frame

We then call it with the following command:


In [9]:
data = pd.DataFrame({'text': [], 'class': []})
for source, classification in sources:
    extractdir = '%s/%s' % (datadir, source[:-7])
    data = data.append(build_data_frame(extractdir, classification))

Preprocessing the data

Scikit-learn offers a number of options when it comes to encoding text features, which we discussed in Chapter 4, Representing Data and Engineering Features. One of the simplest methods of encoding text data, we recall, is by word count: For each phrase, you count the number of occurrences of each word within it. In scikit-learn, this is easily done using CountVectorizer:


In [10]:
from sklearn import feature_extraction
counts = feature_extraction.text.CountVectorizer()
X = counts.fit_transform(data['text'].values)
X.shape


Out[10]:
(52076, 643270)

The result is a giant matrix, which tells us that we harvested a total of 52,076 emails that collectively contain 643,270 different words. However, scikit-learn is smart and saved the data in a sparse matrix:


In [11]:
X


Out[11]:
<52076x643270 sparse matrix of type '<class 'numpy.int64'>'
	with 8607632 stored elements in Compressed Sparse Row format>

In order to build the vector of target labels (y), we need to access data in the Pandas DataFrame. This can be done by treating the DataFrame like a dictionary, where the values attribute will give us access to the underlying NumPy array:


In [12]:
y = data['class'].values

Training a normal Bayes classifier

From here on out, things are (almost) like they always were. We can use scikit-learn to split the data into training and test sets. (Let's reserve 20% of all data points for testing):


In [13]:
from sklearn import model_selection as ms
X_train, X_test, y_train, y_test = ms.train_test_split(
    X, y, test_size=0.2, random_state=42
)

We can instantiate a new normal Bayes classifier with OpenCV:


In [14]:
import cv2
model_norm = cv2.ml.NormalBayesClassifier_create()

However, OpenCV does not know about sparse matrices (at least its Python interface does not). If we were to pass X_train and y_train to the train function like we did earlier, OpenCV would complain that the data matrix is not a NumPy array. But converting the sparse matrix into a regular NumPy array will likely make you run out of memory.

Thus, a possible workaround is to train the OpenCV classifier only on a subset of data points (say 1,000) and features (say 300):


In [15]:
import numpy as np
X_train_small = X_train[:1000, :300].toarray().astype(np.float32)
y_train_small = y_train[:1000]

Then it becomes possible to train the OpenCV classifier (although this might take a while):

It appears that NormalBayesClassifier is broken in OpenCV 3.1 (segmentation fault). As a result, the kernel will die.


In [16]:
# model_norm.train(X_train_small, cv2.ml.ROW_SAMPLE, y_train_small)

Training on the full dataset

However, if you want to classify the full dataset, we need a more sophisticated approach. We turn to scikit-learn's naive Bayes classifier, as it understands how to handle sparse matrices. In fact, if you didn't pay attention and treated X_train like every NumPy array before, you might not even notice that anything is different:


In [17]:
from sklearn import model_selection as ms
X_train, X_test, y_train, y_test = ms.train_test_split(
    X, y, test_size=0.2, random_state=42
)

Here we use MultinomialNB from the naive_bayes module, which is the version of naive Bayes classifier that is best suited to handle categorical data, such as word counts.


In [18]:
from sklearn import naive_bayes
model_naive = naive_bayes.MultinomialNB()
model_naive.fit(X_train, y_train)


Out[18]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

The classifier is trained almost instantly and returns the scores for both the training and the test set:


In [19]:
model_naive.score(X_train, y_train)


Out[19]:
0.95026404224675953

In [20]:
model_naive.score(X_test, y_test)


Out[20]:
0.948252688172043

And there we have it: 94.4% accuracy on the test set! Pretty good for not doing much other than using the default values, isn't it?

However, what if we were super critical of our own work and wanted to improve the result even further? There are a couple of things we could do.

Using n-grams to improve the result

One thing to do is to use $n$-gram counts instead of plain word counts. So far, we have relied on what is known as a bag of words: We simply threw every word of an email into a bag and counted the number of its occurrences. However, in real emails, the order in which words appear can carry a great deal of information!

This is exactly what $n$-gram counts are trying to convey. You can think of an $n$-gram as a phrase that is $n$ words long. For example, the phrase Statistics has its moments contains the following 1-grams: Statistics, has, its, and moments. It also has the following 2-grams: Statistics has, has its, and its moments. It also has two 3-grams (Statistics has its and has its moments), and only a single 4-gram.

We can tell CountVectorizer to include any order of $n$-grams into the feature matrix by specifying a range for $n$:


In [21]:
counts = feature_extraction.text.CountVectorizer(
    ngram_range=(1, 2)
)
X = counts.fit_transform(data['text'].values)

We then repeat the entire procedure of splitting the data and training the classifier:


In [22]:
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [23]:
model_naive = naive_bayes.MultinomialNB()
model_naive.fit(X_train, y_train)


Out[23]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

You might have noticed that the training is taking much longer this time. To our delight, we find that the performance has significantly increased:


In [24]:
model_naive.score(X_test, y_test)


Out[24]:
0.97081413210445466

However, $n$-gram counts are not perfect. They have the disadvantage of unfairly weighting longer documents (because there are more possible combinations of forming $n$-grams).

To avoid this problem, we can use relative frequencies instead of a simple number of occurrences. We have already encountered one way to do so, and it had a horribly complicated name.

Do you remember what it was called?

Using tf-idf to improve the result

It was called the term–inverse document frequency (tf–idf), and we encountered it in Chapter 4, Representing Data and Engineering Features. If you recall, what tf–idf does is basically weigh the word count by a measure of how often they appear in the entire dataset. A useful side effect of this method is the idf part—the inverse frequency of words. This makes sure that frequent words, such as and, the, and but, carry only a small weight in the classification.

We apply tf–idf to the feature matrix by calling fit_transform on our existing feature matrix X:


In [25]:
tfidf = feature_extraction.text.TfidfTransformer()

In [26]:
X_new = tfidf.fit_transform(X)

Don't forget to split the data:


In [27]:
X_train, X_test, y_train, y_test = ms.train_test_split(
    X_new, y, test_size=0.2, random_state=42
)

Then, when we train and score the classifier again, we suddenly find a remarkable score of 99% accuracy!


In [28]:
model_naive = naive_bayes.MultinomialNB()
model_naive.fit(X_train, y_train)
model_naive.score(X_test, y_test)


Out[28]:
0.99039938556067586

To convince us of the classifier's awesomeness, we can inspect the confusion matrix. This is a matrix that shows, for every class, how many data samples were misclassified as belonging to a different class.

The diagonal elements in the matrix tell us how many samples of the class $i$ were correctly classified as belonging to the class $i$. The off-diagonal elements represent misclassifications:


In [29]:
from sklearn import metrics

In [30]:
metrics.confusion_matrix(y_test, model_naive.predict(X_test))


Out[30]:
array([[3737,   93],
       [   7, 6579]])

This tells us we got 3,746 class 0 classifications correct, and 6,575 class 1 classifications correct. We confused 84 samples of class 0 as belonging to class 1 and