Naive Bayes Classifiers

In this lecture we will learn how to use Naive Bayes Classifier to perform a Multi Class Classification on a data set we are already familiar with: the Iris Data Set.

 This Lecture will consist of 7 main parts:

Part 1: Note on Notation and Math Terms
Part 2: Bayes' Theorem
Part 3: Introduction to Naive Bayes
Part 4: Naive Bayes Classifier Mathematics Overview
Part 5: Constructing a classifier from the probability model
Part 6: Gaussian Naive Bayes
Part 7: Gaussian Naive Bayes with SciKit Learn

Part 1: Note on Notation and Math Terms

There are a few more advanced notations and mathematical terms used during the explanation of naive Bayes Classification. You should be familiar with the following:

Product of Sequence

The product of a sequence of terms can be written with the product symbol, which derives from the capital letter Π (Pi) in the Greek alphabet. The meaning of this notation is given by: $$\prod_{i=1}^4 i = 1\cdot 2\cdot 3\cdot 4, $$ that is $$\prod_{i=1}^4 i = 24. $$

Arg Max

In mathematics, the argument of the maximum (abbreviated arg max or argmax) is the set of points of the given argument for which the given function attains its maximum value. In contrast to global maximums, which refer to a function's largest outputs, the arg max refers to the inputs which create those maximum outputs.

The arg max is defined by

$$\operatorname*{arg\,max}_x f(x) := \{x \mid \forall y : f(y) \le f(x)\}$$

In other words, it is the set of points x for which f(x) attains its largest value. This set may be empty, have one element, or have multiple elements. For example, if f(x) is 1−|x|, then it attains its maximum value of 1 at x = 0 and only there, so

$$\operatorname*{arg\,max}_x (1-|x|) = \{0\}$$

Part 2: Bayes' Theorem

First, for a quick introduction to Bayes' Theorem, check out the Bayes' Theorem Lecture in the statistics appendix portion of this course, in order ot fully understand Naive Bayes, you'll need a complete understanding of the Bayes' Theorem.

Part 3: Introduction to Naive Bayes

Naive Bayes is probably one of the practical machine learning algorithms. Despite its name, it is actually performs very well considering its classification performance. It proves to be quite robust to irrelevant features, which it ignores. It learns and predicts very fast and it does not require lots of storage. So, why is it then called naive?

The naive was added to the account for one assumption that is required for Bayes to work optimally: all features must be independent of each other. In reality, this is usually not the case, however, it still returns very good accuracy in practice even when the independent assumption does not hold.

Naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. We will be working with the Iris Flower data set in this lecture.

Part 4: Naive Bayes Classifier Mathematics Overview

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features. Given a class variable y and a dependent feature vector x1 through xn, Bayes’ theorem states the following relationship:

$$P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots x_n \mid y)} {P(x_1, \dots, x_n)}$$

Using the naive independence assumption that $$P(x_i | y, x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(x_i | y)$$

for all i, this relationship is simplified to:

$$P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)} {P(x_1, \dots, x_n)}$$

We now have a relationship between the target and the features using Bayes Theorem along with a Naive Assumption that all features are independent.

Part 5: Constructing a classifier from the probability model

So far we have derived the independent feature model, the Naive Bayes probability model. The Naive Bayes classifier combines this model with a decision rule, this decision rule will decide which hypothesis is most probable, in our example case this will be which class of flower is most probable.

Picking the hypothesis that is most probable is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label to y as follows:

Since P(x1, ..., xn) is constant given the input, we can use the following classification rule: $$P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)$$

$$\Downarrow$$$$\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),$$

and we can use Maximum A Posteriori (MAP) estimation to estimate P(y) and P(xi | y); the former is then the relative frequency of class y in the training set.

There are different naive Bayes classifiers that differ mainly by the assumptions they make regarding the distribution of P(xi | y).

Part 6: Gaussian Naive Bayes

When dealing with continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a Gaussian distribution. Go back to the normal distribution lecture to review the formulas for the Gaussian/Normal Distribution.

For example of using the Gaussian Distribution, suppose the training data contain a continuous attribute, x. We first segment the data by the class, and then compute the mean and variance of x in each class. Let μc be the mean of the values in x associated with class c, and let σ2c be the variance of the values in x associated with class c. Then, the probability distribution of some value given a class, p(x=v|c), can be computed by plugging v into the equation for a Normal distribution parameterized by μc and σ2c. That is:

$$p(x=v|c)=\frac{1}{\sqrt{2\pi\sigma^2_c}}\,e^{ -\frac{(v-\mu_c)^2}{2\sigma^2_c} }$$

The key to Naive Bayes is making the (rather large) assumption that the presences (or absences) of each data feature are independent of one another, conditional on a data having a certain label.

Part 7: Gaussian Naive Bayes with SciKit Learn

Quick note we will actually only use the SciKit Learn Library in this lecture:


In [3]:
import pandas as pd
from pandas import Series,DataFrame

import matplotlib.pyplot as plt
import seaborn as sns

# Gaussian Naive Bayes
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB

Now that we have our module imports, let's go ahead and import the Iris Data Set. We have previously worked with this dataset, so go ahead and look at Lectures on MultiClass Classification for a complete breakdown on this data set!


In [10]:
# load the iris datasets
iris = datasets.load_iris()

# Grab features (X) and the Target (Y)
X = iris.data

Y = iris.target

# Show the Built-in Data Description
print iris.DESCR


Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:
    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================
    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

Since we have already done a general analysis of this data in earlier lectures, let's go ahead and move on to using the Naive Bayes Method to seperate this data set into multiple classes.

First we create and fit the model


In [14]:
# Fit a Naive Bayes model to the data
model = GaussianNB()

Now that we have our model, we will continue by seperating into training and testing sets:


In [16]:
from sklearn.cross_validation import train_test_split
# Split the data into Trainging and Testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

Now we fit our model using the training data set:


In [19]:
# Fit the training model
model.fit(X_train,Y_train)


Out[19]:
GaussianNB()

Now we predict the outcomes from the Testing Set:


In [21]:
# Predicted outcomes
predicted = model.predict(X_test)

# Actual Expected Outvomes
expected = Y_test

Finally we can see the metrics for performance:


In [24]:
print metrics.accuracy_score(expected, predicted)


0.947368421053

It looks like we have about 94.7% accuracy using the Naive Bayes method!


In [ ]: