Naive Bayes Classifiers

Naive Bayes classifiers are a family of classifiers that are quite similar to the linear models discussed previously. However, they tend to be even faster in training. The price paid for this efficiency is that naive Bayes models often provide generalization performance that is slightly worse than that of linear classifiers like LogisticRegression and LinearSVC.

The reason that naive Bayes models are so efficient is that they learn parameters by looking at each feature individually and collect simple per-class statistics from each feature. There are three kinds of naive Bayes classifiers implemented in scikit-learn: GaussianNB, BernoulliNB, and MultinomialNB. GaussianNB can be applied to any continuous data, while BernoulliNB assumes binary data and MultinomialNB assumes count data (that is, that each feature represents an integer count of something, like how often a word appears in a sentence). BernoulliNB and MultinomialNB are mostly used in text data classification.

Advantages of Naive Bayes

  • Very fast to train and to predict
  • Training procedure is easy to understand
  • The models work very well with high-dimensional sparse data and are relatively robust to the parameters

Disadvantages of Naive Bayes

  • Relatively poor generalization performance

Disclaimer: Much of the code in this notebook was borrowed from the excellent book Introduction to Machine Learning with Python by Andreas Muller and Sarah Guido.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

The BernoulliNB classifier counts how often every feature of each class is not zero. This is most easily understood with an example:


In [2]:
X = np.array([[0, 1, 0, 1],
              [1, 0, 1, 1],
              [0, 0, 0, 1],
              [1, 0, 1, 0]])
y = np.array([0, 1, 0, 1])

Here, we have four data points, with four binary features each. There are two classes, 0 and 1. For class 0 (the first and third data points), the first feature is zero two times and nonzero zero times, the second feature is zero one time and nonzero one time, and so on. These same counts are then calculated for the data points in the second class. Counting the nonzero entries per class in essence looks like this:


In [3]:
counts = {}
for label in np.unique(y):
    # iterate over each class
    # count (sum) entries of 1 per feature
    counts[label] = X[y == label].sum(axis=0)
print("Feature counts:\n{}".format(counts))


Feature counts:
{0: array([0, 1, 0, 2]), 1: array([2, 0, 2, 1])}

The other two naive Bayes models, MultinomialNB and GaussianNB, are slightly different in what kinds of statistics they compute. MultinomialNB takes into account the average value of each feature for each class, while GaussianNB stores the average value as well as the standard deviation of each feature for each class.

To make a prediction, a data point is compared to the statistics for each of the classes, and the best matching class is predicted. Interestingly, for both MultinomialNB and BernoulliNB, this leads to a prediction formula that is of the same form as in the linear models. Unfortunately, coef_ for the naive Bayes models has a somewhat different meaning than in the linear models, in that coef_ is not the same as w.

GaussianNB

Let's apply a GaussianNB model to the Iris dataset:


In [18]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

iris = datasets.load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=5)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

print("Accuracy on training set: {:.2f}".format(gnb.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(gnb.score(X_test, y_test)))


Accuracy on training set: 0.97
Accuracy on test set: 0.92

Summary

Naive Bayes models are great baseline models and are often used on very large datasets, where training even a linear model might take too long. However, if accuracy is important, there are generally numerous other models which will have better generalization performance.