This notebook contains an excerpt from the Python Data Science Handbook by Jake VanderPlas; the content is available on GitHub.
The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!
Naive Bayes models
are a group of extremely fast and simple classification algorithms that are often suitable for very high-dimensional datasets.
A quick-and-dirty baseline for a classification problem.
This section will focus on
Naive Bayes classifiers are built on Bayesian classification methods.
Bayes's theorem is an equation describing the relationship of conditional probabilities of statistical quantities.
Bayes's theorem:
$$ P(L~|~{\rm features}) = \frac{P({\rm features}~|~L)P(L)}{P({\rm features})} $$$$ \mbox{posterior} = \frac{\mbox{likelihood}\times \mbox{prior}}{\mbox{evidence}} \ $$https://en.wikipedia.org/wiki/Naive_Bayes_classifier
In practice, there is interest only in the numerator of that fraction,
The numerator is equivalent to the joint probability
model
Now the "naive" conditional independenLe assumptions come into play: assume that each feature $x_i$ is conditionally statistical independence of every other feature $x_j$ for $j\neq i$, given the category $L_k$. This means that
$$p(x_i \mid x_{i+1}, \dots ,x_{n}, L_k ) = p(x_i \mid L_k)$$Thus, the joint model can be expressed as
\begin{align} p(L_k \mid x_1, \dots, x_n) & \varpropto p(L_k, x_1, \dots, x_n) = \\ & = p(L_k) \ p(x_1 \mid L_k) \ p(x_2\mid L_k) \ p(x_3\mid L_k) \ \cdots \\ & = p(L_k) \prod_{i=1}^n p(x_i \mid L_k)\,. \end{align}The maximum a posteriori (MAP) decision rule
$$\hat{y} = \underset{k \in \{1, \dots, K\}}{\operatorname{argmax}} \ p(C_k) \displaystyle\prod_{i=1}^n p(x_i \mid C_k)$$If we are trying to decide between two labels $L_1$ and $L_2$
We can compute $P({\rm features}~|~L_i)$ for each label.
Specifying this generative model
for each label is the main piece of the training of such a Bayesian classifier.
This is where the "naive" in "naive Bayes" comes in:
Different types of naive Bayes classifiers rest on different naive assumptions about the data, and we will examine a few of these in the following sections.
When dealing with continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a Normal distribution (Gaussian) distribution.
The probability density of the normal distribution is
$$ f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2} } e^{ -\frac{(x-\mu)^2}{2\sigma^2} } $$For example, suppose the training data contains a continuous attribute, $x$.
We first segment the data by the class, and then compute the mean and variance of $x$ in each class.
Suppose we have collected some observation value $v$. Then, the probability distribution of $v$ given a class $C_k$,
$p(x=v \mid C_k)$, can be computed by plugging $v$ into the equation for a Normal distribution parameterized by $\mu_k$ and $\sigma^2_k$. That is,
\begin{align} p(x=v \mid C_k)=\frac{1}{\sqrt{2\pi\sigma^2_k}}\,e^{ -\frac{(v-\mu_k)^2}{2\sigma^2_k} } \end{align}
In [14]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
In [15]:
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu');
One extremely fast way to create a simple model is to assume that
This model can be fit by
The result of this naive Gaussian assumption is shown in the following figure:
The ellipses here represent the Gaussian generative model for each label
With this generative model in place for each class, we have a simple recipe to compute the likelihood $P({\rm features}~|~L_1)$ for any data point
This procedure is implemented in Scikit-Learn's sklearn.naive_bayes.GaussianNB
estimator:
In [16]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y);
Now let's generate some new data and predict the label:
In [17]:
rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model.predict(Xnew)
Now we can plot this new data to get an idea of where the decision boundary is:
In [18]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1)
plt.axis(lim);
We see a slightly curved boundary in the classifications—in general, the boundary in Gaussian naive Bayes is quadratic.
A nice piece of this Bayesian formalism is that it naturally allows for probabilistic classification, which we can compute using the predict_proba
method:
In [6]:
yprob = model.predict_proba(Xnew)
yprob[-8:].round(2)
# The columns give the posterior probabilities of the first and second label, respectively.
Out[6]:
If you are looking for estimates of uncertainty in your classification, Bayesian approaches like this can be a useful approach.
The final classification will only be as good as the model assumptions that lead to it
The multinomial distribution describes the probability of observing counts among a number of categories,
One place where multinomial naive Bayes is often used is in text classification
Using the sparse word count features from the 20 Newsgroups corpus to classify these documents.
In [2]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
data.target_names
Out[2]:
For simplicity here, we will select just a few of these categories, and download the training and testing set:
In [3]:
categories = ['talk.religion.misc', 'soc.religion.christian',
'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
Here is a representative entry from the data:
In [4]:
print(train.data[5])
To convert the content of each string into a vector of numbers.
In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
With this pipeline, we can apply the model to the training data, and predict labels for the test data:
In [7]:
model.fit(train.data, train.target)
labels = model.predict(test.data)
Evaluate the performance of the estimator.
In [9]:
from sklearn.metrics import confusion_matrix
sns.set_context("notebook", font_scale=1.7)
mat = confusion_matrix(test.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
Evidently, even this very simple classifier can successfully separate space talk from computer talk,
we now have the tools to determine the category for any string
predict()
method of this pipeline.Here's a quick utility function that will return the prediction for a single string:
In [10]:
def predict_category(s, train=train, model=model):
pred = model.predict([s])
return train.target_names[pred[0]]
Let's try it out:
In [11]:
predict_category('sending a payload to the ISS')
Out[11]:
In [12]:
predict_category('discussing islam vs atheism')
Out[12]:
In [13]:
predict_category('determining the screen resolution')
Out[13]:
Remember that this is nothing more sophisticated than a simple probability model for the (weighted) frequency of each word in the string;
Even a very naive algorithm, when used carefully and trained on a large set of high-dimensional data, can be surprisingly effective.
Because naive Bayesian classifiers make such stringent assumptions about data, they will generally not perform as well as a more complicated model.
Advantages:
A good choice as an initial baseline classification.
Naive Bayes classifiers tend to perform especially well in one of the following situations:
The last two points seem distinct, but they actually are related:
simplistic classifiers like naive Bayes tend to work as well or better than more complicated classifiers as the dimensionality grows: once you have enough data, even a simple model can be very powerful.