Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
The function I'm referring to resides within scikit-learn's datasets
module. Let's create 100
data points, each belonging to one of two possible classes, and group them into two
Gaussian blobs. To make the experiment reproducible, we specify an integer to pick a seed
for the random_state
. You can again pick whatever number you prefer. Here I went with
Thomas Bayes' year of birth (just for kicks):
In [1]:
from sklearn import datasets
X, y = datasets.make_blobs(100, 2, centers=2, random_state=1701, cluster_std=2)
Let's have a look at the dataset we just created using our trusty friend, Matplotlib:
In [2]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
I'm sure this is getting easier every time. We use scatter to create a scatter plot of all $x$
values (X[:, 0]
) and $y$ values (X[:, 1]
), which will result in the following output:
In [3]:
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, s=50);
In agreement with our specifications, we see two different point clusters. They hardly overlap, so it should be relatively easy to classify them. What do you think—could a linear classifier do the job?
Yes, it could. Recall that a linear classifier would try to draw a straight line through the figure, trying to put all blue dots on one side and all red dots on the other. A diagonal line going from the top-left corner to the bottom-right corner could clearly do the job. So we would expect the classification task to be relatively easy, even for a naive Bayes classifier.
But first, don't forget to split the dataset into training and test sets! Here, I reserve 10% of the data points for testing:
In [4]:
import numpy as np
from sklearn import model_selection as ms
X_train, X_test, y_train, y_test = ms.train_test_split(
X.astype(np.float32), y, test_size=0.1
)
We will then use the same procedure as in earlier chapters to train a normal Bayes classifier. Wait, why not a naive Bayes classifier? Well, it turns out OpenCV doesn't really provide a true naive Bayes classifier... Instead, it comes with a Bayesian classifier that doesn't necessarily expect features to be independent, but rather expects the data to be clustered into Gaussian blobs. This is exactly the kind of dataset we created earlier!
We can create a new classifier using the following function:
In [5]:
import cv2
model_norm = cv2.ml.NormalBayesClassifier_create()
Then, training is done via the train
method:
In [6]:
model_norm.train(X_train, cv2.ml.ROW_SAMPLE, y_train)
Out[6]:
Once the classifier has been trained successfully, it will return True. We go through the motions of predicting and scoring the classifier, just like we have done a million times before:
In [7]:
_, y_pred = model_norm.predict(X_test)
In [8]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred)
Out[8]:
Even better—we can reuse the plotting function from the last chapter to inspect the decision boundary! If you recall, the idea was to create a mesh grid that would encompass all data points and then classify every point on the grid. The mesh grid is created via the NumPy function of the same name:
In [9]:
def plot_decision_boundary(model, X_test, y_test):
# create a mesh to plot in
h = 0.02 # step size in mesh
x_min, x_max = X_test[:, 0].min() - 1, X_test[:, 0].max() + 1
y_min, y_max = X_test[:, 1].min() - 1, X_test[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
X_hypo = np.column_stack((xx.ravel().astype(np.float32),
yy.ravel().astype(np.float32)))
ret = model.predict(X_hypo)
if isinstance(ret, tuple):
zz = ret[1]
else:
zz = ret
zz = zz.reshape(xx.shape)
plt.contourf(xx, yy, zz, cmap=plt.cm.coolwarm, alpha=0.8)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=200)
In [10]:
plt.figure(figsize=(10, 6))
plot_decision_boundary(model_norm, X, y)
So far, so good. The interesting part is that a Bayesian classifier also returns the probability with which each data point has been classified:
In [11]:
ret, y_pred, y_proba = model_norm.predictProb(X_test)
The function returns a Boolean flag (True for success, False for failure), the predicted
target labels (y_pred
), and the conditional probabilities (y_proba
). Here, y_proba is an $N
\times 2$ matrix that indicates, for every one of the $N$ data points, the probability with which it
was classified as either class 0 or class 1:
In [12]:
y_proba.round(2)
Out[12]:
In [13]:
from sklearn import naive_bayes
model_naive = naive_bayes.GaussianNB()
As usual, training the classifier is done via the fit
method:
In [14]:
model_naive.fit(X_train, y_train)
Out[14]:
Scoring the classifier is built in:
In [15]:
model_naive.score(X_test, y_test)
Out[15]:
Again a perfect score! However, in contrast to OpenCV, this classifier's predict_proba
method returns true probability values, because all values are between 0 and 1, and because
all rows add up to 1:
In [16]:
yprob = model_naive.predict_proba(X_test)
yprob.round(2)
Out[16]:
You might have noticed something else: This classifier has absolutely no doubt about the target label of each and every data point. It's all or nothing.
The decision boundary returned by the naive Bayes classifier looks slightly different, but can be considered identical to the previous command for the purpose of this exercise:
In [17]:
plt.figure(figsize=(10, 6))
plot_decision_boundary(model_naive, X, y)
In [18]:
def plot_proba(model, X_test, y_test):
# create a mesh to plot in
h = 0.02 # step size in mesh
x_min, x_max = X_test[:, 0].min() - 1, X_test[:, 0].max() + 1
y_min, y_max = X_test[:, 1].min() - 1, X_test[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
X_hypo = np.column_stack((xx.ravel().astype(np.float32),
yy.ravel().astype(np.float32)))
if hasattr(model, 'predictProb'):
_, _, y_proba = model.predictProb(X_hypo)
else:
y_proba = model.predict_proba(X_hypo)
zz = y_proba[:, 1] - y_proba[:, 0]
zz = zz.reshape(xx.shape)
plt.contourf(xx, yy, zz, cmap=plt.cm.coolwarm, alpha=0.8)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=200)
In [19]:
plt.figure(figsize=(10, 6))
plot_proba(model_naive, X, y)