Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
The most essential clustering algorithm that OpenCV provides is $k$-means clustering, which searches for a predetermined number of $k$ clusters (or groups) within an unlabeled multidimensional dataset.
It does so by using two simple assumptions about what an optimal clustering should look like:
It's the easiest to understand the algorithm by looking at a concrete example.
In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
Following the same recipe from earlier chapters, we will create a total of 300 blobs
(n_samples=300
) belonging to four distinct clusters (centers=4
):
In [2]:
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=1.0, random_state=10)
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], s=100);
Even without assigning target labels to the data, it is straightforward to pick out the four clusters by eye. The $k$-means algorithm can do this, too, without having any information about target labels or underlying data distributions.
Although $k$-means is, of course, a statistical model, in OpenCV, it does not come via the ml
module and the common train and predict API calls. Instead, it is directly available as
cv2.kmeans
. In order to use the model, we have to specify some arguments, such as the
termination criteria and some initialization flags. Here, we will tell the algorithm to
terminate whenever the error is smaller than 1.0 (cv2.TERM_CRITERIA_EPS
) or when ten
iterations have been executed (cv2.TERM_CRITERIA_MAX_ITER
):
In [3]:
import cv2
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER,
10, 1.0)
flags = cv2.KMEANS_RANDOM_CENTERS
Then we can pass the preceding data matrix (X
) to cv2.means
. We also specify the number
of clusters (4) and the number of attempts the algorithm should make with different
random initial guesses (10), as shown in the following snippet:
In [4]:
import numpy as np
compactness, labels, centers = cv2.kmeans(X.astype(np.float32), 4,
None, criteria, 10, flags)
Three different variables are returned. The first one, compactness
, returns the sum of
squared distances from each point to their corresponding cluster centers. A high
compactness score indicates that all points are close to their cluster centers, whereas a low
compactness score indicates that the different clusters might not be well separated:
In [5]:
compactness
Out[5]:
Of course, this number strongly depends on the actual values in X
. If the distances between
points were large to begin with, we cannot expect an arbitrarily small compactness score.
Hence, it is more informative to plot the data points, colored to their assigned cluster labels:
In [6]:
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=100)
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
The good news here is that the $k$-means algorithm (at least in this simple case) assigns the points to clusters very similarly to how we might have, had we done the job by eye. But how did the algorithm find these different clusters so quickly? After all, the number of possible combinations of cluster assignments is exponential to the number of data points! By hand, trying all possible combinations would have certainly taken forever.
Fortunately, an exhaustive search is not necessary. Instead, the typical approach that $k$-means takes is to use an iterative algorithm, also known as expectation-maximization.
$k$-means clustering is but one concrete application of a more general algorithm known as expectation-maximization. In short, the algorithm works as follows:
Here, the expectation step is so named because it involves updating our expectation of which cluster each point in the dataset belongs to. The maximization step is so named because it involves maximizing a fitness function that defines the location of the cluster centers. In the case of $k$-means, maximization is performed by taking the arithmetic mean of all the data points in a cluster.
Refer to the book for step-by-step illustrations on how expectation-maximization works.
The expectation-maximization algorithm is simple enough for us to code it up ourselves. To
do so, we will define a function find_clusters(X, n_clusters, rseed=5)
that takes
as input a data matrix (X
), the number of clusters we want to discover (n_clusters
), and a
random seed (optional, rseed
). As will become clear in a second, scikit-learn's
pairwise_distances_argmin
function will come in handy:
In [7]:
from sklearn.metrics import pairwise_distances_argmin
def find_clusters(X, n_clusters, rseed=5):
# 1. Randomly choose clusters
rng = np.random.RandomState(rseed)
i = rng.permutation(X.shape[0])[:n_clusters]
centers = X[i]
while True:
# 2a. Assign labels based on closest center
labels = pairwise_distances_argmin(X, centers)
# 2b. Find new centers from means of points
new_centers = np.array([X[labels == i].mean(0)
for i in range(n_clusters)])
# 2c. Check for convergence
if np.all(centers == new_centers):
break
centers = new_centers
return centers, labels
We can apply our function to the preceding data matrix X
we created. Since we know what
the data looks like, we know that we are looking for four clusters:
In [8]:
centers, labels = find_clusters(X, 4)
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=100);
As we can see, our home-made algorithm got the job done! Granted, this particular clustering example was fairly easy, and most real-life implementations of $k$-means clustering will do a bit more under the hood. But for now, we are happy.
For its simplicity, expectation-maximization performs incredibly well in a range of scenarios. That being said, there are a number of potential limitations that we need to be aware of:
Let's quickly discuss these potential caveats in a little more detail.
Although mathematicians have proved that the expectation-maximization step improves the result in each step, there is still no guarantee that, in the end, we will find the global best solution. For example, if we use a different random seed in our simple example (such as using, seed 10 instead of 5), we suddenly get very poor results:
In [9]:
centers, labels = find_clusters(X, 4, rseed=10)
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=100);
What happened?
The short answer is that the random initialization of cluster centers was unfortunate. It led to the center of the yellow cluster migrating in—between the two top blobs, essentially combining them into one. As a result, the other clusters got confused because they suddenly had to split two visually distinct blobs into three clusters.
For this reason, it is common for the algorithm to be run for multiple initial states. Indeed,
OpenCV does this by default (set by the optional attempts
parameter).
Another potential limitation is that $k$-means cannot learn the number of clusters from the data. Instead, we must tell it how many clusters we expect beforehand. You can see how this could be problematic for complicated real-world data that you don't fully understand yet.
From the viewpoint of $k$-means, there is no wrong or nonsensical number of clusters. For example, if we ask the algorithm to identify six clusters in the dataset generated in the preceding section, it will happily proceed and find the best six clusters:
In [10]:
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
flags = cv2.KMEANS_RANDOM_CENTERS
compactness, labels, centers = cv2.kmeans(X.astype(np.float32), 6, None, criteria, 10, flags)
Here, we used the exact same preceding code and only changed the number of clusters from four to six. We can plot the data points again and color them according to the target labels:
In [11]:
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=100);
Since we generated the data ourselves, we know that each data point comes from a total of four distinct clusters. For more complicated data, where we are uncertain about the right number of clusters, there are a few things we could try.
First and foremost, there is the elbow method, which asks us to repeat the clustering for a whole range of $k$ values and record the compactness value:
In [12]:
kvals = np.arange(2, 10)
compactness = []
for k in kvals:
c, _, _ = cv2.kmeans(X.astype(np.float32), k, None, criteria, 10, flags)
compactness.append(c)
This will result in a plot that looks like an arm. The point that lies at the elbow is the number of clusters to pick:
In [13]:
plt.figure(figsize=(10, 6))
plt.plot(kvals, compactness, 'o-', linewidth=4, markersize=12);
plt.xlabel('number of clusters')
plt.ylabel('compactness')
Out[13]:
In our case, we would pick the number four. The reasoning behind this is that moving from left to right, four is the smallest number of clusters that gives a very compact representation. Three clusters give us a representation that is only half as compact as four clusters. Choosing five clusters or more does not further improve the compactness. Thus, four should be our best guess as to the number of clusters.
A list of more sophisticated methods is given in the book.
The $k$-means algorithm is based on a simple assumption, which is that points will be closer to their own cluster center than to others. Consequently, $k$-means always assumes linear boundaries between clusters, meaning that it will fail whenever the geometry of the clusters is more complicated than that.
We see this limitation for ourselves by generating a slightly more complicated dataset.
Instead of generating data points from Gaussian blobs, we want to organize the data into
two overlapping half circles. We can do this using scikit-learn's make_moons
. Here, we
choose 200 data points belonging to two half circles, in combination with some Gaussian
noise:
In [14]:
from sklearn.datasets import make_moons
X, y = make_moons(200, noise=.05, random_state=12)
This time, we tell $k$-means to look for two clusters:
In [15]:
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
flags = cv2.KMEANS_RANDOM_CENTERS
compactness, labels, centers = cv2.kmeans(X.astype(np.float32), 2, None, criteria, 10, flags)
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=100);
As is evident from the plot, $k$-means failed to identify the two half circles and instead split the data with what looks like a diagonal straight line (from bottom left to top right).
This scenario should ring a bell. We had the exact same problem when we talked about linear SVMs in Chapter 6, Detecting Pedestrians with Support Vector Machines. The idea there was to use the kernel trick in order to transform the data into a higher-dimensional feature space. Can we do the same here?
We most certainly can. There is a form of kernelized k-means that works akin to the kernel trick for SVMs, called spectral clustering. Unfortunately, OpenCV does not provide an implementation of spectral clustering. Fortunately, scikit-learn does:
In [16]:
from sklearn.cluster import SpectralClustering
The algorithm uses the same API as all other statistical models: We set optional arguments
in the constructor and then call fit_predict
on the data. Here, we want to use the graph
of the nearest neighbors to compute a higher-dimensional representation of the data and
then assign labels using $k$-means:
In [17]:
model = SpectralClustering(n_clusters=2, affinity='nearest_neighbors',
assign_labels='kmeans')
labels = model.fit_predict(X)
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=100);
We see that spectral clustering gets the job done. Alternatively, we could have transformed the data into a more suitable representation ourselves and then applied OpenCV's linear $k$-means to it. The lesson of all this is that perhaps, again, feature engineering saves the day.
The final limitation of $k$-means is that it is relatively slow for large datasets. You can imagine that quite a lot of algorithms might suffer from this problem. However, $k$-means is affected especially badly: each iteration of $k$-means must access every single data point in the dataset and compare it to all the cluster centers.
You might wonder whether the requirement to access all data points during each iteration is really necessary. For example, you might just use a subset of the data to update the cluster centers at each step. Indeed, this is the exact idea that underlies a variation of the algorithm called batch-based k-means. Unfortunately, this algorithm is not implemented in OpenCV.
Despite the limitations discussed earlier, $k$-means has a number of interesting applications, especially in computer vision.