In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
Clustering is the task of gathering samples into groups of similar samples according to some predefined similarity or dissimilarity measure (such as the Euclidean distance). In this section we will explore a basic clustering task on some synthetic and real datasets.
Here are some common applications of clustering algorithms:
Let's start of with a very simple and obvious example:
In [ ]:
from sklearn.datasets import make_blobs
X, y = make_blobs(random_state=42)
X.shape
In [ ]:
plt.scatter(X[:, 0], X[:, 1])
There are clearly three separate groups of points in the data, and we would like to recover them using clustering. Even if the groups are obvious in the data, it is hard to find them when the data lives in a high-dimensional space.
Now we will use one of the simplest clustering algorithms, K-means. This is an iterative algorithm which searches for three cluster centers such that the distance from each point to its cluster is minimized. Question: what would you expect the output to look like?
In [ ]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
We can get the cluster labels either by calling fit and then accessing the
labels_
attribute of the K means estimator, or by calling fit_predict
.
Either way, the result contains the ID of the cluster that each point is assigned to.
In [ ]:
labels = kmeans.fit_predict(X)
In [ ]:
all(labels == kmeans.labels_)
Let's visualize the assignments that have been found
In [ ]:
plt.scatter(X[:, 0], X[:, 1], c=labels)
Here, we are probably satisfied with the clustering. But in general we might want to have a more quantitative evaluation. How about we compare our cluster labels with the ground truth we got when generating the blobs?
In [ ]:
from sklearn.metrics import confusion_matrix, accuracy_score
print(accuracy_score(y, labels))
print(confusion_matrix(y, labels))
Even though we recovered the partitioning of the data into clusters perfectly, the cluster IDs we assigned were arbitrary,
and we can not hope to recover them. Therefore, we must use a different scoring metric, such as adjusted_rand_score
, which is invariant to permutations of the labels:
In [ ]:
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y, labels)
Clustering comes with assumptions: A clustering algorithm finds clusters by making assumptions with samples should be grouped together. Each algorithm makes different assumptions and the quality and interpretability of your results will depend on whether the assumptions are satisfied for your goal. For K-means clustering, the model is that all clusters have equal, spherical variance.
In general, there is no guarantee that structure found by a clustering algorithm has anything to do with what you were interested in.
We can easily create a dataset that has non-isotropic clusters, on which kmeans will fail:
In [ ]:
from sklearn.datasets import make_blobs
X, y = make_blobs(random_state=170, n_samples=600)
rng = np.random.RandomState(74)
transformation = rng.normal(size=(2, 2))
X = np.dot(X, transformation)
y_pred = KMeans(n_clusters=3).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
The following are two well-known clustering algorithms.
sklearn.cluster.KMeans
: sklearn.cluster.MeanShift
: sklearn.cluster.DBSCAN
: sklearn.cluster.AffinityPropagation
: sklearn.cluster.SpectralClustering
: sklearn.cluster.Ward
: Of these, Ward, SpectralClustering, DBSCAN and Affinity propagation can also work with precomputed similarity matrices.
Perform K-means clustering on the digits data, searching for ten clusters.
Visualize the cluster centers as images (i.e. reshape each to 8x8 and use
plt.imshow
) Do the clusters seem to be correlated with particular digits? What is the adjusted_rand_score
?
Visualize the projected digits as in the last notebook, but this time use the cluster labels as the color. What do you notice?
In [ ]:
from sklearn.datasets import load_digits
digits = load_digits()
# ...
In [ ]:
# %load solutions/08B_digits_clustering.py
In [ ]: