Clustering is an unsupervised method that tries to find structure in a set of objects. Specifically, we want to find clusters, groups within our object set where objects in a cluster are more similar to each other than they are to objects outside of the cluster. Note that this definition is quite vague - there are many different ways to conceptualize how clusters may occur in data, and today we will look at K-Means and hierarchical clustering in particular.
In [ ]:
from IPython.display import Image
from IPython.core.display import HTML
Given a set of objects, we specify that we want $k$ clusters. Each cluster has a mean representing it, and we assign each point to the clusters based on which cluster mean its closest to. For each point, the reconstruction error is defined to be its distance to its cluster mean. This gives us the total reconstruction error, the sum of all the individual reconstruction errors, as an error value that we want to minimize.
Formally, we have a set of object vectors $\{x_n\}_{n = 1}^N$ and a set of $K$ cluster means $\{\mu_k\}_{k = 1}^K$, where $x_n, \mu_k \in \mathbb{R}^D$. We represent each object's cluster assignment with the $K$ dimensional vector $r_n$, where $r_{nk} = 1$ if $x_n$ is in cluster $k$, and 0 otherwise. This gives us the individual reconstruction error
$$J_n(r_n, \{\mu_k\}) = \sum_{k = 1}^K r_{nk} \cdot |x_n - \mu_k|^2$$and the total reconstruction error
$$J(\{r_n\}, \{\mu_k\}) = \sum_{n = 1}^N \sum_{k = 1}^K r_{nk} \cdot |x_n - \mu_k|^2$$As you can see, the reconstruction error on a set of objects is a function of assignments and means. How would you go about choosing the assignments $\{r_n\}$ and means $\{\mu_k\}$ that minimize the reconstruction error? Lloyd's Algorithm proposes a two step error minimization.
Step 1: minimize $J$ by updating the $r_n$, assigning each $x_n$ to its closest cluster mean.
Step 2: minimize $J$ by recalculting the $\mu_k$ to be the average over all vectors assigned to cluster $k$.
Repeating this process until the assignments do not change, the algorithm will converge upon a local minima. The algorithm can be optimized by starting with reasonable distributions for cluster centers rather than choosing randomly (K-Means ++) or adding the condition that the cluster mean must be a point (K-Medoids).
SKLearn has a K-Means implementation that is documented here. In this example, we use K-Means to classify flowers into 3 classes based on their properties. Note that choosing the right $k$ is crucial. The true number of categories within the dataset is 3, so with this knowlegde, we can let $k$ be 3 to get the most logical split. However, if we didn't know that the dataset consisted of three types of flowers, choosing $k$ to be a value like 7 might result in less logical clusters.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn import datasets
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target
In [ ]:
estimators = {'K-Means 3': KMeans(n_clusters=3),
'K-Means 8': KMeans(n_clusters=7)}
fignum = 1
for name, est in estimators.items():
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
est.fit(X)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title(name)
fignum = fignum + 1
plt.show()
K-Means is one of the most common and simple clustering methods, but it has a couple of key limitations. First off, it is nondeterministic as it depends on the initial choice of cluster means, and Lloyd's Algorithm only arrives upon local minima rather than the global minimum. Furthermore, the algorithm requires you to decide what $k$ is, as we saw earlier. Finally, K-Means can be inflexible, as the only way cluster centers are derived is through the mean distance from all of the points assigned to the cluster. Because of this construction, K-Means doesn't perform well on clusters that are connected but not compact.
Hierarchical clustering solves many of these issues. The motivation between hierarchical clustering is building up clusters as a hierarchy. All objects start out in their individual groups, and at each level, the groups that are a certain distance apart are joined to form a larger group. A variety of different distance metrics can be used in building up these groups to result in different types of hierarchical clusters.
A dendrogram is a way of visualizing the groups as they are aggregated together in the hierarchy. As you can see, hierarchical clustering not only resolves some of the problems explained concerning K-Means - it also prevents a very nice way of representing the structure of the clusters, identifying subclusters within the clusters.
Once again, we can use the SKLearn hierarchical clustering implementation, in the same way as we used clustering. However, there are many resources on the page documenting the different distance metrics that you can use.
In [ ]:
estimators = {'Hierarchical 3': AgglomerativeClustering(n_clusters=3)}
fignum = 1
for name, est in estimators.items():
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
est.fit(X)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title(name)
fignum = fignum + 1
plt.show()
Look at the other clustering methods provided by SKLearn, and consider their use cases. Pick three to run on the sample iris dataset that you think will produce the most accurate clusters. Tune parameters and look at the different options to try and get your clusters as close to the ground truth as possible. This challenge hopes to help you familiarize yourself with the documentation that SKLearn provides and figure out the best clustering method given a problem to solve. We only covered two clustering models in depth to give you a taste of what clustering can do, but there many more clustering models out there all with their own optimal use cases.
In [ ]:
np.random.seed(5)
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target
# TODO: choose three additional estimators here that will give the best results.
estimators = {'Hierarchical 3': AgglomerativeClustering(n_clusters=3),
'K-Means 3': KMeans(n_clusters=3),
'K-Means 7': KMeans(n_clusters=7)}
fignum = 1
for name, est in estimators.items():
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
est.fit(X)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title(name)
fignum = fignum + 1
# Plot the ground truth
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
ax.text3D(X[y == label, 3].mean(),
X[y == label, 0].mean() + 1.5,
X[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title('Ground Truth')
plt.show()