```
In [ ]:
```from IPython.display import Image
from IPython.core.display import HTML

Given a set of objects, we specify that we want $k$ clusters. Each cluster has a mean representing it, and we assign each point to the clusters based on which cluster mean its closest to. For each point, the reconstruction error is defined to be its distance to its cluster mean. This gives us the total reconstruction error, the sum of all the individual reconstruction errors, as an error value that we want to minimize.

Formally, we have a set of object vectors $\{x_n\}_{n = 1}^N$ and a set of $K$ cluster means $\{\mu_k\}_{k = 1}^K$, where $x_n, \mu_k \in \mathbb{R}^D$. We represent each object's cluster assignment with the $K$ dimensional vector $r_n$, where $r_{nk} = 1$ if $x_n$ is in cluster $k$, and 0 otherwise. This gives us the individual reconstruction error

$$J_n(r_n, \{\mu_k\}) = \sum_{k = 1}^K r_{nk} \cdot |x_n - \mu_k|^2$$and the total reconstruction error

$$J(\{r_n\}, \{\mu_k\}) = \sum_{n = 1}^N \sum_{k = 1}^K r_{nk} \cdot |x_n - \mu_k|^2$$As you can see, the reconstruction error on a set of objects is a function of assignments and means. How would you go about choosing the assignments $\{r_n\}$ and means $\{\mu_k\}$ that minimize the reconstruction error? Lloyd's Algorithm proposes a two step error minimization.

**Step 1**: minimize $J$ by updating the $r_n$, assigning each $x_n$ to its closest cluster mean.

**Step 2**: minimize $J$ by recalculting the $\mu_k$ to be the average over all vectors assigned to cluster $k$.

Repeating this process until the assignments do not change, the algorithm will converge upon a local minima. The algorithm can be optimized by starting with reasonable distributions for cluster centers rather than choosing randomly (K-Means ++) or adding the condition that the cluster mean must be a point (K-Medoids).

```
In [ ]:
```import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn import datasets
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target

```
In [ ]:
```estimators = {'K-Means 3': KMeans(n_clusters=3),
'K-Means 8': KMeans(n_clusters=7)}
fignum = 1
for name, est in estimators.items():
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
est.fit(X)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title(name)
fignum = fignum + 1
plt.show()

K-Means is one of the most common and simple clustering methods, but it has a couple of key limitations. First off, it is nondeterministic as it depends on the initial choice of cluster means, and Lloyd's Algorithm only arrives upon local minima rather than the global minimum. Furthermore, the algorithm requires you to decide what $k$ is, as we saw earlier. Finally, K-Means can be inflexible, as the only way cluster centers are derived is through the mean distance from all of the points assigned to the cluster. Because of this construction, K-Means doesn't perform well on clusters that are connected but not compact.

Hierarchical clustering solves many of these issues. The motivation between hierarchical clustering is building up clusters as a hierarchy. All objects start out in their individual groups, and at each level, the groups that are a certain distance apart are joined to form a larger group. A variety of different distance metrics can be used in building up these groups to result in different types of hierarchical clusters.

```
In [ ]:
```estimators = {'Hierarchical 3': AgglomerativeClustering(n_clusters=3)}
fignum = 1
for name, est in estimators.items():
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
est.fit(X)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title(name)
fignum = fignum + 1
plt.show()

```
In [ ]:
```np.random.seed(5)
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target
# TODO: choose three additional estimators here that will give the best results.
estimators = {'Hierarchical 3': AgglomerativeClustering(n_clusters=3),
'K-Means 3': KMeans(n_clusters=3),
'K-Means 7': KMeans(n_clusters=7)}
fignum = 1
for name, est in estimators.items():
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
est.fit(X)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title(name)
fignum = fignum + 1
# Plot the ground truth
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
ax.text3D(X[y == label, 3].mean(),
X[y == label, 0].mean() + 1.5,
X[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title('Ground Truth')
plt.show()