Hierarchical Clustering

In this notebook we will give a basic example of how agglomerative hierarchical cluster works. We use scipy and sklearn libraries.


In [0]:
from sklearn.metrics import normalized_mutual_info_score
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.datasets.samples_generator import make_blobs
import numpy as np

Generating Sample data

make_blobs is used to generate sample data where:

n_samples : the total number of points equally divided among clusters.

centers : the number of centers to generate, or the fixed center locations.

n_features : the number of features for each sample.

random_state: determines random number generation for dataset creation.

This function returns two outputs:

X: the generated samples.

y: The integer labels for cluster membership of each sample.

Then we use plt.scatter to plot the data points in the figure below.


In [0]:
X, y = make_blobs(n_samples=90, centers=4, n_features=3, random_state=4)
plt.scatter(X[:, 0], X[:, 1])
plt.show()


Performing Hierarchical clustering:

In this part, we are performing agglomerative hierarchical clustering using linkage function from scipy library::

method: is the linkage method, 'single' means the linkage method will be single linkage method.

metric: is our similarity metric, 'euclidean' means the metric will be euclidean distance.

"A (n-1) by 4 matrix Z is returned. At the -th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster with index (n+i) . A cluster with an index less than n corresponds to one of the n original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.

The following linkage methods are used to compute the distance d(s,t)between two clusters sand t. The algorithm begins with a forest of clusters that have yet to be used in the hierarchy being formed. When two clusters s and tfrom this forest are combined into a single cluster u, sand t are removed from the forest, and u is added to the forest. When only one cluster remains in the forest, the algorithm stops, and this cluster becomes the root.

A distance matrix is maintained at each iteration. The d[i,j]`` entry corresponds to the distance between clusteriiandj` in the original forest.

At each iteration, the algorithm must update the distance matrix to reflect the distance of the newly formed cluster u with the remaining clusters in the forest."

For more details check the docmentation of linkage: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html


In [0]:
Z = linkage(X, method="single", metric="euclidean")
print(Z.shape)
Z


(89, 4)
Out[0]:
array([[1.90000000e+01, 8.60000000e+01, 1.30442991e-01, 2.00000000e+00],
       [1.00000000e+00, 2.90000000e+01, 1.82464805e-01, 2.00000000e+00],
       [2.60000000e+01, 4.10000000e+01, 2.24296317e-01, 2.00000000e+00],
       [3.40000000e+01, 9.10000000e+01, 2.58407755e-01, 3.00000000e+00],
       [1.70000000e+01, 8.80000000e+01, 4.46261329e-01, 2.00000000e+00],
       [3.00000000e+00, 2.80000000e+01, 4.59526997e-01, 2.00000000e+00],
       [8.00000000e+00, 9.30000000e+01, 4.70577188e-01, 4.00000000e+00],
       [6.40000000e+01, 7.00000000e+01, 4.85796818e-01, 2.00000000e+00],
       [6.80000000e+01, 9.70000000e+01, 5.52673653e-01, 3.00000000e+00],
       [2.20000000e+01, 7.10000000e+01, 5.67637608e-01, 2.00000000e+00],
       [4.80000000e+01, 5.20000000e+01, 5.72218127e-01, 2.00000000e+00],
       [2.70000000e+01, 9.20000000e+01, 5.83346355e-01, 3.00000000e+00],
       [7.70000000e+01, 9.00000000e+01, 5.88123667e-01, 3.00000000e+00],
       [4.50000000e+01, 9.50000000e+01, 6.06112007e-01, 3.00000000e+00],
       [4.90000000e+01, 9.80000000e+01, 6.06134706e-01, 4.00000000e+00],
       [3.80000000e+01, 8.50000000e+01, 6.54175526e-01, 2.00000000e+00],
       [2.00000000e+01, 1.05000000e+02, 6.64553292e-01, 3.00000000e+00],
       [9.00000000e+00, 4.60000000e+01, 6.70931992e-01, 2.00000000e+00],
       [9.90000000e+01, 1.02000000e+02, 6.82628319e-01, 5.00000000e+00],
       [7.00000000e+00, 1.20000000e+01, 6.95405759e-01, 2.00000000e+00],
       [2.10000000e+01, 1.00000000e+02, 7.00137901e-01, 3.00000000e+00],
       [7.50000000e+01, 8.40000000e+01, 7.18486705e-01, 2.00000000e+00],
       [6.70000000e+01, 1.01000000e+02, 7.20982261e-01, 4.00000000e+00],
       [6.50000000e+01, 1.04000000e+02, 7.30097372e-01, 5.00000000e+00],
       [2.30000000e+01, 3.60000000e+01, 7.30257122e-01, 2.00000000e+00],
       [5.60000000e+01, 8.30000000e+01, 7.31435561e-01, 2.00000000e+00],
       [6.90000000e+01, 1.15000000e+02, 7.39908983e-01, 3.00000000e+00],
       [3.20000000e+01, 7.90000000e+01, 7.52488630e-01, 2.00000000e+00],
       [4.20000000e+01, 7.40000000e+01, 7.72963318e-01, 2.00000000e+00],
       [9.40000000e+01, 1.06000000e+02, 7.75139691e-01, 5.00000000e+00],
       [1.30000000e+01, 5.70000000e+01, 8.09976391e-01, 2.00000000e+00],
       [4.70000000e+01, 1.10000000e+02, 8.17317663e-01, 4.00000000e+00],
       [8.70000000e+01, 1.21000000e+02, 8.37778371e-01, 5.00000000e+00],
       [7.20000000e+01, 1.18000000e+02, 8.40307889e-01, 3.00000000e+00],
       [1.09000000e+02, 1.13000000e+02, 8.40561182e-01, 7.00000000e+00],
       [1.10000000e+01, 4.40000000e+01, 8.47017348e-01, 2.00000000e+00],
       [5.40000000e+01, 1.16000000e+02, 8.53348573e-01, 4.00000000e+00],
       [6.10000000e+01, 1.24000000e+02, 8.60237379e-01, 8.00000000e+00],
       [8.10000000e+01, 1.14000000e+02, 8.69649593e-01, 3.00000000e+00],
       [1.60000000e+01, 3.50000000e+01, 8.71267240e-01, 2.00000000e+00],
       [8.90000000e+01, 1.23000000e+02, 8.87225579e-01, 4.00000000e+00],
       [1.40000000e+01, 5.10000000e+01, 8.97306320e-01, 2.00000000e+00],
       [1.27000000e+02, 1.31000000e+02, 9.03391600e-01, 1.00000000e+01],
       [7.60000000e+01, 1.30000000e+02, 9.42132535e-01, 5.00000000e+00],
       [2.00000000e+00, 4.00000000e+00, 9.43591329e-01, 2.00000000e+00],
       [6.00000000e+01, 1.32000000e+02, 9.52055045e-01, 1.10000000e+01],
       [1.50000000e+01, 1.03000000e+02, 9.55858009e-01, 4.00000000e+00],
       [6.20000000e+01, 1.36000000e+02, 9.55990815e-01, 5.00000000e+00],
       [7.80000000e+01, 1.12000000e+02, 9.62094518e-01, 5.00000000e+00],
       [1.26000000e+02, 1.33000000e+02, 9.70616432e-01, 9.00000000e+00],
       [3.10000000e+01, 1.37000000e+02, 9.73453020e-01, 6.00000000e+00],
       [3.00000000e+01, 1.38000000e+02, 9.76974714e-01, 6.00000000e+00],
       [1.17000000e+02, 1.34000000e+02, 9.87111827e-01, 4.00000000e+00],
       [6.00000000e+00, 4.00000000e+01, 9.88659025e-01, 2.00000000e+00],
       [6.30000000e+01, 1.41000000e+02, 9.90928355e-01, 7.00000000e+00],
       [1.80000000e+01, 1.35000000e+02, 9.94563563e-01, 1.20000000e+01],
       [4.30000000e+01, 1.44000000e+02, 1.01672289e+00, 8.00000000e+00],
       [1.19000000e+02, 1.40000000e+02, 1.02423556e+00, 1.10000000e+01],
       [5.30000000e+01, 1.47000000e+02, 1.02433354e+00, 1.20000000e+01],
       [0.00000000e+00, 5.80000000e+01, 1.04742170e+00, 2.00000000e+00],
       [9.60000000e+01, 1.48000000e+02, 1.05140979e+00, 1.60000000e+01],
       [1.11000000e+02, 1.29000000e+02, 1.05383364e+00, 4.00000000e+00],
       [1.49000000e+02, 1.50000000e+02, 1.06811164e+00, 1.80000000e+01],
       [1.43000000e+02, 1.52000000e+02, 1.13528499e+00, 2.00000000e+01],
       [1.42000000e+02, 1.45000000e+02, 1.13534989e+00, 1.60000000e+01],
       [1.00000000e+01, 1.54000000e+02, 1.13899133e+00, 1.70000000e+01],
       [8.00000000e+01, 1.55000000e+02, 1.14020634e+00, 1.80000000e+01],
       [1.08000000e+02, 1.39000000e+02, 1.16825903e+00, 1.40000000e+01],
       [1.28000000e+02, 1.57000000e+02, 1.17114497e+00, 1.70000000e+01],
       [3.30000000e+01, 1.46000000e+02, 1.18258513e+00, 9.00000000e+00],
       [8.20000000e+01, 1.53000000e+02, 1.20489168e+00, 2.10000000e+01],
       [5.00000000e+01, 7.30000000e+01, 1.26106706e+00, 2.00000000e+00],
       [1.25000000e+02, 1.56000000e+02, 1.26262400e+00, 2.00000000e+01],
       [1.51000000e+02, 1.59000000e+02, 1.26658643e+00, 1.30000000e+01],
       [1.07000000e+02, 1.62000000e+02, 1.27731453e+00, 2.20000000e+01],
       [1.58000000e+02, 1.61000000e+02, 1.30683670e+00, 1.90000000e+01],
       [5.50000000e+01, 1.65000000e+02, 1.31086896e+00, 2.00000000e+01],
       [1.20000000e+02, 1.63000000e+02, 1.32666243e+00, 1.50000000e+01],
       [2.40000000e+01, 1.66000000e+02, 1.34505705e+00, 2.10000000e+01],
       [2.50000000e+01, 1.67000000e+02, 1.35109125e+00, 1.60000000e+01],
       [5.90000000e+01, 1.68000000e+02, 1.43253745e+00, 2.20000000e+01],
       [3.70000000e+01, 1.70000000e+02, 1.45594208e+00, 2.30000000e+01],
       [1.22000000e+02, 1.69000000e+02, 1.47692245e+00, 2.10000000e+01],
       [6.60000000e+01, 1.60000000e+02, 1.55016373e+00, 2.20000000e+01],
       [3.90000000e+01, 1.72000000e+02, 1.57569084e+00, 2.20000000e+01],
       [5.00000000e+00, 1.73000000e+02, 1.70993648e+00, 2.30000000e+01],
       [1.64000000e+02, 1.75000000e+02, 3.30049427e+00, 4.50000000e+01],
       [1.74000000e+02, 1.76000000e+02, 1.07623457e+01, 6.70000000e+01],
       [1.71000000e+02, 1.77000000e+02, 1.31670775e+01, 9.00000000e+01]])

Plotting dendrogram

The dedrogram function from scipy is used to plot dendrogram:

  • On the x axis we see the indexes of our samples.
  • On the y axis we see the distances of our metric ('Euclidean').

In [0]:
plt.figure(figsize=(25, 10))
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Samples indexes")
plt.ylabel("distance")
dendrogram(Z, leaf_rotation=90., leaf_font_size=8. ) 
plt.show()


Retrive the clusters

fcluster is used to retrive clusters with some level of distance.

The number two determines the distance in which we want to cut the dendrogram. The number of crossed line is equal to number of clusters.


In [0]:
cluster = fcluster(Z, 2, criterion="distance")
cluster


Out[0]:
array([4, 4, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 3, 2, 3, 4, 2, 4, 3, 1, 4, 2,
       1, 1, 1, 2, 2, 2, 4, 4, 2, 4, 3, 2, 4, 2, 1, 1, 4, 2, 4, 2, 1, 2,
       3, 4, 3, 2, 2, 3, 1, 3, 2, 4, 1, 1, 1, 2, 4, 1, 3, 3, 4, 2, 3, 3,
       4, 2, 3, 1, 3, 1, 1, 1, 1, 2, 1, 1, 2, 3, 3, 1, 4, 1, 2, 4, 1, 2,
       4, 1], dtype=int32)

Plotting Clusters

Plotting the final result. Each color represents a different cluster (four clusters in total).


In [0]:
plt.figure(figsize=(10, 8))
plt.scatter(X[:, 0], X[:, 1], c=cluster, cmap="Accent")
plt.savefig("clusters.png")
plt.show()


Evaluting clusters:

Finally we will use Normalized Mutual Information (NMI) score to evaluate our clusters. Mutual information is a symmetric measure for the degree of dependency between the clustering and the manual classification. When NMI value is close to one, it indicates high similarity between clusters and actual labels. But if it was close to zero, it indicates high dissimilarity between them.


In [0]:
normalized_mutual_info_score(y, cluster)


/usr/local/lib/python3.6/dist-packages/sklearn/metrics/cluster/supervised.py:844: FutureWarning: The behavior of NMI will change in version 0.22. To match the behavior of 'v_measure_score', NMI will use average_method='arithmetic' by default.
  FutureWarning)
Out[0]:
1.0

In [0]: