In [1]:

    
%load_ext watermark



In [2]:

    
%watermark -a "Sebastian Raschka" -d -v









    



Sebastian Raschka 2016-08-31 

CPython 3.5.2
IPython 5.1.0

Hierarchical Agglomerative Clustering - Complete Linkage Clustering

- A quick Python tutorial

Sections

Generating some Sample Data
Pair-wise Distance Matrix, Rows
Apply Clustering - Complete Linkage
Heatmaps
Important Warning About Adding Dendrograms

This is a more technical-oriented tutorial that was born out of necessity to plot a heatmap including the dendrograms from a complete linkage clustering. Since I couldn't find any good resource online, I thought that it might be worthwhile sharing it.

Therefore, the theoretical aspects are a little bit brief, but nonetheless, here is a short introduction of what it is all about.

Complete linkage is one implementation of hierarchical agglomerative clustering. The principle of hierarchical agglomerative clustering is to start with a singleton cluster, and clusters are iteratively merged until one single cluster remains. This results in a "cluster tree," which is also called dendrogram. The opposite approach -- starting with one cluster and divide into clusters until only singleton clusters remain -- is called divisive hierarchical clustering.

The algorithm can be summarized via the following pseudocode

1: Compute a distance or similarity matrix.
2: Each data point is represented as a singleton cluster.
3: Repeat
4: Merge two closest clusters (e.g., based on distance between most similar or dissimilar members).
5: Update the distance (or similarity) matrix.
6: Until one single cluster remains.

Complete linkage compares the most dissimilar members between clusters in each iteration. The two clusters which have the most similar dissimilar members are merged into a new cluster.
\begin{equation} d(C,D) = \max[dist(C_i, D_j)] \end{equation} for all $i$ points in cluster $C$ and $j$ points in cluster $D$.

In contrast, the single linkage algorithm compares the two most similar members instead of the most dissimilar ones. \begin{equation} d(C,D) = \min[dist(C_i, D_j)] \end{equation} for all $i$ points in cluster $C$ and $j$ points in cluster $D$.

For more implementations, e.g., centroid, average etc., please see the scipy.cluster.hierarchy.linkage documentation.

(A more detailed article about prototype-based, hierarchical, and density-based clustering is in the planning stage)

Generating some Sample Data

[back to top]

First, we generate some random sample data to work with. Like in a typical application, the rows represent different observations (Samples 1-5), and the columns are the different features (X, Y, Z).



In [4]:

    
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker

np.random.seed(123)

variables = ['A','B','C','X','Y','Z']
labels = ['ID_0','ID_1','ID_2','ID_3','ID_4','ID_5','ID_6',
          'ID_7','ID_8','ID_9','ID_10']

X = np.random.random_sample([len(labels),len(variables)])*10
df = pd.DataFrame(X, columns=variables, index=labels)
df

Pair-wise Distance Matrix, Rows

[back to top]

First, we calculate the pair-wise distances for every row (i.e, between every sample across the different variables). We will use the default euclidean distance measure. The other available distance measures are listed in the scipy.spatial.distance.pdist documentation with a nice and short explanatory paragraph. In addition, we use the squareform function to return a symmetrical matrix of the pair-wise distances.



In [5]:

    
from scipy.spatial.distance import pdist,squareform

row_dist = pd.DataFrame(squareform(pdist(df, metric='euclidean')), columns=labels, index=labels)
row_dist

Apply Clustering - Complete Linkage

[back to top]

When we apply the complete linkage agglomeration to our clusters, the linkage function returns a so-called linkage matrix. This linkage matrix consists of several rows where each row consists of 1 merge. The first and second column denote the most dissimilar members in each cluster, and the third row reports the distance between those members. The last column returns the count of members in the clusters.

However, before we call the linkage function, let us take a careful look at its documentation.

Parameters:
y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array.

method : str, optional
The linkage algorithm to use. See the Linkage Methods section below for full descriptions.

metric : str, optional
The distance metric to use. See the distance.pdist function for a list of valid distance metrics.

Returns:
Z : ndarray
The hierarchical clustering encoded as a linkage matrix.

Thus, we can either pass a condensed distance matrix (upper triangular) from the pdist function, or we can pass the "original" data array and define the 'euclidean' metric as function argument in linkage. However, we shouldn't pass the squareform distance metrics, which would yield incorrect distance values although the overall clustering could be the same.

a) Squareform distance matrix (wrong)



In [6]:

    
from scipy.cluster.hierarchy import linkage

row_clusters = linkage(row_dist, method='complete', metric='euclidean')
pd.DataFrame(row_clusters, 
             columns=['row label 1', 'row label 2', 'distance', 'no. of items in clust.'],
             index=['cluster %d' %(i+1) for i in range(row_clusters.shape[0])])









    Out[6]:






  
    
      
      row label 1
      row label 2
      distance
      no. of items in clust.
    
  
  
    
      cluster 1
      3.0
      9.0
      7.167666
      2.0
    
    
      cluster 2
      0.0
      8.0
      7.769408
      2.0
    
    
      cluster 3
      1.0
      4.0
      8.416614
      2.0
    
    
      cluster 4
      11.0
      12.0
      9.760483
      4.0
    
    
      cluster 5
      2.0
      5.0
      10.348596
      2.0
    
    
      cluster 6
      6.0
      10.0
      10.529093
      2.0
    
    
      cluster 7
      7.0
      13.0
      13.599170
      3.0
    
    
      cluster 8
      14.0
      15.0
      14.962794
      6.0
    
    
      cluster 9
      16.0
      18.0
      17.737253
      8.0
    
    
      cluster 10
      17.0
      19.0
      18.345150
      11.0

b) Condensed distance matrix (correct)



In [7]:

    
row_clusters = linkage(pdist(df, metric='euclidean'), method='complete')
pd.DataFrame(row_clusters, 
             columns=['row label 1', 'row label 2', 'distance', 'no. of items in clust.'],
             index=['cluster %d' %(i+1) for i in range(row_clusters.shape[0])])









    Out[7]:






  
    
      
      row label 1
      row label 2
      distance
      no. of items in clust.
    
  
  
    
      cluster 1
      3.0
      9.0
      4.426452
      2.0
    
    
      cluster 2
      1.0
      4.0
      5.007880
      2.0
    
    
      cluster 3
      0.0
      8.0
      5.034288
      2.0
    
    
      cluster 4
      6.0
      10.0
      5.780188
      2.0
    
    
      cluster 5
      11.0
      13.0
      6.147102
      4.0
    
    
      cluster 6
      2.0
      5.0
      6.313700
      2.0
    
    
      cluster 7
      7.0
      12.0
      8.758121
      3.0
    
    
      cluster 8
      15.0
      16.0
      8.850425
      6.0
    
    
      cluster 9
      14.0
      18.0
      11.170073
      8.0
    
    
      cluster 10
      17.0
      19.0
      11.594865
      11.0

c) Input sample matrix (correct)



In [8]:

    
row_clusters = linkage(df.values, method='complete', metric='euclidean')
pd.DataFrame(row_clusters, 
             columns=['row label 1', 'row label 2', 'distance', 'no. of items in clust.'],
             index=['cluster %d' %(i+1) for i in range(row_clusters.shape[0])])









    Out[8]:






  
    
      
      row label 1
      row label 2
      distance
      no. of items in clust.
    
  
  
    
      cluster 1
      3.0
      9.0
      4.426452
      2.0
    
    
      cluster 2
      1.0
      4.0
      5.007880
      2.0
    
    
      cluster 3
      0.0
      8.0
      5.034288
      2.0
    
    
      cluster 4
      6.0
      10.0
      5.780188
      2.0
    
    
      cluster 5
      11.0
      13.0
      6.147102
      4.0
    
    
      cluster 6
      2.0
      5.0
      6.313700
      2.0
    
    
      cluster 7
      7.0
      12.0
      8.758121
      3.0
    
    
      cluster 8
      15.0
      16.0
      8.850425
      6.0
    
    
      cluster 9
      14.0
      18.0
      11.170073
      8.0
    
    
      cluster 10
      17.0
      19.0
      11.594865
      11.0



In [9]:

    
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.cluster.hierarchy import dendrogram

row_dendr = dendrogram(row_clusters, labels=labels)

Heatmaps

[back to top]

This section is about the visualization of our hierarchical clustering. Here, we will plot simple heatmaps for each scenario: The original data, the data after row clustering, and eventually the data after we applied row and column clustering.

Heatmap of the Original Data

[back to top]



In [10]:

    
fig = plt.figure()

ax = fig.add_subplot(111)

cax = ax.matshow(df, interpolation='nearest', cmap='hot_r')
fig.colorbar(cax)

tick_spacing = 1
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
ax.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))

ax.set_xticklabels([''] + list(df.columns))
ax.set_yticklabels([''] + list(df.index))

plt.show()

Heatmap after Row-Clustering

[back to top]

The dendrogram function returns a dictionary with various items which are explained in detail in the scipy documenation here. The dendrogram leave order can then be accessed via the 'leaves' key, e.g.,



In [11]:

    
row_dendr['leaves']









    Out[11]:





[7, 1, 4, 6, 10, 3, 9, 0, 8, 2, 5]

Thus, in order to sort the DataFrame according to the clustering, we can simply use the 'leaves' as indices like so:

df.ix[row_dendr['leaves']]



In [13]:

    
# reorder rows with respect to the clustering
row_dendr = dendrogram(row_clusters, labels=labels, no_plot=True)
df_rowclust = df.ix[row_dendr['leaves']]

# plot
fig = plt.figure()
ax = fig.add_subplot(111)

cax = ax.matshow(df_rowclust, 
                 interpolation='nearest', 
                 cmap='hot_r')
fig.colorbar(cax)

tick_spacing = 1
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
ax.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))

ax.set_xticklabels([''] + list(df_rowclust.columns))
ax.set_yticklabels([''] + list(df_rowclust.index))

plt.show()

Heatmap plus Row-Dendrogram

[back to top]

Now, we can rotate the dendrogram via the by setting the orientation parameter to 'right', but note that we now have to sort the clustered data in reverse order to match the row labels in the heatmap via

df.ix[row_dendr['leaves'][::-1]]



In [19]:

    
from scipy.cluster import hierarchy
# makes dendrogram black (1)
hierarchy.set_link_color_palette(['black'])

# plot row dendrogram
fig = plt.figure(figsize=(8,8))
axd = fig.add_axes([0.09,0.1,0.2,0.6])
row_dendr = dendrogram(row_clusters, orientation='left', 
                   color_threshold=np.inf, ) # makes dendrogram black (2))

# reorder data with respect to clustering
df_rowclust = df.ix[row_dendr['leaves'][::-1]]

axd.set_xticks([])
axd.set_yticks([])


# remove axes spines from dendrogram
for i in axd.spines.values():
        i.set_visible(False)

# reorder rows with respect to the clustering
df_rowclust = df.ix[row_dendr['leaves'][::-1]]
        
# plot heatmap
axm = fig.add_axes([0.20, 0.1, 0.6, 0.6]) # x-pos, y-pos, width, height
cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='hot_r')
fig.colorbar(cax)
axm.set_xticklabels([''] + list(df_rowclust.columns))
axm.set_yticklabels([''] + list(df_rowclust.index))

tick_spacing = 1
axm.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
axm.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))

plt.show()

Adding a Column Dendrogram

[back to top]

Also, we can add a an additional dendrogram for the column clustering at the top. Here, we first sort the indexes after clustering the rows

df_rowclust = df.ix[row_dendr['leaves'][::-1]]

And then we can use the indices from the column clustering to reorder the columns:

df_rowclust.columns = [df_rowclust.columns[col_dendr['leaves']]]



In [21]:

    
# Compute pairwise distances for columns
col_dists = pdist(df.T, metric='euclidean')
col_clusters = linkage(col_dists, method='complete')

# plot column dendrogram
fig = plt.figure(figsize=(8,8))

axd2 = fig.add_axes([0.38,0.74,0.36,0.10]) 
col_dendr = dendrogram(col_clusters, orientation='top',
                       color_threshold=np.inf) # makes dendrogram black)
axd2.set_xticks([])
axd2.set_yticks([])

# plot row dendrogram
axd1 = fig.add_axes([0.09,0.1,0.2,0.6])
row_dendr = dendrogram(row_clusters, orientation='left',  
                       count_sort='ascending',
                       color_threshold=np.inf) # makes dendrogram black
axd1.set_xticks([])
axd1.set_yticks([])

# remove axes spines from dendrogram
for i,j in zip(axd1.spines.values(), axd2.spines.values()):
        i.set_visible(False)
        j.set_visible(False)
        

# reorder columns and rows with respect to the clustering
df_rowclust = df.ix[row_dendr['leaves'][::-1]]
df_rowclust.columns = [df_rowclust.columns[col_dendr['leaves']]]

# plot heatmap
axm = fig.add_axes([0.20,0.1,0.6,0.6])
cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='hot_r')
fig.colorbar(cax)
axm.set_xticklabels([''] + list(df_rowclust.columns))
axm.set_yticklabels([''] + list(df_rowclust.index))

tick_spacing = 1
axm.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
axm.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))

plt.show()

Important Warning About Adding Dendrograms

[back to top]

It is very important that you always check that the labels on the heatmap and the dendrogram match when you create the first plot! For example, we can provide the labels to the labels parameter of the dendrogram and DON'T set axd.set_yticks([]) to show the labels on the dendrogram to compare them to the heatmap labels.



In [23]:

    
from scipy.cluster import hierarchy
# makes dendrogram black (1)
hierarchy.set_link_color_palette(['black'])

# plot row dendrogram
fig = plt.figure(figsize=(8,8))
axd = fig.add_axes([0.09,0.1,0.2,0.6])
row_dendr = dendrogram(row_clusters, orientation='left', 
                   labels=labels,
                   color_threshold=np.inf, ) # makes dendrogram black (2))
axd.set_xticks([])

# uncomment to hide dendrogram labels
#axd.set_yticks([])

# remove axes spines from dendrogram
for i in axd.spines.values():
        i.set_visible(False)

# reorder columns and rows with respect to the clustering
df_rowclust = df.ix[row_dendr['leaves'][::-1]]
        
# plot heatmap
axm = fig.add_axes([0.20,0.1,0.6,0.6]) # x-pos, y-pos, width, height
cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='hot_r')
fig.colorbar(cax)
axm.set_xticklabels([''] + list(df_rowclust.columns))
axm.set_yticklabels([])

tick_spacing = 1
axm.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
axm.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))

plt.show()

	A	B	C	X	Y	Z
ID_0	6.964692	2.861393	2.268515	5.513148	7.194690	4.231065
ID_1	9.807642	6.848297	4.809319	3.921175	3.431780	7.290497
ID_2	4.385722	0.596779	3.980443	7.379954	1.824917	1.754518
ID_3	5.315514	5.318276	6.344010	8.494318	7.244553	6.110235
ID_4	7.224434	3.229589	3.617887	2.282632	2.937140	6.309761
ID_5	0.921049	4.337012	4.308628	4.936851	4.258303	3.122612
ID_6	4.263513	8.933892	9.441600	5.018367	6.239530	1.156184
ID_7	3.172855	4.148262	8.663092	2.504554	4.830343	9.855598
ID_8	5.194851	6.128945	1.206287	8.263408	6.030601	5.450680
ID_9	3.427638	3.041208	4.170222	6.813008	8.754568	5.104223
ID_10	6.693138	5.859366	6.249035	6.746891	8.423424	0.831950

	ID_0	ID_1	ID_2	ID_3	ID_4	ID_5	ID_6	ID_7	ID_8	ID_9	ID_10
ID_0	0.000000	7.515813	7.291294	6.147102	5.908280	7.283425	10.307123	10.158830	5.034288	4.587465	6.284267
ID_1	7.515813	0.000000	10.693828	7.856166	5.007880	10.228137	10.161800	8.758121	7.985670	9.852989	9.346920
ID_2	7.291294	10.693828	0.000000	8.850425	7.943009	6.313700	11.170073	11.594865	8.431215	8.154213	9.108111
ID_3	6.147102	7.856166	8.850425	0.000000	8.521560	7.410542	7.836414	8.189719	5.387945	4.426452	5.873941
ID_4	5.908280	5.007880	7.943009	8.521560	0.000000	7.770742	10.962260	7.675921	8.026113	8.400971	9.703415
ID_5	7.283425	10.228137	6.313700	7.410542	7.770742	0.000000	8.151657	8.697356	7.122349	5.970293	8.079069
ID_6	10.307123	10.161800	11.170073	7.836414	10.962260	8.151657	0.000000	10.425101	10.274831	9.399096	5.780188
ID_7	10.158830	8.758121	11.594865	8.189719	7.675921	8.697356	10.425101	0.000000	10.845452	8.832567	11.553433
ID_8	5.034288	7.985670	8.431215	5.387945	8.026113	7.122349	10.274831	10.845452	0.000000	5.575461	7.556781
ID_9	4.587465	9.852989	8.154213	4.426452	8.400971	5.970293	9.399096	8.832567	5.575461	0.000000	6.425987
ID_10	6.284267	9.346920	9.108111	5.873941	9.703415	8.079069	5.780188	11.553433	7.556781	6.425987	0.000000

	row label 1	row label 2	distance	no. of items in clust.
cluster 1	3.0	9.0	7.167666	2.0
cluster 2	0.0	8.0	7.769408	2.0
cluster 3	1.0	4.0	8.416614	2.0
cluster 4	11.0	12.0	9.760483	4.0
cluster 5	2.0	5.0	10.348596	2.0
cluster 6	6.0	10.0	10.529093	2.0
cluster 7	7.0	13.0	13.599170	3.0
cluster 8	14.0	15.0	14.962794	6.0
cluster 9	16.0	18.0	17.737253	8.0
cluster 10	17.0	19.0	18.345150	11.0