Title: Agglomerative Clustering
Slug: agglomerative_clustering
Summary: How to conduct agglomerative clustering in scikit-learn.
Date: 2017-09-22 12:00
Category: Machine Learning
Tags: Clustering
Authors: Chris Albon
In [1]:
    
# Load libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
    
In [2]:
    
# Load data
iris = datasets.load_iris()
X = iris.data
    
In [3]:
    
# Standarize features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
    
In scikit-learn, AgglomerativeClustering uses the linkage parameter to determine the merging strategy to minimize the 1) variance of merged clusters (ward), 2) average of distance between observations from pairs of clusters (average), or 3) maximum distance between observations from pairs of clusters (complete).
Two other parameters are useful to know. First, the affinity parameter determines the distance metric used for linkage (minkowski, euclidean, etc.). Second, n_clusters sets the number of clusters the clustering algorithm will attempt to find. That is, clusters are successively merged until there are only n_clusters remaining.
In [4]:
    
# Create meanshift object
clt = AgglomerativeClustering(linkage='complete', 
                              affinity='euclidean', 
                              n_clusters=3)
# Train model
model = clt.fit(X_std)
    
In [5]:
    
# Show cluster membership
model.labels_
    
    Out[5]: