Title: Agglomerative Clustering
Slug: agglomerative_clustering
Summary: How to conduct agglomerative clustering in scikit-learn.
Date: 2017-09-22 12:00
Category: Machine Learning
Tags: Clustering
Authors: Chris Albon
In [1]:
# Load libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
In [2]:
# Load data
iris = datasets.load_iris()
X = iris.data
In [3]:
# Standarize features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
In scikit-learn, AgglomerativeClustering
uses the linkage
parameter to determine the merging strategy to minimize the 1) variance of merged clusters (ward
), 2) average of distance between observations from pairs of clusters (average
), or 3) maximum distance between observations from pairs of clusters (complete
).
Two other parameters are useful to know. First, the affinity
parameter determines the distance metric used for linkage
(minkowski
, euclidean
, etc.). Second, n_clusters
sets the number of clusters the clustering algorithm will attempt to find. That is, clusters are successively merged until there are only n_clusters
remaining.
In [4]:
# Create meanshift object
clt = AgglomerativeClustering(linkage='complete',
affinity='euclidean',
n_clusters=3)
# Train model
model = clt.fit(X_std)
In [5]:
# Show cluster membership
model.labels_
Out[5]: