Title: Evaluating Clustering
Slug: evaluating_clustering
Summary: How to evaluate clustering models for machine learning in Python.
Date: 2017-09-14 12:00
Category: Machine Learning
Tags: Model Evaluation
Authors: Chris Albon
In [2]:
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
In [3]:
# Generate feature matrix
X, _ = make_blobs(n_samples = 1000,
n_features = 10,
centers = 2,
cluster_std = 0.5,
shuffle = True,
random_state = 1)
In [4]:
# Cluster data using k-means to predict classes
model = KMeans(n_clusters=2, random_state=1).fit(X)
# Get predicted classes
y_hat = model.labels_
Formally, the $i$th observation's silhouette coefficient is:
$$s_{i} = \frac{b_{i} - a_{i}}{\text{max}(a_{i}, b_{i})}$$where $s_{i}$ is the silhouette coefficient for observation $i$, a{i} is the mean distance between $i$ and all observations of the same class and b{i} is the mean distance between $i$ and all observations from the closest cluster of a different class. The value returned by silhouette_score
is the mean silhouette coefficient for all observations. Silhouette coefficients range between -1 and 1, with 1 indicating dense, well separated clusters.
In [5]:
# Evaluate model
silhouette_score(X, y_hat)
Out[5]: