Comparing clustering algorithms

2017-09-08, Josh Montague

What

We often have some curated data of interest for which we'd like a low-dimensional representation. In the case of Twitter data, the corpus in question is usually based on filtering of the text content of the Tweet itself or the description of the user (profile bio). Since Twitter is made of people, we're often in a position where we want to discover and describe a small (low-dimensional) number of "groups" or "communities" of Tweets or users within a corpus.

Unsupervised learning (clustering) is the tool commonly used for this. There are 10 clustering algorithms built in to the sklearn API as of the time of this writing, and others that exist outside of that one specific library. The take-away from this session is a proposal to move from

"KMEANS ALL THE THINGS"

"HDBSCAN ALL THE THINGS (while also taking the time to think a bit more about your algorithm assumptions)".

It really just rolls right off the tongue.

Why

Most of my clustering experience has been with the KMeans algorithm, and so I can't speak from a position of much experience on all the rest. However, I do know that KMeans is often used in clustering tasks - often when it probably shouldn't be - because it is incredibly fast. I've always been uncomfortable with applying this hammer to all tasks, so the goal of this session is to highlight some alternatives and where they differ.

Other people who care more deeply about the comparison of these algorithms have written about them, and so here I'll link to (and copy from) some of those references.

How

First, we'll look at a couple of examples that already exist in the wilds of the internet with some generated data. Then, we'll dig into one particular clustering algorithm that looks promising. Finally, we'll look at that algorithm applied to some of our own data.

To make this visualizable, we're going to work mostly in 2 dimensions.

What aren't we going to do?

a.k.a. "great opportunities for future RST sessions"

have a deep dive into the "proper" dimensionality in which to do this work
have a deep dive into comparing algorithms on the basis of their scaling (we'll touch on it, and there is a nice write up linked later)
have a deep dive into dimensionality reduction techniques

Ok, let's go!



In [ ]:

    
import time

import hdbscan
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(context='poster', style='darkgrid')
sns.set_color_codes()

from sklearn import cluster
from sklearn import datasets

Demo from `sklearn`

Run the next cell, then come back.

Let's start by pointing our eyeballs at the picture that pops out from running the script below. This is a lightly modified version of the sklearn docs clustering demo, and highlights the results of crossing a set of algorithms (columns) with a set of generated data sets (rows).

In most cases, there are parameters that are necessary inputs and for these fits. These are magic numbers pulled from experience and thin air. For the most part, the parameters are even chosen to be "correct" for the data set e.g. k=2 for the annulus and crescent data sets, to illustrate a sort of "best case" scenario. In the lower righthand corner of each plot is the runtime of the fit step.

The code is pretty involved but not particularly informative, so let's just look at the output graphic!



In [ ]:

    
# this takes ~a minute to run
%run sklearn-cluster-demo.py

Quick take-aways:

speed?
- KMeans is fast
- AffinityPropagation, SpectralClustering, Ward, Agglomerative are slow
non-linear separation?
- SpectralClustering, Ward, Agglomerative, and DBSCAN seem to be capable
- the rest aren't
clusters not blob-like (esp. #4, the long skinny clusters)?
- even with "correct" k value, most algos fit a poor model!
"null" / no clusters?
- all bets are off; aka "GIGO" (garbage in garbage out)
best?
- if I had to choose one algorithm based on this data alone, it looks like DBSCAN is the winner

There are some interesting patterns in there. The sklearn user's guide (and associated links to algorithm details) has a nice table that briefly discusses the assumptions and use cases for the algorithms. I don't want to get into the derivations, though, so let's look at anothere data set and see what else we can surmise from empirical comparisons.

Demo from `HDBSCAN`

The DBSCAN algorithm is included in the current release of sklearn, but the algorithm's authors subsequently published an improved version to that algorithm, called HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) which addresses a handful of the shortcomings of their original DBSCAN.

The paper's abstract:

We propose a theoretically and practically improved density-based, hierarchical clustering method, providing a clustering hierarchy from which a simplified tree of significant clusters can be constructed. For obtaining a “flat” partition consisting of only the most significant clusters (possibly corresponding to different density thresholds), we propose a novel cluster stability measure, formalize the problem of maximizing the overall stability of selected clusters, and formulate an algorithm that computes an optimal solution to this problem. We demonstrate that our approach outperforms the current, state-of-the-art, density-based clustering methods on a wide variety of real world data.

The HDBSCAN docs have a great writeup (and notebook) that compares available Python clustering algorithms. This section is a modified version of that writeup.

First, go get the data file...



In [ ]:

    
! wget https://github.com/scikit-learn-contrib/hdbscan/blob/master/notebooks/clusterable_data.npy?raw=true \
    -O clusterable_data.npy



In [ ]:

    
data = np.load('clusterable_data.npy')



In [ ]:

    
plt.figure(figsize=(15,10))
# nb: because of the format of the .npy files, you'll see lots of transposing of data (data.T)
plt.scatter(data.T[0], data.T[1], c='b', alpha=0.25, s=80)

frame = plt.gca()
frame.axes.get_xaxis().set_visible(False)
frame.axes.get_yaxis().set_visible(False)

Some important observations on this data:

there are a bunch of visually recognizable clusters
the shapes of the clusters varies
the core density of the clusters varies
there's a bunch of data scattered around the edges

Let's line up some algorithms to compare. To make it simpler, this is a subset of the sklearn demo ones, plus HDBSCAN (which is not in sklearn). We're giving the algorithms some parameters (because they're required), and we'll come back to how we feel about them later.

Below, we'll fit each of the algorithms to the dataset we've just looked at, and label the results according to the trained model.



In [ ]:

    
# tuples are:
# (algorithm, algo args, algo kwargs)
algo_combos = [
    (cluster.KMeans,                  (),       {'n_clusters':6}),
    (cluster.AffinityPropagation,     (),       {'preference':-5.0, 'damping':0.95}),
    (cluster.MeanShift,               (0.175,), {'cluster_all':False}),
    (cluster.SpectralClustering,      (),       {'n_clusters':6}),
    (cluster.AgglomerativeClustering, (),       {'n_clusters':6, 'linkage':'ward'}),   
    (cluster.DBSCAN,                  (),       {'eps':0.025}),
    (hdbscan.HDBSCAN,                 (),       {'min_cluster_size':15})
]



In [ ]:

    
fig = plt.figure(figsize=(20,20))
plot_kwds = {'alpha' : 0.25, 's' : 60, 'linewidths':0}

# loop over algos and fit each one
for i, (algo, args, algo_kwargs) in enumerate(algo_combos, start=1):
    # catch the runtime for model fitting
    start_time = time.time()
    labels = algo(*args, **algo_kwargs).fit_predict(data)
    end_time = time.time()
    # plot the results
    palette = sns.color_palette('deep', np.unique(labels).max() + 1)
    colors = [palette[x] if x >= 0 else (0.0, 0.0, 0.0) for x in labels]
    ax = fig.add_subplot(4,2,i)    
    ax.scatter(data.T[0], data.T[1], c=colors, **plot_kwds)
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)    
    ax.text(-0.5, 
            0.67,
            '{}: {:.2f} s'.format(str(algo.__name__), (end_time - start_time)),
            fontsize=20
           )

fig.tight_layout()

As before, take a minute and look through these outputs. The format is similar: the algorithm name and the run time for the fit. So, what are our observations?

This time, let's be guided by the HDBScan docs which do a really nice job of summarizing some interpretable comparison criteria for these algorithms. The "Rules" are below.

"Rules for EDA clustering"

this is a paraphrased version of the full doc

"don't be wrong" ("dbw")
- ...conservative in it’s clustering; it should be willing to not assign points to clusters; it should not group points together unless they really are in a cluster
intuitive parameters
- ...parameters need to be intuitive enough that you can hopefully set them without having to know a lot about your data
stable clusters
- ...run the algorithm twice with a different random initialization, you should expect to get roughly the same clusters back
- ...taking a different random sample shouldn’t radically change the resulting cluster structure (unless your sampling has problems)
- ...when you vary the clustering algorithm parameters you want the clustering to change in a somewhat stable predictable fashion
performance
- ...you need a clustering algorithm that can scale to large data sizes

Algorithms evaluated w.r.t. "Rules"

Both the sklearn docs and the HDBScan docs have commentary on the assumptions that go into these algorithms, and where the pros and cons are. To learn more about the specific implementation (and assumptions) about each algorithm e.g. KMeans the docs are incredibly detailed and worth a read.

But for this session, we don't have time to become an expert in all of them. Instead, we'll aim to combine and simplify the comments from the two sources. Below is an informal, emoji-qualified verdict about each of the algorithms.

KMeans
- dbw: partitions every point (❌ ); assumes clusters are convex/globular (❌ )
- intuitive: have to pick k, or survey and select (❌ )
- stable: data-dependent, but initialization is random (⚠️ )
- performance: very fast (✅ )

AffinityPropagation
- dbw: partitions (❌ ); convex/globular clusters (❌ )
- intuitive: trade "k" for two other, difficult-to-choose parameters (❌ )
- stable: deterministic over runs (✅ )
- performance: very slow, nearly intractable on >medium datasets (❌ )

MeanShift
- dbw: clusters instead of partitions using KDE (✅ ); globular clusters (❌ )
- intuitive: KDE bandwidth slightly more intuitive than cluster count (⚠️ )
- stable: random intialization, heavily dependent on bandwidth choice (❌ )
- performance: good in principle, bad in sklearn implementation (❌ )

SpectralClustering
- dbw: partitions (❌ ); doesn't assume convex/globular clusters (✅ )
- intuitive: have to pick k (space xform + kmeans) (❌ )
- stable: slightly better than kmeans (⚠️ )
- performance: space transform (manifold learning) before kmeans (❌ )

AgglomerativeClustering
- dbw: partitions (❌ ); doesn't assume convex/globular clusters (✅ )
- intuitive: have to pick k, or survey and select (❌ )
- stable: consistent over runs (✅ )
- performance: fast evaluation (✅ )

DBSCAN
- dbw: density-based, doesn't partition / allows "noise" (✅ ); doesn't assume convex/globular clusters (✅ )
- intuitive: distance metric (⚠️ ), minimum density threshold (✅ )
- stable: stable across runs, but not hyperparameters (❌ )
- performance: varying densities are challenging (⚠️ ); fast evaluation (✅ )

HDBSCAN
- dbw: density-based, doesn't partition / allows "noise" (✅ ); doesn't assume convex/globular clusters (✅ )
- intuitive: replace one unintuitive param for "minimum cluster population" and another questionable one (⚠️ )
- stable: over sparse subsamples (✅ ) and hyperparam choices (✅ )
- performance: (✅ )

HDBSCAN has a couple of excellent points for it, but I think the ones that I find most compelling are:

no partitioning (allows noise)
density-based (no globular assumptions)
shifts a parameter selection from abstract notion (k) to a problem-specific one (how many points do you care about?)

Let's take a closer look.

HDBSCAN

The algo does have some kwargs, but only two seem to be important. The main one is "what's the minimum cluster population that you would care about?"



In [ ]:

    
# modify this main, default param to discover larger clusters
hdbs = hdbscan.HDBSCAN(min_cluster_size=20)



In [ ]:

    
# sklearn-ish api
labels = hdbs.fit_predict(data)

# the labels are arbitrary integers, -1 is "noise"
labels



In [ ]:

    
# set up colors for plotting
color_palette = sns.color_palette('deep', len(np.unique(labels)))

# color assigned points, leave noise points as gray
cluster_colors = [color_palette[x] if x >= 0 
                  else (0.5, 0.5, 0.5) 
                  for x in labels]

# draw the figure
plt.figure(figsize=(15,10))
plt.scatter(*data.T, s=50, c=cluster_colors, alpha=0.3)

How it works

This would probably be a good stand-alone RST. Here's a pretty detailed answer, and here's an even more detailed answer. Here's a very brief summary based on my read:

Transform the space according to the density/sparsity.
- The core of the clustering algorithm is single linkage clustering, and it can be quite sensitive to noise.
- roughly: focus on the dense cluster by exaggerating the spacing around lower-density points, calculate a particular point-wise distance metric

Build the minimum spanning tree of the distance weighted graph.
- we build the tree one edge at a time, always adding the lowest weight edge that connects the current tree to a vertex not yet in the tree.
- roughly: construct a graph of all points with the minimum number of edges (weights are the above distance metric)

Construct a cluster hierarchy of connected components.
- sort the edges of the tree by distance (in increasing order) and then iterate through, creating a new merged cluster for each edge.
- roughly: calculate hierarchy (combination) of points into clusters, starting with smallest edges

Condense the cluster hierarchy based on minimum cluster size.
- walk through the hierarchy and at each split ask if one of the new clusters created by the split has fewer points than the minimum cluster size.
- roughly: test the effect of splitting clusters (compare to chosen min size)

Extract the stable clusters from the condensed tree.
- roughly: start with all points as individual clusters, calculate stability as function of lambda (inverse distance)

Let's say you understood all of the points above and you wanted to do some deeper inspection into the algorithm calculations. Fortunately, this implementation provides you with a couple tools for that!

linkage trees (#2 in above "Summary")
- methods for plotting, exporting data
condensed trees (#4/5 in above "Summary")

Since I don't deeply understand the algorithm at this point, I'm not going to dwell on these. But to illustrate some of the perspective they convey, take the "condensed tree" below.



In [ ]:

    
plt.figure(figsize=(15,10))

hdbs.condensed_tree_.plot(
    # highlight the chosen clusters (by color)
    select_clusters=True, selection_palette=color_palette
)

Lambda is an inverse distance metric threshold - it represents that pairwise distance matrix referred to in the steps above. As lambda goes from 0 (pairwise distance ~infinity) to larger numbers (small pairwise distance), the splitting lines represent how "clusters" are determined to split off.

Uncomment the select_clusters kwarg and you can see how this chart maps onto the actual colored clusters plotted in the data space. Some observations:

dark blue = cluster is very unique, stays separate the whole time
red and purple separate late in the process
purple has lots of interesting structure that didn't "make it" into a cluster

If you wanted to do some additional analysis on the interim steps, there are also some helper methods for getting data conveniently into other formats.



In [ ]:

    
# also: to_networkx(), to_numpy()
df = hdbs.condensed_tree_.to_pandas()

df.head()

Choosing parameters

One of the values here was not having to "choose k." Instead, the primary parameter is min_cluster_size. The docs summarize this as:

set it to the smallest size grouping that you wish to consider a cluster.

which I think is a nicer framing of the parameter selection problem than choosing k in kmeans!

There is one other parameter that has a strong effect. The min_samples parameter is

a measure of how conservative you want you clustering to be.

Roughly, larger min_samples is more conservative clustering. The default value is min_samples = min_cluster_size (maximally conservative), and you can adjust it down to taste.

Let's look at the effect.

mcs = 50 ms = 50 clusterer = hdbscan.HDBSCAN(min_cluster_size=mcs, min_samples=ms ).fit(data) color_palette = sns.color_palette('deep', 12) cluster_colors = [color_palette[x] if x >= 0 else (0.5, 0.5, 0.5) for x in clusterer.labels_] # saturate by probabilities #cluster_colors = [sns.desaturate(x, p) for x, p in zip(cluster_colors, clusterer.probabilities_)] plt.figure(figsize=(15,10)) plt.scatter(*data.T, s=50, linewidth=0, c=cluster_colors, alpha=0.25) plt.title('min_cluster_size: {}; min_samples: {}'.format(mcs, ms))



In [ ]:

    
# smaller = more "aggressive" clustering (i.e. "label fewer points as noise")
min_sample_vals = [1,10,20,50]

fig = plt.figure(figsize=(20,15))
plot_kwds = {'alpha' : 0.25, 's' : 60, 'linewidths':0}

for i, minsamp in enumerate(min_sample_vals, start=1):
    clusterer = hdbscan.HDBSCAN(min_cluster_size=50, min_samples=minsamp).fit(data)
    color_palette = sns.color_palette('deep', len(clusterer.labels_))    
    cluster_colors = [color_palette[x] if x >= 0
                      else (0.5, 0.5, 0.5)
                      for x in clusterer.labels_]
    # saturate by probabilities
    #cluster_colors = [sns.desaturate(x, p) for x, p in zip(cluster_colors, clusterer.probabilities_)]

    ax = fig.add_subplot(2,2,i)    
    ax.scatter(data.T[0], data.T[1], c=cluster_colors, **plot_kwds)
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)    
    ax.text(-0.5, 0.67, 'min_samples: {}'.format(minsamp), fontsize=20)

fig.tight_layout()

Assigning (predicting) new points

What if we wanted to train and pin a model, and then use it to predict cluster assignment for some new points?

Recall that training the HDBSCAN model is dependent on the underlying density, so it only makes sense to do this in a limited quantity before retraining based on all the currently-available data (that is, recalculate the trees and clusters).

There are some extra calculations (and caching) done to speed this process up, so you can either trigger them with a kwarg at instantiation, or call one of the model's functions to set it up.



In [ ]:

    
hdbs = hdbscan.HDBSCAN(min_cluster_size=20, prediction_data=True).fit(data)

# or on an existing, trained model
#hdbs.generate_prediction_data()



In [ ]:

    
# generate some uniformly random data
test_points = np.random.random(size=(50, 2)) - 0.5

color_palette = sns.color_palette('deep', 12)
cluster_colors = [sns.desaturate(color_palette[col], sat) 
                      for col, sat in zip(hdbs.labels_,hdbs.probabilities_)]
# nb: we haven't talked much about them but each point is also assigned a probability score 
#  which is ~ "how sure are we that this point belongs in this cluster"

fig = plt.figure(figsize=(15,10))

plt.scatter(data.T[0], data.T[1], c=cluster_colors, **plot_kwds);
plt.scatter(*test_points.T, c='k', s=60, label='new points')
plt.legend();



In [ ]:

    
# the approximate_predict() method is module-level, and takes the model object as an arg
test_labels, strengths = hdbscan.approximate_predict(hdbs, test_points)

# show the predicted assignments
test_labels



In [ ]:

    
# color points that are assigned clusters, leave noise points black
test_colors = [color_palette[col] if col >= 0 else (0.1, 0.1, 0.1) for col in test_labels]

fig = plt.figure(figsize=(15,10))
# training data
plt.scatter(data.T[0], data.T[1], c=cluster_colors, **plot_kwds);
# new data points
plt.scatter(*test_points.T, c=test_colors, s=60, linewidths=1, edgecolors='k')

Real data

womp

Sadly, we're out of time for now. But, a good follow-up RST to the content above would be the following:

collect some real data, relevant to a particular problem
apply some amount of sensible text preprocessing (stopwords, normalization, etc.) and vectorization
use that data to choose and train both a KMeans model and an HDBSCAN model
come up with some good ways of comparing the summary outputs from each model
see if one is obviously better
celebrate!

Perhaps my next RST...



In [ ]: