2017-09-08, Josh Montague
We often have some curated data of interest for which we'd like a low-dimensional representation. In the case of Twitter data, the corpus in question is usually based on filtering of the text content of the Tweet itself or the description of the user (profile bio). Since Twitter is made of people, we're often in a position where we want to discover and describe a small (low-dimensional) number of "groups" or "communities" of Tweets or users within a corpus.
Unsupervised learning (clustering) is the tool commonly used for this. There are 10 clustering algorithms built in to the sklearn API as of the time of this writing, and others that exist outside of that one specific library. The take-away from this session is a proposal to move from
"KMEANS ALL THE THINGS"
to
"HDBSCAN ALL THE THINGS (while also taking the time to think a bit more about your algorithm assumptions)".
It really just rolls right off the tongue.
Most of my clustering experience has been with the KMeans algorithm, and so I can't speak from a position of much experience on all the rest. However, I do know that KMeans is often used in clustering tasks - often when it probably shouldn't be - because it is incredibly fast. I've always been uncomfortable with applying this hammer to all tasks, so the goal of this session is to highlight some alternatives and where they differ.
Other people who care more deeply about the comparison of these algorithms have written about them, and so here I'll link to (and copy from) some of those references.
First, we'll look at a couple of examples that already exist in the wilds of the internet with some generated data. Then, we'll dig into one particular clustering algorithm that looks promising. Finally, we'll look at that algorithm applied to some of our own data.
To make this visualizable, we're going to work mostly in 2 dimensions.
a.k.a. "great opportunities for future RST sessions"
Ok, let's go!
In [ ]:
import time
import hdbscan
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(context='poster', style='darkgrid')
sns.set_color_codes()
from sklearn import cluster
from sklearn import datasets
sklearn
Let's start by pointing our eyeballs at the picture that pops out from running the script below. This is a lightly modified version of the sklearn
docs clustering demo, and highlights the results of crossing a set of algorithms (columns) with a set of generated data sets (rows).
In most cases, there are parameters that are necessary inputs and for these fits. These are magic numbers pulled from experience and thin air. For the most part, the parameters are even chosen to be "correct" for the data set e.g. k=2 for the annulus and crescent data sets, to illustrate a sort of "best case" scenario. In the lower righthand corner of each plot is the runtime of the fit step.
The code is pretty involved but not particularly informative, so let's just look at the output graphic!
In [ ]:
# this takes ~a minute to run
%run sklearn-cluster-demo.py
Quick take-aways:
There are some interesting patterns in there. The sklearn user's guide (and associated links to algorithm details) has a nice table that briefly discusses the assumptions and use cases for the algorithms. I don't want to get into the derivations, though, so let's look at anothere data set and see what else we can surmise from empirical comparisons.
HDBSCAN
The DBSCAN algorithm is included in the current release of sklearn, but the algorithm's authors subsequently published an improved version to that algorithm, called HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) which addresses a handful of the shortcomings of their original DBSCAN.
The paper's abstract:
We propose a theoretically and practically improved density-based, hierarchical clustering method, providing a clustering hierarchy from which a simplified tree of significant clusters can be constructed. For obtaining a “flat” partition consisting of only the most significant clusters (possibly corresponding to different density thresholds), we propose a novel cluster stability measure, formalize the problem of maximizing the overall stability of selected clusters, and formulate an algorithm that computes an optimal solution to this problem. We demonstrate that our approach outperforms the current, state-of-the-art, density-based clustering methods on a wide variety of real world data.
The HDBSCAN docs have a great writeup (and notebook) that compares available Python clustering algorithms. This section is a modified version of that writeup.
First, go get the data file...
In [ ]:
! wget https://github.com/scikit-learn-contrib/hdbscan/blob/master/notebooks/clusterable_data.npy?raw=true \
-O clusterable_data.npy
In [ ]:
data = np.load('clusterable_data.npy')
In [ ]:
plt.figure(figsize=(15,10))
# nb: because of the format of the .npy files, you'll see lots of transposing of data (data.T)
plt.scatter(data.T[0], data.T[1], c='b', alpha=0.25, s=80)
frame = plt.gca()
frame.axes.get_xaxis().set_visible(False)
frame.axes.get_yaxis().set_visible(False)
Some important observations on this data:
Let's line up some algorithms to compare. To make it simpler, this is a subset of the sklearn demo ones, plus HDBSCAN (which is not in sklearn). We're giving the algorithms some parameters (because they're required), and we'll come back to how we feel about them later.
Below, we'll fit each of the algorithms to the dataset we've just looked at, and label the results according to the trained model.
In [ ]:
# tuples are:
# (algorithm, algo args, algo kwargs)
algo_combos = [
(cluster.KMeans, (), {'n_clusters':6}),
(cluster.AffinityPropagation, (), {'preference':-5.0, 'damping':0.95}),
(cluster.MeanShift, (0.175,), {'cluster_all':False}),
(cluster.SpectralClustering, (), {'n_clusters':6}),
(cluster.AgglomerativeClustering, (), {'n_clusters':6, 'linkage':'ward'}),
(cluster.DBSCAN, (), {'eps':0.025}),
(hdbscan.HDBSCAN, (), {'min_cluster_size':15})
]
In [ ]:
fig = plt.figure(figsize=(20,20))
plot_kwds = {'alpha' : 0.25, 's' : 60, 'linewidths':0}
# loop over algos and fit each one
for i, (algo, args, algo_kwargs) in enumerate(algo_combos, start=1):
# catch the runtime for model fitting
start_time = time.time()
labels = algo(*args, **algo_kwargs).fit_predict(data)
end_time = time.time()
# plot the results
palette = sns.color_palette('deep', np.unique(labels).max() + 1)
colors = [palette[x] if x >= 0 else (0.0, 0.0, 0.0) for x in labels]
ax = fig.add_subplot(4,2,i)
ax.scatter(data.T[0], data.T[1], c=colors, **plot_kwds)
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
ax.text(-0.5,
0.67,
'{}: {:.2f} s'.format(str(algo.__name__), (end_time - start_time)),
fontsize=20
)
fig.tight_layout()
As before, take a minute and look through these outputs. The format is similar: the algorithm name and the run time for the fit. So, what are our observations?
This time, let's be guided by the HDBScan docs which do a really nice job of summarizing some interpretable comparison criteria for these algorithms. The "Rules" are below.
this is a paraphrased version of the full doc
Both the sklearn docs and the HDBScan docs have commentary on the assumptions that go into these algorithms, and where the pros and cons are. To learn more about the specific implementation (and assumptions) about each algorithm e.g. KMeans the docs are incredibly detailed and worth a read.
But for this session, we don't have time to become an expert in all of them. Instead, we'll aim to combine and simplify the comments from the two sources. Below is an informal, emoji-qualified verdict about each of the algorithms.
HDBSCAN has a couple of excellent points for it, but I think the ones that I find most compelling are:
Let's take a closer look.
The algo does have some kwargs, but only two seem to be important. The main one is "what's the minimum cluster population that you would care about?"
In [ ]:
# modify this main, default param to discover larger clusters
hdbs = hdbscan.HDBSCAN(min_cluster_size=20)
In [ ]:
# sklearn-ish api
labels = hdbs.fit_predict(data)
# the labels are arbitrary integers, -1 is "noise"
labels
In [ ]:
# set up colors for plotting
color_palette = sns.color_palette('deep', len(np.unique(labels)))
# color assigned points, leave noise points as gray
cluster_colors = [color_palette[x] if x >= 0
else (0.5, 0.5, 0.5)
for x in labels]
# draw the figure
plt.figure(figsize=(15,10))
plt.scatter(*data.T, s=50, c=cluster_colors, alpha=0.3)
This would probably be a good stand-alone RST. Here's a pretty detailed answer, and here's an even more detailed answer. Here's a very brief summary based on my read:
Let's say you understood all of the points above and you wanted to do some deeper inspection into the algorithm calculations. Fortunately, this implementation provides you with a couple tools for that!
Since I don't deeply understand the algorithm at this point, I'm not going to dwell on these. But to illustrate some of the perspective they convey, take the "condensed tree" below.
In [ ]:
plt.figure(figsize=(15,10))
hdbs.condensed_tree_.plot(
# highlight the chosen clusters (by color)
select_clusters=True, selection_palette=color_palette
)
Lambda is an inverse distance metric threshold - it represents that pairwise distance matrix referred to in the steps above. As lambda goes from 0 (pairwise distance ~infinity) to larger numbers (small pairwise distance), the splitting lines represent how "clusters" are determined to split off.
Uncomment the select_clusters
kwarg and you can see how this chart maps onto the actual colored clusters plotted in the data space. Some observations:
If you wanted to do some additional analysis on the interim steps, there are also some helper methods for getting data conveniently into other formats.
In [ ]:
# also: to_networkx(), to_numpy()
df = hdbs.condensed_tree_.to_pandas()
df.head()
One of the values here was not having to "choose k." Instead, the primary parameter is min_cluster_size
. The docs summarize this as:
set it to the smallest size grouping that you wish to consider a cluster.
which I think is a nicer framing of the parameter selection problem than choosing k in kmeans!
There is one other parameter that has a strong effect. The min_samples
parameter is
a measure of how conservative you want you clustering to be.
Roughly, larger min_samples
is more conservative clustering. The default value is min_samples = min_cluster_size
(maximally conservative), and you can adjust it down to taste.
Let's look at the effect.
In [ ]:
# smaller = more "aggressive" clustering (i.e. "label fewer points as noise")
min_sample_vals = [1,10,20,50]
fig = plt.figure(figsize=(20,15))
plot_kwds = {'alpha' : 0.25, 's' : 60, 'linewidths':0}
for i, minsamp in enumerate(min_sample_vals, start=1):
clusterer = hdbscan.HDBSCAN(min_cluster_size=50, min_samples=minsamp).fit(data)
color_palette = sns.color_palette('deep', len(clusterer.labels_))
cluster_colors = [color_palette[x] if x >= 0
else (0.5, 0.5, 0.5)
for x in clusterer.labels_]
# saturate by probabilities
#cluster_colors = [sns.desaturate(x, p) for x, p in zip(cluster_colors, clusterer.probabilities_)]
ax = fig.add_subplot(2,2,i)
ax.scatter(data.T[0], data.T[1], c=cluster_colors, **plot_kwds)
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
ax.text(-0.5, 0.67, 'min_samples: {}'.format(minsamp), fontsize=20)
fig.tight_layout()
What if we wanted to train and pin a model, and then use it to predict cluster assignment for some new points?
Recall that training the HDBSCAN model is dependent on the underlying density, so it only makes sense to do this in a limited quantity before retraining based on all the currently-available data (that is, recalculate the trees and clusters).
There are some extra calculations (and caching) done to speed this process up, so you can either trigger them with a kwarg at instantiation, or call one of the model's functions to set it up.
In [ ]:
hdbs = hdbscan.HDBSCAN(min_cluster_size=20, prediction_data=True).fit(data)
# or on an existing, trained model
#hdbs.generate_prediction_data()
In [ ]:
# generate some uniformly random data
test_points = np.random.random(size=(50, 2)) - 0.5
color_palette = sns.color_palette('deep', 12)
cluster_colors = [sns.desaturate(color_palette[col], sat)
for col, sat in zip(hdbs.labels_,hdbs.probabilities_)]
# nb: we haven't talked much about them but each point is also assigned a probability score
# which is ~ "how sure are we that this point belongs in this cluster"
fig = plt.figure(figsize=(15,10))
plt.scatter(data.T[0], data.T[1], c=cluster_colors, **plot_kwds);
plt.scatter(*test_points.T, c='k', s=60, label='new points')
plt.legend();
In [ ]:
# the approximate_predict() method is module-level, and takes the model object as an arg
test_labels, strengths = hdbscan.approximate_predict(hdbs, test_points)
# show the predicted assignments
test_labels
In [ ]:
# color points that are assigned clusters, leave noise points black
test_colors = [color_palette[col] if col >= 0 else (0.1, 0.1, 0.1) for col in test_labels]
fig = plt.figure(figsize=(15,10))
# training data
plt.scatter(data.T[0], data.T[1], c=cluster_colors, **plot_kwds);
# new data points
plt.scatter(*test_points.T, c=test_colors, s=60, linewidths=1, edgecolors='k')
womp
Sadly, we're out of time for now. But, a good follow-up RST to the content above would be the following:
Perhaps my next RST...
In [ ]: