Choosing `k` (or: How many clusters are in your data?)

2015-07, Josh Montague

MIT License

Intro

There are often times when we'd like a representation of a dataset which reduces the dimensionality of the set of measurements. Having initially made $N$ measurements (observations) of $m$ features, the dimensionality of our problem space is $N \times m$. We can reduce the dimensionality by making either $N$ or $m$ smaller.

Techniques like principal component analysis and singular value decomposition are used to reduce the dimensionality by making $m$ smaller; these algorithms identify the most significant features to maintain sufficient variance in the data, and then ignore remaining features. That is, these techniques represent the original data by using all of the original measurements ($N$), but with a smaller number of features ($m$).

Similarly, techniques like k-means clustering are meant to reduce the problem dimensionality by effectively making $N$ smaller; these algorithms collapse many "nearby" measurements (in the appropriate feature space) into a single representation using e.g. a single point (like a polygon centroid) to represent a bunch of other points. That is, these techniques represent the original data by using a smaller number of measurements and all of the original features; its a form of compression.

And then, of course, you can also combine the two. But here, we're interested in the latter - reducing the data dimensionality by representing the data through clusters using the kmeans algorithm.

What is kmeans?

K-means is a commonly used technique for compression. In machine learning, it's a common technique used for the unsupervised learning of structure within a set of observations. For details on the implementation of the kmeans algorithm, see both the intro to kmeans in this repository and also the documentation for the scikit-learn implementation.

And what are we doing with it?

In general, the parameter space in which we're trying to cluster our data can be arbitrarily large. In text analysis, a tfidf matrix on short documents can easily be 10,000-dimensional. Here, for the sake of simplicity (and plotting), we'll just use two orthogonal spatial dimensions.

Let's get started by setting up and making some data in clusters (that is, we know the ground truth about cluster membership). Then, we'll look at it to get a sense for how these groupings are related to one another.



In [ ]:

    
import math
import matplotlib.pyplot as plt
import numpy as np
from scipy.spatial import distance
import seaborn as sns

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

%matplotlib inline



In [ ]:

    
# generate some data
clusters = 3
samples = 100

X, y = make_blobs(n_samples=samples, 
                  centers=clusters, 
                  n_features=2, 
                  cluster_std=1.0, 
                  # fix random_state if you believe in determinism
                  random_state=42
                 )



In [ ]:

    
# set the plot style
sns.set(style='darkgrid', palette='deep')

# 1x3 subplots
f, (ax1, ax2, ax3) = plt.subplots(1,3, sharey=True, sharex=True, figsize=(14,4))

# plot all the data (left)
ax1.plot(X[:,0], X[:,1], 'ok', alpha=0.5, label='all data')
ax1.set_title('raw data')

# show the clusters labeled as generated (middle)
for i in range(clusters):
    ax2.plot(X[y==i,0], X[y==i,1], 'o', alpha=0.5, label='cluster {}'.format(i))
ax2.set_title('labeled clusters (ground truth)')
ax2.legend(loc='best')

# fit a default-settings KMeans model -- for now, assume we know k
km = KMeans(n_clusters=clusters, n_jobs=-1).fit(X)
# grab the model attributes once fit - we'll use these
labels = km.labels_
centers = km.cluster_centers_

# show the clusters as fit by the KMeans model (right)
for i in range(clusters):
    ax3.plot(X[labels==i,0], X[labels==i,1], 'o', alpha=0.5, label='cluster {}'.format(i))
# overlay the centroid points
ax3.plot(centers[:,0], centers[:,1], 
         '^',
         markerfacecolor='k', 
         markersize=10, 
         label='cluster centroid'
        )    
ax3.set_title('labeled clusters (kmeans model)')
ax3.legend(loc='best')

Ok, so what do we have in the plots above?

Left: this is all of the data we generated, but without regard to the ground truth labels
Middle: same data, but now colored and labeled by the ground truth labels
Right: same data, but now we've used a KMeans model to fit the data. Here, we use most of the default settings, and we specifically told the model that $k=3$.

Duh. Anyone could choose $k=3$ there.

Ok, yes, that's true. But let's think about why we "know" the answer in this case.

My hypothesis (at least, how I think about it) is that in 2-D (and possibly 3-D), my intuition is very reminiscent of the separating planes we discussed in the soft-margin, linear support vector machine algorithm. Recall that RST session lives in this repository, as well. By maximizing the gap between the separating plane and the data, we identify a cluster that "feels right."

That's fine, but what if the data is more complicated? Further, what if we can't even look at the data because it exists in 100-dimensional space?



In [ ]:

    
# even in 3D this gets tough
#
# use cntl-return to run this a few times - it's fun to see the variations
%run 3d-example.py --samples 500 --clusters 7

After a few runs of the figure above, we see that sometimes we can clearly make out the clusters. While we might be able to manually label these at times, the truth is we don't want to have to look at the data and use our intuition - even if it is "experienced." We want to use one (or more) quantitative metric to inform a decision about a "good" choice for $k$. This will certainly scale better than our judgement.

How to choose $k$*

*more accurately: "what techniques have other people come up with to determine methods for calculating $k$?"

This is not a new problem, though there appear to still be new techniques to solve it. Here are the ones we'll go over in this session:

The Rule of 👎
The Elbow Method
The Gap Statistic

There are more, and I've added some references at the end for additional places you could go from here.

The Rule of Thumb

This one is kind of awful. I'm only including it because it's always good to be skeptical of easy answers. According to Wikipedia, this comes from: Kanti Mardia et al. (1979). Multivariate Analysis. Academic Press.

I didn't look up the original source, but the formula given is:

$k \approx \sqrt{\frac{N}{2}} $

If we return to the data generation choices from above...



In [ ]:

    
print("""With the selection of {} samples (which we know belong to {} clusters), 
the rule of thumb would suggest there are:


{} clusters


¯\_(ツ)_/¯
""".format(samples, clusters, int(math.sqrt(samples / 2))))

Ok, so that's a simple heuristic, but it's not well motivated, and not very satisfying. Don't use that one.

The Elbow Method

The "elbow method" is an approach that makes use of some single-valued metric that is calculated once the model has converged. A plot of this metric ideally demonstrates an "elbow" as a function of $k$, which represents a diminishing tradeoff between model complexity and added explanatory value. For a quick result, this is probably the most intuitive approach.

So, what kind of metric do we use?

KMeans().intertia_

One such metric is the single-valued inertia_ attribute of a fit KMeans instance. Once the KMeans model has converged, we can reach into the KMeans object and grab this value. It is calculated by summing (over each cluster, $k$) the squared distance* of each sample $x_i$ from its assigned cluster centroid $c_k$:

$$ \text{inertia} = \sum_{clusters,k} \sum_{points,i} || \vec x_{k,i} - \vec c_k ||^2 $$

In this way, the inertia is a metric of intra-cluster variance (which you would then work to optimally minimize).

There are a couple of things to keep in mind: it doesn't include any notion of cluster separation, and in the extreme case of $k=N$ clusters, the inertia goes to zero. So we can already see that we can't just optimize for an extrema in this metric.

This metric (or a variation on it) also seems to often be refered to as the SSQ (sum of squared deviations).

Below, we'll define a couple of functions to look at this quantity. The first function will take a set of data and range of $k$s to fit over, and return an array of the resulting inertia metrics. For now, only worry about the else: clause in the for loop.

The second method will provide us a plot of this data so we can look for the elbow.

*the scikit docs say "sum of distances", but if you source dive far enough, I think you find it is actually the sum of squared distances.



In [ ]:

    
# Modified from (c) 2014 Reid Johnson (BSD License)
# http://www3.nd.edu/~rjohns15/cse40647.sp14/www/home.php
    
def calculate_ssq(data, ks=range(1,11), variance_norm=True):
    """
    Calculate the sum of squared deviations for a range of k with a given set of data.
    
    Args:
      data ((n,m) NumPy array): The dataset on which to compute the gap statistics.
      ks (list, optional): The list of values k for which to compute the gap statistics. 
        Defaults to range(1,11), which creates a list of values from 1 to 10.
      variance_norm (boolean, optional): normalize the sum of squared deviations to 
        a percentage of variance explained. Defaults to True (normalized).

    Returns:
      ssqs: an array of SSQs, one computed for each k.
    """
    # array for SSQs 
    ssqs = np.zeros(len(ks)) 

    # iterate over the range of k values        
    for (i,k) in enumerate(ks): 
        # fit a KMeans model on the data
        km = KMeans(n_clusters=k, n_init=2, n_jobs=-1).fit(data)
        
        if variance_norm:
            # normalize to percentage of variance
            #
            # get the distances from ecah point to the closest centroid
            dist = np.min(
                        # calc the euclidean dist b/w each point and all clusters
                        distance.cdist(data, km.cluster_centers_,'euclidean')
                        , axis=1)
            # total within-cluster sum of squared distances
            tot_withinss = sum(dist**2) 
            # The total sum of squares
            totss = sum(
                        # pairwise euclidean distances b/w all points
                        distance.pdist(data, metric='euclidean')**2
                        ) / data.shape[0] 

            # The between-cluster sum of squares
            betweenss = totss - tot_withinss 
            ssqs[i] = betweenss/totss*100
        else:
            # use the built-in inertia metric 
            # (the sum of squared distances between data and assigned centroids)
            ssqs[i] = km.inertia_
    return ssqs

def plot_ssq_statistics(ssqs, variance_norm=True):
    """
    Generates and shows plots for the sum of squares (SSQ).

    A figure with one plot is generated. The plot is a bar plot of the 
    SSQ computed for each value of k.

    Args:
      ssqs (NumPy array): An array of SSQs, one computed for each k.
      variance_norm (boolean, optional): a hack to get the y-axis label 
        to reflect the normalization. Defaults to True (normalized).
    """
    # size this one a bit more narrow just for aesthetics below
    fig = plt.figure(figsize=(7.5,4))

    ind = range(1,len(ssqs)+1) # the x values for the ssqs
    plt.plot(ind, ssqs)

    plt.ylim(0, max(ssqs)*1.2)
    plt.xlim(0, len(ssqs)+1.0)  
    
    #plt.title('Clustering SSQ', fontsize=14)
    plt.xlabel('Number of clusters k', fontsize=14)
    # hack to get the right labels depending on how we calculated the SSQ
    if variance_norm:
        plt.ylabel('Percentage of variance explained', fontsize=14)
    else:
        plt.ylabel('SSQ (KMeans inertia)', fontsize=14)
    plt.xticks(ind)
    
# switch back to original theme
sns.set(style='darkgrid', palette='deep')

Let's choose some example parameters and use these functions on the data.



In [ ]:

    
# set some parameters
clusters = 4
samples = 100


# generate new data
X, y = make_blobs(n_samples=samples, 
                  centers=clusters, 
                  n_features=2, 
                  cluster_std=1.0, 
                  # fix random_state if you believe in determinism
                  #random_state=42
                 )

# show the clusters as created 
f, ax = plt.subplots(1,1, figsize=(8,4))
for i in range(clusters):
    ax.plot(X[y==i,0], X[y==i,1], 'o', label='cluster {}'.format(i))
ax.set_title('labeled clusters (ground truth)')
ax.legend(loc='best')

# calculate ssq for each value of k
ssqs = calculate_ssq(X, ks=range(1,11), variance_norm=False)
# plot them
plot_ssq_statistics(ssqs, variance_norm=False)

Depending on how your make_blobs() came out, you might want to use control-enter a few times to get an interesting set of points.

Note that based on the formula, as the number of samples (data points) goes up, so does the inertia metric. This can make it hard to compare e.g. different sampling choices from a large dataset. In this case, there are at least two obvious paths that make it easier to compare measurements of inertia across data sets: we can normalize the total inertia by the number of data points. Or, we can alternatively normalize the same distances used in that calculation to represent a percentage of variance explained.

Percentage of variance explained

This approach uses the location of the fit centroids and the input data (not, explicitly, the inertia value). This value is calculated as a ratio of between-group variance to within-group variance in the data. As I understand it, this is also known as an F-test.

In our function defined above, we can switch to using this normalization of percentage of variance observed with the variance_norm=True kwarg. This is the code in the if: block of the for loop.

Here's the same data shown above, but using this version of the metric.



In [ ]:

    
# calculate ssq for each value of k
ssqs = calculate_ssq(X, ks=range(1,11), variance_norm=True)
# plot them
plot_ssq_statistics(ssqs, variance_norm=True)

Now we have a representation of all our original data in terms how the KMeans algorithm converged across a portion of $k$ parameter space. It's certainly better than eye-balling the $N$ data points in order to select $k$. But - as you've probably realized by now - we still have to apply our own judgement at this point to identify the "elbow". How much explained variance is "enough"?

There seems to be a bit of literature on how to use this output as input to another algorithm, but ultimately someone somewhere has to draw a line in the sand and declare a parameter. Say, "80% variance explained," or "an threshold of 10% or more variance explained added per step."

So, this method can work. It reduces a massive parameters space and total lack of quantitative metrics into a small (possibly one-dimensional) space, where you can at least make an informed (and data-driven) decision about your choice of $k$. For example, since the average complexity of the kmeans algorithm is $O \left( k \cdot N \cdot T \right)$ - where $T$ is the number of model iterations per step - you may want to choose $k$ based mainly on evaluation complexity.

We can also take this concept one step further with the "gap statistic".

The Gap Statistic

The gap statistic is an attempt to intelligently choose something like the elbow point above. It is defined in "Estimating the number of clusters in a data set via the gap statistic" (Tibshirani, Walther, Hastie, J. R. Statist. Soc. B (2001) 63, Part 2, pp 411-423). From the paper summary:

[the technique compares] the change in within-cluster dispersion with that expected under an appropriate reference null distribution... a simulation shows that the gap statistic usually outperforms other methods that have been propsed in the literature.

What this means is that for each value of $k$, we'll fit a KMeans model with $k$ clusters to the real data, and then also fit an identical KMeans model to a bunch of null distribution models e.g. randomly-distributed data in the same parameter space. For each value of $k$, calculate the within cluster dispersion (similar to how we did this above) for both the real data the reference distributions. The gap statistic $\text{gap}_k$ is a function of these two metrics (approximately the mean of the point-by-point differences in the logarithms, see ~line 117 in gap.py). From fitting the reference (null) distributions, there will be a distribution of variation in this dispersion from which we calculate the standard deviation (or standard error), $\text{err}_k$.

The optimal number of clusters is then the smallest $k$ for which $\text{gap}_k \ge \text{gap}_{k+1} - \text{err}_{k+1}$.

Let's apply this to the same data we had before. It looked like this:



In [ ]:

    
# show the clusters as created 
f, ax = plt.subplots(1,1, figsize=(8,4))
for i in range(clusters):
    ax.plot(X[y==i,0], X[y==i,1], 'o', label='cluster {}'.format(i))
ax.set_title('labeled clusters (ground truth)')
ax.legend(loc='best')

Now, to see this in action we define two functions - similar to the SSQ-related functions above: one to calculate the new metric, and one to plot the output.

Since the code is bit long, it is contained in the separate module gap.py.



In [ ]:

    
from gap_stats import gap_statistics
from gap_stats import plot_gap_statistics



In [ ]:

    
# for the range of given ks, calculate the gaps, errors, and differences
# (this one can take a minute)
gaps, errs, difs = gap_statistics(X, ks=range(1,11))



In [ ]:

    
# then show them
plot_gap_statistics(gaps, errs, difs)

Note: If you get a result of $k=1$ and that looks - by eye - to be very incorrect, run it again with new data... more on this below.

To see the gap statistic in action on a bigger scale, we can return to the 3D example from earlier and (hopefully) observe that this is a better choice than "eyeballing it."



In [ ]:

    
# this one may take a minute. it has to fit (clusters * (10+1) * 2) kmeans models
#    each with O( k * n * 2 ) complexity

%run 3d-example.py --samples 200 --clusters 7 --gap True

Final thoughts

In experimenting with this formulation, one place that I've seen the gap statistic come up short in it's prediction of "best $k$" is when the gap difference is a slightly $\gt 0$ at $k=1$, but then negative until a large (and potentially "correct", from the ground truth) value of $k$. I haven't come across a way to work with this yet... perhaps a follow-up RST topic.

As with many algorithmic approaches to data, there seem to be many ways to tackle this problem and more constantly being developed. Some of the techniques that I came across while preparing this RST but didn't have time to pursue include:

information criteria techniques (which you can see examples of in the "model selection" RST)
use of the "silhouette coefficient"
Dirichlet process mixture models

... and there are likely many more, as well.



In [ ]:

Choosing k (or: How many clusters are in your data?)