A Problem, and a Potential Solution

Suppose you have the output of a clustering algorithm, with two clusters $x_1$ and $x_2$, with membership counts $N_{x_1}$ and $N_{x_2}$, respectively. It would be very nice to have a single output (call it $M$) which describes the distance between cluster centers, the likelihood that the covariance measurement is accurate (based on cluster membership), and the "spread" of each cluster along the distance axis. In essence:

$M = \frac{\frac{N_{x_1}}{N_{x_1} + N_{x_2}}dvec(x_1, x_2) \cdot cov(x_1) + \frac{N_{x_2}}{N_{x_1} + N_{x_2}}dvec(x_2, x_1) \cdot cov(x_2)}{dist(x_1, x_2)}$

To do this, the product of cov() and dvec() would need to reflect the variance along the axis which point between the two cluster means. Two possible approaches emerge:

  1. Project the major axis of the cluster (from eigendecomposition) onto dvec()
  2. Project a vector from each sample to the other cluster center onto the dvec() between centers. This should be equivalent to 1.

I prefere approach 1 - however, I think either approach would be valid. First, let's import some libraries, and set up code to generate examples. This could use sklearn for generated datasets, but I want to retain fine control over the "shape" of these examples, and 2D matrix transformations are not too complicated (and cool!)


In [28]:
import numpy as np
import matplotlib.pyplot as plt

In [29]:
def make_example(n_samples, rotation_deg=None, x_stretch=1, y_stretch=1, x_shear=None,
                 y_shear=None,  x_trans=0, y_trans=0, random_seed=0):
    rands = np.random.RandomState(random_seed)
    v = rands.randn(n_samples, 2)
    #Stretch matrix has form [[sx, 0],[0, sy]]
    stretch = np.eye(2)
    stretch[0, :] = x_stretch * stretch[0, :]
    stretch[1, :] = y_stretch * stretch[1, :]
    v = np.dot(v, stretch)
    if rotation_deg is not None:
        #You probably don't want to mix shear and rotation...
        sgn = np.sign(rotation_deg)
        r_deg = np.abs(rotation_deg)
        rot = np.array([[np.cos(r_deg), sgn * np.sin(r_deg)], 
                       [sgn * -1. * np.sin(r_deg), np.cos(r_deg)]])
        v = np.dot(v, rot)
    if x_shear is not None:
        #Shear matrix has form [[1, k],[0, 1]] for x, [[1, 0],[k, 1]] for y
        #Apply x shear
        shear = np.eye(2)
        shear[0, 1] = x_shear
        v = np.dot(v, shear)
    if y_shear is not None:
        #Apply y shear
        shear = np.eye(2)
        shear[1, 0] = y_shear
        v = np.dot(v, shear)
    #Translation is added, and has form [px, py]
    trans = np.array([x_trans, y_trans])
    v += trans 
    return v

Example

Let's create two example datasets. Though both can be trivially clustered, we would really like the first image to return a higher score than the second, because the covariance projected along the separating distance in the first image is much smaller than the projected covariance in the second image. In simpler terms, the clusters are flattened out in both images, but in the first they are flattened in a direction we don't care about, where in the second they are flattened in a way that could make it harder to split the clusters. The green dots represent the cluster center computed by K-Means.


In [30]:
v1 = make_example(1000, rotation_deg=45, y_stretch=4, x_trans=20, y_trans=15)
v2 = make_example(1000, rotation_deg=45, y_stretch=4)
v3 = make_example(1000, rotation_deg=-45, y_stretch=4, x_trans=20, y_trans=15)
v4 = make_example(1000, rotation_deg=-45, y_stretch=4)
from sklearn.cluster import KMeans 
v1v2 = np.vstack((v1, v2))
v3v4 = np.vstack((v3, v4))
km = KMeans(n_clusters=2)

In [34]:
def dvec(x1, x2): return x2 - x1
def mag_dvec(x1, x2): return np.linalg.norm(dvec(x1, x2))

km.fit(v1v2)
plt.scatter(v1[:, 0], v1[:, 1], color="steelblue")
plt.scatter(v2[:, 0], v2[:, 1], color="darkred")
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], color="darkgreen")
x1 = km.cluster_centers_[1, :]
x2 = km.cluster_centers_[0, :]
print "0 -> 1:", dvec(x1, x2)
print "1 -> 0:", dvec(x2, x1)
print "Distance:", mag_dvec(x1, x2)


0 -> 1: [-20. -15.]
1 -> 0: [ 20.  15.]
Distance: 25.0

In [35]:
km.fit(v3v4)
plt.scatter(v3[:, 0], v3[:, 1], color="steelblue")
plt.scatter(v4[:, 0], v4[:, 1], color="darkred")
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], color="darkgreen")
x1 = km.cluster_centers_[1, :]
x2 = km.cluster_centers_[0, :]
print "0 -> 1:", dvec(x1, x2)
print "1 -> 0:", dvec(x2, x1)
print "Distance:", mag_dvec(x1, x2)


0 -> 1: [ 20.  15.]
1 -> 0: [-20. -15.]
Distance: 25.0

From the above plots, we can see that from a purely "cluster center" spread/spacing metric, the two clusterings would be equivalent. However, it would be preferable for clusters to have space along the distance vector axis as much as possible, though that also depends on the underlying data. The goal in defining this metric is that it will be easier to tell if there are "too many" clusters, by a reduction in metric score, very similar to AIC or BIC criterion. It should also allow for comparing different clustering algorithms, as a clustering algorithm which has greater covariance separation is likely choosing stronger clusters. It is also an extension of the gap statistic, of Tibshirani, Walther, and Hastie, 2001. Let's define this metric, and use it on a few simple problems.

Open question: has this (or something very similar) already been published somewhere?