Getting started with DeBaCl

1. Create some data

Our first step is to create some data using the scikit-learn make_blobs and make_circles utility. To make this a hard (but not impossible) clustering problem, we set the random state of the blob so that it's always outside the two concentric circles.


In [1]:
import numpy as np
from sklearn.datasets import make_circles, make_blobs
import matplotlib.pyplot as plt
%matplotlib inline

circles = make_circles(500, factor=0.5, noise=0.06, random_state=23)
blob = make_blobs(100, centers=1, center_box=(-1.7, 1.7), cluster_std=0.1,
                  random_state=19)

X = np.vstack((circles[0], blob[0]))
print "Dataset shape:", X.shape


Dataset shape: (600, 2)

In [2]:
with plt.style.context('ggplot'):
    fig, ax = plt.subplots(figsize=(6, 4.5))
    ax.scatter(X[:, 0], X[:, 1], c='black', s=50, alpha=0.5)
    fig.show()


2. Estimate the level set tree

The Level Set Tree (LST) can be constructed directly from our tabular dataset. The most important choice is the bandwidth parameter k, which controls the complexity of the level set tree.

Small values of k allow for complex trees with many leaves, which can good for discovering small clusters, with the caveat that these clusters may not be true features of the underlying data-generating process. Large values of k lead to very simple trees with few branches, but these trees are likely to be very stable across repeated samples from the same probability distribution.

Choosing the bandwidth parameter in a principled way remains an open area of research, and a future tutorial will illustrate some helpful heuristics for finding a good value. For now, we use the value k=20, which was chosen through trial and error.


In [3]:
import debacl as dcl
tree = dcl.construct_tree(X, k=20)

print tree


+----+-------------+-----------+------------+----------+------+--------+----------+
| id | start_level | end_level | start_mass | end_mass | size | parent | children |
+----+-------------+-----------+------------+----------+------+--------+----------+
| 0  |    0.000    |   0.160   |   0.000    |  0.182   | 500  |  None  |  [2, 3]  |
| 1  |    0.000    |   3.677   |   0.000    |  1.000   | 100  |  None  |    []    |
| 2  |    0.160    |   0.462   |   0.182    |  0.653   | 250  |   0    | [20, 21] |
| 3  |    0.160    |   0.164   |   0.182    |  0.207   | 143  |   0    |  [4, 5]  |
| 4  |    0.164    |   0.165   |   0.207    |  0.222   |  84  |   3    |  [6, 7]  |
| 5  |    0.164    |   0.176   |   0.207    |  0.287   |  44  |   3    | [14, 15] |
| 6  |    0.165    |   0.166   |   0.222    |  0.233   |  66  |   4    |  [8, 9]  |
| 7  |    0.165    |   0.229   |   0.222    |  0.420   |  14  |   4    |    []    |
| 8  |    0.166    |   0.167   |   0.233    |  0.248   |  38  |   6    | [10, 11] |
| 9  |    0.166    |   0.185   |   0.233    |  0.328   |  22  |   6    | [16, 17] |
| 10 |    0.167    |   0.169   |   0.248    |  0.252   |  27  |   8    | [12, 13] |
| 11 |    0.167    |   0.182   |   0.248    |  0.322   |  6   |   8    |    []    |
| 12 |    0.169    |   0.246   |   0.252    |  0.432   |  21  |   10   |    []    |
| 13 |    0.169    |   0.177   |   0.252    |  0.288   |  5   |   10   |    []    |
| 14 |    0.176    |   0.241   |   0.287    |  0.430   |  18  |   5    |    []    |
| 15 |    0.176    |   0.192   |   0.287    |  0.352   |  12  |   5    | [18, 19] |
| 16 |    0.185    |   0.208   |   0.328    |  0.393   |  3   |   9    |    []    |
| 17 |    0.185    |   0.232   |   0.328    |  0.423   |  6   |   9    |    []    |
| 18 |    0.192    |   0.251   |   0.352    |  0.433   |  1   |   15   |    []    |
| 19 |    0.192    |   0.196   |   0.352    |  0.368   |  1   |   15   |    []    |
| 20 |    0.462    |   0.467   |   0.653    |  0.660   |  68  |   2    | [22, 23] |
| 21 |    0.462    |   0.507   |   0.653    |  0.707   |  50  |   2    | [24, 25] |
| 22 |    0.467    |   0.618   |   0.660    |  0.818   |  23  |   20   | [34, 35] |
| 23 |    0.467    |   0.520   |   0.660    |  0.727   |  43  |   20   | [26, 27] |
| 24 |    0.507    |   0.536   |   0.707    |  0.753   |  23  |   21   | [28, 29] |
| 25 |    0.507    |   0.703   |   0.707    |  0.862   |  12  |   21   |    []    |
| 26 |    0.520    |   0.543   |   0.727    |  0.757   |  20  |   23   | [30, 31] |
| 27 |    0.520    |   0.794   |   0.727    |  0.872   |  11  |   23   |    []    |
| 28 |    0.536    |   0.680   |   0.753    |  0.855   |  8   |   24   |    []    |
| 29 |    0.536    |   0.581   |   0.753    |  0.793   |  5   |   24   | [32, 33] |
| 30 |    0.543    |   0.624   |   0.757    |  0.825   |  4   |   26   |    []    |
| 31 |    0.543    |   0.667   |   0.757    |  0.850   |  10  |   26   |    []    |
| 32 |    0.581    |   0.589   |   0.793    |  0.805   |  2   |   29   |    []    |
| 33 |    0.581    |   0.623   |   0.793    |  0.823   |  1   |   29   |    []    |
| 34 |    0.618    |   0.690   |   0.818    |  0.858   |  5   |   22   |    []    |
| 35 |    0.618    |   0.625   |   0.818    |  0.828   |  1   |   22   |    []    |
+----+-------------+-----------+------------+----------+------+--------+----------+

The summary of an LST is a table where each row corresponds to a cluster in the tree. Each cluster has an ID number, start and end density levels, start and end mass levels, a parent cluster, child clusters, and a list of the data points that belong to the cluster, represented in the table by the size of this list.

The LST is constructed by finding connected components at successively higher levels in the estimated probability density function. Think of the density function as the water level rising around a set of islands representing the data points. When there is no water, all of the land masses are connected; as the water rises, land masses split into islands and vanish when the water gets high enough.

The start density level is the density value where a cluster first appears by splitting from its parent cluster. The end density level is the density value where the cluster disappears, either by splitting into child clusters or by vanishing when all of its points have insufficiently high density.

At each of the start and end density levels, a certain fraction of the points have been removed to create the upper level sets of the density function; these fractions are the start and end mass levels, respectively.

3. Plot the tree to see the cluster hierarchy


In [4]:
plot = tree.plot()
plot[0].show()


Clusters are represented by the vertical line segments in the dendrogram. The default is to plot the dendrogram on the mass scale, so that the lower endpoint of a cluster's branch is at its starting mass level and the upper endpoint is at its end mass level. The mass from of the LST dendrogram is typically more useful than plotting on the density scale because the mass scale always starts at 0 and ends at 1, while estimated density values are often extremely large or small.

In this example, the dendrogram shows that there are two very well-separated clusters in our dataset, indicated by two distinct trunk clusters. One of the trunks has no splits at all, indicating a very simple uni-modal distribution (this is not surprising, given that one of our synthetic clusters is a Gaussian ball). The other trunk splits repeatedly, indicating a complex hierarchical and multi-modal cluster structure. Within this structure there are two clear primary sub-clusters.

The width of a cluster's branch indicates how many points belong to the cluster when it first appears. The dendrogram for this example shows (correctly) that the Gaussian blob has far less mass than the concentric circles.

4. Prune the tree


In [5]:
pruned_tree = tree.prune(60)
pruned_tree.plot()[0].show()


Our LST is not as useful as it should be because there are many very small leaves; some have only a single point. Pruning the tree recursively merges clusters from the leaves downward, until every cluster has at least some minimum size. The prune method is computationally cheap and returns a new tree, so this can be done repeatedly until the tree is most useful for you. In this example we set the minimum cluster size to be 60 points, which results in a much simpler tree.

We could have also pruned immediately by setting the prune_threshold parameter in our original construct_tree call. A good strategy is to set the prune_threshold to be the same as k in that constructor, then increase the prune threshold later if there are too many high-density clusters for your task.

5. Get clusters from the tree


In [6]:
cluster_labels = pruned_tree.get_clusters()

print "Cluster labels shape:", cluster_labels.shape


Cluster labels shape: (493, 2)

There are many ways to use the LST to assign data points to clusters. The default method of the get_clusters method is to return the points in the leaves of the tree, which are the highest-density modal regions of the dataset. The big advantage of this method is that the user doesn't need to know the right number of clusters beforehand, or even to choose a single density level.

All clustering methods return a numpy matrix with two columns. The first column contains indices of data points and the second contains the integer label of a given point's cluster.

In general, the clusters obtained from the LST exclude low-density data points (the specific pattern of which points to exclude depends on the clustering strategy). In our example above, the cluster labels include only 493 of the original 600 data points.

6. Plot the clusters in feature space


In [7]:
upper_level_idx = cluster_labels[:, 0]
upper_level_set = X[upper_level_idx, :]

with plt.style.context('ggplot'):
    fig, ax = plt.subplots(figsize=(6, 4.5))
    ax.scatter(upper_level_set[:, 0], upper_level_set[:, 1],
               c=cluster_labels[:, 1], s=70, alpha=0.9)
    fig.show()