Example and timings

This notebook gives a short introduction in how to use pydpc for a simple clustering problem.


In [ ]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pydpc import Cluster
from pydpc._reference import Cluster as RefCluster

We start with preparing the data points for clustering. The data is two-dimensional and craeted by drawing random numbers from four superpositioned gaussian distributions which are centered at the corners of a square (indicated by the red dashed lines).


In [ ]:
# generate the data points
npoints = 2000
mux = 1.6
muy = 1.6
points = np.zeros(shape=(npoints, 2), dtype=np.float64)
points[:, 0] = np.random.randn(npoints) + mux * (-1)**np.random.randint(0, high=2, size=npoints)
points[:, 1] = np.random.randn(npoints) + muy * (-1)**np.random.randint(0, high=2, size=npoints)
# draw the data points
fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(points[:, 0], points[:, 1], s=40)
ax.plot([-mux, -mux], [-1.5 * muy, 1.5 * muy], '--', linewidth=2, color="red")
ax.plot([mux, mux], [-1.5 * muy, 1.5 * muy], '--', linewidth=2, color="red")
ax.plot([-1.5 * mux,  1.5 * mux], [-muy, -muy], '--', linewidth=2, color="red")
ax.plot([-1.5 * mux,  1.5 * mux], [muy, muy], '--', linewidth=2, color="red")
ax.set_xlabel(r"x / a.u.", fontsize=20)
ax.set_ylabel(r"y / a.u.", fontsize=20)
ax.tick_params(labelsize=15)
ax.set_xlim([-7, 7])
ax.set_ylim([-7, 7])
ax.set_aspect('equal')
fig.tight_layout()

Now comes the interesting part.

We pass the numpy ndarray with the data points to the Cluster class which prepares the data set for clustering. In this stage, it computes the Euclidean distances between all data points and from that the two properties to identify clusters within the data: each data points' density and minimal distance delta to a point of higher density.

Once these properties are computed, a decision graph is drawn, where each outlier in the upper right corner represents a different cluster. In our example, we should find four outliers. So far, however, no clustering has yet been done.


In [ ]:
clu = Cluster(points)

Now that we have the decision graph, we can select the outliers via the assign method by setting lower bounds for delta and density. The assign method does the actual clustering; it also shows the decision graph again with the given selection.


In [ ]:
clu.assign(20, 1.5)

Let us have a look at the result.

We again plot the data and red dashed lines indicating the centeres of the gaussian distributions. Indicated in the left panel by red dots are the four outliers from the decision graph; these are our four cluster centers. The center panel shows the points' densities and the right panel shows the membership to the four clusters by different coloring.


In [ ]:
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
ax[0].scatter(points[:, 0], points[:, 1], s=40)
ax[0].scatter(points[clu.clusters, 0], points[clu.clusters, 1], s=50, c="red")
ax[1].scatter(points[:, 0], points[:, 1], s=40, c=clu.density)
ax[2].scatter(points[:, 0], points[:, 1], s=40, c=clu.membership, cmap=mpl.cm.cool)
for _ax in ax:
    _ax.plot([-mux, -mux], [-1.5 * muy, 1.5 * muy], '--', linewidth=2, color="red")
    _ax.plot([mux, mux], [-1.5 * muy, 1.5 * muy], '--', linewidth=2, color="red")
    _ax.plot([-1.5 * mux,  1.5 * mux], [-muy, -muy], '--', linewidth=2, color="red")
    _ax.plot([-1.5 * mux,  1.5 * mux], [muy, muy], '--', linewidth=2, color="red")
    _ax.set_xlabel(r"x / a.u.", fontsize=20)
    _ax.set_ylabel(r"y / a.u.", fontsize=20)
    _ax.tick_params(labelsize=15)
    _ax.set_xlim([-7, 7])
    _ax.set_ylim([-7, 7])
    _ax.set_aspect('equal')
fig.tight_layout()

The density peak clusterng can further resolve if the membership of a data point to a certain cluster is strong or rather weak and separates the data points further into core and halo regions.

The left panel depicts the border members in grey. The separation in the center panel uses the core/halo criterion of the original authors, the right panel shows a less strict criterion which assumes a halo only between different clusters; here, the halo members are depicted in grey.


In [ ]:
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
ax[0].scatter(
    points[:, 0], points[:, 1],
    s=40, c=clu.membership, cmap=mpl.cm.cool)
ax[0].scatter(points[clu.border_member, 0], points[clu.border_member, 1], s=40, c="grey")
ax[1].scatter(
    points[clu.core_idx, 0], points[clu.core_idx, 1],
    s=40, c=clu.membership[clu.core_idx], cmap=mpl.cm.cool)
ax[1].scatter(points[clu.halo_idx, 0], points[clu.halo_idx, 1], s=40, c="grey")
clu.autoplot=False
clu.assign(20, 1.5, border_only=True)
ax[2].scatter(
    points[clu.core_idx, 0], points[clu.core_idx, 1],
    s=40, c=clu.membership[clu.core_idx], cmap=mpl.cm.cool)
ax[2].scatter(points[clu.halo_idx, 0], points[clu.halo_idx, 1], s=40, c="grey")
ax[2].tick_params(labelsize=15)
for _ax in ax:
    _ax.plot([-mux, -mux], [-1.5 * muy, 1.5 * muy], '--', linewidth=2, color="red")
    _ax.plot([mux, mux], [-1.5 * muy, 1.5 * muy], '--', linewidth=2, color="red")
    _ax.plot([-1.5 * mux,  1.5 * mux], [-muy, -muy], '--', linewidth=2, color="red")
    _ax.plot([-1.5 * mux,  1.5 * mux], [muy, muy], '--', linewidth=2, color="red")
    _ax.set_xlabel(r"x / a.u.", fontsize=20)
    _ax.set_ylabel(r"y / a.u.", fontsize=20)
    _ax.tick_params(labelsize=15)
    _ax.set_xlim([-7, 7])
    _ax.set_ylim([-7, 7])
    _ax.set_aspect('equal')
fig.tight_layout()

This concludes the example.

In the remaining part, we address the performance of the pydpc implementation (numpy + cython-wrapped C code) with respect to an older development version (numpy). In particular, we look at the numerically most demanding part of computing the Euclidean distances between the data points and estimating density and delta.


In [ ]:
npoints = 1000
points = np.zeros(shape=(npoints, 2), dtype=np.float64)
points[:, 0] = np.random.randn(npoints) + 1.8 * (-1)**np.random.randint(0, high=2, size=npoints)
points[:, 1] = np.random.randn(npoints) + 1.8 * (-1)**np.random.randint(0, high=2, size=npoints)

%timeit Cluster(points, fraction=0.02, autoplot=False)
%timeit RefCluster(fraction=0.02, autoplot=False).load(points)

The next two cells measure the full clustering.


In [ ]:
%%timeit
Cluster(points, fraction=0.02, autoplot=False).assign(20, 1.5)

In [ ]:
%%timeit
tmp = RefCluster(fraction=0.02, autoplot=False)
tmp.load(points)
tmp.assign(20, 1.5)