1. Introduction

Clustering algorithms has a wide range of applications. Several methods are proposed so far, and each method has drawbacks, for example in k-means clustering the knowledge of number of clusters is crucial, yet it doesn't provide a robust solution to datasets with clusters that have different sizes or densities.

In this article, an implementation of a novel clustering algorithm is provided, and applied to a synthetic dataset that contains multiple topologies. See the reference publication by Alex Rodriguez and Alessandro Laio.

The most important feature of this algorithm is that it can automatically find the number of clusters.

In the following sections, first we provide a clustsred dataset, then illustrate how the algorithm works. At the end, we conclude with comparison of this algorithm with other clustering algorithm such as k-means and DBScan.

Input dataset with ground-truth clusters

A syntethic dataset is provided that can be downloded here. This dataset contains 2000 points of 2 features and a ground truth cluster. The dataset is generated by random normal and uniform numbers. Several noise points were also added. Total if 6 clusters exist.


In [1]:
%load_ext watermark

%watermark -a 'Vahid Mirjalili' -d -p scikit-learn,numpy,matplotlib,pyprind,prettytable -v


Vahid Mirjalili 18/10/2014 

CPython 2.7.3
IPython 2.0.0

scikit-learn 0.15.2
numpy 1.9.0
matplotlib 1.4.0
pyprind 2.6.1
prettytable 0.7.2

In [2]:
## Implementation of Density Peak Clustering
### Vahid Mirjalili

import numpy as np

import urllib
import csv

# download the data
#url = 'https://raw.githubusercontent.com/mirjalil/DataScience/master/data/synthetic.6clust.csv'
#content = urllib.urlopen(url)
#content = content.read()

# save the datafile locally
#with open('data/synthetic_data.csv', 'wb') as out:
#    out.write(content)

# read data into numpy array
d = np.loadtxt("../data/synthetic_data.csv", delimiter=',')

y = d[:,2]
print(np.unique(y))


[ 0.  1.  2.  3.  4.  5.  6.]

In this dataset, 0 represents noise points, and others are considered as ground-truth cluster IDs. Next, we visulaize the dataset as below:


In [3]:
from matplotlib import pyplot as plt

%matplotlib inline

plt.figure(figsize=(10,8))

for clustId,color in zip(range(0,7),('grey', 'blue', 'red', 'green', 'purple', 'orange', 'brown')):
    plt.scatter(x=d[:,0][d[:,2] == clustId],
                y=d[:,1][d[:,2] == clustId],
                color=color,
                marker='o',
                alpha=0.6
                )
plt.title('Clustered Dataset', size=20)
plt.xlabel('$x_1$', size=25)
plt.ylabel('$x_2$', size=25)

plt.tick_params(axis='both', which='major', labelsize=18)
plt.xlim(xmin=np.amin(d[:,0])-0.5, xmax=np.amax(d[:,0])+0.5)
plt.ylim(ymin=np.amin(d[:,1])-0.5, ymax=np.amax(d[:,1])+0.5)

plt.show()


2. Methodology Description

Step 1: Calculating density at each point

Next, we have to compute the density of each point.


In [4]:
import pyprind

def euclidean_dist(p, q):
    return (np.sqrt((p[0]-q[0])**2 + (p[1]-q[1])**2))

def cal_density(d, dc=0.1):
    n = d.shape[0]
    den_arr = np.zeros(n, dtype=np.int)

    for i in range(n):
        for j in range(i+1, n):
            if euclidean_dist(d[i,:], d[j,:]) < dc:
                den_arr[i] += 1
                den_arr[j] += 1

    return (den_arr)

den_arr = cal_density(d[:,0:2], 0.5)
print(max(den_arr))


579

Step 2: Calculating the minimum distance to high density points


In [5]:
def cal_minDist2Peaks(d, den):
    n = d.shape[0]
    mdist2peaks = np.repeat(999, n)
    max_pdist = 0 # to store the maximum pairwise distance
    for i in range(n):
        mdist_i = mdist2peaks[i]
        for j in range(i+1, n):
            dist_ij = euclidean_dist(d[i,0:2], d[j,0:2])
            max_pdist = max(max_pdist, dist_ij)
            if den_arr[i] < den_arr[j]:
                mdist_i = min(mdist_i, dist_ij)
            elif den_arr[j] <= den_arr[i]:
                mdist2peaks[j] = min(mdist2peaks[j], dist_ij)
        mdist2peaks[i] = mdist_i
        
    # Update the value for the point with highest density
    max_den_points = np.argwhere(mdist2peaks == 999)
    print(max_den_points)
    mdist2peaks[max_den_points] = max_pdist
    return (mdist2peaks)

Step 3: Finding the number of clusters

Plotting the decision graph


In [6]:
mdist2peaks = cal_minDist2Peaks(d, den_arr)

def plot_decisionGraph(den_arr, mdist2peaks, thresh=None):

    plt.figure(figsize=(10,8))

    if thresh is not None:
        centroids = np.argwhere((mdist2peaks > thresh) & (den_arr>1)).flatten()
        noncenter_points = np.argwhere((mdist2peaks < thresh) )
    else:
        centroids = None
        noncenter_points = np.arange(den_arr.shape[0])

    plt.scatter(x=den_arr[noncenter_points],
            y=mdist2peaks[noncenter_points],
            color='red',
            marker='o',
            alpha=0.5, 
            s=50
            )
    if thresh is not None:
        plt.scatter(x=den_arr[centroids],
                y=mdist2peaks[centroids],
                color='blue',
                marker='o',
                alpha=0.6, 
                s=140
                )

    plt.title('Decision Graph', size=20)
    plt.xlabel(r'$\rho$', size=25)
    plt.ylabel(r'$\delta$', size=25)
    plt.ylim(ymin=min(mdist2peaks-0.5), ymax=max(mdist2peaks+0.5))

    plt.tick_params(axis='both', which='major', labelsize=18)

    plt.show()
    
plot_decisionGraph(den_arr, mdist2peaks, 1.7)


[[954]]

Cluster Centroids

As shown in the previous plot, cluster centroids stand out since they have larger distances to density peaks. These points are shown in blue, and can be picked easily:


In [7]:
## points with highest density and distance to other high-density points

thresh = 1.0
centroids = np.argwhere((mdist2peaks > thresh) & (den_arr>1)).flatten()

print(centroids.reshape(1, centroids.shape[0]))


[[ 266  336  482  734  954 1785]]

Further, we show the cluster centroids and non-centroid points together in figure below. The centroids are shown in purple stars. Surprisingly, the algorithm picks one point as centroid from each high-density region. The centroids have large density and large distance to other density-peaks.


In [8]:
%matplotlib inline

plt.figure(figsize=(10,8))

d_centers = d[centroids,:]

for clustId,color in zip(range(0,7),('grey', 'blue', 'red', 'green', 'purple', 'orange', 'brown')):
    plt.scatter(x=d[:,0][d[:,2] == clustId],
                y=d[:,1][d[:,2] == clustId],
                color='grey',
                marker='o',
                alpha=0.4
                )
    # plot the cluster centroids
    plt.scatter(x=d_centers[:,0][d_centers[:,2] == clustId],
                y=d_centers[:,1][d_centers[:,2] == clustId],
                color='purple',
                marker='*',
                s=400,
                alpha=1.0
                )
plt.title('Cluster Centroids', size=20)
plt.xlabel('$x_1$', size=25)
plt.ylabel('$x_2$', size=25)

plt.tick_params(axis='both', which='major', labelsize=18)
plt.xlim(xmin=0, xmax=10)
plt.ylim(ymin=0, ymax=10)

plt.show()


In this figure, cluster centers are shown in dark triangle, and the remaining points are shown in grey, since we don't know their cluster ID yet.

Cluster Memebrship Assignment

After the cluster centroids are found, the remaining points should be assigned to their corresponding centroid. A naive way is to assign each point simply to the nearest centroid. However, this method is prone to large errors if data has some weird underlying topology.

As a result, the least error prone method is to assign points to the cluster of their nearest neigbour, as given in algorithm below:

Algorithm:

1. Assign centroids to a unique cluster, and remove them from the set
Repeat until all the points are assigned to a cluster:
2. Find the maximum density in the remaining set of points, and remove it from the set
3. Assign it to the same cluster as its nearest neighbor in the set of already assigned points


In [12]:
def assign_cluster(df, den_arr, centroids):
    """ Assign points to clusters
    """
    nsize = den_arr.shape[0]
    #print (nsize)
    cmemb = np.ndarray(shape=(nsize,2), dtype='int')
    cmemb[:,:] = -1
    ncm = 0
    for i,cix in enumerate(centroids):
        cmemb[i,0] = cix # centroid index
        cmemb[i,1] = i   # cluster index
        ncm += 1
    da = np.delete(den_arr, centroids)
    inxsort = np.argsort(da)
    for i in range(da.shape[0]-1, -1, -1):
        ix = inxsort[i]
        dist = np.repeat(999.9, ncm)
        for j in range(ncm):
            dist[j] = euclidean_dist(df[ix], df[cmemb[j,0]])
            #print(j, ix, cmemb[j,0], dist[j])
        nearest_nieghb = np.argmin(dist)
        cmemb[ncm,0] = ix
        cmemb[ncm,1] = cmemb[nearest_nieghb, 1]
        ncm += 1
    return(cmemb)

In [10]:
clust_membership = assign_cluster(d, den_arr, centroids)

In [11]:
plt.figure(figsize=(10,8))

for clustId,color in zip(range(0,6),('blue', 'red', 'green', 'purple', 'orange', 'brown')):
    cset = clust_membership[clust_membership[:,1] == clustId,0]
    plt.scatter(x=d[cset,0],
                y=d[cset,1],
                color=color,
                marker='o',
                alpha=0.6
                )
plt.title('Clustered Dataset by Density-Peak Algorithm', size=20)
plt.xlabel('$x_1$', size=25)
plt.ylabel('$x_2$', size=25)

plt.tick_params(axis='both', which='major', labelsize=18)
plt.xlim(xmin=np.amin(d[:,0])-0.5, xmax=np.amax(d[:,0])+0.5)
plt.ylim(ymin=np.amin(d[:,1])-0.5, ymax=np.amax(d[:,1])+0.5)

plt.show()


3. Conclusion

The new clustering algorithm presented here, is capable of clustering dataset that contain

  • noise points
  • groups of different shape/topology/density

It is important to note that this method needs minimum number of parameters, which is eps for finding the density of each point. The number of clusters is found by analyzing the density peaks and minimum distances to density peaks. The application of this clustering algorithm is not limited to numeric data types, as it could also be used for text/string data types provided that a proper distance function exist.