Vahid Mirjalili, Data Mining Researcher
Clustering algorithms has a wide range of applications. Several methods are proposed so far, and each method has drawbacks, for example in k-means clustering the knowledge of number of clusters is crucial, yet it doesn't provide a robust solution to datasets with clusters that have different sizes or densities.
In this article, an implementation of a novel clustering algorithm is provided, and applied to a synthetic dataset that contains multiple topologies. See the reference publication by Alex Rodriguez and Alessandro Laio.
The most important feature of this algorithm is that it can automatically find the number of clusters.
In the following sections, first we provide a clustsred dataset, then illustrate how the algorithm works. At the end, we conclude with comparison of this algorithm with other clustering algorithm such as k-means and DBScan.
A syntethic dataset is provided that can be downloded here. This dataset contains 2000 points of 2 features and a ground truth cluster. The dataset is generated by random normal and uniform numbers. Several noise points were also added. Total if 6 clusters exist.
In [1]:
%load_ext watermark
%watermark -a 'Vahid Mirjalili' -d -p scikit-learn,numpy,matplotlib,pyprind,prettytable -v
In [2]:
## Implementation of Density Peak Clustering
### Vahid Mirjalili
import numpy as np
import urllib
import csv
# download the data
#url = 'https://raw.githubusercontent.com/mirjalil/DataScience/master/data/synthetic.6clust.csv'
#content = urllib.urlopen(url)
#content = content.read()
# save the datafile locally
#with open('data/synthetic_data.csv', 'wb') as out:
# out.write(content)
# read data into numpy array
d = np.loadtxt("../data/synthetic_data.csv", delimiter=',')
y = d[:,2]
print(np.unique(y))
In this dataset, 0 represents noise points, and others are considered as ground-truth cluster IDs. Next, we visulaize the dataset as below:
In [3]:
from matplotlib import pyplot as plt
%matplotlib inline
plt.figure(figsize=(10,8))
for clustId,color in zip(range(0,7),('grey', 'blue', 'red', 'green', 'purple', 'orange', 'brown')):
plt.scatter(x=d[:,0][d[:,2] == clustId],
y=d[:,1][d[:,2] == clustId],
color=color,
marker='o',
alpha=0.6
)
plt.title('Clustered Dataset', size=20)
plt.xlabel('$x_1$', size=25)
plt.ylabel('$x_2$', size=25)
plt.tick_params(axis='both', which='major', labelsize=18)
plt.xlim(xmin=np.amin(d[:,0])-0.5, xmax=np.amax(d[:,0])+0.5)
plt.ylim(ymin=np.amin(d[:,1])-0.5, ymax=np.amax(d[:,1])+0.5)
plt.show()
In [4]:
import pyprind
def euclidean_dist(p, q):
return (np.sqrt((p[0]-q[0])**2 + (p[1]-q[1])**2))
def cal_density(d, dc=0.1):
n = d.shape[0]
den_arr = np.zeros(n, dtype=np.int)
for i in range(n):
for j in range(i+1, n):
if euclidean_dist(d[i,:], d[j,:]) < dc:
den_arr[i] += 1
den_arr[j] += 1
return (den_arr)
den_arr = cal_density(d[:,0:2], 0.5)
print(max(den_arr))
In [5]:
def cal_minDist2Peaks(d, den):
n = d.shape[0]
mdist2peaks = np.repeat(999, n)
max_pdist = 0 # to store the maximum pairwise distance
for i in range(n):
mdist_i = mdist2peaks[i]
for j in range(i+1, n):
dist_ij = euclidean_dist(d[i,0:2], d[j,0:2])
max_pdist = max(max_pdist, dist_ij)
if den_arr[i] < den_arr[j]:
mdist_i = min(mdist_i, dist_ij)
elif den_arr[j] <= den_arr[i]:
mdist2peaks[j] = min(mdist2peaks[j], dist_ij)
mdist2peaks[i] = mdist_i
# Update the value for the point with highest density
max_den_points = np.argwhere(mdist2peaks == 999)
print(max_den_points)
mdist2peaks[max_den_points] = max_pdist
return (mdist2peaks)
In [6]:
mdist2peaks = cal_minDist2Peaks(d, den_arr)
def plot_decisionGraph(den_arr, mdist2peaks, thresh=None):
plt.figure(figsize=(10,8))
if thresh is not None:
centroids = np.argwhere((mdist2peaks > thresh) & (den_arr>1)).flatten()
noncenter_points = np.argwhere((mdist2peaks < thresh) )
else:
centroids = None
noncenter_points = np.arange(den_arr.shape[0])
plt.scatter(x=den_arr[noncenter_points],
y=mdist2peaks[noncenter_points],
color='red',
marker='o',
alpha=0.5,
s=50
)
if thresh is not None:
plt.scatter(x=den_arr[centroids],
y=mdist2peaks[centroids],
color='blue',
marker='o',
alpha=0.6,
s=140
)
plt.title('Decision Graph', size=20)
plt.xlabel(r'$\rho$', size=25)
plt.ylabel(r'$\delta$', size=25)
plt.ylim(ymin=min(mdist2peaks-0.5), ymax=max(mdist2peaks+0.5))
plt.tick_params(axis='both', which='major', labelsize=18)
plt.show()
plot_decisionGraph(den_arr, mdist2peaks, 1.7)
In [7]:
## points with highest density and distance to other high-density points
thresh = 1.0
centroids = np.argwhere((mdist2peaks > thresh) & (den_arr>1)).flatten()
print(centroids.reshape(1, centroids.shape[0]))
Further, we show the cluster centroids and non-centroid points together in figure below. The centroids are shown in purple stars. Surprisingly, the algorithm picks one point as centroid from each high-density region. The centroids have large density and large distance to other density-peaks.
In [8]:
%matplotlib inline
plt.figure(figsize=(10,8))
d_centers = d[centroids,:]
for clustId,color in zip(range(0,7),('grey', 'blue', 'red', 'green', 'purple', 'orange', 'brown')):
plt.scatter(x=d[:,0][d[:,2] == clustId],
y=d[:,1][d[:,2] == clustId],
color='grey',
marker='o',
alpha=0.4
)
# plot the cluster centroids
plt.scatter(x=d_centers[:,0][d_centers[:,2] == clustId],
y=d_centers[:,1][d_centers[:,2] == clustId],
color='purple',
marker='*',
s=400,
alpha=1.0
)
plt.title('Cluster Centroids', size=20)
plt.xlabel('$x_1$', size=25)
plt.ylabel('$x_2$', size=25)
plt.tick_params(axis='both', which='major', labelsize=18)
plt.xlim(xmin=0, xmax=10)
plt.ylim(ymin=0, ymax=10)
plt.show()
In this figure, cluster centers are shown in dark triangle, and the remaining points are shown in grey, since we don't know their cluster ID yet.
After the cluster centroids are found, the remaining points should be assigned to their corresponding centroid. A naive way is to assign each point simply to the nearest centroid. However, this method is prone to large errors if data has some weird underlying topology.
As a result, the least error prone method is to assign points to the cluster of their nearest neigbour, as given in algorithm below:
Algorithm:
1. Assign centroids to a unique cluster, and remove them from the set
Repeat until all the points are assigned to a cluster:
2. Find the maximum density in the remaining set of points, and remove it from the set
3. Assign it to the same cluster as its nearest neighbor in the set of already assigned points
In [12]:
def assign_cluster(df, den_arr, centroids):
""" Assign points to clusters
"""
nsize = den_arr.shape[0]
#print (nsize)
cmemb = np.ndarray(shape=(nsize,2), dtype='int')
cmemb[:,:] = -1
ncm = 0
for i,cix in enumerate(centroids):
cmemb[i,0] = cix # centroid index
cmemb[i,1] = i # cluster index
ncm += 1
da = np.delete(den_arr, centroids)
inxsort = np.argsort(da)
for i in range(da.shape[0]-1, -1, -1):
ix = inxsort[i]
dist = np.repeat(999.9, ncm)
for j in range(ncm):
dist[j] = euclidean_dist(df[ix], df[cmemb[j,0]])
#print(j, ix, cmemb[j,0], dist[j])
nearest_nieghb = np.argmin(dist)
cmemb[ncm,0] = ix
cmemb[ncm,1] = cmemb[nearest_nieghb, 1]
ncm += 1
return(cmemb)
In [10]:
clust_membership = assign_cluster(d, den_arr, centroids)
In [11]:
plt.figure(figsize=(10,8))
for clustId,color in zip(range(0,6),('blue', 'red', 'green', 'purple', 'orange', 'brown')):
cset = clust_membership[clust_membership[:,1] == clustId,0]
plt.scatter(x=d[cset,0],
y=d[cset,1],
color=color,
marker='o',
alpha=0.6
)
plt.title('Clustered Dataset by Density-Peak Algorithm', size=20)
plt.xlabel('$x_1$', size=25)
plt.ylabel('$x_2$', size=25)
plt.tick_params(axis='both', which='major', labelsize=18)
plt.xlim(xmin=np.amin(d[:,0])-0.5, xmax=np.amax(d[:,0])+0.5)
plt.ylim(ymin=np.amin(d[:,1])-0.5, ymax=np.amax(d[:,1])+0.5)
plt.show()
The new clustering algorithm presented here, is capable of clustering dataset that contain
It is important to note that this method needs minimum number of parameters, which is eps for finding the density of each point. The number of clusters is found by analyzing the density peaks and minimum distances to density peaks. The application of this clustering algorithm is not limited to numeric data types, as it could also be used for text/string data types provided that a proper distance function exist.