DBSCAN Clustering

Authors

Ndèye Gagnessiry Ndiaye and Christin Seifert

License

This work is licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/

This notebook:

  • introduces DBSCAN clustering using features from the Iris flower dataset

In [78]:
import sklearn.metrics as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics as sm
from sklearn import datasets
from sklearn.cluster import DBSCAN

In [79]:
iris = datasets.load_iris()
x = pd.DataFrame(iris.data)
x.columns = ['SepalLength','SepalWidth','PetalLength','PetalWidth'] 

y = pd.DataFrame(iris.target)
y.columns = ['Targets']

iris = x[['SepalLength', 'PetalLength']]

We fit the model with ɛ=0.5 and min_samples=5 considering two attributes of the iris dataset ( e.g SepalLength and PetalLength).

  • ɛ : the radius of our neighborhoods around a data point p.
  • min_samples: The minimum number of data points we want in a neighborhood to define a cluster.

In [80]:
dbscan = DBSCAN()
dbscan


Out[80]:
DBSCAN(algorithm='auto', eps=0.5, leaf_size=30, metric='euclidean',
    min_samples=5, n_jobs=1, p=None)

In [83]:
dbscan.fit(iris)
dbscan.labels_


Out[83]:
array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

The following figure plots the model. It shows two "dense" clusters of points (red and green) and outliers (black).


In [84]:
# Set the size of the plot
plt.figure(figsize=(24,10))

# Create a colormap
colormap = np.array(['red', 'lime', 'black'])

# Plot Original
plt.subplot(1, 2, 1)
plt.scatter(x.SepalLength, x.PetalLength, c="K", s=40)
plt.title('Original dataset')

# Plot the Model

plt.subplot(1, 2, 2)
plt.scatter(x.SepalLength, x.PetalLength, c=colormap[dbscan.labels_], s=40)
plt.title('DBSCAN Clustering')


plt.show()


We compute the confusion matrix and calculate the purity metric.


In [85]:
def confusion(y,labels):
    cm = sm.confusion_matrix(y, labels)
    return cm

In [89]:
confusion(y, dbscan.labels_)


Out[89]:
array([[ 0,  0,  0,  0],
       [ 0, 50,  0,  0],
       [ 3,  0, 47,  0],
       [ 0,  0, 50,  0]])

In [90]:
# Calculate purity 
def Purity(cm):
    M=[]
    S=0
    for i in cm:
        k = max(i)
        M.append(k)
    for i in M:
        S+=i
    Purity=S/150 
    return Purity

In [88]:
Purity(confusion(y, dbscan.labels_))


Out[88]:
0.97999999999999998