DBSCAN Clustering


Ndèye Gagnessiry Ndiaye and Christin Seifert


This work is licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/

This notebook:

  • introduces DBSCAN clustering using features from the Iris flower dataset

import sklearn.metrics as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import DBSCAN

iris = datasets.load_iris()
x = pd.DataFrame(iris.data)
x.columns = ['SepalLength','SepalWidth','PetalLength','PetalWidth'] 

y = pd.DataFrame(iris.target)
y.columns = ['Targets']

iris = x[['SepalLength', 'PetalLength']]

We fit the model with ɛ=0.5 and min_samples=5 considering two attributes of the iris dataset ( e.g SepalLength and PetalLength).

  • ɛ : the radius of our neighborhoods around a data point p.
  • min_samples: The minimum number of data points we want in a neighborhood to define a cluster.

dbscan = DBSCAN()

DBSCAN(algorithm='auto', eps=0.5, leaf_size=30, metric='euclidean',
    min_samples=5, n_jobs=1, p=None)

The following figure plots the model. It shows two "dense" clusters of points (red and green) and outliers (black).

# Set the size of the plot

# Create a colormap
colormap = np.array(['red', 'lime', 'black'])

# Plot Original
plt.subplot(1, 2, 1)
plt.scatter(x.SepalLength, x.PetalLength, c="K", s=40)
plt.title('Original dataset')

# Plot the Model

plt.subplot(1, 2, 2)
plt.scatter(x.SepalLength, x.PetalLength, c=colormap[dbscan.labels_], s=40)
plt.title('DBSCAN Clustering')


We compute the confusion matrix and calculate the purity metric.

def confusion(y,labels):
    cm = sm.confusion_matrix(y, labels)
    return cm

confusion(y, dbscan.labels_)

array([[ 0,  0,  0,  0],
       [ 0, 50,  0,  0],
       [ 3,  0, 47,  0],
       [ 0,  0, 50,  0]])

# Calculate purity 
def Purity(cm):
    for i in cm:
        k = max(i)
    for i in M:
    return Purity

Purity(confusion(y, dbscan.labels_))
