Ndèye Gagnessiry Ndiaye and Christin Seifert
This work is licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/
This notebook:
In [78]:
    
import sklearn.metrics as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics as sm
from sklearn import datasets
from sklearn.cluster import DBSCAN
    
In [79]:
    
iris = datasets.load_iris()
x = pd.DataFrame(iris.data)
x.columns = ['SepalLength','SepalWidth','PetalLength','PetalWidth'] 
y = pd.DataFrame(iris.target)
y.columns = ['Targets']
iris = x[['SepalLength', 'PetalLength']]
    
We fit the model with ɛ=0.5 and min_samples=5 considering two attributes of the iris dataset ( e.g SepalLength and PetalLength).
In [80]:
    
dbscan = DBSCAN()
dbscan
    
    Out[80]:
In [83]:
    
dbscan.fit(iris)
dbscan.labels_
    
    Out[83]:
The following figure plots the model. It shows two "dense" clusters of points (red and green) and outliers (black).
In [84]:
    
# Set the size of the plot
plt.figure(figsize=(24,10))
# Create a colormap
colormap = np.array(['red', 'lime', 'black'])
# Plot Original
plt.subplot(1, 2, 1)
plt.scatter(x.SepalLength, x.PetalLength, c="K", s=40)
plt.title('Original dataset')
# Plot the Model
plt.subplot(1, 2, 2)
plt.scatter(x.SepalLength, x.PetalLength, c=colormap[dbscan.labels_], s=40)
plt.title('DBSCAN Clustering')
plt.show()
    
    
We compute the confusion matrix and calculate the purity metric.
In [85]:
    
def confusion(y,labels):
    cm = sm.confusion_matrix(y, labels)
    return cm
    
In [89]:
    
confusion(y, dbscan.labels_)
    
    Out[89]:
In [90]:
    
# Calculate purity 
def Purity(cm):
    M=[]
    S=0
    for i in cm:
        k = max(i)
        M.append(k)
    for i in M:
        S+=i
    Purity=S/150 
    return Purity
    
In [88]:
    
Purity(confusion(y, dbscan.labels_))
    
    Out[88]: