Ndèye Gagnessiry Ndiaye and Christin Seifert
This work is licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/
This notebook:
In [39]:
import pandas as pd
import numpy as np
import pylab as plt
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import sklearn.metrics as sm
We load the Iris flower data set. From the four measured features (e.g 'SepalLength','SepalWidth','PetalLength','PetalWidth'), two features were selected to perform k-means clustering : 'SepalLength' and 'PetalLength'.
In [75]:
from sklearn import datasets
iris = datasets.load_iris()
#iris.data
#iris.feature_names
iris.target
#iris.target_names
Out[75]:
In [41]:
x = pd.DataFrame(iris.data)
x.columns = ['SepalLength','SepalWidth','PetalLength','PetalWidth']
y = pd.DataFrame(iris.target)
y.columns = ['Targets']
iris = x[['SepalLength', 'PetalLength']]
In [42]:
X= np.array ([[ 6,5],
[ 6.2, 5.2],
[ 5.8,4.8]])
model_1 = KMeans(n_clusters=3, random_state=42,max_iter=1,n_init=1, init = X ).fit(iris)
centroids_1 = model_1.cluster_centers_
labels_1=(model_1.labels_)
print(centroids_1)
print(labels_1)
In [43]:
model_10= KMeans(n_clusters=3, random_state=42,max_iter=10, n_init=1, init = X).fit(iris)
centroids_10 = model_10.cluster_centers_
labels_10=(model_10.labels_)
print(centroids_10)
print(labels_10)
In [44]:
model_11= KMeans(n_clusters=3, random_state=42,max_iter=11,n_init=1, init = X).fit(iris)
centroids_max = model_11.cluster_centers_
labels_max=(model_11.labels_)
print(centroids_max)
print(labels_max)
In [45]:
'''model_999= KMeans(n_clusters=3, random_state=42,max_iter=999).fit(iris)
centroids_max = model.cluster_centers_
labels_max=(model.labels_)
print(centroids_max)
print(labels_max)'''
Out[45]:
The following plots show for each iteration (ie. iter=1; iter=10 ;iter= max) the cluster centroids(blue) and the target data points. Each cluster is distinguished by a different color.
In [53]:
# Set the size of the plot
plt.figure(figsize=(24,10))
# Create a colormap
colormap = np.array(['red', 'lime', 'black'])
#colormap = {0: 'r', 1: 'g', 2: 'b'}
# Plot Original
plt.subplot(1, 4, 1)
plt.scatter(x.SepalLength, x.PetalLength, c="K", s=40)
plt.scatter(X[:,0],X[:,1], c="b")
plt.title('Initial centroids')
# Plot the Models Classifications
plt.subplot(1, 4, 2)
plt.scatter(iris.SepalLength, iris.PetalLength, c=colormap[labels_1], s=40)
plt.scatter(centroids_1[:,0],centroids_1[:,1], c="b")
plt.title('K Mean Clustering(iter=1)')
plt.subplot(1, 4, 3)
plt.scatter(iris.SepalLength, iris.PetalLength, c=colormap[labels_10], s=40)
plt.scatter(centroids_10[:,0],centroids_10[:,1], c="b")
plt.title('K Mean Clustering (iter=10)')
plt.subplot(1, 4, 4)
plt.scatter(iris.SepalLength, iris.PetalLength, c=colormap[labels_max], s=40)
plt.scatter(centroids_max[:,0],centroids_max[:,1], c="b")
plt.title('K Mean Clustering (iter= MAX)')
plt.show()
We compute the confusion matrices for each iteration and calculate the purity metric.
In [68]:
def confusion(y,labels):
cm = sm.confusion_matrix(y, labels)
return cm
In [69]:
# Confusion Matrix (iter=1)
set_list = ["setosa","versicolor","virginica"]
cluster_list = ["c1", "c2", "c3"]
data = confusion(y, labels_1)
pd.DataFrame(data,cluster_list, set_list)
Out[69]:
In [70]:
# Confusion Matrix (iter=10)
set_list = ["setosa","versicolor","virginica"]
cluster_list = ["c1", "c2", "c3"]
data = confusion(y, labels_10)
pd.DataFrame(data,cluster_list, set_list)
Out[70]:
In [71]:
# Confusion Matrix (iter=max)
set_list = ["setosa","versicolor","virginica"]
cluster_list = ["c1", "c2", "c3"]
data = confusion(y, labels_max)
pd.DataFrame(data,cluster_list, set_list)
Out[71]:
In [72]:
# Calculate purity of each confusion matrix
def Purity(cm):
M=[]
S=0
for i in cm:
k = max(i)
M.append(k)
for i in M:
S+=i
Purity=S/150
return Purity
metric_list = ["iter= 1", "iter= 10", "iter= MAX"]
set_list = ["Purity metric"]
data = np.array([Purity(confusion(y, labels_1)),Purity(confusion(y, labels_10)),Purity(confusion(y, labels_max))])
pd.DataFrame(data,metric_list, set_list)
Out[72]:
We select all the four measured features (e.g 'SepalLength','SepalWidth','PetalLength','PetalWidth') for different values of k (e.g k=2, k=3, k=4, k=6) and without random state. We compute the confusion matrix for each k and calculate the purity.
In [87]:
#k=2 , random-state= 0
model = KMeans(n_clusters=2,).fit(x)
centroids = model.cluster_centers_
labels=(model.labels_)
print(centroids)
print(labels)
#Confusion matrix
set_list = ["setosa","versicolor","virginica"]
cluster_list = ["c1", "c2", "c3"]
data = confusion(y, labels)
pd.DataFrame(data,set_list, cluster_list)
Out[87]:
In [88]:
print ("Purity(k=2)= %f " % Purity(confusion(y, labels)))
In [89]:
#k=3 , random-state= 0
model = KMeans(n_clusters=3,).fit(x)
centroids = model.cluster_centers_
labels=(model.labels_)
print(centroids)
print(labels)
#Confusion matrix
set_list = ["setosa","versicolor","virginica"]
cluster_list = ["c1", "c2", "c3"]
data = confusion(y, labels)
pd.DataFrame(data,set_list, cluster_list)
Out[89]:
In [90]:
print ("Purity(k=3)= %f " % Purity(confusion(y, labels)))
In [77]:
#k=4 , random-state= 0
model = KMeans(n_clusters=4,).fit(x)
centroids = model.cluster_centers_
labels=(model.labels_)
print(centroids)
print(labels)
# Confusion Matrix
set_list = ["setosa","versicolor","virginica","undefined"]
cluster_list = ["c1", "c2", "c3","c4"]
data = confusion(y, labels)
pd.DataFrame(data,set_list, cluster_list)
Out[77]:
In [414]:
print ("Purity(k=4)= %f " % Purity(confusion(y, labels)))
In [86]:
#k=6 , random-state= 0
model = KMeans(n_clusters=6,).fit(x)
centroids = model.cluster_centers_
labels=(model.labels_)
print(centroids)
print(labels)
# Confusion Matrix
set_list = ["setosa","versicolor","virginica","undefined_1","undefined_2","undefined_3"]
cluster_list = ["c1", "c2", "c3","c4","c5","c6"]
data = confusion(y, labels)
pd.DataFrame(data,set_list, cluster_list)
Out[86]:
In [416]:
print ("Purity(k=6)= %f " % Purity(confusion(y, labels)))