In unsupervised learning problems, we only have input data x(i) with no labels, and we want the algorithm to find some structure in the data. A clustering algorithm such as the k-means algorithm attempts to group the data into k “clusters” which have some similarity.
Examples : Social Network Analysis, Organizing Computer Clusters, and Astronomical Data Analysis.
The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point. The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set. The algorithm then iterates between two steps:
1. Data assigment step:
Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. More formally, if ci is the collection of centroids in set C, then each data point x is assigned to a cluster based on
$$\underset{c_i \in C}{\arg\min} \; dist(c_i,x)^2$$
2. Centroid update step:
In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to that centroid's cluster.
$$c_i=\frac{1}{|S_i|}\sum_{x_i \in S_i x_i}$$
see more click here
In [1]:
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
plt.figure(figsize=(12, 12))
n_samples = 1500
random_state = 1760
X, y = make_blobs(n_samples=n_samples, random_state=random_state)
kmeans = KMeans(n_clusters=3)
kmeans.fit_predict(X)
y_pred = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()
In [63]:
kmeans.score(X)
Out[63]:
In [64]:
plt.figure(figsize=(12, 12))
kmeans = KMeans(n_clusters=2)
kmeans.fit_predict(X)
y_pred = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()
In [65]:
kmeans.score(X)
Out[65]:
In [3]:
plt.figure(figsize=(12, 12))
kmeans = KMeans(n_clusters=5)
kmeans.fit_predict(X)
y_pred = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()
There is a problem.
how to find the best number of cluster ?
In [67]:
scores = []
for i in range(1,10):
kmeans = KMeans(n_clusters=i)
kmeans.fit_predict(X)
scores.append([i,kmeans.score(X)])
scores
Out[67]:
In [72]:
import pandas as pd
plt.figure(figsize=(12,6))
plt.plot(pd.DataFrame(np.array(scores))[0],pd.DataFrame(np.array(scores))[1])
plt.show()
In [37]:
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline
In [61]:
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
plt.title('The iris dataset')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.scatter(X[:, 0], X[:, 1], s=10, c=y, cmap='rainbow')
Z = linkage(X, 'ward')
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z, leaf_rotation=90., leaf_font_size=8.);
plt.show();
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z, leaf_rotation=90., leaf_font_size=8., truncate_mode='lastp');
In [35]:
from sklearn.datasets import make_moons, make_circles
from sklearn.cluster import AgglomerativeClustering
f, ax = plt.subplots(1, 2, figsize=(13, 5))
X, y = make_moons(1000, noise=.05)
cl = AgglomerativeClustering(n_clusters=2).fit(X)
ax[0].scatter(X[:, 0], X[:, 1], c=cl.labels_, s=10, cmap='rainbow');
X, y = make_circles(1000, factor=.5, noise=.05)
cl = AgglomerativeClustering(n_clusters=2).fit(X)
ax[1].scatter(X[:, 0], X[:, 1], c=cl.labels_, s=10, cmap='rainbow');
In [34]:
from sklearn.datasets import make_moons, make_circles
from sklearn.cluster import SpectralClustering
f, ax = plt.subplots(1, 2, figsize=(13, 5))
X, y = make_moons(1000, noise=.05)
cl = SpectralClustering(n_clusters=2, affinity='nearest_neighbors').fit(X)
ax[0].scatter(X[:, 0], X[:, 1], c=cl.labels_, s=10, cmap='rainbow');
X, y = make_circles(1000, factor=.5, noise=.05)
cl = SpectralClustering(n_clusters=2, affinity='nearest_neighbors').fit(X)
ax[1].scatter(X[:, 0], X[:, 1], c=cl.labels_, s=10, cmap='rainbow');
In [33]:
from sklearn.datasets import make_moons, make_circles
from sklearn.cluster import DBSCAN
f, ax = plt.subplots(1, 2, figsize=(13, 5))
X, y = make_moons(1000, noise=.05)
cl = DBSCAN(eps=0.1).fit(X)
ax[0].scatter(X[:, 0], X[:, 1], c=cl.labels_, s=10, cmap='rainbow');
X, y = make_circles(1000, factor=.5, noise=.05)
cl = DBSCAN(eps=0.1).fit(X)
ax[1].scatter(X[:, 0], X[:, 1], c=cl.labels_, s=10, cmap='rainbow');
Evaluating the performance of a clustering algorithm is not as trivial as counting the number of errors or the precision and recall of a supervised classification algorithm.
Some metrics take two label assignments as input. They can be used for:
Some of them are:
And some metrics do their job by taking just one label assignment. They can be used for scoring a clustering label assignment. for e.g: