Unsupervised learning

In unsupervised learning problems, we only have input data x(i) with no labels, and we want the algorithm to find some structure in the data. A clustering algorithm such as the k-means algorithm attempts to group the data into k “clusters” which have some similarity.

Examples : Social Network Analysis, Organizing Computer Clusters, and Astronomical Data Analysis.

Clustering

Given a set of data points, group them into some clusters so that:

  • points within each cluster are similar to each other
  • points from different clusters are dissimilar

Similarity is defined using a distance measure:

  • Euclidean
  • Cosine
  • Edit distance
  • ...

Different Types of Clustering

  • Connectivity based (Hierarchical) clustering:
    • Agglomerative
  • Centroid-based clustering
    • Kmeans
  • Distribution-based clustering
    • Gaussian Mixture
  • Density-based clustering
    • DBSCAN

K-Means

The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point. The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set. The algorithm then iterates between two steps:

1. Data assigment step:

Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. More formally, if ci is the collection of centroids in set C, then each data point x is assigned to a cluster based on

$$\underset{c_i \in C}{\arg\min} \; dist(c_i,x)^2$$

2. Centroid update step:

In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to that centroid's cluster.

$$c_i=\frac{1}{|S_i|}\sum_{x_i \in S_i x_i}$$

see more click here


In [1]:
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans


plt.figure(figsize=(12, 12))

n_samples = 1500
random_state = 1760
X, y = make_blobs(n_samples=n_samples, random_state=random_state)

kmeans = KMeans(n_clusters=3)
kmeans.fit_predict(X)
y_pred = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()



In [63]:
kmeans.score(X)


Out[63]:
-3117.5261895753729

In [64]:
plt.figure(figsize=(12, 12))
kmeans = KMeans(n_clusters=2)
kmeans.fit_predict(X)
y_pred = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()



In [65]:
kmeans.score(X)


Out[65]:
-12838.655661844503

In [3]:
plt.figure(figsize=(12, 12))
kmeans = KMeans(n_clusters=5)
kmeans.fit_predict(X)
y_pred = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()


There is a problem.

how to find the best number of cluster ?

choose number of cluster?

Elbow method


In [67]:
scores = []
for i in range(1,10):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit_predict(X)
    scores.append([i,kmeans.score(X)])
scores


Out[67]:
[[1, -37601.123361442238],
 [2, -12838.655661844503],
 [3, -3117.5261895753729],
 [4, -2757.2398221452713],
 [5, -2406.1311268017935],
 [6, -2076.4998169922192],
 [7, -1826.6696786768371],
 [8, -1582.8296430226019],
 [9, -1393.0509549543185]]

In [72]:
import pandas as pd 
plt.figure(figsize=(12,6))
plt.plot(pd.DataFrame(np.array(scores))[0],pd.DataFrame(np.array(scores))[1])
plt.show()



In [37]:
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline

Hierarchical (Agglomerative) Clustering

  1. Initially each point is a cluster itself
  2. Repeatedly combine the two nearest clusters into one.
  3. Stop when there is just one cluster left.

In [61]:
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
plt.title('The iris dataset')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.scatter(X[:, 0], X[:, 1], s=10, c=y, cmap='rainbow')

Z = linkage(X, 'ward')

plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')

dendrogram(Z, leaf_rotation=90., leaf_font_size=8.);
plt.show();

plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(Z, leaf_rotation=90., leaf_font_size=8., truncate_mode='lastp');



In [35]:
from sklearn.datasets import make_moons, make_circles
from sklearn.cluster import AgglomerativeClustering

f, ax = plt.subplots(1, 2, figsize=(13, 5))

X, y = make_moons(1000, noise=.05)
cl = AgglomerativeClustering(n_clusters=2).fit(X)
ax[0].scatter(X[:, 0], X[:, 1], c=cl.labels_, s=10, cmap='rainbow');

X, y = make_circles(1000, factor=.5, noise=.05)
cl = AgglomerativeClustering(n_clusters=2).fit(X)
ax[1].scatter(X[:, 0], X[:, 1], c=cl.labels_, s=10, cmap='rainbow');


Spectral Clustering

TODO


In [34]:
from sklearn.datasets import make_moons, make_circles
from sklearn.cluster import SpectralClustering

f, ax = plt.subplots(1, 2, figsize=(13, 5))

X, y = make_moons(1000, noise=.05)
cl = SpectralClustering(n_clusters=2, affinity='nearest_neighbors').fit(X)
ax[0].scatter(X[:, 0], X[:, 1], c=cl.labels_, s=10, cmap='rainbow');

X, y = make_circles(1000, factor=.5, noise=.05)
cl = SpectralClustering(n_clusters=2, affinity='nearest_neighbors').fit(X)
ax[1].scatter(X[:, 0], X[:, 1], c=cl.labels_, s=10, cmap='rainbow');


/usr/local/lib/python3.5/dist-packages/sklearn/manifold/spectral_embedding_.py:234: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn("Graph is not fully connected, spectral embedding"

DBSCAN


In [33]:
from sklearn.datasets import make_moons, make_circles
from sklearn.cluster import DBSCAN

f, ax = plt.subplots(1, 2, figsize=(13, 5))

X, y = make_moons(1000, noise=.05)
cl = DBSCAN(eps=0.1).fit(X)
ax[0].scatter(X[:, 0], X[:, 1], c=cl.labels_, s=10, cmap='rainbow');

X, y = make_circles(1000, factor=.5, noise=.05)
cl = DBSCAN(eps=0.1).fit(X)
ax[1].scatter(X[:, 0], X[:, 1], c=cl.labels_, s=10, cmap='rainbow');


Evaluation

Evaluating the performance of a clustering algorithm is not as trivial as counting the number of errors or the precision and recall of a supervised classification algorithm.

Some metrics take two label assignments as input. They can be used for:

  • Comparison with ground truth (Almost never available in practice)
  • Agreement of two assignments

Some of them are:

  • Adjusted Rand Index
  • Mutual Information based scores
  • Homogeneity, completeness and V-measure
  • Fowlkes-Mallows scores

And some metrics do their job by taking just one label assignment. They can be used for scoring a clustering label assignment. for e.g:

  • Silhouette Coefficient
  • Calinski-Harabaz Index