In this notebook we'll cover the concept of clustering which is an unsupervised learning algorithm. We'll use the simple k-means algorithm that is part of the mltools package.
The concept of k-means is simple. We're going to try to find k centers to describes all the points in the space. We are going to do this iteratively by computing all the "distances" to the centers, assigning each point with the closest centroid and then moving the centroids so it'll fit the data better.
I know this is pretty much the worst explanation ever, so go to the wiki page, It's really good there and there are animations too!!! :)
In [1]:
# Import all required libraries
from __future__ import division # For python 2.*
import numpy as np
import matplotlib.pyplot as plt
import mltools as ml
np.random.seed(0)
%matplotlib inline
"""
Perform K-means clustering on data X.
Parameters
----------
X : numpy array
N x M array containing data to be clustered.
K : int
Number of clusters.
init : str or array (optional)
Either a K x N numpy array containing initial clusters, or
one of the following strings that specifies a cluster init
method: 'random' (K random data points (uniformly) as clusters),
'farthest' (choose cluster 1 uniformly, then the point farthest
from all cluster so far, etc.), or 'k++' (choose cluster 1
uniformly, then points randomly proportional to distance from
current clusters).
max_iter : int (optional)
Maximum number of optimization iterations.
Returns (as tuple)
-------
z : N x 1 array containing cluster numbers of data at indices in X.
c : K x M array of cluster centers.
sumd : (scalar) sum of squared euclidean distances.
"""
In [2]:
np.random.seed(1)
X,Y = ml.datagen.data_GMM(500, 3, get_Z=True) # Random data distribution
plt.scatter(X[:, 0], X[:, 1], c=Y)
plt.show()
In [3]:
n_clusters = 3
Z, mu, ssd = ml.cluster.kmeans(X, K=n_clusters, init='k++', max_iter=100)
Once clusters are found, each point in the data is classified as the class of the nearest centroid. In our case there are 3 of them. And obviously there could be mistakes.
In [5]:
mu
Out[5]:
In [6]:
fig, ax = plt.subplots(1, 2, figsize=(20, 8))
# Plotting the original data
ax[0].scatter(X[:, 0], X[:, 1], c=Y)
# Plotting the clustered data
ax[1].scatter(X[:, 0], X[:, 1], c=Z) # Plotting the data
ax[1].scatter(mu[:, 0], mu[:, 1], s=500, marker='x', facecolor='black', lw=8) # Plotting the centroids
ax[1].scatter(mu[:, 0], mu[:, 1], s=30000, alpha=.45, c=np.unique(Z)) # Lazy way of plotting the clusters area :)
plt.show()
This is an unsupervised learning algorithm. It did not know the identity of the classes before hand so the fact that some of you may get the same classes arrangment but with different colors is not important. In this case it means that the yellow could have easily been on the top left and it will still mean the same thing.
In [7]:
cluster_KNN = ml.knn.knnClassify(mu, np.arange(n_clusters), 1)
c = cluster_KNN.predict(X)
In [8]:
fig, ax = plt.subplots(1, 2, figsize=(20, 8))
# Plotting the clustered data
ax[0].scatter(X[:, 0], X[:, 1], c=Z) # Plotting the data
ax[0].scatter(mu[:, 0], mu[:, 1], s=500, marker='x', facecolor='black', lw=8) # Plotting the centroids
ax[0].scatter(mu[:, 0], mu[:, 1], s=30000, alpha=.45, c=np.unique(Z)) # Lazy way of plotting the clusters area :)
ax[1].scatter(X[:, 0], X[:, 1], c=c) # Plotting the data
plt.show()
An interesting data set where each row is an image in gray scale of a face using 24 X 24 pixels with values between 0 to 255 where 0 is black and 255 is white in grayscale map.
We are going to use this data set to show that K-means clustering can work on any vector and it doesn't really have to be points in the continuous space like the example above.
In [9]:
X = np.genfromtxt("data/faces.txt", delimiter=None) # load face dataset
Let's see what these faces look like:
In [14]:
X[0]
Out[14]:
In [11]:
X.shape
Out[11]:
In [10]:
f, ax = plt.subplots(3, 5, figsize=(17, 13))
ax = ax.flatten()
# Plotting a random 15 faces
for j in range(15):
i = np.random.randint(X.shape[0])
img = np.reshape(X[i,:],(24,24)) # reshape flattened data into a 24*24 patch
# We've seen the imshow method in the previous discussion :)
ax[j].imshow( img.T , cmap="gray")
plt.show()
In [12]:
n_clusters = 10
Zi, mui, ssdi = ml.cluster.kmeans(X, K=n_clusters, init='k++')
In [13]:
f, ax = plt.subplots(2, 5, figsize=(17, 8))
ax = ax.flatten()
for i in range(min(len(ax), n_clusters)):
img = np.reshape(mui[i,:] ,(24, 24))
ax[i].imshow(img.T , cmap="gray")
plt.show()
Each centroid is a kind of a blurry face. That is the mean average of that group and we can see that on average we all look the same genederless colorless face. Unless you have a mustache, then it depends on if you're looking to your left or to you right :)
In [ ]: