Clustering is the task of gathering samples into groups of similar samples according to some predefined similarity or dissimilarity measure (such as the Euclidean distance).
Let's re-use the results of the 2D PCA of the iris dataset in order to explore clustering. First we need to repeat some of the code from the previous notebook
In [ ]:
    
# make sure ipython inline mode is activated
%pylab inline
    
In [ ]:
    
# all of this is taken from the notebook '03_iris_dimensionality.ipynb' 
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import pylab as pl
from itertools import cycle
iris = load_iris()
X = iris.data
y = iris.target
pca = PCA(n_components=2, whiten=True).fit(X)
X_pca = pca.transform(X)
def plot_2D(data, target, target_names):
    colors = cycle('rgbcmykw')
    target_ids = range(len(target_names))
    pl.figure()
    for i, c, label in zip(target_ids, colors, target_names):
        pl.scatter(data[target == i, 0], data[target == i, 1],
                   c=c, label=label)
    pl.legend()
    
Now we will use one of the simplest clustering algorithms, K-means. This is an iterative algorithm which searches for three cluster centers such that the distance from each point to its cluster is minimizied.
In [ ]:
    
from sklearn.cluster import KMeans
from numpy.random import RandomState
rng = RandomState(42)
kmeans = KMeans(n_clusters=3, random_state=rng).fit(X_pca)
    
In [ ]:
    
import numpy as np
np.round(kmeans.cluster_centers_, decimals=2)
    
In [ ]:
    
kmeans.labels_[:10]
    
In [ ]:
    
kmeans.labels_[-10:]
    
The K-means algorithm has been used to infer cluster labels for the
points.  Let's call the plot_2D function again, but color the points
based on the cluster labels rather than the iris species.
In [ ]:
    
plot_2D(X_pca, kmeans.labels_, ["c0", "c1", "c2"])
plot_2D(X_pca, iris.target, iris.target_names)
    
Perform the K-Means cluster search again, but this time learn the
clusters using the full data matrix X, rather than the projected
matrix X_pca.  Does this change the results?  Do these labels
look closer to the true labels?
In [ ]: