K-Means Tutorial

K-means is an example of unsupervised learning through clustering. It tries to separate unlabeled data into clusters with equal variance. In two dimensions, this can be visualized as grouping data using circular areas of equal radius.

There are three steps training a K-means classifier:

  1. Pick how many groups you want it to use and (randomly) assign a starting centroid (center point) to each cluster.
  2. Assign each data point to the group with the closest centroid.
  3. Find the mean value of each feature for all the points assinged to each cluster. This is the new centroid for that cluster.

Steps 2 and 3 repeat until the cluster centroids do not move significantly.

Scikit-learn provides more information on the K-means classifier function KMeans. They also have an examples of using K-means to classify handwritten numbers.

Setup

Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), pylab (for plotting figures) ListedColormap (for plotting colors), neighbors (for the scikit-learn nearest-neighbor algorithm) and datasets (to download the iris dataset from scikit-learn).


In [1]:
# Print figures in the notebook
%matplotlib inline 

print(__doc__)

import numpy as np
import pylab as pl
from matplotlib.colors import ListedColormap
from sklearn import datasets # Import the dataset from scikit-learn
from sklearn.cluster import KMeans # Import the KMeans classifier


Automatically created module for IPython interactive environment

Import your data

Import the iris dataset through scikit-learn. Scikit-learn's explanation of the dataset is here.

The dataset consists of measurements made on 50 examples from each of three different species of iris flowers (Setosa, Versicolour, and Virginica). Each example has four features (or measurements): sepal length, sepal width, petal length, and petal width. All measurements are in cm.

Below, we import the first two features from the dataset (sepal length and width) and store them in X. Normally we would try to use all useful features, but sticking with two allows us to visualize the data more easily. The labels (which species of iris) are stored in y.

Then we plot the data, to get a look at what we're dealing with. The colormap is used to determine what colors are used for each class when plotting.


In [2]:
# Import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features, 
                      #sepal length and sepal width, and store them in X
y = iris.target       # Labels are stored in y as numbers
labelNames = ['Setosa', 'Versicolour', 'Virginica'] # Species names corresponding to labels 0, 1, and 2

# Plot the data

# Create color maps
cmap_bg = ListedColormap(['#EEEEEE', '#CCCCCC', '#AAAAAA']) # Use shades of grey instead of colors
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# Get the minimum and maximum values with an additional 0.5 border
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

pl.figure(figsize=(8, 6))
pl.clf()

# Plot the training points
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
pl.xlabel('Sepal length (cm)')
pl.ylabel('Sepal width (cm)')

# Set the plot limits
pl.xlim(x_min, x_max)
pl.ylim(y_min, y_max)

# Create a legend for the colors, using rectangles for the corresponding colormap colors
r = Rectangle((0, 0), 1, 1, fc='#FF0000')
g = Rectangle((0, 0), 1, 1, fc='#00FF00')
b = Rectangle((0, 0), 1, 1, fc='#0000FF')
legend([r,g,b], labelNames)

pl.show()


K-means: training

Next, we train a K-means on our data.

The first section chooses the number of clusters to use, and stores it in the variable n_clusters. We choose 3 because we know there are 3 species of iris, but we don't always know this when approaching a machine learning problem.

The last two lines create and train the classifier.

The first line creates a classifier (kmeans) using the KMeans() function, and tells it to use the number of neighbors stored in n_neighbors. The second line uses the fit() method to train the classifier on the features in X. Notice that because this is an unsupervised method, it does not use the labels stored in y.


In [3]:
# Choose your number of clusters
n_clusters = 3

# we create an instance of KMeans Classifier and fit the data.
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(X)


Out[3]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
    verbose=0)

Plot the classification boundaries

Now that we have our classifier, let's visualize what it's doing.

First we plot the decision boundaries, or the lines dividing areas assigned to the different clusters. The background shows the areas that are considered to belong to a certain cluster, and each cluster can then be assigned to a species of iris. They are plotted in grey, because the classifier does not assign labels to the clusters. The center of each cluster is plotted as a black x. Then we plot our examples onto the space, showing where each point lies in relation to the decision boundaries.

If we took sepal measurements from a new flower, we could plot it in this space and use the background shade to determine which cluster of data points our classifier would assign to it.


In [4]:
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.figure(figsize=(8, 6))
pl.pcolormesh(xx, yy, Z, cmap=cmap_bg)

# Plot the training points
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
pl.xlim(xx.min(), xx.max())
pl.ylim(yy.min(), yy.max())
pl.title("KMeans (k = %i)"
         % (n_clusters))
pl.xlabel('Sepal length (cm)')
pl.ylabel('Sepal width (cm)')

# Plot the centroids as a black X
centroids = kmeans.cluster_centers_
pl.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='k', zorder=10)

# Create a legend for the colors, using previously defined values
legend([r,g,b], labelNames)

pl.show()


Analyzing the clusters

As you can see in the previous plot, K-means does a good job of separating the Setosa species (red) into its own cluster. It also does a reasonable job separating Versicolour (green) and Virginica (blue), although there is a considerable amount of overlap that it can't pick up.

This is an example where it is important to understand your data (and visualize it whenever possible), as well as understand your machine learning model. In this example, you may want to use a different machine learning model that can separate the data more accurately. Alternatively, we could use all four features to see if that improves accuracy (remember, we aren't using petal length or width here for easier data visualization).

Making Predictions

Now, let's say we go out and measure the sepals of two iris plants, and want to know what species they are. We're going to use our classifier to predict the flowers with the following measurements:

Plant Sepal length Sepal width
A 4.3 2.5
B 6.3 2.1

We can use our classifier's predict() function to predict the label for our input features. We pass in the variable examples to the predict() function, which is a list, and each element is another list containing the features (measurements) for a particular example. The output is a list of labels corresponding to the input examples.

We'll also plot them on the boundary plot, to show why they were predicted that way.


In [5]:
# Add our new data examples
examples = [[4.3, 2.5], # Plant A
            [6.3, 2.1]] # Plant B

# Choose your number of clusters
n_clusters = 3

# we create an instance of KMeans Classifier and fit the data.
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(X)

# Predict the labels for our new examples
labels = kmeans.predict(examples)

# Print the predicted species names
print('A: Cluster ' + str(labels[0]))
print('B: Cluster ' + str(labels[1]))

# Now plot the results
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.figure(figsize=(8, 6))
pl.pcolormesh(xx, yy, Z, cmap=cmap_bg)

# Plot the training points
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
pl.xlim(xx.min(), xx.max())
pl.ylim(yy.min(), yy.max())
pl.title("KMeans (k = %i)"
         % (n_clusters))
pl.xlabel('Sepal length (cm)')
pl.ylabel('Sepal width (cm)')

# Plot the centroids as a black X
centroids = kmeans.cluster_centers_
pl.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='k', zorder=10)

# Display the new examples as labeled text on the graph
pl.text(examples[0][0], examples[0][1],'A', fontsize=14)
pl.text(examples[1][0], examples[1][1],'B', fontsize=14)

# Create a legend for the colors, using previously defined values
legend([r,g,b], labelNames)

pl.show()


A: Cluster 0
B: Cluster 2

As you can see, example A is grouped into Cluster 2, which is primarily Setosa plants. Example B is grouped into Cluster 0, which is primarily Versicolour plants. Remember, K-means does not use labels. It only clusters the data by feature similarity, and it's up to us to decide what the clusters mean (or if they don't mean anything at all).

Final Notes

Some final things to keep in mind when using K-means to cluster your data

  • K-means is unsupervised, meaning it clusters data by similarity of features and does not require (or even use) labels.
  • How well it works is partially dependent on choosing the right number of clusters for the dataset. You can do this using your knowledge of the data (like we did, knowing we are looking at 3 species of plant). Alternatively, there are ways to try to experimentally find the best number of clusters.
  • The output does not provide a meaningful label, only a cluster assignment for the data. It is up to you to determine the meaning of each cluster.

In [ ]: