K-Means Tutorial

K-means is an example of unsupervised learning through clustering. It tries to separate unlabeled data into clusters with equal variance. In two dimensions, this can be visualized as grouping data using circular areas of equal radius.

There are three steps training a K-means classifier:

  1. Pick how many groups you want it to use and (randomly) assign a starting centroid (center point) to each cluster.
  2. Assign each data point to the group with the closest centroid.
  3. Find the mean value of each feature (the middle point of the cluster) for all the points assinged to each cluster. This is the new centroid for that cluster.

Steps 2 and 3 repeat until the cluster centroids do not move significantly.

Scikit-learn provides more information on the K-means classifier function KMeans. They also have an examples of using K-means to classify handwritten numbers.

Setup

Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), pyplot (for plotting figures) ListedColormap (for plotting colors), kmeans (for the scikit-learn kmeans algorithm) and datasets (to download the iris dataset from scikit-learn).

Also create the color maps to use to color the plotted data, and "labelList", which is a list of colored rectangles to use in plotted legends.


In [ ]:
# Print figures in the notebook
%matplotlib inline 

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets # Import the dataset from scikit-learn
from sklearn.cluster import KMeans # Import the KMeans classifier

# Import patch for drawing rectangles in the legend
from matplotlib.patches import Rectangle

# Create color maps
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
cmap_bg = ListedColormap(['#333333', '#666666', '#999999'])

# Create a legend for the colors, using rectangles for the corresponding colormap colors
labelList = []
for color in cmap_bold.colors:
    labelList.append(Rectangle((0, 0), 1, 1, fc=color))

Visualizing K-Means

Let's start by visualizing the steps involved in K-Means clustering. Let's create a simple toy dataset and cluster it into four groups. Don't worry if you aren't sure what the code is doing yet, the important part is the visualizations. We'll cover how to use K-Means afterwards.

Below, we create the toy dataset, randomly select 2 starting locations, and plot them together. First we set the seed to 42 so everyone should see the same results.

Then we create half of the x,y co-ordinates using one random distribution, and the other half from a second distribution. These are joined into a single dataset.

Afterwards, we randomly select the x,y co-ordinates for the 2 starting locations and plot these along with the data points.


In [ ]:
np.random.seed(42)

data1 = np.random.randn(40,2) + np.array([1,3])
data2 = np.random.randn(40,2) + np.array([5,2])
data = np.concatenate([data1, data2], axis=0)

centroids = np.random.randn(2,2) + np.array([-1,3])

# Plot the data
plt.scatter(data[:, 0], data[:, 1])
plt.title("Initial Configuration")

# Plot the centroids as a black X
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='r', zorder=10)

plt.show()

Currently the centroids aren't in great locations, they aren't even really in the data. But in our next step, we assign each data point to belong to the closest centroid. You can see how they're assigned in the plot below.


In [ ]:
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1
y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = np.zeros(xx.shape)

for row in range(Z.shape[0]):
    for col in range(Z.shape[1]):
        d1 = np.linalg.norm(np.array([xx[row,col], yy[row,col]]) - centroids[0,:])
        d2 = np.linalg.norm(np.array([xx[row,col], yy[row,col]]) - centroids[1,:])
        
        if d2 < d1:
            Z[row,col] = 1


# Put the result into a color plot
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)

# Plot the training points
plt.scatter(data[:, 0], data[:, 1])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Assigned to Clusters")

# Plot the centroids as a red X
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='r', zorder=10)

plt.show()

Once all the points are assigned to a cluster, the centroids move to the average x and y value of their assigned points. Below, the old locations are shown as a black X, and the new locations are red Xs.


In [ ]:
kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1, n_init=1)
kmeans.fit(data)

# Put the result into a color plot
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)

# Plot the training points
plt.scatter(data[:, 0], data[:, 1])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Centroids Relocated")

# Plot old centroids as a black X
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='k', zorder=10)

# Plot the centroids as a red X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='r', zorder=10)

plt.show()

Next, the datapoints are re-assigned to the new closest centroid.


In [ ]:
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1
y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction at every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)

# Plot the training points
plt.scatter(data[:, 0], data[:, 1])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Re-assigned to New Clusters")

# Plot the centroids as a black X
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='r', zorder=10)

plt.show()

We repeat these two steps until we've reached a maximum number of calculations or the centroids no longer move significantly. We'll show a few more iterations so you can see what happens.


In [ ]:
kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1, n_init=1)
kmeans.fit(data)

# Put the result into a color plot
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)

# Plot the training points
plt.scatter(data[:, 0], data[:, 1])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Centroids Relocated")

plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='k', zorder=10)

# Plot the centroids as a red X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='r', zorder=10)

plt.show()

h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1
y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction at every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)

# Plot the training points
plt.scatter(data[:, 0], data[:, 1])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Re-assigned to New Clusters")

# Plot the centroids as a red X
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='r', zorder=10)

plt.show()

kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1, n_init=1)
kmeans.fit(data)

# Put the result into a color plot
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)

# Plot the training points
plt.scatter(data[:, 0], data[:, 1])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Centroids Relocated")

plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='k', zorder=10)

# Plot the centroids as a red X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='r', zorder=10)

plt.show()

h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1
y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction at every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)

# Plot the training points
plt.scatter(data[:, 0], data[:, 1])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Re-assigned to New Clusters")

# Plot the centroids as a black X
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='r', zorder=10)

plt.show()

Import the dataset

Import the dataset and store it to a variable called iris. Scikit-learn's explanation of the dataset is here. This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target_names', 'target', 'data', 'feature_names']

The data features are stored in iris.data, where each row is an example from a single flower, and each column is a single feature. The feature names are stored in iris.feature_names. Labels are stored as the numbers 0, 1, or 2 in iris.target, and the names of these labels are in iris.target_names.

The dataset consists of measurements made on 50 examples from each of three different species of iris flowers (Setosa, Versicolour, and Virginica). Each example has four features (or measurements): sepal length, sepal width, petal length, and petal width. All measurements are in cm.

Below, we load the labels into y, the corresponding label names into labelNames, the data into X, and the names of the features into featureNames.


In [ ]:
# Import some data to play with
iris = datasets.load_iris()

# Store the labels (y), label names, features (X), and feature names
y = iris.target       # Labels are stored in y as numbers
labelNames = iris.target_names # Species names corresponding to labels 0, 1, and 2
X = iris.data
featureNames = iris.feature_names

Below, we plot the first two features from the dataset (sepal length and width). Normally we would try to use all useful features, but sticking with two allows us to visualize the data more easily.

Then we plot the data to get a look at what we're dealing with. The colormap is used to determine what colors are used for each class when plotting.


In [ ]:
# Plot the data

# Sepal length and width
X_small = X[:,:2]
# Get the minimum and maximum values with an additional 0.5 border
x_min, x_max = X_small[:, 0].min() - .5, X_small[:, 0].max() + .5
y_min, y_max = X_small[:, 1].min() - .5, X_small[:, 1].max() + .5

plt.figure(figsize=(8, 6))

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')
plt.title('Sepal width vs length')

# Set the plot limits
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

# Plot the legend
plt.legend(labelList, labelNames)

plt.show()

Unlabeled data

K-means is an unsupervised learning method, which means it doesn't make use of data labels. This is useful when we're exploring a new dataset. We may not have labels for this dataset, but we want to see how it is grouped together and what examples are most similar to each other. Below we plot the data again, but this time without any labels. This is what k-means "sees" when we use it.


In [ ]:
# Plot the data

# Sepal length and width
X_small = X[:,:2]
# Get the minimum and maximum values with an additional 0.5 border
x_min, x_max = X_small[:, 0].min() - .5, X_small[:, 0].max() + .5
y_min, y_max = X_small[:, 1].min() - .5, X_small[:, 1].max() + .5

plt.figure(figsize=(8, 6))

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1])
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')
plt.title('Sepal width vs length')

# Set the plot limits
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

plt.show()

K-means: training

Next, we train a K-means classifier on our data.

The first section chooses the number of clusters to use, and stores it in the variable n_clusters. We choose 3 because we know there are 3 species of iris, but we don't always know this when approaching a machine learning problem.

The last two lines create and train the classifier.

The first line creates a classifier (kmeans) using the KMeans() function, and tells it to use the number of neighbors stored in n_neighbors. The second line uses the fit() method to train the classifier on the features in X. Notice that because this is an unsupervised method, it does not use the labels stored in y.


In [ ]:
# Choose your number of clusters
n_clusters = 3

# we create an instance of KMeans Classifier and fit the data.
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(X_small)

Plot the classification boundaries

Now that we have our classifier, let's visualize what it's doing.

First we plot the decision boundaries, or the lines dividing areas assigned to the different clusters. The background shows the areas that are considered to belong to a certain cluster, and each cluster can then be assigned to a species of iris. They are plotted in grey, because the classifier does not assign labels to the clusters. The center of each cluster is plotted as a black x. Then we plot our examples onto the space, showing where each point lies in relation to the decision boundaries.

If we took sepal measurements from a new flower, we could plot it in this space and use the background shade to determine which cluster of data points our classifier would assign to it.


In [ ]:
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("KMeans (k = %i)"
         % (n_clusters))
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')

# Plot the centroids as a black X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='k', zorder=10)

plt.show()

Cheating with labels

Because we do have labels for this dataset, let's see how well k-means did at separating the three species.


In [ ]:
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("KMeans (k = %i)"
         % (n_clusters))
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')

# Plot the centroids as a black X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='k', zorder=10)

# Plot the legend
plt.legend(labelList, labelNames)

plt.show()

Analyzing the clusters

As you can see in the previous plots, K-means does a good job of separating the Setosa species (red) into its own cluster. It also does a reasonable job separating Versicolour (green) and Virginica (blue), although there is a considerable amount of overlap that it can't predict properly.

This is an example where it is important to understand your data (and visualize it whenever possible), as well as understand your machine learning model. In this example, you may want to use a different machine learning model that can separate the data more accurately. Alternatively, we could use all four features to see if that improves accuracy (remember, we aren't using petal length or width here for easier data visualization).

Changing the number of clusters

What would happen if you changed the number of clusters? What would the plot look like with 2 clusters, or 5? Based on the unlabeled data, how would you try to determine the number of classes to use?

In the next block of code, try changing the number of clusteres and seeing what happens. You may need to change the number of colors represented in cmap_bg to match the number of classes you are using.


In [ ]:
# Your code goes here!
cmap_bg = ListedColormap(['#111111','#333333', '#555555', '#777777', '#999999'])

Making Predictions

Now, let's say we go out and measure the sepals of two iris plants, and want to know what group they belong to. We're going to use our classifier to predict the flowers with the following measurements:

Plant Sepal length Sepal width
A 4.3 2.5
B 6.3 2.1

We can use our classifier's predict() function to predict the label for our input features. We pass in the variable examples to the predict() function, which is a list, and each element is another list containing the features (measurements) for a particular example. The output is a list of labels corresponding to the input examples.

We'll also plot them on the boundary plot, to show why they were predicted that way.


In [ ]:
# Add our new data examples
examples = [[4.3, 2.5], # Plant A
            [6.3, 2.1]] # Plant B

# Choose your number of clusters
n_clusters = 3

# we create an instance of KMeans Classifier and fit the data.
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(X_small)

# Predict the labels for our new examples
labels = kmeans.predict(examples)

# Print the predicted species names
print('A: Cluster ' + str(labels[0]))
print('B: Cluster ' + str(labels[1]))

# Now plot the results
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_bg)

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("KMeans (k = %i)"
         % (n_clusters))
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')

# Plot the centroids as a black X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
           marker='x', s=169, linewidths=3,
           color='k', zorder=10)

# Display the new examples as labeled text on the graph
plt.text(examples[0][0], examples[0][1],'A', fontsize=14)
plt.text(examples[1][0], examples[1][1],'B', fontsize=14)

plt.show()

As you can see, example A is grouped into Cluster 2 and example B is grouped into Cluster 0. Remember, K-means does not use labels. It only clusters the data by feature similarity, and it's up to us to decide what the clusters mean (or if they don't mean anything at all).

Using different features

Try using different combinations of the four features and see what results you get. Does it make it any easier to determine how many clusters should be used?


In [ ]:
# Your code goes here!
#cmap_bg = ListedColormap(['#111111','#333333', '#555555', '#777777', '#999999'])

Final Notes

Some final things to keep in mind when using K-means to cluster your data

  • K-means is unsupervised, meaning it clusters data by similarity of features and does not require (or even use) labels.
  • How well it works is partially dependent on choosing the right number of clusters for the dataset. You can do this using your knowledge of the data (like we did, knowing we are looking at 3 species of plant). Alternatively, there are ways to try to experimentally find the best number of clusters.
  • The output does not provide a meaningful label, only a cluster assignment for the data. It is up to you to determine the meaning of each cluster.

In [ ]: