Copyright 2020 Allen B. Downey

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

This notebook introduces cluster analysis and one of the most common algorithms for it, k-means.

It also introduces

Jupyter, which is a tool for creating notebooks like this one;

NumPy, which we'll use to perform array operations;

Pandas, which we'll use to read and clean the data; and

scikit-learn, which provides an implementation of k-means.

We'll proceed "top-down"; that is, we'll use scikit-learn first, then we'll open the hood and see how it works.

If you want to follow along:

Use this link to run this notebook and do the exercises: tinyurl.com/DowPen20

Use this link to run the same notebook with solutions: tinyurl.com/DowPenSoln20

I am a professor at Olin College, which is a small engineering school near Boston, Massachusetts, USA.

Olin was created in 1999 with the mission to transform engineering education.

```
In [1]:
```%%html
<iframe src="https://www.google.com/maps/embed?pb=!1m14!1m8!1m3!1d1512.1750667940496!2d-71.26457056946273!3d42.29270982134376!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0xa038229eeed8c35b!2sOlin%20College%20of%20Engineering!5e1!3m2!1sen!2sus!4v1594232142090!5m2!1sen!2sus" width="600" height="450" frameborder="0" style="border:0;" allowfullscreen="" aria-hidden="false" tabindex="0"></iframe>

```
```

I have been at Olin since 2003. I teach classes related to software, data science, Bayesian statistics, and physical modeling.

I have written several books on these topics, including *Think Python* and *Think Stats*. Most are published by O'Reilly Media, which is famous for putting animals on their covers:

But all of them are freely available from Green Tea Press.

Finally, I write a blog about Data Science and related topics, called Probably Overthinking It.

Jupyter is a tool for writing notebooks that contain text, code, and results.

You can install Jupyter on your own computer, but can also use services like Colab that run the notebook for you. In that case, you don't have to install anything; you just need a browser.

A notebook contains:

Text cells, which contain text in Markdown or HTML, and

Code cells, which contain code in Python or one of about 100 other languages.

This is a text cell; the one below is a code cell.

```
In [2]:
```print('Hello, Jupyter')

```
```

On Colab, code cells have a triangle "Play" icon on the left side. You can press it to run the code in the cell.

Or if you select a cell by clicking on it, you can run it by pressing Shift-Enter.

As an exercise:

- Run the
`print`

statement in the previous cell. - Modify the code in that cell and run it again.
- Run the next cell, which imports the Python modules we'll use later.

```
In [3]:
```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Also run the following cell, which defines a function we'll use.

```
In [4]:
```def decorate(**options):
"""Decorate the current axes.
Call decorate with keyword arguments like
decorate(title='Title',
xlabel='x',
ylabel='y')
The keyword arguments can be any of the axis properties
https://matplotlib.org/api/axes_api.html
"""
ax = plt.gca()
ax.set(**options)
handles, labels = ax.get_legend_handles_labels()
if handles:
ax.legend(handles, labels)
plt.tight_layout()

Cluster analysis is a set of tools for looking at data and

Discovering groups, species, or categories,

Defining boundaries between groups.

It is a form of "unsupervised" learning, which means that the only input is the dataset itself; the algorithm is not given any correct examples to learn from.

As an example, I'll used data collected and made available by Dr. Kristen Gorman at the Palmer Long-Term Ecological Research Station in Antarctica.

This dataset was published to support this article: Gorman, Williams, and Fraser, "Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus *Pygoscelis*)", March 2014.

The following cell downloads the raw data.

```
In [5]:
```# Load the data files from https://github.com/allisonhorst/palmerpenguins
# With gratitude to Allison Horst (@allison_horst)
import os
if not os.path.exists('penguins_raw.csv'):
!wget https://github.com/allisonhorst/palmerpenguins/raw/master/inst/extdata/penguins_raw.csv

The dataset is stored in a CSV file, which contains one row for each penguin and one column for each variable.

I'll use Pandas to read the CSV file and put the results in a `DataFrame`

.

```
In [6]:
```df = pd.read_csv('penguins_raw.csv')
df.shape

```
Out[6]:
```

A `DataFrame`

is like a 2-D array, but it also contains names for the columns and labels for the rows.

The `shape`

of the `DataFrame`

is the number of rows and columns.

The `head`

method displays the first few rows.

```
In [7]:
```df.head()

```
Out[7]:
```

Three species of penguins are represented in the dataset: Adelie, Chinstrap and Gentoo, as shown in this illustration (by Allison Horst, available under the CC-BY license):

In this dataset we are told that there are three species, and we are told which species each penguin belongs to. But for purposes of clustering, we'll pretend we don't have this information and we'll see whether the algorithm "discovers" the different species.

The measurements we'll use are:

Body Mass in grams (g).

Flipper Length in millimeters (mm).

Culmen Length in millimeters.

Culmen Depth in millimeters.

If you are not familiar with the word "culmen", it refers to the top margin of the beak, as shown in the following illustration (also by Allison Horst):

This might seem like an artificial exercise. If we already know that there are three species, why are we trying to discover them?

For now, I'll just say that it's a learning example. But let's come back to this question: what is unsupervised clustering good for?

`Species2`

, that contains a shorter version of the species names.

```
In [8]:
```def shorten(species):
"""Select the first word from a string."""
return species.split()[0]
df['Species2'] = df['Species'].apply(shorten)

I'll use the `groupby`

method to divide the dataset by species.

```
In [9]:
```grouped = df.groupby('Species2')
type(grouped)

```
Out[9]:
```

`GroupBy`

object that contains the three groups and their names. The following loop prints the group names and the number of penguins in each group.

```
In [10]:
```for name, group in grouped:
print(name, len(group))

```
```

`GroupBy`

object to extract a column, like flipper length, from each group and compute its mean.

```
In [11]:
```varname = 'Flipper Length (mm)'
for name, group in grouped:
print(name, group[varname].mean())

```
```

We can also use it to display the distribution of values in each group.

```
In [12]:
```for name, group in grouped:
sns.kdeplot(group[varname], label=name)
decorate(xlabel=varname,
ylabel='PDF',
title='Distributions of features')

```
```

`kdeplot`

uses kernel density estimation to make a smooth histogram of the values.

It looks like we can use flipper length to identify Gentoo penguins, but not to distinguish the other two species.

To make these steps easier to reuse, I'll wrap them a function.

```
In [13]:
```def make_kdeplots(df, varname):
"""Make a KDE plot for each species.
df: DataFrame
varname: string column name
by: string column name
returns: dictionary from species name to Cdf
"""
grouped = df.groupby('Species2')
for name, group in grouped:
sns.kdeplot(group[varname], label=name)
decorate(xlabel=varname,
ylabel='PDF',
title='Distributions of features')

Now we can use it to explore other features, like culmen length.

```
In [14]:
```make_kdeplots(df, 'Culmen Length (mm)')

```
```

It looks like we can use culmen length to identify Adelie penguins.

**Exercise:** Use `make_kdeplots`

to display the distributions of one of the other two features:

`'Body Mass (g)'`

`'Culmen Depth (mm)'`

```
In [15]:
```# Solution
make_kdeplots(df, 'Body Mass (g)')

```
```

```
In [16]:
```var1 = 'Flipper Length (mm)'
var2 = 'Culmen Length (mm)'
var3 = 'Culmen Depth (mm)'
var4 = 'Body Mass (g)'

```
In [17]:
```for name, group in grouped:
plt.plot(group[var1], group[var2],
'o', alpha=0.4, label=name)
decorate(xlabel=var1, ylabel=var2)

```
```

Using those two features, we can divide the penguins into clusters with not much overlap.

We're going to make lots of scatter plots, so let's wrap that code in a function.

And we'll generalize it to take `by`

as a parameter, so we can group by any column, not just `Species2`

.

```
In [18]:
```def scatterplot(df, var1, var2, by):
"""Make a scatter plot.
df: DataFrame
var1: string column name, x-axis
var2: string column name, y-axis
by: string column name, groupby
"""
grouped = df.groupby(by)
for name, group in grouped:
plt.plot(group[var1], group[var2],
'o', alpha=0.4, label=name)
decorate(xlabel=var1, ylabel=var2)

Here's a scatter plot of flipper and culmen length for the three species.

```
In [19]:
```scatterplot(df, var1, var2, 'Species2')

```
```

**Exercise:** Make a scatter plot using any other pair of variables.

```
In [20]:
```# Solution
scatterplot(df, var3, var4, 'Species2')

```
```

We can think of these scatter plots as 2-D views of a 4-D feature space.

```
In [21]:
```df['labels'] = 0

Now if we group by label, there's only one big cluster.

```
In [22]:
```scatterplot(df, var1, var2, 'labels')

```
```

Let's see what happens if we run k-means clustering on this data.

```
In [23]:
```df.shape

```
Out[23]:
```

But some of the variables are NaN, which indicates missing data.

So I'll use `dropna`

to drop any rows that have missing data for the two features we're going to use, flipper length and culmen length.

```
In [24]:
```features = [var1, var2]
data = df.dropna(subset=features).copy()
data.shape

```
Out[24]:
```

I'll extract just those two columns as a NumPy array.

```
In [25]:
```M = data[features].to_numpy()

Now we can use `KMeans`

to identify the clusters.

`n_clusters`

indicates how many cluster we want; this parameter is the $k$ the algorithm is named for.

```
In [26]:
```from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3).fit(M)
type(kmeans)

```
Out[26]:
```

The result is an object that contains

Labels that indicates which cluster each penguin is assigned to, and

The centers of the clusters.

I'll store the labels as a columns in `data`

.

```
In [27]:
```data['labels'] = kmeans.labels_
data['labels']

```
Out[27]:
```

That way we can use `scatterplot`

to show the clusters.

```
In [28]:
```scatterplot(data, var1, var2, 'labels')

```
```

The `KMeans`

object also contains the centers of the clusters as coordinate pairs in a NumPy array.

```
In [29]:
```kmeans.cluster_centers_

```
Out[29]:
```

To plot the centers, I'll transpose the array and assign the columns to `x`

and `y`

:

```
In [30]:
```xs, ys = np.transpose(kmeans.cluster_centers_)

I'll plot the centers with x's and o's.

```
In [31]:
```options = dict(color='C3', ls='none', mfc='none')
plt.plot(xs, ys, marker='o', ms=15, **options)
plt.plot(xs, ys, marker='x', ms=10, **options);

```
```

As usual, let wrap that up in a function.

```
In [32]:
```def plot_centers(centers, color='C3'):
"""Plot cluster centers.
centers: array with x and y columns
color: string color specification
"""
xs, ys = np.transpose(centers)
options = dict(color=color, ls='none', mfc='none')
plt.plot(xs, ys, marker='o', ms=15, **options)
plt.plot(xs, ys, marker='x', ms=10, **options)

Now let's pull it all together.

```
In [33]:
```scatterplot(data, var1, var2, 'labels')
plot_centers(kmeans.cluster_centers_)

```
```

This figure shows the data, color-coded by assigned label, and the centers of the clusters.

It looks like k-means does a reasonable job of rediscovering the species, but with some confusion between Adelie (lower left) and Chinstrap (top center).

As a reminder, here are the right answers:

```
In [34]:
```scatterplot(data, var1, var2, 'Species2')

```
```

**Exercise:** Here's the code from this section all in one place. Modify it to use any two features and see what the results look like.

```
In [35]:
```features2 = [var1, var2]
data2 = df.dropna(subset=features2).copy()
M2 = data2[features2].to_numpy()
kmeans2 = KMeans(n_clusters=3).fit(M2)
data2['labels'] = kmeans2.labels_
scatterplot(data2, var1, var2, 'labels')

```
```

Now let's see how the algorithm works. At a high level, there are three steps:

- Choose $k$ random points in the dataset as initial centers.
- Assign each point in the dataset to the closest center.
- Compute new centers by calculating the "center of mass" in each cluster.

Then you repeat steps 2 and 3 until the centers stop moving.

To select random points from the dataset, I'll use `np.random.choice`

to select three indices.

```
In [36]:
```index = np.random.choice(len(M), size=3)
index

```
Out[36]:
```

And then use the indices to select rows from the dataset.

```
In [37]:
```centers = M[index]
centers

```
Out[37]:
```

I'll wrap that in a function:

```
In [38]:
```def choose_random_start(M, k):
"""Choose k random elements of M.
M: NumPy array with rows of coordinates
k: number of centers to choose
returns: NumPy array
"""
index = np.random.choice(len(M), size=k)
centers = M[index]
return centers

Here's how we use it.

```
In [39]:
```centers = choose_random_start(M, 3)
centers

```
Out[39]:
```

And here's what the centers look like on the scatterplot.

```
In [40]:
```data['labels'] = 0
scatterplot(data, var1, var2, 'labels')
plot_centers(centers)

```
```

```
In [41]:
```center_x, center_y = centers[0]
center_x, center_y

```
Out[41]:
```

`x`

and `y`

coordinates in separate arrays. I can do that with `np.transpose`

, which turns the columns into rows; then I can assign the rows to `x`

and `y`

.

```
In [42]:
```x, y = np.transpose(M)
x.shape

```
Out[42]:
```

Along the x-axis, the distance from each point to this center is `x-center_x`

.

Along the y-axis, the distance is `y-center_y`

.

The distance from each point to the center is the hypotenuse of the triangle, which I can compute with `np.hypot`

:

```
In [43]:
```distances = np.hypot(x-center_x, y-center_y)
distances.shape

```
Out[43]:
```

The result is an array that contains the distance from each point to the chosen center.

To see if we got it right, I'll plot the center and the points, with the size of the points proportional to distance.

```
In [44]:
```plt.plot(center_x, center_y, 'rx', markersize=10)
plt.scatter(x, y, s=distances)
decorate(xlabel=var1, ylabel=var2)

```
```

At least visually, it seems like the size of the points is proportional to their distance from the center.

So let's put those steps into a function:

```
In [45]:
```def compute_distances(M, center):
"""Compute distances to the given center.
M: NumPy array of coordinates
center: x, y coordinates of the center
returns: NumPy array of float distances
"""
x, y = np.transpose(M)
center_x, center_y = center
distances = np.hypot(x-center_x, y-center_y)
return distances

We can use the function to make a list of distance arrays, one for each center.

```
In [46]:
```distance_arrays = [compute_distances(M, center)
for center in centers]
len(distance_arrays)

```
Out[46]:
```

```
In [47]:
```A = np.array(distance_arrays)
A.shape

```
Out[47]:
```

`A`

has one row for each center and one column for each point.

Now we can use `np.argmin`

to find the shortest distance in each column and return its index.

```
In [48]:
```data['labels'] = np.argmin(A, axis=0)
data['labels']

```
Out[48]:
```

The result is an array of indices in the range `0..2`

, which we assign to a column in `data`

.

Let's put these steps in a function.

```
In [49]:
```def compute_labels(M, centers):
"""Label each point with the index of the closest center.
M: NumPy array of coordinates
centers: array of coordinates for the centers
returns: array of labels, 0..k-1
"""
distance_arrays = [compute_distances(M, center)
for center in centers]
A = np.array(distance_arrays)
labels = np.argmin(A, axis=0)
return labels

We can call it like this:

```
In [50]:
```data['labels'] = compute_labels(M, centers)

And here are the results.

```
In [51]:
```scatterplot(data, var1, var2, 'labels')
plot_centers(centers)

```
```

```
In [52]:
```grouped = data.groupby('labels')
for name, group in grouped:
print(name, len(group))

```
```

We can use the `GroupBy`

object to select the columns we're using and compute their means.

```
In [53]:
```data.groupby('labels')[features].mean()

```
Out[53]:
```

The result is a `DataFrame`

that contains the central coordinates of each cluster.

I'll put these steps in a function.

```
In [54]:
```def compute_new_centers(data, features):
"""Compute the center of each cluster.
data: DataFrame
features: list of string column names
"""
means = data.groupby('labels')[features].mean()
return means.to_numpy()

The return value is a NumPy array that contains the new centers.

```
In [55]:
```new_centers = compute_new_centers(data, features)
new_centers

```
Out[55]:
```

Here's what it looks like with the old centers in gray and the new centers in red.

```
In [56]:
```scatterplot(data, var1, var2, 'labels')
plot_centers(centers, color='gray')
plot_centers(new_centers, color='C3')

```
```

```
In [57]:
```def k_means(data, features, k):
"""Cluster by k means.
data: DataFrame
features: list of string column names
k: number of clusters
returns: array of centers
"""
M = data[features].to_numpy()
centers = choose_random_start(M, k)
for i in range(15):
data['labels'] = compute_labels(M, centers)
centers = compute_new_centers(data, features)
return centers

And here's what the results look like after 15 iterations.

```
In [58]:
```centers = k_means(data, features, 3)
scatterplot(data, var1, var2, 'labels')
plot_centers(centers, color='C3')

```
```

The results are (as far as I can see) identical to what we got from the scikit-learn implementation.

```
In [59]:
```kmeans = KMeans(n_clusters=3).fit(M)
data['labels'] = kmeans.labels_
scatterplot(data, var1, var2, 'labels')
plot_centers(kmeans.cluster_centers_)

```
```

```
In [60]:
```from time import sleep
from IPython.display import clear_output
interval = 1
centers = choose_random_start(M, k=3)
plt.figure()
for i in range(10):
# label and scatter plot
data['labels'] = compute_labels(M, centers)
scatterplot(data, var1, var2, 'labels')
plot_centers(centers, color='gray')
# compute new centers and plot them
new_centers = compute_new_centers(data, features)
plot_centers(new_centers)
centers = new_centers
# show the plot, wait, and clear
plt.show()
sleep(interval)
clear_output(wait=True)

```
```

**Exercise:** Run the previous cell a few times. Do you always get the same clusters?

All of this is based on the assumption that you know how many clusters you are looking for, which is true for some applications, but not always.

Let's see what goes wrong if you ask for too many clusters, or too few.

**Exercise:** Run the following code with different values of `n_clusters`

and see what the results look like.

```
In [61]:
```kmeans = KMeans(n_clusters=3).fit(M)
data['labels'] = kmeans.labels_
scatterplot(data, var1, var2, 'labels')
plot_centers(kmeans.cluster_centers_)

```
```

```
In [62]:
```M.max(axis=0) - M.min(axis=0)

```
Out[62]:
```

```
In [63]:
```M.std(axis=0)

```
Out[63]:
```

When we compute the distance from each point to each center, the distances in the $x$ direction tend to dominate.

This is a common problem with algorithms that are based on distance in multidimensional space.

It is such a common problem that there is a common solution: feature scaling.

The goal of feature scaling is to transform the features so the distances along each axis are comparable.

One version of feature scaling is "standardization", which consists of

- Subtracting the mean from each feature, and
- Dividing through by the standard deviation.

Here's how we can do it with the features in `M`

:

```
In [64]:
```means = M.mean(axis=0)
means

```
Out[64]:
```

```
In [65]:
```stds = M.std(axis=0)
stds

```
Out[65]:
```

```
In [66]:
```M_std = (M - means) / stds

Let's see what happens if we run the algorithm again with standardized features.

Notice that I have to transform the centers back before plotting them.

```
In [67]:
```kmeans = KMeans(n_clusters=3).fit(M_std)
data['labels'] = kmeans.labels_
scatterplot(data, var1, var2, 'labels')
centers = kmeans.cluster_centers_ * stds + means
plot_centers(centers)

```
```

That looks a lot better! Again, here are the actual species for comparison.

```
In [68]:
```scatterplot(data, var1, var2, 'Species2')

```
```

scikit-learn provides `StandardScaler`

, which does the same thing.

```
In [69]:
```from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(M)
M_std = scaler.transform(M)

And `scaler`

provides `inverse_transform`

, which we can use to transform the centers back.

```
In [70]:
```kmeans = KMeans(n_clusters=3).fit(M_std)
data['labels'] = kmeans.labels_
scatterplot(data, var1, var2, 'labels')
centers = scaler.inverse_transform(kmeans.cluster_centers_)
plot_centers(centers)

```
```

The k-means algorithm does unsupervised clustering, which means that we don't tell it where the clusters are; we just provide the data and ask it to find a given number of clusters.

In this notebook, we asked it to find clusters in a group of penguins based on two features, flipper length and culmen length. The clusters it finds reflect the species in the dataset, especially if we standardize the data.

In this example we used only two features, because that makes it easy to visualize the results. But k-means extends easily to any number of dimensions (see the exercise below).

So, what is this good for?

Well, Wikipedia provides this list of applications. Applying clustering analysis to these applications, I see a few general ideas:

From an engineering point of view, clustering can be used to automate some kinds of analysis people do, which might be faster, more accurate, or less expensive. And it can work with large datasets and high numbers of dimensions that people can't handle.

From a scientific point of view, clustering provides a way to test whether the patterns we see are in the data or in our minds.

This second point is related to old philosophical questions about the nature of categories. Putting things into categories seems to be a natural part of how humans think, but we have to wonder whether the categories we find truly "carve nature at its joints", as Plato put it.

If a clustering algorithm finds the same "joints" we do, we might have more confidence they are not entirely in our minds.

**Exercise:** Use the scikit-learn implementation of k-means to find clusters using all four features (flipper length, culmen length and depth, body mass). How do the results compare to what we got with just two features?

```
In [71]:
```# Solution
features4 = [var1, var2, var3, var4]
data4 = df.dropna(subset=features4).copy()
data4.shape

```
Out[71]:
```

```
In [72]:
```# Solution
M4 = data[features4].to_numpy()

```
In [73]:
```# Solution
scaler = StandardScaler().fit(M4)
M4_std = scaler.transform(M4)

```
In [74]:
```# Solution
kmeans4 = KMeans(n_clusters=3).fit(M4_std)

```
In [75]:
```# Solution
data4['labels'] = kmeans4.labels_

```
In [76]:
```# Solution
scatterplot(data4, var1, var2, 'labels')

```
```

```
In [ ]:
```