The K-means clustering algorithm associates each point $x_i$ in a set of input points $\{x_1, x_2, \ldots, x_m\}$ to $K$ clusters. Each cluster is specified by a centroid that is the average location of all the points in the cluster. The algorithm proceeds iteratively from arbitrary centroid locations, updating the membership of each point according to minimum distance, then updating the centroid location based on the new cluster membership.
The algorithm will have converged when the assignment of points to centroids does not change with each iteration.
The K-means algorithm is simply a Gaussian mixture model with two restrictions:
the covariance matrix is spherical:
$$\Sigma_k = \sigma I_D$$
the mixture weights are fixed:
$$\pi_k = \frac{1}{K}$$
Hence, we are only interested in locating the appropriate centroid of the clusters. This serves to speed computation.
We can define the distortion function:
$$J(c,\mu) = \sum_{i]1}^m ||x_i - \mu_{c_i}||$$which gets smaller at every iteration. So, k-means is coordinate ascent on $J(c,\mu)$
To check whether a chosen $k$ is reasonable, one approach is to compare the distances between the centroids to the mean distance bewween each data point and their assigned centroid. A good fit involves relatively large inter-centroid distances.
The appropriate value for k (the number of clusters) may depend on the goals of the analysis, or it may be chosen algorithmically, using an optimization procedure.
In [ ]:
%matplotlib inline
import seaborn as sns; sns.set_context('notebook')
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
iris = datasets.load_iris()
features, target = iris.data, iris.target
sepal_length, sepal_width, petal_length, petal_width = features.T
In [ ]:
x, y = sepal_length, petal_length
plt.scatter(x, y)
In [ ]:
plt.scatter(x, y, c=np.array(list('rgbc'))[target])
Let's start with $k=3$, arbitrarily assigned:
In [ ]:
centroids = (5, 4), (6, 1), (7, 6)
In [ ]:
plt.scatter(x, y)
plt.scatter(*np.transpose(centroids), c='r', marker='+', s=100)
We can use the function cdist
from SciPy to calculate the distances from each point to each centroid.
In [ ]:
from scipy.spatial.distance import cdist
distances = cdist(centroids, list(zip(x,y)))
distances.shape
We can make the initial assignment to centroids by picking the minimum distance.
In [ ]:
labels = distances.argmin(axis=0)
labels
In [ ]:
plt.scatter(x, y, c=np.array(list('rgbc'))[labels])
plt.scatter(*np.transpose(centroids), c='r', marker='+', s=100)
Now we can re-assign the centroid locations based on the means of the current members' locations.
In [ ]:
new_centroids = [(x[labels==i].mean(), y[labels==i].mean())
for i in range(len(centroids))]
In [ ]:
plt.scatter(x, y, c=np.array(list('rgbc'))[labels])
plt.scatter(*np.transpose(new_centroids), c='r', marker='+', s=100)
So, we simply iterate these steps until convergence.
In [ ]:
centroids = new_centroids
iterations = 200
for _ in range(iterations):
distances = cdist(centroids, list(zip(x,y)))
labels = distances.argmin(axis=0)
centroids = [(x[labels==i].mean(), y[labels==i].mean())
for i in range(len(centroids))]
In [ ]:
plt.scatter(x, y, c=np.array(list('rgbc'))[labels])
plt.scatter(*np.transpose(centroids), c='r', marker='+', s=100)
In [ ]:
from sklearn.cluster import KMeans
from numpy.random import RandomState
rng = RandomState(1)
# Instantiate model
kmeans = KMeans(n_clusters=3, random_state=rng)
# Fit model
kmeans.fit(np.transpose((x,y)))
After fitting, we can retrieve the labels and cluster centers.
In [ ]:
kmeans.labels_
In [ ]:
kmeans.cluster_centers_
The resulting plot should look very similar to the one we fit by hand.
In [ ]:
plt.scatter(x, y, c=np.array(list('rgbc'))[kmeans.labels_])
plt.scatter(*kmeans.cluster_centers_.T, c='r', marker='+', s=100)
The microbiome.csv
dataset contains counts of various microbe taxa extraced from either tissue or stool samples of NICU infants. We might be interested in seeing if samples cluster into groups approximately corresponding to location (tissue or stool) based on the counts of each bacterial taxon.
In [ ]:
import pandas as pd
microbiome = pd.read_csv("../data/microbiome.csv")
First, we need to transpose the data so that it can be used with scikit-learn
's interface. Fortunately, Pandas makes this relatively painless. The data are stored in long format:
In [ ]:
microbiome.head()
For this analysis, we need the features (i.e. taxa) in columns, with a row for each sample. First we drop the Group
column, then pivot the Taxon
column into a column index.
In [ ]:
microbiome_pivoted = microbiome.drop('Group', axis=1).pivot(index='Patient',
columns='Taxon').stack(level=0).reset_index()
microbiome_pivoted.columns.name = None
microbiome_pivoted.head()
Then we drop the unused column and change the location variable from str
type to int
.
In [ ]:
microbiome_data = microbiome_pivoted.drop('Patient',
axis=1).rename(columns={'level_1':'Location'}
).replace({'Tissue': 0 , 'Stool':1})
y = microbiome_data.values[:, 0]
X = microbiome_data.values[:, 1:]
In [ ]:
microbiome_data.head()
To simplify the analysis, and aid visualization, we will again perform a PCA to isolate the majority of the variation into two principal components.
In [ ]:
from sklearn.decomposition import PCA
from itertools import cycle
pca = PCA(n_components=2, whiten=True).fit(X)
X_pca = pca.transform(X)
def plot_2D(data, target, target_names, pca):
colors = cycle('rgbcmykw')
target_ids = range(len(target_names))
plt.figure()
for i, c, label in zip(target_ids, colors, target_names):
plt.scatter(data[target == i, 0], data[target == i, 1],
c=c, label=label)
var_explained = pca.explained_variance_ratio_ * 100
plt.xlabel('First Component: {0:.1f}%'.format(var_explained[0]))
plt.ylabel('Second Component: {0:.1f}%'.format(var_explained[1]))
plt.legend()
In [ ]:
plot_2D(X_pca, y, ['Tissue', 'Stool'], pca)
We can now create a KMeans
object with k=2
, and fit the data with it.
In [ ]:
km_microbiome = KMeans(n_clusters=2, random_state=rng)
km_microbiome.fit(X_pca)
From this, we can extract the cluster centroids (in the cluster_center_
attribute) and the group labels (in labels_
) in order to generate a plot of the classification result.
In [ ]:
np.round(km_microbiome.cluster_centers_, decimals=2)
In [ ]:
km_microbiome.labels_
In [ ]:
plot_2D(X_pca, km_microbiome.labels_, ["c1", "c2"], pca)
scikit-learn
includes a suite of well-known clustering algorithms. Like most unsupervised learning models in the scikit, they expect the data to be clustered to have the shape (n_samples, n_features)
:
sklearn.cluster.KMeans
: The simplest, yet effective clustering algorithm. Needs to be provided with the
number of clusters in advance, and assumes that the data is normalized as input
(but use a PCA model as preprocessor).sklearn.cluster.MeanShift
: Can find better looking clusters than KMeans but is not scalable to high number of samples.sklearn.cluster.DBSCAN
: Can detect irregularly shaped clusters based on density, i.e. sparse regions in
the input space are likely to become inter-cluster boundaries. Can also detect
outliers (samples that are not part of a cluster).
In [ ]:
## Write answer here
We can use clustering to try to find interesting groupings among sets of baseball statistics. Load the baseball dataset and run a clustering algorithm on the set of three statistics:
You should probably set a minimum number of at bats to qualify for the analysis, since there are pitchers that get only a handful of at bats each year.
Since we are clustering in 3 dimensions, you can visualize the output as a series of pairwise plots.
In [ ]:
import pandas as pd
baseball = pd.read_csv("../data/baseball.csv", index_col=0)
baseball.head()
In [ ]:
## Write answer here