In the unsupervised setting, one of the most straightforward tasks we can perform is to find groups of data instances which are similar between each other. We call such groups of data points clusters.
We position ourselves in the setting where we have access to a dataset $D$ that consists of instances $x \in \mathbb{R}^n$. For example, if our instances have two features $x_1$ and $x_2$ we are in the $\mathbb{R}^2$ space. For simplicity and visualization purposes in this session, we assume our data to be 2-dimensional. That said, the method (as well as the implementation) generalizes to more dimensions in a straightforward way.
$k$-Means is one of the most popular and representative "clustering" algorithms. $k$-means stores $k$ centroids, that is points in the $n$-dimensional space which are then used to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.
The most common algorithm uses an iterative refinement technique. $k$-means is a ubiquitous case of the Expectation Maximization algorithm for clustering; it is also referred to as Lloyd's algorithm.
Given an initial set of $k$ centroids $m_1(1), \ldots, m_k(1)$ , the algorithm proceeds by alternating between two steps:
Assignment step: Assign each observation to the cluster whose mean yields the least within-cluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance, this is intuitively the "nearest" mean.
Update step: Calculate the new means to be the centroids of the observations in the new clusters. Since the arithmetic mean is a least-squares estimator, this also minimizes the within-cluster sum of squares (WCSS) objective.
The algorithm has converged when the assignments no longer change. Since both steps optimize the WCSS objective, and there only exists a finite number of such partitionings, the algorithm must converge to a (local) optimum. There is no guarantee that the global optimum is found using this algorithm.
The algorithm is often presented as assigning objects to the nearest cluster by distance. The standard algorithm aims at minimizing the WCSS objective, and thus assigns by "least sum of squares", which is exactly equivalent to assigning by the smallest Euclidean distance. Using a different distance function other than (squared) Euclidean distance may stop the algorithm from converging.
To make it easier to understand, the figure belows illustrates the process.
The figure depicts the k-means algorithm (Images courtesy of Michael Jordan and adapted from http://stanford.edu/~cpiech/cs221/handouts/kmeans.html). The training examples are shown as dots, and the cluster centroids are shown as crosses. (a) the dataset, (b) random initial cluster centroids -- one may initialize the algorithm using data points as centroids also, (c-f) illustration of running two iterations of k-means. In each iteration, we assign each training example to the closest cluster centroid (shown by "painting" the training examples the same color as the cluster centroid to which is assigned); then we move each cluster centroid to the mean of the points assigned to it.
Our goal today, is to run K-means on a real dataset. This dataset was first created to study genetic diversity accross America and consists of 494 individuals coming from 27 different tribes (across 10 countries). These individuals are described by their genetic profil in terms of micro-satellites. In addition we have information about the precise location of the tribes, given by the latitude and longitude features.
TO DO :
Import the data
Pre-processing
In [1]:
import pandas as pd
import numpy as np
In [2]:
df = pd.read_csv('NAm2.txt', sep=" ")
print(df.head())
print(df.shape)
In [3]:
# List of populations/tribes
tribes = df.Pop.unique()
country = df.Country.unique()
print(tribes)
print(country)
In [4]:
# The features that we need for clustering starts from the 9th one
# Subset of the dataframe
df_micro = df.iloc[0:494,8:5717]
df_micro.shape
Out[4]:
Wikipedia
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of distinct principal components is equal to the smaller of the number of original variables or the number of observations minus one. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set.
Basically we will only use PCA for a visualisation purpose. Our goal is to get a 2D visualisation of a 5717-dimensional dataset. Just keep in mind that PCA relies on the assumption that the data are linearly separable, but it's not always the case!
TO DO : execute the following code!
In [5]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
Y_sklearn = pca.fit(df_micro)
projected = pca.fit_transform(df_micro)
In [6]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)
plt.scatter(projected[:, 0], projected[:, 1],
c=df.Pop.astype('category').cat.codes, edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
We are now ready to run Kmeans! To this end, we will use the implementation given by scikit learn. As for the previous session, we first initialise the algorithm then fit it to our data.
TO DO :
In [7]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10)
res = kmeans.fit(df_micro)
labels = res.labels_
res.cluster_centers_
Out[7]:
In [8]:
plt.scatter(projected[:, 0], projected[:, 1],
c=labels, edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
The initialisation step requires the to set the number of clusters K. To this end, one can either use a priori information and set it manually but there also exists several approach to determine it, including for instance, the Elbow method and the Gap Statistic.
In addition one need to initialise the centroid. Commonly used initialization methods are Forgy and Random Partition. The Forgy method randomly chooses $k$ observations from the data set and uses them as the initial means. The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster's randomly assigned points. The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set. For expectation maximization and standard k-means algorithms, the Forgy method of initialization is preferable.
TO DO :
In [9]:
from sklearn import metrics
# 1 random initisalition
kmeans = KMeans(n_clusters=10, init='random')
res = kmeans.fit(df_micro)
label_1 = res.labels_
centroids_1 = res.cluster_centers_
kmeans = KMeans(n_clusters=10, init='random')
res = kmeans.fit(df_micro)
label_2 = res.labels_
centroids_2 = res.cluster_centers_
metrics.adjusted_rand_score(label_1, label_2)
Out[9]:
In [10]:
# 50 random initialisations
kmeans = KMeans(n_clusters=10, init='random', n_init=50)
res = kmeans.fit(df_micro)
label_1 = res.labels_
centroids_1 = res.cluster_centers_
kmeans = KMeans(n_clusters=10, init='random',n_init=50)
res = kmeans.fit(df_micro)
label_2 = res.labels_
centroids_2 = res.cluster_centers_
metrics.adjusted_rand_score(label_1, label_2)
Out[10]:
In [75]:
# 50 initisalition and improve strategy
kmeans = KMeans(n_clusters=10, init='k-means++', n_init=50)
res = kmeans.fit(df_micro)
label_1 = res.labels_
centroids_1 = res.cluster_centers_
kmeans = KMeans(n_clusters=10, init='k-means++',n_init=50)
res = kmeans.fit(df_micro)
label_2 = res.labels_
centroids_2 = res.cluster_centers_
metrics.adjusted_rand_score(label_1, label_2)
Out[75]:
The Elbow method is a method of interpretation and validation of consistency within cluster analysis designed to help finding the appropriate number of clusters in a dataset. This method looks at the percentage of variance explained as a function of the number of clusters. One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.
If one plots the percentage of variance explained by the clusters against the number of clusters the first clusters will add much information (explain a lot of variance), but at some point the marginal gain in explained variance will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion"
In [48]:
cluster_range = range(1,50)
cluster_errors = []
for num_clusters in cluster_range:
clust = KMeans(n_clusters=num_clusters, random_state=0, n_init=10)
clust.fit(df_micro)
cluster_errors.append(clust.inertia_)
In [49]:
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
clusters_df[0:10]
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )
Out[49]:
The Gap Statistic was developped by reasearchers from stanford, and relies on the variation of the within-cluster variation. To have more details, you
To compute it we will use the implementation of https://github.com/milesgranger/gap_statistic
TO DO
In [17]:
def optimalK(data, nrefs=3, maxClusters=15):
"""
Calculates KMeans optimal K using Gap Statistic from Tibshirani, Walther, Hastie
Params:
data: ndarry of shape (n_samples, n_features)
nrefs: number of sample reference datasets to create
maxClusters: Maximum number of clusters to test for
Returns: (gaps, optimalK)
"""
gaps = np.zeros((len(range(1, maxClusters)),))
resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
for gap_index, k in enumerate(range(1, maxClusters)):
# Holder for reference dispersion results
refDisps = np.zeros(nrefs)
# For n references, generate random sample and perform kmeans getting resulting dispersion of each loop
for i in range(nrefs):
# Create new random reference set
randomReference = np.random.random_sample(size=data.shape)
# Fit to it
km = KMeans(k)
km.fit(randomReference)
refDisp = km.inertia_
refDisps[i] = refDisp
# Fit cluster to original data and create dispersion
km = KMeans(k)
km.fit(data)
origDisp = km.inertia_
# Calculate gap statistic
gap = np.log(np.mean(refDisps)) - np.log(origDisp)
# Assign this loop's gap statistic to gaps
gaps[gap_index] = gap
resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)
return (gaps.argmax() + 1, resultsdf) # Plus 1 because index of 0 means 1 cluster is optimal, index 2 = 3 clusters are optimal
In [22]:
k, gapdf = optimalK(df_micro, nrefs=5, maxClusters=30)
print ('Optimal k is: ', k)
In [27]:
plt.plot(gapdf.clusterCount, gapdf.gap, linewidth=3)
plt.scatter(gapdf[gapdf.clusterCount == k].clusterCount, gapdf[gapdf.clusterCount == k].gap, s=250, c='r')
plt.xlabel('Cluster Count')
plt.ylabel('Gap Value')
plt.title('Gap Values by Cluster Count')
plt.show()
In the probabilistic framework of mixture models, we assume that the data are generated according to a mixture of probability density functions, with cluster-specific parameters.
$$ p(x,\theta) = \sum_k \pi_k f(x,\theta_k)$$where, $\pi_k$ can be interpreted as the proportion of each cluster and $\theta_k$ is the set of parameters. For instance, in the case of a Gaussian mixture we have $\theta_k = (\mu_k,\sigma_k)$. Then, the goal is to estimate the set of parameters $\theta_k$ and to compute the partition of the objects which is assumed to a hidden variable of the model.
In [29]:
# 1 run of the Gaussian mixture models
from sklearn import mixture
gmm = mixture.GaussianMixture(n_components=10).fit(df_micro)
labels_gmm = gmm.predict(df_micro)