This post is based on .
Data clustering is a process of assigning a set of records into subsets, called clusters, such that records in the same cluster are similar and records in different clusters are quite distinct.
A typical clustering process involves the following five steps:
In this interactive session, we will be reviewing the k-means algorithm.
The k-means algorithm is the most popular and the simplest partitional clustering algorithm. It has many variations. This exercise will review the standard algorithm and several implementations (possibly for different variations).
Let $X = \left\{ \mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_{n - 1} \right\}$ be a numeric dataset containing $n$ records and $k$ be an integer in $\{1, 2, \cdots, n\}$. The k-means algorithm tries to divide the dataset into $k$ clusters $C_0$, $C_1$, $\cdots$, and $C_{k - 1}$ by minimizing the following objective function: $$E = \sum_{i = 0}^{k - 1}\sum_{\mathbf{x} \in C_i} D(\mathbf{x}, \boldsymbol{\mu}_i),$$ where $D(\cdot, \cdot)$ is a distance measure and $\boldsymbol{\mu}_i$ is the mean of cluster $C_i$, i.e., $$\boldsymbol{\mu}_i = \frac{1}{\lvert C_i \rvert} \sum_{\mathbf{x} \in C_i}\mathbf{x}$$
Let $\gamma_i$ be the cluster membership of record $\mathbf{x}_i$ for $i = 0$, $1$, $\cdots$, $n - 1$. That is, $\gamma_i = j$ if $\mathbf{x}_i$ belongs to cluster $C_j$. Then the objective function can be rewritten as: $$E = \sum_{i = 0}^{n - 1}D(\mathbf{x}_i, \boldsymbol{\mu}_{\gamma_i})$$
To minimize the objective function, the k-means algorithm employs an iterative process. At the beginning, the k-means algorithm selects $k$ random records from the dataset $X$ as initial cluster centers.
Suppose $\boldsymbol{\mu}_0^{(0)}$, $\boldsymbol{\mu}_1^{(0)}$, $\cdots$, and $\boldsymbol{\mu}_{k - 1}^{(0)}$ are the initial cluster centers. Based on these cluster centers, the k-means algorithm updates the cluster memberships $\gamma_0^{(0)}$, $\gamma_1^{(0)}$, $\cdots$, $\gamma_{n - 1}^{(0)}$ as follows: \begin{equation} \label{membership-update} \gamma_i^{(0)} = \underset{0 \leq j \leq k - 1}{\operatorname{argmin}} D(\mathbf{x}_i, \boldsymbol{\mu}_j^{(0)}) \end{equation} where $\operatorname{argmin}$ is the argument that minimizes the distance. That is, $\gamma_i^{(0)}$ is set to the index of the cluster to which $\mathbf{x}_i$ has the smallest distance.
Based on the cluster memberships $\gamma_0^{(0)}$, $\gamma_1^{(0)}$, $\cdots$, $\gamma_{n - 1}^{(0)}$, the k-means algorithm updates the cluster centers as follows: \begin{equation} \label{center-update} \boldsymbol{\boldsymbol{\mu}}_j^{(1)} = \frac{1}{\left| \left\{ i \middle| \gamma_i^{(0)} = j\right\} \right|} \sum_{i = 0, \gamma_i^{(0)} = j}^{n - 1}\mathbf{x}_i, \qquad\text{$j = 0$, $1$, $\cdots$, $k - 1$} \end{equation}
Then the k-means algorithm repeats updating the cluster memberships based on equations \ref{membership-update} and \ref{center-update}.
The above can be understood completely if we understand the little pieces, and try to make sense of ideas behind the math. So, don't be scared by the symbols. Let's review some notation first.
$\lvert C_{i} \rvert$ in this context, means the cardinality of the set $C_{i}$, that is, the size of the set. In mathematics, the cardinality of a set is a measure of the "number of elements of the set".
In [1]:
S_1 = set([1, 2, 3, 4, 5]) # or { 1, 2, 3, 4, 5}
len(S_1)
Out[1]:
$\mathbf{x} \in C_1$ means, the vector (notice the bold font face) $\mathbf{x}$ belongs to (or is in) the set $C_1$. For example,
$$\begin{bmatrix} 1 \\ 2 \end{bmatrix} \in C_1$$means, the vector with two components $\left[1, 2\right]^T$ belongs to the cluster $C_1$
In [2]:
# Numpy ndarrays are often used in python to represent vectors.
import numpy as np
x = np.array([1, 2])
# Notice we are informally using a 2-dimensional numpy array
# to represent all the elements of the Set C_1
C_1 = np.array([[1, 2],
[2, 5],
[4, 7]])
x in C_1
Out[2]:
means: the sum (component-wise) of all the vectors that belong to the first cluster $C_1$.
In [3]:
# Use help(np.sum) or np.sum? (then press Tab key) to get more information about np.sum
np.sum(C_1, 0) # or C_1.sum(0)
Out[3]:
The centroid (or cluster center) of the first cluster $C_1$ is computed with with a scalar multiplication. For our example:
In [4]:
1/len(C_1)
Out[4]:
In [5]:
1 /len(C_1) * C_1.sum(0)
Out[5]:
In other words, the centroid formula is simply computing the arithmetic mean (component-wise) of the all the vectors of a given cluster:
Let's review: $$E = \sum_{i = 0}^{k - 1}\sum_{\mathbf{x} \in C_i} D(\mathbf{x}, \boldsymbol{\mu}_i)$$
The inner summation, $$\sum_{\mathbf{x} \in C_i} D(\mathbf{x}, \boldsymbol{\mu}_i)$$ says something like: compute the sum of all the distances for the i-th cluster. Notice the requirement in the summation index: $\mathbf{x} \in C_i$ (only consider those vectors that belong to cluster $C_i$).
The outer summation, $$E = \sum_{i = 0}^{k - 1} \text{sum of distances for the $i$-th cluster}$$ simply says, that you have to accumulate the inner computation for all the clusters.
The second form of the objective function, uses more compact notation:
$$E = \sum_{i = 0}^{n - 1}D(\mathbf{x}_i, \boldsymbol{\mu}_{\gamma_i})$$It says: compute the sum of the distances between a vector and it's corresponding cluster center, for all the vectors in the dataset. Notice that $D(\mathbf{x}_i, \boldsymbol{\mu}_{\gamma_i})$ implies that we have a way to compute the cluster membership function: $\gamma_i$.
Let's review \ref{membership-update}:
$$\gamma_i^{(0)} = \underset{0 \leq j \leq k - 1}{\operatorname{argmin}} D(\mathbf{x}_i, \boldsymbol{\mu}_j^{(0)})$$Text above says:
$\gamma_{i}^0$ is set to the index of the cluster to which $\mathbf{x}_{i}$ has the smallest distance.
Notice that we are assuming an iterative process. The superscript introduced here $^{(0)}$ refers to the number of iteration: the cluster membership update $0$.
The notation $$\underset{0 \leq j \leq k - 1}{\operatorname{argmin}} \text{distance to $j$-th cluster}$$ means that we have to choose the cluster index $j$ (that could be any of the integer values from $0$ to $k - 1$) that minimizes the distance to current record $\mathbf{x}_i$.
Now, let's review cluster center update formula \ref{center-update} notation:
$$\boldsymbol{\boldsymbol{\mu}}_j^{(1)} = \frac{1}{\left| \left\{ i \middle| \gamma_i^{(0)} = j\right\} \right|} \sum_{i = 0, \gamma_i^{(0)} = j}^{n - 1}\mathbf{x}_i, \qquad\text{$j = 0$, $1$, $\cdots$, $k - 1$}$$We are again using the membership function $\gamma_i$ to simplify notation. The outer index $j$, says that we have to update the cluster center computation for each one of the $k$ clusters.
For a specific value of $j$ (we are re-computing the center for $j$-th cluster), we simply have to use the cluster center formula, but we have to consider only those records that belong to this cluster:
$$\sum_{i = 0, \gamma_i^{(0)} = j}^{n - 1}\mathbf{x}_i$$
Notice the requirement in the summation index: sum every record, as long as $\gamma_{i} = j$ (as long as record belongs to $j$-th cluster).
$\left| \left\{ i \middle| \gamma_i^{(0)} = j\right\} \right|$ is simply the number of elements of $j$-th cluster. It says: the cardinality of the set of all indexes $i$, such that $i$-th record belongs to cluster $j$.
In [6]:
%matplotlib inline
In [7]:
import pandas as pd
input_file = "data/600points.csv"
# Input file has labeled records. 'label' is one of 1, 2 or 3.
labeled_df = pd.read_csv(input_file,
header=None,
names=['id', 'x', 'y', 'label'])
labeled_df.dtypes
Out[7]:
In [8]:
# To visualize "named" colors see:
# https://matplotlib.org/1.4.3/examples/color/named_colors.html
label_to_color = {1: 'red', 2: 'deepskyblue', 3: 'green'}
colors = labeled_df.loc[:, 'label'].map(label_to_color)
labeled_df.plot.scatter('x', 'y', grid=True, c=colors);
In [9]:
# Unlabeled data (prior to "clustering").
df = labeled_df.loc[:,['x', 'y']]
df.plot.scatter('x', 'y', grid=True);
$ pip install scikit-learn
) or anaconda, ($ conda install scikit-learn
). Use python's docopt or argparse
(available in the standard library) to parse the following command line options:
Allowed options:
--help produce help message
--datafile arg the data file
--k arg (=3) number of clusters
--maxiter arg (=100) maximum number of iterations
Implement the algorithm by yourself and compare with the results obtained with scikit-learn.
matplotlib
bindings for Haskell.Try to use the existing Scala implementations: