Clustering Techniques Writeup

September 26, 2016

K-Means Clustering

K-Means clustering is a regression technique which involves 'fitting' a number $n$ of given values from a dataset around a pre-defined number of $k$ clusters. The K-means clustering process seeks to minimize the Euclidean distance from each point $n$ from its associated $k$ cluster. Note that although the $k$ clusters inhabit the same coordinate space as the $n$ data points, the $k$ cluster locations are not necessarily derived from the set of $n$ points.

The process of K-means involves first defining the number of $k$ clusters. Defining the number of $k$ means is a non-trivial problem that is dependent on the specific experiment at hand. Analysis with an elbow plot showing explained variance as a function of number of clusers allows the designer to designate the $k$ number by choosing the point where adding additional $k$ clusters has diminishing returns for explained variance.

The actual algorithm behind K-means clustering is relatively simple. Given $n$ data points and $k$ clusters, we're looking for µ_k (cluster means that minimize the Euclidean distance between the $n$ data points and said cluster). Effectively, the objective function we're trying to minimize can be written as (courtesy SciKit):

$$ \begin{align} \sum\limits_{i=0}^n \min_{µ_j \subseteq C} ||(x_j - µ_i)||^2 \end{align} $$

For the analysis, we must first choose $k$ starting mean µ's. This is often achieved by randomly sampling $k$ points from the dataset arbitrarily. Below, we have an image of the data set at the start, with arbitrarily placed means. (Image sequence courtesy David Runyan, from Project Rhea)

After choosing the initial set of starting means, we can apply a three step iterative process to eventually 'converge' on finalized means:

1) Fit n points to µ means: Calculate the Euclidean distances and assign each $n$ to its closest µ. In this case, all points are closest to the blue node, so all $n$ are assigned to the blue µ.

2) Define new µ's by finding average: Find the center of each cluster and assign those as new µ's. Only the blue µ moves, as the other nodes have no values assigned to them.

3) Repeat until convergence: Repeat until there is no change in the µ positions. I've shown the final converged image below.

PROS:

Oft-implemented, existing documentation.
Has SciKit implementation in Python.

CONS:

Defining number of means is a non-trivial process.
Does not necessarily converge to global solution, as randomized starting positions will alter final result for different results.
Does not set the means to positions to some value from the dataset (if we're interested in brain clusters, we want our clusters to be centered around some set of pre-defined nodes, not some weighted average determined by k-means.)
Assumes similarly sized 'clusters', which is a very large assumption to be making for brain areas.

K-Medoid Clustering

K-Medoids clustering is similar to K-means in that both start with a known, pre-defined number of 'k' clusters. However, K-medoids sets the 'k' clusters to points that occur naturally in the dataset of 'n' points. Thus, while K-means finds new 'means' and fits each 'n' point to these new means by reducing some summed error, K-medoids instead seeks to maximize some measure of similarity (eg: the similarity matrix) between each points and its medoid (which itself is one of the pre-existing points).

The process for K-Medoids is similar to K-Means; again, the problem of defining the number of 'k' means is a difficult process. From there, an additional difficulty lies in how to define the 'similarity' between two points. A way to do so would be to make a similarity matrix that is has each corresponding value as the inverse of the Euclidean distance between the points. Aside from this, a similar process would be applied iteratively to converge to medoids that maximize the similarity between their respective nodes.

The specific name of the algorithmic approach is Partitioning Around Medoids, or PAM. The steps are, specifically: 1) Arbitrarily select 'k' of the 'n' nodes as the medoids. 2) Calculate the total 'similarity' (eg, by finding the inverse of the total of the distances between all 'n' and their closest 'k', or by using some other measure). 3) Swap one 'n' with one 'k', and recalculate the 'similarity' measure. If the 'similarity' increased, keep the new configuration and continue. Otherwise, return to the previous configuration.

PROS:

Simple, like K-means.

CONS:

Has similiar faults as the K-means methodology (defining 'k', not necessarily global, assumes equal sized regions)
Recalculating similarity at each step relative to all other points is very computationally intensive

Spectral Clustering

Up until now, we've focused on clustering techniques centered around compactness. Spectral clustering is a different dataset fitting technique that focuses on connectivity instead.

The actual algorithm behind K-means clustering is relatively simple. Given 'n' data points and 'k' clusters, we're looking for µk (cluster means that minimize the Euclidean distance between the 'n' data points and said cluster).

Pros:

Cons:

Louvain Method of Community Detection

The Louvain method of community detection is a clustering technique that relies on monitoring the relative connectiveness of communities.

From Blondel, Guillaume, Lambiotte and Lefebvre's seminal paper that defined the Louvain method, they defined first a modularity value. From the paper, "the modularity of a partition is a scalar value between -1 and 1 that measures the density of links inside communities as compared to links between communities." In the case of weighted nodes (where for instance they used the number of calls between between two phone users), the modularity is defined as:

$$ \begin{align} Q = \frac{1}{2m} \sum\limits_{i, \ j} \big[ A_{i, \ j} - \frac{k_i k_j}{2m} \big] \ \delta \big( c_i , c_j \big) \end{align} $$

In this case, the $A_{i, \ j}$ refers to the weight of the edge between $i$ and $j$. From there, they defined $k_i$ as the $\sum_{j} A_{i, \ j}$ as the "weights of the edges attached to vertex $i$". $c_i$ is the 'community' to which node $i$ is assigned, while the $\delta$-function $\delta \big( u, v \big)$ is 1 if $u$ is $v$, and 0 otherwise. $m$ is defined to be $\frac{1}{2}\sum_{i, \ j} A_{i, \ j}$ In lay speech, then, what this calculates is 'how connected' some node $i$ is relative some node $j$ (since we're subtracting from the edge weight of $ij$ the edge weights of all connections from $i$ and $j$).

From there, the paper progresses to explain how this traditional definition of modularity takes a significant amount of storage power to process. Data analytics on the original data using continuous modularity calls take more processing power than is truly necessary. What Blondel, Guillaume, Lambiotte, and Lefebvre did was define an algorithmically (and computationally) more efficient method to measure the change in modularity. It does so by calculating a $\Delta Q$ value that is the change in modularity by moving (inserting) some node $i$ from an 'isolated neighborhood' into $C$:

$$ \begin{align} \Delta Q = \bigg[ \frac{\sum_{in} + \ k_{i, \ in}}{2m} - \big( \frac{\sum_{tot} + \ k_i}{2m} \big)^2 \bigg] - \bigg[ \frac{\sum_{in}}{2m} - \big( \frac{\sum_{tot}}{2m} \big)^2 - \big( \frac{k_i}{2m} \big)^2 \bigg] \end{align} $$

A similar equation is provided for the removal of a node $i$ from the neighborhood $C$.

Application-wise, the process for Louvain's method is similar to the approach for all the other regression techniques (K-means, K-medoids). The approach first assigns and puts each node $N$ into a unique community (so it's in its own community). They then calculate the $\Delta Q$ by adding the node $i$ into community $j$; if after all similar additions into other neighborhoods, the $\Delta Q$ is maximized by adding i to j, that process is fulfilled. Otherwise, the system is reverted to before the addition.

By applying this until a local minimum occurs (when no changes to $i$ increase or decrease $\Delta Q$, the Louvain method can create a very solid mapping.

Pros:

Computationally efficient, their calculations in the paper show a significant reduction in computation time (running on 6.3M nodes, their algorithm only took 197 seconds. The only comparable algorithm that even converged was the Wakita and Tsumuri's CNN implementation method, which took almost three times as long).
Accuracy seems to be high, according to their own discussion and results.

Cons:

Requires weighted edges (we haven't generated these yet). If our weights are defined by the number of connections within the epsilon ball, what's the point of using weighted edges over K-means or K-medoids? Isn't that just an extra layer of complexity?
Requires a C-based workaround that allows us to run the methods in Python (so I won't be able to use these lovely Markdown/Python notebooks). Not too big a problem, but it's nice when packets like SciKit already has implementations of the other methods. Link: http://perso.crans.org/aynaud/communities/

Link to Louvain's Method paper: http://arxiv.org/abs/0803.0476



In [22]:

    
%%latex
\begin{align}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
\nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0
\end{align}









    





\begin{align}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
\nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0
\end{align}



In [ ]: