Unspervised Learning

  • K-means
  • Affinity propagation
  • K-nearest neighbor classifier
  • K-Medoids
  • GMM
  • Spectral clustering
  • Ncut

K-means

Give a set of observation $(x_1,x_2,...x_n)$, where each observation is a $d$-dimensional real vector, k-means clustering aims to partition the n observations into k ($\le n$) sets $S$ = ${S_1,S_2,...,S_k}$ so as to minimize the within-cluster sum of squares (WCSS) (sum of distance functions of each point in the cluster to the K center). In other words, its objective is to find $$arg \min \limits_{S} {\sum_{i=1}^{k} \sum_{x\in \textbf{S}_i} \Arrowvert \textbf{x}- \textbf{$\mu$}_i \Arrowvert^2}$$ where $\textbf{$\mu$}_i$ is the mean of points in $S_i$.

Standard Algorithm

Given an initial set of k means $m_1^{(1)},...,m_k^{(1)}$ (see below), the algorithm proceeds by alternating between two steps:

Assignment Step

Assign each observation to the cluster whose mean yields the least within-cluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance, this is intuitively the "nearest" mean.(Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means). $$ S_i^{(t)} = \{x_p: \Arrowvert x_p - m_i^{(t)} \Arrowvert^2 \le \Arrowvert x_p - m_j^{(t)} \Arrowvert^2 \forall j, 1 \le j \le k \} $$ where each $x_p$ is assigned to exactly one $S^{(t)}$, even if it could be assigned to two or more of them.

Update Step

Calculate the new means to be the centroids of the observations in the new clusters $$m_i^{(t+1)} = \frac {1}{| S_i^{(t)} |} \sum_{x_j \in S_i^{(t)}} x_j$$ since the arithmetic mean is a least-squares estimator, this also minimizes the within-cluster sum of squares (WCSS) objective.

The algorithm has converged when the assignments no longer change. Since both steps optimize the WCSS objective, and there only exists a finite number of such partitionings, the algorithm must converge to a (local) optimum. There is no guarantee that the global optimum is found using this algorithm.

Affinity propagation

Unlike clustering algorithms such as k-means or k-medoids, affinity propagation does not require the number of clusters to be determined or estimated before running the algorithm. Similar to k-medoids, affinity propagation finds "exemplars", members of the input set that are representative of clusters.

Algorithm

Let $x_1$ through $x_n$ be a set of data points, with no assumptions made about their internal structure, and let $s$ be a function that quantifies the similarity between any two points, such that $s(x_i,x_j) > s(x_i,x_k)$ if $x_i$ is more similar to $x_j$ than to $x_k$.

The algorithm proceeds by alternating two message passing steps, to update two matrices:

  • The "responsibility" matrix $R$ has values $r(i, k)$ that quantify how well-suited $x_k$ is to serve as the exemplar for $x_i$, relative to other candidate exemplars for $x_i$.
  • The "availability" matrix $A$ contains values $a(i, k)$ that represent how "appropriate" it would be for $x_i$ to pick $x_k$ as its exemplar, taking into account other points' preference for $x_k$ as an exemplar.

Both matrices are initialized to all zeroes, and can be viewed as log-probability tables. The algorithm then performs the following updates iteratively:

  • First, responsibility updates are sent around: $$r(i,k) \gets s(i,k) - \max \limits_{k' \ne k} \{ a(i,k')+ s(i,k') \}$$
  • Then, availability is updated per $$ a(i,k) \gets \min (0, r(k,k) + \sum_{i' \in \{i,k\}} max(0,r(i',k)))$$ for $i \ne k $ and $$ a(k,k) \gets \sum_{i'\ne k} max(0,r(i',k))$$

The iterations are performed until either the cluster boundaries remain unchanged over a number of iterations, or after some predetermined number of iterations. The exemplars are extracted from the final matrices as those whose 'responsibility' for themselves is positive. (i.e. $r(i,i)>0$)


In [ ]: