Give a set of observation $(x_1,x_2,...x_n)$, where each observation is a $d$-dimensional real vector, k-means clustering aims to partition the n observations into k ($\le n$) sets $S$ = ${S_1,S_2,...,S_k}$ so as to minimize the within-cluster sum of squares (WCSS) (sum of distance functions of each point in the cluster to the K center). In other words, its objective is to find $$arg \min \limits_{S} {\sum_{i=1}^{k} \sum_{x\in \textbf{S}_i} \Arrowvert \textbf{x}- \textbf{$\mu$}_i \Arrowvert^2}$$ where $\textbf{$\mu$}_i$ is the mean of points in $S_i$.
Given an initial set of k means $m_1^{(1)},...,m_k^{(1)}$ (see below), the algorithm proceeds by alternating between two steps:
Assign each observation to the cluster whose mean yields the least within-cluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance, this is intuitively the "nearest" mean.(Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means). $$ S_i^{(t)} = \{x_p: \Arrowvert x_p - m_i^{(t)} \Arrowvert^2 \le \Arrowvert x_p - m_j^{(t)} \Arrowvert^2 \forall j, 1 \le j \le k \} $$ where each $x_p$ is assigned to exactly one $S^{(t)}$, even if it could be assigned to two or more of them.
Calculate the new means to be the centroids of the observations in the new clusters $$m_i^{(t+1)} = \frac {1}{| S_i^{(t)} |} \sum_{x_j \in S_i^{(t)}} x_j$$ since the arithmetic mean is a least-squares estimator, this also minimizes the within-cluster sum of squares (WCSS) objective.
The algorithm has converged when the assignments no longer change. Since both steps optimize the WCSS objective, and there only exists a finite number of such partitionings, the algorithm must converge to a (local) optimum. There is no guarantee that the global optimum is found using this algorithm.
Unlike clustering algorithms such as k-means or k-medoids, affinity propagation does not require the number of clusters to be determined or estimated before running the algorithm. Similar to k-medoids, affinity propagation finds "exemplars", members of the input set that are representative of clusters.
Let $x_1$ through $x_n$ be a set of data points, with no assumptions made about their internal structure, and let $s$ be a function that quantifies the similarity between any two points, such that $s(x_i,x_j) > s(x_i,x_k)$ if $x_i$ is more similar to $x_j$ than to $x_k$.
The algorithm proceeds by alternating two message passing steps, to update two matrices:
Both matrices are initialized to all zeroes, and can be viewed as log-probability tables. The algorithm then performs the following updates iteratively:
The iterations are performed until either the cluster boundaries remain unchanged over a number of iterations, or after some predetermined number of iterations. The exemplars are extracted from the final matrices as those whose 'responsibility' for themselves is positive. (i.e. $r(i,i)>0$)
In [ ]: