Data clustering is one of the main mathematical applications variety of algorithms have been developed to tackle the problem. K-means is one of the basic algorithms for data clustering.
The reader should be familiar with basic linear algebra.
The reader should be able to recognise applications where K-means algorithm can be efficiently used and use it.
Credits. The notebook is based on I. Mirošević, Spectral Graph Partitioning and Application to Knowledge Extraction.
Data clustering problem is the following: partition the given set of $m$ objects of the same type into $k$ subsets according to some criterion. Additional request may be to find the optimal $k$.
K-means clustering problem is the following: partition the set $X=\{x_{1},x_{2},\cdots ,x_{m}\}$ , where $x_{i}\in\mathbb{R}^{n}$, into $k$ clusters $\pi=\{C_{1},C_{2},...,C_{k}\}$ such that $$ J(\pi)=\sum_{i=1}^{k}\sum_{x\in C_{i}}\| x-c_{i}\|_{2}^{2} \to \min $$ over all possible partitions. Here $c_{i}=\displaystyle\frac{1}{|C_{i}|}\sum_{x\in C_{i}} x$ is the mean of points in $C_i$ and $|C_i|$ is the cardinality of $C_i$.
K-means clustering algorithm is the following:
A first variation of a partition $\pi=\{C_1,\ldots,C_k\}$ is a partition $\pi^{\prime}=\{C_{1}^{\prime},\cdots ,C_{k}^{\prime }\}$ obtained by moving a single point $x$ from a cluster $C_{i}$ to a cluster $C_{j}$. Notice that $\pi$ is a first variation of itself.
A next partition of the partition $\pi$ is a partition $\mathop{\mathrm{next}}(\pi)=\mathop{\mathrm{arg min}}\limits_{\pi^{\prime}} J(\pi^{\prime})$.
First Variation clustering algorithm is the following:
The k-means clustering problem is NP-hard.
In the k-means algorithm, $J(\pi)$ decreases in every iteration.
K-means algorithm can converge to a local minimum.
Each iteration of the k-means algorithm requires $O(mnk)$ operations.
K-means algorithm is implemented in the function kmeans()
in the package Clustering.jl.
$J(\pi)=\mathop{\mathrm{trace}}(S_W)$, where $$ S_{W}=\sum\limits_{i=1}^k\sum\limits_{x\in C_{i}} (x-c_i)(x-c_{i})^{T} =\sum_{i=1}^k\frac{1}{2|C_{i}|}\sum_{x\in C_{i}}\sum_{y \in C_{i}} (x-y)(x-y)^{T}. $$ Let $c$ denote the mean of $X$. Then $S_W=S_T-S_B$, where \begin{align*} S_{T}&=\sum_{x\in X}(x-c)(x-c)^{T} = \frac{1}{2m}\sum_{i=1}^m\sum_{j=1}^m (x_{i}-x_{j})(x_{i}-x_{j})^{T}, \\ S_{B}&=\sum_{i=1}^k|C_{i}|(c_{i}-c)(c_{i}-c)^{T} = \frac{1}{2m}\sum_{i=1}^k\sum_{j=1}^k|C_{i}||C_{j}| (c_{i}-c_{j})(c_{i}-c_{j})^{T}. \end{align*}
In order to try to avoid convergence to local minima, the k-means algorithm can be enhanced with first variation by adding the following steps:
In [1]:
using LinearAlgebra
using Random
using Statistics
In [2]:
function myKmeans(X::Array{T}, k::Int64) where T
# X is Array of Arrays
m,n=length(X),length(X[1])
C=Array{Int64}(undef,m)
# Choose random k means among X
c=X[randperm(m)[1:k]]
# This is just to start the while loop
cnew=copy(c)
cnew[1]=zeros(size(c[1])) # cnew[1].+[1.0;1.0]
# Loop
iterations=0
while cnew!=c
iterations+=1
cnew=copy(c)
# Assignment
for i=1:m
C[i]=findmin([norm(X[i]-c[j]) for j=1:k])[2]
end
# Update
for j=1:k
c[j]=mean(X[C.==j])
end
end
C,c,iterations
end
Out[2]:
In [3]:
using Plots
In [5]:
# Generate points
import Random
Random.seed!(362)
k=5
centers=rand(-5:5,k,2)
# Number of points in cluster
sizes=rand(10:50,k)
# X is array of arrays
X=Array{Array{Any}}(undef,sum(sizes))
first=0
last=0
for j=1:k
first=last+1
last=last+sizes[j]
for i=first:last
X[i]=map(Float64,vec(centers[j,:])+(rand(2) .-0.5)/2)
end
end
centers
Out[5]:
In [6]:
sizes
Out[6]:
In [7]:
# Prepare for plot
function extractxy(X::Array)
x=map(Float64,[X[i][1] for i=1:length(X)])
y=map(Float64,[X[i][2] for i=1:length(X)])
x,y
end
x,y=extractxy(X)
Out[7]:
In [8]:
scatter(x,y)
scatter!(centers[:,1],centers[:,2],markersize=6, aspect_ratio=:equal)
Out[8]:
In [9]:
# Cluster the data, repeat several times
C,c,iterations=myKmeans(X,k)
Out[9]:
In [10]:
print(C)
In [11]:
# Plot the solution
function plotKmeansresult(C::Array,c::Array,X::Array)
x,y=extractxy(X)
cx,cy=extractxy(c)
# Clusters
scatter(x[findall(C.==1)],y[findall(C.==1)])
for j=2:k
scatter!(x[findall(C.==j)],y[findall(C.==j)])
end
# Centers
scatter!(cx,cy,markercolor=:black)
end
Out[11]:
In [12]:
plotKmeansresult(C,c,X)
Out[12]:
What happens?
We see that the algorithm, although simple, for this example may converge to a local minimum.
Let us try the function kmeans()
from the package Clustering.jl
.
In this function, the input is a matrix where columns are points, number of cluster we are looking for, and, optionally, the method to compute initial means.
If we choose init=:rand
, the results are similar. If we choose
init=:kmpp
, which is the default, the results are better, but convergence to a local minimum is still possible.
Run the clustering several times!
seeding_algorithm(s::Symbol) =
s == :rand ? RandSeedAlg() :
s == :kmpp ? KmppAlg() :
s == :kmcen ? KmCentralityAlg() :
error("Unknown seeding algorithm $s")
In [13]:
using Clustering
In [14]:
?kmeans
Out[14]:
In [15]:
methods(kmeans)
Out[15]:
In [16]:
X1=Matrix(transpose([x y]))
out=kmeans(X1,k,init=:kmpp)
Out[16]:
In [17]:
X1
Out[17]:
In [18]:
fieldnames(KmeansResult)
Out[18]:
In [19]:
out.centers
Out[19]:
In [20]:
println(out.assignments)
In [21]:
# We need to modify the plotting function
function plotKmeansresult(out::KmeansResult,X::Array)
k=size(out.centers,2)
# Clusters
scatter(X[1,findall(out.assignments.==1)],X[2,findall(out.assignments.==1)])
for j=2:k
scatter!(X[1,findall(out.assignments.==j)],X[2,findall(out.assignments.==j)])
end
# Centers
scatter!(out.centers[1,:],out.centers[2,:],markercolor=:black,aspect_ratio=:equal)
end
Out[21]:
In [22]:
plotKmeansresult(out,X1)
Out[22]:
In [23]:
out=kmeans(X1,k,init=:kmpp)
plotKmeansresult(out,X1)
Out[23]:
In [24]:
# Number of rings, try also k=3
k=2
# Center
center=[rand(-5:5);rand(-5:5)]
# Radii
radii=randperm(10)[1:k]
# Number of points in circles
sizes=rand(1000:2000,k)
center,radii,sizes
Out[24]:
In [25]:
# Points
X=Array{Float64}(undef,2,sum(sizes))
first=0
last=0
for j=1:k
first=last+1
last=last+sizes[j]
# Random angles
ϕ=2*π*rand(sizes[j])
for i=first:last
l=i-first+1
X[:,i]=center+radii[j]*[cos(ϕ[l]);sin(ϕ[l])]+(rand(2).-0.5)/50
end
end
scatter(X[1,:],X[2,:],markersize=2,aspect_ratio=:equal)
Out[25]:
In [26]:
out=kmeans(X,k)
plotKmeansresult(out,X)
Out[26]: