KMeans++ Initialization

Arthur and Vassilvitskii, 2006
KMeans is a clustering algorithm.
"K" refers to the number of cluster centers.
"Means" refers to the means (centroids) of data in a cluster.

In KMeans, there are many ways to choose initial cluster centers.
The "++" of KMeans++ refers to an initialization procedure that distributes initial cluster centers across the data. It takes longer than many initialization procedures, but it greatly reduces the number of KMeans iterations needed to converge and tends to give better clusterings.

Algorithm:

$ \begin{array} {lll} \hline In & k & \text{number of cluster centers} \\ & data & \text{rows of data points to be clustered} \\ \hline Out & centroids & \text{locations of initial cluster centers} \\ \end{array} $

  1. Pick a random data point (row vector) as the first cluster center. Store this row vector in a vector of length k.
  2. Until k cluster centers have been chosen:
    1. For every data point: calculate the minimum SSD to any of the stored cluster centers. Store these minimum distances, one per data point, in a vector. (SSD is the sum of differences squared == Euclidean distance squared.)
    2. Convert the vector of SSD values to a discrete probability distribution (divide by the sum of the vector). Use this probability distribution to weight the selection of another data point as the next cluster center.*

*Note: a data point that has already been selected has 0 probability of being selected again: its SSD to the nearest center is 0.


In [ ]:
import numpy as np

# Implements kmeans++ initialization in Python
# Assumes row vectors are data points that will be clustered

# In: k (int; number of centroids), data (array; data to be clustered)
# Out: centroids (matrix; initial centroid locations)
def centroidInit(k, data):
	nRows = data.shape[0]
	if k > nRows:
		# print "Warning: k exceeds number of data points"
		# print "Setting k to number of data points: " + str(nRows)
		k = nRows
	centroids = np.empty((k, data.shape[1])) # matrix (row = a centroid)
	
	# for each data row, will hold *minimum* sum of difference squared (Euclidean squared)
	# to any already chosen centroid
	D2 = np.full(nRows, float("inf"))

	# pick random data row to be first centroid
	iRow = np.random.randint(nRows)
	centroids[0] = data[iRow]

	# select additional data rows
	for i_k in xrange(0,k-1):
		D2 = np.minimum(((data-centroids[i_k])**2).sum(axis=1), D2)  # update D2 with min distances
		iRow = np.random.choice(nRows, p=D2/D2.sum()) # convert D2 to discrete probability distribution
		centroids[i_k+1] = data[iRow]
	return centroids