First we need to import several modules

Using K-Means

This notebook demonstrates the use of the K-Means from SciKit-Learn library on a simpe mixture of two Gaussians.

Generate data

First, let's generate some data to cluster. Let's consider a mixture of two 2-dimensional Gaussian distributions, with centers on (-5,3) and (2,-2) and standard deviations of 1 and 2, respectevely.


In [38]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

gMix=list()
gMix.append(np.random.normal((-5,3),[0.25,2],(200,2)))
gMix.append(np.random.normal((2,-2),[0.25,0.25],(100,2)))
gMix.append(np.random.normal((0,2),[2,0.25],(100,2)))

mixture=np.concatenate(tuple(gMix))

#plt.rc('axes', color_cycle=['r', 'g', 'b', 'y'])
for i in range(0,len(gMix)):
    plt.plot(gMix[i][:,0],gMix[i][:,1],'^')
plt.title('Mixture of Gaussians')


Out[38]:
<matplotlib.text.Text at 0x7feea5f7d9d0>

K-Means class

To use K-Means we first need to import the KMeans class from the sklearn module (scikit-learn). We defined only 2 parameters of the several allowed (see full list on kmeans).

Here we specify only the number of clusters to use and the type of initialization to use. The 'k-means++' type is a smart initialization that spaces the original centroids and has been shown to produce better clustering in fewer iterations.

Other parameters are:

  • max_iter: maximum number of iterations to be performed - default=300
  • tol: relative tolerance with regards to inertia to declare convergence - default=1e-4
  • init: 'random','k-means++' or numpy.array with centroids - default='k-means++'
  • n_init: number of initializations to run (result is average of these) - default=10

In [43]:
from sklearn.cluster import KMeans

numClusters=3
initType='k-means++'

estimator = KMeans(n_clusters=numClusters,init=initType)

Clustering

We now proceed to cluster the data with parameters specified before. Here, the mixture of Gaussians is provided to the fit_predict method, which will return a numpy.array with the assignment of the data its respective centroid. We also use the clustercenters attribute of the KMeans class to get the centroids.

Other methods and attributes are fully described here.

After the clustering we seperate the data according to its respective cluster and plot the results.


In [45]:
assignment=estimator.fit_predict(mixture)
centroids=estimator.cluster_centers_

assignedData=[None]*numClusters
for i,j in enumerate(assignment):
    if assignedData[j] != None:
        assignedData[j] = np.vstack((assignedData[j],mixture[i]))
    else:
        assignedData[j]=mixture[i]
        
for i in range(0,numClusters):
    plt.plot(assignedData[i][:,0],assignedData[i][:,1],'^')
    plt.plot(centroids[i][0],centroids[i][1],'ko')



In [ ]: