First we need to import several modules
This notebook demonstrates the use of the K-Means from SciKit-Learn library on a simpe mixture of two Gaussians.
First, let's generate some data to cluster. Let's consider a mixture of two 2-dimensional Gaussian distributions, with centers on (-5,3) and (2,-2) and standard deviations of 1 and 2, respectevely.
In [38]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
gMix=list()
gMix.append(np.random.normal((-5,3),[0.25,2],(200,2)))
gMix.append(np.random.normal((2,-2),[0.25,0.25],(100,2)))
gMix.append(np.random.normal((0,2),[2,0.25],(100,2)))
mixture=np.concatenate(tuple(gMix))
#plt.rc('axes', color_cycle=['r', 'g', 'b', 'y'])
for i in range(0,len(gMix)):
plt.plot(gMix[i][:,0],gMix[i][:,1],'^')
plt.title('Mixture of Gaussians')
Out[38]:
To use K-Means we first need to import the KMeans class from the sklearn module (scikit-learn). We defined only 2 parameters of the several allowed (see full list on kmeans).
Here we specify only the number of clusters to use and the type of initialization to use. The 'k-means++' type is a smart initialization that spaces the original centroids and has been shown to produce better clustering in fewer iterations.
Other parameters are:
In [43]:
from sklearn.cluster import KMeans
numClusters=3
initType='k-means++'
estimator = KMeans(n_clusters=numClusters,init=initType)
We now proceed to cluster the data with parameters specified before. Here, the mixture of Gaussians is provided to the fit_predict method, which will return a numpy.array with the assignment of the data its respective centroid. We also use the clustercenters attribute of the KMeans class to get the centroids.
Other methods and attributes are fully described here.
After the clustering we seperate the data according to its respective cluster and plot the results.
In [45]:
assignment=estimator.fit_predict(mixture)
centroids=estimator.cluster_centers_
assignedData=[None]*numClusters
for i,j in enumerate(assignment):
if assignedData[j] != None:
assignedData[j] = np.vstack((assignedData[j],mixture[i]))
else:
assignedData[j]=mixture[i]
for i in range(0,numClusters):
plt.plot(assignedData[i][:,0],assignedData[i][:,1],'^')
plt.plot(centroids[i][0],centroids[i][1],'ko')
In [ ]: