Clustering with KMeans in Shogun Machine Learning Toolbox

Notebook by Parijat Mazumdar (GitHub ID: mazumdarparijat)

This notebook demonstrates clustering with KMeans in Shogun along with its initialization and training. The initialization of cluster centres is shown manually, randomly and using the KMeans++ algorithm. Training is done via the classical Lloyds and mini-batch KMeans method. It is then applied to a real world data set. Furthermore, the effect of dimensionality reduction using PCA is analysed on the KMeans algorithm.

KMeans - An Overview

The KMeans clustering algorithm is used to partition a space of n observations into k partitions (or clusters). Each of these clusters is denoted by the mean of the observation vectors belonging to it and a unique label which is attached to all the observations belonging to it. Thus, in general, the algorithm takes parameter k and an observation matrix (along with the notion of distance between points ie distance metric) as input and returns mean of each of the k clusters along with labels indicating belongingness of each observations. Let us construct a simple example to understand how it is done in Shogun using the CKMeans class.

Let us start by creating a toy dataset.


In [ ]:
from numpy import concatenate, array
from numpy.random import randn
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')

num = 200
d1 = concatenate((randn(1,num),10.*randn(1,num)),0)
d2 = concatenate((randn(1,num),10.*randn(1,num)),0)+array([[10.],[0.]])
d3 = concatenate((randn(1,num),10.*randn(1,num)),0)+array([[0.],[100.]])
d4 = concatenate((randn(1,num),10.*randn(1,num)),0)+array([[10.],[100.]])

rectangle = concatenate((d1,d2,d3,d4),1)
totalPoints = 800

The toy data created above consists of 4 gaussian blobs, having 200 points each, centered around the vertices of a rectancle. Let's plot it for convenience.


In [ ]:
import matplotlib.pyplot as pyplot
%matplotlib inline

figure,axis = pyplot.subplots(1,1)
axis.plot(rectangle[0], rectangle[1], 'o', color='r', markersize=5)
axis.set_xlim(-5,15)
axis.set_ylim(-50,150)
axis.set_title('Toy data : Rectangle')
pyplot.show()

With data at our disposal, it is time to apply KMeans to it using the KMeans class in Shogun. First we construct Shogun features from our data:


In [ ]:
from shogun import *

train_features = RealFeatures(rectangle)

Next we specify the number of clusters we want and create a distance object specifying the distance metric to be used over our data for our KMeans training:


In [ ]:
# number of clusters
k = 2

# distance metric over feature matrix - Euclidean distance
distance = EuclideanDistance(train_features, train_features)

Next, we create a KMeans object with our desired inputs/parameters and train:


In [ ]:
# KMeans object created
kmeans = KMeans(k, distance)

# KMeans training 
kmeans.train()

Now that training has been done, let's get the cluster centers and label for each data point


In [ ]:
# cluster centers
centers = kmeans.get_cluster_centers()

# Labels for data points
result = kmeans.apply()

Finally let us plot the centers and the data points (in different colours for different clusters):


In [ ]:
def plotResult(title = 'KMeans Plot'):
    figure,axis = pyplot.subplots(1,1)
    for i in xrange(totalPoints):
        if result[i]==0.0:
            axis.plot(rectangle[0,i], rectangle[1,i], 'o', color='g', markersize=3)
        else:
            axis.plot(rectangle[0,i], rectangle[1,i], 'o', color='y', markersize=3)
    axis.plot(centers[0,0], centers[1,0], 'ko', color='g', markersize=10)
    axis.plot(centers[0,1], centers[1,1], 'ko', color='y', markersize=10)
    axis.set_xlim(-5,15)
    axis.set_ylim(-50,150)
    axis.set_title(title)
    pyplot.show()
    
plotResult('KMeans Results')

Note: You might not get the perfect result always. That is an inherent flaw of KMeans algorithm. In subsequent sections, we will discuss techniques which allow us to counter this.
Now that we have already worked out a simple KMeans implementation, it's time to understand certain specifics of KMeans implementaion and the options provided by Shogun to its users.

Initialization of cluster centers

The KMeans algorithm requires that the cluster centers are initialized with some values. Shogun offers 3 ways to initialize the clusters.

  • Random initialization (default)
  • Initialization by hand
  • Initialization using KMeans++ algorithm
Unless the user supplies initial centers or tells Shogun to use KMeans++, Random initialization is the default method used for cluster center initialization. This was precisely the case in the example discussed above.

Initialization by hand

There are 2 ways to initialize centers by hand. One way is to pass on the centers during KMeans object creation, as follows:


In [ ]:
from numpy import array
initial_centers = array([[0.,10.],[50.,50.]])

# initial centers passed
kmeans = KMeans(k, distance, initial_centers)

Now, let's first get results by repeating the rest of the steps:


In [ ]:
# KMeans training 
kmeans.train(train_features)

# cluster centers
centers = kmeans.get_cluster_centers()

# Labels for data points
result = kmeans.apply()

# plot the results
plotResult('Hand initialized KMeans Results 1')

The other way to initialize centers by hand is as follows:


In [ ]:
new_initial_centers = array([[5.,5.],[0.,100.]])

# set new initial centers
kmeans.set_initial_centers(new_initial_centers)

Let's complete the rest of the code to get results.


In [ ]:
# KMeans training 
kmeans.train(train_features)

# cluster centers
centers = kmeans.get_cluster_centers()

# Labels for data points
result = kmeans.apply()

# plot the results
plotResult('Hand initialized KMeans Results 2')

Note the difference that inititial cluster centers can have on final result.

Initializing using KMeans++ algorithm

In Shogun, a user can also use KMeans++ algorithm for center initialization. Using KMeans++ for center initialization is beneficial because it reduces total iterations used by KMeans and also the final centers mostly correspond to the global minima, which is often not the case with KMeans with random initialization. One of the ways to use KMeans++ is to set flag as true during KMeans object creation, as follows:


In [ ]:
# set flag for using KMeans++
kmeans = KMeans(k, distance, True)

The other way to initilize using KMeans++ is as follows:


In [ ]:
# set KMeans++ flag
kmeans.set_use_kmeanspp(True)

Completing rest of the steps to get result:


In [ ]:
# KMeans training 
kmeans.train(train_features)

# cluster centers
centers = kmeans.get_cluster_centers()

# Labels for data points
result = kmeans.apply()

# plot the results
plotResult('KMeans with KMeans++ Results')

To switch back to random initialization, you may use:


In [ ]:
#unset KMeans++ flag
kmeans.set_use_kmeanspp(False)

Training Methods

Shogun offers 2 training methods for KMeans clustering:

Lloyd's training method is used by Shogun by default unless user switches to mini-batch training method.

Mini-Batch KMeans

Mini-batch KMeans is very useful in case of extremely large datasets and/or very high dimensional data which is often the case in text mining. One can switch to Mini-batch KMeans training while creating KMeans object as follows:


In [ ]:
# set training method to mini-batch
kmeans = KMeansMiniBatch(k, distance)

In mini-batch KMeans it is compulsory to set batch-size and number of iterations. These parameters can be set together or one after the other.


In [ ]:
# set both parameters together batch size-2 and no. of iterations-100
kmeans.set_mb_params(2,100)

# OR

# set batch size-2
kmeans.set_batch_size(2)

# set no. of iterations-100
kmeans.set_mb_iter(100)

Completing the code to get results:


In [ ]:
# KMeans training 
kmeans.train(train_features)

# cluster centers
centers = kmeans.get_cluster_centers()

# Labels for data points
result = kmeans.apply()

# plot the results
plotResult('Mini-batch KMeans Results')

Applying KMeans on Real Data

In this section we see how useful KMeans can be in classifying the different varieties of Iris plant. For this purpose, we make use of Fisher's Iris dataset borrowed from the UCI Machine Learning Repository. There are 3 varieties of Iris plants

  • Iris Sensosa
  • Iris Versicolour
  • Iris Virginica
The Iris dataset enlists 4 features that can be used to segregate these varieties, namely

  • sepal length
  • sepal width
  • petal length
  • petal width
It is additionally acknowledged that petal length and petal width are the 2 most important features (ie. features with very high class correlations)[refer to summary statistics]. Since the entire feature vector is impossible to plot, we only plot these two most important features in order to understand the dataset (at least partially). Note that we could have extracted the 2 most important features by applying PCA (or any one of the many dimensionality reduction methods available in Shogun) as well.


In [ ]:
f = open(os.path.join(SHOGUN_DATA_DIR, 'uci/iris/iris.data'))
features = []
# read data from file
for line in f:
	words = line.rstrip().split(',')
	features.append([float(i) for i in words[0:4]])

f.close()

# create observation matrix
obsmatrix = array(features).T

# plot the data
figure,axis = pyplot.subplots(1,1)
# First 50 data belong to Iris Sentosa, plotted in green
axis.plot(obsmatrix[2,0:50], obsmatrix[3,0:50], 'o', color='green', markersize=5)
# Next 50 data belong to Iris Versicolour, plotted in red
axis.plot(obsmatrix[2,50:100], obsmatrix[3,50:100], 'o', color='red', markersize=5)
# Last 50 data belong to Iris Virginica, plotted in blue
axis.plot(obsmatrix[2,100:150], obsmatrix[3,100:150], 'o', color='blue', markersize=5)
axis.set_xlim(-1,8)
axis.set_ylim(-1,3)
axis.set_title('3 varieties of Iris plants')
pyplot.show()

In the above plot we see that the data points labelled Iris Sentosa form a nice separate cluster of their own. But in case of other 2 varieties, while the data points of same label do form clusters of their own, there is some mixing between the clusters at the boundary. Now let us apply KMeans algorithm and see how well we can extract these clusters.


In [ ]:
def apply_kmeans_iris(data):
    # wrap to Shogun features
    train_features = RealFeatures(data)

    # number of cluster centers = 3
    k = 3

    # distance function features - euclidean
    distance = EuclideanDistance(train_features, train_features)

    # initialize KMeans object
    kmeans = KMeans(k, distance)

    # use kmeans++ to initialize centers [play around: change it to False and compare results]
    kmeans.set_use_kmeanspp(True)

    # training method is Lloyd by default [play around: change it to mini-batch by uncommenting the following lines]
    #kmeans.set_train_method(KMM_MINI_BATCH)
    #kmeans.set_mbKMeans_params(20,30)

    # training kmeans
    kmeans.train(train_features)

    # labels for data points
    result = kmeans.apply()
    return result

result = apply_kmeans_iris(obsmatrix)

Now let us create a 2-D plot of the clusters formed making use of the two most important features (petal length and petal width) and compare it with the earlier plot depicting the actual labels of data points.


In [ ]:
# plot the clusters over the original points in 2 dimensions
figure,axis = pyplot.subplots(1,1)
for i in xrange(150):
    if result[i]==0.0:
        axis.plot(obsmatrix[2,i],obsmatrix[3,i],'ko',color='r', markersize=5)
    elif result[i]==1.0:
        axis.plot(obsmatrix[2,i],obsmatrix[3,i],'ko',color='g', markersize=5)
    else:
        axis.plot(obsmatrix[2,i],obsmatrix[3,i],'ko',color='b', markersize=5)

axis.set_xlim(-1,8)
axis.set_ylim(-1,3)
axis.set_title('Iris plants clustered based on attributes')
pyplot.show()

From the above plot, it can be inferred that the accuracy of KMeans algorithm is very high for Iris dataset. Don't believe me? Alright, then let us make use of one of Shogun's clustering evaluation techniques to formally validate the claim. But before that, we have to label each sample in the dataset with a label corresponding to the class to which it belongs.


In [ ]:
from numpy import ones, zeros

# first 50 are iris sensosa labelled 0, next 50 are iris versicolour labelled 1 and so on
labels = concatenate((zeros(50),ones(50),2.*ones(50)),1)

# bind labels assigned to Shogun multiclass labels
ground_truth = MulticlassLabels(array(labels,dtype='float64'))

Now we can compute clustering accuracy making use of the ClusteringAccuracy class in Shogun


In [ ]:
from numpy import nonzero

def analyzeResult(result): 
    # shogun object for clustering accuracy
    AccuracyEval = ClusteringAccuracy()

    # changes the labels of result (keeping clusters intact) to produce a best match with ground truth
    AccuracyEval.best_map(result, ground_truth)

    # evaluates clustering accuracy
    accuracy = AccuracyEval.evaluate(result, ground_truth)

    # find out which sample points differ from actual labels (or ground truth)
    compare = result.get_labels()-labels
    diff = nonzero(compare)
    return (diff,accuracy)

(diff,accuracy_4d) = analyzeResult(result)
print 'Accuracy : ' + str(accuracy_4d)

# plot the difference between ground truth and predicted clusters
figure,axis = pyplot.subplots(1,1)
axis.plot(obsmatrix[2,:],obsmatrix[3,:],'x',color='black', markersize=5)
axis.plot(obsmatrix[2,diff],obsmatrix[3,diff],'x',color='r', markersize=7)
axis.set_xlim(-1,8)
axis.set_ylim(-1,3)
axis.set_title('Difference')
pyplot.show()

In the above plot, wrongly clustered data points are marked in red. We see that the Iris Sentosa plants are perfectly clustered without error. The Iris Versicolour plants and Iris Virginica plants are also clustered with high accuracy, but there are some plant samples of either class that have been clustered with the wrong class. This happens near the boundary of the 2 classes in the plot and was well expected. Having mastered KMeans, it's time to move on to next interesting topic.

PCA as a preprocessor to KMeans

KMeans is highly affected by the curse of dimensionality. So, dimension reduction becomes an important preprocessing step. Shogun offers a variety of dimension reduction techniques to choose from. Since our data is not very high dimensional, PCA is a good choice for dimension reduction. We have already seen the accuracy of KMeans when all four dimensions are used. In the following exercise we shall see how the accuracy varies as one chooses lower dimensions to represent data.

1-Dimensional representation

Let us first apply PCA to reduce training features to 1 dimension


In [ ]:
from numpy import dot

def apply_pca_to_data(target_dims):
    train_features = RealFeatures(obsmatrix)
    submean = PruneVarSubMean(False)
    submean.init(train_features)
    submean.apply_to_feature_matrix(train_features)
    preprocessor = PCA()
    preprocessor.set_target_dim(target_dims)
    preprocessor.init(train_features)
    pca_transform = preprocessor.get_transformation_matrix()
    new_features = dot(pca_transform.T, train_features)
    return new_features

oneD_matrix = apply_pca_to_data(1)

Next, let us get an idea of the data in 1-D by plotting it.


In [ ]:
figure,axis = pyplot.subplots(1,1)
# First 50 data belong to Iris Sentosa, plotted in green
axis.plot(oneD_matrix[0,0:50], zeros(50), 'o', color='green', markersize=5)
# Next 50 data belong to Iris Versicolour, plotted in red
axis.plot(oneD_matrix[0,50:100], zeros(50), 'o', color='red', markersize=5)
# Last 50 data belong to Iris Virginica, plotted in blue
axis.plot(oneD_matrix[0,100:150], zeros(50), 'o', color='blue', markersize=5)
axis.set_xlim(-5,5)
axis.set_ylim(-1,1)
axis.set_title('3 varieties of Iris plants')
pyplot.show()

Let us now apply KMeans to the 1-D data to get clusters.


In [ ]:
result = apply_kmeans_iris(oneD_matrix)

Now that we have the results, the inevitable step is to check how good these results are.


In [ ]:
(diff,accuracy_1d) = analyzeResult(result)
print 'Accuracy : ' + str(accuracy_1d)

# plot the difference between ground truth and predicted clusters
figure,axis = pyplot.subplots(1,1)
axis.plot(oneD_matrix[0,:],zeros(150),'x',color='black', markersize=5)
axis.plot(oneD_matrix[0,diff],zeros(len(diff)),'x',color='r', markersize=7)
axis.set_xlim(-5,5)
axis.set_ylim(-1,1)
axis.set_title('Difference')
pyplot.show()

2-Dimensional Representation

We follow the same steps as above and get the clustering accuracy.

STEP 1 : Apply PCA and plot the data (plotting is optional)


In [ ]:
twoD_matrix = apply_pca_to_data(2)

figure,axis = pyplot.subplots(1,1)
# First 50 data belong to Iris Sentosa, plotted in green
axis.plot(twoD_matrix[0,0:50], twoD_matrix[1,0:50], 'o', color='green', markersize=5)
# Next 50 data belong to Iris Versicolour, plotted in red
axis.plot(twoD_matrix[0,50:100], twoD_matrix[1,50:100], 'o', color='red', markersize=5)
# Last 50 data belong to Iris Virginica, plotted in blue
axis.plot(twoD_matrix[0,100:150], twoD_matrix[1,100:150], 'o', color='blue', markersize=5)
axis.set_title('3 varieties of Iris plants')
pyplot.show()

STEP 2 : Apply KMeans to obtain clusters


In [ ]:
result = apply_kmeans_iris(twoD_matrix)

STEP 3: Get the accuracy of the results


In [ ]:
(diff,accuracy_2d) = analyzeResult(result)
print 'Accuracy : ' + str(accuracy_2d)

# plot the difference between ground truth and predicted clusters
figure,axis = pyplot.subplots(1,1)
axis.plot(twoD_matrix[0,:],twoD_matrix[1,:],'x',color='black', markersize=5)
axis.plot(twoD_matrix[0,diff],twoD_matrix[1,diff],'x',color='r', markersize=7)
axis.set_title('Difference')
pyplot.show()

3-Dimensional Representation

Again, we follow the same steps, but skip plotting data.

STEP 1: Apply PCA to data


In [ ]:
threeD_matrix = apply_pca_to_data(3)

STEP 2: Apply KMeans to 3-D representation of data


In [ ]:
result = apply_kmeans_iris(threeD_matrix)

STEP 3: Get accuracy of results. In this step, the 'difference' plot positions data points based petal length and petal width in the original data. This will enable us to visually compare these results with that of KMeans applied to 4-Dimensional data (ie. our first result on Iris dataset)


In [ ]:
(diff,accuracy_3d) = analyzeResult(result)
print 'Accuracy : ' + str(accuracy_3d)

# plot the difference between ground truth and predicted clusters
figure,axis = pyplot.subplots(1,1)
axis.plot(obsmatrix[2,:],obsmatrix[3,:],'x',color='black', markersize=5)
axis.plot(obsmatrix[2,diff],obsmatrix[3,diff],'x',color='r', markersize=7)
axis.set_title('Difference')
axis.set_xlim(-1,8)
axis.set_ylim(-1,3)
pyplot.show()

Finally, let us plot clustering accuracy vs. number of dimensions to consolidate our results.


In [ ]:
from scipy.interpolate import interp1d
from numpy import linspace

x = array([1, 2, 3, 4])
y = array([accuracy_1d, accuracy_2d, accuracy_3d, accuracy_4d])
f = interp1d(x, y)
xnew = linspace(1,4,10)
pyplot.plot(x,y,'o',xnew,f(xnew),'-')
pyplot.xlim([0,5])
pyplot.xlabel('no. of dims')
pyplot.ylabel('Clustering Accuracy')
pyplot.title('PCA Results')
pyplot.show()

The above plot is not very intuitive theoretically. The accuracy obtained by using just one latent dimension is much more than that obtained by taking all four features features. A plausible explanation could be that the mixing of data points from Iris Versicolour and Iris Virginica is least along the single principal dimension chosen by PCA. Additional dimensions only aggrevate this inter-mixing, thus resulting in poorer clustering accuracy. While there could be other explanations to the observed results, our small experiment has successfully highlighted the importance of PCA. Not only does it reduce the complexity of running KMeans, it also enhances results at times.

References

[1] D. Sculley. Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web, pages 1177–1178. ACM, 2010

[2] Bishop, C. M., & others. (2006). Pattern recognition and machine learning. Springer New York.

[3] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science