k-Means Clustering

in R


In [1]:
%pylab inline
%load_ext rmagic


Populating the interactive namespace from numpy and matplotlib

In [2]:
%%R
library(stats)
library(ggplot2)
set.seed(1)

For our first example, let's create some synthetic easy-to-cluster data:


In [3]:
%%R
d <- data.frame()
d <- rbind(d, data.frame(x = 1 + rnorm(20, 0, 0.1), y = 1 + rnorm(20, 0, 0.1), 
                         label = as.factor(rep(1, each=20))))
d <- rbind(d, data.frame(x = 1 + rnorm(20, 0, 0.1), y = 3 + rnorm(20, 0, 0.1), 
                         label = as.factor(rep(2, each=20))))
d <- rbind(d, data.frame(x = 3 + rnorm(20, 0, 0.1), y = 1 + rnorm(20, 0, 0.1), 
                         label = as.factor(rep(3, each=20))))
d <- rbind(d, data.frame(x = 3 + rnorm(20, 0, 0.1), y = 3 + rnorm(20, 0, 0.1), 
                         label = as.factor(rep(4, each=20))))

Have a look...this looks easy enough:


In [4]:
%%R 
ggplot(d, aes(x=x, y=y)) + geom_point(aes(colour=label)) + ggtitle('d -- easy clusters')


Perform Clustering


In [5]:
%%R 
result1 <- kmeans(d[,1:2], 4)

here are the results...note the algorithm found clusters whose means are close to the true means of our synthetic clusters


In [6]:
%R print(result1)


K-means clustering with 4 clusters of sizes 20, 20, 20, 20

Cluster means:
         x         y
1 3.011985 1.0113828
2 1.019052 0.9993528
3 1.013880 3.0101737
4 2.978273 2.9596490

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
 3  3  3  3  3  3  3  3  3  3  3  3  3  3  1  1  1  1  1  1  1  1  1  1  1  1 
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 
 1  1  1  1  1  1  1  1  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4 
79 80 
 4  4 

Within cluster sum of squares by cluster:
[1] 0.2835909 0.3027288 0.3341553 0.3510448
 (between_SS / total_SS =  99.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

plot results...we are looking good


In [7]:
%%R
d$cluster1 <- as.factor(result1$cluster)
ggplot(d, aes(x=x, y=y)) + geom_point(aes(colour=cluster1)) + ggtitle('kmeans result1 -- success!\n(k=4)')


Suppose we repeat these steps...what do you expect to happen?


In [8]:
%%R
result2 <- kmeans(d[,1:2], 4)

In [9]:
%R print(result2)


K-means clustering with 4 clusters of sizes 15, 40, 5, 20

Cluster means:
         x        y
1 2.964552 2.902310
2 1.016466 2.004763
3 3.019435 3.131665
4 3.011985 1.011383

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  4  4  4  4  4  4  4  4  4  4  4  4 
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 
 4  4  4  4  4  4  4  4  1  3  1  1  1  1  3  1  1  1  3  1  1  1  1  1  3  1 
79 80 
 1  3 

Within cluster sum of squares by cluster:
[1]  0.10761076 41.07115633  0.03487477  0.28359085
 (between_SS / total_SS =  73.7 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

notice that the fit got worse! (e.g. large decrease in between_SS / total_SS...also cluster means are not as good as before)

and this scatterplot shows that something is obviously not right...what happened?


In [10]:
%%R
d$cluster2 <- as.factor(result2$cluster)
ggplot(d, aes(x=x, y=y)) + geom_point(aes(colour=cluster2)) + ggtitle('kmeans result2 -- trouble
\n(k=4)')


this instability is a result of the random initial seeds that the clustering algorithm uses if two initial seeds begin in the same cluster, then the algorithm will have difficulty finding all the clusters (in particular, the cluster which doesn't contain an initial seed will be difficult to identify) (note that in any case, the algorithm will still return exactly as many clusters as you asked it to!)

How can we create a more stable clustering fit? By repeating the fit several times and taking an average.

(this is effectively an ensemble clustering technique)


In [11]:
%%R
result3 <- kmeans(d[,1:2], 4, nstart=10)
d$cluster3 <- as.factor(result3$cluster)
ggplot(d, aes(x=x, y=y)) + geom_point(aes(colour=cluster3)) + 
    ggtitle('kmeans result3 -- stable convergence\n(k=4, nstart=10)')


What happens if we introduce a new length scale into the problem? How many clusters are in the dataset now?


In [12]:
%%R
d2 <- rbind(d[,1:3], data.frame(x=1000+rnorm(20,0,50), y=1000+rnorm(20,0,50), label=as.factor(rep(5, each=20))))
ggplot(d2, aes(x=x, y=y)) + geom_point(aes(colour=label)) + ggtitle('d2 -- multiple length scales')


as you can see, things go haywire...recall that clustering results are kind of a heuristic (in particular, not invariant to a change in units!)


In [13]:
%%R
result4 <- kmeans(d2[,1:2], 5, nstart=10)
d2$cluster4 <- as.factor(result4$cluster)
ggplot(d2, aes(x=x, y=y)) + geom_point(aes(colour=cluster4)) + ggtitle('kmeans result4 -- trouble
\n(k=5, nstart=10)')


in Python

Again, we start by generating some artificial data:


In [14]:
import matplotlib.pyplot as plt
plt.jet() # set the color map. When your colors are lost, re-run this.
import sklearn.datasets as datasets
X, Y = datasets.make_blobs(centers=4, cluster_std=0.5, random_state=0)


<matplotlib.figure.Figure at 0x1059fef50>

As always, we first plot the data to get a feeling of what we're dealing with:


In [15]:
plt.scatter(X[:,0], X[:,1]);


The data looks like it may contain four different "types" of data point.

In fact, this is how it was created above.

We can plot this information as well, using color:


In [16]:
plt.scatter(X[:,0], X[:,1], c=Y);


Normally, you do not know the information in Y, however.

You could try to recover it from the data alone.

This is what the kMeans algorithm does.


In [17]:
from sklearn.cluster import KMeans
kmeans = KMeans(4, random_state=8)
Y_hat = kmeans.fit(X).labels_

Now the label assignments should be quite similar to Y, up to a different ordering of the colors:


In [18]:
plt.scatter(X[:,0], X[:,1], c=Y_hat);


Often, you're not so much interested in the assignments to the means.

You'll want to have a closer look at the means $\mu$.

The means in $\mu$ can be seen as representatives of their respective cluster.


In [19]:
plt.scatter(X[:,0], X[:,1], c=Y_hat, alpha=0.4)
mu = kmeans.cluster_centers_
plt.scatter(mu[:,0], mu[:,1], s=100, c=np.unique(Y_hat))
print mu


[[-1.47935679  3.11716896]
 [-1.26811733  7.76378266]
 [ 1.99186903  0.96561071]
 [ 0.92578447  4.32475792]]

On your own

Perform k-Means in R or Python (student's choice)


In [20]:
%%R
summary(iris)


  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

In [21]:
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
pd.DataFrame(iris.data, columns=iris.feature_names).describe()


Out[21]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

8 rows × 4 columns