In [1]:
%pylab inline
%load_ext rmagic
In [2]:
%%R
library(stats)
library(ggplot2)
set.seed(1)
In [3]:
%%R
d <- data.frame()
d <- rbind(d, data.frame(x = 1 + rnorm(20, 0, 0.1), y = 1 + rnorm(20, 0, 0.1),
label = as.factor(rep(1, each=20))))
d <- rbind(d, data.frame(x = 1 + rnorm(20, 0, 0.1), y = 3 + rnorm(20, 0, 0.1),
label = as.factor(rep(2, each=20))))
d <- rbind(d, data.frame(x = 3 + rnorm(20, 0, 0.1), y = 1 + rnorm(20, 0, 0.1),
label = as.factor(rep(3, each=20))))
d <- rbind(d, data.frame(x = 3 + rnorm(20, 0, 0.1), y = 3 + rnorm(20, 0, 0.1),
label = as.factor(rep(4, each=20))))
In [4]:
%%R
ggplot(d, aes(x=x, y=y)) + geom_point(aes(colour=label)) + ggtitle('d -- easy clusters')
In [5]:
%%R
result1 <- kmeans(d[,1:2], 4)
In [6]:
%R print(result1)
In [7]:
%%R
d$cluster1 <- as.factor(result1$cluster)
ggplot(d, aes(x=x, y=y)) + geom_point(aes(colour=cluster1)) + ggtitle('kmeans result1 -- success!\n(k=4)')
In [8]:
%%R
result2 <- kmeans(d[,1:2], 4)
In [9]:
%R print(result2)
In [10]:
%%R
d$cluster2 <- as.factor(result2$cluster)
ggplot(d, aes(x=x, y=y)) + geom_point(aes(colour=cluster2)) + ggtitle('kmeans result2 -- trouble
\n(k=4)')
this instability is a result of the random initial seeds that the clustering algorithm uses if two initial seeds begin in the same cluster, then the algorithm will have difficulty finding all the clusters (in particular, the cluster which doesn't contain an initial seed will be difficult to identify) (note that in any case, the algorithm will still return exactly as many clusters as you asked it to!)
In [11]:
%%R
result3 <- kmeans(d[,1:2], 4, nstart=10)
d$cluster3 <- as.factor(result3$cluster)
ggplot(d, aes(x=x, y=y)) + geom_point(aes(colour=cluster3)) +
ggtitle('kmeans result3 -- stable convergence\n(k=4, nstart=10)')
In [12]:
%%R
d2 <- rbind(d[,1:3], data.frame(x=1000+rnorm(20,0,50), y=1000+rnorm(20,0,50), label=as.factor(rep(5, each=20))))
ggplot(d2, aes(x=x, y=y)) + geom_point(aes(colour=label)) + ggtitle('d2 -- multiple length scales')
In [13]:
%%R
result4 <- kmeans(d2[,1:2], 5, nstart=10)
d2$cluster4 <- as.factor(result4$cluster)
ggplot(d2, aes(x=x, y=y)) + geom_point(aes(colour=cluster4)) + ggtitle('kmeans result4 -- trouble
\n(k=5, nstart=10)')
Again, we start by generating some artificial data:
In [14]:
import matplotlib.pyplot as plt
plt.jet() # set the color map. When your colors are lost, re-run this.
import sklearn.datasets as datasets
X, Y = datasets.make_blobs(centers=4, cluster_std=0.5, random_state=0)
As always, we first plot the data to get a feeling of what we're dealing with:
In [15]:
plt.scatter(X[:,0], X[:,1]);
The data looks like it may contain four different "types" of data point.
In fact, this is how it was created above.
We can plot this information as well, using color:
In [16]:
plt.scatter(X[:,0], X[:,1], c=Y);
Normally, you do not know the information in Y, however.
You could try to recover it from the data alone.
This is what the kMeans algorithm does.
In [17]:
from sklearn.cluster import KMeans
kmeans = KMeans(4, random_state=8)
Y_hat = kmeans.fit(X).labels_
Now the label assignments should be quite similar to Y, up to a different ordering of the colors:
In [18]:
plt.scatter(X[:,0], X[:,1], c=Y_hat);
Often, you're not so much interested in the assignments to the means.
You'll want to have a closer look at the means $\mu$.
The means in $\mu$ can be seen as representatives of their respective cluster.
In [19]:
plt.scatter(X[:,0], X[:,1], c=Y_hat, alpha=0.4)
mu = kmeans.cluster_centers_
plt.scatter(mu[:,0], mu[:,1], s=100, c=np.unique(Y_hat))
print mu
Perform k-Means in R or Python (student's choice)
In [20]:
%%R
summary(iris)
In [21]:
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
pd.DataFrame(iris.data, columns=iris.feature_names).describe()
Out[21]: