This is the fifth installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.
In this notebook, I demonstrate k-means clustering using the Digit Recognizer competition.
In [17]:
import pandas as pd
import numpy as np
import sklearn.cluster as skc
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline
In [5]:
def ij2index(ii,jj):
"""
Converts pixel indices ii (row) and jj (column)
to a single value in the grid below:
jj=0 jj=1 jj=2 jj=3 jj=26 jj=27
ii=0 000 001 002 003 ... 026 027
ii=1 028 029 030 031 ... 054 055
ii=2 056 057 058 059 ... 082 083
| | | | ... | |
ii=26 728 729 730 731 ... 754 755
ii=27 756 757 758 759 ... 782 783
"""
# Number of ii,jj
nI = 28
nJ = 28
return ii*nJ + jj
def index2ij(index):
"""
Converts 1D index to 2D pixel indices
ii (row) and jj (column) from the grid below:
jj=0 jj=1 jj=2 jj=3 jj=26 jj=27
ii=0 000 001 002 003 ... 026 027
ii=1 028 029 030 031 ... 054 055
ii=2 056 057 058 059 ... 082 083
| | | | ... | |
ii=26 728 729 730 731 ... 754 755
ii=27 756 757 758 759 ... 782 783
"""
# Number of ii,jj
nI = 28
nJ = 28
jj = index%nJ
ii = (index-jj)/nI
return (ii,jj)
def convert
In [10]:
data = pd.read_csv("./data/digits/train.csv")
data.head()
Out[10]:
In [11]:
# Split up digit images and labels
targets = data['label']
digits = data.drop('label',axis=1)
In [16]:
# Plot one of the digits
plt.imshow(digits.loc[1000].reshape(28,28),cmap=cm.Greys,interpolation='none')
Out[16]:
In [60]:
# Plot frequency of digits in the dataset
targets.hist()
Out[60]:
We use k-means to classify the available greyscale images into 10 categories. I do not anticipate a clear relationship between the resulting clusters and the 10 digits (0-9) because our method is neither scale, translation, or rotaion invariant.
Nevertheless, let's try and see how it goes!
In [55]:
model = skc.KMeans(n_clusters=10,n_init=1,random_state=1)
In [56]:
model.fit(digits)
Out[56]:
In [68]:
output = model.predict(digits)
In [69]:
# Plot the center of each cluster returned from the k-means algorithm
for ii in range(10):
plt.subplot(2,5,ii+1)
plt.imshow(model.cluster_centers_[ii,:].reshape(28,28),cmap=cm.Greys,interpolation='none')
plt.title('ii = {}'.format(ii))
This is better than I expected! Several of the centroids clearly correspond to recognizable digits. Some deficiences include no digit "5", two digits "0", and strong similarities between digits "4", "7", and "9".
In [70]:
# Plot number of predicted values in each cluster
output = model.predict(digits)
height,left = np.histogram(output)
plt.bar(left[:-1],height)
Out[70]:
This histogram of model predictions reveals more shortcomings. Bins 5 and 7, corresponding to digit "0", contain relatively few numbers. Bin 1, on the other hand, seems to correspond to the digit "1", but the histogram above shows that many other digits have been placed into that bin.
In [83]:
# Visually associate the cluster centers with a digit from 0-9
def real2centroid(number):
"""Return the model centroid index associated with the real digit value"""
realvalues = [7,1,0,6,2,5,8,4,9,3]
return realvalues[number]
def centroid2real(number):
"""Return the real number associated with a given centroid index."""
centroids = [2,1,4,9,7,0,3,0,6,8]
return centroids[number]
In [81]:
# Convert the model output to predicted digit labels
output = model.predict(digits)
output = map(centroid2real,output)
In [82]:
# Calculate fraction correct
print 1-(sum((output-targets)!=0))/float(len(output))
A very simple K-means implementation placed handwritten digits into 10 bins without any labels. In many cases, these bins clearly corresponded to actual digits. We visualize the bins, assign each one a real numerical value, and find that the k-means algorithm correctly categorizes 60% of the digits. That's rather impressive for a straightforward implementation of an unsupervised learning algorithm!