Introduction

This is the fifth installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.

In this notebook, I demonstrate k-means clustering using the Digit Recognizer competition.

Outline

Functions to process the data
Import and examine the data
Perform PCA on handwritten digits
Evaluate results
Summary



In [85]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

1. Read Digit Data



In [86]:

    
data = pd.read_csv("../data/digits/train.csv")
data.head()









    Out[86]:






  
    
      
      label
      pixel0
      pixel1
      pixel2
      pixel3
      pixel4
      pixel5
      pixel6
      pixel7
      pixel8
      ...
      pixel774
      pixel775
      pixel776
      pixel777
      pixel778
      pixel779
      pixel780
      pixel781
      pixel782
      pixel783
    
  
  
    
      0
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      1
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      2
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      3
       4
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      4
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
  

5 rows × 785 columns



In [87]:

    
targets = data['label']
digits  = data.drop('label',axis=1)



In [88]:

    
# Calculate mean image for each label
labels = targets.unique()
meandigits = pd.DataFrame(columns=digits.columns,index=sorted(labels))
for ll in labels:
    inds = (targets==ll)
    meandigits.loc[ll,:] = digits[inds].mean(axis=0)



In [103]:

    
aa = meandigits.loc[0].reshape(28,28).astype(float)
type(aa[0][0])









    Out[103]:





float



In [92]:

    
# Plot the mean value of each label
for ll in labels:
    print ll
    plt.subplot(2,5,ll+1)
    plt.imshow(meandigits.loc[ll].reshape(28,28),cmap=cm.Greys,interpolation='none')









    



1






    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-92-c35440fa5a5c> in <module>()
      3     print ll
      4     plt.subplot(2,5,ll+1)
----> 5     plt.imshow(meandigits.loc[ll].reshape(28,28),cmap=cm.Greys,interpolation='none')

/Users/wecht/python/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.pyc in imshow(X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, shape, filternorm, filterrad, imlim, resample, url, hold, **kwargs)
   2953                         vmax=vmax, origin=origin, extent=extent, shape=shape,
   2954                         filternorm=filternorm, filterrad=filterrad,
-> 2955                         imlim=imlim, resample=resample, url=url, **kwargs)
   2956         draw_if_interactive()
   2957     finally:

/Users/wecht/python/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in imshow(self, X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, shape, filternorm, filterrad, imlim, resample, url, **kwargs)
   4628                        filterrad=filterrad, resample=resample, **kwargs)
   4629 
-> 4630         im.set_data(X)
   4631         im.set_alpha(alpha)
   4632         if im.get_clip_path() is None:

/Users/wecht/python/anaconda/lib/python2.7/site-packages/matplotlib/image.pyc in set_data(self, A)
    428         if (self._A.dtype != np.uint8 and
    429             not np.can_cast(self._A.dtype, np.float)):
--> 430             raise TypeError("Image data can not convert to float")
    431 
    432         if (self._A.ndim not in (2, 3) or

TypeError: Image data can not convert to float

2. Perform PCA on one of the digits



In [ ]:

    
# Perform SVD on 2D mean representation of 0
M = meandigits.loc[0].reshape(28,28)
u,s,v = np.linalg.svd(M)



In [ ]:

    
np.cumsum(s[0:6])/sum(s)



In [ ]:

    
# Print magnitude of singular values
plt.subplot(1,2,1)
plt.plot(range(len(s)),s,'-b',lw=3); plt.hold(True)
plt.plot(range(len(s)),s,'ob',ms=5); plt.hold(False)
plt.title('Singular Values')
plt.subplot(1,2,2)
plt.plot(range(len(s)),np.cumsum(s)/sum(s),'-b',lw=3)
plt.title('Fraction of Variability Captured')

The first 6 singular values capture 99.2% of the variance in the mean value of 0. Let's see what the mean "0" looks like when reconstructed from the singular vectors.



In [ ]:

    
# Plot actual image to be constructed
plt.subplot(2,5,1)
plt.imshow(meandigits.loc[0].reshape(28,28),cmap=cm.Greys,interpolation='none')
plt.title('Actual image')
reconstruction = pd.DataFrame(columns=digits.columns)

# Make matrices for reconstruction
U = np.matrix(u)
S = np.matrix(np.diag(s))
V = np.matrix(v)

# Reconstruct the image from the first 9 singular vectors
for k in range(9):
    value = U[:,0:k+1]*S[0:k+1,0:k+1]*V[0:k+1,:]
    reconstruction.loc[k+1] = value.reshape(28*28)
    plt.subplot(2,5,k+2)
    plt.imshow(reconstruction.loc[k+1].reshape(28,28),cmap=cm.Greys,interpolation='none')
    plt.title('{} SVs'.format(k+1))

The demonstration above shows that we can very accurately represent a 28x28 pixel image with the first 5 singular vectors. Doing so reduces the dimensionality of the inputs from 28x28=784 to 28x5=140, an 82% reduction.

We may want to perform PCA to reduce dimensionality in the data in order to:

Save space storing or transfering images
Preparing data for use in machine learning algorithms to speed up performance

	label	...
0	1	...
1	0	...
2	1	...
3	4	...
4	0	...

	label	...
0	1	...
1	0	...
2	1	...
3	4	...
4	0	...

	label	...
0	1	...
1	0	...
2	1	...
3	4	...
4	0	...