Introduction

This is the fifth installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.

In this notebook, I demonstrate k-means clustering using the Digit Recognizer competition.

Outline

  1. Functions to process the data
  2. Import and examine the data
  3. Perform PCA on handwritten digits
  4. Evaluate results
  5. Summary

In [85]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

1. Read Digit Data


In [86]:
data = pd.read_csv("../data/digits/train.csv")
data.head()


Out[86]:
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 785 columns


In [87]:
targets = data['label']
digits  = data.drop('label',axis=1)

In [88]:
# Calculate mean image for each label
labels = targets.unique()
meandigits = pd.DataFrame(columns=digits.columns,index=sorted(labels))
for ll in labels:
    inds = (targets==ll)
    meandigits.loc[ll,:] = digits[inds].mean(axis=0)

In [103]:
aa = meandigits.loc[0].reshape(28,28).astype(float)
type(aa[0][0])


Out[103]:
float

In [92]:
# Plot the mean value of each label
for ll in labels:
    print ll
    plt.subplot(2,5,ll+1)
    plt.imshow(meandigits.loc[ll].reshape(28,28),cmap=cm.Greys,interpolation='none')


1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-92-c35440fa5a5c> in <module>()
      3     print ll
      4     plt.subplot(2,5,ll+1)
----> 5     plt.imshow(meandigits.loc[ll].reshape(28,28),cmap=cm.Greys,interpolation='none')

/Users/wecht/python/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.pyc in imshow(X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, shape, filternorm, filterrad, imlim, resample, url, hold, **kwargs)
   2953                         vmax=vmax, origin=origin, extent=extent, shape=shape,
   2954                         filternorm=filternorm, filterrad=filterrad,
-> 2955                         imlim=imlim, resample=resample, url=url, **kwargs)
   2956         draw_if_interactive()
   2957     finally:

/Users/wecht/python/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in imshow(self, X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, shape, filternorm, filterrad, imlim, resample, url, **kwargs)
   4628                        filterrad=filterrad, resample=resample, **kwargs)
   4629 
-> 4630         im.set_data(X)
   4631         im.set_alpha(alpha)
   4632         if im.get_clip_path() is None:

/Users/wecht/python/anaconda/lib/python2.7/site-packages/matplotlib/image.pyc in set_data(self, A)
    428         if (self._A.dtype != np.uint8 and
    429             not np.can_cast(self._A.dtype, np.float)):
--> 430             raise TypeError("Image data can not convert to float")
    431 
    432         if (self._A.ndim not in (2, 3) or

TypeError: Image data can not convert to float

2. Perform PCA on one of the digits


In [ ]:
# Perform SVD on 2D mean representation of 0
M = meandigits.loc[0].reshape(28,28)
u,s,v = np.linalg.svd(M)

In [ ]:
np.cumsum(s[0:6])/sum(s)

In [ ]:
# Print magnitude of singular values
plt.subplot(1,2,1)
plt.plot(range(len(s)),s,'-b',lw=3); plt.hold(True)
plt.plot(range(len(s)),s,'ob',ms=5); plt.hold(False)
plt.title('Singular Values')
plt.subplot(1,2,2)
plt.plot(range(len(s)),np.cumsum(s)/sum(s),'-b',lw=3)
plt.title('Fraction of Variability Captured')

The first 6 singular values capture 99.2% of the variance in the mean value of 0. Let's see what the mean "0" looks like when reconstructed from the singular vectors.


In [ ]:
# Plot actual image to be constructed
plt.subplot(2,5,1)
plt.imshow(meandigits.loc[0].reshape(28,28),cmap=cm.Greys,interpolation='none')
plt.title('Actual image')
reconstruction = pd.DataFrame(columns=digits.columns)

# Make matrices for reconstruction
U = np.matrix(u)
S = np.matrix(np.diag(s))
V = np.matrix(v)

# Reconstruct the image from the first 9 singular vectors
for k in range(9):
    value = U[:,0:k+1]*S[0:k+1,0:k+1]*V[0:k+1,:]
    reconstruction.loc[k+1] = value.reshape(28*28)
    plt.subplot(2,5,k+2)
    plt.imshow(reconstruction.loc[k+1].reshape(28,28),cmap=cm.Greys,interpolation='none')
    plt.title('{} SVs'.format(k+1))

The demonstration above shows that we can very accurately represent a 28x28 pixel image with the first 5 singular vectors. Doing so reduces the dimensionality of the inputs from 28x28=784 to 28x5=140, an 82% reduction.

We may want to perform PCA to reduce dimensionality in the data in order to:

  1. Save space storing or transfering images
  2. Preparing data for use in machine learning algorithms to speed up performance