This is the fifth installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.
In this notebook, I demonstrate k-means clustering using the Digit Recognizer competition.
In [85]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline
In [86]:
data = pd.read_csv("../data/digits/train.csv")
data.head()
Out[86]:
In [87]:
targets = data['label']
digits = data.drop('label',axis=1)
In [88]:
# Calculate mean image for each label
labels = targets.unique()
meandigits = pd.DataFrame(columns=digits.columns,index=sorted(labels))
for ll in labels:
inds = (targets==ll)
meandigits.loc[ll,:] = digits[inds].mean(axis=0)
In [103]:
aa = meandigits.loc[0].reshape(28,28).astype(float)
type(aa[0][0])
Out[103]:
In [92]:
# Plot the mean value of each label
for ll in labels:
print ll
plt.subplot(2,5,ll+1)
plt.imshow(meandigits.loc[ll].reshape(28,28),cmap=cm.Greys,interpolation='none')
In [ ]:
# Perform SVD on 2D mean representation of 0
M = meandigits.loc[0].reshape(28,28)
u,s,v = np.linalg.svd(M)
In [ ]:
np.cumsum(s[0:6])/sum(s)
In [ ]:
# Print magnitude of singular values
plt.subplot(1,2,1)
plt.plot(range(len(s)),s,'-b',lw=3); plt.hold(True)
plt.plot(range(len(s)),s,'ob',ms=5); plt.hold(False)
plt.title('Singular Values')
plt.subplot(1,2,2)
plt.plot(range(len(s)),np.cumsum(s)/sum(s),'-b',lw=3)
plt.title('Fraction of Variability Captured')
The first 6 singular values capture 99.2% of the variance in the mean value of 0. Let's see what the mean "0" looks like when reconstructed from the singular vectors.
In [ ]:
# Plot actual image to be constructed
plt.subplot(2,5,1)
plt.imshow(meandigits.loc[0].reshape(28,28),cmap=cm.Greys,interpolation='none')
plt.title('Actual image')
reconstruction = pd.DataFrame(columns=digits.columns)
# Make matrices for reconstruction
U = np.matrix(u)
S = np.matrix(np.diag(s))
V = np.matrix(v)
# Reconstruct the image from the first 9 singular vectors
for k in range(9):
value = U[:,0:k+1]*S[0:k+1,0:k+1]*V[0:k+1,:]
reconstruction.loc[k+1] = value.reshape(28*28)
plt.subplot(2,5,k+2)
plt.imshow(reconstruction.loc[k+1].reshape(28,28),cmap=cm.Greys,interpolation='none')
plt.title('{} SVs'.format(k+1))
The demonstration above shows that we can very accurately represent a 28x28 pixel image with the first 5 singular vectors. Doing so reduces the dimensionality of the inputs from 28x28=784 to 28x5=140, an 82% reduction.
We may want to perform PCA to reduce dimensionality in the data in order to: