Inital classification approach taken from Scikit-learn Hand-written digit recognizion example using support vector machine calassifier http://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html applied for MNIST dataset.
In [1]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib notebook
In [2]:
from sklearn import datasets, svm, metrics, utils
Fetching the MNIST dataset
In [3]:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', data_home='./data')
It's 70000 examples of handwritten digits of size 28x28 pixels, labeled from 0 to 9.
The data is ordered and needs to be shuffled
In [4]:
mnist.data, mnist.target = utils.shuffle(mnist.data, mnist.target)
Pick the first 15 images for visualization
In [5]:
n_samples = len(mnist.data)
In [6]:
fig = plt.figure()
for i in range(15):
img = mnist.data[i].reshape(28, 28)
ax = fig.add_subplot(3, 5, i+1)
ax.axis('off')
ax.imshow(img, cmap=plt.cm.gray, interpolation='nearest')
ax.set_title('# {}'.format(i))
Creating a support vector classifier
In [25]:
# gamma = 0.001, which was in the example for 8x8 images, seems to
# heavily overfit 28x28 MNIST data with N=10000 samples, predicts only 1
#classifier = svm.SVC(gamma=0.001)
# TODO: There must be a bug!
classifier = svm.SVC()
Apply learning on the first half of the digits
In [21]:
N = n_samples//2
N = 10000
classifier.fit(mnist.data[:N], mnist.target[:N])
Out[21]:
Predictions
In [22]:
expected = mnist.target[N:2*N]
predicted = classifier.predict(mnist.data[N:2*N])
Scikit-learn SVM doesn't seem to be of the fastest kind, need to check parameters.
In [11]:
print("Classification report for classifier %s:\n%s\n"
% (classifier, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))
In [23]:
expected, predicted
Out[23]:
In [24]:
np.unique(predicted)
Out[24]:
In [14]:
expected = mnist.target[:N]
predicted = classifier.predict(mnist.data[:N])
In [19]:
mnist.data[N:2*N].shape
Out[19]: