We will use the digits dataset to train a k-Nearest Neighbor algorithm to read hand-written numbers.
In [6]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
You can read the description of the dataset by using the 'DESCR' key:
In [7]:
print(digits['DESCR'])
We first begin by loading the digits dataset and setting the features matrix (X) and the response vector (y)
In [8]:
digits = load_digits()
X = digits.data
y = digits.target
In [10]:
knn = KNeighborsClassifier(n_neighbors=4)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)
print(scores.mean())
In [39]:
k_range = range(1, 31)
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
k_scores.append(scores.mean())
print(k_scores)
In [40]:
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
Out[40]:
In [14]:
# Displaying different keys/attributes
# of the dataset
print 'Keys:', digits.keys()
# Loading data
# This includes the pixel value for each of the samples
digits_data = digits['data']
print 'Data for 1st element:', digits_data[0]
# Targets
# This is what actual number for each sample, i.e. the 'truth'
digits_targetnames = digits['target_names']
print 'Target names:', digits_targetnames
digits_target = digits['target']
print 'Targets:', digits_target
This means that you you have 1797 samples, and each of the them are characterized by 64 different features (pixel values).
We can also visualize some of the data, using the 'images' keys:
In [15]:
# Choosing a colormap
color_map_used = plt.get_cmap('autumn')
In [16]:
# Visualizing some of the targets
fig, axes = plt.subplots(2,5, sharex=True, sharey=True, figsize=(20,12))
axes_f = axes.flatten()
for ii in range(len(axes_f)):
axes_f[ii].imshow(digits['images'][ii], cmap = color_map_used)
axes_f[ii].text(1, -1, 'Target: {0}'.format(digits_target[ii]), fontsize=30)
plt.show()
The algorithm will be able to use the pixel values to determine that the first number is '0' and the other then is '4'.
Let's see some examples of the number 2:
In [17]:
IDX2 = num.where( digits_target == 2)[0]
print 'There are {0} samples of the number 2 in the dataset'.format(IDX2.size)
fig, axes = plt.subplots(2,5, sharex=True, sharey=True, figsize=(20,12))
axes_f = axes.flatten()
for ii in range(len(axes_f)):
axes_f[ii].imshow(digits['images'][IDX2][ii], cmap = color_map_used)
axes_f[ii].text(1, -1, 'Target: {0}'.format(digits_target[IDX2][ii]), fontsize=30)
plt.show()
In [18]:
print 'And now the number 4\n'
IDX4 = num.where( digits_target == 4)[0]
fig, axes = plt.subplots(2,5, sharex=True, sharey=True, figsize=(20,12))
axes_f = axes.flatten()
for ii in range(len(axes_f)):
axes_f[ii].imshow(digits['images'][IDX4][ii], cmap = color_map_used)
axes_f[ii].text(1, -1, 'Target: {0}'.format(digits_target[IDX4][ii]), fontsize=30)
plt.show()
You can see how different each input by subtracting one target from another. In here, I'm subtracting two images that represent the number '4':
In [19]:
# Difference between two samples of the number 4
plt.imshow(digits['images'][IDX4][1] - digits['images'][IDX4][8], cmap=color_map_used)
plt.show()
This figure shows how different two samples can be from each other.