In [ ]:
%%HTML
<style>
.container { width:100% }
</style>
This notebook uses the $k$-nearest neighbours algorithm to recognize handwritten digits. The digits we want to recognize
are stored as images of size $28 \times 28$ pixels. Each pixel $p$ is stored as a number that satisfies $0 \leq p \leq 1$. The pixel values are
interpreted as grey values: If $p = 1.0$, the pixel is completely black, while $p = 0.0$ if the pixel is white. The images are stored in the file mnist.pkl.gz
. This file is compressed using gzip
and the images have been pickled using the module pickle
. The module pickle
supports the
reading and writing of Python data structures.
In order to read the images of the handwritten digits, we therefore have to import the modules gzip
and pickle
. The module numpy
is needed to store the images as arrays.
In [ ]:
import gzip
import pickle
import numpy as np
The function load_data
returns a tuple of the form
$$ (\texttt{X_train}, \texttt{X_test}, \texttt{Y_train}, \texttt{Y_test}) $$
where
In [ ]:
def load_data():
with gzip.open('mnist.pkl.gz', 'rb') as f:
train, _, test = pickle.load(f, encoding="latin1")
return (train[0], test[0], train[1], test[1])
In [ ]:
X_train, X_test, Y_train, Y_test = load_data()
Let us check what we have read:
In [ ]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
Let us inspect the first hand written image of a digit.
In [ ]:
X_train[0, :]
This is an array with 784 entries. Let us draw the corresponding picture.
In [ ]:
import matplotlib.pyplot as plt
The function $\texttt{show_digit}(\texttt{row}, \texttt{columns}, \texttt{offset})$ shows $\texttt{row} \cdot \texttt{columns}$ images of the training data. The first image shown is the image at index $\texttt{offset}$.
In [ ]:
def show_digits(rows, columns, offset=0):
f, axarr = plt.subplots(rows, columns)
for r in range(rows):
for c in range(columns):
i = r * columns + c + offset
image = 1 - X_train[i, :]
image = np.reshape(image, (28, 28))
axarr[r, c].imshow(image, cmap="gray")
axarr[r, c].axis('off')
plt.savefig("digits.pdf")
plt.show()
We take a look at the first 24 images.
In [ ]:
show_digits(4, 6)
Given two arrays $\mathbf{x}$ and $\mathbf{y}$ of the same dimension $n$, the function $\texttt{distance}(\mathbf{x}, \mathbf{y})$ computes the Euclidean distance between $\mathbf{x}$ and $\mathbf{y}$. This distance is defined as follows: $$ \sqrt{\sum\limits_{i=1}^n (x_i - y_i)^2} $$
In [ ]:
def distance(x, y):
return np.sqrt(np.sum((x - y)**2))
For example, the distance between the first two images of the training set is computed as follows:
In [ ]:
distance(X_train[0,:], X_train[1,:])
The distance between the 9th and the 15th image should be smaller, because both of these images show the digit $1$ and hence these images are quite similar. This similarity results in a smaller distance between these images.
In [ ]:
distance(X_train[8,:], X_train[14,:])
Given a list $L$ of digits, the function $\texttt{maxCounts}(L)$ returns a pair $(d, p)$ where $d$ is the digit that occurs most frequently in $L$ and $p$ is the percentage of occurrences of $d$ in $L$. For example, we have $$ \texttt{maxCounts}([5,2,3,5,2,5,6,5,7,8]) = (5, 0.4) $$ because the digit $5$ is the most frequent digit in the list $[5,2,3,5,2,5,6,5,7,8]$ and $40$% of the digits in this list are fives.
In [ ]:
def maxCount(L):
Frequencies = {} # number of occurrences for each digit
most_frequent = L[0] # most frequent digit so far
most_frequent_count = 1 # number of occurrences of most frequent digit
for d in L:
if d in Frequencies:
Frequencies[d] += 1
else:
Frequencies[d] = 1
if Frequencies[d] > most_frequent_count:
most_frequent = d
most_frequent_count = Frequencies[d]
return most_frequent, most_frequent_count / len(L)
In [ ]:
maxCount([3, 3, 4, 2, 1, 2, 3, 2, 5, 3])
Given an image of a digit stored in the vector $\mathbf{x}$ and a number of neighbours $k$, the function $\texttt{digit}(\mathbf{x}, k)$ computes those
$k$ images in the training set X_train
that are closest to the image $\mathbf{x}$. Here
closeness of images is defined in terms of the Euclidean distance of the vectors that store the
images. From these $k$ images of the training set the function chooses the digit that occurs most frequently. It returns a pair $(d, p)$ where $d$ is the digit that is most frequently occurring in the list of $k$ neighbours and $p$ is the percentage of images in the $k$ neighbours of $\mathbf{x}$ that show
the digit $d$.
In [ ]:
def digit(x, k):
n = X_train.shape[0] # number of all training images
Distances = [ (distance(X_train[i, :], x), i) for i in range(n)]
Neighbours = [ Y_train[i] for _, i in sorted(Distances)]
return maxCount(Neighbours[:k])
This function performs $k$-nearest neighbour classification for the $n$-th image of the test set. It also prints the image.
In [ ]:
def test(n, k):
print(f'Testing image {n}:')
image = 1 - X_test[n, :]
image = np.reshape(image, (28, 28))
plt.imshow(image, cmap="gray")
plt.show()
d, p = digit(X_test[n, :], k)
print(f'I believe with a certainty of {p * 100}% that the image shows the digit {d}.')
In [ ]:
test(2, 13)
Let us classify the first 20 images from the test set.
In [ ]:
for n in range(20):
test(n, 13)
In [ ]: