In [1]:
%%HTML
<style>
.container { width:100% }
</style>
In [2]:
import gzip
import pickle
As our data is stored as tuples of numpy
arrays, we have to import numpy.
In [3]:
import numpy as np
In order to be able to show the images of the handwritten digits, we use matplotlib
.
In [4]:
import matplotlib.pyplot as plt
We need to import the module random
as we are using stochastic gradient descent to compute the weights of our neural network.
In [5]:
import random
The function $\texttt{vectorized_result}(d)$ converts a digit $d \in \{0,\cdots,9\}$ into a numpy
array $\mathbf{x}$ of shape $(10, 1)$ such that we have
$$
\mathbf{x}[i] =
\left\{
\begin{array}{ll}
1 & \mbox{if $i = d$;} \\
0 & \mbox{otherwise}
\end{array}
\right.
$$
for all $i \in \{0,\cdots,9\}$.
This function is used to convert a digit $d$ into the expected output of a neural network that has an output unit for every digit.
In [6]:
def vectorized_result(d):
e = np.zeros((10, 1), dtype=np.float32)
e[d] = 1.0
return e
In [7]:
vectorized_result(2)
Out[7]:
The file mnist.pkl.gz
contains a triple of the form
train, validate, test
Here train
is a pair of the form (X, y)
where
X
is a numpy array of shape (50000, 784)
,y
is a numpy array of shape (50000, )
.For every $i \in \{0,\cdots, 49,000\}$ we have that $\textbf{X}[i]$ is an image of a handwritten digit and $\textbf{y}[i]$ is a digit, i.e. an element of the set $\{0,\cdots,9\}$.
The structure of validate
and test
is similar, but these contain only $10,000$ images each.
The function $\texttt{load_data}()$ returns a pair of the form $$ (\texttt{training_data}, \texttt{test_data}) $$ where
numpy.ndarray
containing the input image, and numpy.ndarray
corresponding to the correct digit for
$\textbf{x}$. numpy.ndarray
containing the input image,
and $y \in \{0,\cdots,9\}$ is the corresponding digit value.Note that the formats for training data and test data are different. For the training data $\textbf{y}$ is a vector, but for the test data $y$ is a number.
In [9]:
def load_data():
with gzip.open('mnist.pkl.gz', 'rb') as f:
train, validate, test = pickle.load(f, encoding="latin1")
print(f'shape of training data: {(train[0].shape, train[1].shape)}')
training_inputs = [np.reshape(x, (784, 1)) for x in train[0]]
training_results = [vectorized_result(y) for y in train[1]]
training_data = list(zip(training_inputs, training_results))
test_inputs = [np.reshape(x, (784, 1)) for x in test[0]]
test_data = list(zip(test_inputs, test[1]))
return training_data, test_data
In [8]:
list(zip([1, 2], ['a', 'b']))
Out[8]:
We store the data in two variables: training_data
and test_data
.
training_data
is a list of pairs of the form $(\textbf{x}, \textbf{y})$ where
$\textbf{x}$ is a numpy
array of shape $(784, 1)$ representing the image of a digit,
while $\textbf{y}$ is a numpy
array of shape $(10, 1)$ that is a
one-hot encoding
of the digit shown in $\textbf{x}$.test_data
is a list of pairs of the form $(\textbf{x}, y)$ where
$\textbf{x}$ is a numpy
array of shape $(784, 1)$ representing the image of a digit,
while $y$ is an element of the set $\{0,\cdots,9\}$.
In [10]:
%%time
training_data, test_data = load_data()
The function $\texttt{show_digit}(\texttt{row}, \texttt{columns}, \texttt{offset})$ shows $\texttt{row} \cdot \texttt{columns}$ images of the training data. The first image shown is the image at index $\texttt{offset}$.
In [11]:
def show_digits(rows, columns, offset=0):
f, axarr = plt.subplots(rows, columns)
for r in range(rows):
for c in range(columns):
i = r * columns + c + offset
image = 1 - training_data[i][0]
image = np.reshape(image, (28, 28))
axarr[r, c].imshow(image, cmap="gray")
axarr[r, c].axis('off')
# plt.savefig("digits.pdf")
plt.show()
In [13]:
show_digits(5, 5, 0)
Our goal is to find the weight matrices and biases for a neural net that is able to recognize the digits shown in these images. We initialize these weight matrices randomly. The function $\texttt{rndMatrix}(\texttt{rows}, \texttt{cols})$ returns a matrix of shape $(\texttt{rows}, \texttt{cols})$ that is filled with random numbers that have a Gaussian distribution with mean $0$ and variance $\displaystyle\frac{1}{\texttt{rows}}$.
In [14]:
def rndMatrix(rows, cols):
return np.random.randn(rows, cols) / np.sqrt(cols)
In [15]:
rndMatrix(2, 2)
Out[15]:
The function $\texttt{sigmoid}(x)$ computes the sigmoid of $x$, which is defined as $$ \texttt{sigmoid}(x) = S(x) := \frac{1}{1 + \texttt{exp}(-x)}. $$ Since we are using NumPy to compute the exponential function, this function also works when $x$ is a vector.
In [16]:
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
In [17]:
sigmoid(np.array([-1, 0, 1]))
Out[17]:
The function $\texttt{sigmoid_prime}(x)$ computes the derivative of the sigmoid function for $x$. The implementation is based on the equation: $$ S'(x) = S(x) \cdot \bigl(1 - S(x)\bigr) $$ where $x$ can either be a number or a vector.
In [18]:
def sigmoid_prime(x):
s = sigmoid(x)
return s * (1 - s)
In [19]:
sigmoid_prime(np.array([-5, 0, 5]))
Out[19]:
The class Network
is used to represent a
feedforward neural network with one hidden layer.
The constructor is called with the argument hiddenSize
. This parameter specifies the number of neurons in the hidden layer. The network has $28 \cdot 28 = 784$ input nodes. Each of the input nodes corresponds to the gray scale value of a single pixel in a $28 \cdot 28$ gray scale image of the digit that is to be recognized. The number of output neurons is 10. For $i \in \{0,\cdots,9\}$, the $i$th output neuron tries to recognize the digit $i$.
In [21]:
class Network(object):
def __init__(self, hiddenSize):
self.mInputSize = 28 * 28
self.mHiddenSize = hiddenSize
self.mOutputSize = 10
self.mBiasesH = np.zeros((self.mHiddenSize, 1)) # biases hidden layer
self.mBiasesO = np.zeros((self.mOutputSize, 1)) # biases output layer
self.mWeightsH = rndMatrix(self.mHiddenSize, self.mInputSize) # weights hidden layer
self.mWeightsO = rndMatrix(self.mOutputSize, self.mHiddenSize) # weights output layer
Given a neural network $n$ and an input vector $x$ for this neural network, the function $n.\texttt{feedforward}(x)$ compute the output of the neural network.
The code is a straightforward implementation of the feedforward equations.
These equations are repeated here for convenience:
The input x
is the activation of the input layer and therefore is equal to $\mathbf{a}^{(1)}(\mathbf{x})$.
AH
is the activation of the hidden layer and hence equal to $\mathbf{a}^{(2)}(\mathbf{x})$, while
AO
is the activation of the output layer and therefore equal to $\mathbf{a}^{(3)}(\mathbf{x})$.
In [22]:
def feedforward(self, x):
AH = sigmoid(self.mWeightsH @ x + self.mBiasesH) # hidden layer
AO = sigmoid(self.mWeightsO @ AH + self.mBiasesO) # output layer
return AO
Network.feedforward = feedforward
Given a neural network $n$, the method $\texttt{sgd}(\texttt{training_data}, \texttt{epochs}, \texttt{mbs}, \texttt{eta}, \texttt{test_data})$ uses stochastic gradient descent to train the network. The parameters are as follows:
In [23]:
def sgd(self, training_data, epochs, mbs, eta, test_data):
n_test = len(test_data)
n = len(training_data)
for j in range(epochs):
random.shuffle(training_data)
mini_batches = [training_data[k : k+mbs] for k in range(0, n, mbs)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
print('Epoch %2d: %d / %d' % (j, self.evaluate(test_data), n_test))
Network.sgd = sgd
The method update_mini_batch
performs one step of gradient descent for the data from one
mini-batch. It receives two arguments.
mini_batch
is the list of training data that constitute one mini-batch.eta
is the learning rate.The implementation of update_mini_batch
works as follows:
nabla_BH
, nabla_BO
and the matrices
nabla_WH
, nabla_WO
to contain only zeros.nabla_BH
will store the gradient of the bias vector of the hidden layer.nabla_BO
will store the gradient of the bias vector of the output layer.nabla_WH
will store the gradient of the weight matrix of the hidden layer.nabla_WO
will store the gradient of the weight matrix of the output layer.x, y
we compute the contribution of this training example to the gradients of
the cost function $C$, i.e. we compute
$$ \nabla_{\mathbf{b}^{(l)}} C_{\mathbf{x}, \mathbf{y}} \quad \mbox{and} \quad
\nabla_{W^{(l)}} C_{\mathbf{x}, \mathbf{y}}
$$
for the hidden layer and the output layer. These gradients are computed by the function
backprop
.
In [24]:
def update_mini_batch(self, mini_batch, eta):
nabla_BH = np.zeros((self.mHiddenSize, 1)) # gradient of biases of hidden layer
nabla_BO = np.zeros((self.mOutputSize, 1)) # gradient of biases of output layer
nabla_WH = np.zeros((self.mHiddenSize, self.mInputSize)) # gradient of weights of hidden layer
nabla_WO = np.zeros((self.mOutputSize, self.mHiddenSize)) # gradient of weights of output layer
for x, y in mini_batch:
dltNbl_BH, dltNbl_BO, dltNbl_WH, dltNbl_WO = self.backprop(x, y)
nabla_BH += dltNbl_BH
nabla_BO += dltNbl_BO
nabla_WH += dltNbl_WH
nabla_WO += dltNbl_WO
alpha = eta / len(mini_batch)
self.mBiasesH -= alpha * nabla_BH
self.mBiasesO -= alpha * nabla_BO
self.mWeightsH -= alpha * nabla_WH
self.mWeightsO -= alpha * nabla_WO
Network.update_mini_batch = update_mini_batch
Given a neural network $n$, the method $n.\texttt{backprop}(x, y)$ takes a training example $(x,y)$ and calculates the gradient of the cost function with respect to this training example. This is done by implementing the backpropagation equations shown below:
$$ \begin{array}[h]{llr} \boldsymbol{\varepsilon}^{(L)} = (\mathbf{a}^{(L)} - \mathbf{y}) \odot S'\bigl(\mathbf{z}^{(L)}\bigr) & & \mbox{(BP1v)} \\ \boldsymbol{\varepsilon}^{(l)} = \Bigl(\bigl(W^{(l+1)}\bigr)^\top \cdot \boldsymbol{\varepsilon}^{(l+1)}\Bigr) \odot S'\bigl(\mathbf{z}^{(l)}\bigr) & \mbox{for all $l \in \{2, \cdots, L-1\}$} & \mbox{(BP2v)} \\ \nabla_{\mathbf{b}^{(l)}} C_{\mathbf{x}, \mathbf{y}} = \boldsymbol{\varepsilon}^{(l)} & \mbox{for all $l \in \{2, \cdots,L\}$} & \mbox{(BP3v)} \\ \nabla_{W^{(l)}} C_{\mathbf{x}, \mathbf{y}} = \boldsymbol{\varepsilon}^{(l)} \cdot \bigl(\mathbf{a}^{(l-1)}\bigr)^\top & \mbox{for all $l \in \{2, \cdots,L\}$} & \mbox{(BP4v)} \end{array} $$
In [25]:
def backprop(self, x, y):
# feedforward pass
ZH = self.mWeightsH @ x + self.mBiasesH
AH = sigmoid(ZH)
ZO = self.mWeightsO @ AH + self.mBiasesO
AO = sigmoid(ZO)
# backwards pass, output layer
epsilonO = (AO - y) * sigmoid_prime(ZO)
nabla_BO = epsilonO
nabla_WO = epsilonO @ AH.transpose()
# backwards pass, hidden layer
epsilonH = (self.mWeightsO.transpose() @ epsilonO) * sigmoid_prime(ZH)
nabla_BH = epsilonH
nabla_WH = epsilonH @ x.transpose()
return nabla_BH, nabla_BO, nabla_WH, nabla_WO
Network.backprop = backprop
Given a neural network $n$, the method $n.\texttt{evaluate}(\texttt{test_data})$ uses the test data to compute the number of examples that are predicted correctly by the neural network $N$.
In [26]:
def evaluate(self, test_data):
test_results = \
[(np.argmax(self.feedforward(x)), y) for x, y in test_data]
return sum(y1 == y2 for y1, y2 in test_results)
Network.evaluate = evaluate
In [27]:
%%time
np.random.seed(1)
net = Network(40)
net.sgd(training_data, 30, 10, 0.1, test_data)
What is the number of parameters of our network?
Therefore the network has $$ 40 \cdot (1 + 784) + 10 \cdot (1 + 40) = 31,810 $$ parameters. As we have $50,000$ training data and every training datum gives rise to 10 equations, we shouldn't be too worried about over-fitting.
In [ ]:
40 * (1 + 784) + 10 * (1 + 40)
In [ ]: