Training neural networks involves quite a few tricky bits. We try to make everything clear and easy to understand, to get you training your neural networks as quickly as possible.
Theano allows us to write relatively concise code that follows the structure of the underlying maths. To run the code yourself, download the notebook at https://github.com/ASIDataScience/training-neural-networks-notebook
We will train a network to classify digits. More precisely, we want a network that when presented with an image of a hand-written digit will tell us what digit it is ('0', '1', ..., '9')
. The data we will use for this task is known as the MNIST dataset. It has a long tradition in neural networks research, as the dataset is quite small but still very tricky to classify correctly.
You can download the MNIST dataset which we are using.
You can install theano with the Python package manager pip
. At the command line type
pip install theano
or check out the theano documentation if you run into trouble
The theano.tensor
module contains useful functionality for manipulating vectors and matrices, like we will be doing here, so let's import it along with the full package.
In [1]:
import theano
import theano.tensor as T
import numpy as np
With the data loaded into a variable mnist_dataset
, we split it into three part: a training set, used to teach the network to recognize digits, a validation set that could be used to tune and compare models, and finally a test set, used to see how well it has learned.
We then prepare the datasets, splitting each set into the images and their labels, and store them in Theano shared variables, a bit of theano magic that's explained later. Understanding this massaging of the data isn't crucial.
In [4]:
import os
import gzip
import pickle
myPath = '' # mnist.pkl.gz is in this directory
f = gzip.open(os.path.join(myPath, 'mnist.pkl.gz'), 'rb')
try: # for cross-platform, cross-version reasons, try two different pickle.load statements
mnist_dataset = pickle.load(f, encoding='latin1')
except:
mnist_dataset = pickle.load(f)
f.close()
In [5]:
train_set, valid_set, test_set = mnist_dataset
prepare_data = lambda x: (theano.shared(x[0].astype('float64')),
theano.shared(x[1].astype('int32')))
(training_x, training_y), (valid_x, valid_y), (test_x, test_y) = map(prepare_data, (train_set, valid_set, test_set))
In [6]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
def plot_mnist_digit(image):
'''Plot a single digit from the mnsist dataset'''
image = np.reshape(image, [28,28])
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.matshow(image, cmap=matplotlib.cm.binary)
plt.xticks(np.array([]))
plt.yticks(np.array([]))
plt.show()
Let's look at the first three images in the training set, then set about building a machine learning model that has the ability to recognise digits itself!
We've defined a function plot_mnist_digit
exactly for printing out the training images - it's just for prettiness.
In [7]:
for i in range(3):
plot_mnist_digit(training_x.get_value()[i]) # training_x is a theano object containing *images*
print ('Image class: '+ str(training_y.get_value()[i])) # training_y is a theano object containing *labels*
The neural network model that we will build defines a probability distribution
$$P(Y = y \ |\ X = \boldsymbol{\textrm{x}} ; \theta),$$where $Y$ represents the image class, which means it is a random variable that can take the values 0-9
, $X$ represents the image pixels and is a vector-valued random variable (we collapse the image matrix into a vector), and $\theta$ is the set of model parameters that we are going to learn.
In this tutorial we build two models: first implementing a logistic regression model, then extending it to a neural network model.
The equation for our first model is give by
$$P(Y = y \ |\ \boldsymbol{\textrm{x}} ; \theta) \propto \left[\sigma \left( \boldsymbol{\textrm{x}}^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}}\right)\right]_y,$$where $[\boldsymbol{\textrm{x}}]_i$ is the $i$th entry of vector $\boldsymbol{\textrm{x}}$, and the use of the proportionality symbol $\propto$ means that the probability is equal to the expression on the right hand side times a constant chosen such that $\sum_y{P(Y = y \ |\ \boldsymbol{\textrm{x}} ; \theta)} = 1$
The parameter set $\theta$ for this model is $\theta = \{\boldsymbol{\textrm{W}}, \boldsymbol{\textrm{b}}\}$, where $\boldsymbol{\textrm{W}}$ is a matrix and $\boldsymbol{\textrm{b}}$ is a vector. We also use the non-linearity $\sigma$ given by
$$\sigma(t) = \frac{1}{1+e^{-t}}$$When applied to a vector or matrix, the sigmoid function $\sigma(t)$ is applied entrywise.
One way to think about the logistic regression model is that it takes the input ($\boldsymbol{\textrm{x}}$), puts it through a linear combination ($\boldsymbol{\textrm{x}}^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}}$) and then finally through a non-linearity: $\sigma(\boldsymbol{\textrm{x}}^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}})$.
The result of this operation is a vector representing the entire discrete distribution over all the possible classes - in our case the ten possible digits 0-9
. To get the probability of a particular class $y=6$ we extract the $6th$ entry of the probability vector: $\left[\sigma \left( \boldsymbol{\textrm{x}}^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}}\right)\right]_y$
Graphically the model looks like this:
Each indiviual entry of the vectors x and $P(Y)$ is shown as a circle -- known as the units (or artificial neurons) of the network. We have $D$ input units (the dimensionality of the input vector, which is the flattened matrix of pixels) and $C$ output units (the number of classes, which is the digits 0-9
). The model parameters W and b are represented as arrows. We also show the application of the sigmoid functions, but we do not represent the normalization that makes the probabilities sum up to $1$.
Another way to write the model above is using the $\textrm{SoftMax}$ function. A good exercise is deriving the $\textrm{SoftMax}$ function based on the fact that we can also write the same model using $\textrm{SoftMax}$ and the equal sign instead of the proportionality sign:
$$P(Y = y \ |\ \boldsymbol{\textrm{x}} ; \theta) = \left[\textrm{SoftMax} \left( \boldsymbol{\textrm{W}} \boldsymbol{\textrm{x}} + \boldsymbol{\textrm{b}}\right)\right]_y,$$Our second model is given by
$$P(Y = y \ |\ \boldsymbol{\textrm{x}} ; \theta) = \left[ \textrm{SoftMax} \left( \boldsymbol{\textrm{h}} ^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{hy}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}} _\boldsymbol{\textrm{hy}}\right)\right]_y, \\ \boldsymbol{\textrm{h}} = \tanh \left( \boldsymbol{\textrm{x}} ^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{xh}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}}_\boldsymbol{\textrm{xh}}\right)$$This is very similar to the logisitc regression model. Here we have introduced a new vector-valued random variable $h$. We call this a 'hidden' variable or 'latent' variable, as we do not have any data observations for it. This variable may not even correspond to any quantity in the real world, but we use it to increase the power of our statistical model.
We now also have more parameters: $\theta = \{\boldsymbol{\textrm{W}}_\boldsymbol{\textrm{xh}}, \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{hy}}, \boldsymbol{\textrm{b}}_\boldsymbol{\textrm{xh}}, \boldsymbol{\textrm{b}}_\boldsymbol{\textrm{hy}}\}$.
$\tanh$ is the hyperbolic tangent function given by $$\tanh(t) = \frac{e^t-e^{-t}}{e^t+e^{-t}}$$
Like with the sigmoid, when applied to a vector or matrix, $\tanh$ function is applied entrywise.
Graphically, this model looks like this:
Now our depiction shows a hidden layer with $M$ units (this number can be different from the number of input neurons and number of output neurons), and we have two different nonlinearities in the graph: $tanh$ and sigmoids (but again we are not graphically representing the SoftMax normalization).
The way we're going to make our network learn is by trying to find some values for our parameters $\theta$ so that the network is as likely as possible to guess the correct class. That is, given a data set of training images $\boldsymbol{\textrm{x}}_1, \boldsymbol{\textrm{x}}_2, \dots, \boldsymbol{\textrm{x}}_N$ and correct labels $y_1, y_2, \dots, y_N$, we want to find the parameters that maximize probability of the correct labels given the images.
This method of choosing parameters is called maximum likelihood (ML), and we can express it mathematically as finding the parameters $\theta$ which maximize the likelihood function:
$$\theta^* = {\arg\max}_\theta \ P( Y_1 = y_1, Y_2 = y_2, \dots, Y_N = y_N \ | \ \boldsymbol{\textrm{X}}_1 = \boldsymbol{\textrm{x}}_1,\boldsymbol{\textrm{X}}_2 = \boldsymbol{\textrm{x}}_2, \dots, \boldsymbol{\textrm{X}}_N = \boldsymbol{\textrm{x}}_N ; \theta)$$And since our data points are independent, we can write this joint probability as a product of probabilities:
\begin{align} P( Y_1 = y_1,\dots, Y_N = y_N \ | \ \boldsymbol{\textrm{X}}_1 = \boldsymbol{\textrm{x}}_1, \dots, \boldsymbol{\textrm{X}}_N = \boldsymbol{\textrm{x}}_N ; \theta) &= P( Y_1 = y_1 \ | \ \boldsymbol{\textrm{X}}_1 = \boldsymbol{\textrm{x}}_1 ; \theta) \times \dots \times P( Y_N = y_N \ | \ \boldsymbol{\textrm{X}}_N = \boldsymbol{\textrm{x}}_N ; \theta) \\ &= \prod_{i=1}^N P( Y_i = y_i \ | \ \boldsymbol{\textrm{X}}_i = \boldsymbol{\textrm{x}}_i) \end{align}In our likelihood function above, each random variable pair $(X_1, Y_1), (X_2, Y_2), \dots, (X_N, Y_N)$ refers to a single data point. But since virtually all of our computations need to deal with multiple data points, we will find it both useful and computationally efficient to express both the mathematics and our Python code in terms of datasets. Thus, we will express the scalar random variables $Y_1, Y_2, \dots, Y_N$ as the vector-valued random variable $Y$, and the vector-valued random variables $X_1, X_2, \dots, X_N$ as the matrix-valued random variable $X$, where the matrix $X$ has as many rows as there are data points.
Using this notation, we rewrite the maximum likelihood equation above:
$$\theta^* = {\arg\max}_\theta \ P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)$$Similarly, we can specify the logistic regression model it terms of multiple datapoints:
$$P(Y \ |\ X = \boldsymbol{\textrm{X}} ; \theta) = \textrm{SoftMax} \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}\right)$$Here the result is a matrix of probabilities with as many rows as there are data points, and as many columns as there are classes. We also consider the SoftMax to normalize the result of the linear combination $\left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}\right)$ in such a way that each row of the result is a proper probability distribution summing to $1$. Note that we have had to multiply the bias vector $\boldsymbol{\textrm{b}}$ by the vertical vector of all ones $\boldsymbol{\textrm{1}}$ in order to add the bias term for every single data point.
The neural network model equations follow a similar pattern, which it would be a good exercise to write out for yourself.
In most machine learning applications, it is better to maximize the log-likelihood rather than the likelihood. This is done because the log-likelihood tends to be simpler to compute and more numerically stable than the likelihood. In terms of the math, this doesn't make things much more complicated, as all we need to add is a $\log$ in front of the likelihood:
$$\theta^* = {\arg\max}_\theta \ \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)$$Since the logarithm of a product is the sum of the logarithms, we can write:
$$\log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta) = \sum_{i=1}^N \log P( Y_i = \boldsymbol{\textrm{y}}_i\ | \ X_i = \boldsymbol{\textrm{X}}_i ; \theta)$$that is, the log joint probability is the sum of the log marginal probabilities.
Now let's plug in the probability of a dataset from above to obtain:
$$\log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta) = \sum \left[ \log\left( \textrm{SoftMax} \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}\right)\right)\right]_{\cdot,\boldsymbol{\textrm{y}}}$$where we use the notation $[\boldsymbol{\textrm{M}}]_{\cdot,\boldsymbol{\textrm{y}}}$ to mean that from the matrix $\boldsymbol{\textrm{M}}$ we construct a new vector $(a_1, a_2, \dots, a_n)$ such that $a_i = M_{i, y_i} \forall i$. We use a slight abuse of notation and we use the sum symbol to indicate the summation of the entries in the vector; we need a summation because the log joint probability is the sum of the log marginal probabilities.
Once we have a set of parameters $\theta$ that we are happy with, we can use the model to classify new data.
We have built a model that gives us a distribution over classes given the data. How do we assign a class to a new data point? The simplest way is to choose the class with the highest probability under the model to be the class we assign. We can write this mathematically, again using vector notation:
$$\hat{\boldsymbol{\textrm{y}}} = {\arg\max} \ P(Y \ |\ \boldsymbol{\textrm{X}} ; \theta) $$Computing the maximum likelihood parameters is computationally unfeasible, so we're going to use a method called gradient ascent to find a set of parameters that are really good but perhaps not the absolute best.
The idea of gradient ascent is very simple. Given a function $f:\mathbb{R}^n \rightarrow \mathbb{R}$, we want to iteratively find points $f(\boldsymbol{\textrm{x}}_n)$ such that the value of the function gets progressively higher, that is: $f(\boldsymbol{\textrm{x}}_{n+1}) > f(\boldsymbol{\textrm{x}}_n) \forall n$. One way to do this is taking the direction of steepest ascent, which is just the gradient of the function $\frac{\partial f}{\partial \boldsymbol{\textrm{x}}}$, and taking a step in that direction times a constant $\lambda$ known as the learning rate that describes how big the step should be. We express this mathematically as:
$$ \boldsymbol{\textrm{x}}_{n+1} \leftarrow \boldsymbol{\textrm{x}}_n + \lambda \frac{\partial f}{\partial \boldsymbol{\textrm{x}}} $$The last detail is choosing the starting point, $\boldsymbol{\textrm{x}}_0$, which we can arbitrarily choose by setting to zero or to some random value. Graphically, the algorithm looks like this, with each color representing the path from a different starting point:
Note that this method tends to find the top of the nearest hill ('local' maximum), and not the overall best point ('global' maximum).
It is also not guaranteed to increase the value of the function at each step; if the learning rate is too large, the algorithm could potentially jump across the top of the hill to a lower point on the other side of the hill. Much research goes into optimization methods, and many neural networks models are trained with methods that are more complicated than gradient ascent as it's presented here, but this same idea is at the base of all of those methods.
Finally, most people talk about and use gradient descent on the negative log-likelihood rather than gradient ascent on the log-likelihood; this is because gradient descent is the standard algorithm in the field of optimization. Gradient ascent in used in this tutorial to keep things a bit simpler.
This is extremely easy: we just apply the equation above to our parameters taking the gradient of the log-likelihood:
\begin{align} \boldsymbol{\textrm{W}} &\leftarrow \boldsymbol{\textrm{W}} + \lambda \frac{\partial \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)}{\partial \boldsymbol{\textrm{W}}} \\ \boldsymbol{\textrm{b}} &\leftarrow \boldsymbol{\textrm{b}} + \lambda \frac{\partial \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)}{\partial \boldsymbol{\textrm{b}}} \\ \end{align}Coding with Theano is extremely simple: once we have the equations behind the model, we pretty much type them directly. Since we will be training on multiple data points, we are going to encode the model as we wrote it for datasets, using vectors and matrices.
We'll start with the logistic regression. Our model is:
$$P(Y \ |\ \boldsymbol{\textrm{X}} ; \theta) = \textrm{SoftMax} \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}\right)$$The first thing we do is declare all of the variables ($\boldsymbol{\textrm{X}}, \boldsymbol{\textrm{y}}, \boldsymbol{\textrm{W}}, \boldsymbol{\textrm{b}}$) that we will be using and their types like you would in Java or C:
In [8]:
n_classes = 10 # each digit is one of 0-9
dims = 28 * 28 # our input data is flattened 28x28 matrices of image pixels
X = T.dmatrix() # Theano double matrix
y = T.ivector() # Theano integer vector
W = theano.shared(np.zeros([dims,n_classes])) # Theano shared double matrix
b = theano.shared(np.zeros(n_classes)) # Theano shared double vector
As you can see above, we defined $W$ and $b$ to be shared variables. This means that the values of these variables are persistent -- their values live on after we have run Theano operations. This is opposed to regular Theano variables, which only take values when Theano runs, and otherwise only exist in the abstract.
The reason for making $W$ and $b$ shared variables is that we want to run multiple iterations of gradient descent, and to do that, we need their values to persist. Furthermore, we want to find good parameters through training, but we will then want to use the same parameters for prediction, so we need them to be persistent
Let's now write our statistical model in Theano. We basically copy the following equation into code:
$$P(Y \ |\ \boldsymbol{\textrm{X}} ; \theta) = \textrm{SoftMax} \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}\right)$$
In [9]:
P = T.nnet.softmax(T.dot(X,W) + b) # the matrix of probabilities of all classes for all data points
Theano provides us with a T.nnet.softmax
function to compute SoftMax, correctly normalizing the probabilities so that each row of the matrix P
is a proper probability distribution that sums to $1$.
Note that we didn't need to multiply b
by the 1 vector, Theano will do the correct addition for us automatically, just like numpy would do it. Here is a simple numpy example illustrating this:
In [10]:
amatrix = np.zeros((3,2)) # 3x2 matrix of all zeros
avector = np.array((1,2)) # the vector [1,2]
amatrix + avector
Out[10]:
Our next equation is the log-likelihood (LL) of a dataset:
$$\log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta) = \sum \left[ \log\left( \textrm{SoftMax} \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}\right)\right)\right]_{\cdot,\boldsymbol{\textrm{y}}}$$
In [11]:
LL = T.mean(T.log(P)[T.arange(P.shape[0]), y]) # the log-likelihood (LL)
OK... there's a lot going on here.
We have used two important tricks:
[T.arange(P.shape[0]), y]
Let's go over them one by one.
Imagine that we were to construct a new dataset that contains each data point in the original dataset twice. Then the log-likelihood of the new dataset will be double that of the original. More importantly, the gradient of the log-likelihood of the new dataset will also be double the gradient of the log-likelihood of the original dataset. But we would like the size of the gradient to not depend on the amount of duplication in our dataset, and the easiest way to accomplish that is to divide the gradient by the size of the dataset.
Since taking the mean is equivalent to taking the sum and then dividing by the number of data points, what we are computing here is a type of "normalized" log-likelihood that will cause our gradient descent algorithm to be robust to change in dataset size. The quantity we are computing in the code can be more precisely described as the average log-likelihood for a single datapoint.
The second thing we do is use [T.arange(P.shape[0]), y]
to apply the mathematical operation denoted in the equation by $\left[ \cdot\right]_{\cdot,\boldsymbol{\textrm{y}}}$, that is, constructing a new vector $(a_1, a_2, \dots, a_n)$ from a matrix $\boldsymbol{\textrm{M}}$ such that $a_i = M_{i, y_i} \forall i$.
As cryptic as it may be, this is a peculiar, but standard numpy way to index. For example, given a matrix
M = np.array([[1,2,3,4],
[5,6,7,8],
[9,10,11,12]]
If we wanted to extract one element from each row, say the 1st element from the first row, and the last from the others
M[0,0], M[1,3], M[2,3]
We could write that as a single index expression by combining the indexes
M[(0,1,2), (0,3,3)]
But now the first index is just $(0 \dots \#\textrm{rows}-1)$, or, in code, np.arange(M.shape[0])
.
So we can write:
In [12]:
M = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
M[np.arange(M.shape[0]), (0,3,3)]
Out[12]:
We're done with the model. There's one more thing we need to do, and that is specify the gradient updates
\begin{align} \boldsymbol{\textrm{W}} &\leftarrow \boldsymbol{\textrm{W}} + \lambda \frac{\partial \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)}{\partial \boldsymbol{\textrm{W}}} \\ \boldsymbol{\textrm{b}} &\leftarrow \boldsymbol{\textrm{b}} + \lambda \frac{\partial \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)}{\partial \boldsymbol{\textrm{b}}} \\ \end{align}
In [13]:
learning_rate = 0.5 # we tuned this parameter by hand
updates = [
[W, W + learning_rate * T.grad(LL, W)],
[b, b + learning_rate * T.grad(LL, b)]
]
OK, we're done coding the model. What do we do next?
When working with Theano, the next step is to create a Theano function. A Theano function is the basic unit of Theano code that we call to do something. In our case, this something will be performing a single gradient ascent iteration.
We create a Theano function by calling the function theano.function
(yes, we create a function by calling a function). theano.function
has four important parameters that we provide in order to get a Theano function:
inputs
-- the list of input variables of the Theano function, similar to the inputs of a Python functionoutputs
-- the list of output variables of the Theano function, similar to the inputs of a Python functionupdates
-- a list of updates for shared variables, in the format we used above when we defined the variable updates
givens
-- a dictionary that allows substituting some variables from the model with other variablesIn our case, we want the input to be the training dataset, the updates to be the gradient ascent updates, and while we don't really need an output, it will be helpful to get the log-likelihood as an output to see that we are doing the right things.
However, we will use the givens
instead of the input
to provide the data to the function.
Doing it this way is more efficient, as we've already loaded up the training dataset into memory as a shared Theano function, when we first loaded the data.
Our Theano function will look like this:
In [14]:
training_function = theano.function(
inputs = [], # use givens instead of the inputs as it's more efficient
outputs = LL, # output log-likelihood just to check that it is improving
updates = updates, # these are the gradient updates, the one part that's really important
givens = {X: training_x, # we indicate that the model variables X and y defined in the abstract
y: training_y} # should take the values in the shared variables training_x and training_y
)
OK, let's run ten iterations of our code and see what the log-likelihood does
In [15]:
for i in range(10):
current_LL = training_function()
print("Log-likelihood = " + str(current_LL) + "\t\t" +
"Average probability of the correct class = " + str(np.exp(current_LL))
)
Great, it appears we're improving the log-likelihood of the model on the training data. Now to use it to classify. Recall our equation that expresses how to use the model to get the class:
$$\hat{\boldsymbol{\textrm{y}}} = {\arg\max} \ P(Y \ |\ \boldsymbol{\textrm{X}} ; \theta) $$Let's put that in code:
In [16]:
y_hat = T.argmax(P, axis=1)
Note that we had to specify axis=1, that is, we want to get the argmax for each row, as each row represents the distribution for one datapoint.
Similarly to the training function, the classification function will use givens to pass in the test dataset, and output y_hat
which we just defined
In [15]:
classification_function = theano.function(
inputs = [],
outputs = y_hat,
givens = {X:test_x} # don't need the true labels test_y here
)
Now let's run the classification once, and print the first ten images, the true labels, and the labels assigned by the model
In [16]:
test_y_hat = classification_function()
print ("Classification error: "+ str(100 * (1 - np.mean(test_y_hat == test_y.get_value()))) + "%")
for i in range(10):
plot_mnist_digit(
test_x.get_value()[i] # test_x is a theano object of images
)
print ('Image class: \t\t' + str(test_y.get_value()[i])) # test_y is a theano object of *true labels*
print ('Model-assigned class: \t' + str(test_y_hat[i])) # test_y_hat is a theano object of *predicted labels*
So far we have trained a logistic regression model. The neural network model is so similar that we can imlepement it with just a few changes to the code.
We need
In [17]:
H = T.dmatrix() # Theano double matrix
hidden_layer_size = 20
W_xh = theano.shared(0.01 * np.random.randn(dims, hidden_layer_size))
W_hy = theano.shared(np.zeros([hidden_layer_size, n_classes]))
b_xh = theano.shared(np.zeros(hidden_layer_size))
b_hy = theano.shared(np.zeros(n_classes))
Remember our model? Let's write it using matrices and vector for an entire dataset:
$$\boldsymbol{\textrm{H}} = \tanh \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{xh}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}_\boldsymbol{\textrm{xh}}\right), \\ P(Y \ |\ \boldsymbol{\textrm{x}} ; \theta) = \textrm{SoftMax} \left( \boldsymbol{\textrm{H}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{hy}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}} _\boldsymbol{\textrm{hy}}\right)$$Let's code it up:
In [18]:
H = T.tanh( T.dot(X, W_xh) + b_xh )
P = T.nnet.softmax( T.dot(H, W_hy) + b_hy )
Now let's add the gradient updates:
In [19]:
LL = T.mean(T.log(P)[T.arange(P.shape[0]), y]) # the log-likelihood (LL)
updates = [
[W_xh, W_xh + learning_rate * T.grad(LL, W_xh)],
[W_hy, W_hy + learning_rate * T.grad(LL, W_hy)],
[b_xh, b_xh + learning_rate * T.grad(LL, b_xh)],
[b_hy, b_hy + learning_rate * T.grad(LL, b_hy)],
]
The one extremely important thing we did here was redefined LL
.
This is a crucial point about how Theano works:
Whenever we define a theano variable, like we did with P
, we create a new object.
When we define a new theano variable in terms of another Theano variable, like we did with LL
, using
`LL = T.mean(T.log(P)[T.arange(P.shape[0]), y])`
we implicitly create a new object for LL
that has a reference to our variable P
we just defined.
Now the crucial part: say we redefine P
. Then our variable LL
still has a reference to the old variable P
, and we need to update the reference to LL
by re-running the definition for LL
for everything to work correctly.
Bugs in Theano are very commonly produced by exactly this. It is a good reason to always use Theano in scripts rather than in a notebook like we are here.
Phew, we are now ready to train!
In [20]:
training_function = theano.function(
inputs = [],
outputs = LL,
updates = updates,
givens = {X: training_x[:5000], # use only 10% of the data so model not too complicated
y: training_y[:5000]} # to train on a personal computer
)
for i in range(60): # train more than for logistic regression as this model is more complex
current_LL = training_function()
print(
"Log-likelihood = " + str(current_LL) + "\t\t" +
"Average probability of the correct class = " + str(np.exp(current_LL))
)
In [21]:
y_hat = T.argmax(P, axis=1)
test_y_hat = classification_function()
print ("Classification error: " + str(100 * (1 - np.mean(test_y_hat == test_y.get_value()))) + "%")
for i in range(10):
plot_mnist_digit(
test_x.get_value()[i] # test_y is a theano object of *images*
)
print ('Image class: \t\t' + str(test_y.get_value()[i])) # test_y is a theano object of *true labels*
print ('Model-assigned class: \t' + str(test_y_hat[i])) # test_y_hat is a theano object of *predicted labels*
Along the way we used a good chunk of mathematics, a hand-full of tricks but required very few lines of code.
Hopefully this leads nicely to using Theano more, in particular we would be excited to see you set theano to work with:
'Til next time!