Training Neural Networks with Theano

Training neural networks involves quite a few tricky bits. We try to make everything clear and easy to understand, to get you training your neural networks as quickly as possible.

Theano allows us to write relatively concise code that follows the structure of the underlying maths. To run the code yourself, download the notebook at https://github.com/ASIDataScience/training-neural-networks-notebook

Recognising hand-written digits

We will train a network to classify digits. More precisely, we want a network that when presented with an image of a hand-written digit will tell us what digit it is ('0', '1', ..., '9'). The data we will use for this task is known as the MNIST dataset. It has a long tradition in neural networks research, as the dataset is quite small but still very tricky to classify correctly.

You can download the MNIST dataset which we are using.

Import Theano

You can install theano with the Python package manager pip. At the command line type

pip install theano

or check out the theano documentation if you run into trouble

The theano.tensor module contains useful functionality for manipulating vectors and matrices, like we will be doing here, so let's import it along with the full package.


In [1]:
import theano
import theano.tensor as T 
import numpy as np


WARNING (theano.configdefaults): g++ not detected ! Theano will be unable to execute optimized C-implementations (for both CPU and GPU) and will default to Python implementations. Performance will be severely degraded. To remove this warning, set Theano flags cxx to an empty string.

With the data loaded into a variable mnist_dataset, we split it into three part: a training set, used to teach the network to recognize digits, a validation set that could be used to tune and compare models, and finally a test set, used to see how well it has learned.

We then prepare the datasets, splitting each set into the images and their labels, and store them in Theano shared variables, a bit of theano magic that's explained later. Understanding this massaging of the data isn't crucial.


In [4]:
import os
import gzip
import pickle

myPath = '' # mnist.pkl.gz is in this directory
f = gzip.open(os.path.join(myPath, 'mnist.pkl.gz'), 'rb')

try:        # for cross-platform, cross-version reasons, try two different pickle.load statements
    mnist_dataset = pickle.load(f, encoding='latin1')
except:
    mnist_dataset = pickle.load(f)
f.close()

In [5]:
train_set, valid_set, test_set = mnist_dataset
prepare_data = lambda x: (theano.shared(x[0].astype('float64')), 
                          theano.shared(x[1].astype('int32')))
(training_x, training_y), (valid_x, valid_y), (test_x, test_y) = map(prepare_data, (train_set, valid_set, test_set))

In [6]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

def plot_mnist_digit(image):
    '''Plot a single digit from the mnsist dataset'''
    image = np.reshape(image, [28,28])    
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(image, cmap=matplotlib.cm.binary)
    plt.xticks(np.array([]))
    plt.yticks(np.array([]))
    plt.show()

Let's look at the first three images in the training set, then set about building a machine learning model that has the ability to recognise digits itself!

We've defined a function plot_mnist_digit exactly for printing out the training images - it's just for prettiness.


In [7]:
for i in range(3):
    plot_mnist_digit(training_x.get_value()[i])             # training_x is a theano object containing *images*
    print ('Image class: '+ str(training_y.get_value()[i])) # training_y is a theano object containing *labels*


Image class: 5
Image class: 0
Image class: 4

The neural network model

The neural network model that we will build defines a probability distribution

$$P(Y = y \ |\ X = \boldsymbol{\textrm{x}} ; \theta),$$

where $Y$ represents the image class, which means it is a random variable that can take the values 0-9, $X$ represents the image pixels and is a vector-valued random variable (we collapse the image matrix into a vector), and $\theta$ is the set of model parameters that we are going to learn.

In this tutorial we build two models: first implementing a logistic regression model, then extending it to a neural network model.

Multi-class logistic regression

The equation for our first model is give by

$$P(Y = y \ |\ \boldsymbol{\textrm{x}} ; \theta) \propto \left[\sigma \left( \boldsymbol{\textrm{x}}^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}}\right)\right]_y,$$

where $[\boldsymbol{\textrm{x}}]_i$ is the $i$th entry of vector $\boldsymbol{\textrm{x}}$, and the use of the proportionality symbol $\propto$ means that the probability is equal to the expression on the right hand side times a constant chosen such that $\sum_y{P(Y = y \ |\ \boldsymbol{\textrm{x}} ; \theta)} = 1$

The parameter set $\theta$ for this model is $\theta = \{\boldsymbol{\textrm{W}}, \boldsymbol{\textrm{b}}\}$, where $\boldsymbol{\textrm{W}}$ is a matrix and $\boldsymbol{\textrm{b}}$ is a vector. We also use the non-linearity $\sigma$ given by

$$\sigma(t) = \frac{1}{1+e^{-t}}$$

When applied to a vector or matrix, the sigmoid function $\sigma(t)$ is applied entrywise.

Understanding the logistic regression model

One way to think about the logistic regression model is that it takes the input ($\boldsymbol{\textrm{x}}$), puts it through a linear combination ($\boldsymbol{\textrm{x}}^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}}$) and then finally through a non-linearity: $\sigma(\boldsymbol{\textrm{x}}^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}})$.

The result of this operation is a vector representing the entire discrete distribution over all the possible classes - in our case the ten possible digits 0-9. To get the probability of a particular class $y=6$ we extract the $6th$ entry of the probability vector: $\left[\sigma \left( \boldsymbol{\textrm{x}}^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}}\right)\right]_y$

Graphically the model looks like this:

Each indiviual entry of the vectors x and $P(Y)$ is shown as a circle -- known as the units (or artificial neurons) of the network. We have $D$ input units (the dimensionality of the input vector, which is the flattened matrix of pixels) and $C$ output units (the number of classes, which is the digits 0-9). The model parameters W and b are represented as arrows. We also show the application of the sigmoid functions, but we do not represent the normalization that makes the probabilities sum up to $1$.

Another way to write the model above is using the $\textrm{SoftMax}$ function. A good exercise is deriving the $\textrm{SoftMax}$ function based on the fact that we can also write the same model using $\textrm{SoftMax}$ and the equal sign instead of the proportionality sign:

$$P(Y = y \ |\ \boldsymbol{\textrm{x}} ; \theta) = \left[\textrm{SoftMax} \left( \boldsymbol{\textrm{W}} \boldsymbol{\textrm{x}} + \boldsymbol{\textrm{b}}\right)\right]_y,$$

Neural network with a single hidden layer

Our second model is given by

$$P(Y = y \ |\ \boldsymbol{\textrm{x}} ; \theta) = \left[ \textrm{SoftMax} \left( \boldsymbol{\textrm{h}} ^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{hy}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}} _\boldsymbol{\textrm{hy}}\right)\right]_y, \\ \boldsymbol{\textrm{h}} = \tanh \left( \boldsymbol{\textrm{x}} ^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{xh}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}}_\boldsymbol{\textrm{xh}}\right)$$

This is very similar to the logisitc regression model. Here we have introduced a new vector-valued random variable $h$. We call this a 'hidden' variable or 'latent' variable, as we do not have any data observations for it. This variable may not even correspond to any quantity in the real world, but we use it to increase the power of our statistical model.

We now also have more parameters: $\theta = \{\boldsymbol{\textrm{W}}_\boldsymbol{\textrm{xh}}, \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{hy}}, \boldsymbol{\textrm{b}}_\boldsymbol{\textrm{xh}}, \boldsymbol{\textrm{b}}_\boldsymbol{\textrm{hy}}\}$.

$\tanh$ is the hyperbolic tangent function given by $$\tanh(t) = \frac{e^t-e^{-t}}{e^t+e^{-t}}$$

Like with the sigmoid, when applied to a vector or matrix, $\tanh$ function is applied entrywise.

Graphically, this model looks like this:

Now our depiction shows a hidden layer with $M$ units (this number can be different from the number of input neurons and number of output neurons), and we have two different nonlinearities in the graph: $tanh$ and sigmoids (but again we are not graphically representing the SoftMax normalization).

Teaching the network

The way we're going to make our network learn is by trying to find some values for our parameters $\theta$ so that the network is as likely as possible to guess the correct class. That is, given a data set of training images $\boldsymbol{\textrm{x}}_1, \boldsymbol{\textrm{x}}_2, \dots, \boldsymbol{\textrm{x}}_N$ and correct labels $y_1, y_2, \dots, y_N$, we want to find the parameters that maximize probability of the correct labels given the images.

This method of choosing parameters is called maximum likelihood (ML), and we can express it mathematically as finding the parameters $\theta$ which maximize the likelihood function:

$$\theta^* = {\arg\max}_\theta \ P( Y_1 = y_1, Y_2 = y_2, \dots, Y_N = y_N \ | \ \boldsymbol{\textrm{X}}_1 = \boldsymbol{\textrm{x}}_1,\boldsymbol{\textrm{X}}_2 = \boldsymbol{\textrm{x}}_2, \dots, \boldsymbol{\textrm{X}}_N = \boldsymbol{\textrm{x}}_N ; \theta)$$

And since our data points are independent, we can write this joint probability as a product of probabilities:

\begin{align} P( Y_1 = y_1,\dots, Y_N = y_N \ | \ \boldsymbol{\textrm{X}}_1 = \boldsymbol{\textrm{x}}_1, \dots, \boldsymbol{\textrm{X}}_N = \boldsymbol{\textrm{x}}_N ; \theta) &= P( Y_1 = y_1 \ | \ \boldsymbol{\textrm{X}}_1 = \boldsymbol{\textrm{x}}_1 ; \theta) \times \dots \times P( Y_N = y_N \ | \ \boldsymbol{\textrm{X}}_N = \boldsymbol{\textrm{x}}_N ; \theta) \\ &= \prod_{i=1}^N P( Y_i = y_i \ | \ \boldsymbol{\textrm{X}}_i = \boldsymbol{\textrm{x}}_i) \end{align}

Writing the likelihood function for an entire dataset

In our likelihood function above, each random variable pair $(X_1, Y_1), (X_2, Y_2), \dots, (X_N, Y_N)$ refers to a single data point. But since virtually all of our computations need to deal with multiple data points, we will find it both useful and computationally efficient to express both the mathematics and our Python code in terms of datasets. Thus, we will express the scalar random variables $Y_1, Y_2, \dots, Y_N$ as the vector-valued random variable $Y$, and the vector-valued random variables $X_1, X_2, \dots, X_N$ as the matrix-valued random variable $X$, where the matrix $X$ has as many rows as there are data points.

Using this notation, we rewrite the maximum likelihood equation above:

$$\theta^* = {\arg\max}_\theta \ P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)$$

Similarly, we can specify the logistic regression model it terms of multiple datapoints:

$$P(Y \ |\ X = \boldsymbol{\textrm{X}} ; \theta) = \textrm{SoftMax} \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}\right)$$

Here the result is a matrix of probabilities with as many rows as there are data points, and as many columns as there are classes. We also consider the SoftMax to normalize the result of the linear combination $\left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}\right)$ in such a way that each row of the result is a proper probability distribution summing to $1$. Note that we have had to multiply the bias vector $\boldsymbol{\textrm{b}}$ by the vertical vector of all ones $\boldsymbol{\textrm{1}}$ in order to add the bias term for every single data point.

The neural network model equations follow a similar pattern, which it would be a good exercise to write out for yourself.

Computing the log-likelihood of a dataset

In most machine learning applications, it is better to maximize the log-likelihood rather than the likelihood. This is done because the log-likelihood tends to be simpler to compute and more numerically stable than the likelihood. In terms of the math, this doesn't make things much more complicated, as all we need to add is a $\log$ in front of the likelihood:

$$\theta^* = {\arg\max}_\theta \ \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)$$

Since the logarithm of a product is the sum of the logarithms, we can write:

$$\log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta) = \sum_{i=1}^N \log P( Y_i = \boldsymbol{\textrm{y}}_i\ | \ X_i = \boldsymbol{\textrm{X}}_i ; \theta)$$

that is, the log joint probability is the sum of the log marginal probabilities.

Now let's plug in the probability of a dataset from above to obtain:

$$\log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta) = \sum \left[ \log\left( \textrm{SoftMax} \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}\right)\right)\right]_{\cdot,\boldsymbol{\textrm{y}}}$$

where we use the notation $[\boldsymbol{\textrm{M}}]_{\cdot,\boldsymbol{\textrm{y}}}$ to mean that from the matrix $\boldsymbol{\textrm{M}}$ we construct a new vector $(a_1, a_2, \dots, a_n)$ such that $a_i = M_{i, y_i} \forall i$. We use a slight abuse of notation and we use the sum symbol to indicate the summation of the entries in the vector; we need a summation because the log joint probability is the sum of the log marginal probabilities.

Classifying a dataset

Once we have a set of parameters $\theta$ that we are happy with, we can use the model to classify new data.

We have built a model that gives us a distribution over classes given the data. How do we assign a class to a new data point? The simplest way is to choose the class with the highest probability under the model to be the class we assign. We can write this mathematically, again using vector notation:

$$\hat{\boldsymbol{\textrm{y}}} = {\arg\max} \ P(Y \ |\ \boldsymbol{\textrm{X}} ; \theta) $$

Gradient ascent

Computing the maximum likelihood parameters is computationally unfeasible, so we're going to use a method called gradient ascent to find a set of parameters that are really good but perhaps not the absolute best.

The idea of gradient ascent is very simple. Given a function $f:\mathbb{R}^n \rightarrow \mathbb{R}$, we want to iteratively find points $f(\boldsymbol{\textrm{x}}_n)$ such that the value of the function gets progressively higher, that is: $f(\boldsymbol{\textrm{x}}_{n+1}) > f(\boldsymbol{\textrm{x}}_n) \forall n$. One way to do this is taking the direction of steepest ascent, which is just the gradient of the function $\frac{\partial f}{\partial \boldsymbol{\textrm{x}}}$, and taking a step in that direction times a constant $\lambda$ known as the learning rate that describes how big the step should be. We express this mathematically as:

$$ \boldsymbol{\textrm{x}}_{n+1} \leftarrow \boldsymbol{\textrm{x}}_n + \lambda \frac{\partial f}{\partial \boldsymbol{\textrm{x}}} $$

The last detail is choosing the starting point, $\boldsymbol{\textrm{x}}_0$, which we can arbitrarily choose by setting to zero or to some random value. Graphically, the algorithm looks like this, with each color representing the path from a different starting point:

Note that this method tends to find the top of the nearest hill ('local' maximum), and not the overall best point ('global' maximum).

It is also not guaranteed to increase the value of the function at each step; if the learning rate is too large, the algorithm could potentially jump across the top of the hill to a lower point on the other side of the hill. Much research goes into optimization methods, and many neural networks models are trained with methods that are more complicated than gradient ascent as it's presented here, but this same idea is at the base of all of those methods.

Finally, most people talk about and use gradient descent on the negative log-likelihood rather than gradient ascent on the log-likelihood; this is because gradient descent is the standard algorithm in the field of optimization. Gradient ascent in used in this tutorial to keep things a bit simpler.

Optimizing our model with gradient ascent

This is extremely easy: we just apply the equation above to our parameters taking the gradient of the log-likelihood:

\begin{align} \boldsymbol{\textrm{W}} &\leftarrow \boldsymbol{\textrm{W}} + \lambda \frac{\partial \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)}{\partial \boldsymbol{\textrm{W}}} \\ \boldsymbol{\textrm{b}} &\leftarrow \boldsymbol{\textrm{b}} + \lambda \frac{\partial \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)}{\partial \boldsymbol{\textrm{b}}} \\ \end{align}

Coding with Theano

Coding with Theano is extremely simple: once we have the equations behind the model, we pretty much type them directly. Since we will be training on multiple data points, we are going to encode the model as we wrote it for datasets, using vectors and matrices.

We'll start with the logistic regression. Our model is:

$$P(Y \ |\ \boldsymbol{\textrm{X}} ; \theta) = \textrm{SoftMax} \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}\right)$$

The first thing we do is declare all of the variables ($\boldsymbol{\textrm{X}}, \boldsymbol{\textrm{y}}, \boldsymbol{\textrm{W}}, \boldsymbol{\textrm{b}}$) that we will be using and their types like you would in Java or C:


In [8]:
n_classes = 10  # each digit is one of 0-9 
dims = 28 * 28  # our input data is flattened 28x28 matrices of image pixels

X = T.dmatrix() # Theano double matrix
y = T.ivector() # Theano integer vector
W = theano.shared(np.zeros([dims,n_classes])) # Theano shared double matrix
b = theano.shared(np.zeros(n_classes))        # Theano shared double vector

As you can see above, we defined $W$ and $b$ to be shared variables. This means that the values of these variables are persistent -- their values live on after we have run Theano operations. This is opposed to regular Theano variables, which only take values when Theano runs, and otherwise only exist in the abstract.

The reason for making $W$ and $b$ shared variables is that we want to run multiple iterations of gradient descent, and to do that, we need their values to persist. Furthermore, we want to find good parameters through training, but we will then want to use the same parameters for prediction, so we need them to be persistent

Let's now write our statistical model in Theano. We basically copy the following equation into code:

$$P(Y \ |\ \boldsymbol{\textrm{X}} ; \theta) = \textrm{SoftMax} \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}\right)$$

In [9]:
P = T.nnet.softmax(T.dot(X,W) + b) # the matrix of probabilities of all classes for all data points

Theano provides us with a T.nnet.softmax function to compute SoftMax, correctly normalizing the probabilities so that each row of the matrix P is a proper probability distribution that sums to $1$.

Note that we didn't need to multiply b by the 1 vector, Theano will do the correct addition for us automatically, just like numpy would do it. Here is a simple numpy example illustrating this:


In [10]:
amatrix = np.zeros((3,2)) # 3x2 matrix of all zeros
avector = np.array((1,2)) # the vector [1,2]
amatrix + avector


Out[10]:
array([[ 1.,  2.],
       [ 1.,  2.],
       [ 1.,  2.]])

Our next equation is the log-likelihood (LL) of a dataset:

$$\log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta) = \sum \left[ \log\left( \textrm{SoftMax} \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}\right)\right)\right]_{\cdot,\boldsymbol{\textrm{y}}}$$

In [11]:
LL = T.mean(T.log(P)[T.arange(P.shape[0]), y]) # the log-likelihood (LL)

OK... there's a lot going on here.

We have used two important tricks:

  • using mean instead of sum,
  • using the strange indexing [T.arange(P.shape[0]), y]

Let's go over them one by one.

Use the mean instead of the sum

Imagine that we were to construct a new dataset that contains each data point in the original dataset twice. Then the log-likelihood of the new dataset will be double that of the original. More importantly, the gradient of the log-likelihood of the new dataset will also be double the gradient of the log-likelihood of the original dataset. But we would like the size of the gradient to not depend on the amount of duplication in our dataset, and the easiest way to accomplish that is to divide the gradient by the size of the dataset.

Since taking the mean is equivalent to taking the sum and then dividing by the number of data points, what we are computing here is a type of "normalized" log-likelihood that will cause our gradient descent algorithm to be robust to change in dataset size. The quantity we are computing in the code can be more precisely described as the average log-likelihood for a single datapoint.

Use indexing to create a vector from a matrix

The second thing we do is use [T.arange(P.shape[0]), y] to apply the mathematical operation denoted in the equation by $\left[ \cdot\right]_{\cdot,\boldsymbol{\textrm{y}}}$, that is, constructing a new vector $(a_1, a_2, \dots, a_n)$ from a matrix $\boldsymbol{\textrm{M}}$ such that $a_i = M_{i, y_i} \forall i$.

As cryptic as it may be, this is a peculiar, but standard numpy way to index. For example, given a matrix

M = np.array([[1,2,3,4],
              [5,6,7,8],
              [9,10,11,12]]

If we wanted to extract one element from each row, say the 1st element from the first row, and the last from the others

M[0,0], M[1,3], M[2,3]

We could write that as a single index expression by combining the indexes

M[(0,1,2), (0,3,3)]

But now the first index is just $(0 \dots \#\textrm{rows}-1)$, or, in code, np.arange(M.shape[0]).

So we can write:


In [12]:
M = np.array([[1,  2,  3,  4],
              [5,  6,  7,  8],
              [9, 10, 11, 12]])

M[np.arange(M.shape[0]), (0,3,3)]


Out[12]:
array([ 1,  8, 12])

We're done with the model. There's one more thing we need to do, and that is specify the gradient updates

\begin{align} \boldsymbol{\textrm{W}} &\leftarrow \boldsymbol{\textrm{W}} + \lambda \frac{\partial \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)}{\partial \boldsymbol{\textrm{W}}} \\ \boldsymbol{\textrm{b}} &\leftarrow \boldsymbol{\textrm{b}} + \lambda \frac{\partial \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)}{\partial \boldsymbol{\textrm{b}}} \\ \end{align}

In [13]:
learning_rate = 0.5 # we tuned this parameter by hand
updates = [
        [W, W + learning_rate * T.grad(LL, W)], 
        [b, b + learning_rate * T.grad(LL, b)]
        ]

Theano functions

OK, we're done coding the model. What do we do next?

When working with Theano, the next step is to create a Theano function. A Theano function is the basic unit of Theano code that we call to do something. In our case, this something will be performing a single gradient ascent iteration.

We create a Theano function by calling the function theano.function (yes, we create a function by calling a function). theano.function has four important parameters that we provide in order to get a Theano function:

  • inputs -- the list of input variables of the Theano function, similar to the inputs of a Python function
  • outputs -- the list of output variables of the Theano function, similar to the inputs of a Python function
  • updates -- a list of updates for shared variables, in the format we used above when we defined the variable updates
  • givens -- a dictionary that allows substituting some variables from the model with other variables

In our case, we want the input to be the training dataset, the updates to be the gradient ascent updates, and while we don't really need an output, it will be helpful to get the log-likelihood as an output to see that we are doing the right things.

However, we will use the givens instead of the input to provide the data to the function.

Doing it this way is more efficient, as we've already loaded up the training dataset into memory as a shared Theano function, when we first loaded the data.

Our Theano function will look like this:


In [14]:
training_function = theano.function(
    inputs = [],             # use givens instead of the inputs as it's more efficient
    outputs = LL,            # output log-likelihood just to check that it is improving 
    updates = updates,       # these are the gradient updates, the one part that's really important
    givens = {X: training_x, # we indicate that the model variables X and y defined in the abstract
              y: training_y} # should take the values in the shared variables training_x and training_y                
    )

OK, let's run ten iterations of our code and see what the log-likelihood does


In [15]:
for i in range(10):
    current_LL = training_function()
    print("Log-likelihood = " + str(current_LL) + "\t\t" +
          "Average probability of the correct class = " + str(np.exp(current_LL))
         )


Log-likelihood = -2.3025850929940463		Average probability of the correct class = 0.1
Log-likelihood = -1.8391209025064599		Average probability of the correct class = 0.158957103494
Log-likelihood = -1.5250543679632202		Average probability of the correct class = 0.217609225573
Log-likelihood = -1.3142440592188174		Average probability of the correct class = 0.268677350658
Log-likelihood = -1.1682410159224519		Average probability of the correct class = 0.310913352197
Log-likelihood = -1.0617673924666753		Average probability of the correct class = 0.34584402773
Log-likelihood = -0.9820455376704271		Average probability of the correct class = 0.374544170518
Log-likelihood = -0.9194230935951666		Average probability of the correct class = 0.398749015602
Log-likelihood = -0.869394886438446		Average probability of the correct class = 0.419205139229
Log-likelihood = -0.8276885884406652		Average probability of the correct class = 0.437058341404

Using the model for classification

Great, it appears we're improving the log-likelihood of the model on the training data. Now to use it to classify. Recall our equation that expresses how to use the model to get the class:

$$\hat{\boldsymbol{\textrm{y}}} = {\arg\max} \ P(Y \ |\ \boldsymbol{\textrm{X}} ; \theta) $$

Let's put that in code:


In [16]:
y_hat = T.argmax(P, axis=1)

Note that we had to specify axis=1, that is, we want to get the argmax for each row, as each row represents the distribution for one datapoint.

Create a Theano function that classifies a dataset

Similarly to the training function, the classification function will use givens to pass in the test dataset, and output y_hat which we just defined


In [15]:
classification_function = theano.function(
    inputs = [],          
    outputs = y_hat,      
    givens = {X:test_x} # don't need the true labels test_y here
    )

Now let's run the classification once, and print the first ten images, the true labels, and the labels assigned by the model


In [16]:
test_y_hat = classification_function()
print ("Classification error: "+ str(100 * (1 - np.mean(test_y_hat == test_y.get_value()))) + "%")
for i in range(10):
    plot_mnist_digit(
        test_x.get_value()[i]                                # test_x is a theano object of images
    ) 
    print ('Image class: \t\t' + str(test_y.get_value()[i])) # test_y is a theano object of *true labels*
    print ('Model-assigned class: \t' + str(test_y_hat[i]))  # test_y_hat is a theano object of *predicted labels*


Classification error: 15.12%
Image class: 		7
Model-assigned class: 	7
Image class: 		2
Model-assigned class: 	2
Image class: 		1
Model-assigned class: 	1
Image class: 		0
Model-assigned class: 	0
Image class: 		4
Model-assigned class: 	4
Image class: 		1
Model-assigned class: 	1
Image class: 		4
Model-assigned class: 	4
Image class: 		9
Model-assigned class: 	9
Image class: 		5
Model-assigned class: 	2
Image class: 		9
Model-assigned class: 	9

Training a neural network

So far we have trained a logistic regression model. The neural network model is so similar that we can imlepement it with just a few changes to the code.

We need

  • to declare the hidden layer variable
  • to decide on the size of the hidden layer (we'll keep it small so it will run on your personal computer)
  • new parameters

In [17]:
H = T.dmatrix()  # Theano double matrix
hidden_layer_size = 20
W_xh = theano.shared(0.01 * np.random.randn(dims, hidden_layer_size))
W_hy = theano.shared(np.zeros([hidden_layer_size, n_classes])) 
b_xh = theano.shared(np.zeros(hidden_layer_size))
b_hy = theano.shared(np.zeros(n_classes))

Remember our model? Let's write it using matrices and vector for an entire dataset:

$$\boldsymbol{\textrm{H}} = \tanh \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{xh}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}_\boldsymbol{\textrm{xh}}\right), \\ P(Y \ |\ \boldsymbol{\textrm{x}} ; \theta) = \textrm{SoftMax} \left( \boldsymbol{\textrm{H}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{hy}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}} _\boldsymbol{\textrm{hy}}\right)$$

Let's code it up:


In [18]:
H =         T.tanh( T.dot(X, W_xh) + b_xh )
P = T.nnet.softmax( T.dot(H, W_hy) + b_hy )

Now let's add the gradient updates:


In [19]:
LL = T.mean(T.log(P)[T.arange(P.shape[0]), y])  # the log-likelihood (LL)

updates = [
        [W_xh, W_xh + learning_rate * T.grad(LL, W_xh)], 
        [W_hy, W_hy + learning_rate * T.grad(LL, W_hy)], 
        [b_xh, b_xh + learning_rate * T.grad(LL, b_xh)],
        [b_hy, b_hy + learning_rate * T.grad(LL, b_hy)],
    ]

Redefining Log-loss

The one extremely important thing we did here was redefined LL.

This is a crucial point about how Theano works:

Whenever we define a theano variable, like we did with P, we create a new object. When we define a new theano variable in terms of another Theano variable, like we did with LL, using

`LL = T.mean(T.log(P)[T.arange(P.shape[0]), y])`
we implicitly create a new object for LL that has a reference to our variable P we just defined.

Now the crucial part: say we redefine P. Then our variable LL still has a reference to the old variable P, and we need to update the reference to LL by re-running the definition for LL for everything to work correctly.

Bugs in Theano are very commonly produced by exactly this. It is a good reason to always use Theano in scripts rather than in a notebook like we are here.

Phew, we are now ready to train!


In [20]:
training_function = theano.function(
    inputs = [], 
    outputs = LL,
    updates = updates,
    givens = {X: training_x[:5000],  # use only 10% of the data so model not too complicated
              y: training_y[:5000]}  # to train on a personal computer         
    )
for i in range(60):                  # train more than for logistic regression as this model is more complex
    current_LL = training_function()
    print(
        "Log-likelihood = " + str(current_LL) + "\t\t" +
        "Average probability of the correct class = " + str(np.exp(current_LL))
    )


Log-likelihood = -2.30258509299		Average probability of the correct class = 0.1
Log-likelihood = -2.301260143		Average probability of the correct class = 0.100132582813
Log-likelihood = -2.29985742336		Average probability of the correct class = 0.100273139311
Log-likelihood = -2.29814933548		Average probability of the correct class = 0.100444561005
Log-likelihood = -2.29585974449		Average probability of the correct class = 0.100674801444
Log-likelihood = -2.2926097774		Average probability of the correct class = 0.101002523491
Log-likelihood = -2.28785314718		Average probability of the correct class = 0.101484099578
Log-likelihood = -2.28079743549		Average probability of the correct class = 0.102202674172
Log-likelihood = -2.27031674023		Average probability of the correct class = 0.103279462141
Log-likelihood = -2.25487812942		Average probability of the correct class = 0.104886325515
Log-likelihood = -2.23253100489		Average probability of the correct class = 0.107256619276
Log-likelihood = -2.20104023375		Average probability of the correct class = 0.110687957106
Log-likelihood = -2.1582595724		Average probability of the correct class = 0.115526010828
Log-likelihood = -2.10278232534		Average probability of the correct class = 0.122116188179
Log-likelihood = -2.03470964692		Average probability of the correct class = 0.130718431493
Log-likelihood = -1.95612616578		Average probability of the correct class = 0.141405141361
Log-likelihood = -1.87084468946		Average probability of the correct class = 0.153993530151
Log-likelihood = -1.78339741578		Average probability of the correct class = 0.168066185511
Log-likelihood = -1.69781819817		Average probability of the correct class = 0.183082538427
Log-likelihood = -1.61686684319		Average probability of the correct class = 0.19851971911
Log-likelihood = -1.54193010867		Average probability of the correct class = 0.213967721666
Log-likelihood = -1.47336522593		Average probability of the correct class = 0.229153034446
Log-likelihood = -1.41092281327		Average probability of the correct class = 0.243918088414
Log-likelihood = -1.35405500976		Average probability of the correct class = 0.258191167337
Log-likelihood = -1.30209894369		Average probability of the correct class = 0.271960364056
Log-likelihood = -1.25438406129		Average probability of the correct class = 0.285251491561
Log-likelihood = -1.21029452586		Average probability of the correct class = 0.298109465553
Log-likelihood = -1.16929950344		Average probability of the correct class = 0.310584428406
Log-likelihood = -1.13096059734		Average probability of the correct class = 0.322723100526
Log-likelihood = -1.09492559978		Average probability of the correct class = 0.334564497696
Log-likelihood = -1.06091566234		Average probability of the correct class = 0.346138718988
Log-likelihood = -1.02871055443		Average probability of the correct class = 0.357467598533
Log-likelihood = -0.998135176742		Average probability of the correct class = 0.36856611137
Log-likelihood = -0.969049151498		Average probability of the correct class = 0.379443660064
Log-likelihood = -0.941339587098		Average probability of the correct class = 0.390104905683
Log-likelihood = -0.914915833679		Average probability of the correct class = 0.40055033752
Log-likelihood = -0.889704949206		Average probability of the correct class = 0.410776934935
Log-likelihood = -0.86564729478		Average probability of the correct class = 0.420779096345
Log-likelihood = -0.842692353352		Average probability of the correct class = 0.430549769435
Log-likelihood = -0.820795127401		Average probability of the correct class = 0.440081594419
Log-likelihood = -0.799913392634		Average probability of the correct class = 0.449367881001
Log-likelihood = -0.780005880711		Average probability of the correct class = 0.45840331556
Log-likelihood = -0.761031287056		Average probability of the correct class = 0.467184377286
Log-likelihood = -0.742947906704		Average probability of the correct class = 0.475709499272
Log-likelihood = -0.725713685377		Average probability of the correct class = 0.483979036908
Log-likelihood = -0.709286504208		Average probability of the correct class = 0.491995108699
Log-likelihood = -0.693624565104		Average probability of the correct class = 0.499761364693
Log-likelihood = -0.678686790458		Average probability of the correct class = 0.507282723661
Log-likelihood = -0.664433187553		Average probability of the correct class = 0.514565106971
Log-likelihood = -0.650825153157		Average probability of the correct class = 0.521615186716
Log-likelihood = -0.637825709613		Average probability of the correct class = 0.528440158193
Log-likelihood = -0.625399672875		Average probability of the correct class = 0.53504754179
Log-likelihood = -0.613513757875		Average probability of the correct class = 0.541445015994
Log-likelihood = -0.602136629008		Average probability of the correct class = 0.547640281054
Log-likelihood = -0.59123890443		Average probability of the correct class = 0.553640951444
Log-likelihood = -0.580793122897		Average probability of the correct class = 0.559454474405
Log-likelihood = -0.570773681465		Average probability of the correct class = 0.565088071363
Log-likelihood = -0.561156751609		Average probability of the correct class = 0.570548698858
Log-likelihood = -0.551920180443		Average probability of the correct class = 0.575843025592
Log-likelihood = -0.543043382773		Average probability of the correct class = 0.580977422405

Test the model


In [21]:
y_hat = T.argmax(P, axis=1)
test_y_hat = classification_function()

print ("Classification error: " + str(100 * (1 - np.mean(test_y_hat == test_y.get_value()))) + "%")

for i in range(10):
    plot_mnist_digit(
        test_x.get_value()[i]                                # test_y is a theano object of *images*
    ) 
    print ('Image class: \t\t' + str(test_y.get_value()[i])) # test_y is a theano object of *true labels*
    print ('Model-assigned class: \t' + str(test_y_hat[i]))  # test_y_hat is a theano object of *predicted labels*


Classification error: 15.12%
Image class: 		7
Model-assigned class: 	7
Image class: 		2
Model-assigned class: 	2
Image class: 		1
Model-assigned class: 	1
Image class: 		0
Model-assigned class: 	0
Image class: 		4
Model-assigned class: 	4
Image class: 		1
Model-assigned class: 	1
Image class: 		4
Model-assigned class: 	4
Image class: 		9
Model-assigned class: 	9
Image class: 		5
Model-assigned class: 	2
Image class: 		9
Model-assigned class: 	9

We've trained a neural network with Theano!

Along the way we used a good chunk of mathematics, a hand-full of tricks but required very few lines of code.

Hopefully this leads nicely to using Theano more, in particular we would be excited to see you set theano to work with:

  • smarter training algorithms,
  • parallelized training, and
  • training more popular network architectures

'Til next time!