In [1]:
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
matplotlib.pyplot.gray()

import numpy as np
import theano
import theano.tensor as T
from theano import shared, function
theano.config.floatX = 'float32'
rng = np.random.RandomState(42)

import sys
import time


Couldn't import dot_parser, loading of dot files will not be possible.
<matplotlib.figure.Figure at 0x10846fb50>

We will make heavy use of the resources in the Theano deep learning tutorial. We have it integrated into our git repository as a submodule. You can clone the git repository by doing the following steps: (be sure to include the --recursive or you won't get the Theano deep learning tutorial)

git clone --recursive https://github.com/graphific/DL-Meetup-intro.git

If you already cloned the repository, but the DeepLearningTutorial folder is empty, you need to checkout the submodule. Make sure you are in the folder DL-Meetup-intro and then execute the following command:

git submodule update --init --recursive

Now we have to add this directory to the PythonPath. Depending on the location of your git repository you might have to change this path.


In [2]:
sys.path.insert(1,'DeepLearningTutorials/code')
#sys.path

MNIST: Handwritten Digit Recognition

MNIST consists of 70 000 small image patches, each showing a handwritten digit character in white on a black background. There are 10 different classes (the digits from 0-9).

Let's load the data and have a look.


In [3]:
from logistic_sgd import load_data
dataset='mnist.pkl.gz'

## If not already cached this function actually downloads the data
datasets = load_data(dataset)

train_set_x, train_set_y = datasets[0]
valid_set_x, valid_set_y = datasets[1]
test_set_x, test_set_y = datasets[2]


... loading data

In [4]:
print "Size of the training data matrix: ", train_set_x.get_value().shape


Size of the training data matrix:  (50000, 784)

The data is stored with one training sample per row. The original image patches are $28 \times 28$ pixels, hence we have $28 \cdot 28 = 784$ feature columns in our data matrix. We can reshape each row of our data matrix back to the original $28 \times 28$ image patch to visualize what the data looks like.


In [5]:
from utils import tile_raster_images

samples = tile_raster_images(train_set_x.get_value(), img_shape=(28,28), tile_shape=(5,10), tile_spacing=(0, 0),
                       scale_rows_to_unit_interval=True,
                       output_pixel_vals=True)

plt.imshow(samples)
plt.show()


Convolutional Network

We'll create a convolutional network to classify digits. We use code from the Deep Network Tutorials to make this a little easier, particularly the Convolutional Layer class LeNetConvPoolLayer, defined in convolutional_mlp.py.

We'll build up a network using the following layers:

  • input (1 channel)
  • 10 1x9x9 filters — (first dimension is 1 because the input image is grayscale)
  • 10 10x5x5 filters — (first dimenions is 10 because the previous layer had 10 filters)
  • a fully connected sigmoidal layer with 20 outputs
  • a fully connected Logistic Regression layer with 10 outputs

We'll train with batches, with a batch size of 10. (feel free to change it to a higher number, but for demo purposes 100 will be ok enough and faster)

Setup for the Convolutional Network


In [6]:
# Setup 1: parameters of the network and training
from convolutional_mlp import LeNetConvPoolLayer
from mlp import HiddenLayer
from logistic_sgd import LogisticRegression

# network parameters
num_kernels = [10, 10]
kernel_sizes = [(9, 9), (5, 5)]
sigmoidal_output_size = 20

# training parameters
learning_rate = 0.1
batch_size = 100

For efficiency, we break the data into batches. Our network will take batch_size images at a time as input.


In [7]:
# Setup 2: compute batch sizes for train/test/validation

# borrow=True gets us the value of the variable without making a copy.
n_train_batches = train_set_x.get_value(borrow=True).shape[0]
n_test_batches = test_set_x.get_value(borrow=True).shape[0]
n_valid_batches = valid_set_x.get_value(borrow=True).shape[0]
n_train_batches /= batch_size
n_test_batches /= batch_size
n_valid_batches /= batch_size

fast hack to limit datasize, feel free to skip this step:


In [8]:
#fast hack to limit datasize, feel free to skip this step:
n_train_batches /= 10
n_test_batches /= 10
n_valid_batches /= 10
n_train_batches /= 10
n_test_batches /= 10
n_valid_batches /= 10

/end hack


In [9]:
# Setup 3.

# Declare inputs to network - x and y are placeholders
# that will be used in the training/testing/validation functions below.
x = T.matrix('x')  # input image data
y = T.ivector('y') # input label data

Layer 0 - First convolutional Layer

The first layer takes (batch_size, 1, 28, 28) as input, convolves it with 10 different 9x9 filters, and then downsamples (via maxpooling) in a 2x2 region. Each filter/maxpool combination produces an output of size (28-9+1)/2 = 10 on a side.

The size of the first layer's output is therefore (batch_size, 10, 10, 10).


In [10]:
layer0_input_size = (batch_size, 1, 28, 28)  # fixed size from the data
edge0 = (28 - kernel_sizes[0][0] + 1) / 2
layer0_output_size = (batch_size, num_kernels[0], edge0, edge0)
# check that we have an even multiple of 2 before pooling
assert ((28 - kernel_sizes[0][0] + 1) % 2) == 0

# The actual input is the placeholder x reshaped to the input size of the network
layer0_input = x.reshape(layer0_input_size)

layer0 = LeNetConvPoolLayer(rng,
                            input=layer0_input,
                            image_shape=layer0_input_size,
                            filter_shape=(num_kernels[0], 1) + kernel_sizes[0],
                            poolsize=(2, 2))

Layer 1 - Second convolutional Layer

The second layer takes (batch_size, 10, 10, 10) as input, convolves it with 10 different 10x5x5 filters, and then downsamples (via maxpooling) in a 2x2 region. Each filter/maxpool combination produces an output of size (10-5+1)/2 = 3 on a side.

The size of the second layer's output is therefore (batch_size, 10, 3, 3).


In [11]:
layer1_input_size = layer0_output_size
edge1 = (edge0 - kernel_sizes[1][0] + 1) / 2
layer1_output_size = (batch_size, num_kernels[1], edge1, edge1)
# check that we have an even multiple of 2 before pooling
assert ((edge0 - kernel_sizes[1][0] + 1) % 2) == 0

layer1 = LeNetConvPoolLayer(rng,
                            input=layer0.output,
                            image_shape=layer1_input_size,
                            filter_shape=(num_kernels[1], num_kernels[0]) + kernel_sizes[1],
                            poolsize=(2, 2))

Layer 2 - Fully connected sigmoidal layer

The sigmoidal layer takes a vector as input.

We flatten all but the first two dimensions, to get an input of size (batch_size, 30 * 4 * 4).


In [12]:
layer2_input = layer1.output.flatten(2)

layer2 = HiddenLayer(rng,
                     input=layer2_input,
                     n_in=num_kernels[1] * edge1 * edge1,
                     n_out=sigmoidal_output_size,
                     activation=T.tanh)

Layer 3 - Logistic regression output layer

A fully connected logistic regression layer converts the sigmoid's layer output to a class label.


In [13]:
layer3 = LogisticRegression(input=layer2.output,
                            n_in=sigmoidal_output_size,
                            n_out=10)

Training the network

To train the network, we have to define a cost function. We'll use the Negative Log Likelihood of the model, relative to the true labels y.


In [14]:
# The cost we minimize during training is the NLL of the model.
# Recall: y is a placeholder we defined above
cost = layer3.negative_log_likelihood(y)

Gradient descent

We will train with Stochastic Gradient Descent. To do so, we need the gradient of the cost relative to the parameters of the model. We can get the parameters for each label via the .params attribute.


In [15]:
# create a list of all model parameters to be fit by gradient descent
params = layer3.params + layer2.params + layer1.params + layer0.params

# create a list of gradients for all model parameters
grads = T.grad(cost, params)

Training function

The actual training function takes a batch of inputs (images x and labels y), and updates the model parameters with a small step in the steepest direction that reduces the gradient.

This is the (param_i, param_i - learning_rate * grad_i) line in the code below.

We use the updates keyword to theano.function() to update these variables in-place. See the documentation for function().

We also make use of the givens keyword, to alias x and y to batches (aka, slices) of train_set_x and train_set_y.


In [16]:
# train_model is a function that updates the model parameters by SGD, and returns the current cost
#
# We create the updates list by automatically looping over all
# (params[i], grads[i]) pairs.
updates = [
    (param_i, param_i - learning_rate * grad_i)  # <=== SGD update step
    for param_i, grad_i in zip(params, grads)
]

index = T.lscalar()  # index to a batch of training/validation/testing examples

train_model = theano.function(
    [index],
    cost,
    updates=updates,
    givens={
        x: train_set_x[index * batch_size: (index + 1) * batch_size],  # <=== batching
        y: train_set_y[index * batch_size: (index + 1) * batch_size]   # <=== batching
    }
)

Validation function

To track progress on a held-out set, we count the number of misclassified examples in the validation set.


In [17]:
validate_model = theano.function(
        [index],
        layer3.errors(y),
        givens={
            x: valid_set_x[index * batch_size: (index + 1) * batch_size],
            y: valid_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )

Test function

After training, we check the number of misclassified examples in the test set.


In [18]:
test_model = theano.function(
    [index],
    layer3.errors(y),
    givens={
        x: test_set_x[index * batch_size: (index + 1) * batch_size],
        y: test_set_y[index * batch_size: (index + 1) * batch_size]
    }
)

Training loop

We use SGD for a fixed number of iterations over the full training set (an "epoch"). Usually, we'd use a more complicated rule, such as iterating until a certain number of epochs fail to produce improvement in the validation set.
Were doing CPU so will take a while here, lets do only 2 epochs, feel free to try more yourself, while I'm talking :)


In [19]:
# if you've limited the datasize above then you can easily do 200 epochs, 
# but in any case accuracy will be bad as NN only works with enough data:
# keeping the dataset intact with batch_size of 100, and 20 training epochs,
# validation error around x and test should be at least near x
for epoch in xrange(1,40):
    costs = [train_model(i) for i in xrange(n_train_batches)]
    validation_losses = [validate_model(i) for i in xrange(n_valid_batches)]
    if (epoch+1) %4 == 0:
        print "Epoch {}    NLL {:.2}    %err in validation set {:.1%}".format(epoch + 1, np.mean(costs), np.mean(validation_losses))


Epoch 4    NLL 2.3    %err in validation set 70.0%
Epoch 8    NLL 2.0    %err in validation set 67.0%
Epoch 12    NLL 1.7    %err in validation set 58.0%
Epoch 16    NLL 1.3    %err in validation set 46.0%
Epoch 20    NLL 1.1    %err in validation set 37.0%
Epoch 24    NLL 0.86    %err in validation set 38.0%
Epoch 28    NLL 0.7    %err in validation set 31.0%
Epoch 32    NLL 0.59    %err in validation set 28.0%
Epoch 36    NLL 0.5    %err in validation set 23.0%
Epoch 40    NLL 0.43    %err in validation set 23.0%

Learned features

We can visualize the filters learned by the first layer. In most vision tasks, these will be simple features, such as lines, points, or edges (such as in Figure 2 of this paper).

The MNIST data doesn't contain as much variation as natural images, so we don't see as obvious edge features, but we still see some stroke-like features appearing in the lowest level. With some standard tricks (adding noise to the training set, shifting examples around, using Dropout), we'd probably see smoother features (as in the paper linked above).


In [20]:
#first layer, layer 0
filters = tile_raster_images(layer0.W.get_value(borrow=True), img_shape=(9, 9), tile_shape=(2,5), tile_spacing=(3, 3),
                       scale_rows_to_unit_interval=True,
                       output_pixel_vals=True)

plt.imshow(filters)
plt.show()



In [21]:
# layer 1
filters = tile_raster_images(layer1.W.get_value(borrow=True), img_shape=(50, 5), tile_shape=(1,10), tile_spacing=(3, 3),
                       scale_rows_to_unit_interval=False,
                       output_pixel_vals=False)

plt.imshow(filters)
plt.show()



In [22]:
#sigmoid layer
filters = tile_raster_images(layer2.W.get_value(borrow=True), img_shape=(1, 20), tile_shape=(10,10), tile_spacing=(3, 3),
                       scale_rows_to_unit_interval=False,
                       output_pixel_vals=False)

plt.imshow(filters)
plt.show()



In [23]:
#final layer: fully connected logistic regression layer converts the sigmoid's layer output to a class label.
filters = tile_raster_images(layer3.W.get_value(borrow=True), img_shape=(1, 10), tile_shape=(4,5), tile_spacing=(3, 3),
                       scale_rows_to_unit_interval=False,
                       output_pixel_vals=False)

plt.imshow(filters)
plt.show()


Check performance on the test set


In [24]:
test_errors = [test_model(i) for i in range(n_test_batches)]
print "test errors: {:.1%}".format(np.mean(test_errors))


test errors: 17.0%

In [ ]: