Convolutional Neural Network (LeNet)

Update of the example of CNN given on deeplearning.net. This notebook tries to explain all the code as if reader had no knowledge of Theano whatsoever.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:

  • tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.
  • transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
  • efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
  • speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
  • dynamic C code generation – Evaluate expressions faster.
  • extensive unit-testing and self-verification – Detect and diagnose many types of mistake.

Theano has been powering large-scale computationally intensive scientific investigations since 2007.

Outline of this document:

A. The tools to implement CNNs

  1. The Convolution Operator
  2. Testing ConvOp on an image
  3. MaxPooling
  4. Convolution + MaxPooling layer

B. Full LeNet model

  1. HiddenLayer class
  2. LogisticRegression class
  3. Loading dataset
  4. Putting it all together

C. Implementation of Learning Rate Decay

D. Implementation of dropout

  1. Creating dropout function
  2. Creating dropout classes
  3. Rewriting evaluate_lenet5

E. Visualization of the convolutional filters

  1. Visualization function
  2. Testing the function on a single untrained LeNetConvPoolLayer
  3. Displaying the learned filters after training

F. Automated creation of a CNN + MLP

A. The tools to implement CNNs

1. The Convolution Operator

ConvOp is the main workhorse for implementing a convolutional layer in Theano. ConvOp is used by theano.tensor.signal.conv2d, which takes two symbolic inputs:

  • a 4D tensor corresponding to a mini-batch of input images. The shape of the tensor is as follows: [mini-batch size, number of input feature maps, image height, image width].
  • a 4D tensor corresponding to the weight matrix W. The shape of the tensor is: [number of feature maps at layer m, number of feature maps at layer m-1, filter height, filter width]

Below is the Theano code for implementing a convolutional layer similar to the one of Figure 1. The input consists of 3 features maps (an RGB color image) of size 120x160. We use two convolutional filters with 9x9 receptive fields.


In [29]:
import cPickle
import gzip
import os
import sys
import timeit

import numpy

import theano
import theano.tensor as T
from theano.tensor.signal import downsample
from theano.tensor.nnet import conv

In [9]:
rng = numpy.random.RandomState(23455)

In [10]:
# instantiate 4D tensor for input
input = T.tensor4(name='input')

In [11]:
w_shp = (2, 3, 9, 9)

The shape of the 4D tensor corresponding to the weight matrix W is:

  • number of feature maps at layer 2: as we chose to have only 2 convolutional filters, we will have 2 resulting feature maps.
  • number of feature maps at layer 1: the original image being RGB, it has 3 layers on top of each other, so 3 feature maps.
  • filter height: the convolutional filters has 9x9 receptive fields, so height = 9 pixels
  • filter width: similarly, width = 9 pixels

In [12]:
w_bound = numpy.sqrt(3 * 9 * 9)

Note that we use the same weight initialization formula as with the MLP. Weights are sampled randomly from a uniform distribution in the range [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a hidden unit. For MLPs, this was the number of units in the layer below. For CNNs however, we have to take into account the number of input feature maps and the size of the receptive fields.


In [13]:
W = theano.shared( numpy.asarray(
            rng.uniform(
                low=-1.0 / w_bound,
                high=1.0 / w_bound,
                size=w_shp),
            dtype=input.dtype), name ='W')
  • RandomState.uniform(low=0.0, high=1.0, size=None): draw samples from a uniform distribution: samples are uniformly distributed over the half-open interval [low, high) (includes low, but excludes high). In other words, any value within the given interval is equally likely to be drawn by uniform. source

  • theano.shared: the main benefits of using shared constructors are you can use them to initialise important variables with predefined numerical values (weight matrices in a neural network, for example). source

The distinction between Theano-managed memory and user-managed memory can be broken down by some Theano functions (e.g. shared, get_value and the constructors for In and Out) by using a borrow=True flag. This can make those methods faster (by avoiding copy operations) at the expense of risking subtle bugs in the overall program (by aliasing memory).

Take home message:

It is a safe practice (and a good idea) to use borrow=True in a shared variable constructor when the shared variable stands for a large object (in terms of memory footprint) and you do not want to create copies of it in memory.

It is not a reliable technique to use borrow=True to modify shared variables through side-effect, because with some devices (e.g. GPU devices) this technique will not work.


In [14]:
# initialize shared variable for bias (1D tensor) with random values
# IMPORTANT: biases are usually initialized to zero. However in this
# particular application, we simply apply the convolutional layer to
# an image without learning the parameters. We therefore initialize
# them to random values to "simulate" learning.
b_shp = (2,)
b = theano.shared(numpy.asarray(
            rng.uniform(low=-.5, high=.5, size=b_shp),
            dtype=input.dtype), name ='b')

We chose to have only 2 filters, so 2 bias terms need to be initialized.


In [15]:
# build symbolic expression that computes the convolution of input with filters in w
conv_out = conv.conv2d(input, W)

nnet.conv2d: This is the standard operator for convolutional neural networks working with batches of multi-channel 2D images, available for CPU and GPU. source


In [16]:
# build symbolic expression to add bias and apply activation function, i.e. produce neural net layer output
output = T.nnet.sigmoid(conv_out + b.dimshuffle('x', 0, 'x', 'x'))

tensor.nnet.sigmoid(x): returns the standard sigmoid nonlinearity applied to x.

  • Parameters: x - symbolic Tensor (or compatible)
  • Return type: same as x
  • Returns: element-wise sigmoid: $$sigmoid(x) = \frac{1}{1 + \exp(-x)}$$.

Note: in numpy and in Theano, the transpose of a vector is exactly the same vector! Use reshape or dimshuffle to turn your vector into a row or column matrix. source


In [17]:
# create theano function to compute filtered images
f = theano.function([input], output)

2. Testing ConvOp on an image


In [18]:
import pylab
from PIL import Image

In [19]:
# open random image of dimensions 1936×2592
img = Image.open(open('images/profilepic4.jpg'))

In [20]:
img = numpy.asarray(img, dtype='float64') / 256. # divide by 256 to have RGB 0-1 scale and not 0 - 256

In [23]:
#put image in 4D tensor of shape (1, 3, height, width)
img_ = img.transpose(2, 0, 1).reshape(1, 3, 2592, 1936)
filtered_img = f(img_)

# plot original image and first and second components of output
pylab.subplot(1, 3, 1); pylab.axis('off'); pylab.imshow(img)
pylab.gray();
# recall that the convOp output (filtered image) is actually a "minibatch",
# of size 1 here, so we take index 0 in the first dimension:
pylab.subplot(1, 3, 2); pylab.axis('off'); pylab.imshow(filtered_img[0, 0, :, :])
pylab.subplot(1, 3, 3); pylab.axis('off'); pylab.imshow(filtered_img[0, 1, :, :])
pylab.show()

3. MaxPooling

Another important concept of CNNs is max-pooling, which is a form of non-linear down-sampling. Max-pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value.

Max-pooling is useful in vision for two reasons:

  1. By eliminating non-maximal values, it reduces computation for upper layers.

  2. It provides a form of translation invariance. Imagine cascading a max-pooling layer with a convolutional layer. There are 8 directions in which one can translate the input image by a single pixel. If max-pooling is done over a 2x2 region, 3 out of these 8 possible configurations will produce exactly the same output at the convolutional layer. For max-pooling over a 3x3 window, this jumps to 5/8.

    Since it provides additional robustness to position, max-pooling is a “smart” way of reducing the dimensionality of intermediate representations.

Max-pooling is done in Theano by way of theano.tensor.signal.downsample.max_pool_2d. This function takes as input an N dimensional tensor (where N >= 2) and a downscaling factor and performs max-pooling over the 2 trailing dimensions of the tensor.


In [42]:
from theano.tensor.signal import downsample

input = T.dtensor4('input')
maxpool_shape = (2, 2)
pool_out = downsample.max_pool_2d(input, maxpool_shape, ignore_border=True)
g = theano.function([input],pool_out)

In [43]:
invals = numpy.random.RandomState(1).rand(3, 2, 5, 5)
print 'With ignore_border set to True:'
print 'invals[0, 0, :, :] =\n', invals[0, 0, :, :]
print 'output[0, 0, :, :] =\n', g(invals)[0, 0, :, :]


With ignore_border set to True:
invals[0, 0, :, :] =
[[  4.17022005e-01   7.20324493e-01   1.14374817e-04   3.02332573e-01
    1.46755891e-01]
 [  9.23385948e-02   1.86260211e-01   3.45560727e-01   3.96767474e-01
    5.38816734e-01]
 [  4.19194514e-01   6.85219500e-01   2.04452250e-01   8.78117436e-01
    2.73875932e-02]
 [  6.70467510e-01   4.17304802e-01   5.58689828e-01   1.40386939e-01
    1.98101489e-01]
 [  8.00744569e-01   9.68261576e-01   3.13424178e-01   6.92322616e-01
    8.76389152e-01]]
output[0, 0, :, :] =
[[ 0.72032449  0.39676747]
 [ 0.6852195   0.87811744]]

In [44]:
pool_out = downsample.max_pool_2d(input, maxpool_shape, ignore_border=False)
g = theano.function([input],pool_out)
print 'With ignore_border set to False:'
print 'invals[0, 0, :, :] =\n', invals[0, 0, :, :]
print 'output[0, 0, :, :] =\n', g(invals)[0, 0, :, :]


With ignore_border set to False:
invals[0, 0, :, :] =
[[  4.17022005e-01   7.20324493e-01   1.14374817e-04   3.02332573e-01
    1.46755891e-01]
 [  9.23385948e-02   1.86260211e-01   3.45560727e-01   3.96767474e-01
    5.38816734e-01]
 [  4.19194514e-01   6.85219500e-01   2.04452250e-01   8.78117436e-01
    2.73875932e-02]
 [  6.70467510e-01   4.17304802e-01   5.58689828e-01   1.40386939e-01
    1.98101489e-01]
 [  8.00744569e-01   9.68261576e-01   3.13424178e-01   6.92322616e-01
    8.76389152e-01]]
output[0, 0, :, :] =
[[ 0.72032449  0.39676747  0.53881673]
 [ 0.6852195   0.87811744  0.19810149]
 [ 0.96826158  0.69232262  0.87638915]]

theano.tensor.signal.downsample.max_pool_2d(input, ds, ignore_border=None, st=None, padding=(0, 0), mode='max'): takes as input a N-D tensor, where N >= 2. It downscales the input image by the specified factor, by keeping only the maximum value of non-overlapping patches of size (ds[0],ds[1])

Parameters:

  • input (N-D theano tensor of input images) – Input images. Max pooling will be done over the 2 last dimensions.
  • ds (tuple of length 2) – Factor by which to downscale (vertical ds, horizontal ds). (2,2) will halve the image in each dimension.
  • ignore_border (bool (default None, will print a warning and set to False)) – When True, (5,5) input with ds=(2,2) will generate a (2,2) output. (3,3) otherwise.
  • st (tuple of lenght 2) – Stride size, which is the number of shifts over rows/cols to get the next pool region. If st is None, it is considered equal to ds (no overlap on pooling regions).
  • padding (tuple of two ints) – (pad_h, pad_w), pad zeros to extend beyond four borders of the images, pad_h is the size of the top and bottom margins, and pad_w is the size of the left and right margins.
  • mode ({‘max’, ‘sum’, ‘average_inc_pad’, ‘average_exc_pad’}) – Operation executed on each window. max and sum always exclude the padding in the computation. average gives you the choice to include or exclude it.

source

4. Convolution + MaxPooling layer

We now have all we need to implement a LeNet model in Theano. We start with the LeNetConvPoolLayer class, which implements a {convolution + max-pooling} layer.


In [30]:
class LeNetConvPoolLayer(object):
    """Pool Layer of a convolutional network """

    def __init__(self, rng, input, filter_shape, image_shape, poolsize=(2, 2)):
        """
        Allocate a LeNetConvPoolLayer with shared variable internal parameters.

        :type rng: numpy.random.RandomState
        :param rng: a random number generator used to initialize weights

        :type input: theano.tensor.dtensor4
        :param input: symbolic image tensor, of shape image_shape

        :type filter_shape: tuple or list of length 4
        :param filter_shape: (number of filters, num input feature maps,
                              filter height, filter width)

        :type image_shape: tuple or list of length 4
        :param image_shape: (batch size, num input feature maps,
                             image height, image width)

        :type poolsize: tuple or list of length 2
        :param poolsize: the downsampling (pooling) factor (#rows, #cols)
        """

        assert image_shape[1] == filter_shape[1]
        
        # assert just checks if the number of feature maps is consistent between filter shape and image_shape
        
        self.input = input

        # there are "num input feature maps * filter height * filter width"
        # inputs to each hidden unit
        # reminder: Weights are sampled randomly from a uniform distribution 
        # in the range [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a hidden unit
        fan_in = numpy.prod(filter_shape[1:])
        # each unit in the lower layer receives a gradient from:
        # "num output feature maps * filter height * filter width" /
        #   pooling size
        fan_out = (filter_shape[0] * numpy.prod(filter_shape[2:]) /
                   numpy.prod(poolsize))
        # initialize weights with random weights
        W_bound = numpy.sqrt(6. / (fan_in + fan_out))
        self.W = theano.shared(
            numpy.asarray(
                rng.uniform(low=-W_bound, high=W_bound, size=filter_shape),
                dtype=theano.config.floatX
            ),
            borrow=True # see above the def of theano.shared for explanation of borrow
        )

        # the bias is a 1D tensor -- one bias per output feature map
        b_values = numpy.zeros((filter_shape[0],), dtype=theano.config.floatX)
        self.b = theano.shared(value=b_values, borrow=True)

        # convolve input feature maps with filters
        conv_out = conv.conv2d(
            input=input,
            filters=self.W,
            filter_shape=filter_shape,
            image_shape=image_shape
        )

        # downsample each feature map individually, using maxpooling
        pooled_out = downsample.max_pool_2d(
            input=conv_out,
            ds=poolsize,
            ignore_border=True
        )

        # add the bias term. Since the bias is a vector (1D array), we first
        # reshape it to a tensor of shape (1, n_filters, 1, 1). Each bias will
        # thus be broadcasted across mini-batches and feature map
        # width & height
        self.output = T.tanh(pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))

        # store parameters of this layer
        self.params = [self.W, self.b]

        # keep track of model input
        self.input = input

Notice that when initializing the weight values, the fan-in is determined by the size of the receptive fields and the number of input feature maps.

B. Full LeNet model

Sparse, convolutional layers and max-pooling are at the heart of the LeNet family of models. While the exact details of the model will vary greatly, the figure below shows a graphical depiction of a LeNet model.

The lower-layers are composed to alternating convolution and max-pooling layers. The upper-layers however are fully-connected and correspond to a traditional MLP (hidden layer + logistic regression). The input to the first fully-connected layer is the set of all features maps at the layer below.

From an implementation point of view, this means lower-layers operate on 4D tensors. These are then flattened to a 2D matrix of rasterized feature maps, to be compatible with our previous MLP implementation.

Using the LogisticRegression class defined in Classifying MNIST digits using Logistic Regression and the HiddenLayer class defined in Multilayer Perceptron, we can instantiate the network as follows.

1. HiddenLayer class

The original code for this class can be found here: source


In [3]:
class HiddenLayer(object):
    def __init__(self, rng, input, n_in, n_out, W=None, b=None,
                 activation=T.tanh):
        """
        Typical hidden layer of a MLP: units are fully-connected and have
        sigmoidal activation function. Weight matrix W is of shape (n_in,n_out)
        and the bias vector b is of shape (n_out,).

        NOTE : The nonlinearity used here is tanh

        Hidden unit activation is given by: tanh(dot(input,W) + b)

        :type rng: numpy.random.RandomState
        :param rng: a random number generator used to initialize weights

        :type input: theano.tensor.dmatrix
        :param input: a symbolic tensor of shape (n_examples, n_in)

        :type n_in: int
        :param n_in: dimensionality of input

        :type n_out: int
        :param n_out: number of hidden units

        :type activation: theano.Op or function
        :param activation: Non linearity to be applied in the hidden
                           layer
        """
        self.input = input

        # `W` is initialized with `W_values` which is uniformely sampled
        # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
        # for tanh activation function
        # the output of uniform if converted using asarray to dtype
        # theano.config.floatX so that the code is runable on GPU
        # Note : optimal initialization of weights is dependent on the
        #        activation function used (among other things).
        #        For example, results presented in [Xavier10] suggest that you
        #        should use 4 times larger initial weights for sigmoid
        #        compared to tanh
        #        We have no info for other function, so we use the same as
        #        tanh.
        if W is None:
            W_values = numpy.asarray(
                rng.uniform(
                    low=-numpy.sqrt(6. / (n_in + n_out)),
                    high=numpy.sqrt(6. / (n_in + n_out)),
                    size=(n_in, n_out)
                ),
                dtype=theano.config.floatX
            )
            if activation == theano.tensor.nnet.sigmoid:
                W_values *= 4

            W = theano.shared(value=W_values, name='W', borrow=True)

        if b is None:
            b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
            b = theano.shared(value=b_values, name='b', borrow=True)

        self.W = W
        self.b = b

        lin_output = T.dot(input, self.W) + self.b
        self.output = (
            lin_output if activation is None
            else activation(lin_output)
        )
        # parameters of the model
        self.params = [self.W, self.b]

The class uses tanh as activation function by default. This can be supported by the results presented in the scientific paper by called Performance Analysis of Various Activation Functions in Generalized MLP Architectures of Neural Networks by Ahmet V Olgac and Bekir Karlik.

In this study, we have used five conventional differentiable and monotonic activation functions for the evolution of MLP architecture along with Generalized Delta rule learning. These proposed well-known and effective activation functions are Bi-polar sigmoid, Uni-polar sigmoid, Tanh, Conic Section, and Radial Bases Function (RBF). Having compared their performances, simulation results show that Tanh (hyperbolic tangent) function performs better recognition accuracy than those of the other functions. In other words, the neural network computed good results when “Tanh-Tanh” combination of activation functions was used for both neurons (or nodes) of hidden and output layers.

The paper by Xavier can be found at:

2. LogisticRegression class

The original code for this class can be found here: source


In [4]:
class LogisticRegression(object):
    """Multi-class Logistic Regression Class

    The logistic regression is fully described by a weight matrix :math:`W`
    and bias vector :math:`b`. Classification is done by projecting data
    points onto a set of hyperplanes, the distance to which is used to
    determine a class membership probability.
    """

    def __init__(self, input, n_in, n_out):
        """ Initialize the parameters of the logistic regression

        :type input: theano.tensor.TensorType
        :param input: symbolic variable that describes the input of the
                      architecture (one minibatch)

        :type n_in: int
        :param n_in: number of input units, the dimension of the space in
                     which the datapoints lie

        :type n_out: int
        :param n_out: number of output units, the dimension of the space in
                      which the labels lie

        """
        # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
        self.W = theano.shared(
            value=numpy.zeros(
                (n_in, n_out),
                dtype=theano.config.floatX
            ),
            name='W',
            borrow=True
        )
        # initialize the biases b as a vector of n_out 0s
        self.b = theano.shared(
            value=numpy.zeros(
                (n_out,),
                dtype=theano.config.floatX
            ),
            name='b',
            borrow=True
        )

        # symbolic expression for computing the matrix of class-membership
        # probabilities
        # Where:
        # W is a matrix where column-k represent the separation hyperplane for
        # class-k
        # x is a matrix where row-j  represents input training sample-j
        # b is a vector where element-k represent the free parameter of
        # hyperplane-k
        self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)

        # symbolic description of how to compute prediction as class whose
        # probability is maximal
        self.y_pred = T.argmax(self.p_y_given_x, axis=1)

        # parameters of the model
        self.params = [self.W, self.b]

        # keep track of model input
        self.input = input

    def negative_log_likelihood(self, y):
        """Return the mean of the negative log-likelihood of the prediction
        of this model under a given target distribution.

        .. math::

            \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) =
            \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|}
                \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
            \ell (\theta=\{W,b\}, \mathcal{D})

        :type y: theano.tensor.TensorType
        :param y: corresponds to a vector that gives for each example the
                  correct label

        Note: we use the mean instead of the sum so that
              the learning rate is less dependent on the batch size
        """
        # y.shape[0] is (symbolically) the number of rows in y, i.e.,
        # number of examples (call it n) in the minibatch
        # T.arange(y.shape[0]) is a symbolic vector which will contain
        # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
        # Log-Probabilities (call it LP) with one row per example and
        # one column per class LP[T.arange(y.shape[0]),y] is a vector
        # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
        # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
        # the mean (across minibatch examples) of the elements in v,
        # i.e., the mean log-likelihood across the minibatch.
        return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])

    def errors(self, y):
        """Return a float representing the number of errors in the minibatch
        over the total number of examples of the minibatch ; zero one
        loss over the size of the minibatch

        :type y: theano.tensor.TensorType
        :param y: corresponds to a vector that gives for each example the
                  correct label
        """

        # check if y has same dimension of y_pred
        if y.ndim != self.y_pred.ndim:
            raise TypeError(
                'y should have the same shape as self.y_pred',
                ('y', y.type, 'y_pred', self.y_pred.type)
            )
        # check if y is of the correct datatype
        if y.dtype.startswith('int'):
            # the T.neq operator returns a vector of 0s and 1s, where 1
            # represents a mistake in prediction
            return T.mean(T.neq(self.y_pred, y))
        else:
            raise NotImplementedError()

.negative_log_likelihood(y): this method returns the mean of the negative log-likelihood of the prediction of this model under a given target distribution: $$\frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) =\frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\ \ell (\theta=\{W,b\}, \mathcal{D})$$

type y: theano.tensor.TensorType

param y: corresponds to a vector that gives for each example the correct label

Note: we use the mean instead of the sum so that the learning rate is less dependent on the batch size.

3. Loading dataset

Original code can be found here. This piece of code loads the dataset and partitions it into: train set, validation set and test set.

4. Putting it all together


In [6]:
def evaluate_lenet5(learning_rate=0.1, n_epochs=200,
                    dataset='mnist.pkl.gz',
                    nkerns=[20, 50], batch_size=500):
    """ Demonstrates lenet on MNIST dataset

    :type learning_rate: float
    :param learning_rate: learning rate used (factor for the stochastic
                          gradient)

    :type n_epochs: int
    :param n_epochs: maximal number of epochs to run the optimizer

    :type dataset: string
    :param dataset: path to the dataset used for training /testing (MNIST here)

    :type nkerns: list of ints
    :param nkerns: number of kernels on each layer (so 20 convolutional filters, and then 50 activation units)
    """

    rng = numpy.random.RandomState(23455)

    datasets = load_data(dataset)

    train_set_x, train_set_y = datasets[0]
    valid_set_x, valid_set_y = datasets[1]
    test_set_x, test_set_y = datasets[2]

    # compute number of minibatches for training, validation and testing
    n_train_batches = train_set_x.get_value(borrow=True).shape[0]
    n_valid_batches = valid_set_x.get_value(borrow=True).shape[0]
    n_test_batches = test_set_x.get_value(borrow=True).shape[0]
    n_train_batches /= batch_size
    n_valid_batches /= batch_size
    n_test_batches /= batch_size

    # allocate symbolic variables for the data
    index = T.lscalar()  # index to a [mini]batch

    # start-snippet-1
    x = T.matrix('x')   # the data is presented as rasterized images
    y = T.ivector('y')  # the labels are presented as 1D vector of
                        # [int] labels

    ######################
    # BUILD ACTUAL MODEL #
    ######################
    print '... building the model'

    # Reshape matrix of rasterized images of shape (batch_size, 28 * 28)
    # to a 4D tensor, compatible with our LeNetConvPoolLayer
    # (28, 28) is the size of MNIST images.
    layer0_input = x.reshape((batch_size, 1, 28, 28))

    # Construct the first convolutional pooling layer:
    # filtering reduces the image size to (28-5+1 , 28-5+1) = (24, 24)
    # maxpooling reduces this further to (24/2, 24/2) = (12, 12)
    # 4D output tensor is thus of shape (batch_size, nkerns[0], 12, 12)
    layer0 = LeNetConvPoolLayer(
        rng,
        input=layer0_input,
        image_shape=(batch_size, 1, 28, 28),
        filter_shape=(nkerns[0], 1, 5, 5),
        poolsize=(2, 2)
    )
    '''
    Reminder of LeNetConvPoolLayer input parameters and types
    :type rng: numpy.random.RandomState
    :param rng: a random number generator used to initialize weights

    :type input: theano.tensor.dtensor4
    :param input: symbolic image tensor, of shape image_shape

    :type filter_shape: tuple or list of length 4
    :param filter_shape: (number of filters, num input feature maps,
                              filter height, filter width)

    :type image_shape: tuple or list of length 4
    :param image_shape: (batch size, num input feature maps,
                             image height, image width)

    :type poolsize: tuple or list of length 2
    :param poolsize: the downsampling (pooling) factor (#rows, #cols)
    '''
    # Construct the second convolutional pooling layer
    # filtering reduces the image size to (12-5+1, 12-5+1) = (8, 8)
    # maxpooling reduces this further to (8/2, 8/2) = (4, 4)
    # 4D output tensor is thus of shape (batch_size, nkerns[1], 4, 4)
    layer1 = LeNetConvPoolLayer(
        rng,
        input=layer0.output,
        image_shape=(batch_size, nkerns[0], 12, 12),
        filter_shape=(nkerns[1], nkerns[0], 5, 5),
        poolsize=(2, 2)
    )

    # the HiddenLayer being fully-connected, it operates on 2D matrices of
    # shape (batch_size, num_pixels) (i.e matrix of rasterized images).
    # This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4),
    # or (500, 50 * 4 * 4) = (500, 800) with the default values.
    layer2_input = layer1.output.flatten(2)

    # construct a fully-connected sigmoidal layer
    layer2 = HiddenLayer(
        rng,
        input=layer2_input,
        n_in=nkerns[1] * 4 * 4,
        n_out=500,
        activation=T.tanh
    )

    # classify the values of the fully-connected sigmoidal layer
    layer3 = LogisticRegression(input=layer2.output, n_in=500, n_out=10)

    # the cost we minimize during training is the NLL of the model
    cost = layer3.negative_log_likelihood(y)

    # create a function to compute the mistakes that are made by the model
    test_model = theano.function(
        [index],
        layer3.errors(y),
        givens={
            x: test_set_x[index * batch_size: (index + 1) * batch_size],
            y: test_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )

    validate_model = theano.function(
        [index],
        layer3.errors(y),
        givens={
            x: valid_set_x[index * batch_size: (index + 1) * batch_size],
            y: valid_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )

    # create a list of all model parameters to be fit by gradient descent
    params = layer3.params + layer2.params + layer1.params + layer0.params

    # create a list of gradients for all model parameters
    grads = T.grad(cost, params)

    # train_model is a function that updates the model parameters by
    # SGD Since this model has many parameters, it would be tedious to
    # manually create an update rule for each model parameter. We thus
    # create the updates list by automatically looping over all
    # (params[i], grads[i]) pairs.
    updates = [
        (param_i, param_i - learning_rate * grad_i)
        for param_i, grad_i in zip(params, grads)
    ]

    train_model = theano.function(
        [index],
        cost,
        updates=updates,
        givens={
            x: train_set_x[index * batch_size: (index + 1) * batch_size],
            y: train_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )
    # end-snippet-1

    ###############
    # TRAIN MODEL #
    ###############
    print '... training'
    # early-stopping parameters
    patience = 10000  # look as this many examples regardless
    patience_increase = 2  # wait this much longer when a new best is
                           # found
    improvement_threshold = 0.995  # a relative improvement of this much is
                                   # considered significant
    validation_frequency = min(n_train_batches, patience / 2)
                                  # go through this many
                                  # minibatche before checking the network
                                  # on the validation set; in this case we
                                  # check every epoch

    best_validation_loss = numpy.inf
    best_iter = 0
    test_score = 0.
    start_time = timeit.default_timer()

    epoch = 0
    done_looping = False

    while (epoch < n_epochs) and (not done_looping):
        epoch = epoch + 1
        for minibatch_index in xrange(n_train_batches):
            
            # This function is very similar to range(), but returns an xrange object instead of a list. 
            # This is an opaque sequence type which yields the same values as the corresponding list, 
            # without actually storing them all simultaneously. The advantage of xrange() over range() 
            # is minimal (since xrange() still has to create the values when asked for them) except when a 
            # very large range is used on a memory-starved machine or when all of the range’s elements 
            # are never used (such as when the loop is usually terminated with break). 
            # For more information on xrange objects, see XRange Type and Sequence Types — str, 
            # unicode, list, tuple, bytearray, buffer, xrange

            iter = (epoch - 1) * n_train_batches + minibatch_index
            
            # for epoch = 1 (first value while entering the "while" loop; iter = 0 * n_train_batches + minibtach_index
            # so iter = 0. This will call train_model over the index of train_set_x[0:500] and train_set_y[0:500].
            # the (epoch -1) * n_train_batches keep track of the iteration number while looping over and over on
            # the train set.

            if iter % 100 == 0:
                print 'training @ iter = ', iter
            cost_ij = train_model(minibatch_index)
            
            # Only at this moment all the symbolic expression that were called during "Building the model" are
            # called with real values replacing the symbolic tensors. That is how theano works.

            if (iter + 1) % validation_frequency == 0:

                # compute zero-one loss on validation set
                validation_losses = [validate_model(i) for i
                                     in xrange(n_valid_batches)]
                this_validation_loss = numpy.mean(validation_losses)
                print('epoch %i, minibatch %i/%i, validation error %f %%' %
                      (epoch, minibatch_index + 1, n_train_batches,
                       this_validation_loss * 100.))

                # if we got the best validation score until now
                if this_validation_loss < best_validation_loss:

                    #improve patience if loss improvement is good enough
                    if this_validation_loss < best_validation_loss *  \
                       improvement_threshold:
                        patience = max(patience, iter * patience_increase)

                    # save best validation score and iteration number
                    best_validation_loss = this_validation_loss
                    best_iter = iter

                    # test it on the test set
                    test_losses = [
                        test_model(i)
                        for i in xrange(n_test_batches)
                    ]
                    test_score = numpy.mean(test_losses)
                    print(('     epoch %i, minibatch %i/%i, test error of '
                           'best model %f %%') %
                          (epoch, minibatch_index + 1, n_train_batches,
                           test_score * 100.))

            if patience <= iter:
                done_looping = True
                break

    end_time = timeit.default_timer()
    print('Optimization complete.')
    print('Best validation score of %f %% obtained at iteration %i, '
          'with test performance %f %%' %
          (best_validation_loss * 100., best_iter + 1, test_score * 100.))
    print >> sys.stderr, ('The code for file ' +
                          os.path.split(__file__)[1] +
                          ' ran for %.2fm' % ((end_time - start_time) / 60.))

C. Implementation of Learning Rate Decay

Let's modify the code of evaluate_lenet5 function so it allows Learning Rate Decay.

Definition: the learning rate is the step-size of the update of the parameters during gradient descent. It is typically between 0.1 and 0.01. However, if it is too big, gradient descent can overshoot the minimum and diverge. If it is too small, the optimization is very slow and may get stuck into a local minimum. The learning rate decay allows for the learning rate to be big at the beginning and then slowly decrease when nearing the global minimum:

  • initial learning rate: $$\alpha = \alpha_0$$
  • learning rate decay: $$\alpha_d$$
  • at each iteration update: $$\alpha = \alpha_d*\alpha$$

The full code can be found at code/convolutional_mlp_ldr.py


In [7]:
def evaluate_lenet5_ldr(learning_rate=0.1, learning_rate_decay = 0.98, n_epochs=200,
                    dataset='mnist.pkl.gz',
                    nkerns=[20, 50], batch_size=500):
    """ 
    :type learning_rate_decay: float
    :param learning_rate_decay: learning rate decay used 
    """

    rng = numpy.random.RandomState(23455)

   """
   ...
   """
    
    #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!#
    # Theano function to decay the learning rate, this is separate from the
    # training function because we only want to do this once each epoch instead
    # of after each minibatch.
    decay_learning_rate = theano.function(inputs=[], outputs=learning_rate,
            updates={learning_rate: learning_rate * learning_rate_decay})
    #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!#

    ###############
    # TRAIN MODEL #
    ###############
    
    """
    ...
    """

    while (epoch < n_epochs) and (not done_looping):
        epoch = epoch + 1
        for minibatch_index in xrange(n_train_batches):

            iter = (epoch - 1) * n_train_batches + minibatch_index
            
            if iter % 100 == 0:
                print 'training @ iter = ', iter
            cost_ij = train_model(minibatch_index)

            if (iter + 1) % validation_frequency == 0:

                # compute zero-one loss on validation set
                validation_losses = [validate_model(i) for i
                                     in xrange(n_valid_batches)]
                this_validation_loss = numpy.mean(validation_losses)
                print('epoch %i, minibatch %i/%i, validation error %f %%' %
                      (epoch, minibatch_index + 1, n_train_batches,
                       this_validation_loss * 100.))

                # if we got the best validation score until now
                if this_validation_loss < best_validation_loss:

                    #improve patience if loss improvement is good enough
                    if this_validation_loss < best_validation_loss *  \
                       improvement_threshold:
                        patience = max(patience, iter * patience_increase)

                    # save best validation score and iteration number
                    best_validation_loss = this_validation_loss
                    best_iter = iter

                    # test it on the test set
                    test_losses = [
                        test_model(i)
                        for i in xrange(n_test_batches)
                    ]
                    test_score = numpy.mean(test_losses)
                    print(('     epoch %i, minibatch %i/%i, test error of '
                           'best model %f %%') %
                          (epoch, minibatch_index + 1, n_train_batches,
                           test_score * 100.))
                
                #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                new_learning_rate = decay_learning_rate()
                #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

            if patience <= iter:
                done_looping = True
                break

"""
...
"""

D. Implementation of dropout

Dropout is a technique that was presented in G. Hinton's work "Dropout: A simple Way to Prevent Neural Networks from Overfitting". As can be read in the abstract:

Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different “thinned” networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights

Dropping out 20% of the input units and 50% of the hidden units was often found to be optimal

Implementation solution presented here is greatly inspired from GitHub user mdenil work on dropout, who implemented Hinton's dropout on a Multi-Layer Perceptron (base code from deeplearning.net).

1. Creating dropout function

This function takes a layer (which can be either a layer of units in an MLP or a layer of feature maps in a CNN) and drop units from the layer with a probability of p (or in the case of CNN pixels from feature maps with a probability of p).


In [1]:
def _dropout_from_layer(rng, layer, p):
    """p is the probablity of dropping a unit
    """
    srng = theano.tensor.shared_randomstreams.RandomStreams(
            rng.randint(999999))
    # p=1-p because 1's indicate keep and p is probability of dropping
    mask = srng.binomial(n=1, p=1-p, size=layer.shape)
    # The cast is important because
    # int * float32 = float64 which pulls things off the gpu
    output = layer * T.cast(mask, theano.config.floatX)
    return output

2. Creating dropout classes

We create child classes from HiddenLayer and LeNetConvPoolLayer so that they take into account dropout.


In [ ]:
class DropoutHiddenLayer(HiddenLayer):
    def __init__(self, rng, input, n_in, n_out,
                 activation, dropout_rate, W=None, b=None):
        super(DropoutHiddenLayer, self).__init__(
                rng=rng, input=input, n_in=n_in, n_out=n_out, W=W, b=b,
                activation=activation)

        self.output = _dropout_from_layer(rng, self.output, p=dropout_rate)

In [ ]:
class DropoutLeNetConvPoolLayer(LeNetConvPoolLayer):
    def __init__(self, rng, input, filter_shape, image_shape, poolsize,
               dropout_rate, W=None, b=None):
    super(DropoutLeNetConvPoolLayer, self).__init__(
      rng=rng, input=input, filter_shape=filter_shape, image_shape=image_shape,
      poolsize=poolsize, W=W, b=b)

    self.output = _dropout_from_layer(rng, self.output, p=dropout_rate)

Note: we dropout pixels after pooling.

3. Rewriting evaluate_lenet5

Each time a layer is instantiated, two actual layers need to be actually created in parallel: the dropout layer which drops out some of its units with a probability of p, and an associated layer sharing the same coefficient W and b except W is scaled using p.

Again, full code can be found at code/convolutional_mlp_dropout.py


In [ ]:
def evaluate_lenet5(initial_learning_rate=0.1, learning_rate_decay = 1, 
                    dropout_rates = [0.2, 0.2, 0.2, 0.5], n_epochs=200,
                    dataset='mnist.pkl.gz',
                    nkerns=[20, 50], batch_size=500):
    """
    :type dropout_rates: list of float
    :param dropout_rates: dropout rate used for each layer (input layer, 
    1st filtered layer, 2nd filtered layer, fully connected layer)

    """
    
    """
    ...
    """

    ######################
    # BUILD ACTUAL MODEL #
    ######################
    print '... building the model'

    # Reshape matrix of rasterized images of shape (batch_size, 28 * 28)
    # to a 4D tensor, compatible with our LeNetConvPoolLayer
    # (28, 28) is the size of MNIST images.
    layer0_input = x.reshape((batch_size, 1, 28, 28))
    # Dropping out pixels from original image randomly, with a probability of dropping
    # low enough not too drop too much information (20% was found to be ideal)
    layer0_input_dropout = _dropout_from_layer(rng, layer0_input, dropout_rates[0])


    # Construct the first convolutional pooling layer:
    # filtering reduces the image size to (28-5+1 , 28-5+1) = (24, 24)
    # maxpooling reduces this further to (24/2, 24/2) = (12, 12)
    # 4D output tensor is thus of shape (batch_size, nkerns[0], 12, 12)
    layer0_dropout = DropoutLeNetConvPoolLayer(
        rng,
        input=layer0_input_dropout,
        image_shape=(batch_size, 1, 28, 28),
        filter_shape=(nkerns[0], 1, 5, 5),
        poolsize=(2, 2),
        dropout_rate= dropout_rates[1]
    )
    
    # Creating in parallel a normal LeNetConvPoolLayer that share the same
    # W and b as the dropout layer, with W scaled with p.
    layer0 = LeNetConvPoolLayer(
      rng,
      input=layer0_input,
      image_shape=(batch_size, 1, 28, 28),
      filter_shape=(nkerns[0], 1, 5, 5),
      poolsize=(2, 2),
      W=layer0_dropout.W * (1 - dropout_rates[0]),
      b=layer0_dropout.b
    )

    # Construct the second convolutional pooling layer
    # filtering reduces the image size to (12-5+1, 12-5+1) = (8, 8)
    # maxpooling reduces this further to (8/2, 8/2) = (4, 4)
    # 4D output tensor is thus of shape (batch_size, nkerns[1], 4, 4)
    layer1_dropout = DropoutLeNetConvPoolLayer(
        rng,
        input=layer0_dropout.output,
        image_shape=(batch_size, nkerns[0], 12, 12),
        filter_shape=(nkerns[1], nkerns[0], 5, 5),
        poolsize=(2, 2),
        dropout_rate = dropout_rates[2]
    )

    layer1 = LeNetConvPoolLayer(
      rng,
      input=layer0.output,
      image_shape=(batch_size, nkerns[0], 12, 12),
      filter_shape=(nkerns[1], nkerns[0], 5, 5),
      poolsize=(2, 2),
      W=layer1_dropout.W * (1 - dropout_rates[1]),
      b=layer1_dropout.b
    )

    # the HiddenLayer being fully-connected, it operates on 2D matrices of
    # shape (batch_size, num_pixels) (i.e matrix of rasterized images).
    # This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4),
    # or (500, 50 * 4 * 4) = (500, 800) with the default values.
    layer2_dropout_input = layer1_dropout.output.flatten(2)
    layer2_input = layer1.output.flatten(2)

    # construct a fully-connected sigmoidal layer
    layer2_dropout = DropoutHiddenLayer(
        rng,
        input=layer2_dropout_input,
        n_in=nkerns[1] * 4 * 4,
        n_out=500,
        activation=T.tanh,
        dropout_rate = dropout_rates[3]
    )

    layer2 = HiddenLayer(
      rng,
      input=layer2_input,
      n_in=nkerns[1] * 4 * 4,
      n_out=500,
      activation=T.tanh,
      W=layer2_dropout.W * (1 - dropout_rates[2]),
      b=layer2_dropout.b
    )


    # classify the values of the fully-connected sigmoidal layer
    layer3_dropout = LogisticRegression(
      input = layer2_dropout.output,
      n_in = 500, n_out = 10)

    layer3 = LogisticRegression(
      input=layer2.output,
      n_in=500, n_out=10,
      W=layer3_dropout.W * (1 - dropout_rates[-1]),
      b=layer3_dropout.b
    )


    # the cost we minimize during training is the NLL of the model
    cost = layer3.negative_log_likelihood(y)
    dropout_cost = layer3_dropout.negative_log_likelihood(y)

    # create a function to compute the mistakes that are made by the model
    test_model = theano.function(
        [index],
        layer3.errors(y),
        givens={
            x: test_set_x[index * batch_size: (index + 1) * batch_size],
            y: test_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )

    validate_model = theano.function(
        [index],
        layer3.errors(y),
        givens={
            x: valid_set_x[index * batch_size: (index + 1) * batch_size],
            y: valid_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )

    # create a list of all model parameters to be fit by gradient descent
    params = layer3_dropout.params + layer2_dropout.params + layer1_dropout.params + layer0_dropout.params

    # create a list of gradients for all model parameters
    grads = T.grad(dropout_cost, params)

    # train_model is a function that updates the model parameters by SGD
    updates = [
        (param_i, param_i - learning_rate * grad_i)
        for param_i, grad_i in zip(params, grads)
    ]

    train_model = theano.function(
        [index],
        dropout_cost,
        updates=updates,
        givens={
            x: train_set_x[index * batch_size: (index + 1) * batch_size],
            y: train_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )

    """
    ...
    """

After running the code for 50 epochs (237 minutes of computation) we get:

Best validation score of 1.560000 % obtained at iteration 5000, with test performance 1.450000 %

Full result file at results/dropout_good_percent.txt

E. Visualization of the convolutional filters

Read this article on understanding the convolutional neural networks. Many methods of visualization what the convolutional networks learned are descrived. We will retain the first one, as it is the most straight-forward to implement:

Visualizing the activations and first-layer weights:

  • Layer Activations: the most straight-forward visualization technique is to show the activations of the network during the forward pass. For ReLU networks, the activations usually start out looking relatively blobby and dense, but as the training progresses the activations usually become more sparse and localized. One dangerous pitfall that can be easily noticed with this visualization is that some activation maps may be all zero for many different inputs, which can indicate dead filters, and can be a symptom of high learning rates.

  • Conv/FC Filters: The second common strategy is to visualize the weights. These are usually most interpretable on the first CONV layer which is looking directly at the raw pixel data, but it is possible to also show the filter weights deeper in the network. The weights are useful to visualize because well-trained networks usually display nice and smooth filters without any noisy patterns. Noisy patterns can be an indicator of a network that hasn't been trained for long enough, or possibly a very low regularization strength that may have led to overfitting.

I would like to visualize the filters, so implement the second most common strategy to see the first 20 filters.

M. D. Zeiler wrote an interesting paper about Deconvolutional Networks (DeConvNet) for visualizing and understanding convolutional filter. The only code I found for this subject can be found here.

1. Visualization function

Let's create a function that displays the weight of the filters if fed the weight parameter W.


In [25]:
import pylab
from PIL import Image

In [35]:
def display_filter(W, n_cols = 5):
    
    """
    :type W: numpy_nd_array
    :param W: parameter W of a convolutional + max pooling layer
    
    :type image_width: int
    : param image_width: width of the final image representing the different filters
    """
    
    W_shape = W.shape
    n_filters = W_shape[0]
    
    #param filter_shape: (number of filters, num input feature maps, filter height, filter width)
    filter_height = W_shape[2]
    filter_width = W_shape[3]
    
    n_lines = numpy.ceil(n_filters / n_cols)
    
    for n in range(n_filters):
        Wn = W[n,0,:,:]
        Wn = Wn / Wn.max() # Scaling W to get 0-1 gray scale 
        pylab.subplot(n_lines, n_cols, n + 1); pylab.axis('off'); pylab.imshow(W[n,0,:,:], cmap=pylab.gray())
    pylab.show()

2. Testing the function on a single untrained LeNetConvPoolLayer

To test the function let's take back the example of the image of me when I was 5 years old. I will feed it to a LeNetConvPoolLayer, retrieve the weights, and display them.


In [36]:
rng = numpy.random.RandomState(1234)

img = Image.open(open('images/profilepic4.jpg'))
img = numpy.asarray(img, dtype='float64') / 256. # divide by 256 to have RGB 0-1 scale and not 0 - 256
img_ = img.transpose(2, 0, 1).reshape(1, 3, 2592, 1936)

input = img_

filter_shape = [20,3,12,12]
image_shape = [1,3,2592,1936]

poolsize = (2, 2)

layer_test = LeNetConvPoolLayer(rng, input, filter_shape, image_shape, poolsize)

f = theano.function([], layer_test.params)

W = f[0]

In [37]:
display_filter(W)

As the weights are randomly initialized we of course see random pattern in each filter.

3. Displaying the learned filters after training

Let's now modify the code of evaluate_lenet5 so that it displays the filters after training. Full code can be found at code/filter_visualization.py.


In [ ]:
def evaluate_lenet5(initial_learning_rate=0.1, learning_rate_decay = 1, 
                    dropout_rates = [0.2, 0.2, 0.2, 0.5], n_epochs=200,
                    dataset='mnist.pkl.gz', display_filters = True,
                    nkerns=[20, 50], batch_size=500):
    """
    :type display_filters: Bool
    :param display_filters: True if we want to display the learned filters after training
    
    
    we skip to the very end of the code, after training is done
    """
    
    if display_filters:
        # Retrieving the filters from first and second layer
        first_convlayer_params = theano.function([], layer0_dropout.params)
        second_convlayer_params = theano.function([], layer1_dropout.params)
        
        W0 = first_convlayer_params[0]
        W1 = second_convlayer_params[0]
        
        # Display filters from first layer (20 filters)
        display_filter(W0)
        
        # Display filters from second layer (50 filters)
        display_filter(W1)

The filters from the first layer after 50 epochs (159.21 minutes of computation):

The filters from the second layer after 50 epochs:

We can see less randomness and patterns appearing on some filters.

F. Automated creation of a CNN + MLP

The idea is to create a cnn_mlp class that enable the direct instatiation of a CNN followed by a MLP, in order to have a cleaner, more flexible code. The full code will not be presented here but can be found at code/cnn_mlp.py. It is greatly inspired from mlp.py created by mdenil.