Tutorial: Training convolutional neural networks with nolearn

Note: This notebook was updated on April 4, 2016, to reflect recent changes in nolearn.

This tutorial's goal is to teach you how to use nolearn to train convolutional neural networks (CNNs). The nolearn documentation can be found here. We assume that you have some general knowledge about machine learning in general or neural nets specifically, but want to learn more about convolutional neural networks and nolearn.

We well cover several points in this notebook.

How to load image data such that we can use it for our purpose. For this tutorial, we will use the MNIST data set, which consists of images of the numbers from 0 to 9.
How to properly define layers of the net. A good choice of layers, i.e. a good network architecture, is most important to get nice results out of a neural net.
The definition of the neural network itself. Here we define important hyper-parameters.
Next we will see how visualizations may help us to further refine the network.
Finally, we will show you how nolearn can help us find better architectures for our neural network.

Imports



In [1]:

    
import os



In [2]:

    
import matplotlib.pyplot as plt
%pylab inline
import numpy as np









    



Populating the interactive namespace from numpy and matplotlib



In [3]:

    
from lasagne.layers import DenseLayer
from lasagne.layers import InputLayer
from lasagne.layers import DropoutLayer
from lasagne.layers import Conv2DLayer
from lasagne.layers import MaxPool2DLayer
from lasagne.nonlinearities import softmax
from lasagne.updates import adam
from lasagne.layers import get_all_params



In [4]:

    
from nolearn.lasagne import NeuralNet
from nolearn.lasagne import TrainSplit
from nolearn.lasagne import objective

Loading MNIST data

This little helper function loads the MNIST data available here.



In [5]:

    
def load_mnist(path):
    X = []
    y = []
    with open(path, 'rb') as f:
        next(f)  # skip header
        for line in f:
            yi, xi = line.split(',', 1)
            y.append(yi)
            X.append(xi.split(','))

    # Theano works with fp32 precision
    X = np.array(X).astype(np.float32)
    y = np.array(y).astype(np.int32)

    # apply some very simple normalization to the data
    X -= X.mean()
    X /= X.std()

    # For convolutional layers, the default shape of data is bc01,
    # i.e. batch size x color channels x image dimension 1 x image dimension 2.
    # Therefore, we reshape the X data to -1, 1, 28, 28.
    X = X.reshape(
        -1,  # number of samples, -1 makes it so that this number is determined automatically
        1,   # 1 color channel, since images are only black and white
        28,  # first image dimension (vertical)
        28,  # second image dimension (horizontal)
    )

    return X, y



In [6]:

    
# here you should enter the path to your MNIST data
path = os.path.join(os.path.expanduser('~'), 'data/mnist/train.csv')



In [7]:

    
X, y = load_mnist(path)



In [8]:

    
figs, axes = plt.subplots(4, 4, figsize=(6, 6))
for i in range(4):
    for j in range(4):
        axes[i, j].imshow(-X[i + 4 * j].reshape(28, 28), cmap='gray', interpolation='none')
        axes[i, j].set_xticks([])
        axes[i, j].set_yticks([])
        axes[i, j].set_title("Label: {}".format(y[i + 4 * j]))
        axes[i, j].axis('off')

Definition of the layers

So let us define the layers for the convolutional net. In general, layers are assembled in a list. Each element of the list is a tuple -- first a Lasagne layer, next a dictionary containing the arguments of the layer. We will explain the layer definitions in a moment, but in general, you should look them up in the Lasagne documenation.

Nolearn allows you to skip Lasagne's incoming keyword, which specifies how the layers are connected. Instead, nolearn will automatically assume that layers are connected in the order they appear in the list.

Note: Of course you can manually set the incoming parameter if your neural net's layers are connected differently. To do so, you have to give the corresponding layer a name (e.g. 'name': 'my layer') and use that name as a reference ('incoming': 'my layer').

The layers we use are the following:

InputLayer: We have to specify the shape of the data. For image data, it is batch size x color channels x image dimension 1 x image dimension 2 (aka bc01). Here you should generally just leave the batch size as None, so that it is taken care off automatically. The other dimensions are given by X.
Conv2DLayer: The most important keywords are _numfilters and _filtersize. The former indicates the number of channels -- the more you choose, the more different filters can be learned by the CNN. Generally, the first convolutional layers will learn simple features, such as edges, while deeper layers can learn more abstract features. Therefore, you should increase the number of filters the deeper you go. The _filtersize is the size of the filter/kernel. The current consensus is to always use 3x3 filters, as these allow to cover the same number of image pixels with fewer parameters than larger filters do.
MaxPool2DLayer: This layer performs max pooling and hopefully provides translation invariance. We need to indicate the region over which it pools, with 2x2 being the default of most users.
DenseLayer: This is your vanilla fully-connected layer; you should indicate the number of 'neurons' with the _numunits argument. The very last layer is assumed to be the output layer. We thus set the number of units to be the number of classes, 10, and choose softmax as the output nonlinearity, as we are dealing with a classification task.
DropoutLayer: Dropout is a common technique to regularize neural networks. It is almost always a good idea to include dropout between your dense layers.

Apart from these arguments, the Lasagne layers have very reasonable defaults concerning weight initialization, nonlinearities (rectified linear units), etc.



In [9]:

    
layers0 = [
    # layer dealing with the input data
    (InputLayer, {'shape': (None, X.shape[1], X.shape[2], X.shape[3])}),

    # first stage of our convolutional layers
    (Conv2DLayer, {'num_filters': 96, 'filter_size': 5}),
    (Conv2DLayer, {'num_filters': 96, 'filter_size': 3}),
    (Conv2DLayer, {'num_filters': 96, 'filter_size': 3}),
    (Conv2DLayer, {'num_filters': 96, 'filter_size': 3}),
    (Conv2DLayer, {'num_filters': 96, 'filter_size': 3}),
    (MaxPool2DLayer, {'pool_size': 2}),

    # second stage of our convolutional layers
    (Conv2DLayer, {'num_filters': 128, 'filter_size': 3}),
    (Conv2DLayer, {'num_filters': 128, 'filter_size': 3}),
    (Conv2DLayer, {'num_filters': 128, 'filter_size': 3}),
    (MaxPool2DLayer, {'pool_size': 2}),

    # two dense layers with dropout
    (DenseLayer, {'num_units': 64}),
    (DropoutLayer, {}),
    (DenseLayer, {'num_units': 64}),

    # the output layer
    (DenseLayer, {'num_units': 10, 'nonlinearity': softmax}),
]

Definition of the neural network

Now we initialize nolearn's neural net itself. We will explain each argument shortly:

The most important argument is the layers argument, which should be the list of layers defined above.
_maxepochs is simply the number of epochs the net learns with each call to fit (an 'epoch' is a full training cycle using all training data).
As update, we choose adam, which for many problems is a good first choice as updateing rule.
The objective of our net will be the _regularizationobjective we just defined.
To change the magnitude of L2 regularization (see here), we set the _objectivel2 parameter. The NeuralNetwork class will then automatically pass this value when calling the objective. Usually, moderate L2 regularization is applied, whereas L1 regularization is less frequently used.
For 'adam', a small learning rate is best, so we set it with the _update_learningrate argument (nolearn will automatically interpret this argument to mean the _learningrate argument of the update parameter, i.e. adam in our case).
The NeuralNet will hold out some of the training data for validation if we set the _evalsize of the TrainSplit to a number greater than 0. This will allow us to monitor how well the net generalizes to yet unseen data. By setting this argument to 1/4, we tell the net to hold out 25% of the samples for validation.
Finally, we set verbose to 1, which will result in the net giving us some useful information.



In [10]:

    
net0 = NeuralNet(
    layers=layers0,
    max_epochs=10,

    update=adam,
    update_learning_rate=0.0002,

    objective_l2=0.0025,

    train_split=TrainSplit(eval_size=0.25),
    verbose=1,
)

Training the neural network

To train the net, we call its fit method with our X and y data, as we would with any scikit learn classifier.



In [11]:

    
net0.fit(X, y)









    



# Neural Network with 753610 learnable parameters

## Layer information

  #  name         size
---  -----------  --------
  0  input0       1x28x28
  1  conv2d1      96x24x24
  2  conv2d2      96x22x22
  3  conv2d3      96x20x20
  4  conv2d4      96x18x18
  5  conv2d5      96x16x16
  6  maxpool2d6   96x8x8
  7  conv2d7      128x6x6
  8  conv2d8      128x4x4
  9  conv2d9      128x2x2
 10  maxpool2d10  128x1x1
 11  dense11      64
 12  dropout12    64
 13  dense13      64
 14  dense14      10

  epoch    train loss    valid loss    train/val    valid acc  dur
-------  ------------  ------------  -----------  -----------  ------
      1       2.49039       1.47864      1.68424      0.93246  21.84s
      2       1.46182       1.20667      1.21145      0.96527  21.43s
      3       1.21271       1.08095      1.12190      0.96913  21.34s
      4       1.06744       0.96083      1.11096      0.97421  21.34s
      5       0.94933       0.87513      1.08479      0.97675  21.37s
      6       0.86063       0.79308      1.08517      0.97967  21.36s
      7       0.78233       0.72516      1.07885      0.97986  21.48s
      8       0.71531       0.66827      1.07039      0.97939  21.59s
      9       0.65571       0.61533      1.06562      0.98136  21.37s
     10       0.60635       0.57576      1.05313      0.98033  21.35s






    Out[11]:





NeuralNet(X_tensor_type=None,
     batch_iterator_test=<nolearn.lasagne.base.BatchIterator object at 0x7f3de3115910>,
     batch_iterator_train=<nolearn.lasagne.base.BatchIterator object at 0x7f3de3115790>,
     check_input=True, custom_scores=None,
     layers=[(<class 'lasagne.layers.input.InputLayer'>, {'shape': (None, 1, 28, 28)}), (<class 'lasagne.layers.conv.Conv2DLayer'>, {'filter_size': 5, 'num_filters': 96}), (<class 'lasagne.layers.conv.Conv2DLayer'>, {'filter_size': 3, 'num_filters': 96}), (<class 'lasagne.layers.conv.Conv2DLayer'>, {'fil...layers.dense.DenseLayer'>, {'num_units': 10, 'nonlinearity': <function softmax at 0x7f3dec0f9848>})],
     loss=None, max_epochs=10, more_params={},
     objective=<function objective at 0x7f3de311ade8>, objective_l2=0.0025,
     objective_loss_function=<function categorical_crossentropy at 0x7f3dec0d15f0>,
     on_batch_finished=[],
     on_epoch_finished=[<nolearn.lasagne.handlers.PrintLog instance at 0x7f3dde0997a0>],
     on_training_finished=[],
     on_training_started=[<nolearn.lasagne.handlers.PrintLayerInfo instance at 0x7f3dde0998c0>],
     regression=False,
     train_split=<nolearn.lasagne.base.TrainSplit object at 0x7f3dec5aae10>,
     update=<function adam at 0x7f3dec0d7140>, update_learning_rate=0.0002,
     use_label_encoder=False, verbose=1,
     y_tensor_type=TensorType(int32, vector))

As we set the verbosity to 1, nolearn will print some useful information for us:

First of all, some general information about the net and its layers is printed. Then, during training, the progress will be printed after each epoch.
The train loss is the loss/cost that the net tries to minimize. For this example, this is the log loss (cross entropy).
The valid loss is the loss for the hold out validation set. You should expect this value to indicate how well your model generalizes to yet unseen data.
train/val is simply the ratio of train loss to valid loss. If this value is very low, i.e. if the train loss is much better than your valid loss, it means that the net has probably overfitted the train data.
When we are dealing with a classification task, the accuracy score of the valdation set, valid acc, is also printed.
dur is simply the duration it took to process the given epoch.

In addition to this, nolearn will color the as of yet best train and valid loss, so that it is easy to spot whether the net makes progress.

Visualizations

Diagnosing what's wrong with your neural network if the results are unsatisfying can sometimes be difficult, something closer to an art than a science. But with nolearn's visualization tools, we should be able to get some insights that help us diagnose if something is wrong.



In [12]:

    
from nolearn.lasagne.visualize import draw_to_notebook
from nolearn.lasagne.visualize import plot_loss
from nolearn.lasagne.visualize import plot_conv_weights
from nolearn.lasagne.visualize import plot_conv_activity
from nolearn.lasagne.visualize import plot_occlusion
from nolearn.lasagne.visualize import plot_saliency

Visualizing the network architecture

First we may be interested in simply visualizing the architecture. When using an IPython/Jupyter notebook, this is achieved best by calling the _draw_tonotebook function, passing the net as the first argument.



In [13]:

    
draw_to_notebook(net0)









    Out[13]:

If we have accidentally made an error during the construction of the architecture, you should be able to spot it easily now.

Train and validation loss progress

With nolearn's visualization tools, it is possible to get some further insights into the working of the CNN. Below, we will simply plot the log loss of the training and validation data over each epoch:



In [14]:

    
plot_loss(net0)









    Out[14]:





<module 'matplotlib.pyplot' from '/home/vinh/anaconda/envs/nolearn/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>

This kind of visualization can be helpful in determining whether we want to continue training or not. For instance, here we see that both loss functions still are still decreasing and that more training will pay off. This graph can also help determine if we are overfitting: If the train loss is much lower than the validation loss, we should probably do something to regularize the net.

Visualizing layer weights

We can further have a look at the weights learned by the net. The first argument of the function should be the layer we want to visualize. The layers can be accessed through the layers_ attribute and then by name (e.g. 'conv2dcc1') or by index, as below. (Obviously, visualizing the weights only makes sense for convolutional layers.)



In [15]:

    
plot_conv_weights(net0.layers_[1], figsize=(4, 4))









    Out[15]:





<module 'matplotlib.pyplot' from '/home/vinh/anaconda/envs/nolearn/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>

As can be seen above, in our case, the results are not too interesting. If the weights just look like noise, we might have to do something (e.g. use more filters so that each can specialize better).

Visualizing the layers' activities

To see through the "eyes" of the net, we can plot the activities produced by different layers. The plot_conv_activity function is made for that. The first argument, again, is a layer, the second argument an image in the bc01 format (which is why we use X[0:1] instead of just X[0]).



In [16]:

    
x = X[0:1]



In [17]:

    
plot_conv_activity(net0.layers_[1], x)









    Out[17]:





<module 'matplotlib.pyplot' from '/home/vinh/anaconda/envs/nolearn/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>

Here we can see that depending on the learned filters, the neural net represents the image in different ways, which is what we should expect. If, e.g., some images were completely black, that could indicate that the corresponding filters have not learned anything useful. When you find yourself in such a situation, training longer or initializing the weights differently might do the trick.

Plot occlusion images

A possibility to check if the net, for instance, overfits or learns important features is to occlude part of the image. Then we can check whether the net still makes correct predictions. The idea behind that is the following: If the most critical part of an image is something like the head of a person, that is probably right. If it is instead a random part of the background, the net probably overfits (see here for more).

With the plot_occlusion function, we can check this. The approach is to occlude parts of the image and check how strongly this affects the power of our net to predict the correct label. The first argument to the function is the neural net, the second the X data, the third the y data. Be warned that this function can be quite slow for larger images.



In [18]:

    
plot_occlusion(net0, X[:5], y[:5])









    Out[18]:





<module 'matplotlib.pyplot' from '/home/vinh/anaconda/envs/nolearn/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>

Here we see which parts of the number are most important for correct classification. We see that the critical parts are all directly above the numbers, so this seems to work out. For more complex images with different objects in the scene, this function should be more useful, though.

Salience plot

Similarly to plotting the occlusion images, we may also backpropagate the error onto the image parts to see which ones matter to the net. The idea here is similar but the outcome differs, as a quick comparison shows. The advantage of using the gradient is that the computation is much quicker but the critical parts are more distributed across the image, making interpretation more difficult.



In [19]:

    
plot_saliency(net0, X[:5]);

Finding a good architecture

This section tries to help you go deep with your convolutional neural net.

There is more than one way to go deep with CNNs. A possibility is to try a residual net architecture, which won several tasks of the 2015 imagenet competition. Here we will try instead a more "traditional" approach using blocks of convolutional layers separated by pooling layers. If we want to increase the number of convolutional layers, we cannot simply do so at will. It is important that the layers have a sufficiently high learning capacity while they should cover approximately 100% of the incoming image (Xudong Cao, 2015).

The usual approach is to try to go deep with convolutional layers. If you chain too many convolutional layers, though, the learning capacity of the layers falls too low. At this point, you have to add a max pooling layer. Use too many max pooling layers, and your image coverage grows larger than the image, which is clearly pointless. Striking the right balance while maximizing the depth of your layer is the final goal.

It is generally a good idea to use small filter sizes for your convolutional layers, generally 3x3. The reason for this is that this allows to cover the same receptive field of the image while using less parameters that would be required if a larger filter size were used. Moreover, deeper stacks of convolutional layers are more expressive (see here for more).



In [20]:

    
from nolearn.lasagne import PrintLayerInfo

A shallow net

Let us try out a simple architecture and see how we fare.



In [21]:

    
layers1 = [
    (InputLayer, {'shape': (None, X.shape[1], X.shape[2], X.shape[3])}),

    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3)}),
    (MaxPool2DLayer, {'pool_size': (2, 2)}),

    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3)}),
    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3)}),
    (MaxPool2DLayer, {'pool_size': (2, 2)}),

    (Conv2DLayer, {'num_filters': 96, 'filter_size': (3, 3)}),
    (MaxPool2DLayer, {'pool_size': (2, 2)}),

    (DenseLayer, {'num_units': 64}),
    (DropoutLayer, {}),
    (DenseLayer, {'num_units': 64}),

    (DenseLayer, {'num_units': 10, 'nonlinearity': softmax}),
]



In [22]:

    
net1 = NeuralNet(
    layers=layers1,
    update_learning_rate=0.01,
    verbose=2,
)

To see information about the capacity and coverage of each layer, we need to set the verbosity of the net to a value of 2 and then initialize the net. We next pass the initialized net to PrintLayerInfo to see some useful information. By the way, we could also just call the fit method of the net to get the same outcome, but since we don't want to fit just now, we proceed as shown below.



In [23]:

    
net1.initialize()



In [24]:

    
layer_info = PrintLayerInfo()



In [25]:

    
layer_info(net1)









    



# Neural Network with 122154 learnable parameters

## Layer information

name        size        total    cap.Y    cap.X    cov.Y    cov.X
----------  --------  -------  -------  -------  -------  -------
input0      1x28x28       784   100.00   100.00   100.00   100.00
conv2d1     32x26x26    21632   100.00   100.00    10.71    10.71
maxpool2d2  32x13x13     5408   100.00   100.00    10.71    10.71
conv2d3     64x11x11     7744    85.71    85.71    25.00    25.00
conv2d4     64x9x9       5184    54.55    54.55    39.29    39.29
maxpool2d5  64x4x4       1024    54.55    54.55    39.29    39.29
conv2d6     96x2x2        384    63.16    63.16    67.86    67.86
maxpool2d7  96x1x1         96    63.16    63.16    67.86    67.86
dense8      64             64   100.00   100.00   100.00   100.00
dropout9    64             64   100.00   100.00   100.00   100.00
dense10     64             64   100.00   100.00   100.00   100.00
dense11     10             10   100.00   100.00   100.00   100.00

Explanation
    X, Y:    image dimensions
    cap.:    learning capacity
    cov.:    coverage of image
    magenta: capacity too low (<1/6)
    cyan:    image coverage too high (>100%)
    red:     capacity too low and coverage too high

This net is fine. The capacity never falls below 1/6, which would be 16.7%, and the coverage of the image never exceeds 100%. However, with only 4 convolutional layers, this net is not very deep and will properly not achieve the best possible results.

What we also see is the role of max pooling. If we look at 'maxpool2d1', after this layer, the capacity of the net is increased. Max pooling thus helps to increase capacity should it dip too low. However, max pooling also significantly increases the coverage of the image. So if we use max pooling too often, the coverage will quickly exceed 100% and we cannot go sufficiently deep.

Too little maxpooling

Now let us try an architecture that uses a lot of convolutional layers but only one maxpooling layer.



In [26]:

    
layers2 = [
    (InputLayer, {'shape': (None, X.shape[1], X.shape[2], X.shape[3])}),

    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3)}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3)}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3)}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3)}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3)}),
    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3)}),
    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3)}),
    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3)}),
    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3)}),
    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3)}),
    (MaxPool2DLayer, {'pool_size': (2, 2)}),

    (DenseLayer, {'num_units': 64}),
    (DropoutLayer, {}),
    (DenseLayer, {'num_units': 64}),

    (DenseLayer, {'num_units': 10, 'nonlinearity': softmax}),
]



In [27]:

    
net2 = NeuralNet(
    layers=layers2,
    update_learning_rate=0.01,
    verbose=2,
)



In [28]:

    
net2.initialize()



In [29]:

    
layer_info(net2)









    



# Neural Network with 273930 learnable parameters

## Layer information

name         size        total    cap.Y    cap.X    cov.Y    cov.X
-----------  --------  -------  -------  -------  -------  -------
input0       1x28x28       784   100.00   100.00   100.00   100.00
conv2d1      32x26x26    21632   100.00   100.00    10.71    10.71
conv2d2      32x24x24    18432    60.00    60.00    17.86    17.86
conv2d3      32x22x22    15488    42.86    42.86    25.00    25.00
conv2d4      32x20x20    12800    33.33    33.33    32.14    32.14
conv2d5      32x18x18    10368    27.27    27.27    39.29    39.29
conv2d6      64x16x16    16384    23.08    23.08    46.43    46.43
conv2d7      64x14x14    12544    20.00    20.00    53.57    53.57
conv2d8      64x12x12     9216    17.65    17.65    60.71    60.71
conv2d9      64x10x10     6400    15.79    15.79    67.86    67.86
conv2d10     64x8x8       4096    14.29    14.29    75.00    75.00
maxpool2d11  64x4x4       1024    14.29    14.29    75.00    75.00
dense12      64             64   100.00   100.00   100.00   100.00
dropout13    64             64   100.00   100.00   100.00   100.00
dense14      64             64   100.00   100.00   100.00   100.00
dense15      10             10   100.00   100.00   100.00   100.00

Explanation
    X, Y:    image dimensions
    cap.:    learning capacity
    cov.:    coverage of image
    magenta: capacity too low (<1/6)
    cyan:    image coverage too high (>100%)
    red:     capacity too low and coverage too high

Here we have a very deep net but we have a problem: The lack of max pooling layers means that the capacity of the net dips below 16.7%. The corresponding layers are shown in magenta. We need to find a better solution.

Too much maxpooling

Here is an architecture with too mach maxpooling. For illustrative purposes, we set the pad parameter to 1; without it, the image size would shrink below 0, at which point the code will raise an error.



In [30]:

    
layers3 = [
    (InputLayer, {'shape': (None, X.shape[1], X.shape[2], X.shape[3])}),

    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3), 'pad': 1}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3), 'pad': 1}),
    (MaxPool2DLayer, {'pool_size': (2, 2)}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3), 'pad': 1}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3), 'pad': 1}),
    (MaxPool2DLayer, {'pool_size': (2, 2)}),
    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3), 'pad': 1}),
    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3), 'pad': 1}),
    (MaxPool2DLayer, {'pool_size': (2, 2)}),
    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3), 'pad': 1}),
    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3), 'pad': 1}),
    (MaxPool2DLayer, {'pool_size': (2, 2)}),

    (DenseLayer, {'num_units': 64}),
    (DropoutLayer, {}),
    (DenseLayer, {'num_units': 64}),

    (DenseLayer, {'num_units': 10, 'nonlinearity': softmax}),
]



In [31]:

    
net3 = NeuralNet(
    layers=layers3,
    update_learning_rate=0.01,
    verbose=2,
)



In [32]:

    
net3.initialize()



In [33]:

    
layer_info(net3)









    



# Neural Network with 166314 learnable parameters

## Layer information

name         size        total    cap.Y    cap.X    cov.Y    cov.X
-----------  --------  -------  -------  -------  -------  -------
input0       1x28x28       784   100.00   100.00   100.00   100.00
conv2d1      32x28x28    25088   100.00   100.00    10.71    10.71
conv2d2      32x28x28    25088    60.00    60.00    17.86    17.86
maxpool2d3   32x14x14     6272    60.00    60.00    17.86    17.86
conv2d4      32x14x14     6272    66.67    66.67    32.14    32.14
conv2d5      32x14x14     6272    46.15    46.15    46.43    46.43
maxpool2d6   32x7x7       1568    46.15    46.15    46.43    46.43
conv2d7      64x7x7       3136    57.14    57.14    75.00    75.00
conv2d8      64x7x7       3136    41.38    41.38   103.57   103.57
maxpool2d9   64x3x3        576    41.38    41.38   103.57   103.57
conv2d10     64x3x3        576    53.33    53.33   160.71   160.71
conv2d11     64x3x3        576    39.34    39.34   217.86   217.86
maxpool2d12  64x1x1         64    39.34    39.34   217.86   217.86
dense13      64             64   100.00   100.00   100.00   100.00
dropout14    64             64   100.00   100.00   100.00   100.00
dense15      64             64   100.00   100.00   100.00   100.00
dense16      10             10   100.00   100.00   100.00   100.00

Explanation
    X, Y:    image dimensions
    cap.:    learning capacity
    cov.:    coverage of image
    magenta: capacity too low (<1/6)
    cyan:    image coverage too high (>100%)
    red:     capacity too low and coverage too high

This net uses too much maxpooling for too small an image. The later layers, colored in cyan, would cover more than 100% of the image. So this network is clearly also suboptimal.

A good compromise

Now let us have a look at a reasonably deep architecture that satisfies the criteria we set out to meet:



In [34]:

    
layers4 = [
    (InputLayer, {'shape': (None, X.shape[1], X.shape[2], X.shape[3])}),

    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3), 'pad': 1}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3), 'pad': 1}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3), 'pad': 1}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3), 'pad': 1}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3), 'pad': 1}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3), 'pad': 1}),
    (Conv2DLayer, {'num_filters': 32, 'filter_size': (3, 3), 'pad': 1}),
    (MaxPool2DLayer, {'pool_size': (2, 2)}),

    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3), 'pad': 1}),
    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3), 'pad': 1}),
    (Conv2DLayer, {'num_filters': 64, 'filter_size': (3, 3), 'pad': 1}),
    (MaxPool2DLayer, {'pool_size': (2, 2)}),

    (DenseLayer, {'num_units': 64}),
    (DropoutLayer, {}),
    (DenseLayer, {'num_units': 64}),

    (DenseLayer, {'num_units': 10, 'nonlinearity': softmax}),
]



In [35]:

    
net4 = NeuralNet(
    layers=layers4,
    update_learning_rate=0.01,
    verbose=2,
)



In [36]:

    
net4.initialize()



In [37]:

    
layer_info(net4)









    



# Neural Network with 353738 learnable parameters

## Layer information

name         size        total    cap.Y    cap.X    cov.Y    cov.X
-----------  --------  -------  -------  -------  -------  -------
input0       1x28x28       784   100.00   100.00   100.00   100.00
conv2d1      32x28x28    25088   100.00   100.00    10.71    10.71
conv2d2      32x28x28    25088    60.00    60.00    17.86    17.86
conv2d3      32x28x28    25088    42.86    42.86    25.00    25.00
conv2d4      32x28x28    25088    33.33    33.33    32.14    32.14
conv2d5      32x28x28    25088    27.27    27.27    39.29    39.29
conv2d6      32x28x28    25088    23.08    23.08    46.43    46.43
conv2d7      32x28x28    25088    20.00    20.00    53.57    53.57
maxpool2d8   32x14x14     6272    20.00    20.00    53.57    53.57
conv2d9      64x14x14    12544    31.58    31.58    67.86    67.86
conv2d10     64x14x14    12544    26.09    26.09    82.14    82.14
conv2d11     64x14x14    12544    22.22    22.22    96.43    96.43
maxpool2d12  64x7x7       3136    22.22    22.22    96.43    96.43
dense13      64             64   100.00   100.00   100.00   100.00
dropout14    64             64   100.00   100.00   100.00   100.00
dense15      64             64   100.00   100.00   100.00   100.00
dense16      10             10   100.00   100.00   100.00   100.00

Explanation
    X, Y:    image dimensions
    cap.:    learning capacity
    cov.:    coverage of image
    magenta: capacity too low (<1/6)
    cyan:    image coverage too high (>100%)
    red:     capacity too low and coverage too high

With 10 convolutional layers, this network is rather deep, given the small image size. Yet the learning capacity is always suffiently large and never are is than 100% of the image covered. This could just be a good solution. Maybe you would like to give this architecture a spin?

Note 1: The MNIST images typically don't cover the whole of the 28x28 image size. Therefore, an image coverage of less than 100% is probably very acceptable. For other image data sets such as CIFAR or ImageNet, it is recommended to cover the whole image.

Note 2: This analysis does not tell us how many feature maps (i.e. number of filters per convolutional layer) to use. Here we have to experiment with different values. Larger values mean that the network should learn more types of features but also increase the risk of overfitting (and may exceed the available memory). In general though, deeper layers (those farther down) are supposed to learn more complex features and should thus have more feature maps.

Even more information

It is possible to get more information by increasing the verbosity level beyond 2.



In [38]:

    
net4.verbose = 3



In [39]:

    
layer_info(net4)









    



# Neural Network with 353738 learnable parameters

## Layer information

name         size        total    cap.Y    cap.X    cov.Y    cov.X    filter Y    filter X    field Y    field X
-----------  --------  -------  -------  -------  -------  -------  ----------  ----------  ---------  ---------
input0       1x28x28       784   100.00   100.00   100.00   100.00          28          28         28         28
conv2d1      32x28x28    25088   100.00   100.00    10.71    10.71           3           3          3          3
conv2d2      32x28x28    25088    60.00    60.00    17.86    17.86           3           3          5          5
conv2d3      32x28x28    25088    42.86    42.86    25.00    25.00           3           3          7          7
conv2d4      32x28x28    25088    33.33    33.33    32.14    32.14           3           3          9          9
conv2d5      32x28x28    25088    27.27    27.27    39.29    39.29           3           3         11         11
conv2d6      32x28x28    25088    23.08    23.08    46.43    46.43           3           3         13         13
conv2d7      32x28x28    25088    20.00    20.00    53.57    53.57           3           3         15         15
maxpool2d8   32x14x14     6272    20.00    20.00    53.57    53.57           3           3         15         15
conv2d9      64x14x14    12544    31.58    31.58    67.86    67.86           6           6         19         19
conv2d10     64x14x14    12544    26.09    26.09    82.14    82.14           6           6         23         23
conv2d11     64x14x14    12544    22.22    22.22    96.43    96.43           6           6         27         27
maxpool2d12  64x7x7       3136    22.22    22.22    96.43    96.43           6           6         27         27
dense13      64             64   100.00   100.00   100.00   100.00          28          28         28         28
dropout14    64             64   100.00   100.00   100.00   100.00          28          28         28         28
dense15      64             64   100.00   100.00   100.00   100.00          28          28         28         28
dense16      10             10   100.00   100.00   100.00   100.00          28          28         28         28

Explanation
    X, Y:    image dimensions
    cap.:    learning capacity
    cov.:    coverage of image
    magenta: capacity too low (<1/6)
    cyan:    image coverage too high (>100%)
    red:     capacity too low and coverage too high

Here we get additional information about the real filter size of the convolutional layers, as well as their receptive field sizes. If the receptive field size grows too large compared to the real filter size, capacity dips too low. As receptive field size grows larger, more and more of the image is covered.

Caveat

A caveat to the findings presented here is that capacity and coverage may not be calculated correctly if you use padding or strides other than 1 for the convolutional layers. Including this would make the calculation much more complicated. However, even if you want to use these parameters, the calculations shown here should not deviate too much and the results may still serve as a rough guideline.

Furthermore, to our knowledge, there is no publicly available paper on this topic, so all results should be taken with care.