This ipython notebook will teach you the basics of how convolutional networks work, and show you how to use multilayer perceptrons in pylearn2.
To do this, we will go over several concepts:
Part 1: What pylearn2 is doing for you in this example
Review of multilayer perceptrons, and how convolutional networks are similar
Convolution and the equivariance property
Pooling and the invariance property
A note on using convolution in research papers
Part 2: How to use pylearn2 to train a convolutional network
- pylearn2 Spaces
- MNIST classification example
Note that this won't explain in detail how the individual classes are implemented. The classes follow pretty good naming conventions and have pretty good docstrings, but if you have trouble understanding them, write to me and I might add a part 3 explaining how some of the parts work under the hood.
Please write to pylearn-dev@googlegroups.com if you encounter any problem with this tutorial.
Before running this notebook, you must have installed pylearn2. Follow the download and installation instructions if you have not yet done so.
This tutorial also assumes you already know about multilayer perceptrons, and know how to train and evaluate a multilayer perceptron in pylearn2. If not, work through multilayer_perceptron.ipynb before starting this tutorial.
It's also strongly recommend that you run this notebook with THEANO_FLAGS="device=gpu". This is a processing intensive example and the GPU will make it run a lot faster, if you have one available. Execute the next cell to verify that you are using the GPU.
In [1]:
import theano
print theano.config.device
In this part, we won't get into any specifics of pylearn2 yet. We'll just discuss what a convolutional network is. If you already know about convolutional networks, feel free to skip to part 2.
In multilayer_perceptron.ipynb, we saw how the multilayer perceptron (MLP) is a versatile model that can do many things. In this series of tutorials, we think of it as a classification model that learns to map an input vector $x$ to a probability distribution $p(y\mid x)$ where $y$ is a categorical value with $k$ different values. Using a dataset $\mathcal{D}$ of $(x, y)$, we can train any such probabilistic model by maximizing the log likelihood,
$$ \sum_{x,y \in \mathcal{D} } \log P(y \mid x). $$The multilayer perceptron defines $P(y \mid x)$ to be the composition of several simpler functions. Each function being composed can be thought of as another "layer" or "stage" of processing.
A convolutional network is nothing but a multilayer perceptron where some layers take a very special form, which we will call "convolutional layers". These layers are specially designed for processing inputs where the indices of the elements have some topological significance.
For example, if we represent a grayscale image as an array $I$ with the array indices corresponding to physical locations in the image, then we know that the element $I_{i,j}$ represents something that is spatially close to the element $I_{i+1,j}$. This is in contrast to a vector representation of an image. If $I$ is a vector, then $I_i$ might not be very close at all to $I_{i+1}$, depending on whether the image was converted to vector form in row-major or column major format and depending on whether $i$ is close to the end of a row or column.
Other kinds of data with topological in the indices include time series data, where some series $S$ can be indexed by a time variable $t$. We know that $S_t$ and $S_{t+1}$ come from close together in time. We can also think of the (row, column, time) indices of video data as providing topological information.
Suppose $T$ is a function that can translate (move) an input in the space defined by its indices by some amount $x$. In other words, $T(S,x)_i = S_j$ where $j=i-x$ (a MathJax or ipython bug seems to prevent me from putting $i-x$ in a subscript). Convolutional layers are an example of a function $f$ designed with the property $f(T(S,x)) \approx f(S)$ for small x.
This means if a neural network can recognize a handwritten digit in one position, it can recognize it when it is slightly shifted to a nearby position. Being able to recognize shifted versions of previously seen inputs greatly improves the generalization performance of convolutional networks.
TODO
TODO
TODO
Now that we've described the theory of what we're going to do, it's time to do it! This part describes how to use pylearn2 to run the algorithms described above.
As in the MLP tutorial, we will use the convolutional net to do optical character recognition on the MNIST dataset.
In many places in pylearn2, we would like to be able to process several different kinds of data. In previous tutorials, we've just talked about data that could be preprocessed into a vector representation. Our algorithms all worked on vector spaces. However, it's often useful to format data in other ways. The pylearn2 Space object is used to specify the format for data. The VectorSpace class represents the typical vector formatted data we've used so far. The only thing it needs to encode about the data is its dimensionality, i.e., how many elements the vector has. In this tutorial we will start to explicitly represent images as having 2D structure, so we need to use the Conv2DSpace. The Conv2DSpace object describes how to represent a collection of images as a 4-tensor.
One thing the Conv2DSpace object needs to describe is the shape of the space--how big is the image in terms of rows and columns of pixels? Also, the image may have multiple channels. In this example, we use a grayscale input image, so the input only has one channel. Color images require three channels to store the red, green, and blue pixels at each location. We can also think of the output of each convolution layer as living in a Conv2DSpace, where each kernel outputs a different channel. Finally, the Conv2DSpace specifies what each axis of the 4-tensor means. The default is for the first axis to index over different examples, the second axis to index over channels, and the last two to index over rows and columns, respectively. This is the format that theano's 2D convolution code uses, but other libraries exist that use other formats and we often need to convert between them.
Setting up a convolutional network in pylearn2 is essentially the same as setting up any other MLP. In the YAML experiment description below, there are really just two things to take note of.
First, rather than using "nvis" to specify the input that the MLP will take, we use a parameter called "input_space". "nvis" is actually shorthand; if you pass an integer n to nvis, it will set input_space to VectorSpace(n). Now that we are using a convolutional network, we need the input to be formatted as a collection of images so that the convolution operator will have a 2D space to work on.
Second, we make a few layers of the network be "ConvRectifiedLinear" layers. Putting some convolutional layers in the network makes those layers invariant to small translations, so the job of the remaining layers is much easier.
We don't need to do anything special to make the Softmax layer on top work with these convolutional layers. The MLP class will tell the Softmax class that its input is now coming from a Conv2DSpace. The Softmax layer will then use the Conv2DSpace's convert method to convert the 2D output from the convolutional layer into a batch of vector-valued examples.
The model and training is defined in conv.yaml file. Here we load it and set some of it's hypyer-parameters.
In [1]:
train = open('conv.yaml', 'r').read()
train_params = {'train_stop': 50000,
'valid_stop': 60000,
'test_stop': 10000,
'batch_size': 100,
'output_channels_h2': 64,
'output_channels_h3': 64,
'max_epochs': 500,
'save_path': '.'}
train = train % (train_params)
print train
Now, we use pylearn2's yaml_parse.load to construct the Train object, and run its main loop. The same thing could be accomplished by running pylearn2's train.py script on a file containing the yaml string.
Execute the next cell to train the model. This will take several minutes and possible as much as a few hours depending on how fast your computer is.
In [2]:
from pylearn2.config import yaml_parse
train = yaml_parse.load(train)
train.main_loop()
Compiling the theano functions used to run the network will take a long time for this example. This is because the number of theano variables and ops used to specify the computation is relatively large. There is no single theano op for doing max pooling with overlapping pooling windows, so pylearn2 builds a large expression graph using indexing operations to accomplish the max pooling.
After the model is trained, we can use the print_monitor script to print the last monitoring entry of a saved model. By running it on "convolutional_network_best.pkl", we can see the performance of the model at the point where it did the best on the validation set.
In [ ]:
!print_monitor.py convolutional_network_best.pkl | grep test_y_misclass
The test set error has dropped to 0.74%! This is a big improvement over the standard MLP.
We can also look at the convolution kernels learned by the first layer, to see that the network is looking for shifted versions of small pieces of penstrokes.
In [ ]:
!show_weights.py convolutional_network_best.pkl
You can find more information on convolutional networks from the following sources:
LISA lab's Deep Learning Tutorials: Convolutional Neural Networks (LeNet)
This is by no means a complete list.