Our Multi-Layer Perceptron (MLP) works quite well on the MNIST dataset. What need is there for a new architecture ?
The problem with the MLP is that it does not take the image structure into account. Indeed, the input data, an array with 784 entries, could very well be shuffled (entry 0 becomes entry 26, entry 26 becomes entry 54 and so on), it would not matter to the MLP (try it !).
This means that the two images below are essentially the same for our neural network!
(Going from left to right, we applied a random permutation of the grayscale values of this MNIST image).
Clearly, our architecture is a waste of information. In addition, it is computationally expensive as our network is fully connected : for the MNIST dataset, it is still feasible, but for higher resolution images, learning all the weights and biases quickly becomes prohibitive.
We will now see how a cleverer architecture can be used to efficiently tackle image classification.
A simple solution to the computation cost is to restrict connections between units. We will now connect each unit in the hidden layer to a small patch of contiguous pixels from the input image.
This is in close analogy to the way visual systems work: neurons in the visual cortex have localized receptive fields and will only respond to stimuli from a given location.
We will also make use of shared weights. By shared weights, we mean that we apply the same transformation for all subpatch of the input image. This ensures that every unit in the hidden layer detects exactly the same pattern, albeit at different locations of the input image.
This design choice is desirable because natural images have the property of being ”‘stationary”’. For instance, if we find an operation that allows us to detect a stick in one part of the image, we should be able to detect a stick elsewhere by applying the same operation to other patches of the image.
Shared weights also immediately decrease the number of parameters to be learnt and thus the computational cost.
The way Convolutional Networks operate is best explained visually :
In the GIF above, the green array is an 8x8 schematic input image, with cells indicating pixels and cell values indicating color intensity. The pink array is a 3x3 hidden layer.
Each unit in the hidden layer is connected to a patch of the input image (the sliding 3x3 orange window). The values in each cell of the hidden layers are computed with the same operation (i.e. the weights of the sliding orange window do not change).
For instance, the first unit of the hidden layer is connected to the following patch of the input image:
$$P= \begin{bmatrix} 1 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 1 \end{bmatrix}$$To compute the value of the first unit of the hidden layer, we do :
\begin{eqnarray} \rm value=\rm Sum(P\odot W)&= 1\times 1 + 1\times 0 +1\times 1 \\ &+ 0\times 0 + 1\times 1 + 1\times 0 \\ &+ 0\times 1 + 0\times 0 + 1\times 1 \\ & = 4 \tag{1}\end{eqnarray}where W, the weight matrix is defined as:
$$W= \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}$$Of course, we won't restrict ourselves to a single weight matrix (also called kernel or filter matrix). Indeed, we don't want to restrict ourselves to sticks only, we may be interested in detecting stars or ellipses. This is why we will use several weight matrices at this stage, as shown on the image below :
In this example, we apply 3 different 5x5 weight matrices to the input MNIST data, obtaining convolved features of size 3 x (28-5+1) * (28-5+1)
As an aside, note that the operation Eq. (1) is a form of convolution hence the name of the architecture.
It is possible to use all the features in the hidden layer. However, this can lead to computational issues. Let us give an illustratory example :
To address this, we condense the information with a so-called pooling layer. For instance, we could take the mean of neighboring cells in the hidden layer and obtain a much lower dimension feature vector.
This operation can greatly reduce the computational burden and also address overfitting issues. The figure below shows how this works in practice :
The hidden layer units are divided into 4 groups (highlighted by the sliding red window). Then we take the mean (or the max, or whatever summary statistics we choose) of each group. This yields a much lower dimension vector.
Combining all that we've seen above, we get the following architecture :
N.B. The size of the convolution filter (5x5 in this example) is a hyperparameter, as well as the stride length (which states by how many pixels we slide the window). They can and should be tuned.
We will now use Theano to build our neural network.
From Wikipedia :
Theano is a numerical computation library for Python. In Theano, computations are expressed using a NumPy-like syntax and compiled to run efficiently on either CPU or GPU architectures.
In case you do have a GPU on your machine, Theano can greatly speed up computations for Deep Neural Networks. In case you don't, you can run a GPU instance on the Amazon Web Services
An easy way to install theano and its dependencies is to use the Anaconda package by continuum analytics.
To do so, we download the latest package from the Anaconda website. For Linux and to this date (January 2016), the command is :
wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda2-2.4.1-Linux-x86_64.sh
Then in the directory were it was downloaded :
bash Anaconda2-2.4.1-Linux-x86_64.sh
Or
http://docs.continuum.io/anaconda/install
conda update conda
conda update ipython ipython-notebook ipython-qtconsole
pip install Theano
You should now be able to use Theano with python. All the code below can be ran without a GPU. However, activating the GPU may greatly speed up the examples.
In [1]:
"""
This program makes use of the Theano library to speed up computations.
Besides speeding up computations through the use of a GPU,
Theano also automatically computes the mappings required for the
backpropagation algorithm
This program also implements dropout
(a method in which you randomly deactivate some units inside the neural network)
this helps regularisation and performance
But you should pay it no heed in this tutorial
"""
#### Libraries
# Standard library
import cPickle
import gzip
# Third-party libraries
import numpy as np
import theano
import theano.tensor as T
from theano.tensor.nnet import conv
from theano.tensor.nnet import softmax
from theano.tensor import shared_randomstreams
from theano.tensor.signal import downsample
# Activation functions for neurons
from theano.tensor.nnet import sigmoid
from theano.tensor import tanh
#### Constants
GPU = False
if GPU:
print "Running under a GPU"
try: theano.config.device = 'gpu'
except: pass # it's already set
theano.config.floatX = 'float32'
else:
print "Running with a CPU"
#### Load the MNIST data
def load_data_shared(filename="../data/mnist.pkl.gz"):
f = gzip.open(filename, 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
# We store the data in so-called shared-variables
#The reason we store our dataset in shared variables is to allow
#Theano to copy it into the GPU memory (when code is run on GPU).
#Since copying data into the GPU is slow, copying a minibatch everytime
#is needed (the default behaviour if the data is not in a shared
#variable) would lead to a large decrease in performance.
def shared(data):
"""Place the data into shared variables. This allows Theano to copy
the data to the GPU, if one is available.
"""
shared_x = theano.shared(
np.asarray(data[0], dtype=theano.config.floatX), borrow=True)
shared_y = theano.shared(
np.asarray(data[1], dtype=theano.config.floatX), borrow=True)
return shared_x, T.cast(shared_y, "int32")
return [shared(training_data), shared(validation_data), shared(test_data)]
#### Main class used to construct and train networks
class Network(object):
def __init__(self, layers, mini_batch_size):
"""Takes a list of `layers`, describing the network architecture, and
a value for the `mini_batch_size` to be used during training
by stochastic gradient descent.
"""
self.layers = layers
self.mini_batch_size = mini_batch_size
self.params = [param for layer in self.layers for param in layer.params]
# Theano symbolic variables
self.x = T.matrix("x")
self.y = T.ivector("y")
init_layer = self.layers[0]
init_layer.set_inpt(self.x, self.x, self.mini_batch_size)
for j in xrange(1, len(self.layers)):
prev_layer, layer = self.layers[j-1], self.layers[j]
layer.set_inpt(
prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)
self.output = self.layers[-1].output
self.output_dropout = self.layers[-1].output_dropout
def SGD(self, training_data, epochs, mini_batch_size, eta,
validation_data, test_data, lmbda=0.0):
"""Train the network using mini-batch stochastic gradient descent."""
training_x, training_y = training_data
validation_x, validation_y = validation_data
test_x, test_y = test_data
# compute number of minibatches for training, validation and testing
num_training_batches = size(training_data)/mini_batch_size
num_validation_batches = size(validation_data)/mini_batch_size
num_test_batches = size(test_data)/mini_batch_size
# define the (regularized) cost function, symbolic gradients, and updates
l2_norm_squared = sum([(layer.w**2).sum() for layer in self.layers])
# set up the (regularised) log likelihood function
cost = self.layers[-1].cost(self)+\
0.5*lmbda*l2_norm_squared/num_training_batches
# compute the gradient of our cost function
grads = T.grad(cost, self.params)
updates = [(param, param-eta*grad)
for param, grad in zip(self.params, grads)]
# define functions to train a mini-batch, and to compute the
# accuracy in validation and test mini-batches.
i = T.lscalar() # mini-batch index
#train_mb is a symbolic theano function
train_mb = theano.function(
[i], cost, updates=updates,
givens={
self.x:
training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
validate_mb_accuracy = theano.function(
[i], self.layers[-1].accuracy(self.y),
givens={
self.x:
validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
test_mb_accuracy = theano.function(
[i], self.layers[-1].accuracy(self.y),
givens={
self.x:
test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
self.test_mb_predictions = theano.function(
[i], self.layers[-1].y_out,
givens={
self.x:
test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
# Do the actual training
best_validation_accuracy = 0.0
for epoch in xrange(epochs):
for minibatch_index in xrange(num_training_batches):
iteration = num_training_batches*epoch+minibatch_index
if iteration % 1000 == 0:
print("Training mini-batch number {0}".format(iteration))
cost_ij = train_mb(minibatch_index)
if (iteration+1) % num_training_batches == 0:
validation_accuracy = np.mean(
[validate_mb_accuracy(j) for j in xrange(num_validation_batches)])
print("Epoch {0}: validation accuracy {1:.2%}".format(
epoch, validation_accuracy))
if validation_accuracy >= best_validation_accuracy:
print("This is the best validation accuracy to date.")
best_validation_accuracy = validation_accuracy
best_iteration = iteration
if test_data:
test_accuracy = np.mean(
[test_mb_accuracy(j) for j in xrange(num_test_batches)])
print('The corresponding test accuracy is {0:.2%}'.format(
test_accuracy))
print("Finished training network.")
print("Best validation accuracy of {0:.2%} obtained at iteration {1}".format(
best_validation_accuracy, best_iteration))
print("Corresponding test accuracy of {0:.2%}".format(test_accuracy))
#### Define layer types
class ConvPoolLayer(object):
"""Used to create a combination of a convolutional and a max-pooling
layer. A more sophisticated implementation would separate the
two, but for our purposes we'll always use them together, and it
simplifies the code, so it makes sense to combine them.
"""
def __init__(self, filter_shape, image_shape, poolsize=(2, 2),
activation_fn=sigmoid):
"""`filter_shape` is a tuple of length 4, whose entries are the number
of filters, the number of input feature maps, the filter height, and the
filter width.
`image_shape` is a tuple of length 4, whose entries are the
mini-batch size, the number of input feature maps, the image
height, and the image width.
`poolsize` is a tuple of length 2, whose entries are the y and
x pooling sizes.
"""
self.filter_shape = filter_shape
self.image_shape = image_shape
self.poolsize = poolsize
self.activation_fn=activation_fn
# initialize weights and biases
n_out = (filter_shape[0]*np.prod(filter_shape[2:])/np.prod(poolsize))
self.w = theano.shared(
np.asarray(
np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape),
dtype=theano.config.floatX),
borrow=True)
self.b = theano.shared(
np.asarray(
np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)),
dtype=theano.config.floatX),
borrow=True)
self.params = [self.w, self.b]
def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
self.inpt = inpt.reshape(self.image_shape)
conv_out = conv.conv2d(
input=self.inpt, filters=self.w, filter_shape=self.filter_shape,
image_shape=self.image_shape)
pooled_out = downsample.max_pool_2d(
input=conv_out, ds=self.poolsize, ignore_border=True)
self.output = self.activation_fn(
pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))
self.output_dropout = self.output # no dropout in the convolutional layers
class FullyConnectedLayer(object):
def __init__(self, n_in, n_out, activation_fn=sigmoid, p_dropout=0.0):
self.n_in = n_in
self.n_out = n_out
self.activation_fn = activation_fn
self.p_dropout = p_dropout
# Initialize weights and biases
# use gaussian with proper width to initialise
# we use shared variables to limit computation overhead
# between CPU and GPU
self.w = theano.shared(
np.asarray(
np.random.normal(
loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),
dtype=theano.config.floatX),
name='w', borrow=True)
self.b = theano.shared(
np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)),
dtype=theano.config.floatX),
name='b', borrow=True)
self.params = [self.w, self.b]
def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
self.inpt = inpt.reshape((mini_batch_size, self.n_in))
self.output = self.activation_fn(
(1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
self.y_out = T.argmax(self.output, axis=1)
self.inpt_dropout = dropout_layer(
inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
self.output_dropout = self.activation_fn(
T.dot(self.inpt_dropout, self.w) + self.b)
def accuracy(self, y):
"Return the accuracy for the mini-batch."
return T.mean(T.eq(y, self.y_out))
class SoftmaxLayer(object):
def __init__(self, n_in, n_out, p_dropout=0.0):
self.n_in = n_in
self.n_out = n_out
self.p_dropout = p_dropout
# Initialize weights and biases
self.w = theano.shared(
np.zeros((n_in, n_out), dtype=theano.config.floatX),
name='w', borrow=True)
self.b = theano.shared(
np.zeros((n_out,), dtype=theano.config.floatX),
name='b', borrow=True)
self.params = [self.w, self.b]
def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
self.inpt = inpt.reshape((mini_batch_size, self.n_in))
self.output = softmax((1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
self.y_out = T.argmax(self.output, axis=1)
self.inpt_dropout = dropout_layer(
inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
self.output_dropout = softmax(T.dot(self.inpt_dropout, self.w) + self.b)
def cost(self, net):
"Return the log-likelihood cost."
return -T.mean(T.log(self.output_dropout)[T.arange(net.y.shape[0]), net.y])
def accuracy(self, y):
"Return the accuracy for the mini-batch."
return T.mean(T.eq(y, self.y_out))
#### Miscellanea
def size(data):
"Return the size of the dataset `data`."
return data[0].get_value(borrow=True).shape[0]
def dropout_layer(layer, p_dropout):
srng = shared_randomstreams.RandomStreams(
np.random.RandomState(0).randint(999999))
mask = srng.binomial(n=1, p=1-p_dropout, size=layer.shape)
return layer*T.cast(mask, theano.config.floatX)
The accuracy should reach somewhere close to 97.8 % at best.
N.B It takes some time to complete the full training, feel free to stop anytime. N.B. If you have a GPU, feel free to activate it (set the GPU flag to True in the code), it should speed up computations !
In [ ]:
training_data, validation_data, test_data = load_data_shared()
mini_batch_size = 10
net = Network([
FullyConnectedLayer(n_in=784, n_out=100),
SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data, 60, mini_batch_size, 0.1,
validation_data, test_data)
The accuracy should reach somewhere close to 98.8 % at best. While it does not seem much of an improvement, it actually means a 45% decrease in the error rate over the previous network !
In [47]:
net = Network([
ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28),
filter_shape=(20, 1, 5, 5),
poolsize=(2, 2)),
FullyConnectedLayer(n_in=20*12*12, n_out=100),
SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data, 60, mini_batch_size, 0.1,
validation_data, test_data)
In [37]:
# Let's start by importing relevant packages :
import cPickle as pickle
import subprocess
import os
import struct
import numpy as np
from matplotlib.gridspec import GridSpec
import matplotlib.pylab as plt
import matplotlib as mpl
# Define a function to unzip the mnist data and load it
def get_mnist_data():
""" Unzip mnist data if needed and load it"""
if not os.path.isfile("./Data/mnist.pkl"):
subprocess.call("gunzip -k ./Data/mnist.pkl.gz".split(" "))
with open("./Data/mnist.pkl", "r") as f:
return pickle.load(f)
p = get_mnist_data()
print p[0][0][0].shape, p[0][1].shape
im, num = p[0][0][0], p[0][1][0]
im = np.reshape(im, (28,28))
imr = im.copy()
# Shuffle a copy of the image
np.random.shuffle(imr)
fig = plt.figure()
gs = GridSpec(1, 2, bottom=0.18, left=0.18, right=0.88)
ax = fig.add_subplot(gs[0])
axshuff = fig.add_subplot(gs[1])
#Plot the first MNIST image in the data and its
# shuffled version side by side.
image = ax.imshow(im, cmap=mpl.cm.Greys)
imageshuff = axshuff.imshow(imr, cmap=mpl.cm.Greys)
imageshuff.set_interpolation('nearest')
image.set_interpolation('nearest')
plt.savefig("Figures/rand_shuffle.png", bbox_inches='tight')
Figures courtesy of Matt Nielsen and Stanford University
In [ ]: