In [1]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
matplotlib.pyplot.gray()
import numpy as np
import theano
import theano.tensor as T
from theano import shared, function
theano.config.floatX = 'float32'
rng = np.random.RandomState(42)
import sys
import time
We will make heavy use of the resources in the Theano deep learning tutorial. We have it integrated into our git repository as a submodule. You can clone the git repository by doing the following steps:
(be sure to include the --recursive or you won't get the Theano deep learning tutorial)
git clone --recursive https://github.com/graphific/DL-Meetup-intro.git
If you already cloned the repository, but the DeepLearningTutorial folder is empty, you need to checkout the submodule. Make sure you are in the folder DL-Meetup-intro and then execute the following command:
git submodule update --init --recursive
Now we have to add this directory to the PythonPath. Depending on the location of your git repository you might have to change this path.
In [2]:
sys.path.insert(1,'DeepLearningTutorials/code')
#sys.path
In [3]:
from logistic_sgd import load_data
dataset='mnist.pkl.gz'
## If not already cached this function actually downloads the data
datasets = load_data(dataset)
train_set_x, train_set_y = datasets[0]
valid_set_x, valid_set_y = datasets[1]
test_set_x, test_set_y = datasets[2]
In [4]:
print "Size of the training data matrix: ", train_set_x.get_value().shape
The data is stored with one training sample per row. The original image patches are $28 \times 28$ pixels, hence we have $28 \cdot 28 = 784$ feature columns in our data matrix. We can reshape each row of our data matrix back to the original $28 \times 28$ image patch to visualize what the data looks like.
In [5]:
from utils import tile_raster_images
samples = tile_raster_images(train_set_x.get_value(), img_shape=(28,28), tile_shape=(5,10), tile_spacing=(0, 0),
scale_rows_to_unit_interval=True,
output_pixel_vals=True)
plt.imshow(samples)
plt.show()
We'll create a convolutional network to classify digits. We use code from the Deep Network Tutorials to make this a little easier, particularly the Convolutional Layer class LeNetConvPoolLayer, defined in convolutional_mlp.py.
We'll build up a network using the following layers:
We'll train with batches, with a batch size of 10. (feel free to change it to a higher number, but for demo purposes 100 will be ok enough and faster)
In [6]:
# Setup 1: parameters of the network and training
from convolutional_mlp import LeNetConvPoolLayer
from mlp import HiddenLayer
from logistic_sgd import LogisticRegression
# network parameters
num_kernels = [10, 10]
kernel_sizes = [(9, 9), (5, 5)]
sigmoidal_output_size = 20
# training parameters
learning_rate = 0.1
batch_size = 100
For efficiency, we break the data into batches. Our network will take batch_size images at a time as input.
In [7]:
# Setup 2: compute batch sizes for train/test/validation
# borrow=True gets us the value of the variable without making a copy.
n_train_batches = train_set_x.get_value(borrow=True).shape[0]
n_test_batches = test_set_x.get_value(borrow=True).shape[0]
n_valid_batches = valid_set_x.get_value(borrow=True).shape[0]
n_train_batches /= batch_size
n_test_batches /= batch_size
n_valid_batches /= batch_size
In [8]:
#fast hack to limit datasize, feel free to skip this step:
n_train_batches /= 10
n_test_batches /= 10
n_valid_batches /= 10
n_train_batches /= 10
n_test_batches /= 10
n_valid_batches /= 10
In [9]:
# Setup 3.
# Declare inputs to network - x and y are placeholders
# that will be used in the training/testing/validation functions below.
x = T.matrix('x') # input image data
y = T.ivector('y') # input label data
The first layer takes (batch_size, 1, 28, 28) as input, convolves it with 10 different 9x9 filters, and then downsamples (via maxpooling) in a 2x2 region. Each filter/maxpool combination produces an output of size (28-9+1)/2 = 10 on a side.
The size of the first layer's output is therefore (batch_size, 10, 10, 10).
In [10]:
layer0_input_size = (batch_size, 1, 28, 28) # fixed size from the data
edge0 = (28 - kernel_sizes[0][0] + 1) / 2
layer0_output_size = (batch_size, num_kernels[0], edge0, edge0)
# check that we have an even multiple of 2 before pooling
assert ((28 - kernel_sizes[0][0] + 1) % 2) == 0
# The actual input is the placeholder x reshaped to the input size of the network
layer0_input = x.reshape(layer0_input_size)
layer0 = LeNetConvPoolLayer(rng,
input=layer0_input,
image_shape=layer0_input_size,
filter_shape=(num_kernels[0], 1) + kernel_sizes[0],
poolsize=(2, 2))
The second layer takes (batch_size, 10, 10, 10) as input, convolves it with 10 different 10x5x5 filters, and then downsamples (via maxpooling) in a 2x2 region. Each filter/maxpool combination produces an output of size (10-5+1)/2 = 3 on a side.
The size of the second layer's output is therefore (batch_size, 10, 3, 3).
In [11]:
layer1_input_size = layer0_output_size
edge1 = (edge0 - kernel_sizes[1][0] + 1) / 2
layer1_output_size = (batch_size, num_kernels[1], edge1, edge1)
# check that we have an even multiple of 2 before pooling
assert ((edge0 - kernel_sizes[1][0] + 1) % 2) == 0
layer1 = LeNetConvPoolLayer(rng,
input=layer0.output,
image_shape=layer1_input_size,
filter_shape=(num_kernels[1], num_kernels[0]) + kernel_sizes[1],
poolsize=(2, 2))
In [12]:
layer2_input = layer1.output.flatten(2)
layer2 = HiddenLayer(rng,
input=layer2_input,
n_in=num_kernels[1] * edge1 * edge1,
n_out=sigmoidal_output_size,
activation=T.tanh)
In [13]:
layer3 = LogisticRegression(input=layer2.output,
n_in=sigmoidal_output_size,
n_out=10)
In [14]:
# The cost we minimize during training is the NLL of the model.
# Recall: y is a placeholder we defined above
cost = layer3.negative_log_likelihood(y)
In [15]:
# create a list of all model parameters to be fit by gradient descent
params = layer3.params + layer2.params + layer1.params + layer0.params
# create a list of gradients for all model parameters
grads = T.grad(cost, params)
The actual training function takes a batch of inputs (images x and labels y), and updates the model parameters with a small step in the steepest direction that reduces the gradient.
This is the
(param_i, param_i - learning_rate * grad_i)
line in the code below.
We use the updates keyword to theano.function() to update these variables in-place. See the documentation for function().
We also make use of the givens keyword, to alias x and y to batches (aka, slices) of train_set_x and train_set_y.
In [16]:
# train_model is a function that updates the model parameters by SGD, and returns the current cost
#
# We create the updates list by automatically looping over all
# (params[i], grads[i]) pairs.
updates = [
(param_i, param_i - learning_rate * grad_i) # <=== SGD update step
for param_i, grad_i in zip(params, grads)
]
index = T.lscalar() # index to a batch of training/validation/testing examples
train_model = theano.function(
[index],
cost,
updates=updates,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size], # <=== batching
y: train_set_y[index * batch_size: (index + 1) * batch_size] # <=== batching
}
)
In [17]:
validate_model = theano.function(
[index],
layer3.errors(y),
givens={
x: valid_set_x[index * batch_size: (index + 1) * batch_size],
y: valid_set_y[index * batch_size: (index + 1) * batch_size]
}
)
In [18]:
test_model = theano.function(
[index],
layer3.errors(y),
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size],
y: test_set_y[index * batch_size: (index + 1) * batch_size]
}
)
We use SGD for a fixed number of iterations over the full training set (an "epoch"). Usually, we'd use a more complicated rule, such as iterating until a certain number of epochs fail to produce improvement in the validation set.
Were doing CPU so will take a while here, lets do only 2 epochs, feel free to try more yourself, while I'm talking :)
In [19]:
# if you've limited the datasize above then you can easily do 200 epochs,
# but in any case accuracy will be bad as NN only works with enough data:
# keeping the dataset intact with batch_size of 100, and 20 training epochs,
# validation error around x and test should be at least near x
for epoch in xrange(1,40):
costs = [train_model(i) for i in xrange(n_train_batches)]
validation_losses = [validate_model(i) for i in xrange(n_valid_batches)]
if (epoch+1) %4 == 0:
print "Epoch {} NLL {:.2} %err in validation set {:.1%}".format(epoch + 1, np.mean(costs), np.mean(validation_losses))
We can visualize the filters learned by the first layer. In most vision tasks, these will be simple features, such as lines, points, or edges (such as in Figure 2 of this paper).
The MNIST data doesn't contain as much variation as natural images, so we don't see as obvious edge features, but we still see some stroke-like features appearing in the lowest level. With some standard tricks (adding noise to the training set, shifting examples around, using Dropout), we'd probably see smoother features (as in the paper linked above).
In [20]:
#first layer, layer 0
filters = tile_raster_images(layer0.W.get_value(borrow=True), img_shape=(9, 9), tile_shape=(2,5), tile_spacing=(3, 3),
scale_rows_to_unit_interval=True,
output_pixel_vals=True)
plt.imshow(filters)
plt.show()
In [21]:
# layer 1
filters = tile_raster_images(layer1.W.get_value(borrow=True), img_shape=(50, 5), tile_shape=(1,10), tile_spacing=(3, 3),
scale_rows_to_unit_interval=False,
output_pixel_vals=False)
plt.imshow(filters)
plt.show()
In [22]:
#sigmoid layer
filters = tile_raster_images(layer2.W.get_value(borrow=True), img_shape=(1, 20), tile_shape=(10,10), tile_spacing=(3, 3),
scale_rows_to_unit_interval=False,
output_pixel_vals=False)
plt.imshow(filters)
plt.show()
In [23]:
#final layer: fully connected logistic regression layer converts the sigmoid's layer output to a class label.
filters = tile_raster_images(layer3.W.get_value(borrow=True), img_shape=(1, 10), tile_shape=(4,5), tile_spacing=(3, 3),
scale_rows_to_unit_interval=False,
output_pixel_vals=False)
plt.imshow(filters)
plt.show()
In [24]:
test_errors = [test_model(i) for i in range(n_test_batches)]
print "test errors: {:.1%}".format(np.mean(test_errors))
In [ ]: