In this notebook, we'll walk through the processes of building a convolutional neural network using optimus step by step. To make the demonstration more applicable to a real research problem, we'll use the MNIST dataset.
Conventions:
In [20]:
# Global imports
import random
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn
import mpld3
import numpy as np
seaborn.set()
np.set_printoptions(precision=4, suppress=True)
mpld3.enable_notebook()
import optimus
One of the first successful applications of neural networks to "real world problems" came from Yann LeCun et al's work on hand-written digit recognition in the 1990's. Buzz around this problem was further stoked by the curation and dissemination of the data used, referred to simply as "MNIST". Interested readers can refer to the following choice selections:
For those who would like to follow along, go ahead and download the pickled dataset from our friends at UMontreal, who have conveniently converted the data to a python-friendly format:
http://deeplearning.net/data/mnist/mnist.pkl.gz
No need to unzip the dataset; we'll build a data parser to unpack it directly.
In [2]:
import gzip
import cPickle
def load_mnist(mnist_file):
"""Load the MNIST dataset into memory.
Parameters
----------
mnist_file : str
Path to gzipped MNIST file.
Returns
-------
train, valid, test: tuples of np.ndarrays
Each consists of (data, labels), where data.shape=(N, 1, 28, 28) and
labels.shape=(N,).
"""
dsets = []
with gzip.open(mnist_file, 'rb') as fp:
for split in cPickle.load(fp):
n_samples = len(split[1])
data = np.zeros([n_samples, 1, 28, 28])
labels = np.zeros([n_samples], dtype=int)
for n, (x, y) in enumerate(zip(*split)):
data[n, ...] = x.reshape(1, 28, 28)
labels[n] = y
dsets.append((data, labels))
return dsets
A few comments before moving on...
data = [(X_train, y_train), (X_valid, y_valid), (X_test, y_test)]
In [3]:
# let's load the data and take a look at some digits.
train, valid, test = load_mnist("/Users/ejhumphrey/mnist/mnist.pkl.gz")
num_imgs = 5
fig = plt.figure(figsize=(num_imgs*2, 2))
for n, idx in enumerate(np.random.permutation(len(train[1]))[:num_imgs]):
ax = fig.add_subplot(101 + 10*num_imgs + n)
ax.imshow(
train[0][idx, 0], interpolation='nearest',
aspect='equal', cmap=plt.cm.hot)
ax.set_xlabel("{0}".format(train[1][idx]))
ax.set_xticks([])
ax.set_yticks([]);
plt.tight_layout()
Now that everything looks right, we're ready to move on to creating a network for training. We'll do this in several consectutive steps.
For our digit classifier, we'll need three inputs:
data, as a 4D Tensor of floats (default)labels, as a vector of int32'slearning_rate to control the magnitude of each update
In [4]:
# shape = (num_samples, 1, x_dim, y_dim)
data = optimus.Input(
name='data',
shape=(None, 1, 28, 28))
# shape = (num_samples, )
class_labels = optimus.Input(
name='labels',
shape=(None,),
dtype='int32')
# scalar -> None
learning_rate = optimus.Input(
name='learning_rate',
shape=None)
Some important aspects to note:
None can be passed as the first dimension of an Input to indicate that it is variable. Only the first dimension may ever be None. Doing so allows for the graph to accept different number of inputs at a time, but will be slightly less efficient. Keep this trade-off in mind.None as the entire shape specifies the Input is a scalar. Input names, as is done with data, although it is not strictly necessary, as with labels. For clarity, the graph will expect an input under labels, not class_labels.Most convoltional neural networks are comprised of one or more convolutional layers, with or without pooling, followed by one or more affine, or fully-connected, layers. For classifiers, a softmax operation is often applied to the output of the final layer so that it behaves like a probability mass function over the known classes, i.e. bounded on [0, 1] and sums to 1.
Here, we'll create one convolutional node, two affine nodes, and a softmax operator, which will do the main work of our graph.
In [5]:
conv = optimus.Conv3D(
name='conv',
input_shape=data.shape,
weight_shape=(15, 1, 9, 9), # (num_kernels, num_input_maps, x_dim, y_dim)
pool_shape=(2, 2),
act_type='relu')
affine = optimus.Affine(
name='affine',
input_shape=conv.output.shape,
output_shape=(None, 512,), # (num_samples, num_outputs)
act_type='relu')
classifier = optimus.Affine(
name='classifier',
input_shape=affine.output.shape,
output_shape=(None, 10), # (num_samples, num_outputs)
act_type='linear')
softmax = optimus.Softmax(name='softmax')
Observations:
Inputs, nodes must also be uniquely named. shape attributes of Inputs or node Outputs can be used to easily specify the dimensions a node might expect, as in line 3 or 10 respectively. Check the class definitions of a given node to determine the named outputs (see below).relu, or Rectified Linear Unit, activation functions; other common activation functions include sigmoid, tanh, or linear, as in line 18.Inputs, both Affine nodes have output shapes that start with None; this indicates that the first dimension, num_samples, is variable. Affine nodes are defined by input and output shapes, whereas Conv3D nodes are defined by input and weight shapes, sparing you the need to determine, in advance, what the output shape will be.In order to train a network via backpropagation, it is necessary to compute some scalar loss such that all parameters in the network can be differentiated with respect to it. The choice or design of a loss function is crucial and hardly trivial, and an ongoing research topic; for those interested, the following is suggested:
http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf
Using optimus, a great many custom loss functions can be stitched together by some combination of nodes. However, more often than not, it's more than sufficient to use one of the provided losses, which are use-case specific nodes. Here, we'll use a negative log-likelihood loss, which typically works well for 1-of-K classifier problems.
In [6]:
nll = optimus.NegativeLogLikelihoodLoss(name='nll')
As we will see shortly, the NegativeLogLikelihoodLoss node has two inputs, likelihoods and index, and outputs a scalar output, which is the mean negative log-likehood of the "correct" answers.
Finally, to get information out of our network, it is necessary to create Outputs, which act as signal taps on the graph. Any node that computes some result can be tapped as an output. To demonstrate, we'll specify outputs for likelihoods, from the softmax node, and the overall loss, from the nll loss node.
In [7]:
likelihoods = optimus.Output(name='likelihoods')
loss = optimus.Output(name='loss')
Note that any one graph can produce a multitude of outputs, so long as they are named and specified as connections. This can be particularly useful when debugging a network, or to explore what the learned representations happen to look like.
While much neural network research makes use of a cascade of singular layers, it is possible to specify arbitrarily connected acyclic (no loops) graphs. This can be crucial when learning on multimodal data, i.e. synchronized audio and video, or combining multiple sources of ground truth data, i.e. joint speech and face recognition.
For convenience, ConnectionManager are used to specify edges as a list of "(from, to)" tuples, like one were connecting objects with lines:
In [8]:
trainer_edges = optimus.ConnectionManager([
# Input
(data, conv.input),
# Nodes
(conv.output, affine.input),
(affine.output, classifier.input),
(classifier.output, softmax.input),
(softmax.output, nll.likelihoods),
(class_labels, nll.index),
# Outputs
(softmax.output, likelihoods),
(nll.output, loss)])
update_edges = optimus.ConnectionManager([
(learning_rate, conv.weights),
(learning_rate, conv.bias),
(learning_rate, affine.weights),
(learning_rate, affine.bias),
(learning_rate, classifier.weights),
(learning_rate, classifier.bias)])
Note that two ConnectionManagers are used here: one connecting inputs to outputs, and a second connecting the learning_rate to the parameters of each node. Specifying the connectivity of learning rates makes it easy to use different values for different parameters, or simply not update certain parameters. One may come across these needs when attempting to train a network with coordinate block descent or supervised fine-tuning of pre-trained parameters.
A word of caution: ports, e.g. inputs or outputs, are single-input, multi-output (SIMO). In other words, one output can be mapped to many inputs, but only one connection can be mapped to one input.
Strictly speaking, ConnectionManagers are not a necessity, but the "(from, to)" connection paradigm is easier to rationalize than the form that is needed, being "{from : [to, ...]}":
In [9]:
# Show the dictionary of {from -> [to, ...]}
trainer_edges.connections
Out[9]:
Having specified the various parts of our training graph, we can now assemble the full network and randomly intialize the parameters (which default to zero).
In [10]:
trainer = optimus.Graph(
name='mnist_trainer',
inputs=[data, class_labels, learning_rate],
nodes=[conv, affine, classifier, softmax, nll],
connections=trainer_edges.connections,
outputs=[loss, likelihoods],
loss=loss,
updates=update_edges.connections)
for node in conv, affine, classifier:
optimus.random_init(node.weights, mean=0.0, std=0.1)
Now the training graph can be called directly with mnist data. Here, we'll take the push just the few datapoints through the graph.
In [11]:
X, y = [_[:3] for _ in train]
print trainer(data=X, labels=y, learning_rate=0.02)
Here we see that, given the appropriate inputs, a graph returns a dictionary of values, keyed by the named outputs requested.
Once the trainer network is built, pushing data through the graph with a positive scalar for learning_rate will cause the parameters to move in the direction of a smaller loss value. If the learning rate is too small this may take a considerable amount of time; conversely, too large a learning rate will cause the network to diverge, often quickly, and die a glorious death.
This iterative training process can be simplied through two pieces of code: a minibatch data generator and a training driver.
While research of stochastic learning methods, and in particular curriculum learning, is ongoing, here we'll define a generator that returns batches of randomly sampled datapoints.
Note: If generators make you weak in the knees, the following post is a highly recommended introduction (or refresher):
http://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/
In [12]:
def minibatch(data, labels, batch_size, max_iter=np.inf):
"""Random mini-batch generator.
Parameters
----------
data : array_like, len=N
Observation data.
labels : array_like, len=N
Labels corresponding the the given data.
batch_size : int
Number of datapoints to return at each iteration.
max_iter : int, default=inf
Number of iterations before raising a StopIteration.
Yields
------
batch : dict
Random batch of datapoints, under the keys `data` and `labels`.
"""
num_points = len(labels)
order = np.random.permutation(num_points)
idx, count = 0, 0
while count < max_iter:
x, y = [], []
while len(y) < batch_size:
x.append(data[order[idx]])
y.append(labels[order[idx]])
idx += 1
if idx >= num_points:
idx = 0
np.random.shuffle(order)
# Look here! I'm important!
yield dict(data=np.asarray(x), labels=np.asarray(y))
count += 1
Generator or logic nuances are left as an exercise, but there is one crucial component of this generator that allows it to work seamlessly with an optimus Driver. In line 33, the generator yields a dictionary with keys corresponding to the names of the graph's inputs (defined above in 2.1).
Note, however, that we don't trouble ourselves with constants for the time being. These will be provided as a separate dictionary of static values.
When training neural networks, there is much repeated infrastructure common to all graphs, such as parameter checkpointing, loss statistics, and progress reporting. The Driver class serves to condense this functionality and more into a single object thus simplifying the process of learning parameters.
The following passes the training graph to a driver, and calling the fit method begins the training process. The right-most value reports the loss as a function of iteration.
In [13]:
driver = optimus.Driver(graph=trainer)
driver.fit(
source=minibatch(train[0], train[1], batch_size=25),
hyperparams={'learning_rate': 0.02},
max_iter=500, print_freq=25)
Now that we've trained the ConvNet for several hundred iterations, we can transform the validation data and see how we're doing. Note that an image's class can be estimated by finding the argmax over each datapoint.
In [14]:
outputs = trainer(data=valid[0], labels=valid[1], learning_rate=0.0)
labels_pred = outputs['likelihoods'].argmax(axis=1)
print "Classification Error: {0:0.4}".format(1.0 - np.equal(valid[1], labels_pred).mean())
print " Loss: {0:0.4}".format(float(outputs['loss']))
Certainly this network isn't done learning, and would proceed until some stopping criterion is reached. However, for the purposes of demonstration, we're almost done here. There's only one big topic left to address.
One of main goals of optimus is to provide safe and seamless serialization between the learning and application stages in a development pipeline. This is achieved by tearing down a graph into two distinct parts: a JSON definition, and a numpy archive of parameters.
What does this look like? Let's save the trainer to disk.
In [15]:
optimus.save(trainer, "mnist_trainer.json", "mnist_params.npz")
In [16]:
# Print the JSON
for line in open("mnist_trainer.json"):
print line.strip('\n')
In [17]:
param_values = np.load("mnist_params.npz")
for k in param_values.keys():
print k, param_values[k].shape
This graph can be recreated by calling optimus.load(def_file, param_file), but for the sake of demonstration, let's load only the defintion and predict the validation dataset.
In [18]:
# Load the definition, without parameters
new_trainer = optimus.load("mnist_trainer.json")
outputs = new_trainer(data=valid[0], labels=valid[1], learning_rate=0.0)
labels_pred = outputs['likelihoods'].argmax(axis=1)
print "Classification Error: {0:0.4}".format(1.0 - np.equal(valid[1], labels_pred).mean())
print " Loss: {0:0.4}".format(float(outputs['loss']))
As to be expected, the network is effectively blank, and predicts a constant class index (each of which will occur roughly 10% of the time).
Let's manually set the parameters by passing the loaded numpy archive to the graph's param_values property. All keys in the archive that match parameter names known to the graph will be set, and the rest discarded noisily.
In [19]:
# Set the parameters with those previously loaded.
new_trainer.param_values = param_values
new_trainer.param_values = {'conv.another_param': np.ones([3, 4])}
outputs = new_trainer(data=valid[0], labels=valid[1], learning_rate=0.0)
labels_pred = outputs['likelihoods'].argmax(axis=1)
print "Classification Error: {0:0.4}".format(1.0 - np.equal(valid[1], labels_pred).mean())
print " Loss: {0:0.4}".format(float(outputs['loss']))
This way, the parameters of any node can be set directly by providing a dictionary of params, keyed by "{node_name}.{parameter_name}". This makes it easy to load parameters into a graph that came from a previous training session, some new unsupervised learning approach, or hand-crafted features for fine-tuning.