Building an MNIST ConvNet in Optimus

In this notebook, we'll walk through the processes of building a convolutional neural network using optimus step by step. To make the demonstration more applicable to a real research problem, we'll use the MNIST dataset.

Conventions:

  • Each block is structured locally in three parts: (1) context, (2) code, and (3) a brief explanation.
  • Necessary code blocks are shown with line numbers, demonstration code blocks are not.

In [20]:
# Global imports
import random
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn
import mpld3
import numpy as np

seaborn.set()
np.set_printoptions(precision=4, suppress=True)
mpld3.enable_notebook()

import optimus

1. Get the MNIST data.

One of the first successful applications of neural networks to "real world problems" came from Yann LeCun et al's work on hand-written digit recognition in the 1990's. Buzz around this problem was further stoked by the curation and dissemination of the data used, referred to simply as "MNIST". Interested readers can refer to the following choice selections:

For those who would like to follow along, go ahead and download the pickled dataset from our friends at UMontreal, who have conveniently converted the data to a python-friendly format:

http://deeplearning.net/data/mnist/mnist.pkl.gz

No need to unzip the dataset; we'll build a data parser to unpack it directly.


In [2]:
import gzip
import cPickle

def load_mnist(mnist_file):
    """Load the MNIST dataset into memory.

    Parameters
    ----------
    mnist_file : str
        Path to gzipped MNIST file.

    Returns
    -------
    train, valid, test: tuples of np.ndarrays
        Each consists of (data, labels), where data.shape=(N, 1, 28, 28) and
        labels.shape=(N,).
    """
    dsets = []
    with gzip.open(mnist_file, 'rb') as fp:
        for split in cPickle.load(fp):
            n_samples = len(split[1])
            data = np.zeros([n_samples, 1, 28, 28])
            labels = np.zeros([n_samples], dtype=int)
            for n, (x, y) in enumerate(zip(*split)):
                data[n, ...] = x.reshape(1, 28, 28)
                labels[n] = y
            dsets.append((data, labels))

    return dsets

A few comments before moving on...

  1. Note that the data is stored as three pairs of tuples: data = [(X_train, y_train), (X_valid, y_valid), (X_test, y_test)]
  2. Data is provided as a flat vector of coefficients, so these are reshaped on-load, in line 25.
  3. An additional (singleton) dimension is included during this reshape operation for convenience later. Try not to worry about it just yet.

In [3]:
# let's load the data and take a look at some digits.
train, valid, test = load_mnist("/Users/ejhumphrey/mnist/mnist.pkl.gz")
num_imgs = 5
fig = plt.figure(figsize=(num_imgs*2, 2))
for n, idx in enumerate(np.random.permutation(len(train[1]))[:num_imgs]):
    ax = fig.add_subplot(101 + 10*num_imgs + n)
    ax.imshow(
        train[0][idx, 0], interpolation='nearest', 
        aspect='equal', cmap=plt.cm.hot)
    ax.set_xlabel("{0}".format(train[1][idx]))
    ax.set_xticks([])    
    ax.set_yticks([]);
plt.tight_layout()


2. Building a ConvNet

Now that everything looks right, we're ready to move on to creating a network for training. We'll do this in several consectutive steps.

  1. Define inputs
  2. Define processing nodes
  3. Define a scalar loss
  4. Define outputs
  5. Specify connections between nodes, i.e. edges
  6. Create the network, i.e. graph

2.1 Defining Inputs

For our digit classifier, we'll need three inputs:

  • input image data, as a 4D Tensor of floats (default)
  • ground truth class labels, as a vector of int32's
  • a scalar learning_rate to control the magnitude of each update

In [4]:
# shape = (num_samples, 1, x_dim, y_dim)
data = optimus.Input(
    name='data',
    shape=(None, 1, 28, 28))

# shape = (num_samples, )
class_labels = optimus.Input(
    name='labels',
    shape=(None,),
    dtype='int32')

# scalar -> None
learning_rate = optimus.Input(
    name='learning_rate',
    shape=None)

Some important aspects to note:

  • As a convention, the first dimension corresponds to unique samples or datapoints.
  • None can be passed as the first dimension of an Input to indicate that it is variable. Only the first dimension may ever be None. Doing so allows for the graph to accept different number of inputs at a time, but will be slightly less efficient. Keep this trade-off in mind.
  • Passing None as the entire shape specifies the Input is a scalar.
  • All inputs must be uniquely named; these are the keyword argument names under which a network will accept. It is good practice to keep the instance names consistent with Input names, as is done with data, although it is not strictly necessary, as with labels. For clarity, the graph will expect an input under labels, not class_labels.

2.2 Defining Nodes

Most convoltional neural networks are comprised of one or more convolutional layers, with or without pooling, followed by one or more affine, or fully-connected, layers. For classifiers, a softmax operation is often applied to the output of the final layer so that it behaves like a probability mass function over the known classes, i.e. bounded on [0, 1] and sums to 1.

Here, we'll create one convolutional node, two affine nodes, and a softmax operator, which will do the main work of our graph.


In [5]:
conv = optimus.Conv3D(
    name='conv',
    input_shape=data.shape,
    weight_shape=(15, 1, 9, 9),  # (num_kernels, num_input_maps, x_dim, y_dim)
    pool_shape=(2, 2),
    act_type='relu')

affine = optimus.Affine(
    name='affine',
    input_shape=conv.output.shape,
    output_shape=(None, 512,),  # (num_samples, num_outputs)
    act_type='relu')

classifier = optimus.Affine(
    name='classifier',
    input_shape=affine.output.shape,
    output_shape=(None, 10),  # (num_samples, num_outputs)
    act_type='linear')

softmax = optimus.Softmax(name='softmax')

Observations:

  • Like Inputs, nodes must also be uniquely named.
  • When creating nodes, the shape attributes of Inputs or node Outputs can be used to easily specify the dimensions a node might expect, as in line 3 or 10 respectively. Check the class definitions of a given node to determine the named outputs (see below).
  • Here, we'll use 3D convolutions since they're --generally-- faster than other convolution operations. The Conv3D operation will expect a 4D input, which is the motivation of the singleton dimension previously. As given by the comment in line 14, this Conv3D node will have 15 kernels, shaped (1, 9, 9).
  • The first two nodes use relu, or Rectified Linear Unit, activation functions; other common activation functions include sigmoid, tanh, or linear, as in line 18.
  • Similar to the shape arguments of Inputs, both Affine nodes have output shapes that start with None; this indicates that the first dimension, num_samples, is variable.
  • Note that Affine nodes are defined by input and output shapes, whereas Conv3D nodes are defined by input and weight shapes, sparing you the need to determine, in advance, what the output shape will be.

2.3 Defining Losses

In order to train a network via backpropagation, it is necessary to compute some scalar loss such that all parameters in the network can be differentiated with respect to it. The choice or design of a loss function is crucial and hardly trivial, and an ongoing research topic; for those interested, the following is suggested:

http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf

Using optimus, a great many custom loss functions can be stitched together by some combination of nodes. However, more often than not, it's more than sufficient to use one of the provided losses, which are use-case specific nodes. Here, we'll use a negative log-likelihood loss, which typically works well for 1-of-K classifier problems.


In [6]:
nll = optimus.NegativeLogLikelihoodLoss(name='nll')

As we will see shortly, the NegativeLogLikelihoodLoss node has two inputs, likelihoods and index, and outputs a scalar output, which is the mean negative log-likehood of the "correct" answers.

2.4 Define Outputs

Finally, to get information out of our network, it is necessary to create Outputs, which act as signal taps on the graph. Any node that computes some result can be tapped as an output. To demonstrate, we'll specify outputs for likelihoods, from the softmax node, and the overall loss, from the nll loss node.


In [7]:
likelihoods = optimus.Output(name='likelihoods')
loss = optimus.Output(name='loss')

Note that any one graph can produce a multitude of outputs, so long as they are named and specified as connections. This can be particularly useful when debugging a network, or to explore what the learned representations happen to look like.

2.5 Specifying Connections

While much neural network research makes use of a cascade of singular layers, it is possible to specify arbitrarily connected acyclic (no loops) graphs. This can be crucial when learning on multimodal data, i.e. synchronized audio and video, or combining multiple sources of ground truth data, i.e. joint speech and face recognition.

For convenience, ConnectionManager are used to specify edges as a list of "(from, to)" tuples, like one were connecting objects with lines:


In [8]:
trainer_edges = optimus.ConnectionManager([
    # Input
    (data, conv.input),
    # Nodes
    (conv.output, affine.input),
    (affine.output, classifier.input),
    (classifier.output, softmax.input),
    (softmax.output, nll.likelihoods),
    (class_labels, nll.index),
    # Outputs
    (softmax.output, likelihoods),
    (nll.output, loss)])

update_edges = optimus.ConnectionManager([
    (learning_rate, conv.weights),
    (learning_rate, conv.bias),
    (learning_rate, affine.weights),
    (learning_rate, affine.bias),
    (learning_rate, classifier.weights),
    (learning_rate, classifier.bias)])

Note that two ConnectionManagers are used here: one connecting inputs to outputs, and a second connecting the learning_rate to the parameters of each node. Specifying the connectivity of learning rates makes it easy to use different values for different parameters, or simply not update certain parameters. One may come across these needs when attempting to train a network with coordinate block descent or supervised fine-tuning of pre-trained parameters.

A word of caution: ports, e.g. inputs or outputs, are single-input, multi-output (SIMO). In other words, one output can be mapped to many inputs, but only one connection can be mapped to one input.

Strictly speaking, ConnectionManagers are not a necessity, but the "(from, to)" connection paradigm is easier to rationalize than the form that is needed, being "{from : [to, ...]}":


In [9]:
# Show the dictionary of {from -> [to, ...]}
trainer_edges.connections


Out[9]:
{'affine.output': ['classifier.input'],
 'classifier.output': ['softmax.input'],
 'conv.output': ['affine.input'],
 'data': ['conv.input'],
 'labels': ['nll.index'],
 'nll.output': ['loss'],
 'softmax.output': ['nll.likelihoods', 'likelihoods']}

2.6 Create the Full Graph

Having specified the various parts of our training graph, we can now assemble the full network and randomly intialize the parameters (which default to zero).


In [10]:
trainer = optimus.Graph(
    name='mnist_trainer',
    inputs=[data, class_labels, learning_rate],
    nodes=[conv, affine, classifier, softmax, nll],
    connections=trainer_edges.connections,
    outputs=[loss, likelihoods],
    loss=loss,
    updates=update_edges.connections)

for node in conv, affine, classifier:
    optimus.random_init(node.weights, mean=0.0, std=0.1)

Now the training graph can be called directly with mnist data. Here, we'll take the push just the few datapoints through the graph.


In [11]:
X, y = [_[:3] for _ in train]
print trainer(data=X, labels=y, learning_rate=0.02)


OrderedDict([('loss', array(2.378587438364377)), ('likelihoods', array([[ 0.3637,  0.    ,  0.0009,  0.0043,  0.0082,  0.0056,  0.6079,
         0.0017,  0.0001,  0.0075],
       [ 0.9842,  0.0002,  0.0007,  0.002 ,  0.0017,  0.0014,  0.0054,
         0.0034,  0.0004,  0.0006],
       [ 0.6026,  0.0315,  0.0006,  0.0521,  0.1438,  0.0242,  0.0364,
         0.0494,  0.0324,  0.0269]]))])

Here we see that, given the appropriate inputs, a graph returns a dictionary of values, keyed by the named outputs requested.

3. Training the Network

Once the trainer network is built, pushing data through the graph with a positive scalar for learning_rate will cause the parameters to move in the direction of a smaller loss value. If the learning rate is too small this may take a considerable amount of time; conversely, too large a learning rate will cause the network to diverge, often quickly, and die a glorious death.

This iterative training process can be simplied through two pieces of code: a minibatch data generator and a training driver.

3.1 Minibatch generator

While research of stochastic learning methods, and in particular curriculum learning, is ongoing, here we'll define a generator that returns batches of randomly sampled datapoints.

Note: If generators make you weak in the knees, the following post is a highly recommended introduction (or refresher):

http://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/


In [12]:
def minibatch(data, labels, batch_size, max_iter=np.inf):
    """Random mini-batch generator.

    Parameters
    ----------
    data : array_like, len=N
        Observation data.
    labels : array_like, len=N
        Labels corresponding the the given data.
    batch_size : int
        Number of datapoints to return at each iteration.
    max_iter : int, default=inf
        Number of iterations before raising a StopIteration.

    Yields
    ------
    batch : dict
        Random batch of datapoints, under the keys `data` and `labels`.
    """
    num_points = len(labels)
    order = np.random.permutation(num_points)
    idx, count = 0, 0
    while count < max_iter:
        x, y = [], []
        while len(y) < batch_size:
            x.append(data[order[idx]])
            y.append(labels[order[idx]])
            idx += 1
            if idx >= num_points:
                idx = 0
                np.random.shuffle(order)
        # Look here! I'm important!
        yield dict(data=np.asarray(x), labels=np.asarray(y))
        count += 1

Generator or logic nuances are left as an exercise, but there is one crucial component of this generator that allows it to work seamlessly with an optimus Driver. In line 33, the generator yields a dictionary with keys corresponding to the names of the graph's inputs (defined above in 2.1).

Note, however, that we don't trouble ourselves with constants for the time being. These will be provided as a separate dictionary of static values.

3.2 Create a Driver

When training neural networks, there is much repeated infrastructure common to all graphs, such as parameter checkpointing, loss statistics, and progress reporting. The Driver class serves to condense this functionality and more into a single object thus simplifying the process of learning parameters.

The following passes the training graph to a driver, and calling the fit method begins the training process. The right-most value reports the loss as a function of iteration.


In [13]:
driver = optimus.Driver(graph=trainer)
driver.fit(
    source=minibatch(train[0], train[1], batch_size=25), 
    hyperparams={'learning_rate': 0.02}, 
    max_iter=500, print_freq=25)


[Sun Jan  4 22:13:16 2015] 0 / 500: 5.6353
[Sun Jan  4 22:13:17 2015] 25 / 500: 1.6048
[Sun Jan  4 22:13:19 2015] 50 / 500: 0.5624
[Sun Jan  4 22:13:20 2015] 75 / 500: 0.6243
[Sun Jan  4 22:13:21 2015] 100 / 500: 0.5457
[Sun Jan  4 22:13:22 2015] 125 / 500: 0.7153
[Sun Jan  4 22:13:24 2015] 150 / 500: 0.3057
[Sun Jan  4 22:13:25 2015] 175 / 500: 0.1457
[Sun Jan  4 22:13:26 2015] 200 / 500: 0.1947
[Sun Jan  4 22:13:27 2015] 225 / 500: 0.2481
[Sun Jan  4 22:13:29 2015] 250 / 500: 0.2332
[Sun Jan  4 22:13:30 2015] 275 / 500: 0.2170
[Sun Jan  4 22:13:31 2015] 300 / 500: 0.2805
[Sun Jan  4 22:13:32 2015] 325 / 500: 0.2990
[Sun Jan  4 22:13:34 2015] 350 / 500: 0.4211
[Sun Jan  4 22:13:35 2015] 375 / 500: 0.1738
[Sun Jan  4 22:13:36 2015] 400 / 500: 0.3504
[Sun Jan  4 22:13:37 2015] 425 / 500: 0.2855
[Sun Jan  4 22:13:39 2015] 450 / 500: 0.2888
[Sun Jan  4 22:13:40 2015] 475 / 500: 0.1993
[Sun Jan  4 22:13:41 2015] 500 / 500: 0.2995

Now that we've trained the ConvNet for several hundred iterations, we can transform the validation data and see how we're doing. Note that an image's class can be estimated by finding the argmax over each datapoint.


In [14]:
outputs = trainer(data=valid[0], labels=valid[1], learning_rate=0.0)
labels_pred = outputs['likelihoods'].argmax(axis=1)
print "Classification Error: {0:0.4}".format(1.0 - np.equal(valid[1], labels_pred).mean())
print "                Loss: {0:0.4}".format(float(outputs['loss']))


Classification Error: 0.0585
                Loss: 0.2003

Certainly this network isn't done learning, and would proceed until some stopping criterion is reached. However, for the purposes of demonstration, we're almost done here. There's only one big topic left to address.

4. Serialization

One of main goals of optimus is to provide safe and seamless serialization between the learning and application stages in a development pipeline. This is achieved by tearing down a graph into two distinct parts: a JSON definition, and a numpy archive of parameters.

What does this look like? Let's save the trainer to disk.


In [15]:
optimus.save(trainer, "mnist_trainer.json", "mnist_params.npz")

In [16]:
# Print the JSON
for line in open("mnist_trainer.json"):
    print line.strip('\n')


{
  "loss": {
    "type": "Output", 
    "name": "loss"
  }, 
  "name": "mnist_trainer", 
  "inputs": [
    {
      "dtype": "float64", 
      "shape": [
        null, 
        1, 
        28, 
        28
      ], 
      "type": "Input", 
      "name": "data"
    }, 
    {
      "dtype": "int32", 
      "shape": [
        null
      ], 
      "type": "Input", 
      "name": "labels"
    }, 
    {
      "dtype": "float64", 
      "shape": null, 
      "type": "Input", 
      "name": "learning_rate"
    }
  ], 
  "outputs": [
    {
      "type": "Output", 
      "name": "loss"
    }, 
    {
      "type": "Output", 
      "name": "likelihoods"
    }
  ], 
  "connections": {
    "nll.output": [
      "loss"
    ], 
    "conv.output": [
      "affine.input"
    ], 
    "classifier.output": [
      "softmax.input"
    ], 
    "softmax.output": [
      "nll.likelihoods", 
      "likelihoods"
    ], 
    "labels": [
      "nll.index"
    ], 
    "data": [
      "conv.input"
    ], 
    "affine.output": [
      "classifier.input"
    ]
  }, 
  "updates": {
    "learning_rate": [
      "conv.weights", 
      "conv.bias", 
      "affine.weights", 
      "affine.bias", 
      "classifier.weights", 
      "classifier.bias"
    ]
  }, 
  "nodes": [
    {
      "act_type": "relu", 
      "name": "conv", 
      "weight_shape": [
        15, 
        1, 
        9, 
        9
      ], 
      "type": "Conv3D", 
      "downsample_shape": [
        1, 
        1
      ], 
      "input_shape": [
        null, 
        1, 
        28, 
        28
      ], 
      "pool_shape": [
        2, 
        2
      ], 
      "border_mode": "valid"
    }, 
    {
      "act_type": "relu", 
      "type": "Affine", 
      "output_shape": [
        null, 
        512
      ], 
      "name": "affine", 
      "input_shape": [
        null, 
        15, 
        10, 
        10
      ]
    }, 
    {
      "act_type": "linear", 
      "type": "Affine", 
      "output_shape": [
        null, 
        10
      ], 
      "name": "classifier", 
      "input_shape": [
        null, 
        512
      ]
    }, 
    {
      "type": "Softmax", 
      "name": "softmax"
    }, 
    {
      "type": "NegativeLogLikelihoodLoss", 
      "name": "nll"
    }
  ], 
  "type": "Graph"
}

In [17]:
param_values = np.load("mnist_params.npz")
for k in param_values.keys():
    print k, param_values[k].shape


conv.bias (15,)
conv.weights (15, 1, 9, 9)
affine.weights (1500, 512)
affine.bias (512,)
classifier.weights (512, 10)
classifier.bias (10,)

This graph can be recreated by calling optimus.load(def_file, param_file), but for the sake of demonstration, let's load only the defintion and predict the validation dataset.


In [18]:
# Load the definition, without parameters
new_trainer = optimus.load("mnist_trainer.json")

outputs = new_trainer(data=valid[0], labels=valid[1], learning_rate=0.0)
labels_pred = outputs['likelihoods'].argmax(axis=1)
print "Classification Error: {0:0.4}".format(1.0 - np.equal(valid[1], labels_pred).mean())
print "                Loss: {0:0.4}".format(float(outputs['loss']))


Classification Error: 0.9009
                Loss: 2.303

As to be expected, the network is effectively blank, and predicts a constant class index (each of which will occur roughly 10% of the time).

Let's manually set the parameters by passing the loaded numpy archive to the graph's param_values property. All keys in the archive that match parameter names known to the graph will be set, and the rest discarded noisily.


In [19]:
# Set the parameters with those previously loaded.
new_trainer.param_values = param_values
new_trainer.param_values = {'conv.another_param': np.ones([3, 4])}

outputs = new_trainer(data=valid[0], labels=valid[1], learning_rate=0.0)
labels_pred = outputs['likelihoods'].argmax(axis=1)
print "Classification Error: {0:0.4}".format(1.0 - np.equal(valid[1], labels_pred).mean())
print "                Loss: {0:0.4}".format(float(outputs['loss']))


Received erroneous parameter: conv.another_param
Classification Error: 0.0585
                Loss: 0.2003

This way, the parameters of any node can be set directly by providing a dictionary of params, keyed by "{node_name}.{parameter_name}". This makes it easy to load parameters into a graph that came from a previous training session, some new unsupervised learning approach, or hand-crafted features for fine-tuning.