DL Indaba Practical 2

Feedforward Neural Networks on Real Data & Best Practices

Developed by Stephan Gouws, Avishkar Bhoopchand & Ulrich Paquet.


In this practical we will move on and discuss best practices for building and training models on real world data (the famous MNIST dataset of hand-written images of digits). We will develop a deep, fully-connected ("feed-forward") neural network model that can classify these images with around 98% accuracy (close to state-of-the-art for feedforward models on this dataset).

Learning objectives:

Understanding the issues involved in implementing and applying deep neural networks in practice (w/ a focus on TensorFlow). In particular:

  • Implementation sanity checking details & controlled overfitting on a small training set.

  • Training deep neural networks. Understand:

    • Conceptually how the backprop algorithm efficiently computes model gradients by applying the chain rule and applying dynamic programming (saving intermediate computations).
    • Overfitting and how to use regularization to avoid it (Weight decay/L2 and dropout).
    • Knowing when to stop training (early stopping).
  • Understand the need for hyperparameter tuning in finding good architectures and training settings.

What is expected of you:

  • We have included rough time estimates of how long you should aim to spend on the important sections below.
  • Step through the notebook answering the questions by discussing them with your lab partner.
  • Execute each cell in turn & fill in the missing code by pair-programming with your lab partner.
  • 5 min before the end, pair up with someone else and make sure you both understand the concepts listed in "Learning Objectives" above. If not, please speak to the tutors!

Setups and Imports

For this practical, we will work with the famous MNIST dataset of handwritten digits. The task is to classify which digit a particular image represents. Luckily, since MNIST is a very popular dataset, TensorFlow has some built-in functions to download it.

The MNIST dataset consists of pairs of images (28x28 matrices) and labels. Each label is represented as a (sparse) binary vector of length 10 with a 1 in position i iff the image represents digit i, and 0 elsewhere. This is what the "one_hot" parameter you see below does.

In [1]:
# Import TensorFlow and some other libraries we'll be using.
import datetime
import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

# Download the MNIST dataset onto the local machine.
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz

Visualizing the MNIST data

Let's visualize a few of the digits (from the training set):

In [2]:
from matplotlib import pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Helper plotting routine.
def display_images(gens, title=""):
    fig, axs = plt.subplots(1, 10, figsize=(25, 3))
    fig.suptitle(title, fontsize=14, fontweight='bold')
    for i in xrange(10):
        reshaped_img = (gens[i].reshape(28, 28) * 255).astype(np.uint8)
    return fig, axs

batch_xs, batch_ys = mnist.train.next_batch(10)
list_of_images = np.split(batch_xs, 10)
_ = display_images(list_of_images, "Some Examples from the Training Set.")

Building a Feed-Forward Neural Network

In this section, we will build a neural network that takes the raw MNIST pixels as inputs and outputs 10 values, which we will interpret as the probability that the input image belongs to one of the 10 digit classes (0 through 9). Along the way, the data will pass through the hidden layers and activation functions we encountered in the first practical.

NOTE: Standard feedforward neural network architectures can be summarised by chaining together the number of neurons in each layer, e.g. "784-500-300-10" would be a net with 784 input neurons, followed by a layer with 500 neurons, then 300, and finally 10 output classes.

Build the model (30min)

We want to explore the different choices and some of the best practices of training feedforward neural networks on the MNIST dataset. In this practical we will only be using "fully connected" (also referred to as "FC", "affine", or "dense") layers (i.e. no convolutional layers). So let's first write a little helper function to construct these:

In [4]:
def _dense_linear_layer(inputs, layer_name, input_size, output_size):
    Builds a layer that takes a batch of inputs of size `input_size` and returns 
    a batch of outputs of size `output_size`.
        inputs: A `Tensor` of shape [batch_size, input_size].
        layer_name: A string representing the name of the layer.
        input_size: The size of the inputs
        output_size: The size of the outputs
        out, weights: tuple of layer outputs and weights.
    # Name scopes allow us to logically group together related variables.
    # Setting reuse=False avoids accidental reuse of variables between different runs.
    with tf.variable_scope(layer_name, reuse=False):
        # Create the weights for the layer
        layer_weights = tf.get_variable("weights",
                                        shape=[input_size, output_size], 
        # Create the biases for the layer
        layer_bias = tf.get_variable("biases", 
        ## IMPLEMENT-ME: (1) 
        outputs = ...
    return (outputs, layer_weights)

Now let's use this to construct a linear softmax classifier as before, which we will expand into a near state-of-the-art feed-forward model for MNIST. We first create an abstract BaseSoftmaxClassifier base class that houses common functionality between the models. Each specific model will then provide a build_model method that represents the logic of that specific model.

In [5]:
class BaseSoftmaxClassifier(object):
    def __init__(self, input_size, output_size, l2_lambda):        
        # Define the input placeholders. The "None" dimension means that the 
        # placeholder can take any number of images as the batch size. 
        self.x = tf.placeholder(tf.float32, [None, input_size])
        self.y = tf.placeholder(tf.float32, [None, output_size])    
        self.input_size = input_size
        self.output_size = output_size
        self.l2_lambda = l2_lambda

        self._all_weights = [] # Used to compute L2 regularization in compute_loss().
        # You should override these in your build_model() function.
        self.logits = None
        self.predictions = None
        self.loss = None
    def get_logits(self):
        return self.logits
    def build_model(self):
        raise NotImplementedError("Subclasses should implement this function!")
    def compute_loss(self):
        """All models share the same softmax cross-entropy loss."""
        assert self.logits is not None    # Ensure that logits has been created! 
        # IMPLEMENT-ME: (2)
        # HINT: This time, use the TensorFlow function tf.nn.softmax_cross_entropy_with_logits  rather than 
        # implementing it manually like we did in Prac1
        data_loss = ...

        reg_loss = 0.
        for w in self._all_weights:
            # IMPLEMENT-ME: (3)
            # HINT: TensorFlow has a built-in function for this too! tf.nn.l2_loss
            reg_loss += ...
        return data_loss + self.l2_lambda * reg_loss
    def accuracy(self):
        # Calculate accuracy.
        assert self.predictions is not None    # Ensure that pred has been created!
        # IMPLEMENT-ME: (4)
        # HINT: Look up the tf.equal and tf.argmax functions
        correct_prediction = ...
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
        return accuracy

If we wanted to reimplement the linear softmax classifier from before, we just need to override build_model() to perform one projection from the input to the output logits, like this:

In [6]:
class LinearSoftmaxClassifier(BaseSoftmaxClassifier):
    def __init__(self, intput_size, output_size, l2_lambda):
        super(LinearSoftmaxClassifier, self).__init__(input_size, output_size, l2_lambda)
    def build_model(self):
        # The model takes x as input and produces output_size outputs.
        self.logits, weights = _dense_linear_layer(
                self.x, "linear_layer", self.input_size, self.output_size)
        self.predictions = tf.nn.softmax(self.logits)
        self.loss = self.compute_loss()

In order to build a deeper model, let's add several layers with multiple transformations and rectied linear units (relus):

def build_model(self):
     # The first layer takes x as input and has n_hidden_1 outputs.
     layer1, weights1 = _build_linear_layer(self.x, "layer1", self.input_size, self.num_hidden_1)
     layer1 = tf.nn.relu(layer1)

     # The second layer takes layer1's output as its input and has num_hidden_2 outputs.
     layer2, weights2 = _build_linear_layer(layer1, "layer2", self.num_hidden_1, self.num_hidden_2)
     layer2 = tf.nn.relu(layer2)

     # The final layer is our predictions and goes from num_hidden_2 inputs to 
     # num_classes outputs. The outputs are "logits" (un-normalised scores). 
     self.logits, weights3 = _build_linear_layer(layer2, "output", self.num_hidden_2, self.output_size)

     self.pred = tf.nn.softmax(self.logits)
     self.loss = self.compute_loss(self.logits, self.y)

Note: Instead of writing special classes for linear nets (0 hidden layers), 1 hidden layer nets, 2 hidden layer nets, etc., we can generalize this as follows:

In [7]:
class DNNClassifier(BaseSoftmaxClassifier):
    """DNN = Deep Neural Network - now we're doing Deep Learning! :)"""
    def __init__(self, 
                 input_size=784,    # There are 28x28 = 784 pixels in MNIST images
                 hidden_sizes=[],    # List of hidden layer dimensions, empty for linear model.
                 output_size=10,    # There are 10 possible digit classes
                 act_fn=tf.nn.relu,    # The activation function to use in the hidden layers
                 l2_lambda=0.):    # The strength of regularisation, off by default.
        self.hidden_sizes = hidden_sizes
        self.act_fn = act_fn
        super(DNNClassifier, self).__init__(input_size, output_size, l2_lambda)
    def build_model(self):
        prev_layer = self.x
        prev_size = self.input_size
        for layer_num, size in enumerate(self.hidden_sizes):        
            layer_name = "layer_" + str(layer_num)
            ## IMPLEMENT-ME: (5)
            # HINT: Use the linear function we defined earlier!
            layer, weights = ...

            ## IMPLEMENT-ME: (6)
            # HINT: What do we still need to do after doing the "linear" part? 
            layer = ...
            prev_layer, prev_size = layer, size

        # The final layer is our predictions and goes from prev_size inputs to 
        # output_size outputs. The outputs are "logits", un-normalised scores. 
        self.logits, out_weights = _dense_linear_layer(prev_layer, "output", prev_size, self.output_size)
        self.predictions = tf.nn.softmax(self.logits)
        self.loss = self.compute_loss()

We can now create a linear model, i.e. a 784-10 architecture (note that there is only one possible linear model going from 784 inputs to 10 outputs), as follows:

tf_linear_model = DNNClassifier(input_size=784, hidden_sizes=[], output_size=10)

We can create a deep neural network (DNN, also called a "multi-layer perceptron") model, e.g. a 784-512-10 architecture (there can be many others...), as follows:

tf_784_512_10_model = DNNClassifier(input_size=784, hidden_sizes=[512], output_size=10)

and so forth.

NOTE: Make sure you understand how this works before you move on.

Sanity Checks (5-10 min)

Dealing with randomness

Pseudo-random number generators start from some value (the 'seed') and generate numbers which appear to be random. It is good practise to seed RNGs with a fixed value to encourage reproducibility of results. We do this in NumPy by using np.random.seed(1234), where 1234 is your chosen seed value.

In TensorFlow we can set a graph-level seed (global, using tf.set_random_seed(1234)) or an op-level seed (passed in to each op via the seed argument, to override the graph-level seed.)

Next we need to initialize the parameters of our model. We do not want to initialize all weights to be the same.

QUESTION: Can you think of why this might be a bad idea? (Think about a simple MLP with one input and 2 hidden units and one output. Look at the contributions via each weight connection to the activations on the hidden layer. Now let those weights be equal. How does that affect the contributions? How does that affect the update that backprop would propose for each weight? Do you see any problems?)

The simplest approach is to initialize weights (W's) to small random values. For ReLUs specifically, it is recommended to initialize weights to np.random.randn(n) * sqrt(2.0/n), where n is the number of inputs to that layer (neuron on the previous layer) (see He et al., 2015 for more information). This initialization encourages the distributions over ReLU activations to have roughly the same variance at each layer, which in turn ensures that useful gradient information gets sent back during backpropagation.

Biases are not that sensitive to initialization and can be initialized to 0s or small random numbers.

Check the loss of a random model

When we have a new model, it's always a good idea to do a quick sanity check. A random model that predicts C classes on random data should have no reason for preferring either class, i.e., on average its loss (negative log-likelihood) should be $-\log(1/C) = -\log(C^{-1}) = \log(C)$.

NOTE: TF actually computes the cross-entropy on the logits for numerical stability reasons (logs of small numbers blow up quickly..). This means we'll get a different value from tf.nn.softmax_cross_entropy_with_logits. So for this check we will just manually compute the cross-entropy (negative log-likelihood).

The second thing to check is that adding the L2 loss (a strictly positive value), should increase total loss (here we'll just use the TF cross-entropy as provided).

In [8]:

# Generate a batch of 100 "images" of 784 pixels consisting of Gaussian noise.
x_rnd = np.random.randn(100, 784)
print "Sample of random data:\n", x_rnd[:5,:]    # Print the first 5 "images"
print "Shape: ", x_rnd.shape
# Generate some random one-hot labels.
y_rnd = np.eye(10)[np.random.choice(10, 100)]
print "Sample of random labels:\n", y_rnd[:5,:]
print "Shape: ", y_rnd.shape

# Model without regularization.
tf_linear_model = DNNClassifier(l2_lambda=0.0)
x, y = tf_linear_model.x, tf_linear_model.y

with tf.Session() as sess:
    # Initialize variables.
    init = tf.global_variables_initializer()
    avg_cross_entropy = -tf.log(tf.reduce_mean(tf_linear_model.predictions))
    loss_no_reg = tf_linear_model.loss
    manual_avg_xent, loss_no_reg = sess.run([avg_cross_entropy, loss_no_reg],
                                            feed_dict={x : x_rnd, y: y_rnd})
# Sanity check: Loss should be about log(10) = 2.3026
print '\nSanity check manual avg cross entropy: ', manual_avg_xent
print 'Model loss (no reg): ', loss_no_reg

# Model with regularization.
tf_linear_model = DNNClassifier(l2_lambda=1.0)
x, y = tf_linear_model.x, tf_linear_model.y

with tf.Session() as sess:
    # Initialize variables.
    init = tf.global_variables_initializer()
    loss_w_reg = tf_linear_model.loss.eval(feed_dict={x : x_rnd, y: y_rnd})

# Sanity check: Loss should go up when you add regularization
print 'Sanity check loss (with regularization, should be higher): ', loss_w_reg

Sample of random data:
[[ 0.47143516 -1.19097569  1.43270697 ..., -0.2453605  -1.26943186
 [ 2.33759848 -0.78171744  0.08009975 ..., -0.64055353  1.76256841
 [ 1.63617833 -0.54410827 -1.04999868 ..., -0.90640906  0.31915076
 [-0.66039926  0.0774697   0.38755182 ...,  0.31019053  1.87791254
 [-3.23350453  0.20024296 -0.13933709 ..., -0.72622006  0.50774695
Shape:  (100, 784)
Sample of random labels:
[[ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  1.  0.]]
Shape:  (100, 10)

Sanity check manual avg cross entropy:  2.30259
Model loss (no reg):  46.3142
Sanity check loss (with regularization, should be higher):  4025.31

Compute the gradients: Backpropagation Review

Once one has implemented the model and checked that the loss produces sensible outputs it is time to create the training loop. For this, we will need to obtain the gradients of the loss wrt each model parameter. We've already looked at this in detail in practical 1, where we performed this manually. One of the great benefits of using a library like TensorFlow, is that it can automatically derive the gradients wrt any parameter in the graph (using tf.gradients()).

In the previous practical we saw that there is a common pattern to deriving the gradients in neural networks:

  1. propagate activations forward through the network ("make a prediction" $\Rightarrow$ fprop),
  2. compute an error delta ("see how far we're off") , and
  3. propagate errors backwards to update the weights ("update the weights to do better next time" $\Rightarrow$ backprop).

It turns out that for deeper networks, this pattern repeats for every new layer. Take a deep breath, and let's dive in, starting with fprop:


Conceptually, forward propagation is very simple: Starting with the input x, we repeatedly apply an affine function and a non-linearity to arrive at the output $a^L = \sigma^L(W^{L-1}\sigma^{L-1}( \ldots \sigma(W^1x + b^1) \dots ) + b^{L-1})$.

Mathematically, forward propagation comes down to a composition of functions. So for two layers, the output activation $a^2$ is the composition of the function $a^1 = f^1(x)$ and $a^2 = f^2(a^1)$ $\Rightarrow$ $a^2 = f^2(f^1(x))$.

NOTE: We save both the pre-activations (before the non-linearity) and the (post-)activations (after the nonlinearity). We'll need these for the backprop phase!

In code it looks like this:

In [9]:

def fprop(x, weights, biases, per_layer_nonlinearities):

    # Initialise the input. We pretend inputs are the first pre- and post-activations.
    z = a = x
    cache = [(z, a)] # We'll save z's and a's for the backprop phase.
    for W, b, act_fn in zip(weights, biases, per_layer_nonlinearities): 
        z = np.dot(W, a) + b        # "pre-activations" / logits
        a = act_fn(z)               # "outputs" / (post-)activations of the current layer
        # NOTE: We save both pre-activations and (post-)activations for the backwards phase!
        cache.append((z, a)
    return cache # Per-layer pre- and post-activations.

  File "<ipython-input-9-fe5acd516f4c>", line 17
    return cache # Per-layer pre- and post-activations.
SyntaxError: invalid syntax


Given a loss or error function $E$ at the output (e.g. cross-entropy), we then need the derivative of the loss wrt each of the model parameters in order to train the network (decrease the loss). Mathematically, this comes down to the derivative of a composition of functions, and from calculus we know we have the chain rule for that: If $a = f(g(x))$, then $\frac{\partial a}{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x}$.

In order to get the gradients on some intermediate parameter $\theta^i = \{W^i, b^i\}$ in layer $i$, we just apply the chain rule over all $L$ 'layers' of this composition of functions (the neural network) to derive the intermediate gradients:

\begin{aligned} \frac{\partial E}{\partial \theta^{(i)}} &= \underbrace{ \frac{\partial E}{\partial z^{(L)}} \frac{\partial z^{(L)}}{\partial z^{(L-1)}} \ldots \frac{\partial z^{(i+2)}}{\partial z^{(i+1)}}}_{\triangleq \delta^{(i+1)}} \frac{\partial z^{(i+1)}}{\partial \theta^{(i)}} \\ &= \delta^{(i+1)} \frac{\partial z^{(i+1)}}{\partial \theta^{(i)}} \end{aligned}

We have glossed a little over the fact that we are dealing with matrices and vectors, for the sake of brevity. But please see these two great resources:

  • For a step-by-step walk-through of the mechanics of backpropagation, see this great resource.
  • For a great brush-up on the mechanics of vector calculus, see this fantastic note (for the same course).


  • $E$ is the error/loss function at the output.
  • $W^{(i)}$ is an intermediate weight matrix mapping activations from layer $i$ to pre-activations $z^{(i+1)}$ at layer $i+1$.
  • The derivative with respect to the inputs of layer $i$ are called deltas and is defined as $\delta^{(i)} \triangleq \frac{\partial E}{\partial z^{(i)}}$.
    • $\delta^i$ is a vector (the size of the layer $i$), and is the result of the deltas at the output $\delta^L$ (dE/dlogits) multiplied by a product of Jacobian matrices. If you don't understand this statement, just think of $\delta^i_j$ as the contribution that unit $j$ in layer $i$ made to the total error $E$. Also, read up on this in the resources linked above.
    • NB: $z^{(i)}$ are the pre-activations ("logits" at the output layer)!
  • $\frac{\partial z^{(i+1)}}{\partial W^{(i)}} = \frac{\partial (W^{(i)}a^i + b^i)}{\partial W^{(i)}} = a^i$.
  • Likewise, $\frac{\partial z^{(i+1)}}{\partial b^{(i)}} = 1$ (check this for yourself).


  1. What is the shape of $\frac{\partial E}{\partial \mathbb{W}^{(i)}}$?

    • Convince yourself (now or later) that $\frac{\partial E}{\partial W^{(i)}_{jk}} = \delta^{(i+1)}_k {a^i_j}$ (a scalar), and therefore $\frac{\partial E}{\partial \mathbb{W}^{(i)}} = \mathbb{\delta}^{(i+1)} {\mathbb{a}^i}^T$ (outer product of two vectors, therefore a matrix of the same shape as $W^i$).
    • In words: The gradient on weights $W^i$ at layer $i$ is the outer product of the deltas-vector from the layer above $\delta^{(i+1)}$ and the activations vector from the layer below $a^i$.
  2. What is the shape of $\frac{\partial E}{\partial \mathbb{b}^{(i)}}$?

    • Convince yourself (now or later) that $\frac{\partial E}{\partial \mathbb{b}^{(i)}} = \mathbb{\delta}^{(i+1)}$ (a vector of the same length as $b^i$).
    • In words: The gradient on biases $b^i$ at layer $i$ is equal to the deltas at the layer above.
  3. Make sure you understand why we save the (z,a)'s during fprop, and how we use them during backprop.

Note that we need to compute the deltas at every layer, and $\delta^i$ share all the terms of $\delta^{(i+1)}$ (which we've already computed), except one. Backprop has one more trick up its sleeve, and that is to reuse previously computed values to save computation (this is called dynamic programming). For the deltas, this comes down to computing $\delta^{(i)}$ given $\delta^{(i+1)}$. We won't show the full derivation (but we again use the chain rule, see the notes linked above) to arrive at:

\begin{align} \delta^{(i)} &= \frac{\partial E}{\partial z^{(i)}} \\ &= \ldots \\ &= \underbrace{ \left[ \delta^{(i+1)} {\mathbb{W}^{(i)}}^T \right] }_\textrm{Map the global delta 'backwards' through W up to layer $i$} \circ \underbrace{ \sigma'(z^{(i)}) }_\textrm{Correct for the local errors at this layer.}. \end{align}

In words: The delta on the current layer $i$ is the delta on the layer above $\delta^{(i+1)}$ multiplied by the transpose weights matrix between the two layers $W^i$ to yield a vector, scaled by the element-wise multiplication of $\sigma'(z^{(i)})$ (the derivative of the non-linearity applied to the original pre-activations).

QUESTIONs: Look at the code below and make sure you understand (at least conceptually) how this works.

In [10]:

def backprop(target, fprop_cache, weights):
    # Pop/remove the model prediction (last activation `a` we computed above) 
    # off the cache we created during the fprop phase.
    (_, pred) = fprop_cache.pop() 
    # Intialise delta^{L} (at the output layer) as dE/dz (cross-entropy).
    delta_above = (target - pred)         
    grads = []
    # Unroll backwards from the output:
    for (z_below, a_below), W_between in reversed(zip(fprop_cache, weights)):
        # Compute dE/dW:
        Wgrad = np.dot(delta_above, a_below.T) # Outer product
        # Compute dE/db:
        bgrad = delta_above     
        # Save these:
        grads.append((Wgrad, bgrad))
        # Update for the *next* iteration/layer. Note the elem-wise multiplication.
        # Note the use of z_below, the preactivations in the layer below!
        # delta^i = delta^{(i+1)}.(W^i)^T .* sigma'(z_i):
        delta_above = np.dot(delta_above, W_between.T) * dsigmoid(z_below)
    return grads

Ppphhhhhheeeeewwwww, ok, come up for a breather. We know. This takes a while to wrap your head around. For now, just try to get the high-level picture! Depending on your background, this may well be the toughest part of the week.

BACKPROP SUMMARY NOTE: It helps to have a good idea of what backprop does, and for this it helps to work through one example by hand (see the linked notes above). But you don't need to understand every detail above for the rest of this and the following lectures. The good news is that in practise you won't compute gradients by hand, and all of the above is done for us by tf.gradients() (or the similar function in other packages)!

BACKPROP FINAL QUESTION: Take a few minutes to explain the high-level details of the backprop algorithm to your neighbour. Try to understand how fprop is essentially just a composition of functions applied to the input, and backprop 'peels off' those compositions one by one by following the chain rule. In the process it avoids recomputation by saving activations during fprop, and computing deltas during backprop. Be sure to ask the tutors if you're stuck!

In [11]:

biases, weights = [b1, b2], [W1, W2]
x, y = ..., ...
non_linearities = [relu, softmax]

fprop_cache = fprop(x, weights, biases, non_linearities)
grads = backprop(y, fprop_cache, weights)

'\n# PSEUDOCODE:\n\nbiases, weights = [b1, b2], [W1, W2]\nx, y = ..., ...\nnon_linearities = [relu, softmax]\n\nfprop_cache = fprop(x, weights, biases, non_linearities)\ngrads = backprop(y, fprop_cache, weights)\n'

Training deep neural networks (10min)

Now that we have a model with a loss and gradients, let's write a function to train it! This will be largely similar to the train_tf_model() function (same name, see below) from the previous practical. However, we are introducing several new concepts in this practical:

  • how to update parameters using different optimizers,
  • model complexity and how to match this to your data:
    • recognizing overfitting
    • adding regularization (L2, dropout)
  • knowing when to stop training: early stopping,

As we go through these concepts, we will show how to use them in our training function.


Training neural networks involves solving an optimization problem. Stochastic gradient-based methods are by far the most popular family of techniques that are being used for this. These methods evaluate the gradient of the loss on a small part of the data (called a mini-batch), and then propose a small change to the weights based on the current sample (and some maintain running averages over the previous steps), that reduces the loss, before moving on to another sample of the data:

step = optimizer(grad(cost), learning_rate, ...) 
new_weights = old_weights + step

The oldest algorithm is stochastic gradient descent, but there are many others (Adagrad, AdaDelta, RMSProp, ADAM, etc.). As a general rule of thumb, ADAM or SGD with Momentum tend to work quite well out of the box, but this depends on your model and your data! See these two great blog posts for more on this:

In our code, we can select different optimization functions by passing in a different optimizer to train_tf_model (see the full list here: https://www.tensorflow.org/api_guides/python/train#Optimizers) as follows:

optimizer = tf.train.RMSProp(...)
results_tuple = train_tf_model(optimizer_fn=optimizer, ...)

Training / validation / test splits

When training supervised machine learning models, the goal is to build a model that will perform well on some test task with data that we'll obtain some time in the future. Unfortunately, until we solve time-travel, we don't yet have access to that data. So how do we train a model on data we have now, to perform well on some (unseen) data from the future? This, in a nutshell, is the statistical learning problem: we want to train a data on available data to generalize to unseen test data.

The way we approach this is to take a dataset that we do have, and split it into a training, validation, and test set (split). Typically we will use ratios of 80/10/10 for example. We then train our model on the training set only, and use the validation set to make all kinds of decisions about architectural selection, hyperparameters, etc. When we're done, we evaluate our model and report its accuracy on the test set only.

NOTE: We (typically) do not train on the validation set, and we never train on the test set. Think about why training on the test set might be a bad thing?

Model complexity, overfitting & regularization

Overfitting occurs when improving the model's training loss (its performance on training data) comes at the expense of its generalisation ability (its performance on unseen test data). Generally it is a symptom of the model complexity increasing to fit the peculiarities (outliers) of the training data too accurately, causing it not to generalize well to new unseen (test) data. Overfitting is usually indicated when:

  • training & validation loss starts decreasing at different rates,
  • validation error starts increasing while training error still goes down,
  • training error reaches 0.

Underfitting is the opposite: when a model cannot fit the training data well enough (usually a sign to train for longer or add more parameters to the model).

You can think of complexity as how "wiggly" or "wrinkly" the decision boundary that the model can represent is. We can increase model complexity by adding more layers (i.e. more parameters). We can control or reduce the model complexity of an architecture using a family of techniques called regularisers. We've already encountered L2-regularisation (also called weight-decay), where we penalise the model for having very large weights. Another option is L1 regularization (which encourages sparsity of weights). Another very popular current technique is called dropout (we'll look at this in more detail in Practical 3). There are many others, but these two are the most popular.

We will only be using L2 regularization in this practical. It can be set by passing a non-zero value to l2_lambda when constructing a DNNClassifier instance.

Early stopping

Neural networks are nonlinear models and can have very complicated optimization landscapes. Stochastic gradient based methods for optimizing these loss functions do not proceed monotonically (i.e. does not just keep going up). Sometimes the loss can go down for a while before it goes up to reach a better part of parameter space later. How do we know when to stop training?

Early stopping is one technique that helps with this. It is added to the training routine and means that we periodically evaluate the model's performance on the validation set (Crucially! not the test set. Why?). If the performance on the validation set starts becoming worse we know we have reached the point of overfitting (usually), so it usually makes sense to stop training and not waste any more computations. The train_tf_model function we build has an early-stopping feature that you can enable by passing the stop_early=True parameter. Have a look at the code to see how this is done.

For this practical, we just implemented the most basic idea of early stopping: stop training as soon as the model starts doing worse on validation data. However, there are different ways of implementing this idea. Two of the most popular are

  • "early stopping with patience": don't stop training immediately once validation accuracy degrades, but wait for P more epochs, and reset P if the model starts improving again within this timeframe,
  • training for T epochs, and simply selecting the best model based on validation score over the entire T epochs.

QUESTION: What are the pros and cons of these different methods?

Wrapping these ideas into the training function (10-15min)

The training function below implements all the ideas we discussed above.

In [17]:
class MNISTFraction(object):
    """A helper class to extract only a fixed fraction of MNIST data."""
    def __init__(self, mnist, fraction):
        self.mnist = mnist
        self.num_images = int(mnist.num_examples * fraction)
        self.image_data, self.label_data = mnist.images[:self.num_images], mnist.labels[:self.num_images]
        self.start = 0
    def next_batch(self, batch_size):
        start = self.start
        end = min(start + batch_size, self.num_images)
        self.start = 0 if end == self.num_images else end
        return self.image_data[start:end], self.label_data[start:end]

In [18]:
def train_tf_model(tf_model,                                     
                   session,    # The active session.
                   num_epochs,    # Max epochs/iterations to train for.
                   batch_size=50,    # Number of examples per batch.
                   keep_prob=1.0,    # (1. - dropout) probability, none by default.
                   train_only_on_fraction=1.,    # Fraction of training data to use.
                   optimizer_fn=None,    # The optimizer we want to use
                   report_every=1, # Report training results every nr of epochs.
                   eval_every=1,    # Evaluate on validation data every nr of epochs.
                   stop_early=True,    # Use early stopping or not.

    # Get the (symbolic) model input, output, loss and accuracy.
    x, y = tf_model.x, tf_model.y
    loss = tf_model.loss
    accuracy = tf_model.accuracy()

    # Compute the gradient of the loss with respect to the model parameters 
    # and create an op that will perform one parameter update using the specific
    # optimizer's update rule in the direction of the gradients.
    if optimizer_fn is None:
        optimizer_fn = tf.train.AdamOptimizer()
    optimizer_step = optimizer_fn.minimize(loss)

    # Get the op which, when executed, will initialize the variables.
    init = tf.global_variables_initializer()
    # Actually initialize the variables (run the op).

    # Save the training loss and accuracies on training and validation data.
    train_costs = []
    train_accs = []
    val_costs = []
    val_accs = []

    if train_only_on_fraction < 1:
        mnist_train_data = MNISTFraction(mnist.train, train_only_on_fraction)
        mnist_train_data = mnist.train
    prev_c_eval = 1000000
    # Main training cycle.
    for epoch in range(num_epochs):

        avg_cost = 0.
        avg_acc = 0.
        total_batch = int(train_only_on_fraction * mnist.train.num_examples / batch_size)

        # Loop over all batches.
        for i in range(total_batch):
            batch_x, batch_y = mnist_train_data.next_batch(batch_size)
            # Run optimization op (backprop) and cost op (to get loss value),
            # and compute the accuracy of the model.
            feed_dict = {x: batch_x, y: batch_y}
            if keep_prob < 1.:
                feed_dict["keep_prob:0"] = keep_prob
            _, c, a = session.run(
                    [optimizer_step, loss, accuracy], feed_dict=feed_dict)
            # Compute average loss/accuracy
            avg_cost += c / total_batch
            avg_acc += a / total_batch            
        train_costs.append((epoch, avg_cost))
        train_accs.append((epoch, avg_acc))

        # Display logs per epoch step
        if epoch % report_every == 0 and verbose:
            print "Epoch:", '%04d' % (epoch+1), "Training cost=", \
        if epoch % eval_every == 0:
            val_x, val_y = mnist.validation.images, mnist.validation.labels            
            feed_dict = {x : val_x, y : val_y}
            if keep_prob < 1.:
                feed_dict['keep_prob:0'] = 1.0
            c_eval, a_eval = session.run([loss, accuracy], feed_dict=feed_dict)
            if verbose:
                print "Epoch:", '%04d' % (epoch+1), "Validation acc=", \
            if c_eval >= prev_c_eval and stop_early:
                print "Validation loss stopped improving, stopping training early after %d epochs!" % (epoch + 1)
            prev_c_eval = c_eval
            val_costs.append((epoch, c_eval))
            val_accs.append((epoch, a_eval))
    print "Optimization Finished!"
    return train_costs, train_accs, val_costs, val_accs

In [19]:
# Helper functions to plot training progress.

def my_plot(list_of_tuples):
    """Take a list of (epoch, value) and split these into lists of 
    epoch-only and value-only. Pass these to plot to make sure we
    line up the values at the correct time-steps.

def plot_multi(values_lst, labels_lst, y_label, x_label='epoch'):
    # Plot multiple curves.
    assert len(values_lst) == len(labels_lst)
    plt.subplot(2, 1, 2)
    for v in values_lst:
    plt.legend(labels_lst, loc='upper left')

Wrapping everything together and verifying that it works (10min)

Once we have a training function, it is usually a good idea to train on a small amount of your data first to verify that everything is indeed working. We can put all the pieces together to achieve this as follows:

In [15]:
##### BUILD MODEL #####
tf.reset_default_graph()    # Clear the graph.
model = DNNClassifier()     # Choose model hyperparameters.

with tf.Session() as sess:

    ##### TRAIN MODEL #####

    train_losses, train_accs, val_losses, val_accs = train_tf_model(


    # Get the op which calculates model accuracy.
    accuracy_op = model.accuracy()    # Get the symbolic accuracy operation
    # Connect the MNIST test images and labels to the model input/output
    # placeholders, and compute the accuracy given the trained parameters.
    accuracy = accuracy_op.eval(feed_dict = {model.x: mnist.test.images, 
                                             model.y: mnist.test.labels})
    print "Accuracy on test set:", accuracy

Epoch: 0001 Training cost= 22.728044510
Epoch: 0001 Validation acc= 0.075199999
Epoch: 0002 Training cost= 21.128668196
Epoch: 0003 Training cost= 19.666281579
Epoch: 0003 Validation acc= 0.081200004
Epoch: 0004 Training cost= 18.349199122
Epoch: 0005 Training cost= 17.178852853
Epoch: 0005 Validation acc= 0.090599999
Epoch: 0006 Training cost= 16.147987530
Epoch: 0007 Training cost= 15.243225618
Epoch: 0007 Validation acc= 0.094599999
Epoch: 0008 Training cost= 14.447481615
Epoch: 0009 Training cost= 13.744003599
Epoch: 0009 Validation acc= 0.101599999
Epoch: 0010 Training cost= 13.119138050
Optimization Finished!
Accuracy on test set: 0.1226

Instead of just training and checking that the loss goes down, it is usually a good idea to try to overfit a small subset of your training data. We will do this below by training a 1 hidden layer network on a subset of the MNIST training data, by setting the train_only_on_fraction training hyperparameter to 0.05 (i.e. 5%). We turn off early stopping for this. The following diagram illustrates the difference between under-fitting and over-fitting. Note that the diagram is idealised and it's not always this clear in practice!

QUESTION: Why do we turn off early-stopping?

In the rest of this practical, we will explore the effects of different model hyperparameters and different training choices, so let's wrap everything together to emphasize these different choices, and then train a simple model for a few epochs on 5% of the data to verify that everything works and that we can overfit a small portion of the data.

In [20]:

# Helper to wrap building, training, evaluating and plotting model accuracy.

def build_train_eval_and_plot(build_params, train_params, verbose=True):
    m = DNNClassifier(**build_params)

    with tf.Session() as sess:
        # Train model on the MNIST dataset.
        train_losses, train_accs, val_losses, val_accs = train_tf_model(
        # Now evaluate it on the test set:
        accuracy_op = m.accuracy()    # Get the symbolic accuracy operation
        # Calculate the accuracy using the test images and labels.
        accuracy = accuracy_op.eval({m.x: mnist.test.images, 
                                                                 m.y: mnist.test.labels})    
        if verbose: 
            print "Accuracy on test set:", accuracy
            # Plot losses and accuracies.
            plot_multi([train_losses, val_losses], ['train', 'val'], 'loss', 'epoch')
            plot_multi([train_accs, val_accs], ['train', 'val'], 'accuracy', 'epoch')
        ret = {'train_losses': train_losses, 'train_accs' : train_accs,
                     'val_losses' : val_losses, 'val_accs' : val_accs,
                     'test_acc' : accuracy}
        return m, ret

#################################CODE TEMPLATE##################################
# Specify the model hyperparameters (NOTE: All the defaults can be omitted):
model_params = {
        #'input_size' : 784,    # There are 28x28 = 784 pixels in MNIST images
        'hidden_sizes' : [512], # List of hidden layer dimensions, empty for linear model.
        #'output_size' : 10,    # There are 10 possible digit classes
        #'act_fn' : tf.nn.relu,    # The activation function to use in the hidden layers
        'l2_lambda' : 0.            # Strength of L2 regularization.

# Specify the training hyperparameters:
training_params = {'num_epochs' : 100,     # Max epochs/iterations to train for.
                        #'batch_size' : 100,    # Number of examples per batch, 100 default.
                        #'keep_prob' : 1.0,    # (1. - dropout) probability, none by default.
                        'train_only_on_fraction' : 5e-2,    # Fraction of training data to use, 1. for everything.
                        'optimizer_fn' : None,    # Optimizer, None for Adam.
                        'report_every' : 1, # Report training results every nr of epochs.
                        'eval_every' : 2,     # Evaluate on validation data every nr of epochs.
                        'stop_early' : False,    # Use early stopping or not.

# Build, train, evaluate and plot the results!
trained_model, training_results = build_train_eval_and_plot(
        verbose=True    # Modify as desired.

###############################END CODE TEMPLATE################################

Epoch: 0001 Training cost= 86.671144555
Epoch: 0001 Validation acc= 0.470400006
Epoch: 0002 Training cost= 32.378755257
Epoch: 0003 Training cost= 19.619923999
Epoch: 0003 Validation acc= 0.705600023
Epoch: 0004 Training cost= 14.065403336
Epoch: 0005 Training cost= 10.729673871
Epoch: 0005 Validation acc= 0.766799986
Epoch: 0006 Training cost= 8.362369494
Epoch: 0007 Training cost= 6.607117094
Epoch: 0007 Validation acc= 0.790600002
Epoch: 0008 Training cost= 5.190952872
Epoch: 0009 Training cost= 4.023866670
Epoch: 0009 Validation acc= 0.798600018
Epoch: 0010 Training cost= 3.152098826
Epoch: 0011 Training cost= 2.488907944
Epoch: 0011 Validation acc= 0.804199994
Epoch: 0012 Training cost= 1.918182273
Epoch: 0013 Training cost= 1.529727884
Epoch: 0013 Validation acc= 0.807200015
Epoch: 0014 Training cost= 1.141758306
Epoch: 0015 Training cost= 0.821120327
Epoch: 0015 Validation acc= 0.809800029
Epoch: 0016 Training cost= 0.596917208
Epoch: 0017 Training cost= 0.448648420
Epoch: 0017 Validation acc= 0.811800003
Epoch: 0018 Training cost= 0.362648918
Epoch: 0019 Training cost= 0.294025914
Epoch: 0019 Validation acc= 0.815400004
Epoch: 0020 Training cost= 0.211786659
Epoch: 0021 Training cost= 0.199456693
Epoch: 0021 Validation acc= 0.816600025
Epoch: 0022 Training cost= 0.134907617
Epoch: 0023 Training cost= 0.100976078
Epoch: 0023 Validation acc= 0.816200018
Epoch: 0024 Training cost= 0.082484495
Epoch: 0025 Training cost= 0.110259895
Epoch: 0025 Validation acc= 0.819199979
Epoch: 0026 Training cost= 0.031071639
Epoch: 0027 Training cost= 0.038540008
Epoch: 0027 Validation acc= 0.820999980
Epoch: 0028 Training cost= 0.009517248
Epoch: 0029 Training cost= 0.010902141
Epoch: 0029 Validation acc= 0.826399982
Epoch: 0030 Training cost= 0.015957228
Epoch: 0031 Training cost= 0.012406773
Epoch: 0031 Validation acc= 0.827199996
Epoch: 0032 Training cost= 0.002218569
Epoch: 0033 Training cost= 0.001438031
Epoch: 0033 Validation acc= 0.824999988
Epoch: 0034 Training cost= 0.000026247
Epoch: 0035 Training cost= 0.000015770
Epoch: 0035 Validation acc= 0.824999988
Epoch: 0036 Training cost= 0.000013471
Epoch: 0037 Training cost= 0.000012097
Epoch: 0037 Validation acc= 0.824800014
Epoch: 0038 Training cost= 0.000011124
Epoch: 0039 Training cost= 0.000010370
Epoch: 0039 Validation acc= 0.824800014
Epoch: 0040 Training cost= 0.000009756
Epoch: 0041 Training cost= 0.000009236
Epoch: 0041 Validation acc= 0.824999988
Epoch: 0042 Training cost= 0.000008785
Epoch: 0043 Training cost= 0.000008387
Epoch: 0043 Validation acc= 0.824999988
Epoch: 0044 Training cost= 0.000008030
Epoch: 0045 Training cost= 0.000007706
Epoch: 0045 Validation acc= 0.824999988
Epoch: 0046 Training cost= 0.000007411
Epoch: 0047 Training cost= 0.000007139
Epoch: 0047 Validation acc= 0.824999988
Epoch: 0048 Training cost= 0.000006887
Epoch: 0049 Training cost= 0.000006652
Epoch: 0049 Validation acc= 0.824999988
Epoch: 0050 Training cost= 0.000006433
Epoch: 0051 Training cost= 0.000006227
Epoch: 0051 Validation acc= 0.824999988
Epoch: 0052 Training cost= 0.000006033
Epoch: 0053 Training cost= 0.000005850
Epoch: 0053 Validation acc= 0.825200021
Epoch: 0054 Training cost= 0.000005677
Epoch: 0055 Training cost= 0.000005513
Epoch: 0055 Validation acc= 0.825200021
Epoch: 0056 Training cost= 0.000005356
Epoch: 0057 Training cost= 0.000005207
Epoch: 0057 Validation acc= 0.825200021
Epoch: 0058 Training cost= 0.000005065
Epoch: 0059 Training cost= 0.000004929
Epoch: 0059 Validation acc= 0.825200021
Epoch: 0060 Training cost= 0.000004799
Epoch: 0061 Training cost= 0.000004675
Epoch: 0061 Validation acc= 0.825200021
Epoch: 0062 Training cost= 0.000004555
Epoch: 0063 Training cost= 0.000004440
Epoch: 0063 Validation acc= 0.825200021
Epoch: 0064 Training cost= 0.000004330
Epoch: 0065 Training cost= 0.000004224
Epoch: 0065 Validation acc= 0.825200021
Epoch: 0066 Training cost= 0.000004122
Epoch: 0067 Training cost= 0.000004023
Epoch: 0067 Validation acc= 0.825200021
Epoch: 0068 Training cost= 0.000003928
Epoch: 0069 Training cost= 0.000003836
Epoch: 0069 Validation acc= 0.824999988
Epoch: 0070 Training cost= 0.000003748
Epoch: 0071 Training cost= 0.000003662
Epoch: 0071 Validation acc= 0.824999988
Epoch: 0072 Training cost= 0.000003579
Epoch: 0073 Training cost= 0.000003499
Epoch: 0073 Validation acc= 0.824999988
Epoch: 0074 Training cost= 0.000003421
Epoch: 0075 Training cost= 0.000003345
Epoch: 0075 Validation acc= 0.825200021
Epoch: 0076 Training cost= 0.000003272
Epoch: 0077 Training cost= 0.000003202
Epoch: 0077 Validation acc= 0.825200021
Epoch: 0078 Training cost= 0.000003133
Epoch: 0079 Training cost= 0.000003066
Epoch: 0079 Validation acc= 0.825200021
Epoch: 0080 Training cost= 0.000003001
Epoch: 0081 Training cost= 0.000002938
Epoch: 0081 Validation acc= 0.825399995
Epoch: 0082 Training cost= 0.000002877
Epoch: 0083 Training cost= 0.000002817
Epoch: 0083 Validation acc= 0.825399995
Epoch: 0084 Training cost= 0.000002759
Epoch: 0085 Training cost= 0.000002702
Epoch: 0085 Validation acc= 0.825399995
Epoch: 0086 Training cost= 0.000002648
Epoch: 0087 Training cost= 0.000002594
Epoch: 0087 Validation acc= 0.824999988
Epoch: 0088 Training cost= 0.000002542
Epoch: 0089 Training cost= 0.000002491
Epoch: 0089 Validation acc= 0.824999988
Epoch: 0090 Training cost= 0.000002442
Epoch: 0091 Training cost= 0.000002393
Epoch: 0091 Validation acc= 0.824999988
Epoch: 0092 Training cost= 0.000002346
Epoch: 0093 Training cost= 0.000002299
Epoch: 0093 Validation acc= 0.824800014
Epoch: 0094 Training cost= 0.000002255
Epoch: 0095 Training cost= 0.000002211
Epoch: 0095 Validation acc= 0.824800014
Epoch: 0096 Training cost= 0.000002168
Epoch: 0097 Training cost= 0.000002126
Epoch: 0097 Validation acc= 0.824800014
Epoch: 0098 Training cost= 0.000002085
Epoch: 0099 Training cost= 0.000002045
Epoch: 0099 Validation acc= 0.824800014
Epoch: 0100 Training cost= 0.000002006
Optimization Finished!
Accuracy on test set: 0.8289

Above we plot the training loss vs the validation loss and the training accuracy vs the validation accuracy on only 5% (train_only_on_fraction=5e-3) of the training data (so that it doesn't take too long, and also so that our model can overfit easier). We see that the loss is coming down and the accuracies are going up, as expected! By training on a small subset of the training data, we established that

  • the data that the model is being trained on is hopefully not corrupt (this can happen during preprocessing, loading, etc),
  • our loss and gradients are likely correct,
  • our optimizer seems to do the right thing,
  • and generally, that our code probably works!

NOTE: Notice the point where training loss/accuracy continues to improve, but validation accuracy starts to plateau? That is the point where the model starts to overfit the training data.

Architectural Choices (15min)


A Linear 784-10 Model

Let's evaluate the simple linear model trained on the full dataset and then add layers to see what effect this will have on the accuracy.

NOTE: If you're unsure what the "784-10" notation means, scroll up and re-read the section on "Building a Feed-forward Neural Network" where we explain that.

In [21]:
# Train the linear model on the full dataset.

# Specify the model hyperparameters.
model_params = {'l2_lambda' : 0.}

# Specify the training hyperparameters:
training_params = {'num_epochs' : 50,     # Max epochs/iterations to train for.
                   'optimizer_fn' : None,            # Now we're using Adam.
                   'report_every' : 1, # Report training results every nr of epochs.
                   'eval_every' : 1,     # Evaluate on validation data every nr of epochs.
                   'stop_early' : True    

# Build, train, evaluate and plot the results!
trained_model, training_results = build_train_eval_and_plot(
        verbose=True    # Modify as desired.


Epoch: 0001 Training cost= 4.470642067
Epoch: 0001 Validation acc= 0.694800019

~91% is quite a bad score on MNIST! Now let's build a deeper model, to improve on that score.

1 hidden layer: 784-512-10 Architecture w/ L2

Notice that we add a bit of L2-regularization to our model.

QUESTION: What does that do? What is the effect of removing it? Try it!

In [220]:
# Specify the model hyperparameters (NOTE: All the defaults can be omitted):
model_params = {
        'hidden_sizes' : [512],    # List of hidden layer dimensions, empty for linear model.
        'l2_lambda' : 1e-3            # Strength of L2 regularization.

# Specify the training hyperparameters:
training_params = {
        'num_epochs' : 50,        # Max epochs/iterations to train for.
        'report_every' : 1,     # Report training results every nr of epochs.
        'eval_every' : 1,         # Evaluate on validation data every nr of epochs.
        'stop_early' : True    # Use early stopping or not.

# Build, train, evaluate and plot the results!
trained_model, training_results = build_train_eval_and_plot(
        verbose=True    # Modify as desired.

Epoch: 0001 Training cost= 199.931141524
Epoch: 0001 Validation acc= 0.870599747
Epoch: 0002 Training cost= 141.988723103
Epoch: 0002 Validation acc= 0.898999751
Epoch: 0003 Training cost= 113.183370292
Epoch: 0003 Validation acc= 0.912399769
Epoch: 0004 Training cost= 93.298404125
Epoch: 0004 Validation acc= 0.922599733
Epoch: 0005 Training cost= 78.503151592
Epoch: 0005 Validation acc= 0.921799719
Epoch: 0006 Training cost= 66.877790590
Epoch: 0006 Validation acc= 0.935599744
Epoch: 0007 Training cost= 57.344941926
Epoch: 0007 Validation acc= 0.935999691
Epoch: 0008 Training cost= 49.325550454
Epoch: 0008 Validation acc= 0.943399668
Epoch: 0009 Training cost= 42.386559400
Epoch: 0009 Validation acc= 0.940399766
Epoch: 0010 Training cost= 36.450480402
Epoch: 0010 Validation acc= 0.946999669
Epoch: 0011 Training cost= 31.281487076
Epoch: 0011 Validation acc= 0.944399714
Epoch: 0012 Training cost= 26.775760117
Epoch: 0012 Validation acc= 0.944599748
Epoch: 0013 Training cost= 22.888392136
Epoch: 0013 Validation acc= 0.950199783
Epoch: 0014 Training cost= 19.539065340
Epoch: 0014 Validation acc= 0.950399756
Epoch: 0015 Training cost= 16.640647588
Epoch: 0015 Validation acc= 0.948999763
Epoch: 0016 Training cost= 14.154421536
Epoch: 0016 Validation acc= 0.955999732
Epoch: 0017 Training cost= 12.027777351
Epoch: 0017 Validation acc= 0.957199812
Epoch: 0018 Training cost= 10.202699273
Epoch: 0018 Validation acc= 0.954399705
Epoch: 0019 Training cost= 8.657644658
Epoch: 0019 Validation acc= 0.953599632
Epoch: 0020 Training cost= 7.334119753
Epoch: 0020 Validation acc= 0.954599738
Epoch: 0021 Training cost= 6.196740341
Epoch: 0021 Validation acc= 0.961399794
Epoch: 0022 Training cost= 5.227180183
Epoch: 0022 Validation acc= 0.959199727
Epoch: 0023 Training cost= 4.402065789
Epoch: 0023 Validation acc= 0.956199765
Epoch: 0024 Training cost= 3.707452070
Epoch: 0024 Validation acc= 0.955999732
Epoch: 0025 Training cost= 3.129813027
Epoch: 0025 Validation acc= 0.964799762
Epoch: 0026 Training cost= 2.621126144
Epoch: 0026 Validation acc= 0.949599743
Epoch: 0027 Training cost= 2.205886074
Epoch: 0027 Validation acc= 0.958999693
Epoch: 0028 Training cost= 1.853469856
Epoch: 0028 Validation acc= 0.963399649
Epoch: 0029 Training cost= 1.541123295
Epoch: 0029 Validation acc= 0.964799762
Epoch: 0030 Training cost= 1.291453608
Epoch: 0030 Validation acc= 0.962999761
Epoch: 0031 Training cost= 1.086839738
Epoch: 0031 Validation acc= 0.963599741
Epoch: 0032 Training cost= 0.914146203
Epoch: 0032 Validation acc= 0.963399649
Epoch: 0033 Training cost= 0.769378621
Epoch: 0033 Validation acc= 0.965999722
Epoch: 0034 Training cost= 0.650265682
Epoch: 0034 Validation acc= 0.967199743
Epoch: 0035 Training cost= 0.558896172
Epoch: 0035 Validation acc= 0.969799757
Epoch: 0036 Training cost= 0.482242939
Epoch: 0036 Validation acc= 0.972199798
Epoch: 0037 Training cost= 0.417804223
Epoch: 0037 Validation acc= 0.966399789
Epoch: 0038 Training cost= 0.370378067
Epoch: 0038 Validation acc= 0.973199725
Epoch: 0039 Training cost= 0.331554002
Epoch: 0039 Validation acc= 0.973999739
Epoch: 0040 Training cost= 0.299039026
Epoch: 0040 Validation acc= 0.973399699
Epoch: 0041 Training cost= 0.271972703
Epoch: 0041 Validation acc= 0.975399673
Epoch: 0042 Training cost= 0.252507508
Epoch: 0042 Validation acc= 0.972199798
Epoch: 0043 Training cost= 0.237628794
Epoch: 0043 Validation acc= 0.975399673
Epoch: 0044 Training cost= 0.222427252
Epoch: 0044 Validation acc= 0.970399678
Epoch: 0045 Training cost= 0.213050932
Epoch: 0045 Validation acc= 0.973399699
Epoch: 0046 Training cost= 0.206244858
Epoch: 0046 Validation acc= 0.970799685
Epoch: 0047 Training cost= 0.200895666
Epoch: 0047 Validation acc= 0.975999713
Epoch: 0048 Training cost= 0.193038337
Epoch: 0048 Validation acc= 0.976599693
Epoch: 0049 Training cost= 0.189712861
Epoch: 0049 Validation acc= 0.973599732
Epoch: 0050 Training cost= 0.186406397
Epoch: 0050 Validation acc= 0.976599693
Optimization Finished!
Accuracy on test set: 0.977
CPU times: user 1min 27s, sys: 12.2 s, total: 1min 39s
Wall time: 1min 4s

97.7% is a much more decent score! The 1 hidden layer model gives a much better score than the linear model, so let's see if we can do better by adding another layer!

Going deeper! A 784-512-512-10 architecture w/ L2

In [75]:

# Specify the model hyperparameters (NOTE: All the defaults can be omitted):
model_params = {
        'hidden_sizes' : [512, 512], # List of hidden layer dimensions, empty for linear model.
        'l2_lambda' : 1e-3                     # Strength of L2 regularization.

# Specify the training hyperparameters:
training_params = {
        'num_epochs' : 200,     # Max epochs/iterations to train for.
        'report_every' : 1,     # Report training results every nr of epochs.
        'eval_every' : 1,         # Evaluate on validation data every nr of epochs.
        'stop_early' : True,    # Use early stopping or not.

# Build, train, evaluate and plot the results!
trained_model, training_results = build_train_eval_and_plot(
        verbose=True    # Modify as desired.

Epoch: 0001 Training cost= 576.249123591
Epoch: 0001 Validation acc= 0.901599765
Epoch: 0002 Training cost= 356.557729048
Epoch: 0002 Validation acc= 0.921399713
Epoch: 0003 Training cost= 317.592652199
Epoch: 0003 Validation acc= 0.931799769
Epoch: 0004 Training cost= 293.989943626
Epoch: 0004 Validation acc= 0.937399745
Epoch: 0005 Training cost= 276.594365179
Epoch: 0005 Validation acc= 0.944599748
Epoch: 0006 Training cost= 263.482251143
Epoch: 0006 Validation acc= 0.946799755
Epoch: 0007 Training cost= 251.240891252
Epoch: 0007 Validation acc= 0.945599735
Epoch: 0008 Training cost= 241.220871305
Epoch: 0008 Validation acc= 0.949199736
Epoch: 0009 Training cost= 231.527150657
Epoch: 0009 Validation acc= 0.950399697
Epoch: 0010 Training cost= 222.198544922
Epoch: 0010 Validation acc= 0.953399718
Epoch: 0011 Training cost= 213.551616044
Epoch: 0011 Validation acc= 0.955799699
Epoch: 0012 Training cost= 204.914471547
Epoch: 0012 Validation acc= 0.957999706
Epoch: 0013 Training cost= 196.558960072
Epoch: 0013 Validation acc= 0.954799712
Epoch: 0014 Training cost= 188.747769498
Epoch: 0014 Validation acc= 0.959999740
Epoch: 0015 Training cost= 180.734522983
Epoch: 0015 Validation acc= 0.958799720
Epoch: 0016 Training cost= 173.227038269
Epoch: 0016 Validation acc= 0.963799715
Epoch: 0017 Training cost= 166.165209351
Epoch: 0017 Validation acc= 0.958999813
Epoch: 0018 Training cost= 158.934809099
Epoch: 0018 Validation acc= 0.962199748
Epoch: 0019 Training cost= 152.528965426
Epoch: 0019 Validation acc= 0.963199794
Epoch: 0020 Training cost= 146.041184803
Epoch: 0020 Validation acc= 0.960999727
Epoch: 0021 Training cost= 139.855197560
Epoch: 0021 Validation acc= 0.964799702
Epoch: 0022 Training cost= 133.776780784
Epoch: 0022 Validation acc= 0.967599690
Epoch: 0023 Training cost= 128.551627003
Epoch: 0023 Validation acc= 0.966199756
Epoch: 0024 Training cost= 122.610672066
Epoch: 0024 Validation acc= 0.968599677
Epoch: 0025 Training cost= 117.313750860
Epoch: 0025 Validation acc= 0.967199683
Epoch: 0026 Training cost= 112.409426963
Epoch: 0026 Validation acc= 0.966599703
Epoch: 0027 Training cost= 107.463527194
Epoch: 0027 Validation acc= 0.965999722
Epoch: 0028 Training cost= 102.690966672
Epoch: 0028 Validation acc= 0.966199636
Epoch: 0029 Training cost= 97.922743253
Epoch: 0029 Validation acc= 0.970599711
Epoch: 0030 Training cost= 93.660167001
Epoch: 0030 Validation acc= 0.968199790
Epoch: 0031 Training cost= 89.266302893
Epoch: 0031 Validation acc= 0.966199696
Epoch: 0032 Training cost= 85.337237507
Epoch: 0032 Validation acc= 0.972999752
Epoch: 0033 Training cost= 81.284292270
Epoch: 0033 Validation acc= 0.969199717
Epoch: 0034 Training cost= 77.531017983
Epoch: 0034 Validation acc= 0.966799676
Epoch: 0035 Training cost= 73.750824987
Epoch: 0035 Validation acc= 0.962999761
Epoch: 0036 Training cost= 70.103137623
Epoch: 0036 Validation acc= 0.969999731
Epoch: 0037 Training cost= 66.711315141
Epoch: 0037 Validation acc= 0.971399724
Epoch: 0038 Training cost= 63.361611460
Epoch: 0038 Validation acc= 0.968599737
Epoch: 0039 Training cost= 60.011656321
Epoch: 0039 Validation acc= 0.970199764
Epoch: 0040 Training cost= 57.025321149
Epoch: 0040 Validation acc= 0.969599724
Epoch: 0041 Training cost= 53.959342832
Epoch: 0041 Validation acc= 0.970399678
Epoch: 0042 Training cost= 51.148343631
Epoch: 0042 Validation acc= 0.970399737
Epoch: 0043 Training cost= 48.408878347
Epoch: 0043 Validation acc= 0.969199717
Epoch: 0044 Training cost= 45.793727729
Epoch: 0044 Validation acc= 0.969799697
Epoch: 0045 Training cost= 43.255685619
Epoch: 0045 Validation acc= 0.969599724
Epoch: 0046 Training cost= 40.852977191
Epoch: 0046 Validation acc= 0.971399784
Epoch: 0047 Training cost= 38.543384718
Epoch: 0047 Validation acc= 0.967999697
Epoch: 0048 Training cost= 36.367765503
Epoch: 0048 Validation acc= 0.972199738
Epoch: 0049 Training cost= 34.265690106
Epoch: 0049 Validation acc= 0.971599758
Epoch: 0050 Training cost= 32.170752844
Epoch: 0050 Validation acc= 0.966999650
Epoch: 0051 Training cost= 30.278565372
Epoch: 0051 Validation acc= 0.966999769
Epoch: 0052 Training cost= 28.405725545
Epoch: 0052 Validation acc= 0.969399750
Epoch: 0053 Training cost= 26.525246187
Epoch: 0053 Validation acc= 0.967999816
Epoch: 0054 Training cost= 24.837529481
Epoch: 0054 Validation acc= 0.970999718
Epoch: 0055 Training cost= 23.152070309
Epoch: 0055 Validation acc= 0.964599729
Epoch: 0056 Training cost= 21.591280441
Epoch: 0056 Validation acc= 0.971399724
Epoch: 0057 Training cost= 20.122739747
Epoch: 0057 Validation acc= 0.968599677
Epoch: 0058 Training cost= 18.699150758
Epoch: 0058 Validation acc= 0.969799638
Epoch: 0059 Training cost= 17.338278760
Epoch: 0059 Validation acc= 0.971799731
Epoch: 0060 Training cost= 16.024516387
Epoch: 0060 Validation acc= 0.962999761
Epoch: 0061 Training cost= 14.899629550
Epoch: 0061 Validation acc= 0.975199699
Epoch: 0062 Training cost= 13.723355025
Epoch: 0062 Validation acc= 0.968799710
Epoch: 0063 Training cost= 12.608933298
Epoch: 0063 Validation acc= 0.968399704
Epoch: 0064 Training cost= 11.589327968
Epoch: 0064 Validation acc= 0.967199743
Epoch: 0065 Training cost= 10.676569299
Epoch: 0065 Validation acc= 0.968399644
Epoch: 0066 Training cost= 9.733573787
Epoch: 0066 Validation acc= 0.969999731
Epoch: 0067 Training cost= 8.906992539
Epoch: 0067 Validation acc= 0.965599716
Epoch: 0068 Training cost= 8.061578745
Epoch: 0068 Validation acc= 0.970599711
Epoch: 0069 Training cost= 7.316279169
Epoch: 0069 Validation acc= 0.969999611
Epoch: 0070 Training cost= 6.604641718
Epoch: 0070 Validation acc= 0.970599711
Epoch: 0071 Training cost= 5.932473201
Epoch: 0071 Validation acc= 0.967399716
Epoch: 0072 Training cost= 5.323358593
Epoch: 0072 Validation acc= 0.968999684
Epoch: 0073 Training cost= 4.746478335
Epoch: 0073 Validation acc= 0.970599771
Epoch: 0074 Training cost= 4.206226299
Epoch: 0074 Validation acc= 0.967599809
Epoch: 0075 Training cost= 3.727632008
Epoch: 0075 Validation acc= 0.972599745
Epoch: 0076 Training cost= 3.261459116
Epoch: 0076 Validation acc= 0.969599724
Epoch: 0077 Training cost= 2.828105771
Epoch: 0077 Validation acc= 0.972399712
Epoch: 0078 Training cost= 2.451501520
Epoch: 0078 Validation acc= 0.966999710
Epoch: 0079 Training cost= 2.098190145
Epoch: 0079 Validation acc= 0.966599762
Epoch: 0080 Training cost= 1.771351617
Epoch: 0080 Validation acc= 0.970199704
Epoch: 0081 Training cost= 1.476746987
Epoch: 0081 Validation acc= 0.968999743
Epoch: 0082 Training cost= 1.229572264
Epoch: 0082 Validation acc= 0.968599737
Epoch: 0083 Training cost= 1.003394099
Epoch: 0083 Validation acc= 0.969799757
Epoch: 0084 Training cost= 0.814870764
Epoch: 0084 Validation acc= 0.972399652
Epoch: 0085 Training cost= 0.665674133
Epoch: 0085 Validation acc= 0.973599732
Epoch: 0086 Training cost= 0.538791539
Epoch: 0086 Validation acc= 0.975999713
Epoch: 0087 Training cost= 0.444349860
Epoch: 0087 Validation acc= 0.974999785
Epoch: 0088 Training cost= 0.373609773
Epoch: 0088 Validation acc= 0.978399694
Epoch: 0089 Training cost= 0.316718706
Epoch: 0089 Validation acc= 0.972999752
Epoch: 0090 Training cost= 0.278378398
Epoch: 0090 Validation acc= 0.978799701
Epoch: 0091 Training cost= 0.247428952
Epoch: 0091 Validation acc= 0.976799726
Epoch: 0092 Training cost= 0.224561186
Epoch: 0092 Validation acc= 0.975799739
Epoch: 0093 Training cost= 0.206074331
Epoch: 0093 Validation acc= 0.977399707
Epoch: 0094 Training cost= 0.195836283
Epoch: 0094 Validation acc= 0.978199720
Epoch: 0095 Training cost= 0.184348083
Epoch: 0095 Validation acc= 0.976799667
Epoch: 0096 Training cost= 0.177935783
Epoch: 0096 Validation acc= 0.974799752
Epoch: 0097 Training cost= 0.170420442
Epoch: 0097 Validation acc= 0.977799714
Epoch: 0098 Training cost= 0.166264389
Epoch: 0098 Validation acc= 0.979799628
Epoch: 0099 Training cost= 0.162785649
Epoch: 0099 Validation acc= 0.976399720
Epoch: 0100 Training cost= 0.159062386
Epoch: 0100 Validation acc= 0.978199780
Epoch: 0101 Training cost= 0.156110233
Epoch: 0101 Validation acc= 0.975999773
Validation loss stopped improving, stopping training early after 101 epochs!
Optimization Finished!
Accuracy on test set: 0.9742
CPU times: user 3min 43s, sys: 22 s, total: 4min 5s
Wall time: 2min 46s

You should get around 97.4%. Shouldn't deeper do better?! Why is it that the 2-hidden layer model

  • took much longer to train (2min 46s versus 1min 3s on our system), and
  • got roughly the same accuracy as the 1 hidden layer model (sometimes worse)?

This illustrates the fundamental difficulty of training deep networks that have plagued deep learning research for decades (and to some extent, still do):

Although deeper networks can give you more powerful models, training those models to find the right parameters is not always easy & takes a lot of computing power (time)!

For a long time people just believed it wouldn't work because a) fewer people tried to make it work, and those who did b) didn't have enough data or computing power to really explore the vast space of hyperparameters. These days, we have more data and more compute power, however one can never have 'enough' resources :) This is where the art of doing proper hyper-parameter selection comes in. In fact, we should be able to squeeze out another percentage point or so by choosing better:

  • Optimizer + its hyperparams (learning rate, momentum rates, etc)
  • batch-size,
  • choice of regularization.

For a new problem, we will typically spend most of our time exploring these choices, usually just by launching many different training runs (hopefully in parallel and using GPUs!) and keeping the best ones.

EXTRA: How to Choose Architectures?

How does one design new architectures? Should you use:

  • a tapered architecture (large-medium-small),
  • a "regular" (like the Levi's :) architecture (large-med-med-...-small),
  • a "bottlenecked" architecutre (large-small-large)?
  • an "over-complete" architecture (small-large-small).

Unfortunately it mostly comes down to just developing your own intuition over time, and trying different approaches in hyperparameter search.

However, a good pattern to follow in general is something like the following:

  1. Start with a basic architecture and overfit a portion of your training data.
  2. As you train on more data, add capacity and prevent overfitting by
    • Adding more units to the hidden layer (often times wide still beats deep)
    • Add one or two more hidden layers, trying some architectural choices mentioned above.
    • Add dropout.

The goal is to match your architecture to your data, meaning you have just enough capacity to fit the data, but not too much to easily overfit. The easiest way to do this is to gradually build up the capacity of your model in this way.

EXTRA: Hyperparameter Selection (USE A GPU INSTANCE FOR THIS)

Most of the deep learning practitioner's time will be spent training and evaluating new model architectures with different hyperparameter combinations. General rules of thumb exist, but often times change for each new architecture or dataset. Several approaches exist for automatic hyperparameter selection (grid search, random search, Bayesian methods). In practice, most people just use a randomized grid-search approach to try out different hyperparams, where one defines sensible values for each hyperparameter, and then randomly samples from the (cross-product space) of all of these possible combinations. This space grows exponentially, making exhaustive search infeasible for all but the simplest models. In practice, randomized search have also been found to find the best values, quicker (see Bergstra and Bengio 2009 if you're interested in the details).

Below, we will illustrate the basic idea with a skeleton randomized grid search implementation. In practice, one would launch these different runs in parallel on a computing cluster, making it easier to explore multiple options at the same time. Here, we would just try out a few different options to get a sense for how that would work.

NOTE: We will keep the models small here (so we won't get SOTA results), in order to keep the training times reasonable, but to illustrate how dependent results are on these choices.

Architecture Selection: Let's consider 1 and 2-hidden layer models. For each layer, we'll try out [128, 256] neurons per layer. We'll stick to ReLUs for this.

Hyperparameters: Let's pick SGD with Momentum as our optimizer (a good, stable workhorse), with a fixed computational budget of 20 training epochs. We'll need to pick ranges for the following:

  • learning rate: we'll use a log-scale from 1e-5 to 1.
  • momentum coefficient: we'll use a log-scale from 5e-1 (0.5) to 1.
  • L2 regularization: we'll use a log-scale from 1e-5 to 1.

In [7]:
def sample_log_scale(v_min=1e-6, v_max=1.):
    '''Sample uniformly on a log-scale from 10**v_min to 10**v_max.'''
    return np.exp(np.random.uniform(np.log(v_min), np.log(v_max)))

def sample_model_architecture_and_hyperparams(max_num_layers=2,
        '''Generate a random model architecture & hyperparameters.'''
        # Sample the architecture.
        num_layers = np.random.choice(range(1, max_num_layers+1))
        hidden_sizes = []
        layer_ranges=[128, 256]
        for l in range(num_layers):
        # Sample the training parameters.
        l2_lambda = sample_log_scale(l2_min, l2_max)
        lr = sample_log_scale(lr_min, lr_max)
        mom_coeff = sample_log_scale(mom_min, mom_max)
        # Build base model definitions:
        model_params = {
                'hidden_sizes' : hidden_sizes,
                'l2_lambda' : l2_lambda}

        # Specify the training hyperparameters:
        training_params = {
                'num_epochs' : 20,
                'optimizer_fn' : tf.train.MomentumOptimizer(
                'report_every' : 1,
                'eval_every' : 1,
                'stop_early' : True}
        return model_params, training_params

# TEST THIS: Run this cell a few times and look at the different outputs. 
# Each of these will be a different model trained with different hyperparameters.
m, t = sample_model_architecture_and_hyperparams()
print m
print t

{'hidden_sizes': [256, 128], 'l2_lambda': 0.0054235896697678509}
{'eval_every': 1, 'report_every': 1, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x55725cae8a90>, 'num_epochs': 20, 'stop_early': True}

We will play around with the strength of the L2 regularization. Let's use SGD+Momentum, and run for a fixed budget of 20 epochs:

In [56]:
results = []

# Perform a random search over hyper-parameter space this many times.

for i in range(NUM_EXPERIMENTS): 
    # Sample the model and hyperparams we are using.
    model_params, training_params = sample_model_architecture_and_hyperparams()
    print "RUN: %d out of %d:" % (i, NUM_EXPERIMENTS)
    print "Sampled Architecture: \n", model_params
    print "Hyper-parameters:\n", training_params
    # Build, train, evaluate
    model, performance = build_train_eval_and_plot(
            model_params, training_params, verbose=False)
    # Save results
    results.append((performance['test_acc'], model_params, training_params))
# Display (best?) results/variance/etc:
results.sort(key=lambda x : x[0], reverse=True)

RUN: 0 out of 10:
Sampled Architecture: 
{'hidden_sizes': [256], 'l2_lambda': 0.017639122106750327}
{'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556eca7c8c90>}
Validation loss stopped improving, stopping training early after 3 epochs!
Optimization Finished!
RUN: 1 out of 10:
Sampled Architecture: 
{'hidden_sizes': [256], 'l2_lambda': 0.00071915651503389648}
{'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556eca247f10>}
Optimization Finished!
RUN: 2 out of 10:
Sampled Architecture: 
{'hidden_sizes': [256], 'l2_lambda': 0.034917746813186976}
{'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556eca7c8a10>}
Validation loss stopped improving, stopping training early after 5 epochs!
Optimization Finished!
RUN: 3 out of 10:
Sampled Architecture: 
{'hidden_sizes': [128, 128], 'l2_lambda': 0.0011477153620692811}
{'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556ecba07d10>}
Optimization Finished!
RUN: 4 out of 10:
Sampled Architecture: 
{'hidden_sizes': [128, 128], 'l2_lambda': 0.0066110195159220803}
{'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556e59bd68d0>}
Optimization Finished!
RUN: 5 out of 10:
Sampled Architecture: 
{'hidden_sizes': [256], 'l2_lambda': 0.00064575036694946794}
{'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556ecabea5d0>}
Optimization Finished!
RUN: 6 out of 10:
Sampled Architecture: 
{'hidden_sizes': [128, 128], 'l2_lambda': 0.013143246333050255}
{'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556eb9135650>}
Optimization Finished!
RUN: 7 out of 10:
Sampled Architecture: 
{'hidden_sizes': [128, 128], 'l2_lambda': 0.00017646189551092479}
{'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556ec9811090>}
Optimization Finished!
RUN: 8 out of 10:
Sampled Architecture: 
{'hidden_sizes': [256], 'l2_lambda': 0.00012706891665017044}
{'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556ecb78fc10>}
Validation loss stopped improving, stopping training early after 9 epochs!
Optimization Finished!
RUN: 9 out of 10:
Sampled Architecture: 
{'hidden_sizes': [256], 'l2_lambda': 0.0029062557341502037}
{'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556eca24d890>}
Optimization Finished!

In [57]:
for r in results:
    print r    # Tuples of (test_accuracy, model_hyperparameters, training_hyperparameters)

(0.95090008, {'hidden_sizes': [128, 128], 'l2_lambda': 0.0066110195159220803}, {'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556e59bd68d0>})
(0.94180018, {'hidden_sizes': [256], 'l2_lambda': 0.00064575036694946794}, {'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556ecabea5d0>})
(0.92350012, {'hidden_sizes': [128, 128], 'l2_lambda': 0.013143246333050255}, {'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556eb9135650>})
(0.8925001, {'hidden_sizes': [256], 'l2_lambda': 0.034917746813186976}, {'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556eca7c8a10>})
(0.86520004, {'hidden_sizes': [128, 128], 'l2_lambda': 0.0011477153620692811}, {'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556ecba07d10>})
(0.71679997, {'hidden_sizes': [256], 'l2_lambda': 0.0029062557341502037}, {'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556eca24d890>})
(0.69389999, {'hidden_sizes': [256], 'l2_lambda': 0.017639122106750327}, {'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556eca7c8c90>})
(0.28809997, {'hidden_sizes': [256], 'l2_lambda': 0.00071915651503389648}, {'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556eca247f10>})
(0.20929998, {'hidden_sizes': [256], 'l2_lambda': 0.00012706891665017044}, {'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556ecb78fc10>})
(0.1135, {'hidden_sizes': [128, 128], 'l2_lambda': 0.00017646189551092479}, {'stop_early': True, 'report_every': 1, 'eval_every': 1, 'num_epochs': 20, 'optimizer_fn': <google3.third_party.tensorflow.python.training.momentum.MomentumOptimizer object at 0x556ec9811090>})

Notice the huuuuge variance in the test accuracies! This is just a toy run, but hopefully it illustrates how important choosing the right architectures and training hyperparameters is to getting good results in deep learning.

Below, we've included some hyperparameter settings which achieve near state-of-the-art results.

EXTRA: Known Good Models

784-500-300-10 w/ L2 + SGD + Momentum

Best so far: 98.02% test accuracy when we ran this (well, in one of our training runs :).

In [13]:

# Specify the model hyperparameters (NOTE: All the defaults can be omitted):
model_params = {
        'hidden_sizes' : [500, 300], # List of hidden layer dimensions, empty for linear model.
        'l2_lambda' : 1e-3                     # Strength of L2 regularization.

# Specify the training hyperparameters:
training_params = {'num_epochs' : 100,     # Max epochs/iterations to train for.
                        'optimizer_fn' : tf.train.MomentumOptimizer(learning_rate=2e-3, momentum=0.98),
                        'report_every' : 1, # Report training results every nr of epochs.
                        'eval_every' : 1,     # Evaluate on validation data every nr of epochs.
                        'stop_early' : True,    # Use early stopping or not.

# Build, train, evaluate and plot the results!
trained_model, training_results = build_train_eval_and_plot(
        verbose=True    # Modify as desired.

Epoch: 0001 Training cost= 327.262532238
Epoch: 0001 Validation acc= 0.435399979
Epoch: 0002 Training cost= 251.312214827
Epoch: 0002 Validation acc= 0.542799890
Epoch: 0003 Training cost= 224.936292614
Epoch: 0003 Validation acc= 0.605999947
Epoch: 0004 Training cost= 201.409107805
Epoch: 0004 Validation acc= 0.626999915
Epoch: 0005 Training cost= 180.312732350
Epoch: 0005 Validation acc= 0.656599879
Epoch: 0006 Training cost= 161.427582259
Epoch: 0006 Validation acc= 0.692799866
Epoch: 0007 Training cost= 144.570377253
Epoch: 0007 Validation acc= 0.730599880
Epoch: 0008 Training cost= 129.454500316
Epoch: 0008 Validation acc= 0.729999840
Epoch: 0009 Training cost= 115.930559138
Epoch: 0009 Validation acc= 0.764599860
Epoch: 0010 Training cost= 103.779861076
Epoch: 0010 Validation acc= 0.817999840
Epoch: 0011 Training cost= 92.894099121
Epoch: 0011 Validation acc= 0.848599792
Epoch: 0012 Training cost= 83.177957764
Epoch: 0012 Validation acc= 0.874799788
Epoch: 0013 Training cost= 74.441788441
Epoch: 0013 Validation acc= 0.866999805
Epoch: 0014 Training cost= 66.607071221
Epoch: 0014 Validation acc= 0.908799767
Epoch: 0015 Training cost= 59.628462719
Epoch: 0015 Validation acc= 0.920999765
Epoch: 0016 Training cost= 53.376457901
Epoch: 0016 Validation acc= 0.927999735
Epoch: 0017 Training cost= 47.795069421
Epoch: 0017 Validation acc= 0.921199799
Epoch: 0018 Training cost= 42.801629826
Epoch: 0018 Validation acc= 0.938799739
Epoch: 0019 Training cost= 38.321246601
Epoch: 0019 Validation acc= 0.939399719
Epoch: 0020 Training cost= 34.317467527
Epoch: 0020 Validation acc= 0.943799734
Epoch: 0021 Training cost= 30.729907542
Epoch: 0021 Validation acc= 0.950399756
Epoch: 0022 Training cost= 27.518849869
Epoch: 0022 Validation acc= 0.948799729
Epoch: 0023 Training cost= 24.648165835
Epoch: 0023 Validation acc= 0.949599802
Epoch: 0024 Training cost= 22.077698746
Epoch: 0024 Validation acc= 0.954999745
Epoch: 0025 Training cost= 19.774227052
Epoch: 0025 Validation acc= 0.954399645
Epoch: 0026 Training cost= 17.714090944
Epoch: 0026 Validation acc= 0.957399666
Epoch: 0027 Training cost= 15.867918540
Epoch: 0027 Validation acc= 0.960799754
Epoch: 0028 Training cost= 14.219327608
Epoch: 0028 Validation acc= 0.961799681
Epoch: 0029 Training cost= 12.738986726
Epoch: 0029 Validation acc= 0.963399768
Epoch: 0030 Training cost= 11.412021542
Epoch: 0030 Validation acc= 0.968799710
Epoch: 0031 Training cost= 10.229510890
Epoch: 0031 Validation acc= 0.965799749
Epoch: 0032 Training cost= 9.169845919
Epoch: 0032 Validation acc= 0.967199683
Epoch: 0033 Training cost= 8.216691914
Epoch: 0033 Validation acc= 0.969199717
Epoch: 0034 Training cost= 7.367491474
Epoch: 0034 Validation acc= 0.970399737
Epoch: 0035 Training cost= 6.608459154
Epoch: 0035 Validation acc= 0.972999692
Epoch: 0036 Training cost= 5.925226277
Epoch: 0036 Validation acc= 0.969599724
Epoch: 0037 Training cost= 5.316913984
Epoch: 0037 Validation acc= 0.973999679
Epoch: 0038 Training cost= 4.772619098
Epoch: 0038 Validation acc= 0.973199725
Epoch: 0039 Training cost= 4.282892211
Epoch: 0039 Validation acc= 0.973599672
Epoch: 0040 Training cost= 3.846757963
Epoch: 0040 Validation acc= 0.973599792
Epoch: 0041 Training cost= 3.455926691
Epoch: 0041 Validation acc= 0.976799786
Epoch: 0042 Training cost= 3.105613444
Epoch: 0042 Validation acc= 0.975199699
Epoch: 0043 Training cost= 2.793994174
Epoch: 0043 Validation acc= 0.976199746
Epoch: 0044 Training cost= 2.512941882
Epoch: 0044 Validation acc= 0.975399733
Epoch: 0045 Training cost= 2.262076859
Epoch: 0045 Validation acc= 0.976999760
Epoch: 0046 Training cost= 2.037694932
Epoch: 0046 Validation acc= 0.975999653
Epoch: 0047 Training cost= 1.837552240
Epoch: 0047 Validation acc= 0.976399660
Epoch: 0048 Training cost= 1.657969395
Epoch: 0048 Validation acc= 0.975599706
Epoch: 0049 Training cost= 1.496847012
Epoch: 0049 Validation acc= 0.976999760
Epoch: 0050 Training cost= 1.352709585
Epoch: 0050 Validation acc= 0.978399754
Epoch: 0051 Training cost= 1.223720786
Epoch: 0051 Validation acc= 0.978799701
Epoch: 0052 Training cost= 1.108979440
Epoch: 0052 Validation acc= 0.978399634
Epoch: 0053 Training cost= 1.005705943
Epoch: 0053 Validation acc= 0.977999747
Epoch: 0054 Training cost= 0.913236865
Epoch: 0054 Validation acc= 0.978799701
Epoch: 0055 Training cost= 0.830339956
Epoch: 0055 Validation acc= 0.979199767
Epoch: 0056 Training cost= 0.755614541
Epoch: 0056 Validation acc= 0.978599668
Epoch: 0057 Training cost= 0.689879866
Epoch: 0057 Validation acc= 0.979399741
Epoch: 0058 Training cost= 0.631272063
Epoch: 0058 Validation acc= 0.978599787
Epoch: 0059 Training cost= 0.578013867
Epoch: 0059 Validation acc= 0.978799701
Epoch: 0060 Training cost= 0.530160095
Epoch: 0060 Validation acc= 0.979799747
Epoch: 0061 Training cost= 0.487254938
Epoch: 0061 Validation acc= 0.979799747
Epoch: 0062 Training cost= 0.449552791
Epoch: 0062 Validation acc= 0.980399728
Epoch: 0063 Training cost= 0.415439274
Epoch: 0063 Validation acc= 0.978799701
Epoch: 0064 Training cost= 0.384874637
Epoch: 0064 Validation acc= 0.979999721
Epoch: 0065 Training cost= 0.356752845
Epoch: 0065 Validation acc= 0.978199661
Epoch: 0066 Training cost= 0.332616655
Epoch: 0066 Validation acc= 0.979799747
Epoch: 0067 Training cost= 0.309994660
Epoch: 0067 Validation acc= 0.979199767
Epoch: 0068 Training cost= 0.290989620
Epoch: 0068 Validation acc= 0.979599655
Epoch: 0069 Training cost= 0.272564379
Epoch: 0069 Validation acc= 0.980599701
Epoch: 0070 Training cost= 0.256773828
Epoch: 0070 Validation acc= 0.980799794
Epoch: 0071 Training cost= 0.242967232
Epoch: 0071 Validation acc= 0.979999661
Epoch: 0072 Training cost= 0.230392076
Epoch: 0072 Validation acc= 0.979199708
Epoch: 0073 Training cost= 0.218723012
Epoch: 0073 Validation acc= 0.980599761
Epoch: 0074 Training cost= 0.208804401
Epoch: 0074 Validation acc= 0.980199695
Epoch: 0075 Training cost= 0.198739098
Epoch: 0075 Validation acc= 0.981199682
Epoch: 0076 Training cost= 0.190688753
Epoch: 0076 Validation acc= 0.979799688
Epoch: 0077 Training cost= 0.183180254
Epoch: 0077 Validation acc= 0.980599761
Epoch: 0078 Training cost= 0.176371708
Epoch: 0078 Validation acc= 0.980599761
Epoch: 0079 Training cost= 0.170845800
Epoch: 0079 Validation acc= 0.979599714
Epoch: 0080 Training cost= 0.165155359
Epoch: 0080 Validation acc= 0.980199695
Epoch: 0081 Training cost= 0.160126905
Epoch: 0081 Validation acc= 0.979999721
Epoch: 0082 Training cost= 0.155851784
Epoch: 0082 Validation acc= 0.982599735
Epoch: 0083 Training cost= 0.152026351
Epoch: 0083 Validation acc= 0.980199695
Epoch: 0084 Training cost= 0.147996796
Epoch: 0084 Validation acc= 0.981599748
Epoch: 0085 Training cost= 0.146238381
Epoch: 0085 Validation acc= 0.980599761
Epoch: 0086 Training cost= 0.142040186
Epoch: 0086 Validation acc= 0.981599689
Epoch: 0087 Training cost= 0.138474341
Epoch: 0087 Validation acc= 0.981799722
Epoch: 0088 Training cost= 0.137078403
Epoch: 0088 Validation acc= 0.980999768
Validation loss stopped improving, stopping training early after 88 epochs!
Optimization Finished!
Accuracy on test set: 0.9798
CPU times: user 2min 47s, sys: 32.3 s, total: 3min 19s
Wall time: 2min 8s

NB: Before you go (5min)

Pair up with someone else and go through the questions in "Learning Objectives" at the top. Take turns explaining each of these to each other, and be sure to ask the tutors if you're both unsure!

Additional Resources


Please send any bugs and comments to dli-practicals@googlegroups.com.