Deep Learning

Assignment 3

Previously in 2_fullyconnected.ipynb, we trained a logistic regression and a simple neural network model.

The goal of this assignment is to explore regularization techniques.

Overview

  • Problem 1: Introduce and tune L2 regularisation.
  • Problem 2: Exploring Overfitting.
  • Problem 3: Introducing dropout.
  • Problem 4: Explore multi-layered models and techniques from previous problems.

In [1]:
# These are all the modules we'll be using later. 
# Make sure you can import them before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle
import os

First reload the data we generated in 1_notmnist.ipynb.


In [2]:
# Create data directory path
dpath = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
dpath = os.path.join(dpath, 'data')
# create pickle data file path
pickle_file = os.path.join(dpath,'notMNIST.pickle')

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)


Training set (500000, 28, 28) (500000,)
Validation set (29000, 28, 28) (29000,)
Test set (18000, 28, 28) (18000,)

Reformat into a shape that's more adapted to the models we're going to train:

  • data as a flat matrix,
  • labels as float 1-hot encodings.

In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
    dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
    # Map 1 to [0.0, 1.0, 0.0 ...], 2 to [0.0, 0.0, 1.0 ...]
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)


Training set (500000, 784) (500000, 10)
Validation set (29000, 784) (29000, 10)
Test set (18000, 784) (18000, 10)

In [4]:
def accuracy(predictions, labels):
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
              / predictions.shape[0])

Evaluate up to this point for all computations. After this point only evaluate the graphs you are interested in re-calculating and then run the relevant training.


Problem 1

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor t using nn.l2_loss(t). The right amount of regularization should improve your validation / test accuracy.


First we work on logistic regression

  • We use the minibatch implementation from assignment 2.

In [5]:
# Create TensorFlow graph

batch_size = 128
# regularisation constant
gamma = 0.01

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32,
                                      shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
  
    # Variables.
    weights = tf.Variable(
         tf.truncated_normal([image_size * image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))
  
    # Training computation.
    logits = tf.matmul(tf_train_dataset, weights) + biases
    
    # tf.reduce_mean because we take the average cross entropy over the batch.
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
    # add regularisation to loss
    # notes: regularise both weights and biases
    loss = loss + gamma * (
    tf.nn.l2_loss(weights) + tf.nn.l2_loss(biases)
    )
        
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

In [6]:
# run tensorFlow graph.

num_steps = 3001

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
    
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 45.040466
Minibatch accuracy: 11.7%
Validation accuracy: 17.8%
Minibatch loss at step 500: 0.821025
Minibatch accuracy: 82.8%
Validation accuracy: 81.2%
Minibatch loss at step 1000: 0.762393
Minibatch accuracy: 78.1%
Validation accuracy: 81.1%
Minibatch loss at step 1500: 0.754469
Minibatch accuracy: 83.6%
Validation accuracy: 80.3%
Minibatch loss at step 2000: 0.772569
Minibatch accuracy: 82.0%
Validation accuracy: 81.0%
Minibatch loss at step 2500: 0.840660
Minibatch accuracy: 78.9%
Validation accuracy: 79.2%
Minibatch loss at step 3000: 0.795423
Minibatch accuracy: 82.0%
Validation accuracy: 80.6%
Test accuracy: 87.3%

Now let's work on a neural network with a hidden layer

  • We use the example from assignment 2:

In [7]:
batch_size = 128
hidden_nodes = 1024
# regularisation constant
gamma = 0.01

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes]))
    # We construct the variables representing the output layer:
    weights2 = tf.Variable(
        tf.truncated_normal([hidden_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))

    # Training computation.
    hidden_layer = tf.matmul(tf_train_dataset, weights1) + biases1
    logits = tf.matmul(hidden_layer, weights2) + biases2
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    # add regularisation for all weights.
    loss = loss + gamma * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer_val = tf.matmul(tf_valid_dataset, weights1) + biases1
    logits_val = tf.matmul(hidden_layer_val, weights2) + biases2
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer_test = tf.matmul(tf_test_dataset, weights1) + biases1
    logits_test = tf.matmul(hidden_layer_test, weights2) + biases2
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [8]:
num_steps = 3001

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 3683.220703
Minibatch accuracy: 8.6%
Validation accuracy: 35.7%
Minibatch loss at step 500: 20.774141
Minibatch accuracy: 82.0%
Validation accuracy: 79.9%
Minibatch loss at step 1000: 0.906779
Minibatch accuracy: 79.7%
Validation accuracy: 80.5%
Minibatch loss at step 1500: 0.829210
Minibatch accuracy: 81.2%
Validation accuracy: 79.4%
Minibatch loss at step 2000: 0.878375
Minibatch accuracy: 79.7%
Validation accuracy: 81.5%
Minibatch loss at step 2500: 0.885852
Minibatch accuracy: 77.3%
Validation accuracy: 78.8%
Minibatch loss at step 3000: 0.864545
Minibatch accuracy: 80.5%
Validation accuracy: 80.4%
Test accuracy: 87.2%

Problem 2

Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?


First we work on logistic regression


In [9]:
# Re running Graph for Logistic regression.

batch_size = 128
# regularisation constant
gamma = 0.01

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32,
                                      shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
  
    # Variables.
    weights = tf.Variable(
         tf.truncated_normal([image_size * image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))
  
    # Training computation.
    logits = tf.matmul(tf_train_dataset, weights) + biases
    
    # tf.reduce_mean because we take the average cross entropy over the batch.
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
    # add regularisation to loss
    # notes: regularise both weights and biases
    loss = loss + gamma * (
    tf.nn.l2_loss(weights) + tf.nn.l2_loss(biases)
    )
        
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

In [10]:
# run tensorFlow graph for logistic regression.

num_steps = 3001

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: restrict offset to [1, 500]
        offset = np.random.choice(list(range(1, 501)))
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
    
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 46.589722
Minibatch accuracy: 7.8%
Validation accuracy: 9.2%
Minibatch loss at step 500: 0.489492
Minibatch accuracy: 100.0%
Validation accuracy: 77.5%
Minibatch loss at step 1000: 0.356012
Minibatch accuracy: 100.0%
Validation accuracy: 77.9%
Minibatch loss at step 1500: 0.302419
Minibatch accuracy: 100.0%
Validation accuracy: 77.9%
Minibatch loss at step 2000: 0.343183
Minibatch accuracy: 99.2%
Validation accuracy: 77.7%
Minibatch loss at step 2500: 0.350681
Minibatch accuracy: 99.2%
Validation accuracy: 78.1%
Minibatch loss at step 3000: 0.321371
Minibatch accuracy: 99.2%
Validation accuracy: 77.6%
Test accuracy: 84.6%

Now let's work on a neural network with a hidden layer


In [11]:
# Re running graph for 1 hidden layer
batch_size = 128
hidden_nodes = 1024
# regularisation constant
gamma = 0.01

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes]))
    # We construct the variables representing the output layer:
    weights2 = tf.Variable(
        tf.truncated_normal([hidden_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))

    # Training computation.
    hidden_layer = tf.matmul(tf_train_dataset, weights1) + biases1
    logits = tf.matmul(hidden_layer, weights2) + biases2
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    # add regularisation for all weights.
    loss = loss + gamma * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer_val = tf.matmul(tf_valid_dataset, weights1) + biases1
    logits_val = tf.matmul(hidden_layer_val, weights2) + biases2
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer_test = tf.matmul(tf_test_dataset, weights1) + biases1
    logits_test = tf.matmul(hidden_layer_test, weights2) + biases2
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [12]:
### NOTE: Rerun graph2 build step before running ###
num_steps = 3001

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: restrict offset to [1, 500]
        offset = np.random.choice(list(range(1, 501)))
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 3582.423584
Minibatch accuracy: 13.3%
Validation accuracy: 35.3%
Minibatch loss at step 500: 20.918388
Minibatch accuracy: 100.0%
Validation accuracy: 75.0%
Minibatch loss at step 1000: 0.513852
Minibatch accuracy: 98.4%
Validation accuracy: 76.7%
Minibatch loss at step 1500: 0.315037
Minibatch accuracy: 100.0%
Validation accuracy: 77.5%
Minibatch loss at step 2000: 0.308981
Minibatch accuracy: 100.0%
Validation accuracy: 77.3%
Minibatch loss at step 2500: 0.296738
Minibatch accuracy: 100.0%
Validation accuracy: 77.4%
Minibatch loss at step 3000: 0.334466
Minibatch accuracy: 99.2%
Validation accuracy: 76.9%
Test accuracy: 83.9%

Problem 3

Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides nn.dropout() for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?


Introducing dropout for the hidden layer


In [14]:
batch_size = 128
hidden_nodes = 1024

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes]))
    # We construct the variables representing the output layer:
    weights2 = tf.Variable(
        tf.truncated_normal([hidden_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))

    # Training computation.
    hidden_layer = tf.matmul(tf_train_dataset, weights1) + biases1
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    hidden_layer_d = tf.nn.dropout(hidden_layer, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer, weights2) + biases2
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer_d, weights2) + biases2
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer_val = tf.matmul(tf_valid_dataset, weights1) + biases1
    logits_val = tf.matmul(hidden_layer_val, weights2) + biases2
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer_test = tf.matmul(tf_test_dataset, weights1) + biases1
    logits_test = tf.matmul(hidden_layer_test, weights2) + biases2
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [15]:
num_steps = 3001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 3889.392578
Minibatch accuracy: 6.2%
Validation accuracy: 40.8%
Minibatch loss at step 500: 23.329178
Minibatch accuracy: 69.5%
Validation accuracy: 66.9%
Minibatch loss at step 1000: 1.040158
Minibatch accuracy: 78.9%
Validation accuracy: 78.5%
Minibatch loss at step 1500: 1.029830
Minibatch accuracy: 74.2%
Validation accuracy: 76.9%
Minibatch loss at step 2000: 0.963729
Minibatch accuracy: 77.3%
Validation accuracy: 80.3%
Minibatch loss at step 2500: 0.957147
Minibatch accuracy: 76.6%
Validation accuracy: 75.8%
Minibatch loss at step 3000: 0.996744
Minibatch accuracy: 74.2%
Validation accuracy: 77.3%
Test accuracy: 83.9%

Restricting training data


In [16]:
num_steps = 3001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: restrict offset to [1, 500]
        offset = np.random.choice(list(range(1, 501)))
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 3783.325195
Minibatch accuracy: 6.2%
Validation accuracy: 33.9%
Minibatch loss at step 500: 22.153778
Minibatch accuracy: 99.2%
Validation accuracy: 76.2%
Minibatch loss at step 1000: 0.491627
Minibatch accuracy: 100.0%
Validation accuracy: 77.3%
Minibatch loss at step 1500: 0.399057
Minibatch accuracy: 99.2%
Validation accuracy: 76.5%
Minibatch loss at step 2000: 0.341882
Minibatch accuracy: 99.2%
Validation accuracy: 77.5%
Minibatch loss at step 2500: 0.332075
Minibatch accuracy: 100.0%
Validation accuracy: 78.0%
Minibatch loss at step 3000: 0.365101
Minibatch accuracy: 97.7%
Validation accuracy: 76.1%
Test accuracy: 83.0%

Dropout didn't do much against overfit in this specific case.


Problem 4

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is 97.1%.

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

global_step = tf.Variable(0)  # count the number of steps taken.
learning_rate = tf.train.exponential_decay(0.5, global_step, ...)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

1. We start by increasing the training steps on the regularised with dropout 1 hidden_layer network.


In [17]:
num_steps = 8001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 3880.716553
Minibatch accuracy: 5.5%
Validation accuracy: 35.4%
Minibatch loss at step 500: 23.848827
Minibatch accuracy: 71.9%
Validation accuracy: 72.8%
Minibatch loss at step 1000: 1.084231
Minibatch accuracy: 76.6%
Validation accuracy: 77.7%
Minibatch loss at step 1500: 1.053075
Minibatch accuracy: 72.7%
Validation accuracy: 76.6%
Minibatch loss at step 2000: 0.961123
Minibatch accuracy: 79.7%
Validation accuracy: 79.9%
Minibatch loss at step 2500: 0.972796
Minibatch accuracy: 75.8%
Validation accuracy: 75.7%
Minibatch loss at step 3000: 0.941161
Minibatch accuracy: 76.6%
Validation accuracy: 78.4%
Minibatch loss at step 3500: 0.994951
Minibatch accuracy: 78.9%
Validation accuracy: 79.2%
Minibatch loss at step 4000: 0.654308
Minibatch accuracy: 86.7%
Validation accuracy: 80.3%
Minibatch loss at step 4500: 0.851535
Minibatch accuracy: 82.0%
Validation accuracy: 80.5%
Minibatch loss at step 5000: 1.031823
Minibatch accuracy: 78.1%
Validation accuracy: 78.9%
Minibatch loss at step 5500: 0.975298
Minibatch accuracy: 81.2%
Validation accuracy: 78.7%
Minibatch loss at step 6000: 0.786854
Minibatch accuracy: 82.8%
Validation accuracy: 80.4%
Minibatch loss at step 6500: 0.869442
Minibatch accuracy: 80.5%
Validation accuracy: 79.0%
Minibatch loss at step 7000: 0.930099
Minibatch accuracy: 77.3%
Validation accuracy: 79.6%
Minibatch loss at step 7500: 1.021005
Minibatch accuracy: 75.8%
Validation accuracy: 80.4%
Minibatch loss at step 8000: 0.935995
Minibatch accuracy: 81.2%
Validation accuracy: 78.4%
Test accuracy: 85.1%

Increasing the number of steps only slightly increased performance!

2. Let's increase regularisation.


In [19]:
num_steps = 8001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.03

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 10097.652344
Minibatch accuracy: 18.0%
Validation accuracy: 30.2%
Minibatch loss at step 500: 1.091034
Minibatch accuracy: 75.0%
Validation accuracy: 75.7%
Minibatch loss at step 1000: 1.010935
Minibatch accuracy: 76.6%
Validation accuracy: 79.4%
Minibatch loss at step 1500: 1.124861
Minibatch accuracy: 75.0%
Validation accuracy: 77.2%
Minibatch loss at step 2000: 1.096387
Minibatch accuracy: 78.9%
Validation accuracy: 79.8%
Minibatch loss at step 2500: 1.132236
Minibatch accuracy: 75.0%
Validation accuracy: 75.8%
Minibatch loss at step 3000: 1.049138
Minibatch accuracy: 80.5%
Validation accuracy: 77.9%
Minibatch loss at step 3500: 1.130792
Minibatch accuracy: 77.3%
Validation accuracy: 79.5%
Minibatch loss at step 4000: 0.810080
Minibatch accuracy: 88.3%
Validation accuracy: 79.4%
Minibatch loss at step 4500: 1.021264
Minibatch accuracy: 81.2%
Validation accuracy: 80.2%
Minibatch loss at step 5000: 1.228076
Minibatch accuracy: 71.9%
Validation accuracy: 75.6%
Minibatch loss at step 5500: 1.069519
Minibatch accuracy: 80.5%
Validation accuracy: 79.4%
Minibatch loss at step 6000: 0.931228
Minibatch accuracy: 83.6%
Validation accuracy: 79.5%
Minibatch loss at step 6500: 1.021470
Minibatch accuracy: 82.0%
Validation accuracy: 77.9%
Minibatch loss at step 7000: 1.082067
Minibatch accuracy: 76.6%
Validation accuracy: 79.3%
Minibatch loss at step 7500: 1.152230
Minibatch accuracy: 75.8%
Validation accuracy: 80.3%
Minibatch loss at step 8000: 1.111657
Minibatch accuracy: 79.7%
Validation accuracy: 77.1%
Test accuracy: 84.0%

Results:

Increasing regularisation above 0.01 didn't increase performance!

3. Let's double the width of the hidden layer:


In [20]:
batch_size = 128
hidden_nodes = 2*1024

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes]))
    # We construct the variables representing the output layer:
    weights2 = tf.Variable(
        tf.truncated_normal([hidden_nodes, num_labels]))
    biases2 = tf.Variable(tf.zeros([num_labels]))

    # Training computation.
    hidden_layer = tf.matmul(tf_train_dataset, weights1) + biases1
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    hidden_layer_d = tf.nn.dropout(hidden_layer, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer, weights2) + biases2
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer_d, weights2) + biases2
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer_val = tf.matmul(tf_valid_dataset, weights1) + biases1
    logits_val = tf.matmul(hidden_layer_val, weights2) + biases2
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer_test = tf.matmul(tf_test_dataset, weights1) + biases1
    logits_test = tf.matmul(hidden_layer_test, weights2) + biases2
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [21]:
num_steps = 8001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 7183.234863
Minibatch accuracy: 18.8%
Validation accuracy: 35.8%
Minibatch loss at step 500: 49.708282
Minibatch accuracy: 71.1%
Validation accuracy: 67.8%
Minibatch loss at step 1000: 1.318965
Minibatch accuracy: 77.3%
Validation accuracy: 77.3%
Minibatch loss at step 1500: 1.033121
Minibatch accuracy: 75.0%
Validation accuracy: 76.4%
Minibatch loss at step 2000: 0.998889
Minibatch accuracy: 76.6%
Validation accuracy: 79.8%
Minibatch loss at step 2500: 0.997531
Minibatch accuracy: 73.4%
Validation accuracy: 74.0%
Minibatch loss at step 3000: 0.942747
Minibatch accuracy: 75.0%
Validation accuracy: 77.1%
Minibatch loss at step 3500: 1.000108
Minibatch accuracy: 77.3%
Validation accuracy: 79.6%
Minibatch loss at step 4000: 0.641362
Minibatch accuracy: 86.7%
Validation accuracy: 80.2%
Minibatch loss at step 4500: 0.866591
Minibatch accuracy: 79.7%
Validation accuracy: 80.7%
Minibatch loss at step 5000: 1.037657
Minibatch accuracy: 78.9%
Validation accuracy: 79.4%
Minibatch loss at step 5500: 0.995731
Minibatch accuracy: 83.6%
Validation accuracy: 79.6%
Minibatch loss at step 6000: 0.802124
Minibatch accuracy: 84.4%
Validation accuracy: 80.6%
Minibatch loss at step 6500: 0.843823
Minibatch accuracy: 80.5%
Validation accuracy: 78.6%
Minibatch loss at step 7000: 0.953323
Minibatch accuracy: 75.8%
Validation accuracy: 79.7%
Minibatch loss at step 7500: 0.975157
Minibatch accuracy: 75.8%
Validation accuracy: 80.5%
Minibatch loss at step 8000: 0.961649
Minibatch accuracy: 80.5%
Validation accuracy: 78.5%
Test accuracy: 85.2%

The accuracy of the network did not significantly increase with the increase of the hidden nodes.

4. Let's try 2 hidden layers


In [22]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.random_normal([hidden_nodes1, hidden_nodes2]))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the output layer:
    weights3 = tf.Variable(
        tf.truncated_normal([hidden_nodes2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.matmul(tf_train_dataset, weights1) + biases1
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    hidden_layer2 = tf.matmul(hidden_layer1_d, weights2) + biases2
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer2, weights3) + biases3
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
        tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) +
        tf.nn.l2_loss(weights3))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.001).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.matmul(tf_valid_dataset, weights1) + biases1
    hidden_layer2_val = tf.matmul(hidden_layer1_val, weights2) + biases2
    logits_val = tf.matmul(hidden_layer2_val, weights3) + biases3
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.matmul(tf_test_dataset, weights1) + biases1
    hidden_layer2_test = tf.matmul(hidden_layer1_test, weights2) + biases2
    logits_test = tf.matmul(hidden_layer2_test, weights3) + biases3
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)
    
    ## Notes:
    # 2 hidden layers cause instability in gradient backpropagation
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the learning rate was reduced.

In [23]:
num_steps = 36001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 2000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 19600.097656
Minibatch accuracy: 12.5%
Validation accuracy: 19.2%
Minibatch loss at step 2000: 4415.139648
Minibatch accuracy: 71.9%
Validation accuracy: 78.4%
Minibatch loss at step 4000: 4200.625488
Minibatch accuracy: 80.5%
Validation accuracy: 78.4%
Minibatch loss at step 6000: 4015.094727
Minibatch accuracy: 80.5%
Validation accuracy: 79.1%
Minibatch loss at step 8000: 3892.113281
Minibatch accuracy: 71.1%
Validation accuracy: 78.2%
Minibatch loss at step 10000: 3688.991455
Minibatch accuracy: 75.8%
Validation accuracy: 79.2%
Minibatch loss at step 12000: 3517.541748
Minibatch accuracy: 78.1%
Validation accuracy: 79.5%
Minibatch loss at step 14000: 3411.374268
Minibatch accuracy: 70.3%
Validation accuracy: 79.6%
Minibatch loss at step 16000: 3234.648682
Minibatch accuracy: 69.5%
Validation accuracy: 79.9%
Minibatch loss at step 18000: 3091.167236
Minibatch accuracy: 77.3%
Validation accuracy: 78.9%
Minibatch loss at step 20000: 3010.418457
Minibatch accuracy: 71.9%
Validation accuracy: 79.7%
Minibatch loss at step 22000: 2869.405518
Minibatch accuracy: 69.5%
Validation accuracy: 79.6%
Minibatch loss at step 24000: 2758.060791
Minibatch accuracy: 72.7%
Validation accuracy: 78.5%
Minibatch loss at step 26000: 2633.211426
Minibatch accuracy: 78.1%
Validation accuracy: 79.4%
Minibatch loss at step 28000: 2524.239014
Minibatch accuracy: 73.4%
Validation accuracy: 79.6%
Minibatch loss at step 30000: 2417.297363
Minibatch accuracy: 74.2%
Validation accuracy: 78.3%
Minibatch loss at step 32000: 2315.153320
Minibatch accuracy: 76.6%
Validation accuracy: 79.1%
Minibatch loss at step 34000: 2224.187988
Minibatch accuracy: 75.0%
Validation accuracy: 79.3%
Minibatch loss at step 36000: 2127.373535
Minibatch accuracy: 78.1%
Validation accuracy: 80.0%
Test accuracy: 87.0%

The network accuracy increased slightly.

5. Let's add relu actication functions


In [24]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.random_normal([hidden_nodes1, hidden_nodes2]))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the output layer:
    weights3 = tf.Variable(
        tf.truncated_normal([hidden_nodes2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer2, weights3) + biases3
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.003).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    logits_val = tf.matmul(hidden_layer2_val, weights3) + biases3
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    logits_test = tf.matmul(hidden_layer2_test, weights3) + biases3
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)
    
    ## Notes:
    # 2 hidden layers cause instability in gradient backpropagation
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the learning rate was reduced.

In [25]:
num_steps = 36001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 2000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 9459.060547
Minibatch accuracy: 5.5%
Validation accuracy: 18.8%
Minibatch loss at step 2000: 3930.830811
Minibatch accuracy: 61.7%
Validation accuracy: 62.4%
Minibatch loss at step 4000: 3476.717773
Minibatch accuracy: 55.5%
Validation accuracy: 59.0%
Minibatch loss at step 6000: 3081.757568
Minibatch accuracy: 60.9%
Validation accuracy: 59.5%
Minibatch loss at step 8000: 2734.829102
Minibatch accuracy: 64.8%
Validation accuracy: 60.9%
Minibatch loss at step 10000: 2424.271973
Minibatch accuracy: 62.5%
Validation accuracy: 64.9%
Minibatch loss at step 12000: 2149.825195
Minibatch accuracy: 75.8%
Validation accuracy: 67.2%
Minibatch loss at step 14000: 1906.975220
Minibatch accuracy: 66.4%
Validation accuracy: 67.7%
Minibatch loss at step 16000: 1691.013916
Minibatch accuracy: 71.9%
Validation accuracy: 70.5%
Minibatch loss at step 18000: 1499.751831
Minibatch accuracy: 75.0%
Validation accuracy: 71.8%
Minibatch loss at step 20000: 1330.522095
Minibatch accuracy: 68.0%
Validation accuracy: 73.6%
Minibatch loss at step 22000: 1180.085083
Minibatch accuracy: 78.1%
Validation accuracy: 74.7%
Minibatch loss at step 24000: 1046.699829
Minibatch accuracy: 70.3%
Validation accuracy: 75.4%
Minibatch loss at step 26000: 928.454407
Minibatch accuracy: 75.8%
Validation accuracy: 76.4%
Minibatch loss at step 28000: 823.300842
Minibatch accuracy: 81.2%
Validation accuracy: 77.1%
Minibatch loss at step 30000: 730.298950
Minibatch accuracy: 75.8%
Validation accuracy: 77.6%
Minibatch loss at step 32000: 647.876221
Minibatch accuracy: 76.6%
Validation accuracy: 78.2%
Minibatch loss at step 34000: 574.721436
Minibatch accuracy: 77.3%
Validation accuracy: 78.7%
Minibatch loss at step 36000: 509.698761
Minibatch accuracy: 79.7%
Validation accuracy: 79.2%
Test accuracy: 86.6%

Relu activation functions didn't change performance.

6. Try again without dropout !?


In [26]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.random_normal([hidden_nodes1, hidden_nodes2]))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the output layer:
    weights3 = tf.Variable(
        tf.truncated_normal([hidden_nodes2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    #hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1, weights2) + biases2)
    #hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer2, weights3) + biases3
    
    # logits_d for training with dropout
    #logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.003).minimize(loss)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    logits_val = tf.matmul(hidden_layer2_val, weights3) + biases3
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    logits_test = tf.matmul(hidden_layer2_test, weights3) + biases3
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)
    
    ## Notes:
    # 2 hidden layers cause instability in gradient backpropagation
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the learning rate was reduced.

In [27]:
num_steps = 24001
# dropout layer keep probability
keep_probl = 0.05 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 2000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 8702.677734
Minibatch accuracy: 10.2%
Validation accuracy: 11.9%
Minibatch loss at step 2000: 3943.468994
Minibatch accuracy: 74.2%
Validation accuracy: 75.1%
Minibatch loss at step 4000: 3473.043701
Minibatch accuracy: 84.4%
Validation accuracy: 74.0%
Minibatch loss at step 6000: 3080.477539
Minibatch accuracy: 75.0%
Validation accuracy: 72.2%
Minibatch loss at step 8000: 2729.142578
Minibatch accuracy: 71.1%
Validation accuracy: 70.5%
Minibatch loss at step 10000: 2420.376221
Minibatch accuracy: 71.9%
Validation accuracy: 74.3%
Minibatch loss at step 12000: 2146.191895
Minibatch accuracy: 80.5%
Validation accuracy: 75.4%
Minibatch loss at step 14000: 1903.903076
Minibatch accuracy: 71.1%
Validation accuracy: 74.9%
Minibatch loss at step 16000: 1688.259155
Minibatch accuracy: 81.2%
Validation accuracy: 76.2%
Minibatch loss at step 18000: 1497.265015
Minibatch accuracy: 81.2%
Validation accuracy: 77.2%
Minibatch loss at step 20000: 1328.249268
Minibatch accuracy: 73.4%
Validation accuracy: 78.3%
Minibatch loss at step 22000: 1177.956787
Minibatch accuracy: 77.3%
Validation accuracy: 78.8%
Minibatch loss at step 24000: 1044.823608
Minibatch accuracy: 78.1%
Validation accuracy: 79.1%
Test accuracy: 86.6%

Performance got marginally worse!

7. Let's try using a variable learning rate !


In [28]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.random_normal([hidden_nodes1, hidden_nodes2]))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the output layer:
    weights3 = tf.Variable(
        tf.truncated_normal([hidden_nodes2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    #hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1, weights2) + biases2)
    #hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer2, weights3) + biases3
    
    # logits_d for training with dropout
    #logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.9)
    
    optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    logits_val = tf.matmul(hidden_layer2_val, weights3) + biases3
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    logits_test = tf.matmul(hidden_layer2_test, weights3) + biases3
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)
    
    ## Notes:
    # 2 hidden layers cause instability in gradient backpropagation
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the learning rate was reduced.

In [29]:
num_steps = 64001
# dropout layer keep probability
keep_probl = 0.05 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.02

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 8437.728516
Minibatch accuracy: 13.3%
Validation accuracy: 27.8%
Current learning rate: 0.0199989452958107
Minibatch loss at step 4000: 1043.648682
Minibatch accuracy: 82.8%
Validation accuracy: 73.2%
Current learning rate: 0.01619914546608925
Minibatch loss at step 8000: 324.763031
Minibatch accuracy: 78.1%
Validation accuracy: 80.1%
Current learning rate: 0.013121306896209717
Minibatch loss at step 12000: 126.240578
Minibatch accuracy: 84.4%
Validation accuracy: 82.6%
Current learning rate: 0.010628257878124714
Minibatch loss at step 16000: 59.007954
Minibatch accuracy: 82.0%
Validation accuracy: 84.0%
Current learning rate: 0.00860888883471489
Minibatch loss at step 20000: 32.142159
Minibatch accuracy: 79.7%
Validation accuracy: 84.8%
Current learning rate: 0.006973199546337128
Minibatch loss at step 24000: 19.709427
Minibatch accuracy: 79.7%
Validation accuracy: 85.1%
Current learning rate: 0.005648291204124689
Minibatch loss at step 28000: 13.271347
Minibatch accuracy: 85.2%
Validation accuracy: 85.2%
Current learning rate: 0.004575116094201803
Minibatch loss at step 32000: 9.816479
Minibatch accuracy: 81.2%
Validation accuracy: 85.1%
Current learning rate: 0.003705843584612012
Minibatch loss at step 36000: 7.559084
Minibatch accuracy: 85.2%
Validation accuracy: 85.1%
Current learning rate: 0.003001732984557748
Minibatch loss at step 40000: 6.181899
Minibatch accuracy: 89.1%
Validation accuracy: 85.1%
Current learning rate: 0.002431403612717986
Minibatch loss at step 44000: 5.257970
Minibatch accuracy: 89.1%
Validation accuracy: 85.1%
Current learning rate: 0.0019694368820637465
Minibatch loss at step 48000: 4.856206
Minibatch accuracy: 82.8%
Validation accuracy: 85.1%
Current learning rate: 0.0015952438116073608
Minibatch loss at step 52000: 4.476823
Minibatch accuracy: 82.8%
Validation accuracy: 85.0%
Current learning rate: 0.0012921473244205117
Minibatch loss at step 56000: 4.106928
Minibatch accuracy: 81.2%
Validation accuracy: 85.0%
Current learning rate: 0.0010466392850503325
Minibatch loss at step 60000: 3.719689
Minibatch accuracy: 85.2%
Validation accuracy: 85.1%
Current learning rate: 0.0008477778756059706
Minibatch loss at step 64000: 3.570590
Minibatch accuracy: 85.2%
Validation accuracy: 85.0%
Current learning rate: 0.0006867000483907759
Test accuracy: 91.4%

As we can see an exponential learning rate significantly increased our results!!!

8. Let's introduce dropout


In [30]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1]))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.random_normal([hidden_nodes1, hidden_nodes2]))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the output layer:
    weights3 = tf.Variable(
        tf.truncated_normal([hidden_nodes2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer2, weights3) + biases3
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.9)
    
    optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    logits_val = tf.matmul(hidden_layer2_val, weights3) + biases3
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    logits_test = tf.matmul(hidden_layer2_test, weights3) + biases3
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)
    
    ## Notes:
    # 2 hidden layers cause instability in gradient backpropagation
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the learning rate was reduced.

In [31]:
num_steps = 64001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.02

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 12598.875000
Minibatch accuracy: 6.2%
Validation accuracy: 16.9%
Current learning rate: 0.0199989452958107
Minibatch loss at step 4000: 1039.270020
Minibatch accuracy: 38.3%
Validation accuracy: 22.4%
Current learning rate: 0.01619914546608925
Minibatch loss at step 8000: 323.689301
Minibatch accuracy: 62.5%
Validation accuracy: 68.7%
Current learning rate: 0.013121306896209717
Minibatch loss at step 12000: 125.959976
Minibatch accuracy: 78.1%
Validation accuracy: 78.3%
Current learning rate: 0.010628257878124714
Minibatch loss at step 16000: 58.830399
Minibatch accuracy: 81.2%
Validation accuracy: 81.1%
Current learning rate: 0.00860888883471489
Minibatch loss at step 20000: 32.211987
Minibatch accuracy: 78.1%
Validation accuracy: 82.5%
Current learning rate: 0.006973199546337128
Minibatch loss at step 24000: 19.724861
Minibatch accuracy: 76.6%
Validation accuracy: 83.0%
Current learning rate: 0.005648291204124689
Minibatch loss at step 28000: 13.309738
Minibatch accuracy: 79.7%
Validation accuracy: 83.4%
Current learning rate: 0.004575116094201803
Minibatch loss at step 32000: 9.933225
Minibatch accuracy: 79.7%
Validation accuracy: 83.6%
Current learning rate: 0.003705843584612012
Minibatch loss at step 36000: 7.643488
Minibatch accuracy: 82.8%
Validation accuracy: 83.7%
Current learning rate: 0.003001732984557748
Minibatch loss at step 40000: 6.316326
Minibatch accuracy: 89.1%
Validation accuracy: 83.8%
Current learning rate: 0.002431403612717986
Minibatch loss at step 44000: 5.338342
Minibatch accuracy: 86.7%
Validation accuracy: 83.9%
Current learning rate: 0.0019694368820637465
Minibatch loss at step 48000: 4.847468
Minibatch accuracy: 84.4%
Validation accuracy: 83.9%
Current learning rate: 0.0015952438116073608
Minibatch loss at step 52000: 4.511444
Minibatch accuracy: 76.6%
Validation accuracy: 83.9%
Current learning rate: 0.0012921473244205117
Minibatch loss at step 56000: 4.054007
Minibatch accuracy: 82.0%
Validation accuracy: 83.9%
Current learning rate: 0.0010466392850503325
Minibatch loss at step 60000: 3.781799
Minibatch accuracy: 83.6%
Validation accuracy: 83.9%
Current learning rate: 0.0008477778756059706
Minibatch loss at step 64000: 3.625834
Minibatch accuracy: 83.6%
Validation accuracy: 84.0%
Current learning rate: 0.0006867000483907759
Test accuracy: 90.4%

Slightly worse performance compared to without dropout

9. Let's add a 3rd hidden layer (relu without dropout)


In [32]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256
hidden_nodes3 = 64

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1], stddev=0.1))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=0.1))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rdd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=0.1))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the output layer:
    weights4 = tf.Variable(
        tf.truncated_normal([hidden_nodes3, num_labels], stddev=0.1))
    biases4 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    # keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    #hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1, weights2) + biases2)
    #hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2, weights3) + biases3)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer3, weights4) + biases4
    
    # logits_d for training with dropout
    #logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.9)
    
    optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    logits_val = tf.matmul(hidden_layer3_val, weights4) + biases4
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    logits_test = tf.matmul(hidden_layer3_test, weights4) + biases4
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [33]:
num_steps = 64001
# dropout layer keep probability
# keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.02

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     #keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.


Initialized
Minibatch loss at step 0: 44.443871
Minibatch accuracy: 17.2%
Validation accuracy: 14.2%
Current learning rate: 0.0199989452958107
Minibatch loss at step 4000: 10.390105
Minibatch accuracy: 90.6%
Validation accuracy: 84.9%
Current learning rate: 0.01619914546608925
Minibatch loss at step 8000: 3.814507
Minibatch accuracy: 86.7%
Validation accuracy: 85.2%
Current learning rate: 0.013121306896209717
Minibatch loss at step 12000: 1.833274
Minibatch accuracy: 89.1%
Validation accuracy: 85.2%
Current learning rate: 0.010628257878124714
Minibatch loss at step 16000: 1.305994
Minibatch accuracy: 83.6%
Validation accuracy: 85.3%
Current learning rate: 0.00860888883471489
Minibatch loss at step 20000: 1.153313
Minibatch accuracy: 84.4%
Validation accuracy: 85.4%
Current learning rate: 0.006973199546337128
Minibatch loss at step 24000: 1.037861
Minibatch accuracy: 82.8%
Validation accuracy: 85.5%
Current learning rate: 0.005648291204124689
Minibatch loss at step 28000: 0.884387
Minibatch accuracy: 84.4%
Validation accuracy: 85.5%
Current learning rate: 0.004575116094201803
Minibatch loss at step 32000: 0.902294
Minibatch accuracy: 83.6%
Validation accuracy: 85.6%
Current learning rate: 0.003705843584612012
Minibatch loss at step 36000: 0.775964
Minibatch accuracy: 85.2%
Validation accuracy: 85.6%
Current learning rate: 0.003001732984557748
Minibatch loss at step 40000: 0.740765
Minibatch accuracy: 90.6%
Validation accuracy: 85.6%
Current learning rate: 0.002431403612717986
Minibatch loss at step 44000: 0.679652
Minibatch accuracy: 89.8%
Validation accuracy: 85.7%
Current learning rate: 0.0019694368820637465
Minibatch loss at step 48000: 0.892612
Minibatch accuracy: 80.5%
Validation accuracy: 85.7%
Current learning rate: 0.0015952438116073608
Minibatch loss at step 52000: 0.942879
Minibatch accuracy: 82.0%
Validation accuracy: 85.7%
Current learning rate: 0.0012921473244205117
Minibatch loss at step 56000: 0.888775
Minibatch accuracy: 82.0%
Validation accuracy: 85.7%
Current learning rate: 0.0010466392850503325
Minibatch loss at step 60000: 0.729433
Minibatch accuracy: 84.4%
Validation accuracy: 85.7%
Current learning rate: 0.0008477778756059706
Minibatch loss at step 64000: 0.745552
Minibatch accuracy: 85.9%
Validation accuracy: 85.7%
Current learning rate: 0.0006867000483907759
Test accuracy: 92.0%

We observe a small improvement

10. Let's use 4 hidden layers (relu without dropout).


In [35]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 512
hidden_nodes3 = 256
hidden_nodes4 = 64

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1], stddev=0.1))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=0.1))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=0.1))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=0.1))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=0.1))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    # keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    #hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1, weights2) + biases2)
    #hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2, weights3) + biases3)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3, weights4) + biases4)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    #logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.9)
    
    optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [36]:
num_steps = 48001
# dropout layer keep probability
#keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.02

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     #keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.


Initialized
Minibatch loss at step 0: 60.442921
Minibatch accuracy: 8.6%
Validation accuracy: 18.1%
Current learning rate: 0.0199989452958107
Minibatch loss at step 4000: 13.989778
Minibatch accuracy: 90.6%
Validation accuracy: 85.3%
Current learning rate: 0.01619914546608925
Minibatch loss at step 8000: 4.943952
Minibatch accuracy: 85.2%
Validation accuracy: 85.4%
Current learning rate: 0.013121306896209717
Minibatch loss at step 12000: 2.297653
Minibatch accuracy: 88.3%
Validation accuracy: 85.7%
Current learning rate: 0.010628257878124714
Minibatch loss at step 16000: 1.529235
Minibatch accuracy: 84.4%
Validation accuracy: 85.8%
Current learning rate: 0.00860888883471489
Minibatch loss at step 20000: 1.283257
Minibatch accuracy: 85.9%
Validation accuracy: 85.8%
Current learning rate: 0.006973199546337128
Minibatch loss at step 24000: 1.106153
Minibatch accuracy: 85.2%
Validation accuracy: 86.0%
Current learning rate: 0.005648291204124689
Minibatch loss at step 28000: 0.958892
Minibatch accuracy: 84.4%
Validation accuracy: 86.1%
Current learning rate: 0.004575116094201803
Minibatch loss at step 32000: 0.955353
Minibatch accuracy: 82.8%
Validation accuracy: 86.2%
Current learning rate: 0.003705843584612012
Minibatch loss at step 36000: 0.833034
Minibatch accuracy: 88.3%
Validation accuracy: 86.2%
Current learning rate: 0.003001732984557748
Minibatch loss at step 40000: 0.795019
Minibatch accuracy: 92.2%
Validation accuracy: 86.3%
Current learning rate: 0.002431403612717986
Minibatch loss at step 44000: 0.720634
Minibatch accuracy: 92.2%
Validation accuracy: 86.2%
Current learning rate: 0.0019694368820637465
Minibatch loss at step 48000: 0.925160
Minibatch accuracy: 81.2%
Validation accuracy: 86.3%
Current learning rate: 0.0015952438116073608
Test accuracy: 92.7%

Our best result so far !!!


In [37]:
# let's use some diferent parameters

num_steps = 48001
# dropout layer keep probability
# keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.02
# learning rate (initial)
learning_rate_i = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     #keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.


Initialized
Minibatch loss at step 0: 117.590088
Minibatch accuracy: 11.7%
Validation accuracy: 11.3%
Current learning rate: 0.00999947264790535
Minibatch loss at step 4000: 27.615152
Minibatch accuracy: 89.8%
Validation accuracy: 84.0%
Current learning rate: 0.008099572733044624
Minibatch loss at step 8000: 9.370770
Minibatch accuracy: 85.9%
Validation accuracy: 84.0%
Current learning rate: 0.006560653448104858
Minibatch loss at step 12000: 4.192151
Minibatch accuracy: 86.7%
Validation accuracy: 83.8%
Current learning rate: 0.005314128939062357
Minibatch loss at step 16000: 2.545827
Minibatch accuracy: 82.0%
Validation accuracy: 83.8%
Current learning rate: 0.004304444417357445
Minibatch loss at step 20000: 1.998170
Minibatch accuracy: 82.0%
Validation accuracy: 83.8%
Current learning rate: 0.003486599773168564
Minibatch loss at step 24000: 1.671694
Minibatch accuracy: 79.7%
Validation accuracy: 83.8%
Current learning rate: 0.0028241456020623446
Minibatch loss at step 28000: 1.380485
Minibatch accuracy: 81.2%
Validation accuracy: 83.8%
Current learning rate: 0.0022875580471009016
Minibatch loss at step 32000: 1.363601
Minibatch accuracy: 80.5%
Validation accuracy: 83.7%
Current learning rate: 0.001852921792306006
Minibatch loss at step 36000: 1.179382
Minibatch accuracy: 85.2%
Validation accuracy: 83.8%
Current learning rate: 0.001500866492278874
Minibatch loss at step 40000: 1.117060
Minibatch accuracy: 89.8%
Validation accuracy: 83.7%
Current learning rate: 0.001215701806358993
Minibatch loss at step 44000: 1.071316
Minibatch accuracy: 88.3%
Validation accuracy: 83.7%
Current learning rate: 0.0009847184410318732
Minibatch loss at step 48000: 1.226735
Minibatch accuracy: 82.8%
Validation accuracy: 83.7%
Current learning rate: 0.0007976219058036804
Test accuracy: 90.4%

Higher regularisation and reduced learning rate didn't help

11. Let's try 4 hidden layers with dropout


In [38]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 512
hidden_nodes3 = 256
hidden_nodes4 = 64

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1], stddev=0.1))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=0.1))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=0.1))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=0.1))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=0.1))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.9)
    
    optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [39]:
num_steps = 64001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.02

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.


Initialized
Minibatch loss at step 0: 66.674530
Minibatch accuracy: 11.7%
Validation accuracy: 9.2%
Current learning rate: 0.0199989452958107
Minibatch loss at step 4000: 14.132818
Minibatch accuracy: 85.9%
Validation accuracy: 81.5%
Current learning rate: 0.01619914546608925
Minibatch loss at step 8000: 5.028029
Minibatch accuracy: 81.2%
Validation accuracy: 83.5%
Current learning rate: 0.013121306896209717
Minibatch loss at step 12000: 2.437177
Minibatch accuracy: 85.2%
Validation accuracy: 84.2%
Current learning rate: 0.010628257878124714
Minibatch loss at step 16000: 1.649951
Minibatch accuracy: 86.7%
Validation accuracy: 84.5%
Current learning rate: 0.00860888883471489
Minibatch loss at step 20000: 1.464722
Minibatch accuracy: 79.7%
Validation accuracy: 84.7%
Current learning rate: 0.006973199546337128
Minibatch loss at step 24000: 1.289010
Minibatch accuracy: 80.5%
Validation accuracy: 85.0%
Current learning rate: 0.005648291204124689
Minibatch loss at step 28000: 1.065984
Minibatch accuracy: 82.8%
Validation accuracy: 85.0%
Current learning rate: 0.004575116094201803
Minibatch loss at step 32000: 1.153860
Minibatch accuracy: 78.9%
Validation accuracy: 85.0%
Current learning rate: 0.003705843584612012
Minibatch loss at step 36000: 0.909501
Minibatch accuracy: 85.9%
Validation accuracy: 85.1%
Current learning rate: 0.003001732984557748
Minibatch loss at step 40000: 0.935680
Minibatch accuracy: 89.8%
Validation accuracy: 85.1%
Current learning rate: 0.002431403612717986
Minibatch loss at step 44000: 0.854572
Minibatch accuracy: 89.1%
Validation accuracy: 85.1%
Current learning rate: 0.0019694368820637465
Minibatch loss at step 48000: 1.036662
Minibatch accuracy: 83.6%
Validation accuracy: 85.2%
Current learning rate: 0.0015952438116073608
Minibatch loss at step 52000: 1.079605
Minibatch accuracy: 78.9%
Validation accuracy: 85.2%
Current learning rate: 0.0012921473244205117
Minibatch loss at step 56000: 1.115830
Minibatch accuracy: 82.0%
Validation accuracy: 85.2%
Current learning rate: 0.0010466392850503325
Minibatch loss at step 60000: 0.869583
Minibatch accuracy: 85.2%
Validation accuracy: 85.2%
Current learning rate: 0.0008477778756059706
Minibatch loss at step 64000: 0.939081
Minibatch accuracy: 82.0%
Validation accuracy: 85.2%
Current learning rate: 0.0006867000483907759
Test accuracy: 91.7%

In [40]:
num_steps = 64001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.05

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.


Initialized
Minibatch loss at step 0: 66.597153
Minibatch accuracy: 7.8%
Validation accuracy: 12.2%
Current learning rate: 0.04999736696481705
Minibatch loss at step 4000: 2.286793
Minibatch accuracy: 88.3%
Validation accuracy: 84.0%
Current learning rate: 0.04049786552786827
Minibatch loss at step 8000: 0.964728
Minibatch accuracy: 85.2%
Validation accuracy: 84.9%
Current learning rate: 0.03280326724052429
Minibatch loss at step 12000: 0.828323
Minibatch accuracy: 87.5%
Validation accuracy: 85.2%
Current learning rate: 0.026570646092295647
Minibatch loss at step 16000: 0.935294
Minibatch accuracy: 84.4%
Validation accuracy: 85.4%
Current learning rate: 0.021522222086787224
Minibatch loss at step 20000: 1.009734
Minibatch accuracy: 82.8%
Validation accuracy: 85.4%
Current learning rate: 0.01743300072848797
Minibatch loss at step 24000: 0.950870
Minibatch accuracy: 80.5%
Validation accuracy: 85.6%
Current learning rate: 0.014120727777481079
Minibatch loss at step 28000: 0.924990
Minibatch accuracy: 82.0%
Validation accuracy: 85.7%
Current learning rate: 0.011437790468335152
Minibatch loss at step 32000: 0.925228
Minibatch accuracy: 84.4%
Validation accuracy: 85.7%
Current learning rate: 0.009264609776437283
Minibatch loss at step 36000: 0.894122
Minibatch accuracy: 84.4%
Validation accuracy: 85.8%
Current learning rate: 0.007504332810640335
Minibatch loss at step 40000: 0.793890
Minibatch accuracy: 91.4%
Validation accuracy: 85.7%
Current learning rate: 0.006078509148210287
Minibatch loss at step 44000: 0.784852
Minibatch accuracy: 88.3%
Validation accuracy: 85.8%
Current learning rate: 0.004923592321574688
Minibatch loss at step 48000: 0.974941
Minibatch accuracy: 83.6%
Validation accuracy: 85.8%
Current learning rate: 0.003988109529018402
Minibatch loss at step 52000: 1.006029
Minibatch accuracy: 79.7%
Validation accuracy: 85.8%
Current learning rate: 0.003230368485674262
Minibatch loss at step 56000: 1.058559
Minibatch accuracy: 81.2%
Validation accuracy: 85.8%
Current learning rate: 0.0026165982708334923
Minibatch loss at step 60000: 0.806690
Minibatch accuracy: 86.7%
Validation accuracy: 85.9%
Current learning rate: 0.002119444776326418
Minibatch loss at step 64000: 0.869959
Minibatch accuracy: 83.6%
Validation accuracy: 85.9%
Current learning rate: 0.0017167500918731093
Test accuracy: 92.3%

Let's increase exponential decay.


In [41]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 512
hidden_nodes3 = 256
hidden_nodes4 = 64

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1], stddev=0.1))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=0.1))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=0.1))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=0.1))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=0.1))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.95)
    
    optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [42]:
num_steps = 80001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial)
learning_rate_i = 0.05

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.


Initialized
Minibatch loss at step 0: 67.313019
Minibatch accuracy: 10.9%
Validation accuracy: 10.9%
Current learning rate: 0.04999871551990509
Minibatch loss at step 4000: 1.952044
Minibatch accuracy: 89.8%
Validation accuracy: 84.2%
Current learning rate: 0.04512384161353111
Minibatch loss at step 8000: 0.993130
Minibatch accuracy: 85.9%
Validation accuracy: 84.8%
Current learning rate: 0.040724266320466995
Minibatch loss at step 12000: 0.793567
Minibatch accuracy: 88.3%
Validation accuracy: 85.3%
Current learning rate: 0.03675365075469017
Minibatch loss at step 16000: 0.946355
Minibatch accuracy: 85.2%
Validation accuracy: 85.5%
Current learning rate: 0.03317016735672951
Minibatch loss at step 20000: 1.065387
Minibatch accuracy: 82.0%
Validation accuracy: 85.5%
Current learning rate: 0.02993607521057129
Minibatch loss at step 24000: 0.906525
Minibatch accuracy: 84.4%
Validation accuracy: 85.6%
Current learning rate: 0.02701730839908123
Minibatch loss at step 28000: 0.952807
Minibatch accuracy: 81.2%
Validation accuracy: 85.6%
Current learning rate: 0.024383118376135826
Minibatch loss at step 32000: 0.955357
Minibatch accuracy: 80.5%
Validation accuracy: 85.8%
Current learning rate: 0.02200576476752758
Minibatch loss at step 36000: 0.833998
Minibatch accuracy: 86.7%
Validation accuracy: 85.7%
Current learning rate: 0.019860202446579933
Minibatch loss at step 40000: 0.876558
Minibatch accuracy: 89.1%
Validation accuracy: 85.9%
Current learning rate: 0.017923833802342415
Minibatch loss at step 44000: 0.740768
Minibatch accuracy: 88.3%
Validation accuracy: 86.0%
Current learning rate: 0.016176259145140648
Minibatch loss at step 48000: 0.963341
Minibatch accuracy: 83.6%
Validation accuracy: 86.0%
Current learning rate: 0.014599071815609932
Minibatch loss at step 52000: 0.974737
Minibatch accuracy: 84.4%
Validation accuracy: 86.0%
Current learning rate: 0.013175661675632
Minibatch loss at step 56000: 1.017659
Minibatch accuracy: 78.9%
Validation accuracy: 86.1%
Current learning rate: 0.011891034431755543
Minibatch loss at step 60000: 0.830248
Minibatch accuracy: 84.4%
Validation accuracy: 86.1%
Current learning rate: 0.010731658898293972
Minibatch loss at step 64000: 0.865245
Minibatch accuracy: 85.9%
Validation accuracy: 86.1%
Current learning rate: 0.00968532171100378
Minibatch loss at step 68000: 0.696237
Minibatch accuracy: 91.4%
Validation accuracy: 86.2%
Current learning rate: 0.008741003461182117
Minibatch loss at step 72000: 0.843322
Minibatch accuracy: 85.9%
Validation accuracy: 86.1%
Current learning rate: 0.007888754829764366
Minibatch loss at step 76000: 0.917027
Minibatch accuracy: 87.5%
Validation accuracy: 86.2%
Current learning rate: 0.007119601126760244
Minibatch loss at step 80000: 0.878364
Minibatch accuracy: 85.2%
Validation accuracy: 86.2%
Current learning rate: 0.006425440311431885
Test accuracy: 92.5%

Accuracy hasn't increased more with further training but it is on par with our best results!

12. Let's try momentum optimiser


In [43]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 512
hidden_nodes3 = 256
hidden_nodes4 = 64

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1], stddev=0.1))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=0.1))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=0.1))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=0.1))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=0.1))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    #flrate = tf.train.exponential_decay(ilrate, gstep, 2000, 0.95)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(ilrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [44]:
num_steps = 80001
# dropout layer keep probability
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.01
# learning rate (initial) - calculate within loop
#learning_rate_i = 0.5

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Pre calculate from wolfram alpha
        # https://www.wolframalpha.com/input/?i=plot+(0.1)*(x%2F2000)%5E2*e%5E(-x%2F7000)+%7Bx,0,80000%7D
        learning_rate_i = 0.01 * ((step/2000)**2)*np.exp(-step/7000)
        #print(learning_rate_i)
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(ilrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.


Initialized
Minibatch loss at step 0: 68.113480
Minibatch accuracy: 3.9%
Validation accuracy: 9.3%
Current learning rate: 0.0
Minibatch loss at step 4000: 0.826690
Minibatch accuracy: 87.5%
Validation accuracy: 83.5%
Current learning rate: 0.02258872427046299
Minibatch loss at step 8000: 1.024994
Minibatch accuracy: 85.2%
Validation accuracy: 82.2%
Current learning rate: 0.05102504789829254
Minibatch loss at step 12000: 1.040810
Minibatch accuracy: 82.8%
Validation accuracy: 82.8%
Current learning rate: 0.06483323127031326
Minibatch loss at step 16000: 1.059431
Minibatch accuracy: 83.6%
Validation accuracy: 83.2%
Current learning rate: 0.06508889049291611
Minibatch loss at step 20000: 1.293144
Minibatch accuracy: 78.1%
Validation accuracy: 82.8%
Current learning rate: 0.057432617992162704
Minibatch loss at step 24000: 1.186287
Minibatch accuracy: 78.9%
Validation accuracy: 82.8%
Current learning rate: 0.04670386761426926
Minibatch loss at step 28000: 0.968586
Minibatch accuracy: 81.2%
Validation accuracy: 83.8%
Current learning rate: 0.03589865192770958
Minibatch loss at step 32000: 1.052121
Minibatch accuracy: 82.0%
Validation accuracy: 84.1%
Current learning rate: 0.026478523388504982
Minibatch loss at step 36000: 0.910155
Minibatch accuracy: 83.6%
Validation accuracy: 84.4%
Current learning rate: 0.018924767151474953
Minibatch loss at step 40000: 0.841862
Minibatch accuracy: 87.5%
Validation accuracy: 84.6%
Current learning rate: 0.013194022700190544
Minibatch loss at step 44000: 0.786193
Minibatch accuracy: 86.7%
Validation accuracy: 85.0%
Current learning rate: 0.00901559367775917
Minibatch loss at step 48000: 1.030138
Minibatch accuracy: 82.8%
Validation accuracy: 85.3%
Current learning rate: 0.006059031002223492
Minibatch loss at step 52000: 1.029774
Minibatch accuracy: 81.2%
Validation accuracy: 85.3%
Current learning rate: 0.004015680402517319
Minibatch loss at step 56000: 1.064219
Minibatch accuracy: 76.6%
Validation accuracy: 85.6%
Current learning rate: 0.002630027011036873
Minibatch loss at step 60000: 0.802562
Minibatch accuracy: 83.6%
Validation accuracy: 85.7%
Current learning rate: 0.0017049764283001423
Minibatch loss at step 64000: 0.830660
Minibatch accuracy: 83.6%
Validation accuracy: 85.8%
Current learning rate: 0.0010954878525808454
Minibatch loss at step 68000: 0.631111
Minibatch accuracy: 94.5%
Validation accuracy: 85.8%
Current learning rate: 0.0006983886123634875
Minibatch loss at step 72000: 0.839468
Minibatch accuracy: 87.5%
Validation accuracy: 85.8%
Current learning rate: 0.00044215653906576335
Minibatch loss at step 76000: 0.882892
Minibatch accuracy: 86.7%
Validation accuracy: 85.8%
Current learning rate: 0.00027820823015645146
Minibatch loss at step 80000: 0.877599
Minibatch accuracy: 83.6%
Validation accuracy: 85.8%
Current learning rate: 0.00017408224812243134
Test accuracy: 92.3%

Good performance, close to our best results!

13 Let's try some external examples:


In [45]:
# taken from
# https://github.com/rndbrtrnd/udacity-deep-learning/blob/master/3_regularization.ipynb

batch_size = 128
num_hidden_nodes1 = 1024
num_hidden_nodes2 = 100
beta_regul = 1e-3

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(tf.float32,
                                      shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    global_step = tf.Variable(0)

    # Variables.
    weights1 = tf.Variable(
        tf.truncated_normal(
            [image_size * image_size, num_hidden_nodes1],
            stddev=np.sqrt(2.0 / (image_size * image_size)))
    )
    biases1 = tf.Variable(tf.zeros([num_hidden_nodes1]))
    weights2 = tf.Variable(
        tf.truncated_normal([num_hidden_nodes1, num_hidden_nodes2], stddev=np.sqrt(2.0 / num_hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([num_hidden_nodes2]))
    weights3 = tf.Variable(
        tf.truncated_normal([num_hidden_nodes2, num_labels], stddev=np.sqrt(2.0 / num_hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([num_labels]))
  
    # Training computation.
    lay1_train = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    lay2_train = tf.nn.relu(tf.matmul(lay1_train, weights2) + biases2)
    logits = tf.matmul(lay2_train, weights3) + biases3
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels)) + \
            beta_regul * (tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))
  
    # Optimizer.
    learning_rate = tf.train.exponential_decay(0.5, global_step, 1000, 0.65, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    lay1_valid = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    lay2_valid = tf.nn.relu(tf.matmul(lay1_valid, weights2) + biases2)
    valid_prediction = tf.nn.softmax(tf.matmul(lay2_valid, weights3) + biases3)
    lay1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    lay2_test = tf.nn.relu(tf.matmul(lay1_test, weights2) + biases2)
    test_prediction = tf.nn.softmax(tf.matmul(lay2_test, weights3) + biases3)

In [46]:
num_steps = 9001

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 3.394331
Minibatch accuracy: 10.9%
Validation accuracy: 27.9%
Minibatch loss at step 500: 0.964337
Minibatch accuracy: 86.7%
Validation accuracy: 86.0%
Minibatch loss at step 1000: 0.785412
Minibatch accuracy: 89.1%
Validation accuracy: 86.8%
Minibatch loss at step 1500: 0.629551
Minibatch accuracy: 92.2%
Validation accuracy: 87.3%
Minibatch loss at step 2000: 0.723625
Minibatch accuracy: 87.5%
Validation accuracy: 88.0%
Minibatch loss at step 2500: 0.590059
Minibatch accuracy: 89.1%
Validation accuracy: 88.5%
Minibatch loss at step 3000: 0.549629
Minibatch accuracy: 88.3%
Validation accuracy: 88.9%
Minibatch loss at step 3500: 0.565709
Minibatch accuracy: 88.3%
Validation accuracy: 89.3%
Minibatch loss at step 4000: 0.365797
Minibatch accuracy: 94.5%
Validation accuracy: 89.5%
Minibatch loss at step 4500: 0.428408
Minibatch accuracy: 92.2%
Validation accuracy: 89.7%
Minibatch loss at step 5000: 0.529648
Minibatch accuracy: 87.5%
Validation accuracy: 89.9%
Minibatch loss at step 5500: 0.472851
Minibatch accuracy: 90.6%
Validation accuracy: 89.8%
Minibatch loss at step 6000: 0.419778
Minibatch accuracy: 93.0%
Validation accuracy: 89.9%
Minibatch loss at step 6500: 0.423195
Minibatch accuracy: 89.8%
Validation accuracy: 90.1%
Minibatch loss at step 7000: 0.486998
Minibatch accuracy: 89.1%
Validation accuracy: 90.0%
Minibatch loss at step 7500: 0.518828
Minibatch accuracy: 89.8%
Validation accuracy: 90.2%
Minibatch loss at step 8000: 0.426836
Minibatch accuracy: 93.0%
Validation accuracy: 90.1%
Minibatch loss at step 8500: 0.308639
Minibatch accuracy: 95.3%
Validation accuracy: 90.2%
Minibatch loss at step 9000: 0.499433
Minibatch accuracy: 87.5%
Validation accuracy: 90.3%
Test accuracy: 95.5%

This accuracy beats anything we have done up to now - with a lot less complexity !!!

  • Let's try to replicate it!

14 Let's re-build a 2 hidden layer network.


In [47]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 128

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.random_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the output layer:
    weights3 = tf.Variable(
        tf.truncated_normal([hidden_nodes2, num_labels]))
    biases3 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    #hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1, weights2) + biases2)
    #hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer2, weights3) + biases3
    
    # logits_d for training with dropout
    #logits_d = tf.matmul(hidden_layer2_d, weights3) + biases3
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2) + tf.nn.l2_loss(weights3))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 7000, 0.65)
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)
    
    #optimizer = tf.train.GradientDescentOptimizer(flrate).minimize(
        #loss, global_step=gstep)
    

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    logits_val = tf.matmul(hidden_layer2_val, weights3) + biases3
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    logits_test = tf.matmul(hidden_layer2_test, weights3) + biases3
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [48]:
num_steps = 20001
# dropout layer keep probability - not used in this computation
keep_probl = 0.5 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.001
# learning rate (initial)
learning_rate_i = 0.02

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 2000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.


Initialized
Minibatch loss at step 0: 29.565529
Minibatch accuracy: 6.2%
Validation accuracy: 36.8%
Current learning rate: 0.019998770207166672
Minibatch loss at step 2000: 1.604423
Minibatch accuracy: 82.8%
Validation accuracy: 86.4%
Current learning rate: 0.0176827535033226
Minibatch loss at step 4000: 0.811861
Minibatch accuracy: 93.0%
Validation accuracy: 88.0%
Current learning rate: 0.01563495211303234
Minibatch loss at step 6000: 0.670658
Minibatch accuracy: 93.8%
Validation accuracy: 88.8%
Current learning rate: 0.013824302703142166
Minibatch loss at step 8000: 0.567102
Minibatch accuracy: 92.2%
Validation accuracy: 89.6%
Current learning rate: 0.012223341502249241
Minibatch loss at step 10000: 0.489442
Minibatch accuracy: 92.2%
Validation accuracy: 89.7%
Current learning rate: 0.010807783342897892
Minibatch loss at step 12000: 0.352940
Minibatch accuracy: 94.5%
Validation accuracy: 89.9%
Current learning rate: 0.00955615658313036
Minibatch loss at step 14000: 0.478857
Minibatch accuracy: 89.1%
Validation accuracy: 90.2%
Current learning rate: 0.008449479937553406
Minibatch loss at step 16000: 0.422401
Minibatch accuracy: 92.2%
Validation accuracy: 90.5%
Current learning rate: 0.007470963057130575
Minibatch loss at step 18000: 0.406087
Minibatch accuracy: 90.6%
Validation accuracy: 90.6%
Current learning rate: 0.006605768110603094
Minibatch loss at step 20000: 0.441311
Minibatch accuracy: 89.8%
Validation accuracy: 90.8%
Current learning rate: 0.00584076764062047
Test accuracy: 95.9%

It appears our problem was a big regularisation rate!

Momentum optimiser also helps !

15. Let's go back at 4 layers


In [49]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256
hidden_nodes3 = 64
hidden_nodes4 = 16

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [50]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.8 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.00001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.02

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
    ## Notes:
    # hidden layers cause instability in gradient backpropagation!
    # The solution is a combination of appropriate initialisation and learning rate.
    # Here the standard deviation of weights initialisation was reduced.


Initialized
Minibatch loss at step 0: 2.419878
Minibatch accuracy: 4.7%
Validation accuracy: 11.1%
Current learning rate: 0.019999278709292412
Minibatch loss at step 4000: 0.288888
Minibatch accuracy: 93.0%
Validation accuracy: 88.3%
Current learning rate: 0.01731988601386547
Minibatch loss at step 8000: 0.410929
Minibatch accuracy: 89.8%
Validation accuracy: 89.5%
Current learning rate: 0.014999459497630596
Minibatch loss at step 12000: 0.259949
Minibatch accuracy: 93.8%
Validation accuracy: 90.2%
Current learning rate: 0.012989913113415241
Minibatch loss at step 16000: 0.369395
Minibatch accuracy: 89.8%
Validation accuracy: 90.6%
Current learning rate: 0.011249594390392303
Minibatch loss at step 20000: 0.356018
Minibatch accuracy: 92.2%
Validation accuracy: 90.9%
Current learning rate: 0.009742435067892075
Minibatch loss at step 24000: 0.318570
Minibatch accuracy: 93.8%
Validation accuracy: 91.3%
Current learning rate: 0.008437196724116802
Minibatch loss at step 28000: 0.325883
Minibatch accuracy: 93.0%
Validation accuracy: 91.4%
Current learning rate: 0.007306825835257769
Minibatch loss at step 32000: 0.338675
Minibatch accuracy: 92.2%
Validation accuracy: 91.5%
Current learning rate: 0.006327897775918245
Minibatch loss at step 36000: 0.247116
Minibatch accuracy: 93.0%
Validation accuracy: 91.7%
Current learning rate: 0.005480119958519936
Minibatch loss at step 40000: 0.326437
Minibatch accuracy: 93.0%
Validation accuracy: 91.7%
Current learning rate: 0.0047459229826927185
Test accuracy: 96.7%

Our best result so far!!

Let's experiment some more!


In [51]:
batch_size = 128
hidden_nodes1 = 1024
hidden_nodes2 = 256
hidden_nodes3 = 64
hidden_nodes4 = 16

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [52]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.9 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.000001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 2.325851
Minibatch accuracy: 11.7%
Validation accuracy: 10.1%
Current learning rate: 0.009999639354646206
Minibatch loss at step 4000: 0.263083
Minibatch accuracy: 92.2%
Validation accuracy: 88.5%
Current learning rate: 0.008659943006932735
Minibatch loss at step 8000: 0.302504
Minibatch accuracy: 90.6%
Validation accuracy: 89.9%
Current learning rate: 0.007499729748815298
Minibatch loss at step 12000: 0.249020
Minibatch accuracy: 93.0%
Validation accuracy: 90.4%
Current learning rate: 0.006494956556707621
Minibatch loss at step 16000: 0.321804
Minibatch accuracy: 92.2%
Validation accuracy: 90.8%
Current learning rate: 0.005624797195196152
Minibatch loss at step 20000: 0.217675
Minibatch accuracy: 93.0%
Validation accuracy: 91.3%
Current learning rate: 0.004871217533946037
Minibatch loss at step 24000: 0.263802
Minibatch accuracy: 93.0%
Validation accuracy: 91.4%
Current learning rate: 0.004218598362058401
Minibatch loss at step 28000: 0.232229
Minibatch accuracy: 94.5%
Validation accuracy: 91.7%
Current learning rate: 0.0036534129176288843
Minibatch loss at step 32000: 0.290526
Minibatch accuracy: 91.4%
Validation accuracy: 91.8%
Current learning rate: 0.0031639488879591227
Minibatch loss at step 36000: 0.177739
Minibatch accuracy: 93.8%
Validation accuracy: 92.0%
Current learning rate: 0.002740059979259968
Minibatch loss at step 40000: 0.245655
Minibatch accuracy: 94.5%
Validation accuracy: 92.1%
Current learning rate: 0.0023729614913463593
Test accuracy: 96.7%

A new record in our accuracy scores!


In [53]:
batch_size = 128
hidden_nodes1 = 4096
hidden_nodes2 = 1024
hidden_nodes3 = 256
hidden_nodes4 = 64

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [54]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.9 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.000001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 2.349177
Minibatch accuracy: 14.1%
Validation accuracy: 18.9%
Current learning rate: 0.009999639354646206
Minibatch loss at step 4000: 0.206964
Minibatch accuracy: 93.8%
Validation accuracy: 89.5%
Current learning rate: 0.008659943006932735
Minibatch loss at step 8000: 0.255833
Minibatch accuracy: 91.4%
Validation accuracy: 90.8%
Current learning rate: 0.007499729748815298
Minibatch loss at step 12000: 0.161104
Minibatch accuracy: 96.1%
Validation accuracy: 91.5%
Current learning rate: 0.006494956556707621
Minibatch loss at step 16000: 0.244641
Minibatch accuracy: 94.5%
Validation accuracy: 92.0%
Current learning rate: 0.005624797195196152
Minibatch loss at step 20000: 0.176510
Minibatch accuracy: 93.8%
Validation accuracy: 92.2%
Current learning rate: 0.004871217533946037
Minibatch loss at step 24000: 0.187588
Minibatch accuracy: 93.8%
Validation accuracy: 92.4%
Current learning rate: 0.004218598362058401
Minibatch loss at step 28000: 0.158503
Minibatch accuracy: 95.3%
Validation accuracy: 92.5%
Current learning rate: 0.0036534129176288843
Minibatch loss at step 32000: 0.180381
Minibatch accuracy: 96.1%
Validation accuracy: 92.6%
Current learning rate: 0.0031639488879591227
Minibatch loss at step 36000: 0.136327
Minibatch accuracy: 96.1%
Validation accuracy: 92.7%
Current learning rate: 0.002740059979259968
Minibatch loss at step 40000: 0.115021
Minibatch accuracy: 96.9%
Validation accuracy: 92.8%
Current learning rate: 0.0023729614913463593
Test accuracy: 97.4%

Our best result yet !!

16 Let's experiment some more ...


In [55]:
batch_size = 128
hidden_nodes1 = 784
hidden_nodes2 = 1568
hidden_nodes3 = 500
hidden_nodes4 = 50

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [56]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.9 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.000001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 2.440985
Minibatch accuracy: 7.8%
Validation accuracy: 8.9%
Current learning rate: 0.009999639354646206
Minibatch loss at step 4000: 0.243824
Minibatch accuracy: 94.5%
Validation accuracy: 89.1%
Current learning rate: 0.008659943006932735
Minibatch loss at step 8000: 0.264471
Minibatch accuracy: 93.8%
Validation accuracy: 90.3%
Current learning rate: 0.007499729748815298
Minibatch loss at step 12000: 0.233488
Minibatch accuracy: 94.5%
Validation accuracy: 91.0%
Current learning rate: 0.006494956556707621
Minibatch loss at step 16000: 0.262845
Minibatch accuracy: 91.4%
Validation accuracy: 91.4%
Current learning rate: 0.005624797195196152
Minibatch loss at step 20000: 0.255824
Minibatch accuracy: 91.4%
Validation accuracy: 91.8%
Current learning rate: 0.004871217533946037
Minibatch loss at step 24000: 0.209804
Minibatch accuracy: 93.0%
Validation accuracy: 91.9%
Current learning rate: 0.004218598362058401
Minibatch loss at step 28000: 0.242691
Minibatch accuracy: 93.8%
Validation accuracy: 92.1%
Current learning rate: 0.0036534129176288843
Minibatch loss at step 32000: 0.263058
Minibatch accuracy: 93.0%
Validation accuracy: 92.1%
Current learning rate: 0.0031639488879591227
Minibatch loss at step 36000: 0.226826
Minibatch accuracy: 92.2%
Validation accuracy: 92.3%
Current learning rate: 0.002740059979259968
Minibatch loss at step 40000: 0.226705
Minibatch accuracy: 94.5%
Validation accuracy: 92.5%
Current learning rate: 0.0023729614913463593
Test accuracy: 97.1%

Close to our best performance!

Let's try some more:


In [57]:
batch_size = 128
hidden_nodes1 = 1568
hidden_nodes2 = 3136
hidden_nodes3 = 500
hidden_nodes4 = 50

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [58]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.9 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.000001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 2.379453
Minibatch accuracy: 10.2%
Validation accuracy: 16.0%
Current learning rate: 0.009999639354646206
Minibatch loss at step 4000: 0.254569
Minibatch accuracy: 94.5%
Validation accuracy: 89.3%
Current learning rate: 0.008659943006932735
Minibatch loss at step 8000: 0.248845
Minibatch accuracy: 91.4%
Validation accuracy: 90.7%
Current learning rate: 0.007499729748815298
Minibatch loss at step 12000: 0.196139
Minibatch accuracy: 94.5%
Validation accuracy: 91.3%
Current learning rate: 0.006494956556707621
Minibatch loss at step 16000: 0.232484
Minibatch accuracy: 93.8%
Validation accuracy: 91.8%
Current learning rate: 0.005624797195196152
Minibatch loss at step 20000: 0.239970
Minibatch accuracy: 92.2%
Validation accuracy: 92.0%
Current learning rate: 0.004871217533946037
Minibatch loss at step 24000: 0.180846
Minibatch accuracy: 95.3%
Validation accuracy: 92.4%
Current learning rate: 0.004218598362058401
Minibatch loss at step 28000: 0.153889
Minibatch accuracy: 96.1%
Validation accuracy: 92.4%
Current learning rate: 0.0036534129176288843
Minibatch loss at step 32000: 0.157343
Minibatch accuracy: 93.0%
Validation accuracy: 92.6%
Current learning rate: 0.0031639488879591227
Minibatch loss at step 36000: 0.154015
Minibatch accuracy: 94.5%
Validation accuracy: 92.7%
Current learning rate: 0.002740059979259968
Minibatch loss at step 40000: 0.126590
Minibatch accuracy: 95.3%
Validation accuracy: 92.8%
Current learning rate: 0.0023729614913463593
Test accuracy: 97.3%

Re-achieving our top performance

Let's try some more:


In [59]:
batch_size = 128
hidden_nodes1 = 1568
hidden_nodes2 = 3136
hidden_nodes3 = 1000
hidden_nodes4 = 100

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
    tf.nn.l2_loss(weights1) + tf.nn.l2_loss(weights2)
        + tf.nn.l2_loss(weights3) + tf.nn.l2_loss(weights4)
        + tf.nn.l2_loss(weights5))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [60]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.9 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.000001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.01

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 2.364189
Minibatch accuracy: 10.2%
Validation accuracy: 12.8%
Current learning rate: 0.009999639354646206
Minibatch loss at step 4000: 0.198735
Minibatch accuracy: 95.3%
Validation accuracy: 89.5%
Current learning rate: 0.008659943006932735
Minibatch loss at step 8000: 0.250931
Minibatch accuracy: 93.8%
Validation accuracy: 90.8%
Current learning rate: 0.007499729748815298
Minibatch loss at step 12000: 0.176855
Minibatch accuracy: 93.0%
Validation accuracy: 91.4%
Current learning rate: 0.006494956556707621
Minibatch loss at step 16000: 0.246995
Minibatch accuracy: 94.5%
Validation accuracy: 91.9%
Current learning rate: 0.005624797195196152
Minibatch loss at step 20000: 0.229791
Minibatch accuracy: 93.0%
Validation accuracy: 92.2%
Current learning rate: 0.004871217533946037
Minibatch loss at step 24000: 0.200849
Minibatch accuracy: 94.5%
Validation accuracy: 92.3%
Current learning rate: 0.004218598362058401
Minibatch loss at step 28000: 0.162767
Minibatch accuracy: 94.5%
Validation accuracy: 92.5%
Current learning rate: 0.0036534129176288843
Minibatch loss at step 32000: 0.221019
Minibatch accuracy: 94.5%
Validation accuracy: 92.7%
Current learning rate: 0.0031639488879591227
Minibatch loss at step 36000: 0.110184
Minibatch accuracy: 97.7%
Validation accuracy: 92.7%
Current learning rate: 0.002740059979259968
Minibatch loss at step 40000: 0.142781
Minibatch accuracy: 96.1%
Validation accuracy: 92.9%
Current learning rate: 0.0023729614913463593
Test accuracy: 97.4%

Matching our score !!

17 Let's try to change the regularisation weights:


In [61]:
batch_size = 128
hidden_nodes1 = 1568
hidden_nodes2 = 3136
hidden_nodes3 = 500
hidden_nodes4 = 50

graph = tf.Graph()
with graph.as_default():

    # Input data. For the training data, we use a placeholder that will be fed
    # at run time with a training minibatch.
    tf_train_dataset = tf.placeholder(
        tf.float32, shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)

    # Variables - Network Construction!
    # Matrix Dimensions:
    # 1st argument has dimensions coming from previous layer
    # 2nd argument has dimensions going to the next layer == dim(bias)
    # We construct the variables representing the 1st hidden layer:
    weights1 = tf.Variable(
        tf.truncated_normal([image_size * image_size, hidden_nodes1],
                            stddev=np.sqrt(2.0 / (image_size * image_size))))
    biases1 = tf.Variable(tf.zeros([hidden_nodes1]))
    # We construct the variables representing the 2nd hidden layer:
    weights2 = tf.Variable(
    tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev=np.sqrt(2.0 / hidden_nodes1)))
    biases2 = tf.Variable(tf.zeros([hidden_nodes2]))
    # We construct the variables representing the 3rd hidden layer:
    weights3 = tf.Variable(
    tf.truncated_normal([hidden_nodes2, hidden_nodes3], stddev=np.sqrt(2.0 / hidden_nodes2)))
    biases3 = tf.Variable(tf.zeros([hidden_nodes3]))
    # We construct the variables representing the 4th hidden layer:
    weights4 = tf.Variable(
    tf.truncated_normal([hidden_nodes3, hidden_nodes4], stddev=np.sqrt(2.0 / hidden_nodes3)))
    biases4 = tf.Variable(tf.zeros([hidden_nodes4]))
    # We construct the variables representing the output layer:
    weights5 = tf.Variable(
        tf.truncated_normal([hidden_nodes4, num_labels], stddev=np.sqrt(2.0 / hidden_nodes4)))
    biases5 = tf.Variable(tf.zeros([num_labels]))
    
    # introduce dropout
    keep_prob = tf.placeholder(tf.float32)
    
    # Training computation.
    hidden_layer1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
    hidden_layer1_d = tf.nn.dropout(hidden_layer1, keep_prob)
    
    hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1_d, weights2) + biases2)
    hidden_layer2_d = tf.nn.dropout(hidden_layer2, keep_prob)
    
    hidden_layer3 = tf.nn.relu(tf.matmul(hidden_layer2_d, weights3) + biases3)
    hidden_layer3_d = tf.nn.dropout(hidden_layer3, keep_prob)
    hidden_layer4 = tf.nn.relu(tf.matmul(hidden_layer3_d, weights4) + biases4)
    hidden_layer4_d = tf.nn.dropout(hidden_layer4, keep_prob)
    
    # logits for prediction
    logits = tf.matmul(hidden_layer4, weights5) + biases5
    
    # logits_d for training with dropout
    logits_d = tf.matmul(hidden_layer4_d, weights5) + biases5
    # Note: we didn't use activation function (relu) for logits calculation.
    
    loss = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(
            labels=tf_train_labels, logits=logits_d))
    
    # add regularisation for all weights.
    regconst = tf.placeholder(tf.float32)
    loss = loss + regconst * (
        (tf.nn.l2_loss(weights1)/(image_size * image_size * hidden_nodes1)) 
        + (tf.nn.l2_loss(weights2)/(hidden_nodes1 * hidden_nodes2))
        + (tf.nn.l2_loss(weights3)/(hidden_nodes2 * hidden_nodes3))
        + (tf.nn.l2_loss(weights4)/(hidden_nodes3 * hidden_nodes4))
        + (tf.nn.l2_loss(weights5)/(hidden_nodes4 * num_labels)))
    
    # Optimizer - with variable learning rate.
    gstep = tf.Variable(0)  # steps taken
    ilrate = tf.placeholder(tf.float32)
    flrate = tf.train.exponential_decay(ilrate, gstep, 8000, 0.75)
    # Feed learning rate during training step!
    
    optimizer = tf.train.MomentumOptimizer(flrate, momentum=0.9, use_nesterov=True).minimize(
        loss, global_step=gstep)

    # Predictions for the training, validation, and test data.
    # Predict for training:
    train_prediction = tf.nn.softmax(logits)
    
    # Create Validation graph
    hidden_layer1_val = tf.nn.relu(tf.matmul(tf_valid_dataset, weights1) + biases1)
    hidden_layer2_val = tf.nn.relu(tf.matmul(hidden_layer1_val, weights2) + biases2)
    hidden_layer3_val = tf.nn.relu(tf.matmul(hidden_layer2_val, weights3) + biases3)
    hidden_layer4_val = tf.nn.relu(tf.matmul(hidden_layer3_val, weights4) + biases4)
    logits_val = tf.matmul(hidden_layer4_val, weights5) + biases5
    # Predict for validation
    valid_prediction = tf.nn.softmax(logits_val)
    
    # Create Test graph
    hidden_layer1_test = tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1)
    hidden_layer2_test = tf.nn.relu(tf.matmul(hidden_layer1_test, weights2) + biases2)
    hidden_layer3_test = tf.nn.relu(tf.matmul(hidden_layer2_test, weights3) + biases3)
    hidden_layer4_test = tf.nn.relu(tf.matmul(hidden_layer3_test, weights4) + biases4)
    logits_test = tf.matmul(hidden_layer4_test, weights5) + biases5
    # Predict for test
    test_prediction = tf.nn.softmax(logits_test)

In [62]:
num_steps = 40001
# dropout layer keep probability
keep_probl = 0.8 # cannot have the same name as graph variable!
# regularisation constant
gamma = 0.00001
# learning rate (initial) - calculate within loop
learning_rate_i = 0.008

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print("Initialized")
    for step in range(num_steps):
        # Pick an offset within the training data, which has been randomized.
        # Note: we could use better randomization across epochs.
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data,
                     tf_train_labels : batch_labels,
                     regconst : gamma,
                     keep_prob : keep_probl,
                     ilrate : learning_rate_i
                    }
        _, l, predictions = session.run(
            [optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 4000 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(
                valid_prediction.eval(), valid_labels))
            print("Current learning rate: {}".format(flrate.eval(feed_dict=feed_dict)))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))


Initialized
Minibatch loss at step 0: 2.468988
Minibatch accuracy: 4.7%
Validation accuracy: 12.4%
Current learning rate: 0.007999712601304054
Minibatch loss at step 4000: 0.276256
Minibatch accuracy: 90.6%
Validation accuracy: 88.3%
Current learning rate: 0.006927954498678446
Minibatch loss at step 8000: 0.305281
Minibatch accuracy: 90.6%
Validation accuracy: 90.0%
Current learning rate: 0.005999784450978041
Minibatch loss at step 12000: 0.232435
Minibatch accuracy: 92.2%
Validation accuracy: 90.6%
Current learning rate: 0.005195965524762869
Minibatch loss at step 16000: 0.297620
Minibatch accuracy: 90.6%
Validation accuracy: 91.1%
Current learning rate: 0.004499838221818209
Minibatch loss at step 20000: 0.295901
Minibatch accuracy: 89.1%
Validation accuracy: 91.4%
Current learning rate: 0.0038969742599874735
Minibatch loss at step 24000: 0.274724
Minibatch accuracy: 89.8%
Validation accuracy: 91.7%
Current learning rate: 0.0033748787827789783
Minibatch loss at step 28000: 0.240110
Minibatch accuracy: 91.4%
Validation accuracy: 91.9%
Current learning rate: 0.0029227305203676224
Minibatch loss at step 32000: 0.252527
Minibatch accuracy: 89.8%
Validation accuracy: 91.9%
Current learning rate: 0.0025311592034995556
Minibatch loss at step 36000: 0.205972
Minibatch accuracy: 93.8%
Validation accuracy: 92.0%
Current learning rate: 0.0021920481231063604
Minibatch loss at step 40000: 0.245355
Minibatch accuracy: 93.8%
Validation accuracy: 92.1%
Current learning rate: 0.0018983692862093449
Test accuracy: 96.9%

Close to top notch performance !!

Note: Need to verify from theory standpoint if this approach has impact - and what kind of impact?