Deep Learning

Lab Session 2 - 3 Hours

Convolutional Neural Network (CNN) for Handwritten Digits Recognition

The aim of this session is to practice with Convolutional Neural Networks. Answers and experiments should be made by groups of one or two students. Each group should fill and run appropriate notebook cells.

In the last Lab Session, you built a Multilayer Perceptron for recognizing hand-written digits from the MNIST data-set. The best achieved accuracy on testing data was about 97%. Can you do better than these results using a deep CNN ? In this Lab Session, you will build, train and optimize in TensorFlow one of the early Convolutional Neural Networks: LeNet-5 to go to more than 99% of accuracy.

Load MNIST Data in TensorFlow

Run the cell above to load the MNIST data that comes with TensorFlow. You will use this data in Section 1 and Section 2.

In [1]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
X_train, y_train           = mnist.train.images, mnist.train.labels
X_validation, y_validation = mnist.validation.images, mnist.validation.labels
X_test, y_test             = mnist.test.images, mnist.test.labels
print("Image Shape: {}".format(X_train[0].shape))
print("Training Set:   {} samples".format(len(X_train)))
print("Validation Set: {} samples".format(len(X_validation)))
print("Test Set:       {} samples".format(len(X_test)))

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Image Shape: (784,)
Training Set:   55000 samples
Validation Set: 5000 samples
Test Set:       10000 samples

Section 1 : My First Model in TensorFlow

Before starting with CNN, let's train and test in TensorFlow the example : y=softmax(Wx+b) seen in the DeepLearing course last week.

This model reaches an accuracy of about 92 %. You will also learn how to launch the tensorBoard to visualize the computation graph, statistics and learning curves.

Part 1 : Read carefully the code in the cell below. Run it to perform training.

In [2]:
from __future__ import print_function
import tensorflow as tf


# Parameters
learning_rate = 0.01
training_epochs = 100
batch_size = 128
display_step = 1
logs_path = 'log_files/'  # useful for tensorboard

# tf Graph Input:  mnist data image of shape 28*28=784
x = tf.placeholder(tf.float32, [None, 784], name='InputData')
# 0-9 digits recognition,  10 classes
y = tf.placeholder(tf.float32, [None, 10], name='LabelData')

# Set model weights
W = tf.Variable(tf.zeros([784, 10]), name='Weights')
b = tf.Variable(tf.zeros([10]), name='Bias')

# Construct model and encapsulating all ops into scopes, making Tensorboard's Graph visualization more convenient
with tf.name_scope('Model'):
    # Model
    pred = tf.nn.softmax(tf.matmul(x, W) + b) # Softmax
with tf.name_scope('Loss'):
    # Minimize error using cross entropy
    cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
with tf.name_scope('SGD'):
    # Gradient Descent
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
    # Accuracy
    acc = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
    acc = tf.reduce_mean(tf.cast(acc, tf.float32))

# Initializing the variables
init = tf.global_variables_initializer()
# Create a summary to monitor cost tensor
tf.summary.scalar("Loss", cost)
# Create a summary to monitor accuracy tensor
tf.summary.scalar("Accuracy", acc)
# Merge all summaries into a single op
merged_summary_op = tf.summary.merge_all()

#STEP 2 

# Launch the graph for training
with tf.Session() as sess:
    # op to write logs to Tensorboard
    summary_writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph())
    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(mnist.train.num_examples/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)
            # Run optimization op (backprop), cost op (to get loss value)
            # and summary nodes
            _, c, summary =[optimizer, cost, merged_summary_op],
                                     feed_dict={x: batch_xs, y: batch_ys})
            # Write logs at every iteration
            summary_writer.add_summary(summary, epoch * total_batch + i)
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        if (epoch+1) % display_step == 0:
            print("Epoch: ", '%02d' % (epoch+1), "  =====> Loss=", "{:.9f}".format(avg_cost))

    print("Optimization Finished!")

    # Test model
    # Calculate accuracy
    print("Accuracy:", acc.eval({x: mnist.test.images, y: mnist.test.labels}))

Optimization Finished!
Accuracy: 0.9203

Part 2 : Using Tensorboard, we can now visualize the created graph, giving you an overview of your architecture and how all of the major components are connected. You can also see and analyse the learning curves.

To launch tensorBoard:

  • Go to the TP2 folder,
  • Open a Terminal and run the command line "tensorboard --logdir= log_files/", it will generate an http link ,ex http://666.6.6.6:6006,
  • Copy this link into your web browser

Enjoy It !!

Section 2 : The 99% MNIST Challenge !

Part 1 : LeNet5 implementation

One you are now familar with tensorFlow and tensorBoard, you are in this section to build, train and test the baseline LeNet-5 model for the MNIST digits recognition problem.

In more advanced step you will make some optimizations to get more than 99% of accuracy. The best model can get to over 99.7% accuracy!

For more information, have a look at this list of results :

<img src="lenet.png",width="800" height="600" align="center">

Figure 1: Lenet 5

The LeNet architecture accepts a 32x32xC image as input, where C is the number of color channels. Since MNIST images are grayscale, C is 1 in this case.

Layer 1: Convolutional. The output shape should be 28x28x6 Activation. sigmoid Pooling. The output shape should be 14x14x6.

Layer 2: Convolutional. The output shape should be 10x10x16. Activation. sigmoid Pooling. The output shape should be 5x5x16.

Flatten. Flatten the output shape of the final pooling layer such that it's 1D instead of 3D. You may need to use *flatten from tensorflow.contrib.layers import flatten

Layer 3: Fully Connected. This should have 120 outputs. Activation. sigmoid

Layer 4: Fully Connected. This should have 84 outputs. Activation. sigmoid

Layer 5: Fully Connected. This should have 10 outputs. Activation. softmax

Question 2.1.1 Implement the Neural Network architecture described above. For that, your will use classes and functions from

We give you some helper functions for weigths and bias initilization. Also you can refer to section 1.

In [2]:
#Helper functions  for weigths and bias initilization 
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

In [3]:

def LeNet5_Model(data,activation_function=tf.nn.sigmoid):

    # layer 1 param
    conv1_weights = weight_variable([5,5,1,6])
    conv1_bias = bias_variable([6])
    # layer 2 param
    conv2_weights = weight_variable([5,5,6,16])
    conv2_bias = bias_variable([16])
    # layer 3 param
    layer3_weights = weight_variable([400, 120])
    layer3_bias = bias_variable([120])
    # layer 4 param
    layer4_weights = weight_variable([120, 84])
    layer4_bias = bias_variable([84])
    # layer 5 param
    layer5_weights = weight_variable([84, 10])
    layer5_bias = bias_variable([10])
    with tf.name_scope('Model'):
        with tf.name_scope('Layer1'):
            conv1 = tf.nn.conv2d(input=data,filter=conv1_weights,strides=[1,1,1,1],padding='SAME')
            sigmoid1 = activation_function(conv1 + conv1_bias)
            pool1 = tf.nn.max_pool(sigmoid1,ksize=[1, 2, 2, 1],strides=[1, 2, 2, 1],padding='VALID')
        with tf.name_scope('Layer2'):
            conv2 = tf.nn.conv2d(input=pool1,filter=conv2_weights,strides=[1,1,1,1],padding='VALID')
            sigmoid2 = activation_function(conv2 + conv2_bias)
            pool2 = tf.nn.max_pool(sigmoid2,ksize=[1, 2, 2, 1],strides=[1, 2, 2, 1],padding='VALID')   
        with tf.name_scope('Flatten'):
            flat_inputs = tf.contrib.layers.flatten(pool2)
        with tf.name_scope('Layer3'):
            out3 = activation_function(tf.matmul(flat_inputs, layer3_weights) + layer3_bias)
        with tf.name_scope('Layer4'):
            out4 = activation_function(tf.matmul(out3, layer4_weights) + layer4_bias)
        with tf.name_scope('Layer5'):
            pred = tf.nn.softmax(tf.matmul(out4, layer5_weights) + layer5_bias) # Softmax
        return pred

Question 2.1.2. Calculate the number of parameters of this model

In [13]:
total_parameters = 0
for variable in tf.trainable_variables():
    # shape is an array of tf.Dimension
    shape = variable.get_shape()
    variable_parametes = 1
    for dim in shape:
        variable_parametes *= dim.value
    total_parameters += variable_parametes

(5, 5, 1, 6)
(5, 5, 6, 16)
(400, 120)
(120, 84)
(84, 10)

In [15]:
layer1 = 5*5*1*6 + 6
layer2 = 5*5*6*16 + 16
layer3 = 400*120 + 120
layer4 = 120*84 + 84
layer5 = 84*10 + 10
tot = layer1 + layer2 + layer3 + layer4 + layer5
print('total number of parameters: %d' % tot)

total number of parameters: 61706

Your answer goes here in details

Question 2.1.3. Start the training with the parameters cited below:

 Learning rate =0.1
 Loss Fucntion : Cross entropy
 Optimisateur: SGD
 Number of training iterations= 100
 The batch size =128

In [18]:
from __future__ import print_function
import tensorflow as tf
from numpy import array
import numpy as np

# Parameters
learning_rate = 0.1
training_epochs = 100
batch_size = 128
display_step = 1
logs_path = 'log_files/'  # useful for tensorboard

# tf Graph Input:  mnist data image of shape 28*28=784
x = tf.placeholder(tf.float32, [batch_size,28, 28,1], name='InputData')
# 0-9 digits recognition,  10 classes
y = tf.placeholder(tf.float32, [batch_size, 10], name='LabelData')

# Construct model and encapsulating all ops into scopes, making Tensorboard's Graph visualization more convenient
with tf.name_scope('Model'):
    # Model
    pred = LeNet5_Model(data=x)
with tf.name_scope('Loss'):
    # Minimize error using cross entropy
    cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
with tf.name_scope('SGD'):
    # Gradient Descent
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
    # Accuracy
    acc = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
    acc = tf.reduce_mean(tf.cast(acc, tf.float32))

# Initializing the variables
init = tf.global_variables_initializer()
# Create a summary to monitor cost tensor
tf.summary.scalar("Loss", cost)
# Create a summary to monitor accuracy tensor
tf.summary.scalar("Accuracy", acc)
# Merge all summaries into a single op
merged_summary_op = tf.summary.merge_all()

#STEP 2 

# Launch the graph for training
with tf.Session() as sess:
    # op to write logs to Tensorboard
    summary_writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph())
    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(mnist.train.num_examples/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)

            batch_xs = array(batch_xs).reshape(batch_size, 28,28,1)
            # Run optimization op (backprop), cost op (to get loss value)
            # and summary nodes
            _, c, summary =[optimizer, cost, merged_summary_op],
                                     feed_dict={x: batch_xs, y: batch_ys})
            # Write logs at every iteration
            summary_writer.add_summary(summary, epoch * total_batch + i)
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        if (epoch+1) % display_step == 0:
            print("Epoch: ", '%02d' % (epoch+1), "  =====> Loss=", "{:.9f}".format(avg_cost))

    print("Optimization Finished!")

(128, 28, 28, 6)
(128, 14, 14, 6)
(128, 10, 10, 16)
(128, 5, 5, 16)
(128, 400)
Optimization Finished!

Question 2.1.4. Implement the evaluation function for accuracy computation

In [4]:
def evaluate(model, y):
    #your implementation goes here
    correct_prediction = tf.equal(tf.argmax(model,1), tf.argmax(y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    #print(accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

    return accuracy

Question 2.1.5. Implement training pipeline and run the training data through it to train the model.

  • Before each epoch, shuffle the training set.
  • Print the loss per mini batch and the training/validation accuracy per epoch. (Display results every 100 epochs)
  • Save the model after training
  • Print after training the final testing accuracy

In [5]:
import numpy as np 
# Initializing the variables
def train(learning_rate, training_epochs, batch_size, display_step, optimizer_method=tf.train.GradientDescentOptimizer,activation_function=tf.nn.sigmoid):
    # Initializing the session 
    logs_path = 'log_files/'  # useful for tensorboard

    # tf Graph Input:  mnist data image of shape 28*28=784
    x = tf.placeholder(tf.float32, [None,28, 28,1], name='InputData')
    # 0-9 digits recognition,  10 classes
    y = tf.placeholder(tf.float32, [None, 10], name='LabelData')

    # Construct model and encapsulating all ops into scopes, making Tensorboard's Graph visualization more convenient
    with tf.name_scope('Model'):
        # Model
        pred = LeNet5_Model(data=x,activation_function=activation_function)
    with tf.name_scope('Loss'):
        # Minimize error using cross entropy
                # Minimize error using cross entropy
        if activation_function == tf.nn.sigmoid:
            cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
            cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(tf.clip_by_value(pred,-1.0,1.0)), reduction_indices=1))
            #cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    with tf.name_scope('SGD'):
        # Gradient Descent
        optimizer = optimizer_method(learning_rate).minimize(cost)
    with tf.name_scope('Accuracy'):
        # Accuracy
        acc = evaluate(pred, y)

    # Initializing the variables
    init = tf.global_variables_initializer()
    # Create a summary to monitor cost tensor
    tf.summary.scalar("Loss", cost)
    # Create a summary to monitor accuracy tensor
    tf.summary.scalar("Accuracy", acc)
    # Merge all summaries into a single op
    merged_summary_op = tf.summary.merge_all()

    saver = tf.train.Saver()
    print ("Start Training!")
    t0 = time()
    X_train,Y_train = mnist.train.images.reshape((-1,28,28,1)), mnist.train.labels
    X_val,Y_val = mnist.validation.images.reshape((-1,28,28,1)), mnist.validation.labels
    # Launch the graph for training
    with tf.Session() as sess:
        # op to write logs to Tensorboard
        summary_writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph())

        # Training cycle
        for epoch in range(training_epochs):
            avg_cost = 0.
            total_batch = int(mnist.train.num_examples/batch_size)

            # Loop over all batches
            for i in range(total_batch):
                # train_next_batch shuffle the images by default
                batch_xs, batch_ys = mnist.train.next_batch(batch_size)
                batch_xs = batch_xs.reshape((-1,28,28,1))

                # Run optimization op (backprop), cost op (to get loss value)
                # and summary nodes
                _, c, summary =[optimizer, cost, merged_summary_op],
                                         feed_dict={x: batch_xs, 
                                                    y: batch_ys})
                # Write logs at every iteration
                summary_writer.add_summary(summary, epoch * total_batch + i)
                # Compute average loss
                avg_cost += c / total_batch

            # Display logs per epoch step
            if (epoch+1) % display_step == 0:
                print("Epoch: ", '%02d' % (epoch+1), "=====> Loss=", "{:.9f}".format(avg_cost))
                acc_train = acc.eval({x: X_train, y: Y_train})
                print("Epoch: ", '%02d' % (epoch+1), "=====> Accuracy Train=", "{:.9f}".format(acc_train))
                acc_val = acc.eval({x: X_val, y: Y_val})
                print("Epoch: ", '%02d' % (epoch+1), "=====> Accuracy Validation=", "{:.9f}".format(acc_val))
        print ("Training Finished!")
        t1 = time()
        # Save the variables to disk.
        save_path =, "model.ckpt")
        print("Model saved in file: %s" % save_path)

        #Your implementation for testing accuracy after training goes here
        X_test,Y_test = mnist.test.images.reshape((-1,28,28,1)),mnist.test.labels

        acc_test = acc.eval({x: X_test, y: Y_test})
        print("Accuracy Test=", "{:.9f}".format(acc_test))
        return acc_train,acc_val,acc_test,t1-t0

In [87]:
%time train (0.1,100,128,10,optimizer_method=tf.train.GradientDescentOptimizer)

(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
CPU times: user 1h 26min 32s, sys: 14min 28s, total: 1h 41min
Wall time: 35min 38s
(0.99220002, 0.98680001, 0.9867, 2137.679314851761)

Question 2.1.6 : Use tensorBoard to visualise and save the LeNet5 Graph and all learning curves. Save all obtained figures in the folder "TP2/MNIST_99_Challenge_Figures"


Here we see how the accuracy rapidly increases after few epochs and then increases at an always slower rate. Regarding the accuracy, we see that it gets always nearer to zero epoch after epoch. From the graph we can have a confirm of our network architecture.

Part 2 : LeNET 5 Optimization

Question 2.2.1 Change the sigmoid function with a Relu :

  • Retrain your network with SGD and AdamOptimizer and then fill the table above :
Optimizer Gradient Descent AdamOptimizer
Validation Accuracy 0.99180001 0.097599998
Testing Accuracy 0.99089998 0.1032
Training Time 36min 36min
  • Try with different learning rates for each Optimizer (0.0001 and 0.001 ) and different Batch sizes (50 and 128) for 20000 Epochs.

  • For each optimizer, plot (on the same curve) the testing accuracies function to (learning rate, batch size)

  • Did you reach the 99% accuracy ? What are the optimal parametres that gave you the best results?


  • Relu: when we use the relu we need to change the cost function, in fact we need to do gradient clipping otherwise our network will crash.
  • The Adam optimizer gives very bad results when used with high learning rate, in fact it works better with a low learning rate, such as 0.001
  • When we use the stocastic gradient descent with the relu we obtain really good results: more than 99% test accuracy. So we could stop here.

In [6]:
from time import time

In [14]:
%time train (0.1,100,128,10,optimizer_method=tf.train.GradientDescentOptimizer,activation_function=tf.nn.relu)

(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
CPU times: user 1h 24min 38s, sys: 13min 29s, total: 1h 38min 7s
Wall time: 36min 13s
(0.99998182, 0.99180001, 0.99089998, 2172.412855863571)

In [15]:
%time train (0.1,100,128,10,optimizer_method=tf.train.AdamOptimizer,activation_function=tf.nn.relu)

(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
CPU times: user 1h 24min 53s, sys: 13min 50s, total: 1h 38min 43s
Wall time: 36min 27s
(0.099454544, 0.097599998, 0.1032, 2185.8693709373474)

In [7]:
# your answer goas here

columns = ['optimizer','learning_rate','activation_function','batch_size','training_accuracy','validation_accuracy','test_accuracy','elapsed_time']
optimizer_options = {'gradient_descent':tf.train.GradientDescentOptimizer,'adam':tf.train.AdamOptimizer}
learning_options = [0.001,0.0001]
activation_options = {'sigmoid':tf.nn.sigmoid,'relu':tf.nn.relu}
batch_options = [50,128]

final_results = []
for optimizer_label in optimizer_options:
    optimizer = optimizer_options[optimizer_label]
    for learning_rate in learning_options:
        for activation_label in activation_options:
            activation_function = activation_options[activation_label]
            for batch_size in batch_options:
                #TO DEFINE TrainAndTest
                training_accuracy,validation_accuracy,test_accuracy,elapsed_time = train(
                    learning_rate = learning_rate,
                    batch_size = batch_size,
                    display_step = 10,
                    optimizer_method = optimizer,
                    activation_function = activation_function
                obj_test = {'optimizer':optimizer_label,
                        'elapsed_time': elapsed_time


(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
In [9]:

[{'activation_function': 'relu',
  'batch_size': 50,
  'elapsed_time': 2719.1444079875946,
  'learning_rate': 0.001,
  'optimizer': 'adam',
  'test_accuracy': 0.98979998,
  'training_accuracy': 0.99956363,
  'validation_accuracy': 0.99956363},
 {'activation_function': 'relu',
  'batch_size': 128,
  'elapsed_time': 2140.7165479660034,
  'learning_rate': 0.001,
  'optimizer': 'adam',
  'test_accuracy': 0.99260002,
  'training_accuracy': 1.0,
  'validation_accuracy': 1.0},
 {'activation_function': 'sigmoid',
  'batch_size': 50,
  'elapsed_time': 2590.7476279735565,
  'learning_rate': 0.001,
  'optimizer': 'adam',
  'test_accuracy': 0.9903,
  'training_accuracy': 1.0,
  'validation_accuracy': 1.0},
 {'activation_function': 'sigmoid',
  'batch_size': 128,
  'elapsed_time': 2174.403846025467,
  'learning_rate': 0.001,
  'optimizer': 'adam',
  'test_accuracy': 0.98989999,
  'training_accuracy': 1.0,
  'validation_accuracy': 1.0},
 {'activation_function': 'relu',
  'batch_size': 50,
  'elapsed_time': 2707.127268075943,
  'learning_rate': 0.0001,
  'optimizer': 'adam',
  'test_accuracy': 0.98860002,
  'training_accuracy': 0.9999091,
  'validation_accuracy': 0.9999091},
 {'activation_function': 'relu',
  'batch_size': 128,
  'elapsed_time': 2144.91255903244,
  'learning_rate': 0.0001,
  'optimizer': 'adam',
  'test_accuracy': 0.98869997,
  'training_accuracy': 0.99923635,
  'validation_accuracy': 0.99923635},
 {'activation_function': 'sigmoid',
  'batch_size': 50,
  'elapsed_time': 2667.0797259807587,
  'learning_rate': 0.0001,
  'optimizer': 'adam',
  'test_accuracy': 0.9874,
  'training_accuracy': 0.99463636,
  'validation_accuracy': 0.99463636},
 {'activation_function': 'sigmoid',
  'batch_size': 128,
  'elapsed_time': 2166.4854328632355,
  'learning_rate': 0.0001,
  'optimizer': 'adam',
  'test_accuracy': 0.9849,
  'training_accuracy': 0.98799998,
  'validation_accuracy': 0.98799998},
 {'activation_function': 'relu',
  'batch_size': 50,
  'elapsed_time': 2705.517254114151,
  'learning_rate': 0.001,
  'optimizer': 'gradient_descent',
  'test_accuracy': 0.98269999,
  'training_accuracy': 0.9860909,
  'validation_accuracy': 0.9860909},
 {'activation_function': 'relu',
  'batch_size': 128,
  'elapsed_time': 2159.0516889095306,
  'learning_rate': 0.001,
  'optimizer': 'gradient_descent',
  'test_accuracy': 0.97509998,
  'training_accuracy': 0.97543639,
  'validation_accuracy': 0.97543639},
 {'activation_function': 'sigmoid',
  'batch_size': 50,
  'elapsed_time': 2701.9426329135895,
  'learning_rate': 0.001,
  'optimizer': 'gradient_descent',
  'test_accuracy': 0.1135,
  'training_accuracy': 0.11234546,
  'validation_accuracy': 0.11234546},
 {'activation_function': 'sigmoid',
  'batch_size': 128,
  'elapsed_time': 2153.0241179466248,
  'learning_rate': 0.001,
  'optimizer': 'gradient_descent',
  'test_accuracy': 0.1135,
  'training_accuracy': 0.11234546,
  'validation_accuracy': 0.11234546},
 {'activation_function': 'relu',
  'batch_size': 50,
  'elapsed_time': 2709.723870038986,
  'learning_rate': 0.0001,
  'optimizer': 'gradient_descent',
  'test_accuracy': 0.93660003,
  'training_accuracy': 0.93341815,
  'validation_accuracy': 0.93341815},
 {'activation_function': 'relu',
  'batch_size': 128,
  'elapsed_time': 2159.801533937454,
  'learning_rate': 0.0001,
  'optimizer': 'gradient_descent',
  'test_accuracy': 0.90009999,
  'training_accuracy': 0.89254546,
  'validation_accuracy': 0.89254546},
 {'activation_function': 'sigmoid',
  'batch_size': 50,
  'elapsed_time': 2715.5786321163177,
  'learning_rate': 0.0001,
  'optimizer': 'gradient_descent',
  'test_accuracy': 0.1135,
  'training_accuracy': 0.11234546,
  'validation_accuracy': 0.11234546},
 {'activation_function': 'sigmoid',
  'batch_size': 128,
  'elapsed_time': 2195.4368810653687,
  'learning_rate': 0.0001,
  'optimizer': 'gradient_descent',
  'test_accuracy': 0.1135,
  'training_accuracy': 0.11234546,
  'validation_accuracy': 0.11234546}]


Here we have seen that the relu performs the best. Then, we also seen that the the sigmoid, combined with a low learning rate, takes an infinite amount of time to reach the optimum.

{'activation_function': 'relu',
'batch_size': 128,
'elapsed_time': 2140.7165479660034,
'learning_rate': 0.001,
'optimizer': 'adam',
'test_accuracy': 0.99260002,
'training_accuracy': 1.0,
'validation_accuracy': 1.0},

Regarding the batch size, the best configuration (learning rate=0.001, adam optimizer) achieved a better result with a larger batch size. Generally, we have not seen great differences. The most important thing is that with a littler batch size we need more time to train.

Regarding the learning rate, we achieved the best result with a learning rate of 0.001, but also in this case the answer is not certain. The final accuracy depends from a combination of factors and we can't find something like a proportional behaviour.

Special cases: when we use the stocasthic gradient descent with a low learning rate and the sigmoid activation function, the accuracy is always really low.

As a general rule, the adam optimizer can achieve a better acuracy in a lower amount of time, but we need to take care of carefully choosing the learning rate.

In [23]:
with open('json.json','r') as input_fp:
    results = json.load(input_fp)

In [17]:
import matplotlib.pyplot as plt
sigmoid = [x for x in results if x['activation_function']=='sigmoid']
relu =[x for x in results if x['activation_function'] !='sigmoid']
plt.plot(range(len(sigmoid)),[x['test_accuracy'] for x in sigmoid])
plt.plot(range(len(sigmoid)),[x['test_accuracy'] for x in relu])
plt.ylabel('test accuracy')
plt.xlabel('index run')

The relu is always equal or better than the sigmoid

In [16]:
import matplotlib.pyplot as plt
a = [x for x in results if x['batch_size']== 128]
b =[x for x in results if x['batch_size'] !=128]
plt.plot(range(len(a)),[x['test_accuracy'] for x in a])
plt.plot(range(len(a)),[x['test_accuracy'] for x in b])
plt.ylabel('test accuracy')
plt.xlabel('index run')

Test accuracy does not change when we change batch size

In [18]:
import matplotlib.pyplot as plt
a = [x for x in results if x['learning_rate']== 0.0001]
b =[x for x in results if x['learning_rate'] !=0.0001]
plt.plot(range(len(a)),[x['test_accuracy'] for x in a])
plt.plot(range(len(a)),[x['test_accuracy'] for x in b])
plt.legend(['learning_rate = 0.0001','learning_rate = 0.001'])
plt.ylabel('test accuracy')
plt.xlabel('index run')

An higher learning rate is better, in this specifi case. Later we will see that this is not always the case.

In [19]:
import matplotlib.pyplot as plt
a = [x for x in results if x['optimizer']== 'adam']
b =[x for x in results if x['optimizer'] != 'adam']
plt.plot(range(len(a)),[x['test_accuracy'] for x in a])
plt.plot(range(len(a)),[x['test_accuracy'] for x in b])
plt.ylabel('test accuracy')
plt.xlabel('index run')

Adam is always better than Stocasthic gradient descent

In [21]:
import matplotlib.pyplot as plt
a = [x for x in results if x['optimizer']== 'adam']
b =[x for x in results if x['optimizer'] != 'adam']
plt.plot(range(len(a)),[x['elapsed_time'] for x in a])
plt.plot(range(len(a)),[x['elapsed_time'] for x in b])
plt.xlabel('index run')

Sometimes adam is vaster than SGD

In [22]:
import matplotlib.pyplot as plt
a = [x for x in results if x['batch_size']== 128]
b =[x for x in results if x['batch_size'] !=128]
plt.plot(range(len(a)),[x['elapsed_time'] for x in a])
plt.plot(range(len(a)),[x['elapsed_time'] for x in b])
plt.xlabel('index run')

Bigger batches mean less training time, this is actually a good news if we consider that the batch dimension does not has a big influence on the final accuracy

Question 2.2.2 What about applying a dropout layer on the Fully conntected layer and then retraining the model with the best Optimizer and parameters(Learning rate and Batsh size) obtained in Question 2.2.1 ? (probability to keep units=0.75). For this stage ensure that the keep prob is set to 1.0 to evaluate the performance of the network including all nodes.

In [16]:

def LeNet5_Model(data,keep_prob,activation_function=tf.nn.sigmoid):

    # layer 1 param
    conv1_weights = weight_variable([5,5,1,6])
    conv1_bias = bias_variable([6])
    # layer 2 param
    conv2_weights = weight_variable([5,5,6,16])
    conv2_bias = bias_variable([16])
    # layer 3 param
    layer3_weights = weight_variable([400, 120])
    layer3_bias = bias_variable([120])
    # layer 4 param
    layer4_weights = weight_variable([120, 84])
    layer4_bias = bias_variable([84]) 
    # layer 5 param
    layer5_weights = weight_variable([84, 10])
    layer5_bias = bias_variable([10])
    with tf.name_scope('Model'):
        with tf.name_scope('Layer1'):
            conv1 = tf.nn.conv2d(input=data,filter=conv1_weights,strides=[1,1,1,1],padding='SAME')
            sigmoid1 = activation_function(conv1 + conv1_bias)
            pool1 = tf.nn.max_pool(sigmoid1,ksize=[1, 2, 2, 1],strides=[1, 2, 2, 1],padding='VALID')
        with tf.name_scope('Layer2'):
            conv2 = tf.nn.conv2d(input=pool1,filter=conv2_weights,strides=[1,1,1,1],padding='VALID')
            sigmoid2 = activation_function(conv2 + conv2_bias)
            pool2 = tf.nn.max_pool(sigmoid2,ksize=[1, 2, 2, 1],strides=[1, 2, 2, 1],padding='VALID')   
        with tf.name_scope('Flatten'):
            flat_inputs = tf.contrib.layers.flatten(pool2)
        with tf.name_scope('Layer3'):
            out3 = activation_function(tf.matmul(flat_inputs, layer3_weights) + layer3_bias)
        with tf.name_scope('Layer4'):
            out4 = activation_function(tf.matmul(out3, layer4_weights) + layer4_bias)
        with tf.name_scope('Layer5'):
            out_drop = tf.nn.dropout(out4, keep_prob)
            pred = tf.nn.softmax(tf.matmul(out_drop, layer5_weights) + layer5_bias) # Softmax
        return pred

In [19]:
import numpy as np 
# Initializing the variables
def train(learning_rate, training_epochs, batch_size, display_step, optimizer_method=tf.train.GradientDescentOptimizer,activation_function=tf.nn.sigmoid):
    # Initializing the session 
    logs_path = 'log_files/'  # useful for tensorboard

    # tf Graph Input:  mnist data image of shape 28*28=784
    x = tf.placeholder(tf.float32, [None,28, 28,1], name='InputData')
    # 0-9 digits recognition,  10 classes
    y = tf.placeholder(tf.float32, [None, 10], name='LabelData')
    keep_prob = tf.placeholder(tf.float32)

    # Construct model and encapsulating all ops into scopes, making Tensorboard's Graph visualization more convenient
    with tf.name_scope('Model'):
        # Model
        pred = LeNet5_Model(x,keep_prob,activation_function=activation_function)
    with tf.name_scope('Loss'):
        # Minimize error using cross entropy
                # Minimize error using cross entropy
        if activation_function == tf.nn.sigmoid:
            cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
            cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(tf.clip_by_value(pred,-1.0,1.0)), reduction_indices=1))
            #cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    with tf.name_scope('SGD'):
        # Gradient Descent
        optimizer = optimizer_method(learning_rate).minimize(cost)
    with tf.name_scope('Accuracy'):
        # Accuracy
        acc = evaluate(pred, y)

    # Initializing the variables
    init = tf.global_variables_initializer()
    # Create a summary to monitor cost tensor
    tf.summary.scalar("Loss", cost)
    # Create a summary to monitor accuracy tensor
    tf.summary.scalar("Accuracy", acc)
    # Merge all summaries into a single op
    merged_summary_op = tf.summary.merge_all()

    saver = tf.train.Saver()
    print ("Start Training!")
    t0 = time()
    X_train,Y_train = mnist.train.images.reshape((-1,28,28,1)), mnist.train.labels
    X_val,Y_val = mnist.validation.images.reshape((-1,28,28,1)), mnist.validation.labels
    # Launch the graph for training
    with tf.Session() as sess:
        # op to write logs to Tensorboard
        summary_writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph())

        # Training cycle
        for epoch in range(training_epochs):
            avg_cost = 0.
            total_batch = int(mnist.train.num_examples/batch_size)

            # Loop over all batches
            for i in range(total_batch):
                # train_next_batch shuffle the images by default
                batch_xs, batch_ys = mnist.train.next_batch(batch_size)
                batch_xs = batch_xs.reshape((-1,28,28,1))

                # Run optimization op (backprop), cost op (to get loss value)
                # and summary nodes
                _, c, summary =[optimizer, cost, merged_summary_op],
                                         feed_dict={x: batch_xs, 
                                                    y: batch_ys,keep_prob:0.75})
                # Write logs at every iteration
                summary_writer.add_summary(summary, epoch * total_batch + i)
                # Compute average loss
                avg_cost += c / total_batch

            # Display logs per epoch step
            if (epoch+1) % display_step == 0:
                print("Epoch: ", '%02d' % (epoch+1), "=====> Loss=", "{:.9f}".format(avg_cost))
                acc_train = acc.eval({x: X_train, y: Y_train,keep_prob:1.0})
                print("Epoch: ", '%02d' % (epoch+1), "=====> Accuracy Train=", "{:.9f}".format(acc_train))
                acc_val = acc.eval({x: X_val, y: Y_val,keep_prob:1.0})
                print("Epoch: ", '%02d' % (epoch+1), "=====> Accuracy Validation=", "{:.9f}".format(acc_val))
        print ("Training Finished!")
        t1 = time()
        # Save the variables to disk.
        save_path =, "model.ckpt")
        print("Model saved in file: %s" % save_path)

        #Your implementation for testing accuracy after training goes here
        X_test,Y_test = mnist.test.images.reshape((-1,28,28,1)),mnist.test.labels

        acc_test = acc.eval({x: X_test, y: Y_test,keep_prob:1.0})
        print("Accuracy Test=", "{:.9f}".format(acc_test))
        return acc_train,acc_val,acc_test,t1-t0


Here we managed the keep_prob using a placeholder, in this way we can change it dynamically during our run. We are also quite sure that we can achieve good performances with a limited number of epochs.


We have seen that using a learning rate = 0.001 is unstable, this behavior is even more visible when we add the dropout, thus we used a lower learning rate. The adam optimizer uses an Adaptive Moment Estimation and with an high learning rate, combined with a big batch size can actually bring the network in a worse state.

In [21]:
train (0.0001,50,128,10,optimizer_method=tf.train.AdamOptimizer,activation_function=tf.nn.relu)

(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(0.99321818, 0.98879999, 0.98790002, 1117.6969571113586)


Here after 50 epochs the result is not what we expected, we think that using the dropout we might actually need to use different parameters.

In [22]:
train (0.0001,50,50,10,optimizer_method=tf.train.AdamOptimizer,activation_function=tf.nn.relu)

(?, 28, 28, 6)
(?, 14, 14, 6)
(?, 10, 10, 16)
(?, 5, 5, 16)
(?, 400)
Start Training!
(0.99893639, 0.99699998, 0.99070002, 1378.664743900299)


We managed to obtain 99% accuracy score over the test set in 50 epochs, this is a really good result. This actually explain the idea of the adam optimizer, that using the concept of moment can improve the model in a short amount of time and a low learning rate. This, combined with the relu function allows us to achieve high accuracies.