Problem 1: Basics of Neural Networks

  • Learning Objective: In the entrance exam, we asked you to implement a K-NN classifier to classify some tiny images extracted from CIFAR-10 dataset. Probably many of you noticed that the performances were quite bad. In this problem, you are going to implement a basic multi-layer fully connected neural network to perform the same classification task.
  • Provided Code: We provide the skeletons of classes you need to complete. Forward checking and gradient checkings are provided for verifying your implementation as well.
  • TODOs: You are asked to implement the forward passes and backward passes for standard layers and loss functions, various widely-used optimizers, and part of the training procedure. And finally we want you to train a network from scratch on your own.

In [2]:
from lib.fully_conn import *
from lib.layer_utils import *
from lib.grad_check import *
from lib.datasets import *
from lib.optim import *
from lib.train import *
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

Loading the data (CIFAR-10)

Run the following code block to load in the properly splitted CIFAR-10 data.


In [3]:
data = CIFAR10_data()
for k, v in data.iteritems():
    print "Name: {} Shape: {}".format(k, v.shape)


Name: data_train Shape: (49000, 3, 32, 32)
Name: data_val Shape: (1000, 3, 32, 32)
Name: data_test Shape: (1000, 3, 32, 32)
Name: labels_train Shape: (49000,)
Name: labels_val Shape: (1000,)
Name: labels_test Shape: (1000,)

Implement Standard Layers

You will now implement all the following standard layers commonly seen in a fully connected neural network. Please refer to the file layer_utils.py under the directory lib. Take a look at each class skeleton, and we will walk you through the network layer by layer. We provide results of some examples we pre-computed for you for checking the forward pass, and also the gradient checking for the backward pass.

FC Forward

In the class skeleton "fc", please complete the forward pass in function "forward", the input to the fc layer may not be of dimension (batch size, features size), it could be an image or any higher dimensional data. Make sure that you handle this dimensionality issue.


In [4]:
# Test the fc forward function
input_bz = 3
input_dim = (6, 5, 4)
output_dim = 4

input_size = input_bz * np.prod(input_dim)
weight_size = output_dim * np.prod(input_dim)

single_fc = fc(np.prod(input_dim), output_dim, init_scale=0.02, name="fc_test")

x = np.linspace(-0.1, 0.5, num=input_size).reshape(input_bz, *input_dim)
w = np.linspace(-0.2, 0.3, num=weight_size).reshape(np.prod(input_dim), output_dim)
b = np.linspace(-0.3, 0.1, num=output_dim)

single_fc.params[single_fc.w_name] = w
single_fc.params[single_fc.b_name] = b

out = single_fc.forward(x)

correct_out = np.array([[0.70157129, 0.83483484, 0.96809839, 1.10136194],
                        [1.86723094, 2.02561647, 2.18400199, 2.34238752],
                        [3.0328906,  3.2163981,  3.3999056,  3.5834131]])

# Compare your output with the above pre-computed ones. 
# The difference should not be larger than 1e-8
print "Difference: ", rel_error(out, correct_out)


Difference:  2.48539291792e-09

FC Backward

Please complete the function "backward" as the backward pass of the fc layer. Follow the instructions in the comments to store gradients into the predefined dictionaries in the attributes of the class. Parameters of the layer are also stored in the predefined dictionary.


In [5]:
# Test the fc backward function
x = np.random.randn(10, 2, 2, 3)
w = np.random.randn(12, 10)
b = np.random.randn(10)
dout = np.random.randn(10, 10)

single_fc = fc(np.prod(x.shape[1:]), 10, init_scale=5e-2, name="fc_test")
single_fc.params[single_fc.w_name] = w
single_fc.params[single_fc.b_name] = b

dx_num = eval_numerical_gradient_array(lambda x: single_fc.forward(x), x, dout)
dw_num = eval_numerical_gradient_array(lambda w: single_fc.forward(x), w, dout)
db_num = eval_numerical_gradient_array(lambda b: single_fc.forward(x), b, dout)

out = single_fc.forward(x)
dx = single_fc.backward(dout)
dw = single_fc.grads[single_fc.w_name]
db = single_fc.grads[single_fc.b_name]

# The error should be around 1e-10
print "dx Error: ", rel_error(dx_num, dx)
print "dw Error: ", rel_error(dw_num, dw)
print "db Error: ", rel_error(db_num, db)


dx Error:  1.51731269817e-09
dw Error:  3.15831244664e-10
db Error:  3.47686254918e-11

ReLU Forward

In the class skeleton "relu", please complete the forward pass.


In [6]:
# Test the relu forward function
x = np.linspace(-1.0, 1.0, num=12).reshape(3, 4)
relu_f = relu(name="relu_f")

out = relu_f.forward(x)
correct_out = np.array([[0.,          0.,        0.,         0.        ],
                        [0.,          0.,        0.09090909, 0.27272727],
                        [0.45454545, 0.63636364, 0.81818182, 1.        ]])

# Compare your output with the above pre-computed ones. 
# The difference should not be larger than 1e-8
print "Difference: ", rel_error(out, correct_out)


Difference:  5.00000005012e-09

ReLU Backward

Please complete the backward pass of the class relu.


In [7]:
# Test the relu backward function
x = np.random.randn(10, 10)
dout = np.random.randn(*x.shape)
relu_b = relu(name="relu_b")

dx_num = eval_numerical_gradient_array(lambda x: relu_b.forward(x), x, dout)

out = relu_b.forward(x)
dx = relu_b.backward(dout)

# The error should not be larger than 1e-10
print "dx Error: ", rel_error(dx_num, dx)


dx Error:  3.27562616389e-12

Dropout Forward

In the class "dropout", please complete the forward pass. Remember that the dropout is only applied during training phase, you should pay attention to this while implementing the function.


In [8]:
x = np.random.randn(100, 100) + 5.0

print "----------------------------------------------------------------"
for p in [0.25, 0.50, 0.75]:
    dropout_f = dropout(p)
    out = dropout_f.forward(x, True)
    out_test = dropout_f.forward(x, False)

    print "Dropout p = ", p
    print "Mean of input: ", x.mean()
    print "Mean of output during training time: ", out.mean()
    print "Mean of output during testing time: ", out_test.mean()
    print "Fraction of output set to zero during training time: ", (out == 0).mean()
    print "Fraction of output set to zero during testing time: ", (out_test == 0).mean()
    print "----------------------------------------------------------------"


----------------------------------------------------------------
Dropout p =  0.25
Mean of input:  4.9987432398
Mean of output during training time:  4.95736273107
Mean of output during testing time:  4.9987432398
Fraction of output set to zero during training time:  0.7496
Fraction of output set to zero during testing time:  0.0
----------------------------------------------------------------
Dropout p =  0.5
Mean of input:  4.9987432398
Mean of output during training time:  4.97714637955
Mean of output during testing time:  4.9987432398
Fraction of output set to zero during training time:  0.5027
Fraction of output set to zero during testing time:  0.0
----------------------------------------------------------------
Dropout p =  0.75
Mean of input:  4.9987432398
Mean of output during training time:  4.97382899296
Mean of output during testing time:  4.9987432398
Fraction of output set to zero during training time:  0.2535
Fraction of output set to zero during testing time:  0.0
----------------------------------------------------------------

Dropout Backward

Please complete the backward pass. Again remember that the dropout is only applied during training phase, handle this in the backward pass as well.


In [9]:
x = np.random.randn(5, 5) + 5
dout = np.random.randn(*x.shape)

p = 0.75
dropout_b = dropout(p, seed=100)
out = dropout_b.forward(x, True)
dx = dropout_b.backward(dout)
dx_num = eval_numerical_gradient_array(lambda xx: dropout_b.forward(xx, True), x, dout)

# The error should not be larger than 1e-9
print 'dx relative error: ', rel_error(dx, dx_num)


dx relative error:  3.00311576483e-11

Testing cascaded layers: FC + ReLU

Please find the TestFCReLU function in fully_conn.py under lib directory.
You only need to complete few lines of code in the TODO block.
Please design an FC --> ReLU two-layer-mini-network where the parameters of them match the given x, w, and b
Please insert the corresponding names you defined for each layer to param_name_w, and param_name_b respectively.
Here you only modify the param_name part, the _w, and _b are automatically assigned during network setup


In [10]:
x = np.random.randn(2, 3, 4)  # the input features
w = np.random.randn(12, 10)   # the weight of fc layer
b = np.random.randn(10)       # the bias of fc layer
dout = np.random.randn(2, 10) # the gradients to the output, notice the shape

tiny_net = TestFCReLU()

tiny_net.net.assign("fc_w", w)
tiny_net.net.assign("fc_b", b)

out = tiny_net.forward(x)
dx = tiny_net.backward(dout)

dw = tiny_net.net.get_grads("fc_w")
db = tiny_net.net.get_grads("fc_b")

dx_num = eval_numerical_gradient_array(lambda x: tiny_net.forward(x), x, dout)
dw_num = eval_numerical_gradient_array(lambda w: tiny_net.forward(x), w, dout)
db_num = eval_numerical_gradient_array(lambda b: tiny_net.forward(x), b, dout)

# The errors should not be larger than 1e-7
print "dx error: ", rel_error(dx_num, dx)
print "dw error: ", rel_error(dw_num, dw)
print "db error: ", rel_error(db_num, db)


dx error:  2.70712836868e-10
dw error:  4.92451202602e-10
db error:  3.27562531714e-12

SoftMax Function and Loss Layer

In the layer_utils.py, please first complete the function softmax, which will be use in the function cross_entropy. Please refer to the lecture slides of the mathematical expressions of the cross entropy loss function, and complete its forward pass and backward pass.


In [11]:
num_classes, num_inputs = 5, 50
x = 0.001 * np.random.randn(num_inputs, num_classes)
y = np.random.randint(num_classes, size=num_inputs)

test_loss = cross_entropy()

dx_num = eval_numerical_gradient(lambda x: test_loss.forward(x, y), x, verbose=False)

loss = test_loss.forward(x, y)
dx = test_loss.backward()

# Test softmax_loss function. Loss should be around 1.609
# and dx error should be at the scale of 1e-8 (or smaller)
print "Cross Entropy Loss: ", loss
print "dx error: ", rel_error(dx_num, dx)


Cross Entropy Loss:  1.60945846468
dx error:  2.77559987163e-09

Test a Small Fully Connected Network

Please find the SmallFullyConnectedNetwork function in fully_conn.py under lib directory.
Again you only need to complete few lines of code in the TODO block.
Please design an FC --> ReLU --> FC --> ReLU network where the shapes of parameters match the given shapes
Please insert the corresponding names you defined for each layer to param_name_w, and param_name_b respectively.
Here you only modify the param_name part, the _w, and _b are automatically assigned during network setup


In [12]:
model = SmallFullyConnectedNetwork()
loss_func = cross_entropy()

N, D, = 4, 4  # N: batch size, D: input dimension
H, C  = 30, 7 # H: hidden dimension, C: output dimension
std = 0.02
x = np.random.randn(N, D)

y = np.random.randint(C, size=N)

print "Testing initialization ... "
w1_std = abs(model.net.get_params("fc1_w").std() - std)
b1 = model.net.get_params("fc1_b").std()
w2_std = abs(model.net.get_params("fc2_w").std() - std)
b2 = model.net.get_params("fc2_b").std()

assert w1_std < std / 10, "First layer weights do not seem right"
assert np.all(b1 == 0), "First layer biases do not seem right"
assert w2_std < std / 10, "Second layer weights do not seem right"
assert np.all(b2 == 0), "Second layer biases do not seem right"
print "Passed!"

print "Testing test-time forward pass ... "
w1 = np.linspace(-0.7, 0.3, num=D*H).reshape(D, H)
w2 = np.linspace(-0.3, 0.4, num=H*C).reshape(H, C)
b1 = np.linspace(-0.1, 0.9, num=H)
b2 = np.linspace(-0.9, 0.1, num=C)

model.net.assign("fc1_w", w1)
model.net.assign("fc1_b", b1)
model.net.assign("fc2_w", w2)
model.net.assign("fc2_b", b2)

feats = np.linspace(-5.5, 4.5, num=N*D).reshape(D, N).T
scores = model.forward(feats)
correct_scores = np.asarray([[4.20670862, 4.87188359, 5.53705856, 6.20223352, 6.86740849, 7.53258346, 8.19775843],
                             [4.74826036, 5.35984681, 5.97143326, 6.58301972, 7.19460617, 7.80619262, 8.41777907],
                             [5.2898121,  5.84781003, 6.40580797, 6.96380591, 7.52180384, 8.07980178, 8.63779971],
                             [5.83136384, 6.33577326, 6.84018268, 7.3445921,  7.84900151, 8.35341093, 8.85782035]])
scores_diff = np.sum(np.abs(scores - correct_scores))
assert scores_diff < 1e-6, "Your implementation might went wrong!"
print "Passed!"

print "Testing the loss ...",
y = np.asarray([0, 5, 1, 4])
loss = loss_func.forward(scores, y)
dLoss = loss_func.backward()
correct_loss = 2.90181552716
assert abs(loss - correct_loss) < 1e-10, "Your implementation might went wrong!"
print "Passed!"

print "Testing the gradients (error should be no larger than 1e-7) ..."
din = model.backward(dLoss)
for layer in model.net.layers:
    if not layer.params:
        continue
    for name in sorted(layer.grads):
        f = lambda _: loss_func.forward(model.forward(feats), y)
        grad_num = eval_numerical_gradient(f, layer.params[name], verbose=False)
        print '%s relative error: %.2e' % (name, rel_error(grad_num, layer.grads[name]))


Testing initialization ... 
Passed!
Testing test-time forward pass ... 
Passed!
Testing the loss ... Passed!
Testing the gradients (error should be no larger than 1e-7) ...
fc1_b relative error: 2.85e-09
fc1_w relative error: 5.01e-09
fc2_b relative error: 4.33e-07
fc2_w relative error: 2.59e-09

Test a Fully Connected Network regularized with Dropout

Please find the DropoutNet function in fully_conn.py under lib directory.
For this part you don't need to design a new network, just simply run the following test code
If something goes wrong, you might want to double check your dropout implementation


In [13]:
N, D, C = 3, 15, 10
X = np.random.randn(N, D)
y = np.random.randint(C, size=(N,))
seed = 123

for dropout_p in [0., 0.25, 0.5]:
    print "Dropout p =", dropout_p
    model = DropoutNet(dropout_p=dropout_p, seed=seed)
    loss_func = cross_entropy()
    output = model.forward(X, True)
    loss = loss_func.forward(output, y)
    dLoss = loss_func.backward()
    dX = model.backward(dLoss)
    grads = model.net.grads
    print "Loss (should be ~2.30) : ", loss

    print "Error of gradients should be no larger than 1e-5"
    for name in sorted(model.net.params):
        f = lambda _: loss_func.forward(model.forward(X, True), y)
        grad_num = eval_numerical_gradient(f, model.net.params[name], verbose=False, h=1e-5)
        print "{} relative error: {}".format(name, rel_error(grad_num, grads[name]))
    print


Dropout p = 0.0
Loss (should be ~2.30) :  2.30285163514
Error of gradients should be no larger than 1e-5
fc1_b relative error: 2.19071174494e-08
fc1_w relative error: 1.68428086253e-06
fc2_b relative error: 3.66612391182e-09
fc2_w relative error: 4.23158548789e-06
fc3_b relative error: 1.44153927065e-10
fc3_w relative error: 1.22245170674e-07

Dropout p = 0.25
Loss (should be ~2.30) :  2.30423200279
Error of gradients should be no larger than 1e-5
fc1_b relative error: 1.03459579487e-07
fc1_w relative error: 9.65553469588e-06
fc2_b relative error: 8.23253678335e-08
fc2_w relative error: 1.46154121621e-06
fc3_b relative error: 1.47721326322e-10
fc3_w relative error: 1.87136171565e-08

Dropout p = 0.5
Loss (should be ~2.30) :  2.29994690876
Error of gradients should be no larger than 1e-5
fc1_b relative error: 1.00992962439e-07
fc1_w relative error: 5.24658134624e-06
fc2_b relative error: 1.67052460093e-08
fc2_w relative error: 1.41351178584e-05
fc3_b relative error: 2.84814660514e-10
fc3_w relative error: 1.26080763129e-07

Training a Network

In this section, we defined a TinyNet class for you to fill in the TODO block in fully_conn.py.

  • Here please design a two layer fully connected network for this part.
  • Please read the train.py under lib directory carefully and complete the TODO blocks in the train_net function first.
  • In addition, read how the SGD function is implemented in optim.py, you will be asked to complete three other optimization methods in the later sections.

In [14]:
# Arrange the data
data_dict = {
    "data_train": (data["data_train"], data["labels_train"]),
    "data_val": (data["data_val"], data["labels_val"]),
    "data_test": (data["data_test"], data["labels_test"])
}

In [25]:
model = TinyNet()
loss_f = cross_entropy()
optimizer = SGD(model.net, 1e-4)

Now train the network to achieve at least 50% validation accuracy


In [26]:
results = None
#############################################################################
# TODO: Use the train_net function you completed to train a network         #
#############################################################################
results = train_net(data_dict, model, loss_f, optimizer, batch_size=16, 
                        max_epochs=50, show_every=1000, verbose=True, lr_decay=0.5, lr_decay_every=8)
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
opt_params, loss_hist, train_acc_hist, val_acc_hist = results


(Iteration 1 / 153100) loss: 22.8441509676
(Iteration 1001 / 153100) loss: 3.30073984619
(Iteration 2001 / 153100) loss: 1.60817619153
(Iteration 3001 / 153100) loss: 2.22223511671
(Epoch 1 / 50) Training Accuracy: 0.367775510204, Validation Accuracy: 0.387
(Iteration 4001 / 153100) loss: 1.98538378627
(Iteration 5001 / 153100) loss: 1.74472623109
(Iteration 6001 / 153100) loss: 2.68107714369
(Epoch 2 / 50) Training Accuracy: 0.373857142857, Validation Accuracy: 0.364
(Iteration 7001 / 153100) loss: 2.3226085906
(Iteration 8001 / 153100) loss: 2.61800326614
(Iteration 9001 / 153100) loss: 1.24355899473
(Epoch 3 / 50) Training Accuracy: 0.420734693878, Validation Accuracy: 0.393
(Iteration 10001 / 153100) loss: 2.15581714905
(Iteration 11001 / 153100) loss: 2.46185524392
(Iteration 12001 / 153100) loss: 1.4023878393
(Epoch 4 / 50) Training Accuracy: 0.397673469388, Validation Accuracy: 0.374
(Iteration 13001 / 153100) loss: 2.26579605478
(Iteration 14001 / 153100) loss: 1.12424368603
(Iteration 15001 / 153100) loss: 1.59905225049
(Epoch 5 / 50) Training Accuracy: 0.456959183673, Validation Accuracy: 0.434
(Iteration 16001 / 153100) loss: 1.83704617588
(Iteration 17001 / 153100) loss: 1.27736505903
(Iteration 18001 / 153100) loss: 1.30614524213
(Epoch 6 / 50) Training Accuracy: 0.460142857143, Validation Accuracy: 0.452
(Iteration 19001 / 153100) loss: 1.99192933683
(Iteration 20001 / 153100) loss: 1.51874659992
(Iteration 21001 / 153100) loss: 1.86293611544
(Epoch 7 / 50) Training Accuracy: 0.470469387755, Validation Accuracy: 0.416
(Iteration 22001 / 153100) loss: 2.11758482231
(Iteration 23001 / 153100) loss: 1.35604903872
(Iteration 24001 / 153100) loss: 1.47198748142
(Epoch 8 / 50) Training Accuracy: 0.486183673469, Validation Accuracy: 0.428
Decaying learning rate of the optimizer to 5e-05
(Iteration 25001 / 153100) loss: 1.74682628444
(Iteration 26001 / 153100) loss: 0.74627990418
(Iteration 27001 / 153100) loss: 1.93814566184
(Epoch 9 / 50) Training Accuracy: 0.512, Validation Accuracy: 0.463
(Iteration 28001 / 153100) loss: 1.58591946287
(Iteration 29001 / 153100) loss: 1.34040372426
(Iteration 30001 / 153100) loss: 1.19354453876
(Epoch 10 / 50) Training Accuracy: 0.530530612245, Validation Accuracy: 0.478
(Iteration 31001 / 153100) loss: 1.38698416414
(Iteration 32001 / 153100) loss: 1.41702135394
(Iteration 33001 / 153100) loss: 1.56109746114
(Epoch 11 / 50) Training Accuracy: 0.534367346939, Validation Accuracy: 0.468
(Iteration 34001 / 153100) loss: 1.55234759996
(Iteration 35001 / 153100) loss: 1.86948425501
(Iteration 36001 / 153100) loss: 1.35686685495
(Epoch 12 / 50) Training Accuracy: 0.526265306122, Validation Accuracy: 0.461
(Iteration 37001 / 153100) loss: 1.99123271256
(Iteration 38001 / 153100) loss: 1.6725241687
(Iteration 39001 / 153100) loss: 2.04182158936
(Epoch 13 / 50) Training Accuracy: 0.543530612245, Validation Accuracy: 0.466
(Iteration 40001 / 153100) loss: 0.876656700603
(Iteration 41001 / 153100) loss: 1.73521999091
(Iteration 42001 / 153100) loss: 1.17208523131
(Epoch 14 / 50) Training Accuracy: 0.53887755102, Validation Accuracy: 0.479
(Iteration 43001 / 153100) loss: 1.24753967442
(Iteration 44001 / 153100) loss: 1.23301693946
(Iteration 45001 / 153100) loss: 1.2437118343
(Epoch 15 / 50) Training Accuracy: 0.525612244898, Validation Accuracy: 0.467
(Iteration 46001 / 153100) loss: 1.0942041391
(Iteration 47001 / 153100) loss: 1.02816975553
(Iteration 48001 / 153100) loss: 1.14790407232
(Epoch 16 / 50) Training Accuracy: 0.538020408163, Validation Accuracy: 0.456
Decaying learning rate of the optimizer to 2.5e-05
(Iteration 49001 / 153100) loss: 1.08027394485
(Iteration 50001 / 153100) loss: 1.69340875714
(Iteration 51001 / 153100) loss: 1.39914833493
(Iteration 52001 / 153100) loss: 1.07147156253
(Epoch 17 / 50) Training Accuracy: 0.575, Validation Accuracy: 0.502
(Iteration 53001 / 153100) loss: 1.68710517263
(Iteration 54001 / 153100) loss: 0.854522062
(Iteration 55001 / 153100) loss: 0.985444654947
(Epoch 18 / 50) Training Accuracy: 0.571897959184, Validation Accuracy: 0.483
(Iteration 56001 / 153100) loss: 1.6255403968
(Iteration 57001 / 153100) loss: 1.54995651456
(Iteration 58001 / 153100) loss: 1.6087415827
(Epoch 19 / 50) Training Accuracy: 0.583775510204, Validation Accuracy: 0.492
(Iteration 59001 / 153100) loss: 0.77159527359
(Iteration 60001 / 153100) loss: 1.12082477351
(Iteration 61001 / 153100) loss: 1.46274028551
(Epoch 20 / 50) Training Accuracy: 0.582142857143, Validation Accuracy: 0.492
(Iteration 62001 / 153100) loss: 0.840749256016
(Iteration 63001 / 153100) loss: 1.21183400307
(Iteration 64001 / 153100) loss: 0.782648430883
(Epoch 21 / 50) Training Accuracy: 0.585204081633, Validation Accuracy: 0.495
(Iteration 65001 / 153100) loss: 1.37154695102
(Iteration 66001 / 153100) loss: 1.69035914557
(Iteration 67001 / 153100) loss: 1.00887774669
(Epoch 22 / 50) Training Accuracy: 0.587857142857, Validation Accuracy: 0.481
(Iteration 68001 / 153100) loss: 1.11336440274
(Iteration 69001 / 153100) loss: 0.969121761065
(Iteration 70001 / 153100) loss: 1.59290369013
(Epoch 23 / 50) Training Accuracy: 0.586040816327, Validation Accuracy: 0.491
(Iteration 71001 / 153100) loss: 1.48588778729
(Iteration 72001 / 153100) loss: 1.16045495039
(Iteration 73001 / 153100) loss: 1.30070404628
(Epoch 24 / 50) Training Accuracy: 0.591959183673, Validation Accuracy: 0.496
Decaying learning rate of the optimizer to 1.25e-05
(Iteration 74001 / 153100) loss: 1.14398332456
(Iteration 75001 / 153100) loss: 0.959848872552
(Iteration 76001 / 153100) loss: 0.974254604799
(Epoch 25 / 50) Training Accuracy: 0.599530612245, Validation Accuracy: 0.496
(Iteration 77001 / 153100) loss: 1.38228498903
(Iteration 78001 / 153100) loss: 1.0772959738
(Iteration 79001 / 153100) loss: 1.62090054015
(Epoch 26 / 50) Training Accuracy: 0.599448979592, Validation Accuracy: 0.498
(Iteration 80001 / 153100) loss: 1.72259480549
(Iteration 81001 / 153100) loss: 1.01204384928
(Iteration 82001 / 153100) loss: 1.01587440981
(Epoch 27 / 50) Training Accuracy: 0.599224489796, Validation Accuracy: 0.501
(Iteration 83001 / 153100) loss: 1.12017304812
(Iteration 84001 / 153100) loss: 0.639997825758
(Iteration 85001 / 153100) loss: 1.18385802514
(Epoch 28 / 50) Training Accuracy: 0.603734693878, Validation Accuracy: 0.501
(Iteration 86001 / 153100) loss: 1.06913910287
(Iteration 87001 / 153100) loss: 1.35555670953
(Iteration 88001 / 153100) loss: 1.25593512054
(Epoch 29 / 50) Training Accuracy: 0.603428571429, Validation Accuracy: 0.49
(Iteration 89001 / 153100) loss: 1.09771914737
(Iteration 90001 / 153100) loss: 0.920128267352
(Iteration 91001 / 153100) loss: 1.40744935596
(Epoch 30 / 50) Training Accuracy: 0.603816326531, Validation Accuracy: 0.498
(Iteration 92001 / 153100) loss: 0.632733999172
(Iteration 93001 / 153100) loss: 1.58875441147
(Iteration 94001 / 153100) loss: 1.28420077083
(Epoch 31 / 50) Training Accuracy: 0.601612244898, Validation Accuracy: 0.497
(Iteration 95001 / 153100) loss: 0.885135589679
(Iteration 96001 / 153100) loss: 0.952678117263
(Iteration 97001 / 153100) loss: 0.899451033905
(Epoch 32 / 50) Training Accuracy: 0.602510204082, Validation Accuracy: 0.5
Decaying learning rate of the optimizer to 6.25e-06
(Iteration 98001 / 153100) loss: 1.51170021817
(Iteration 99001 / 153100) loss: 0.99842263885
(Iteration 100001 / 153100) loss: 1.16622477518
(Iteration 101001 / 153100) loss: 1.07462564556
(Epoch 33 / 50) Training Accuracy: 0.610408163265, Validation Accuracy: 0.49
(Iteration 102001 / 153100) loss: 0.811777801168
(Iteration 103001 / 153100) loss: 0.954714708783
(Iteration 104001 / 153100) loss: 0.859259111869
(Epoch 34 / 50) Training Accuracy: 0.610346938776, Validation Accuracy: 0.502
(Iteration 105001 / 153100) loss: 0.545366633501
(Iteration 106001 / 153100) loss: 0.920538368189
(Iteration 107001 / 153100) loss: 1.08317714149
(Epoch 35 / 50) Training Accuracy: 0.613387755102, Validation Accuracy: 0.516
(Iteration 108001 / 153100) loss: 1.12632609768
(Iteration 109001 / 153100) loss: 1.51888545893
(Iteration 110001 / 153100) loss: 1.03852019196
(Epoch 36 / 50) Training Accuracy: 0.611734693878, Validation Accuracy: 0.511
(Iteration 111001 / 153100) loss: 1.72066769103
(Iteration 112001 / 153100) loss: 0.799832060279
(Iteration 113001 / 153100) loss: 1.10607928026
(Epoch 37 / 50) Training Accuracy: 0.613387755102, Validation Accuracy: 0.517
(Iteration 114001 / 153100) loss: 1.09634841227
(Iteration 115001 / 153100) loss: 0.789572544272
(Iteration 116001 / 153100) loss: 1.12421650857
(Epoch 38 / 50) Training Accuracy: 0.613102040816, Validation Accuracy: 0.5
(Iteration 117001 / 153100) loss: 1.27171812372
(Iteration 118001 / 153100) loss: 0.821170476035
(Iteration 119001 / 153100) loss: 1.36715685207
(Epoch 39 / 50) Training Accuracy: 0.615081632653, Validation Accuracy: 0.513
(Iteration 120001 / 153100) loss: 1.51941419848
(Iteration 121001 / 153100) loss: 1.04189679706
(Iteration 122001 / 153100) loss: 1.0775106238
(Epoch 40 / 50) Training Accuracy: 0.615163265306, Validation Accuracy: 0.508
Decaying learning rate of the optimizer to 3.125e-06
(Iteration 123001 / 153100) loss: 1.01355468393
(Iteration 124001 / 153100) loss: 0.953953992563
(Iteration 125001 / 153100) loss: 1.176885487
(Epoch 41 / 50) Training Accuracy: 0.61706122449, Validation Accuracy: 0.512
(Iteration 126001 / 153100) loss: 1.25524352641
(Iteration 127001 / 153100) loss: 0.777832674004
(Iteration 128001 / 153100) loss: 1.05172535368
(Epoch 42 / 50) Training Accuracy: 0.615816326531, Validation Accuracy: 0.506
(Iteration 129001 / 153100) loss: 1.34407216817
(Iteration 130001 / 153100) loss: 1.04350206796
(Iteration 131001 / 153100) loss: 0.777110843238
(Epoch 43 / 50) Training Accuracy: 0.616387755102, Validation Accuracy: 0.506
(Iteration 132001 / 153100) loss: 0.925736486041
(Iteration 133001 / 153100) loss: 1.13747859148
(Iteration 134001 / 153100) loss: 0.9661658639
(Epoch 44 / 50) Training Accuracy: 0.617346938776, Validation Accuracy: 0.509
(Iteration 135001 / 153100) loss: 1.22901345823
(Iteration 136001 / 153100) loss: 1.0297890694
(Iteration 137001 / 153100) loss: 0.870378739903
(Epoch 45 / 50) Training Accuracy: 0.615816326531, Validation Accuracy: 0.51
(Iteration 138001 / 153100) loss: 1.17755646306
(Iteration 139001 / 153100) loss: 1.41603349973
(Iteration 140001 / 153100) loss: 0.8024104578
(Epoch 46 / 50) Training Accuracy: 0.618510204082, Validation Accuracy: 0.511
(Iteration 141001 / 153100) loss: 0.81845134543
(Iteration 142001 / 153100) loss: 0.630285208734
(Iteration 143001 / 153100) loss: 1.30389327052
(Epoch 47 / 50) Training Accuracy: 0.617448979592, Validation Accuracy: 0.509
(Iteration 144001 / 153100) loss: 1.07059457238
(Iteration 145001 / 153100) loss: 0.933370928657
(Iteration 146001 / 153100) loss: 1.48829022933
(Epoch 48 / 50) Training Accuracy: 0.617224489796, Validation Accuracy: 0.505
Decaying learning rate of the optimizer to 1.5625e-06
(Iteration 147001 / 153100) loss: 0.992827536652
(Iteration 148001 / 153100) loss: 0.727935177979
(Iteration 149001 / 153100) loss: 1.49129569757
(Iteration 150001 / 153100) loss: 1.24125069008
(Epoch 49 / 50) Training Accuracy: 0.617816326531, Validation Accuracy: 0.508
(Iteration 151001 / 153100) loss: 1.55125267897
(Iteration 152001 / 153100) loss: 0.73132023685
(Iteration 153001 / 153100) loss: 1.08929269003
(Epoch 50 / 50) Training Accuracy: 0.619571428571, Validation Accuracy: 0.509

In [27]:
# Take a look at what names of params were stored
print opt_params.keys()


['fc1_w', 'fc2_b', 'fc1_b', 'fc2_w']

In [28]:
# Demo: How to load the parameters to a newly defined network
model = TinyNet()
model.net.load(opt_params)
val_acc = compute_acc(model, data["data_val"], data["labels_val"])
print "Validation Accuracy: {}%".format(val_acc*100)
test_acc = compute_acc(model, data["data_test"], data["labels_test"])
print "Testing Accuracy: {}%".format(test_acc*100)


Loading Params: fc1_w Shape: (3072, 250)
Loading Params: fc1_b Shape: (250,)
Loading Params: fc2_b Shape: (10,)
Loading Params: fc2_w Shape: (250, 10)
Validation Accuracy: 51.7%
Testing Accuracy: 48.4%

In [29]:
# Plot the learning curves
plt.subplot(2, 1, 1)
plt.title('Training loss')
loss_hist_ = loss_hist[1::100] # sparse the curve a bit
plt.plot(loss_hist_, '-o')
plt.xlabel('Iteration')

plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(train_acc_hist, '-o', label='Training')
plt.plot(val_acc_hist, '-o', label='Validation')
plt.plot([0.5] * len(val_acc_hist), 'k--')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)
plt.show()


Different Optimizers

There are several more advanced optimizers than vanilla SGD, you will implement three more sophisticated and widely-used methods in this section. Please complete the TODOs in the optim.py under lib directory.

SGD + Momentum

The update rule of SGD plus momentum is as shown below: <br\ > \begin{equation} v_t: velocity \\ \gamma: momentum \\ \eta: learning\ rate \\ v_t = \gamma v_{t-1} + \eta \nabla_{\theta}J(\theta) \\ \theta = \theta - v_t \end{equation} Complete the SGDM() function in optim.py


In [30]:
# SGD with momentum
model = TinyNet()
loss_f = cross_entropy()
optimizer = SGD(model.net, 1e-4)

In [31]:
# Test the implementation of SGD with Momentum
N, D = 4, 5
test_sgd = sequential(fc(N, D, name="sgd_fc"))

w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
v = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)

test_sgd.layers[0].params = {"sgd_fc_w": w}
test_sgd.layers[0].grads = {"sgd_fc_w": dw}

test_sgd_momentum = SGDM(test_sgd, 1e-3, 0.9)
test_sgd_momentum.velocity = {"sgd_fc_w": v}
test_sgd_momentum.step()

updated_w = test_sgd.layers[0].params["sgd_fc_w"]
velocity = test_sgd_momentum.velocity["sgd_fc_w"]

expected_updated_w = np.asarray([
  [ 0.1406,      0.20738947,  0.27417895,  0.34096842,  0.40775789],
  [ 0.47454737,  0.54133684,  0.60812632,  0.67491579,  0.74170526],
  [ 0.80849474,  0.87528421,  0.94207368,  1.00886316,  1.07565263],
  [ 1.14244211,  1.20923158,  1.27602105,  1.34281053,  1.4096    ]])
expected_velocity = np.asarray([
  [ 0.5406,      0.55475789,  0.56891579, 0.58307368,  0.59723158],
  [ 0.61138947,  0.62554737,  0.63970526,  0.65386316,  0.66802105],
  [ 0.68217895,  0.69633684,  0.71049474,  0.72465263,  0.73881053],
  [ 0.75296842,  0.76712632,  0.78128421,  0.79544211,  0.8096    ]])

print 'updated_w error: ', rel_error(updated_w, expected_updated_w)
print 'velocity error: ', rel_error(expected_velocity, velocity)


updated_w error:  8.88234703351e-09
velocity error:  4.26928774328e-09

Run the following code block to train a multi-layer fully connected network with both SGD and SGD plus Momentum. The network trained with SGDM optimizer should converge faster.


In [33]:
# Arrange a small data
num_train = 4000
small_data_dict = {
    "data_train": (data["data_train"][:num_train], data["labels_train"][:num_train]),
    "data_val": (data["data_val"], data["labels_val"]),
    "data_test": (data["data_test"], data["labels_test"])
}

model_sgd      = FullyConnectedNetwork()
model_sgdm     = FullyConnectedNetwork()
loss_f_sgd     = cross_entropy()
loss_f_sgdm    = cross_entropy()
optimizer_sgd  = SGD(model_sgd.net, 1e-2)
optimizer_sgdm = SGDM(model_sgdm.net, 1e-2, 0.9)

print "Training with Vanilla SGD..."
results_sgd = train_net(small_data_dict, model_sgd, loss_f_sgd, optimizer_sgd, batch_size=100, 
                        max_epochs=5, show_every=100, verbose=True)

print "\nTraining with SGD plus Momentum..."
results_sgdm = train_net(small_data_dict, model_sgdm, loss_f_sgdm, optimizer_sgdm, batch_size=100, 
                         max_epochs=5, show_every=100, verbose=True)

opt_params_sgd,  loss_hist_sgd,  train_acc_hist_sgd,  val_acc_hist_sgd  = results_sgd
opt_params_sgdm, loss_hist_sgdm, train_acc_hist_sgdm, val_acc_hist_sgdm = results_sgdm

plt.subplot(3, 1, 1)
plt.title('Training loss')
plt.xlabel('Iteration')

plt.subplot(3, 1, 2)
plt.title('Training accuracy')
plt.xlabel('Epoch')

plt.subplot(3, 1, 3)
plt.title('Validation accuracy')
plt.xlabel('Epoch')

plt.subplot(3, 1, 1)
plt.plot(loss_hist_sgd, 'o', label="Vanilla SGD")
plt.subplot(3, 1, 2)
plt.plot(train_acc_hist_sgd, '-o', label="Vanilla SGD")
plt.subplot(3, 1, 3)
plt.plot(val_acc_hist_sgd, '-o', label="Vanilla SGD")
         
plt.subplot(3, 1, 1)
plt.plot(loss_hist_sgdm, 'o', label="SGD with Momentum")
plt.subplot(3, 1, 2)
plt.plot(train_acc_hist_sgdm, '-o', label="SGD with Momentum")
plt.subplot(3, 1, 3)
plt.plot(val_acc_hist_sgdm, '-o', label="SGD with Momentum")
  
for i in [1, 2, 3]:
  plt.subplot(3, 1, i)
  plt.legend(loc='upper center', ncol=4)
plt.gcf().set_size_inches(15, 15)
plt.show()


Training with Vanilla SGD...
(Iteration 1 / 200) loss: 2.70958764403
(Epoch 1 / 5) Training Accuracy: 0.27725, Validation Accuracy: 0.248
(Epoch 2 / 5) Training Accuracy: 0.3275, Validation Accuracy: 0.287
(Iteration 101 / 200) loss: 1.66009228487
(Epoch 3 / 5) Training Accuracy: 0.36625, Validation Accuracy: 0.312
(Epoch 4 / 5) Training Accuracy: 0.382, Validation Accuracy: 0.32
(Epoch 5 / 5) Training Accuracy: 0.42175, Validation Accuracy: 0.333

Training with SGD plus Momentum...
(Iteration 1 / 200) loss: 2.75849278438
(Epoch 1 / 5) Training Accuracy: 0.3015, Validation Accuracy: 0.263
(Epoch 2 / 5) Training Accuracy: 0.3915, Validation Accuracy: 0.314
(Iteration 101 / 200) loss: 1.66893134699
(Epoch 3 / 5) Training Accuracy: 0.44275, Validation Accuracy: 0.346
(Epoch 4 / 5) Training Accuracy: 0.47875, Validation Accuracy: 0.355
(Epoch 5 / 5) Training Accuracy: 0.5385, Validation Accuracy: 0.395

RMSProp

The update rule of RMSProp is as shown below: <br\ > \begin{equation} \gamma: decay\ rate \\ \epsilon: small\ number \\ g_t^2: squared\ gradients \\ \eta: learning\ rate \\ E[g^2]_t: decaying\ average\ of\ past\ squared\ gradients\ at\ update\ step\ t \\ E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma)g_t^2 \\ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}} \end{equation} Complete the RMSProp() function in optim.py


In [34]:
# Test RMSProp implementation; you should see errors less than 1e-7
N, D = 4, 5
test_rms = sequential(fc(N, D, name="rms_fc"))

w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
cache = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)

test_rms.layers[0].params = {"rms_fc_w": w}
test_rms.layers[0].grads = {"rms_fc_w": dw}

opt_rms = RMSProp(test_rms, 1e-2, 0.99)
opt_rms.cache = {"rms_fc_w": cache}
opt_rms.step()

updated_w = test_rms.layers[0].params["rms_fc_w"]
cache = opt_rms.cache["rms_fc_w"]

expected_updated_w = np.asarray([
  [-0.39223849, -0.34037513, -0.28849239, -0.23659121, -0.18467247],
  [-0.132737,   -0.08078555, -0.02881884,  0.02316247,  0.07515774],
  [ 0.12716641,  0.17918792,  0.23122175,  0.28326742,  0.33532447],
  [ 0.38739248,  0.43947102,  0.49155973,  0.54365823,  0.59576619]])
expected_cache = np.asarray([
  [ 0.5976,      0.6126277,   0.6277108,   0.64284931,  0.65804321],
  [ 0.67329252,  0.68859723,  0.70395734,  0.71937285,  0.73484377],
  [ 0.75037008,  0.7659518,   0.78158892,  0.79728144,  0.81302936],
  [ 0.82883269,  0.84469141,  0.86060554,  0.87657507,  0.8926    ]])

print 'updated_w error: ', rel_error(expected_updated_w, updated_w)
print 'cache error: ', rel_error(expected_cache, opt_rms.cache["rms_fc_w"])


updated_w error:  9.50264522989e-08
cache error:  2.64779558072e-09

Adam

The update rule of Adam is as shown below: <br\ > \begin{equation} g_t: gradients\ at\ update\ step\ t \\ m_t = \beta_1m_{t-1} + (1-\beta_1)g_t \\ v_t = \beta_2v_{t-1} + (1-\beta_1)g_t^2 \\ \hat{m_t}: bias\ corrected\ m_t \\ \hat{v_t}: bias\ corrected\ v_t \\ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v_t}}+\epsilon} \end{equation} Complete the Adam() function in optim.py


In [35]:
# Test Adam implementation; you should see errors around 1e-7 or less
N, D = 4, 5
test_adam = sequential(fc(N, D, name="adam_fc"))

w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
m = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)
v = np.linspace(0.7, 0.5, num=N*D).reshape(N, D)

test_adam.layers[0].params = {"adam_fc_w": w}
test_adam.layers[0].grads = {"adam_fc_w": dw}

opt_adam = Adam(test_adam, 1e-2, 0.9, 0.999, t=5)
opt_adam.mt = {"adam_fc_w": m}
opt_adam.vt = {"adam_fc_w": v}
opt_adam.step()

updated_w = test_adam.layers[0].params["adam_fc_w"]
mt = opt_adam.mt["adam_fc_w"]
vt = opt_adam.vt["adam_fc_w"]

expected_updated_w = np.asarray([
  [-0.40094747, -0.34836187, -0.29577703, -0.24319299, -0.19060977],
  [-0.1380274,  -0.08544591, -0.03286534,  0.01971428,  0.0722929],
  [ 0.1248705,   0.17744702,  0.23002243,  0.28259667,  0.33516969],
  [ 0.38774145,  0.44031188,  0.49288093,  0.54544852,  0.59801459]])
expected_v = np.asarray([
  [ 0.69966,     0.68908382,  0.67851319,  0.66794809,  0.65738853,],
  [ 0.64683452,  0.63628604,  0.6257431,   0.61520571,  0.60467385,],
  [ 0.59414753,  0.58362676,  0.57311152,  0.56260183,  0.55209767,],
  [ 0.54159906,  0.53110598,  0.52061845,  0.51013645,  0.49966,   ]])
expected_m = np.asarray([
  [ 0.48,        0.49947368,  0.51894737,  0.53842105,  0.55789474],
  [ 0.57736842,  0.59684211,  0.61631579,  0.63578947,  0.65526316],
  [ 0.67473684,  0.69421053,  0.71368421,  0.73315789,  0.75263158],
  [ 0.77210526,  0.79157895,  0.81105263,  0.83052632,  0.85      ]])

print 'updated_w error: ', rel_error(expected_updated_w, updated_w)
print 'mt error: ', rel_error(expected_m, mt)
print 'vt error: ', rel_error(expected_v, vt)


updated_w error:  1.13956917985e-07
mt error:  4.21496319311e-09
vt error:  4.20831403811e-09

Comparing the optimizers

Run the following code block to compare the plotted results among all the above optimizers


In [36]:
model_rms      = FullyConnectedNetwork()
model_adam     = FullyConnectedNetwork()
loss_f_rms     = cross_entropy()
loss_f_adam    = cross_entropy()
optimizer_rms  = RMSProp(model_rms.net, 5e-4)
optimizer_adam = Adam(model_adam.net, 5e-4)

print "Training with RMSProp..."
results_rms = train_net(small_data_dict, model_rms, loss_f_rms, optimizer_rms, batch_size=100, 
                        max_epochs=5, show_every=100, verbose=True)

print "\nTraining with Adam..."
results_adam = train_net(small_data_dict, model_adam, loss_f_adam, optimizer_adam, batch_size=100, 
                         max_epochs=5, show_every=100, verbose=True)

opt_params_rms,  loss_hist_rms,  train_acc_hist_rms,  val_acc_hist_rms  = results_rms
opt_params_adam, loss_hist_adam, train_acc_hist_adam, val_acc_hist_adam = results_adam

plt.subplot(3, 1, 1)
plt.title('Training loss')
plt.xlabel('Iteration')

plt.subplot(3, 1, 2)
plt.title('Training accuracy')
plt.xlabel('Epoch')

plt.subplot(3, 1, 3)
plt.title('Validation accuracy')
plt.xlabel('Epoch')

plt.subplot(3, 1, 1)
plt.plot(loss_hist_sgd, 'o', label="Vanilla SGD")
plt.subplot(3, 1, 2)
plt.plot(train_acc_hist_sgd, '-o', label="Vanilla SGD")
plt.subplot(3, 1, 3)
plt.plot(val_acc_hist_sgd, '-o', label="Vanilla SGD")
         
plt.subplot(3, 1, 1)
plt.plot(loss_hist_sgdm, 'o', label="SGD with Momentum")
plt.subplot(3, 1, 2)
plt.plot(train_acc_hist_sgdm, '-o', label="SGD with Momentum")
plt.subplot(3, 1, 3)
plt.plot(val_acc_hist_sgdm, '-o', label="SGD with Momentum")

plt.subplot(3, 1, 1)
plt.plot(loss_hist_rms, 'o', label="RMSProp")
plt.subplot(3, 1, 2)
plt.plot(train_acc_hist_rms, '-o', label="RMSProp")
plt.subplot(3, 1, 3)
plt.plot(val_acc_hist_rms, '-o', label="RMSProp")
         
plt.subplot(3, 1, 1)
plt.plot(loss_hist_adam, 'o', label="Adam")
plt.subplot(3, 1, 2)
plt.plot(train_acc_hist_adam, '-o', label="Adam")
plt.subplot(3, 1, 3)
plt.plot(val_acc_hist_adam, '-o', label="Adam")
  
for i in [1, 2, 3]:
  plt.subplot(3, 1, i)
  plt.legend(loc='upper center', ncol=4)
plt.gcf().set_size_inches(15, 15)
plt.show()


Training with RMSProp...
(Iteration 1 / 200) loss: 2.72317962586
(Epoch 1 / 5) Training Accuracy: 0.3725, Validation Accuracy: 0.339
(Epoch 2 / 5) Training Accuracy: 0.42075, Validation Accuracy: 0.338
(Iteration 101 / 200) loss: 1.49617506325
(Epoch 3 / 5) Training Accuracy: 0.47875, Validation Accuracy: 0.369
(Epoch 4 / 5) Training Accuracy: 0.54925, Validation Accuracy: 0.38
(Epoch 5 / 5) Training Accuracy: 0.60125, Validation Accuracy: 0.38

Training with Adam...
(Iteration 1 / 200) loss: 2.91650321033
(Epoch 1 / 5) Training Accuracy: 0.34175, Validation Accuracy: 0.286
(Epoch 2 / 5) Training Accuracy: 0.4295, Validation Accuracy: 0.353
(Iteration 101 / 200) loss: 1.63409537858
(Epoch 3 / 5) Training Accuracy: 0.50125, Validation Accuracy: 0.372
(Epoch 4 / 5) Training Accuracy: 0.5715, Validation Accuracy: 0.383
(Epoch 5 / 5) Training Accuracy: 0.62775, Validation Accuracy: 0.374

Training a Network with Dropout

Run the following code blocks to compare the results with and without dropout


In [49]:
# Train two identical nets, one with dropout and one without
num_train = 500
data_dict_500 = {
    "data_train": (data["data_train"][:num_train], data["labels_train"][:num_train]),
    "data_val": (data["data_val"], data["labels_val"]),
    "data_test": (data["data_test"], data["labels_test"])
}

solvers = {}
dropout_ps = [0, 0.25, 0.5, 0.75]  # you can try some dropout prob yourself

results_dict = {}
for dropout_p in dropout_ps:
    results_dict[dropout_p] = {}

for dropout_p in dropout_ps:
    print "Dropout =", dropout_p
    model = DropoutNetTest(dropout_p=dropout_p)
    loss_f = cross_entropy()
    optimizer = SGDM(model.net, 1e-4)
    results = train_net(data_dict_500, model, loss_f, optimizer, batch_size=100, 
                        max_epochs=20, show_every=100, verbose=True)
    opt_params, loss_hist, train_acc_hist, val_acc_hist = results
    results_dict[dropout_p] = {
        "opt_params": opt_params, 
        "loss_hist": loss_hist, 
        "train_acc_hist": train_acc_hist, 
        "val_acc_hist": val_acc_hist
    }


Dropout = 0
(Iteration 1 / 100) loss: 2.37784083924
(Epoch 1 / 20) Training Accuracy: 0.102, Validation Accuracy: 0.108
(Epoch 2 / 20) Training Accuracy: 0.124, Validation Accuracy: 0.113
(Epoch 3 / 20) Training Accuracy: 0.15, Validation Accuracy: 0.116
(Epoch 4 / 20) Training Accuracy: 0.168, Validation Accuracy: 0.127
(Epoch 5 / 20) Training Accuracy: 0.188, Validation Accuracy: 0.13
(Epoch 6 / 20) Training Accuracy: 0.212, Validation Accuracy: 0.14
(Epoch 7 / 20) Training Accuracy: 0.22, Validation Accuracy: 0.145
(Epoch 8 / 20) Training Accuracy: 0.244, Validation Accuracy: 0.158
(Epoch 9 / 20) Training Accuracy: 0.254, Validation Accuracy: 0.161
(Epoch 10 / 20) Training Accuracy: 0.262, Validation Accuracy: 0.167
(Epoch 11 / 20) Training Accuracy: 0.276, Validation Accuracy: 0.167
(Epoch 12 / 20) Training Accuracy: 0.286, Validation Accuracy: 0.173
(Epoch 13 / 20) Training Accuracy: 0.294, Validation Accuracy: 0.178
(Epoch 14 / 20) Training Accuracy: 0.302, Validation Accuracy: 0.182
(Epoch 15 / 20) Training Accuracy: 0.318, Validation Accuracy: 0.189
(Epoch 16 / 20) Training Accuracy: 0.328, Validation Accuracy: 0.192
(Epoch 17 / 20) Training Accuracy: 0.33, Validation Accuracy: 0.196
(Epoch 18 / 20) Training Accuracy: 0.332, Validation Accuracy: 0.197
(Epoch 19 / 20) Training Accuracy: 0.336, Validation Accuracy: 0.204
(Epoch 20 / 20) Training Accuracy: 0.346, Validation Accuracy: 0.213
Dropout = 0.25
(Iteration 1 / 100) loss: 3.33902418465
(Epoch 1 / 20) Training Accuracy: 0.12, Validation Accuracy: 0.09
(Epoch 2 / 20) Training Accuracy: 0.122, Validation Accuracy: 0.097
(Epoch 3 / 20) Training Accuracy: 0.14, Validation Accuracy: 0.109
(Epoch 4 / 20) Training Accuracy: 0.154, Validation Accuracy: 0.126
(Epoch 5 / 20) Training Accuracy: 0.168, Validation Accuracy: 0.131
(Epoch 6 / 20) Training Accuracy: 0.184, Validation Accuracy: 0.137
(Epoch 7 / 20) Training Accuracy: 0.194, Validation Accuracy: 0.138
(Epoch 8 / 20) Training Accuracy: 0.214, Validation Accuracy: 0.138
(Epoch 9 / 20) Training Accuracy: 0.23, Validation Accuracy: 0.149
(Epoch 10 / 20) Training Accuracy: 0.218, Validation Accuracy: 0.146
(Epoch 11 / 20) Training Accuracy: 0.216, Validation Accuracy: 0.152
(Epoch 12 / 20) Training Accuracy: 0.242, Validation Accuracy: 0.154
(Epoch 13 / 20) Training Accuracy: 0.246, Validation Accuracy: 0.16
(Epoch 14 / 20) Training Accuracy: 0.276, Validation Accuracy: 0.164
(Epoch 15 / 20) Training Accuracy: 0.278, Validation Accuracy: 0.172
(Epoch 16 / 20) Training Accuracy: 0.294, Validation Accuracy: 0.17
(Epoch 17 / 20) Training Accuracy: 0.304, Validation Accuracy: 0.166
(Epoch 18 / 20) Training Accuracy: 0.322, Validation Accuracy: 0.174
(Epoch 19 / 20) Training Accuracy: 0.318, Validation Accuracy: 0.17
(Epoch 20 / 20) Training Accuracy: 0.338, Validation Accuracy: 0.174
Dropout = 0.5
(Iteration 1 / 100) loss: 2.86609250853
(Epoch 1 / 20) Training Accuracy: 0.102, Validation Accuracy: 0.109
(Epoch 2 / 20) Training Accuracy: 0.108, Validation Accuracy: 0.118
(Epoch 3 / 20) Training Accuracy: 0.13, Validation Accuracy: 0.137
(Epoch 4 / 20) Training Accuracy: 0.132, Validation Accuracy: 0.141
(Epoch 5 / 20) Training Accuracy: 0.154, Validation Accuracy: 0.154
(Epoch 6 / 20) Training Accuracy: 0.17, Validation Accuracy: 0.16
(Epoch 7 / 20) Training Accuracy: 0.186, Validation Accuracy: 0.166
(Epoch 8 / 20) Training Accuracy: 0.198, Validation Accuracy: 0.171
(Epoch 9 / 20) Training Accuracy: 0.208, Validation Accuracy: 0.172
(Epoch 10 / 20) Training Accuracy: 0.212, Validation Accuracy: 0.177
(Epoch 11 / 20) Training Accuracy: 0.224, Validation Accuracy: 0.184
(Epoch 12 / 20) Training Accuracy: 0.232, Validation Accuracy: 0.196
(Epoch 13 / 20) Training Accuracy: 0.246, Validation Accuracy: 0.198
(Epoch 14 / 20) Training Accuracy: 0.26, Validation Accuracy: 0.2
(Epoch 15 / 20) Training Accuracy: 0.264, Validation Accuracy: 0.202
(Epoch 16 / 20) Training Accuracy: 0.268, Validation Accuracy: 0.207
(Epoch 17 / 20) Training Accuracy: 0.278, Validation Accuracy: 0.207
(Epoch 18 / 20) Training Accuracy: 0.274, Validation Accuracy: 0.207
(Epoch 19 / 20) Training Accuracy: 0.286, Validation Accuracy: 0.21
(Epoch 20 / 20) Training Accuracy: 0.292, Validation Accuracy: 0.21
Dropout = 0.75
(Iteration 1 / 100) loss: 2.46717341664
(Epoch 1 / 20) Training Accuracy: 0.108, Validation Accuracy: 0.102
(Epoch 2 / 20) Training Accuracy: 0.12, Validation Accuracy: 0.105
(Epoch 3 / 20) Training Accuracy: 0.14, Validation Accuracy: 0.117
(Epoch 4 / 20) Training Accuracy: 0.154, Validation Accuracy: 0.121
(Epoch 5 / 20) Training Accuracy: 0.166, Validation Accuracy: 0.128
(Epoch 6 / 20) Training Accuracy: 0.18, Validation Accuracy: 0.136
(Epoch 7 / 20) Training Accuracy: 0.194, Validation Accuracy: 0.143
(Epoch 8 / 20) Training Accuracy: 0.2, Validation Accuracy: 0.153
(Epoch 9 / 20) Training Accuracy: 0.216, Validation Accuracy: 0.161
(Epoch 10 / 20) Training Accuracy: 0.218, Validation Accuracy: 0.169
(Epoch 11 / 20) Training Accuracy: 0.238, Validation Accuracy: 0.18
(Epoch 12 / 20) Training Accuracy: 0.244, Validation Accuracy: 0.187
(Epoch 13 / 20) Training Accuracy: 0.264, Validation Accuracy: 0.194
(Epoch 14 / 20) Training Accuracy: 0.268, Validation Accuracy: 0.198
(Epoch 15 / 20) Training Accuracy: 0.284, Validation Accuracy: 0.205
(Epoch 16 / 20) Training Accuracy: 0.3, Validation Accuracy: 0.213
(Epoch 17 / 20) Training Accuracy: 0.312, Validation Accuracy: 0.211
(Epoch 18 / 20) Training Accuracy: 0.316, Validation Accuracy: 0.216
(Epoch 19 / 20) Training Accuracy: 0.326, Validation Accuracy: 0.225
(Epoch 20 / 20) Training Accuracy: 0.332, Validation Accuracy: 0.222

In [48]:
# Plot train and validation accuracies of the two models
train_accs = []
val_accs = []
for dropout_p in dropout_ps:
    curr_dict = results_dict[dropout_p]
    train_accs.append(curr_dict["train_acc_hist"][-1])
    val_accs.append(curr_dict["val_acc_hist"][-1])

plt.subplot(3, 1, 1)
for dropout_p in dropout_ps:
    curr_dict = results_dict[dropout_p]
    plt.plot(curr_dict["train_acc_hist"], 'o', label='%.2f dropout' % dropout_p)
plt.title('Train accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(ncol=2, loc='lower right')
  
plt.subplot(3, 1, 2)
for dropout_p in dropout_ps:
    curr_dict = results_dict[dropout_p]
    plt.plot(curr_dict["val_acc_hist"], 'o', label='%.2f dropout' % dropout_p)
plt.title('Val accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(ncol=2, loc='lower right')

plt.gcf().set_size_inches(15, 15)
plt.show()


Inline Question: Describe what you observe from the above results and graphs

Ans: It can be observed that as the dropout range increases during training, the training accuracy (perforamance decreases). This is expected as we randomly drop neurons according to the keep probability. However, this leads to better performance on validation data since the model is able to generalize better on unseen examples. This is the classic case of using regularization to reduce overfitting.

Plot the Activation Functions

In each of the activation function, use the given lambda function template to plot their corresponding curves.


In [44]:
left, right = -10, 10
X  = np.linspace(left, right, 100)
XS = np.linspace(-5, 5, 10)
lw = 4
alpha = 0.1
elu_alpha = 0.5
selu_alpha = 1.6732
selu_scale = 1.0507

#########################
####### YOUR CODE #######
#########################
sigmoid = lambda x: 1 / (1 + np.exp(-x))
leaky_relu = np.vectorize(lambda x: x if x > 0 else alpha*x, otypes=[np.float128])
relu = np.vectorize(lambda x: x if x > 0 else 0.0,  otypes=[np.float128])
elu = np.vectorize(lambda x: x if x >=0 else elu_alpha*(np.exp((x)) - 1),  otypes=[np.float128])
selu = np.vectorize(lambda x: selu_scale*x if x > 0 else selu_scale*(selu_alpha*np.exp(x) - selu_alpha),  otypes=[np.float128])
tanh = lambda x: 2 / (1 + np.exp((-2*x))) - 1

#########################
### END OF YOUR CODE ####
#########################

activations = {
    "Sigmoid": sigmoid,
    "LeakyReLU": leaky_relu,
    "ReLU": relu,
    "ELU": elu,
    "SeLU": selu,
    "Tanh": tanh
}

# Ground Truth activations
GT_Act = {
    "Sigmoid": [0.00669285092428, 0.0200575365379, 0.0585369028744, 0.158869104881, 0.364576440742, 
                0.635423559258, 0.841130895119, 0.941463097126, 0.979942463462, 0.993307149076],
    "LeakyReLU": [-0.5, -0.388888888889, -0.277777777778, -0.166666666667, -0.0555555555556, 
                  0.555555555556, 1.66666666667, 2.77777777778, 3.88888888889, 5.0],
    "ReLU": [-0.0, -0.0, -0.0, -0.0, -0.0, 0.555555555556, 1.66666666667, 2.77777777778, 3.88888888889, 5.0],
    "ELU": [-0.4966310265, -0.489765962143, -0.468911737989, -0.405562198581, -0.213123289631, 
            0.555555555556, 1.66666666667, 2.77777777778, 3.88888888889, 5.0],
    "SeLU": [-1.74618571868, -1.72204772347, -1.64872296837, -1.42598202974, -0.749354802287, 
             0.583722222222, 1.75116666667, 2.91861111111, 4.08605555556, 5.2535],
    "Tanh": [-0.999909204263, -0.999162466631, -0.992297935288, -0.931109608668, -0.504672397722, 
             0.504672397722, 0.931109608668, 0.992297935288, 0.999162466631, 0.999909204263]
} 

for label in activations:
    fig = plt.figure(figsize=(4,4))
    ax = fig.add_subplot(1, 1, 1)
    ax.plot(X, activations[label](X), color='darkorchid', lw=lw, label=label)
    assert rel_error(activations[label](XS), GT_Act[label]) < 1e-9, \
           "Your implementation of {} might be wrong".format(label)
    ax.legend(loc="lower right")
    ax.axhline(0, color='black')
    ax.axvline(0, color='black')
    ax.set_title('{}'.format(label), fontsize=14)
    plt.xlabel(r"X")
    plt.ylabel(r"Y")
    plt.show()


Phew! You're done for problem 1 now, but 3 more to go... LOL