Fully-Connected Neural Nets

In the previous homework you implemented a fully-connected two-layer neural network on CIFAR-10. The implementation was simple but not very modular since the loss and gradient were computed in a single monolithic function. This is manageable for a simple two-layer network, but would become impractical as we move to bigger models. Ideally we want to build networks using a more modular design so that we can implement different layer types in isolation and then snap them together into models with different architectures.

In this exercise we will implement fully-connected networks using a more modular approach. For each layer we will implement a forward and a backward function. The forward function will receive inputs, weights, and other parameters and will return both an output and a cache object storing data needed for the backward pass, like this:

def layer_forward(x, w):
  """ Receive inputs x and weights w """
  # Do some computations ...
  z = # ... some intermediate value
  # Do some more computations ...
  out = # the output

  cache = (x, w, z, out) # Values we need to compute gradients

  return out, cache

The backward pass will receive upstream derivatives and the cache object, and will return gradients with respect to the inputs and weights, like this:

def layer_backward(dout, cache):
  """
  Receive derivative of loss with respect to outputs and cache,
  and compute derivative with respect to inputs.
  """
  # Unpack cache values
  x, w, z, out = cache

  # Use values in cache to compute derivatives
  dx = # Derivative of loss with respect to x
  dw = # Derivative of loss with respect to w

  return dx, dw

After implementing a bunch of layers this way, we will be able to easily combine them to build classifiers with different architectures.

In addition to implementing fully-connected networks of arbitrary depth, we will also explore different update rules for optimization, and introduce Dropout as a regularizer and Batch Normalization as a tool to more efficiently optimize deep networks.


In [2]:
# As usual, a bit of setup

import time
import numpy as np
import matplotlib.pyplot as plt
from cs231n.classifiers.fc_net import *
from cs231n.data_utils import get_CIFAR10_data
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
from cs231n.solver import Solver

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

def rel_error(x, y):
  """ returns relative error """
  return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))


run the following from the cs231n directory and try again:
python setup.py build_ext --inplace
You may also need to restart your iPython kernel

In [3]:
# Load the (preprocessed) CIFAR10 data.

data = get_CIFAR10_data()
for k, v in data.iteritems():
  print '%s: ' % k, v.shape


X_val:  (1000L, 3L, 32L, 32L)
X_train:  (49000L, 3L, 32L, 32L)
X_test:  (1000L, 3L, 32L, 32L)
y_val:  (1000L,)
y_train:  (49000L,)
y_test:  (1000L,)

In [4]:
X_val = data['X_val']
print X_val.shape
plt.imshow(X_val[50,1,:,:])
plt.show()


(1000L, 3L, 32L, 32L)

Affine layer: foward

Open the file cs231n/layers.py and implement the affine_forward function.

Once you are done you can test your implementaion by running the following:


In [5]:
# Test the affine_forward function

num_inputs = 2
input_shape = (4, 5, 6)
output_dim = 3

input_size = num_inputs * np.prod(input_shape)
weight_size = output_dim * np.prod(input_shape)

x = np.linspace(-0.1, 0.5, num=input_size).reshape(num_inputs, *input_shape)
w = np.linspace(-0.2, 0.3, num=weight_size).reshape(np.prod(input_shape), output_dim)
b = np.linspace(-0.3, 0.1, num=output_dim)

out, _ = affine_forward(x, w, b)
correct_out = np.array([[ 1.49834967,  1.70660132,  1.91485297],
                        [ 3.25553199,  3.5141327,   3.77273342]])

# Compare your output with ours. The error should be around 1e-9.
print 'Testing affine_forward function:'
print 'difference: ', rel_error(out, correct_out)


Testing affine_forward function:
difference:  9.76985004799e-10

Affine layer: backward

Now implement the affine_backward function and test your implementation using numeric gradient checking.


In [6]:
# Test the affine_backward function

x = np.random.randn(10, 2, 3)
w = np.random.randn(6, 5)
b = np.random.randn(5)
dout = np.random.randn(10, 5)

dx_num = eval_numerical_gradient_array(lambda x: affine_forward(x, w, b)[0], x, dout)
dw_num = eval_numerical_gradient_array(lambda w: affine_forward(x, w, b)[0], w, dout)
db_num = eval_numerical_gradient_array(lambda b: affine_forward(x, w, b)[0], b, dout)

_, cache = affine_forward(x, w, b)

len(cache[0])

dx, dw, db = affine_backward(dout, cache)

# The error should be around 1e-10
print 'Testing affine_backward function:'
print 'dx error: ', rel_error(dx_num, dx)
print 'dw error: ', rel_error(dw_num, dw)
print 'db error: ', rel_error(db_num, db)


Testing affine_backward function:
dx error:  2.40760434196e-10
dw error:  5.79182574577e-10
db error:  3.69237678623e-11

ReLU layer: forward

Implement the forward pass for the ReLU activation function in the relu_forward function and test your implementation using the following:


In [7]:
# Test the relu_forward function

x = np.linspace(-0.5, 0.5, num=12).reshape(3, 4)

out, _ = relu_forward(x)
correct_out = np.array([[ 0.,          0.,          0.,          0.,        ],
                        [ 0.,          0.,          0.04545455,  0.13636364,],
                        [ 0.22727273,  0.31818182,  0.40909091,  0.5,       ]])

# Compare your output with ours. The error should be around 1e-8
print 'Testing relu_forward function:'
print 'difference: ', rel_error(out, correct_out)


Testing relu_forward function:
difference:  4.99999979802e-08

ReLU layer: backward

Now implement the backward pass for the ReLU activation function in the relu_backward function and test your implementation using numeric gradient checking:


In [8]:
x = np.random.randn(10, 10)
dout = np.random.randn(*x.shape)

dx_num = eval_numerical_gradient_array(lambda x: relu_forward(x)[0], x, dout)

_, cache = relu_forward(x)
dx = relu_backward(dout, cache)

# The error should be around 1e-12
print 'Testing relu_backward function:'
print 'dx error: ', rel_error(dx_num, dx)


Testing relu_backward function:
dx error:  3.27561242878e-12

"Sandwich" layers

There are some common patterns of layers that are frequently used in neural nets. For example, affine layers are frequently followed by a ReLU nonlinearity. To make these common patterns easy, we define several convenience layers in the file cs231n/layer_utils.py.

For now take a look at the affine_relu_forward and affine_relu_backward functions, and run the following to numerically gradient check the backward pass:


In [9]:
from cs231n.layer_utils import affine_relu_forward, affine_relu_backward

x = np.random.randn(2, 3, 4)
w = np.random.randn(12, 10)
b = np.random.randn(10)
dout = np.random.randn(2, 10)

out, cache = affine_relu_forward(x, w, b)
dx, dw, db = affine_relu_backward(dout, cache)

dx_num = eval_numerical_gradient_array(lambda x: affine_relu_forward(x, w, b)[0], x, dout)
dw_num = eval_numerical_gradient_array(lambda w: affine_relu_forward(x, w, b)[0], w, dout)
db_num = eval_numerical_gradient_array(lambda b: affine_relu_forward(x, w, b)[0], b, dout)

print 'Testing affine_relu_forward:'
print 'dx error: ', rel_error(dx_num, dx)
print 'dw error: ', rel_error(dw_num, dw)
print 'db error: ', rel_error(db_num, db)


Testing affine_relu_forward:
dx error:  5.87247670397e-11
dw error:  4.02334559152e-10
db error:  1.89288907671e-11

Loss layers: Softmax and SVM

You implemented these loss functions in the last assignment, so we'll give them to you for free here. You should still make sure you understand how they work by looking at the implementations in cs231n/layers.py.

You can make sure that the implementations are correct by running the following:


In [10]:
num_classes, num_inputs = 10, 50
x = 0.001 * np.random.randn(num_inputs, num_classes)
y = np.random.randint(num_classes, size=num_inputs)

dx_num = eval_numerical_gradient(lambda x: svm_loss(x, y)[0], x, verbose=False)
loss, dx = svm_loss(x, y)

# Test svm_loss function. Loss should be around 9 and dx error should be 1e-9
print 'Testing svm_loss:'
print 'loss: ', loss
print 'dx error: ', rel_error(dx_num, dx)

dx_num = eval_numerical_gradient(lambda x: softmax_loss(x, y)[0], x, verbose=False)
loss, dx = softmax_loss(x, y)

# Test softmax_loss function. Loss should be 2.3 and dx error should be 1e-8
print '\nTesting softmax_loss:'
print 'loss: ', loss
print 'dx error: ', rel_error(dx_num, dx)


Testing svm_loss:
loss:  9.00049248798
dx error:  1.40215660067e-09

Testing softmax_loss:
loss:  2.30263485593
dx error:  9.34226786902e-09

Two-layer network

In the previous assignment you implemented a two-layer neural network in a single monolithic class. Now that you have implemented modular versions of the necessary layers, you will reimplement the two layer network using these modular implementations.

Open the file cs231n/classifiers/fc_net.py and complete the implementation of the TwoLayerNet class. This class will serve as a model for the other networks you will implement in this assignment, so read through it to make sure you understand the API. You can run the cell below to test your implementation.


In [11]:
N, D, H, C = 3, 5, 50, 7
X = np.random.randn(N, D)
y = np.random.randint(C, size=N)

std = 1e-2
model = TwoLayerNet(input_dim=D, hidden_dim=H, num_classes=C, weight_scale=std)

print 'Testing initialization ... '
W1_std = abs(model.params['W1'].std() - std)
b1 = model.params['b1']
W2_std = abs(model.params['W2'].std() - std)
b2 = model.params['b2']
assert W1_std < std / 10, 'First layer weights do not seem right'
assert np.all(b1 == 0), 'First layer biases do not seem right'
assert W2_std < std / 10, 'Second layer weights do not seem right'
assert np.all(b2 == 0), 'Second layer biases do not seem right'

print 'Testing test-time forward pass ... '
model.params['W1'] = np.linspace(-0.7, 0.3, num=D*H).reshape(D, H)
model.params['b1'] = np.linspace(-0.1, 0.9, num=H)
model.params['W2'] = np.linspace(-0.3, 0.4, num=H*C).reshape(H, C)
model.params['b2'] = np.linspace(-0.9, 0.1, num=C)
X = np.linspace(-5.5, 4.5, num=N*D).reshape(D, N).T
scores = model.loss(X)
correct_scores = np.asarray(
  [[11.53165108,  12.2917344,   13.05181771,  13.81190102,  14.57198434, 15.33206765,  16.09215096],
   [12.05769098,  12.74614105,  13.43459113,  14.1230412,   14.81149128, 15.49994135,  16.18839143],
   [12.58373087,  13.20054771,  13.81736455,  14.43418138,  15.05099822, 15.66781506,  16.2846319 ]])
scores_diff = np.abs(scores - correct_scores).sum()
assert scores_diff < 1e-6, 'Problem with test-time forward pass'

print 'Testing training loss (no regularization)'
y = np.asarray([0, 5, 1])
loss, grads = model.loss(X, y)
correct_loss = 3.4702243556
assert abs(loss - correct_loss) < 1e-10, 'Problem with training-time loss'

model.reg = 1.0
loss, grads = model.loss(X, y)
correct_loss = 26.5948426952
assert abs(loss - correct_loss) < 1e-10, 'Problem with regularization loss'

for reg in [0.0, 0.5, 0.7]:
  print 'Running numeric gradient check with reg = ', reg
  model.reg = reg
  loss, grads = model.loss(X, y)

  for name in sorted(grads):
    f = lambda _: model.loss(X, y)[0]
    grad_num = eval_numerical_gradient(f, model.params[name], verbose=False)
    print '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name]))


Testing initialization ... 
Testing test-time forward pass ... 
Testing training loss (no regularization)
Running numeric gradient check with reg =  0.0
W1 relative error: 1.52e-08
W2 relative error: 3.30e-10
b1 relative error: 8.37e-09
b2 relative error: 2.14e-10
Running numeric gradient check with reg =  0.5
W1 relative error: 3.88e-08
W2 relative error: 9.57e-09
b1 relative error: 8.37e-09
b2 relative error: 7.76e-10
Running numeric gradient check with reg =  0.7
W1 relative error: 2.53e-07
W2 relative error: 2.85e-08
b1 relative error: 1.56e-08
b2 relative error: 9.09e-10

Solver

In the previous assignment, the logic for training models was coupled to the models themselves. Following a more modular design, for this assignment we have split the logic for training models into a separate class.

Open the file cs231n/solver.py and read through it to familiarize yourself with the API. After doing so, use a Solver instance to train a TwoLayerNet that achieves at least 50% accuracy on the validation set.


In [12]:
model = TwoLayerNet(reg = 0.15)
config = {'learning_rate':9e-4}
solver = Solver(model, data, update_rule = 'sgd', optim_config = config)

##############################################################################
# TODO: Use a Solver instance to train a TwoLayerNet that achieves at least  #
# 50% accuracy on the validation set.                                        #
##############################################################################
solver.train()
# solver.check_accuracy(data['X_val'], data['y_val'])
##############################################################################
#                             END OF YOUR CODE                               #
##############################################################################


(Iteration 1 / 4900) loss: 6.616707
(Epoch 0 / 10) train acc: 0.135000; val_acc: 0.139000
(Iteration 11 / 4900) loss: 4.788472
(Iteration 21 / 4900) loss: 4.781479
(Iteration 31 / 4900) loss: 4.389951
(Iteration 41 / 4900) loss: 4.486065
(Iteration 51 / 4900) loss: 4.233175
(Iteration 61 / 4900) loss: 4.262578
(Iteration 71 / 4900) loss: 4.341264
(Iteration 81 / 4900) loss: 4.015177
(Iteration 91 / 4900) loss: 4.105352
(Iteration 101 / 4900) loss: 3.986159
(Iteration 111 / 4900) loss: 4.025810
(Iteration 121 / 4900) loss: 4.101243
(Iteration 131 / 4900) loss: 3.988496
(Iteration 141 / 4900) loss: 4.125624
(Iteration 151 / 4900) loss: 3.939759
(Iteration 161 / 4900) loss: 3.956261
(Iteration 171 / 4900) loss: 4.144093
(Iteration 181 / 4900) loss: 4.029767
(Iteration 191 / 4900) loss: 4.034787
(Iteration 201 / 4900) loss: 4.156631
(Iteration 211 / 4900) loss: 4.170272
(Iteration 221 / 4900) loss: 3.961455
(Iteration 231 / 4900) loss: 4.053099
(Iteration 241 / 4900) loss: 3.937463
(Iteration 251 / 4900) loss: 3.938441
(Iteration 261 / 4900) loss: 3.774730
(Iteration 271 / 4900) loss: 3.958524
(Iteration 281 / 4900) loss: 3.933408
(Iteration 291 / 4900) loss: 3.967101
(Iteration 301 / 4900) loss: 3.919744
(Iteration 311 / 4900) loss: 3.696977
(Iteration 321 / 4900) loss: 3.866769
(Iteration 331 / 4900) loss: 3.707732
(Iteration 341 / 4900) loss: 3.965681
(Iteration 351 / 4900) loss: 3.779792
(Iteration 361 / 4900) loss: 3.786718
(Iteration 371 / 4900) loss: 3.860225
(Iteration 381 / 4900) loss: 3.658862
(Iteration 391 / 4900) loss: 3.665527
(Iteration 401 / 4900) loss: 3.635238
(Iteration 411 / 4900) loss: 3.850285
(Iteration 421 / 4900) loss: 3.577995
(Iteration 431 / 4900) loss: 3.698510
(Iteration 441 / 4900) loss: 3.660561
(Iteration 451 / 4900) loss: 3.649726
(Iteration 461 / 4900) loss: 3.705486
(Iteration 471 / 4900) loss: 3.654193
(Iteration 481 / 4900) loss: 3.560192
(Epoch 1 / 10) train acc: 0.427000; val_acc: 0.412000
(Iteration 491 / 4900) loss: 3.697882
(Iteration 501 / 4900) loss: 3.549816
(Iteration 511 / 4900) loss: 3.656376
(Iteration 521 / 4900) loss: 3.565146
(Iteration 531 / 4900) loss: 3.612036
(Iteration 541 / 4900) loss: 3.542752
(Iteration 551 / 4900) loss: 3.672141
(Iteration 561 / 4900) loss: 3.670430
(Iteration 571 / 4900) loss: 3.620872
(Iteration 581 / 4900) loss: 3.852350
(Iteration 591 / 4900) loss: 3.532018
(Iteration 601 / 4900) loss: 3.631047
(Iteration 611 / 4900) loss: 3.679617
(Iteration 621 / 4900) loss: 3.523492
(Iteration 631 / 4900) loss: 3.367978
(Iteration 641 / 4900) loss: 3.639100
(Iteration 651 / 4900) loss: 3.630095
(Iteration 661 / 4900) loss: 3.517328
(Iteration 671 / 4900) loss: 3.632787
(Iteration 681 / 4900) loss: 3.425984
(Iteration 691 / 4900) loss: 3.271335
(Iteration 701 / 4900) loss: 3.406516
(Iteration 711 / 4900) loss: 3.251056
(Iteration 721 / 4900) loss: 3.336979
(Iteration 731 / 4900) loss: 3.568523
(Iteration 741 / 4900) loss: 3.542573
(Iteration 751 / 4900) loss: 3.424344
(Iteration 761 / 4900) loss: 3.655872
(Iteration 771 / 4900) loss: 3.568457
(Iteration 781 / 4900) loss: 3.507113
(Iteration 791 / 4900) loss: 3.595109
(Iteration 801 / 4900) loss: 3.400143
(Iteration 811 / 4900) loss: 3.310017
(Iteration 821 / 4900) loss: 3.414901
(Iteration 831 / 4900) loss: 3.522101
(Iteration 841 / 4900) loss: 3.293777
(Iteration 851 / 4900) loss: 3.433847
(Iteration 861 / 4900) loss: 3.378536
(Iteration 871 / 4900) loss: 3.425495
(Iteration 881 / 4900) loss: 3.358823
(Iteration 891 / 4900) loss: 3.445685
(Iteration 901 / 4900) loss: 3.282565
(Iteration 911 / 4900) loss: 3.594066
(Iteration 921 / 4900) loss: 3.223946
(Iteration 931 / 4900) loss: 3.277773
(Iteration 941 / 4900) loss: 3.298436
(Iteration 951 / 4900) loss: 3.343167
(Iteration 961 / 4900) loss: 3.482501
(Iteration 971 / 4900) loss: 3.218529
(Epoch 2 / 10) train acc: 0.416000; val_acc: 0.444000
(Iteration 981 / 4900) loss: 3.171139
(Iteration 991 / 4900) loss: 3.274842
(Iteration 1001 / 4900) loss: 3.446343
(Iteration 1011 / 4900) loss: 3.482038
(Iteration 1021 / 4900) loss: 3.152553
(Iteration 1031 / 4900) loss: 3.180753
(Iteration 1041 / 4900) loss: 3.328628
(Iteration 1051 / 4900) loss: 3.147473
(Iteration 1061 / 4900) loss: 3.235547
(Iteration 1071 / 4900) loss: 3.188010
(Iteration 1081 / 4900) loss: 3.424427
(Iteration 1091 / 4900) loss: 3.011787
(Iteration 1101 / 4900) loss: 3.266621
(Iteration 1111 / 4900) loss: 3.380010
(Iteration 1121 / 4900) loss: 3.093736
(Iteration 1131 / 4900) loss: 3.322978
(Iteration 1141 / 4900) loss: 3.272077
(Iteration 1151 / 4900) loss: 3.406946
(Iteration 1161 / 4900) loss: 3.239551
(Iteration 1171 / 4900) loss: 3.233191
(Iteration 1181 / 4900) loss: 3.148904
(Iteration 1191 / 4900) loss: 3.242546
(Iteration 1201 / 4900) loss: 3.374245
(Iteration 1211 / 4900) loss: 3.087429
(Iteration 1221 / 4900) loss: 3.195273
(Iteration 1231 / 4900) loss: 3.051453
(Iteration 1241 / 4900) loss: 3.132105
(Iteration 1251 / 4900) loss: 3.265869
(Iteration 1261 / 4900) loss: 3.208775
(Iteration 1271 / 4900) loss: 2.886637
(Iteration 1281 / 4900) loss: 3.073029
(Iteration 1291 / 4900) loss: 2.872007
(Iteration 1301 / 4900) loss: 3.073539
(Iteration 1311 / 4900) loss: 3.095435
(Iteration 1321 / 4900) loss: 3.106193
(Iteration 1331 / 4900) loss: 3.106711
(Iteration 1341 / 4900) loss: 2.856428
(Iteration 1351 / 4900) loss: 3.376602
(Iteration 1361 / 4900) loss: 3.119359
(Iteration 1371 / 4900) loss: 3.086373
(Iteration 1381 / 4900) loss: 3.092613
(Iteration 1391 / 4900) loss: 3.002511
(Iteration 1401 / 4900) loss: 2.981758
(Iteration 1411 / 4900) loss: 3.074219
(Iteration 1421 / 4900) loss: 3.042947
(Iteration 1431 / 4900) loss: 3.135873
(Iteration 1441 / 4900) loss: 3.075377
(Iteration 1451 / 4900) loss: 2.931159
(Iteration 1461 / 4900) loss: 2.918133
(Epoch 3 / 10) train acc: 0.469000; val_acc: 0.474000
(Iteration 1471 / 4900) loss: 2.978438
(Iteration 1481 / 4900) loss: 2.964961
(Iteration 1491 / 4900) loss: 2.939756
(Iteration 1501 / 4900) loss: 3.020388
(Iteration 1511 / 4900) loss: 2.959593
(Iteration 1521 / 4900) loss: 3.105295
(Iteration 1531 / 4900) loss: 3.105787
(Iteration 1541 / 4900) loss: 2.959095
(Iteration 1551 / 4900) loss: 2.999844
(Iteration 1561 / 4900) loss: 2.902397
(Iteration 1571 / 4900) loss: 2.919478
(Iteration 1581 / 4900) loss: 2.899367
(Iteration 1591 / 4900) loss: 2.978089
(Iteration 1601 / 4900) loss: 3.002139
(Iteration 1611 / 4900) loss: 2.947822
(Iteration 1621 / 4900) loss: 2.765433
(Iteration 1631 / 4900) loss: 2.828059
(Iteration 1641 / 4900) loss: 2.949965
(Iteration 1651 / 4900) loss: 2.766935
(Iteration 1661 / 4900) loss: 2.961203
(Iteration 1671 / 4900) loss: 3.207482
(Iteration 1681 / 4900) loss: 2.867769
(Iteration 1691 / 4900) loss: 2.801145
(Iteration 1701 / 4900) loss: 2.824272
(Iteration 1711 / 4900) loss: 2.916757
(Iteration 1721 / 4900) loss: 3.147204
(Iteration 1731 / 4900) loss: 2.792431
(Iteration 1741 / 4900) loss: 3.118947
(Iteration 1751 / 4900) loss: 2.961418
(Iteration 1761 / 4900) loss: 2.767975
(Iteration 1771 / 4900) loss: 3.182174
(Iteration 1781 / 4900) loss: 2.901250
(Iteration 1791 / 4900) loss: 2.733943
(Iteration 1801 / 4900) loss: 2.973330
(Iteration 1811 / 4900) loss: 2.887614
(Iteration 1821 / 4900) loss: 3.052556
(Iteration 1831 / 4900) loss: 2.672293
(Iteration 1841 / 4900) loss: 2.935966
(Iteration 1851 / 4900) loss: 2.890619
(Iteration 1861 / 4900) loss: 2.961545
(Iteration 1871 / 4900) loss: 2.501025
(Iteration 1881 / 4900) loss: 2.783066
(Iteration 1891 / 4900) loss: 3.011517
(Iteration 1901 / 4900) loss: 2.915861
(Iteration 1911 / 4900) loss: 2.701639
(Iteration 1921 / 4900) loss: 2.872412
(Iteration 1931 / 4900) loss: 2.850129
(Iteration 1941 / 4900) loss: 2.826105
(Iteration 1951 / 4900) loss: 2.706315
(Epoch 4 / 10) train acc: 0.497000; val_acc: 0.472000
(Iteration 1961 / 4900) loss: 2.676679
(Iteration 1971 / 4900) loss: 2.831395
(Iteration 1981 / 4900) loss: 2.725037
(Iteration 1991 / 4900) loss: 2.751346
(Iteration 2001 / 4900) loss: 2.615298
(Iteration 2011 / 4900) loss: 2.782068
(Iteration 2021 / 4900) loss: 3.004813
(Iteration 2031 / 4900) loss: 2.952429
(Iteration 2041 / 4900) loss: 2.808218
(Iteration 2051 / 4900) loss: 2.864826
(Iteration 2061 / 4900) loss: 2.897780
(Iteration 2071 / 4900) loss: 2.979026
(Iteration 2081 / 4900) loss: 2.508575
(Iteration 2091 / 4900) loss: 2.696476
(Iteration 2101 / 4900) loss: 2.628321
(Iteration 2111 / 4900) loss: 2.704699
(Iteration 2121 / 4900) loss: 2.939499
(Iteration 2131 / 4900) loss: 2.623186
(Iteration 2141 / 4900) loss: 2.691435
(Iteration 2151 / 4900) loss: 2.748063
(Iteration 2161 / 4900) loss: 2.772544
(Iteration 2171 / 4900) loss: 2.711727
(Iteration 2181 / 4900) loss: 2.603549
(Iteration 2191 / 4900) loss: 2.624234
(Iteration 2201 / 4900) loss: 2.773999
(Iteration 2211 / 4900) loss: 2.810088
(Iteration 2221 / 4900) loss: 2.715239
(Iteration 2231 / 4900) loss: 2.915970
(Iteration 2241 / 4900) loss: 2.900197
(Iteration 2251 / 4900) loss: 2.752457
(Iteration 2261 / 4900) loss: 2.884994
(Iteration 2271 / 4900) loss: 2.779655
(Iteration 2281 / 4900) loss: 2.628101
(Iteration 2291 / 4900) loss: 2.868106
(Iteration 2301 / 4900) loss: 2.673909
(Iteration 2311 / 4900) loss: 2.513746
(Iteration 2321 / 4900) loss: 2.769193
(Iteration 2331 / 4900) loss: 2.563717
(Iteration 2341 / 4900) loss: 2.448047
(Iteration 2351 / 4900) loss: 2.590312
(Iteration 2361 / 4900) loss: 2.504313
(Iteration 2371 / 4900) loss: 2.791704
(Iteration 2381 / 4900) loss: 2.755883
(Iteration 2391 / 4900) loss: 2.563767
(Iteration 2401 / 4900) loss: 2.576028
(Iteration 2411 / 4900) loss: 2.740063
(Iteration 2421 / 4900) loss: 2.563391
(Iteration 2431 / 4900) loss: 2.705479
(Iteration 2441 / 4900) loss: 2.551123
(Epoch 5 / 10) train acc: 0.516000; val_acc: 0.453000
(Iteration 2451 / 4900) loss: 2.501045
(Iteration 2461 / 4900) loss: 2.496031
(Iteration 2471 / 4900) loss: 2.442420
(Iteration 2481 / 4900) loss: 2.466637
(Iteration 2491 / 4900) loss: 2.681805
(Iteration 2501 / 4900) loss: 2.402107
(Iteration 2511 / 4900) loss: 2.520799
(Iteration 2521 / 4900) loss: 2.707888
(Iteration 2531 / 4900) loss: 2.747841
(Iteration 2541 / 4900) loss: 2.377689
(Iteration 2551 / 4900) loss: 2.593209
(Iteration 2561 / 4900) loss: 2.621179
(Iteration 2571 / 4900) loss: 2.476694
(Iteration 2581 / 4900) loss: 2.523891
(Iteration 2591 / 4900) loss: 2.538244
(Iteration 2601 / 4900) loss: 2.377195
(Iteration 2611 / 4900) loss: 2.861970
(Iteration 2621 / 4900) loss: 2.528676
(Iteration 2631 / 4900) loss: 2.603410
(Iteration 2641 / 4900) loss: 2.413427
(Iteration 2651 / 4900) loss: 2.628374
(Iteration 2661 / 4900) loss: 2.430814
(Iteration 2671 / 4900) loss: 2.465717
(Iteration 2681 / 4900) loss: 2.258486
(Iteration 2691 / 4900) loss: 2.583307
(Iteration 2701 / 4900) loss: 2.698229
(Iteration 2711 / 4900) loss: 2.652304
(Iteration 2721 / 4900) loss: 2.695840
(Iteration 2731 / 4900) loss: 2.420669
(Iteration 2741 / 4900) loss: 2.462999
(Iteration 2751 / 4900) loss: 2.432568
(Iteration 2761 / 4900) loss: 2.410852
(Iteration 2771 / 4900) loss: 2.345827
(Iteration 2781 / 4900) loss: 2.475545
(Iteration 2791 / 4900) loss: 2.454104
(Iteration 2801 / 4900) loss: 2.518131
(Iteration 2811 / 4900) loss: 2.426285
(Iteration 2821 / 4900) loss: 2.519186
(Iteration 2831 / 4900) loss: 2.359393
(Iteration 2841 / 4900) loss: 2.349056
(Iteration 2851 / 4900) loss: 2.266981
(Iteration 2861 / 4900) loss: 2.364452
(Iteration 2871 / 4900) loss: 2.449414
(Iteration 2881 / 4900) loss: 2.496066
(Iteration 2891 / 4900) loss: 2.426041
(Iteration 2901 / 4900) loss: 2.426873
(Iteration 2911 / 4900) loss: 2.519587
(Iteration 2921 / 4900) loss: 2.238154
(Iteration 2931 / 4900) loss: 2.504357
(Epoch 6 / 10) train acc: 0.564000; val_acc: 0.494000
(Iteration 2941 / 4900) loss: 2.250942
(Iteration 2951 / 4900) loss: 2.470473
(Iteration 2961 / 4900) loss: 2.590192
(Iteration 2971 / 4900) loss: 2.599676
(Iteration 2981 / 4900) loss: 2.652662
(Iteration 2991 / 4900) loss: 2.287612
(Iteration 3001 / 4900) loss: 2.227973
(Iteration 3011 / 4900) loss: 2.347633
(Iteration 3021 / 4900) loss: 2.433449
(Iteration 3031 / 4900) loss: 2.274836
(Iteration 3041 / 4900) loss: 2.303733
(Iteration 3051 / 4900) loss: 2.351275
(Iteration 3061 / 4900) loss: 2.425333
(Iteration 3071 / 4900) loss: 2.403539
(Iteration 3081 / 4900) loss: 2.497764
(Iteration 3091 / 4900) loss: 2.067489
(Iteration 3101 / 4900) loss: 2.399916
(Iteration 3111 / 4900) loss: 2.338286
(Iteration 3121 / 4900) loss: 2.612456
(Iteration 3131 / 4900) loss: 2.317291
(Iteration 3141 / 4900) loss: 2.388410
(Iteration 3151 / 4900) loss: 2.140339
(Iteration 3161 / 4900) loss: 2.304739
(Iteration 3171 / 4900) loss: 2.473614
(Iteration 3181 / 4900) loss: 2.148771
(Iteration 3191 / 4900) loss: 2.297129
(Iteration 3201 / 4900) loss: 2.306349
(Iteration 3211 / 4900) loss: 2.460083
(Iteration 3221 / 4900) loss: 2.334408
(Iteration 3231 / 4900) loss: 2.549655
(Iteration 3241 / 4900) loss: 2.400962
(Iteration 3251 / 4900) loss: 2.292571
(Iteration 3261 / 4900) loss: 2.435803
(Iteration 3271 / 4900) loss: 2.222900
(Iteration 3281 / 4900) loss: 2.402165
(Iteration 3291 / 4900) loss: 2.172296
(Iteration 3301 / 4900) loss: 2.256848
(Iteration 3311 / 4900) loss: 2.235333
(Iteration 3321 / 4900) loss: 2.331785
(Iteration 3331 / 4900) loss: 2.215123
(Iteration 3341 / 4900) loss: 2.042225
(Iteration 3351 / 4900) loss: 2.253328
(Iteration 3361 / 4900) loss: 2.280738
(Iteration 3371 / 4900) loss: 2.312860
(Iteration 3381 / 4900) loss: 2.148552
(Iteration 3391 / 4900) loss: 2.095078
(Iteration 3401 / 4900) loss: 2.424531
(Iteration 3411 / 4900) loss: 2.162704
(Iteration 3421 / 4900) loss: 2.247909
(Epoch 7 / 10) train acc: 0.559000; val_acc: 0.522000
(Iteration 3431 / 4900) loss: 2.172213
(Iteration 3441 / 4900) loss: 2.128398
(Iteration 3451 / 4900) loss: 2.173296
(Iteration 3461 / 4900) loss: 2.272669
(Iteration 3471 / 4900) loss: 1.949725
(Iteration 3481 / 4900) loss: 2.266566
(Iteration 3491 / 4900) loss: 2.295547
(Iteration 3501 / 4900) loss: 2.271691
(Iteration 3511 / 4900) loss: 2.027798
(Iteration 3521 / 4900) loss: 2.094748
(Iteration 3531 / 4900) loss: 2.243244
(Iteration 3541 / 4900) loss: 2.217404
(Iteration 3551 / 4900) loss: 2.161693
(Iteration 3561 / 4900) loss: 2.361524
(Iteration 3571 / 4900) loss: 2.238823
(Iteration 3581 / 4900) loss: 2.115541
(Iteration 3591 / 4900) loss: 2.384000
(Iteration 3601 / 4900) loss: 2.172922
(Iteration 3611 / 4900) loss: 2.246539
(Iteration 3621 / 4900) loss: 2.165018
(Iteration 3631 / 4900) loss: 2.185228
(Iteration 3641 / 4900) loss: 2.289838
(Iteration 3651 / 4900) loss: 2.207021
(Iteration 3661 / 4900) loss: 2.006744
(Iteration 3671 / 4900) loss: 2.445197
(Iteration 3681 / 4900) loss: 2.206788
(Iteration 3691 / 4900) loss: 2.134774
(Iteration 3701 / 4900) loss: 2.128338
(Iteration 3711 / 4900) loss: 2.226403
(Iteration 3721 / 4900) loss: 2.449933
(Iteration 3731 / 4900) loss: 2.009135
(Iteration 3741 / 4900) loss: 2.002438
(Iteration 3751 / 4900) loss: 2.133228
(Iteration 3761 / 4900) loss: 1.990792
(Iteration 3771 / 4900) loss: 2.359861
(Iteration 3781 / 4900) loss: 2.194453
(Iteration 3791 / 4900) loss: 2.140489
(Iteration 3801 / 4900) loss: 2.060096
(Iteration 3811 / 4900) loss: 2.083873
(Iteration 3821 / 4900) loss: 2.170456
(Iteration 3831 / 4900) loss: 2.171175
(Iteration 3841 / 4900) loss: 2.189833
(Iteration 3851 / 4900) loss: 2.218581
(Iteration 3861 / 4900) loss: 2.000258
(Iteration 3871 / 4900) loss: 2.060527
(Iteration 3881 / 4900) loss: 2.082641
(Iteration 3891 / 4900) loss: 2.312590
(Iteration 3901 / 4900) loss: 2.101571
(Iteration 3911 / 4900) loss: 2.158224
(Epoch 8 / 10) train acc: 0.546000; val_acc: 0.505000
(Iteration 3921 / 4900) loss: 2.156832
(Iteration 3931 / 4900) loss: 2.256525
(Iteration 3941 / 4900) loss: 2.318530
(Iteration 3951 / 4900) loss: 2.105277
(Iteration 3961 / 4900) loss: 2.387728
(Iteration 3971 / 4900) loss: 2.032969
(Iteration 3981 / 4900) loss: 2.275472
(Iteration 3991 / 4900) loss: 2.135331
(Iteration 4001 / 4900) loss: 2.196359
(Iteration 4011 / 4900) loss: 2.078001
(Iteration 4021 / 4900) loss: 2.107999
(Iteration 4031 / 4900) loss: 1.942828
(Iteration 4041 / 4900) loss: 2.106565
(Iteration 4051 / 4900) loss: 2.005430
(Iteration 4061 / 4900) loss: 2.073322
(Iteration 4071 / 4900) loss: 1.965463
(Iteration 4081 / 4900) loss: 2.065967
(Iteration 4091 / 4900) loss: 2.191239
(Iteration 4101 / 4900) loss: 2.189761
(Iteration 4111 / 4900) loss: 2.014198
(Iteration 4121 / 4900) loss: 2.181683
(Iteration 4131 / 4900) loss: 2.144387
(Iteration 4141 / 4900) loss: 2.080640
(Iteration 4151 / 4900) loss: 2.056085
(Iteration 4161 / 4900) loss: 2.005442
(Iteration 4171 / 4900) loss: 2.284981
(Iteration 4181 / 4900) loss: 2.068543
(Iteration 4191 / 4900) loss: 1.806275
(Iteration 4201 / 4900) loss: 2.152528
(Iteration 4211 / 4900) loss: 1.913833
(Iteration 4221 / 4900) loss: 2.157355
(Iteration 4231 / 4900) loss: 2.036085
(Iteration 4241 / 4900) loss: 2.071480
(Iteration 4251 / 4900) loss: 2.024507
(Iteration 4261 / 4900) loss: 2.211132
(Iteration 4271 / 4900) loss: 1.931425
(Iteration 4281 / 4900) loss: 2.129388
(Iteration 4291 / 4900) loss: 1.932811
(Iteration 4301 / 4900) loss: 2.005638
(Iteration 4311 / 4900) loss: 2.044426
(Iteration 4321 / 4900) loss: 1.945824
(Iteration 4331 / 4900) loss: 2.095641
(Iteration 4341 / 4900) loss: 2.273607
(Iteration 4351 / 4900) loss: 2.115442
(Iteration 4361 / 4900) loss: 2.118374
(Iteration 4371 / 4900) loss: 1.813804
(Iteration 4381 / 4900) loss: 2.229470
(Iteration 4391 / 4900) loss: 2.056012
(Iteration 4401 / 4900) loss: 1.994028
(Epoch 9 / 10) train acc: 0.562000; val_acc: 0.497000
(Iteration 4411 / 4900) loss: 2.005658
(Iteration 4421 / 4900) loss: 1.928359
(Iteration 4431 / 4900) loss: 1.933751
(Iteration 4441 / 4900) loss: 2.053987
(Iteration 4451 / 4900) loss: 2.039483
(Iteration 4461 / 4900) loss: 2.118195
(Iteration 4471 / 4900) loss: 2.203885
(Iteration 4481 / 4900) loss: 1.799171
(Iteration 4491 / 4900) loss: 2.022430
(Iteration 4501 / 4900) loss: 2.085583
(Iteration 4511 / 4900) loss: 2.109631
(Iteration 4521 / 4900) loss: 1.938987
(Iteration 4531 / 4900) loss: 1.994186
(Iteration 4541 / 4900) loss: 1.852397
(Iteration 4551 / 4900) loss: 2.449893
(Iteration 4561 / 4900) loss: 1.891605
(Iteration 4571 / 4900) loss: 1.899480
(Iteration 4581 / 4900) loss: 2.066591
(Iteration 4591 / 4900) loss: 2.090356
(Iteration 4601 / 4900) loss: 1.937531
(Iteration 4611 / 4900) loss: 2.229956
(Iteration 4621 / 4900) loss: 1.961544
(Iteration 4631 / 4900) loss: 1.906778
(Iteration 4641 / 4900) loss: 1.862160
(Iteration 4651 / 4900) loss: 2.091532
(Iteration 4661 / 4900) loss: 1.980394
(Iteration 4671 / 4900) loss: 2.217303
(Iteration 4681 / 4900) loss: 1.945830
(Iteration 4691 / 4900) loss: 1.799137
(Iteration 4701 / 4900) loss: 1.996650
(Iteration 4711 / 4900) loss: 2.004414
(Iteration 4721 / 4900) loss: 1.988429
(Iteration 4731 / 4900) loss: 1.861591
(Iteration 4741 / 4900) loss: 1.710288
(Iteration 4751 / 4900) loss: 2.043997
(Iteration 4761 / 4900) loss: 2.038329
(Iteration 4771 / 4900) loss: 2.020392
(Iteration 4781 / 4900) loss: 1.905755
(Iteration 4791 / 4900) loss: 1.876548
(Iteration 4801 / 4900) loss: 2.180648
(Iteration 4811 / 4900) loss: 1.948905
(Iteration 4821 / 4900) loss: 1.795086
(Iteration 4831 / 4900) loss: 2.020769
(Iteration 4841 / 4900) loss: 2.083966
(Iteration 4851 / 4900) loss: 2.112480
(Iteration 4861 / 4900) loss: 1.915327
(Iteration 4871 / 4900) loss: 1.870269
(Iteration 4881 / 4900) loss: 2.008067
(Iteration 4891 / 4900) loss: 1.956441
(Epoch 10 / 10) train acc: 0.540000; val_acc: 0.511000

In [13]:
# Run this cell to visualize training loss and train / val accuracy

plt.subplot(2, 1, 1)
plt.title('Training loss')
plt.plot(solver.loss_history, 'o')
plt.xlabel('Iteration')

plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(solver.train_acc_history, '-o', label='train')
plt.plot(solver.val_acc_history, '-o', label='val')
plt.plot([0.5] * len(solver.val_acc_history), 'k--')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)
plt.show()

solver.check_accuracy(data['X_val'], data['y_val'])


Out[13]:
0.52200000000000002

Multilayer network

Next you will implement a fully-connected network with an arbitrary number of hidden layers.

Read through the FullyConnectedNet class in the file cs231n/classifiers/fc_net.py.

Implement the initialization, the forward pass, and the backward pass. For the moment don't worry about implementing dropout or batch normalization; we will add those features soon.

Initial loss and gradient check

As a sanity check, run the following to check the initial loss and to gradient check the network both with and without regularization. Do the initial losses seem reasonable?

For gradient checking, you should expect to see errors around 1e-6 or less.


In [14]:
N, D, H1, H2, H3, C = 2, 15, 20, 30, 50, 10
X = np.random.randn(N, D)
y = np.random.randint(C, size=(N,))

for reg in [0, 3.14]:
  print 'Running check with reg = ', reg
  model = FullyConnectedNet([H1, H2, H3], input_dim=D, num_classes=C,
                            reg=reg, weight_scale=5e-2, dtype=np.float64)

  loss, grads = model.loss(X, y)
  print 'Initial loss: ', loss

  for name in sorted(grads):
    f = lambda _: model.loss(X, y)[0]
    grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)
    print '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name]))


Running check with reg =  0
Initial loss:  2.30291783893
W1 relative error: 6.71e-05
W2 relative error: 2.22e-06
W3 relative error: 1.00e-04
W4 relative error: 9.46e-07
b1 relative error: 1.35e-06
b2 relative error: 3.14e-09
b3 relative error: 1.08e-08
b4 relative error: 8.82e-11
Running check with reg =  3.14
Initial loss:  12.9801772471
W1 relative error: 4.57e-08
W2 relative error: 5.41e-08
W3 relative error: 7.81e-07
W4 relative error: 3.68e-08
b1 relative error: 3.87e-07
b2 relative error: 9.57e-08
b3 relative error: 9.55e-09
b4 relative error: 4.11e-10

As another sanity check, make sure you can overfit a small dataset of 50 images. First we will try a three-layer network with 100 units in each hidden layer. You will need to tweak the learning rate and initialization scale, but you should be able to overfit and achieve 100% training accuracy within 20 epochs.


In [15]:
# TODO: Use a three-layer Net to overfit 50 training examples.

num_train = 50
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

weight_scale = 8e-3
learning_rate = 7.0e-3
model = FullyConnectedNet([100, 100],
              weight_scale=weight_scale, dtype=np.float64)
solver = Solver(model, small_data,
                print_every=10, num_epochs=50, batch_size=25,
                update_rule='sgd',
                optim_config={
                  'learning_rate': learning_rate,
                }
         )
solver.train()

plt.plot(solver.loss_history, 'o')
plt.title('Training loss history')
plt.xlabel('Iteration')
plt.ylabel('Training loss')
plt.show()


(Iteration 1 / 100) loss: 2.314217
(Epoch 0 / 50) train acc: 0.180000; val_acc: 0.100000
(Epoch 1 / 50) train acc: 0.340000; val_acc: 0.097000
(Epoch 2 / 50) train acc: 0.340000; val_acc: 0.126000
(Epoch 3 / 50) train acc: 0.320000; val_acc: 0.124000
(Epoch 4 / 50) train acc: 0.500000; val_acc: 0.114000
(Epoch 5 / 50) train acc: 0.560000; val_acc: 0.124000
(Iteration 11 / 100) loss: 1.751305
(Epoch 6 / 50) train acc: 0.520000; val_acc: 0.138000
(Epoch 7 / 50) train acc: 0.540000; val_acc: 0.143000
(Epoch 8 / 50) train acc: 0.600000; val_acc: 0.174000
(Epoch 9 / 50) train acc: 0.760000; val_acc: 0.170000
(Epoch 10 / 50) train acc: 0.740000; val_acc: 0.163000
(Iteration 21 / 100) loss: 1.086859
(Epoch 11 / 50) train acc: 0.820000; val_acc: 0.170000
(Epoch 12 / 50) train acc: 0.900000; val_acc: 0.172000
(Epoch 13 / 50) train acc: 0.920000; val_acc: 0.162000
(Epoch 14 / 50) train acc: 0.940000; val_acc: 0.173000
(Epoch 15 / 50) train acc: 0.980000; val_acc: 0.176000
(Iteration 31 / 100) loss: 0.246904
(Epoch 16 / 50) train acc: 0.980000; val_acc: 0.173000
(Epoch 17 / 50) train acc: 0.980000; val_acc: 0.173000
(Epoch 18 / 50) train acc: 0.960000; val_acc: 0.179000
(Epoch 19 / 50) train acc: 0.980000; val_acc: 0.169000
(Epoch 20 / 50) train acc: 0.980000; val_acc: 0.178000
(Iteration 41 / 100) loss: 0.133100
(Epoch 21 / 50) train acc: 0.980000; val_acc: 0.175000
(Epoch 22 / 50) train acc: 0.980000; val_acc: 0.172000
(Epoch 23 / 50) train acc: 0.980000; val_acc: 0.181000
(Epoch 24 / 50) train acc: 0.960000; val_acc: 0.183000
(Epoch 25 / 50) train acc: 0.980000; val_acc: 0.169000
(Iteration 51 / 100) loss: 0.057150
(Epoch 26 / 50) train acc: 0.980000; val_acc: 0.177000
(Epoch 27 / 50) train acc: 1.000000; val_acc: 0.181000
(Epoch 28 / 50) train acc: 1.000000; val_acc: 0.169000
(Epoch 29 / 50) train acc: 1.000000; val_acc: 0.174000
(Epoch 30 / 50) train acc: 0.980000; val_acc: 0.178000
(Iteration 61 / 100) loss: 0.029218
(Epoch 31 / 50) train acc: 1.000000; val_acc: 0.180000
(Epoch 32 / 50) train acc: 1.000000; val_acc: 0.179000
(Epoch 33 / 50) train acc: 1.000000; val_acc: 0.178000
(Epoch 34 / 50) train acc: 1.000000; val_acc: 0.173000
(Epoch 35 / 50) train acc: 1.000000; val_acc: 0.176000
(Iteration 71 / 100) loss: 0.020662
(Epoch 36 / 50) train acc: 1.000000; val_acc: 0.181000
(Epoch 37 / 50) train acc: 1.000000; val_acc: 0.177000
(Epoch 38 / 50) train acc: 1.000000; val_acc: 0.175000
(Epoch 39 / 50) train acc: 1.000000; val_acc: 0.179000
(Epoch 40 / 50) train acc: 1.000000; val_acc: 0.181000
(Iteration 81 / 100) loss: 0.006881
(Epoch 41 / 50) train acc: 1.000000; val_acc: 0.183000
(Epoch 42 / 50) train acc: 1.000000; val_acc: 0.185000
(Epoch 43 / 50) train acc: 1.000000; val_acc: 0.183000
(Epoch 44 / 50) train acc: 1.000000; val_acc: 0.182000
(Epoch 45 / 50) train acc: 1.000000; val_acc: 0.180000
(Iteration 91 / 100) loss: 0.011870
(Epoch 46 / 50) train acc: 1.000000; val_acc: 0.180000
(Epoch 47 / 50) train acc: 1.000000; val_acc: 0.182000
(Epoch 48 / 50) train acc: 1.000000; val_acc: 0.181000
(Epoch 49 / 50) train acc: 1.000000; val_acc: 0.181000
(Epoch 50 / 50) train acc: 1.000000; val_acc: 0.179000

Now try to use a five-layer network with 100 units on each layer to overfit 50 training examples. Again you will have to adjust the learning rate and weight initialization, but you should be able to achieve 100% training accuracy within 20 epochs.


In [16]:
# TODO: Use a five-layer Net to overfit 50 training examples.

num_train = 50
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

learning_rate = 4e-3
weight_scale = 50e-3
model = FullyConnectedNet([100, 100, 100, 100],
                weight_scale=weight_scale, dtype=np.float64)
solver = Solver(model, small_data,
                print_every=10, num_epochs=20, batch_size=25,
                update_rule='sgd',
                optim_config={
                  'learning_rate': learning_rate,
                }
         )
solver.train()

plt.plot(solver.loss_history, 'o')
plt.title('Training loss history')
plt.xlabel('Iteration')
plt.ylabel('Training loss')
plt.show()


(Iteration 1 / 40) loss: 4.093225
(Epoch 0 / 20) train acc: 0.200000; val_acc: 0.098000
(Epoch 1 / 20) train acc: 0.320000; val_acc: 0.128000
(Epoch 2 / 20) train acc: 0.520000; val_acc: 0.129000
(Epoch 3 / 20) train acc: 0.700000; val_acc: 0.117000
(Epoch 4 / 20) train acc: 0.820000; val_acc: 0.107000
(Epoch 5 / 20) train acc: 0.840000; val_acc: 0.128000
(Iteration 11 / 40) loss: 0.852951
(Epoch 6 / 20) train acc: 0.880000; val_acc: 0.133000
(Epoch 7 / 20) train acc: 0.940000; val_acc: 0.109000
(Epoch 8 / 20) train acc: 0.960000; val_acc: 0.137000
(Epoch 9 / 20) train acc: 0.980000; val_acc: 0.134000
(Epoch 10 / 20) train acc: 0.980000; val_acc: 0.124000
(Iteration 21 / 40) loss: 0.382571
(Epoch 11 / 20) train acc: 0.980000; val_acc: 0.127000
(Epoch 12 / 20) train acc: 1.000000; val_acc: 0.142000
(Epoch 13 / 20) train acc: 0.980000; val_acc: 0.134000
(Epoch 14 / 20) train acc: 0.980000; val_acc: 0.134000
(Epoch 15 / 20) train acc: 0.980000; val_acc: 0.134000
(Iteration 31 / 40) loss: 0.142612
(Epoch 16 / 20) train acc: 0.980000; val_acc: 0.143000
(Epoch 17 / 20) train acc: 1.000000; val_acc: 0.137000
(Epoch 18 / 20) train acc: 1.000000; val_acc: 0.137000
(Epoch 19 / 20) train acc: 1.000000; val_acc: 0.138000
(Epoch 20 / 20) train acc: 1.000000; val_acc: 0.138000

Inline question:

Did you notice anything about the comparative difficulty of training the three-layer net vs training the five layer net?

Answer:

It's actually much easier for the five layer neural network to overfit. With a larger weight scale it can easily get 100% training error in 5 epochs.

Update rules

So far we have used vanilla stochastic gradient descent (SGD) as our update rule. More sophisticated update rules can make it easier to train deep networks. We will implement a few of the most commonly used update rules and compare them to vanilla SGD.

SGD+Momentum

Stochastic gradient descent with momentum is a widely used update rule that tends to make deep networks converge faster than vanilla stochstic gradient descent.

Open the file cs231n/optim.py and read the documentation at the top of the file to make sure you understand the API. Implement the SGD+momentum update rule in the function sgd_momentum and run the following to check your implementation. You should see errors less than 1e-8.


In [17]:
from cs231n.optim import sgd_momentum

N, D = 4, 5
w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
v = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)

config = {'learning_rate': 1e-3, 'velocity': v}
next_w, _ = sgd_momentum(w, dw, config=config)

expected_next_w = np.asarray([
  [ 0.1406,      0.20738947,  0.27417895,  0.34096842,  0.40775789],
  [ 0.47454737,  0.54133684,  0.60812632,  0.67491579,  0.74170526],
  [ 0.80849474,  0.87528421,  0.94207368,  1.00886316,  1.07565263],
  [ 1.14244211,  1.20923158,  1.27602105,  1.34281053,  1.4096    ]])
expected_velocity = np.asarray([
  [ 0.5406,      0.55475789,  0.56891579, 0.58307368,  0.59723158],
  [ 0.61138947,  0.62554737,  0.63970526,  0.65386316,  0.66802105],
  [ 0.68217895,  0.69633684,  0.71049474,  0.72465263,  0.73881053],
  [ 0.75296842,  0.76712632,  0.78128421,  0.79544211,  0.8096    ]])

print 'next_w error: ', rel_error(next_w, expected_next_w)
print 'velocity error: ', rel_error(expected_velocity, config['velocity'])


next_w error:  8.88234703351e-09
velocity error:  4.26928774328e-09

Once you have done so, run the following to train a six-layer network with both SGD and SGD+momentum. You should see the SGD+momentum update rule converge faster.


In [18]:
num_train = 4000
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

solvers = {}

for update_rule in ['sgd', 'sgd_momentum']:
  print 'running with ', update_rule
  model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)

  solver = Solver(model, small_data,
                  num_epochs=5, batch_size=100,
                  update_rule=update_rule,
                  optim_config={
                    'learning_rate': 1e-2,
                  },
                  verbose=True)
  solvers[update_rule] = solver
  solver.train()
  print

plt.subplot(3, 1, 1)
plt.title('Training loss')
plt.xlabel('Iteration')

plt.subplot(3, 1, 2)
plt.title('Training accuracy')
plt.xlabel('Epoch')

plt.subplot(3, 1, 3)
plt.title('Validation accuracy')
plt.xlabel('Epoch')

for update_rule, solver in solvers.iteritems():
  plt.subplot(3, 1, 1)
  plt.plot(solver.loss_history, 'o', label=update_rule)
  
  plt.subplot(3, 1, 2)
  plt.plot(solver.train_acc_history, '-o', label=update_rule)

  plt.subplot(3, 1, 3)
  plt.plot(solver.val_acc_history, '-o', label=update_rule)
  
for i in [1, 2, 3]:
  plt.subplot(3, 1, i)
  plt.legend(loc='upper center', ncol=4)
plt.gcf().set_size_inches(15, 15)
plt.show()


running with  sgd
(Iteration 1 / 200) loss: 2.770480
(Epoch 0 / 5) train acc: 0.088000; val_acc: 0.117000
(Iteration 11 / 200) loss: 2.254171
(Iteration 21 / 200) loss: 2.154485
(Iteration 31 / 200) loss: 1.976192
(Epoch 1 / 5) train acc: 0.262000; val_acc: 0.247000
(Iteration 41 / 200) loss: 2.130188
(Iteration 51 / 200) loss: 1.894977
(Iteration 61 / 200) loss: 1.928483
(Iteration 71 / 200) loss: 1.924018
(Epoch 2 / 5) train acc: 0.343000; val_acc: 0.291000
(Iteration 81 / 200) loss: 1.754585
(Iteration 91 / 200) loss: 1.853601
(Iteration 101 / 200) loss: 1.762569
(Iteration 111 / 200) loss: 1.863274
(Epoch 3 / 5) train acc: 0.379000; val_acc: 0.313000
(Iteration 121 / 200) loss: 1.657947
(Iteration 131 / 200) loss: 1.756390
(Iteration 141 / 200) loss: 1.633797
(Iteration 151 / 200) loss: 1.761497
(Epoch 4 / 5) train acc: 0.441000; val_acc: 0.333000
(Iteration 161 / 200) loss: 1.654843
(Iteration 171 / 200) loss: 1.646113
(Iteration 181 / 200) loss: 1.553213
(Iteration 191 / 200) loss: 1.550278
(Epoch 5 / 5) train acc: 0.451000; val_acc: 0.331000

running with  sgd_momentum
(Iteration 1 / 200) loss: 2.889261
(Epoch 0 / 5) train acc: 0.134000; val_acc: 0.140000
(Iteration 11 / 200) loss: 2.143275
(Iteration 21 / 200) loss: 1.927722
(Iteration 31 / 200) loss: 1.794172
(Epoch 1 / 5) train acc: 0.314000; val_acc: 0.267000
(Iteration 41 / 200) loss: 1.804012
(Iteration 51 / 200) loss: 1.866279
(Iteration 61 / 200) loss: 1.742535
(Iteration 71 / 200) loss: 1.812978
(Epoch 2 / 5) train acc: 0.366000; val_acc: 0.340000
(Iteration 81 / 200) loss: 1.691168
(Iteration 91 / 200) loss: 1.764580
(Iteration 101 / 200) loss: 1.716754
(Iteration 111 / 200) loss: 1.622431
(Epoch 3 / 5) train acc: 0.445000; val_acc: 0.347000
(Iteration 121 / 200) loss: 1.497614
(Iteration 131 / 200) loss: 1.503044
(Iteration 141 / 200) loss: 1.492878
(Iteration 151 / 200) loss: 1.655548
(Epoch 4 / 5) train acc: 0.465000; val_acc: 0.334000
(Iteration 161 / 200) loss: 1.594414
(Iteration 171 / 200) loss: 1.459273
(Iteration 181 / 200) loss: 1.486554
(Iteration 191 / 200) loss: 1.458644
(Epoch 5 / 5) train acc: 0.501000; val_acc: 0.306000

RMSProp and Adam

RMSProp [1] and Adam [2] are update rules that set per-parameter learning rates by using a running average of the second moments of gradients.

In the file cs231n/optim.py, implement the RMSProp update rule in the rmsprop function and implement the Adam update rule in the adam function, and check your implementations using the tests below.

[1] Tijmen Tieleman and Geoffrey Hinton. "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural Networks for Machine Learning 4 (2012).

[2] Diederik Kingma and Jimmy Ba, "Adam: A Method for Stochastic Optimization", ICLR 2015.


In [20]:
# Test RMSProp implementation; you should see errors less than 1e-7
from cs231n.optim import rmsprop

N, D = 4, 5
w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
cache = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)

config = {'learning_rate': 1e-2, 'cache': cache}
next_w, _ = rmsprop(w, dw, config=config)

expected_next_w = np.asarray([
  [-0.39223849, -0.34037513, -0.28849239, -0.23659121, -0.18467247],
  [-0.132737,   -0.08078555, -0.02881884,  0.02316247,  0.07515774],
  [ 0.12716641,  0.17918792,  0.23122175,  0.28326742,  0.33532447],
  [ 0.38739248,  0.43947102,  0.49155973,  0.54365823,  0.59576619]])
expected_cache = np.asarray([
  [ 0.5976,      0.6126277,   0.6277108,   0.64284931,  0.65804321],
  [ 0.67329252,  0.68859723,  0.70395734,  0.71937285,  0.73484377],
  [ 0.75037008,  0.7659518,   0.78158892,  0.79728144,  0.81302936],
  [ 0.82883269,  0.84469141,  0.86060554,  0.87657507,  0.8926    ]])

print 'next_w error: ', rel_error(expected_next_w, next_w)
print 'cache error: ', rel_error(expected_cache, config['cache'])


 next_w error:  9.52468751104e-08
cache error:  2.64779558072e-09

In [30]:
# Test Adam implementation; you should see errors around 1e-7 or less
from cs231n.optim import adam

N, D = 4, 5
w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
m = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)
v = np.linspace(0.7, 0.5, num=N*D).reshape(N, D)

config = {'learning_rate': 1e-2, 'm': m, 'v': v, 't': 5}
next_w, _ = adam(w, dw, config=config)

expected_next_w = np.asarray([
  [-0.40094747, -0.34836187, -0.29577703, -0.24319299, -0.19060977],
  [-0.1380274,  -0.08544591, -0.03286534,  0.01971428,  0.0722929],
  [ 0.1248705,   0.17744702,  0.23002243,  0.28259667,  0.33516969],
  [ 0.38774145,  0.44031188,  0.49288093,  0.54544852,  0.59801459]])
expected_v = np.asarray([
  [ 0.69966,     0.68908382,  0.67851319,  0.66794809,  0.65738853,],
  [ 0.64683452,  0.63628604,  0.6257431,   0.61520571,  0.60467385,],
  [ 0.59414753,  0.58362676,  0.57311152,  0.56260183,  0.55209767,],
  [ 0.54159906,  0.53110598,  0.52061845,  0.51013645,  0.49966,   ]])
expected_m = np.asarray([
  [ 0.48,        0.49947368,  0.51894737,  0.53842105,  0.55789474],
  [ 0.57736842,  0.59684211,  0.61631579,  0.63578947,  0.65526316],
  [ 0.67473684,  0.69421053,  0.71368421,  0.73315789,  0.75263158],
  [ 0.77210526,  0.79157895,  0.81105263,  0.83052632,  0.85      ]])

print 'next_w error: ', rel_error(expected_next_w, next_w)
print 'v error: ', rel_error(expected_v, config['v'])
print 'm error: ', rel_error(expected_m, config['m'])


next_w error:  0.207207036686
v error:  4.20831403811e-09
m error:  4.21496319311e-09

Once you have debugged your RMSProp and Adam implementations, run the following to train a pair of deep networks using these new update rules:


In [31]:
learning_rates = {'rmsprop': 1e-4, 'adam': 1e-3}
for update_rule in ['adam', 'rmsprop']:
  print 'running with ', update_rule
  model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)

  solver = Solver(model, small_data,
                  num_epochs=5, batch_size=100,
                  update_rule=update_rule,
                  optim_config={
                    'learning_rate': learning_rates[update_rule]
                  },
                  verbose=True)
  solvers[update_rule] = solver
  solver.train()
  print

plt.subplot(3, 1, 1)
plt.title('Training loss')
plt.xlabel('Iteration')

plt.subplot(3, 1, 2)
plt.title('Training accuracy')
plt.xlabel('Epoch')

plt.subplot(3, 1, 3)
plt.title('Validation accuracy')
plt.xlabel('Epoch')

for update_rule, solver in solvers.iteritems():
  plt.subplot(3, 1, 1)
  plt.plot(solver.loss_history, 'o', label=update_rule)
  
  plt.subplot(3, 1, 2)
  plt.plot(solver.train_acc_history, '-o', label=update_rule)

  plt.subplot(3, 1, 3)
  plt.plot(solver.val_acc_history, '-o', label=update_rule)
  
for i in [1, 2, 3]:
  plt.subplot(3, 1, i)
  plt.legend(loc='upper center', ncol=4)
plt.gcf().set_size_inches(15, 15)
plt.show()


running with  adam
(Iteration 1 / 200) loss: 2.566493
(Epoch 0 / 5) train acc: 0.115000; val_acc: 0.107000
(Iteration 11 / 200) loss: 2.325604
(Iteration 21 / 200) loss: 2.117216
(Iteration 31 / 200) loss: 1.980261
(Epoch 1 / 5) train acc: 0.292000; val_acc: 0.256000
(Iteration 41 / 200) loss: 1.864532
(Iteration 51 / 200) loss: 1.924519
(Iteration 61 / 200) loss: 1.855249
(Iteration 71 / 200) loss: 1.829453
(Epoch 2 / 5) train acc: 0.330000; val_acc: 0.306000
(Iteration 81 / 200) loss: 1.744097
(Iteration 91 / 200) loss: 1.718135
(Iteration 101 / 200) loss: 1.624939
(Iteration 111 / 200) loss: 1.646293
(Epoch 3 / 5) train acc: 0.408000; val_acc: 0.343000
(Iteration 121 / 200) loss: 1.636377
(Iteration 131 / 200) loss: 1.685118
(Iteration 141 / 200) loss: 1.584826
(Iteration 151 / 200) loss: 1.422236
(Epoch 4 / 5) train acc: 0.480000; val_acc: 0.358000
(Iteration 161 / 200) loss: 1.437361
(Iteration 171 / 200) loss: 1.610384
(Iteration 181 / 200) loss: 1.506537
(Iteration 191 / 200) loss: 1.580346
(Epoch 5 / 5) train acc: 0.429000; val_acc: 0.348000

running with  rmsprop
(Iteration 1 / 200) loss: 2.705651
(Epoch 0 / 5) train acc: 0.127000; val_acc: 0.106000
(Iteration 11 / 200) loss: 2.000446
(Iteration 21 / 200) loss: 1.892830
(Iteration 31 / 200) loss: 2.008644
(Epoch 1 / 5) train acc: 0.376000; val_acc: 0.323000
(Iteration 41 / 200) loss: 1.800785
(Iteration 51 / 200) loss: 1.803668
(Iteration 61 / 200) loss: 1.799848
(Iteration 71 / 200) loss: 1.702923
(Epoch 2 / 5) train acc: 0.466000; val_acc: 0.307000
(Iteration 81 / 200) loss: 1.569922
(Iteration 91 / 200) loss: 1.468731
(Iteration 101 / 200) loss: 1.335259
(Iteration 111 / 200) loss: 1.518212
(Epoch 3 / 5) train acc: 0.501000; val_acc: 0.356000
(Iteration 121 / 200) loss: 1.660264
(Iteration 131 / 200) loss: 1.543441
(Iteration 141 / 200) loss: 1.595290
(Iteration 151 / 200) loss: 1.433084
(Epoch 4 / 5) train acc: 0.510000; val_acc: 0.352000
(Iteration 161 / 200) loss: 1.174871
(Iteration 171 / 200) loss: 1.452137
(Iteration 181 / 200) loss: 1.434223
(Iteration 191 / 200) loss: 1.457331
(Epoch 5 / 5) train acc: 0.534000; val_acc: 0.374000

Train a good model!

Train the best fully-connected model that you can on CIFAR-10, storing your best model in the best_model variable. We require you to get at least 50% accuracy on the validation set using a fully-connected net.

If you are careful it should be possible to get accuracies above 55%, but we don't require it for this part and won't assign extra credit for doing so. Later in the assignment we will ask you to train the best convolutional network that you can on CIFAR-10, and we would prefer that you spend your effort working on convolutional nets rather than fully-connected nets.

You might find it useful to complete the BatchNormalization.ipynb and Dropout.ipynb notebooks before completing this part, since those techniques can help you train powerful models.


In [38]:
best_model = None
################################################################################
# TODO: Train the best FullyConnectedNet that you can on CIFAR-10. You might   #
# batch normalization and dropout useful. Store your best model in the         #
# best_model variable.                                                         #
################################################################################
weight_scale = 8e-3
learning_rate = 1.0e-3
model = FullyConnectedNet([50, 50],
              weight_scale=weight_scale, dtype=np.float64)
solver = Solver(model, data,
                print_every=100000, num_epochs=50, batch_size=250,
                update_rule='rmsprop',
                optim_config = {'learning_rate': learning_rate}
         )
solver.train()

################################################################################
#                              END OF YOUR CODE                                #
################################################################################


(Iteration 1 / 980) loss: 2.301886
(Epoch 0 / 5) train acc: 0.130000; val_acc: 0.111000
(Iteration 101 / 980) loss: 1.780897
(Epoch 1 / 5) train acc: 0.444000; val_acc: 0.432000
(Iteration 201 / 980) loss: 1.581982
(Iteration 301 / 980) loss: 1.593262
(Epoch 2 / 5) train acc: 0.461000; val_acc: 0.453000
(Iteration 401 / 980) loss: 1.526138
(Iteration 501 / 980) loss: 1.515401
(Epoch 3 / 5) train acc: 0.495000; val_acc: 0.463000
(Iteration 601 / 980) loss: 1.406601
(Iteration 701 / 980) loss: 1.512447
(Epoch 4 / 5) train acc: 0.471000; val_acc: 0.470000
(Iteration 801 / 980) loss: 1.475562
(Iteration 901 / 980) loss: 1.496496
(Epoch 5 / 5) train acc: 0.511000; val_acc: 0.462000

In [47]:
plt.subplot(2, 1, 1)
plt.plot(solver.loss_history, 'o')
plt.title('Training loss history')
plt.xlabel('Iteration')
plt.ylabel('Training loss')

plt.subplot(2, 1, 2)
plt.plot(solver.train_acc_history, '-o', label='training')
plt.plot(solver.val_acc_history, '-*', label='validation')
plt.title('Training and validation loss history')
plt.xlabel('Iteration')
plt.ylabel('Training loss')

for i in [1, 2]:
    plt.subplot(2, 1, i)
    plt.legend(loc='upper center', ncol=4)
plt.gcf().set_size_inches(15, 15)

plt.show()


Test you model

Run your best model on the validation and test sets. You should achieve above 50% accuracy on the validation set.


In [42]:
best_model = model
y_test_pred = np.argmax(best_model.loss(data['X_test']), axis=1)
y_val_pred = np.argmax(best_model.loss(data['X_val']), axis=1)
print 'Validation set accuracy: ', (y_val_pred == data['y_val']).mean()
print 'Test set accuracy: ', (y_test_pred == data['y_test']).mean()


 Validation set accuracy:  0.47
Test set accuracy:  0.454