Multiclass Support Vector Machine exercise

Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the assignments page on the course website.

In this exercise you will:

  • implement a fully-vectorized loss function for the SVM
  • implement the fully-vectorized expression for its analytic gradient
  • check your implementation using numerical gradient
  • use a validation set to tune the learning rate and regularization strength
  • optimize the loss function with SGD
  • visualize the final learned weights

In [1]:
# Run some setup code for this notebook.

import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt

# This is a bit of magic to make matplotlib figures appear inline in the
# notebook rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

CIFAR-10 Data Loading and Preprocessing


In [2]:
# Load the raw CIFAR-10 data.
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

# As a sanity check, we print out the size of the training and test data.
print 'Training data shape: ', X_train.shape
print 'Training labels shape: ', y_train.shape
print 'Test data shape: ', X_test.shape
print 'Test labels shape: ', y_test.shape


Training data shape:  (50000, 32, 32, 3)
Training labels shape:  (50000,)
Test data shape:  (10000, 32, 32, 3)
Test labels shape:  (10000,)

In [3]:
# Visualize some examples from the dataset.
# We show a few examples of training images from each class.
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y)
    idxs = np.random.choice(idxs, samples_per_class, replace=False)
    for i, idx in enumerate(idxs):
        plt_idx = i * num_classes + y + 1
        plt.subplot(samples_per_class, num_classes, plt_idx)
        plt.imshow(X_train[idx].astype('uint8'))
        plt.axis('off')
        if i == 0:
            plt.title(cls)
plt.show()



In [4]:
# Split the data into train, val, and test sets. In addition we will
# create a small development set as a subset of the training data;
# we can use this for development so our code runs faster.
num_training = 49000
num_validation = 1000
num_test = 1000
num_dev = 500

# Our validation set will be num_validation points from the original
# training set.
mask = range(num_training, num_training + num_validation)
X_val = X_train[mask]
y_val = y_train[mask]

# Our training set will be the first num_train points from the original
# training set.
mask = range(num_training)
X_train = X_train[mask]
y_train = y_train[mask]

# We will also make a development set, which is a small subset of
# the training set.
mask = np.random.choice(num_training, num_dev, replace=False)
X_dev = X_train[mask]
y_dev = y_train[mask]

# We use the first num_test points of the original test set as our
# test set.
mask = range(num_test)
X_test = X_test[mask]
y_test = y_test[mask]

print 'Train data shape: ', X_train.shape
print 'Train labels shape: ', y_train.shape
print 'Validation data shape: ', X_val.shape
print 'Validation labels shape: ', y_val.shape
print 'Test data shape: ', X_test.shape
print 'Test labels shape: ', y_test.shape


Train data shape:  (49000, 32, 32, 3)
Train labels shape:  (49000,)
Validation data shape:  (1000, 32, 32, 3)
Validation labels shape:  (1000,)
Test data shape:  (1000, 32, 32, 3)
Test labels shape:  (1000,)

In [5]:
# Preprocessing: reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_val = np.reshape(X_val, (X_val.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))

# As a sanity check, print out the shapes of the data
print 'Training data shape: ', X_train.shape
print 'Validation data shape: ', X_val.shape
print 'Test data shape: ', X_test.shape
print 'dev data shape: ', X_dev.shape


Training data shape:  (49000, 3072)
Validation data shape:  (1000, 3072)
Test data shape:  (1000, 3072)
dev data shape:  (500, 3072)

In [6]:
# Preprocessing: subtract the mean image
# first: compute the image mean based on the training data
mean_image = np.mean(X_train, axis=0)
print mean_image[:10] # print a few of the elements
plt.figure(figsize=(4,4))
plt.imshow(mean_image.reshape((32,32,3)).astype('uint8')) # visualize the mean image
plt.show()


[ 130.64189796  135.98173469  132.47391837  130.05569388  135.34804082
  131.75402041  130.96055102  136.14328571  132.47636735  131.48467347]

In [7]:
# second: subtract the mean image from train and test data
X_train -= mean_image
X_val -= mean_image
X_test -= mean_image
X_dev -= mean_image

In [8]:
# third: append the bias dimension of ones (i.e. bias trick) so that our SVM
# only has to worry about optimizing a single weight matrix W.
X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])
X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])
X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])
X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])

print X_train.shape, X_val.shape, X_test.shape, X_dev.shape


(49000, 3073) (1000, 3073) (1000, 3073) (500, 3073)

SVM Classifier

Your code for this section will all be written inside cs231n/classifiers/linear_svm.py.

As you can see, we have prefilled the function compute_loss_naive which uses for loops to evaluate the multiclass SVM loss function.

The grad returned from the function above is right now all zero. Derive and implement the gradient for the SVM cost function and implement it inline inside the function svm_loss_naive. You will find it helpful to interleave your new code inside the existing function.

To check that you have correctly implemented the gradient correctly, you can numerically estimate the gradient of the loss function and compare the numeric estimate to the gradient that you computed. We have provided code that does this for you:


In [9]:
# Evaluate the naive implementation of the loss we provided for you:
from cs231n.classifiers.linear_svm import svm_loss_naive
import time

# generate a random SVM weight matrix of small numbers
W = np.random.randn(3073, 10) * 0.0001 

loss, grad = svm_loss_naive(W, X_dev, y_dev, 0.00001)
print 'loss: %f' % (loss, )


loss: 9.220215

In [10]:
# Once you've implemented the gradient, recompute it with the code below
# and gradient check it with the function we provided for you

# Compute the loss and its gradient at W.
loss, grad = svm_loss_naive(W, X_dev, y_dev, 0.0)

# Numerically compute the gradient along several randomly chosen dimensions, and
# compare them with your analytically computed gradient. The numbers should match
# almost exactly along all dimensions.
from cs231n.gradient_check import grad_check_sparse
f = lambda w: svm_loss_naive(w, X_dev, y_dev, 0.0)[0]
grad_numerical = grad_check_sparse(f, W, grad)

# do the gradient check once again with regularization turned on
# you didn't forget the regularization gradient did you?
loss, grad = svm_loss_naive(W, X_dev, y_dev, 1e2)
f = lambda w: svm_loss_naive(w, X_dev, y_dev, 1e2)[0]
grad_numerical = grad_check_sparse(f, W, grad)


numerical: -15.075133 analytic: -15.075133, relative error: 1.801964e-11
numerical: -34.924259 analytic: -34.924259, relative error: 1.337497e-12
numerical: 2.801715 analytic: 2.801715, relative error: 2.724777e-11
numerical: 0.658640 analytic: 0.658640, relative error: 8.476795e-10
numerical: -17.166538 analytic: -17.166538, relative error: 9.861441e-14
numerical: 15.668297 analytic: 15.668297, relative error: 1.970258e-11
numerical: 16.268971 analytic: 16.268971, relative error: 1.217476e-11
numerical: 27.729075 analytic: 27.729075, relative error: 5.209582e-12
numerical: 1.264709 analytic: 1.264709, relative error: 1.352046e-10
numerical: 13.845269 analytic: 13.845269, relative error: 9.061874e-12
numerical: 12.761375 analytic: 12.761375, relative error: 2.344719e-12
numerical: 4.497833 analytic: 4.497833, relative error: 5.343635e-11
numerical: 5.191822 analytic: 5.191822, relative error: 4.437418e-11
numerical: 5.157840 analytic: 5.157840, relative error: 4.232066e-12
numerical: 5.208222 analytic: 5.208222, relative error: 4.870695e-11
numerical: 39.989715 analytic: 39.989715, relative error: 7.661976e-12
numerical: 15.035931 analytic: 15.035931, relative error: 3.042893e-11
numerical: -33.575759 analytic: -33.575759, relative error: 3.021140e-12
numerical: 20.354748 analytic: 20.354748, relative error: 1.498529e-11
numerical: -14.691460 analytic: -14.691460, relative error: 2.824550e-11

Inline Question 1:

It is possible that once in a while a dimension in the gradcheck will not match exactly. What could such a discrepancy be caused by? Is it a reason for concern? What is a simple example in one dimension where a gradient check could fail? Hint: the SVM loss function is not strictly speaking differentiable

Your Answer: This can occur while evaluating the gradient at a point where the function is non-differentiable. The numerical gradient will simply yield a average-slope-like quantity however, the analytic gradient might explode (become Infinite, NaN, or so on) or simply give incorrect results.


In [11]:
# Next implement the function svm_loss_vectorized; for now only compute the loss;
# we will implement the gradient in a moment.
tic = time.time()
loss_naive, grad_naive = svm_loss_naive(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'Naive loss: %e computed in %fs' % (loss_naive, toc - tic)

from cs231n.classifiers.linear_svm import svm_loss_vectorized
tic = time.time()
loss_vectorized, _ = svm_loss_vectorized(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'Vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)

# The losses should match but your vectorized implementation should be much faster.
print 'difference: %f' % (loss_naive - loss_vectorized)


Naive loss: 9.220215e+00 computed in 0.145116s
Vectorized loss: 9.220215e+00 computed in 0.013835s
difference: -0.000000

In [12]:
# Complete the implementation of svm_loss_vectorized, and compute the gradient
# of the loss function in a vectorized way.

# The naive implementation and the vectorized implementation should match, but
# the vectorized version should still be much faster.
tic = time.time()
_, grad_naive = svm_loss_naive(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'Naive loss and gradient: computed in %fs' % (toc - tic)

tic = time.time()
_, grad_vectorized = svm_loss_vectorized(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'Vectorized loss and gradient: computed in %fs' % (toc - tic)

# The loss is a single number, so it is easy to compare the values computed
# by the two implementations. The gradient on the other hand is a matrix, so
# we use the Frobenius norm to compare them.
difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')
print 'difference: %f' % difference


Naive loss and gradient: computed in 0.146886s
Vectorized loss and gradient: computed in 0.013559s
difference: 0.000000

Stochastic Gradient Descent

We now have vectorized and efficient expressions for the loss, the gradient and our gradient matches the numerical gradient. We are therefore ready to do SGD to minimize the loss.


In [13]:
# In the file linear_classifier.py, implement SGD in the function
# LinearClassifier.train() and then run it with the code below.
from cs231n.classifiers import LinearSVM
svm = LinearSVM()
tic = time.time()
loss_hist = svm.train(X_train, y_train, learning_rate=1e-7, reg=5e4,
                      num_iters=1500, verbose=True)
toc = time.time()
print 'That took %fs' % (toc - tic)


iteration 0 / 1500: loss 786.481862
iteration 100 / 1500: loss 288.048892
iteration 200 / 1500: loss 108.584548
iteration 300 / 1500: loss 43.139334
iteration 400 / 1500: loss 18.888727
iteration 500 / 1500: loss 10.066233
iteration 600 / 1500: loss 6.664986
iteration 700 / 1500: loss 5.531027
iteration 800 / 1500: loss 5.515485
iteration 900 / 1500: loss 4.929219
iteration 1000 / 1500: loss 5.283793
iteration 1100 / 1500: loss 5.679458
iteration 1200 / 1500: loss 5.086316
iteration 1300 / 1500: loss 5.793161
iteration 1400 / 1500: loss 5.697323
That took 7.356027s

In [14]:
# A useful debugging strategy is to plot the loss as a function of
# iteration number:
plt.plot(loss_hist)
plt.xlabel('Iteration number')
plt.ylabel('Loss value')
plt.show()



In [15]:
# Write the LinearSVM.predict function and evaluate the performance on both the
# training and validation set
y_train_pred = svm.predict(X_train)
print 'training accuracy: %f' % (np.mean(y_train == y_train_pred), )
y_val_pred = svm.predict(X_val)
print 'validation accuracy: %f' % (np.mean(y_val == y_val_pred), )


training accuracy: 0.367571
validation accuracy: 0.394000

In [16]:
# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of about 0.4 on the validation set.
learning_rates = [1e-9, 5e-6]
regularization_strengths = [1e2, 1e4]

# results is dictionary mapping tuples of the form
# (learning_rate, regularization_strength) to tuples of the form
# (training_accuracy, validation_accuracy). The accuracy is simply the fraction
# of data points that are correctly classified.
results = {}
best_val = -1   # The highest validation accuracy that we have seen so far.
best_svm = None # The LinearSVM object that achieved the highest validation rate.

for _ in np.arange(50):
    i = 10 ** np.random.uniform(low=np.log10(learning_rates[0]), high=np.log10(learning_rates[1]))
    j = 10 ** np.random.uniform(low=np.log10(regularization_strengths[0]), high=np.log10(regularization_strengths[1]))
    
    svm = LinearSVM()
    loss_hist = svm.train(X_train, y_train, learning_rate=i, reg=j, 
                          num_iters=500, verbose=False)
    y_train_pred = svm.predict(X_train)
    y_val_pred = svm.predict(X_val)
    accuracy = (np.mean(y_train == y_train_pred), np.mean(y_val == y_val_pred))
    
    results[(i, j)] = accuracy
    
    if accuracy[1] > best_val:
        best_val = accuracy[1]

# Print out results.
for lr, reg in sorted(results):
    train_accuracy, val_accuracy = results[(lr, reg)]
    print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (
                lr, reg, train_accuracy, val_accuracy)
    
print 'best validation accuracy achieved during cross-validation: %f' % best_val


lr 1.302405e-09 reg 1.721646e+03 train accuracy: 0.117184 val accuracy: 0.131000
lr 1.444376e-09 reg 2.013006e+02 train accuracy: 0.129286 val accuracy: 0.113000
lr 1.828280e-09 reg 1.131647e+02 train accuracy: 0.140673 val accuracy: 0.146000
lr 1.852493e-09 reg 3.872099e+03 train accuracy: 0.130020 val accuracy: 0.116000
lr 1.946407e-09 reg 2.281849e+02 train accuracy: 0.100673 val accuracy: 0.105000
lr 2.699682e-09 reg 1.159344e+02 train accuracy: 0.127878 val accuracy: 0.131000
lr 3.863270e-09 reg 8.420995e+03 train accuracy: 0.140061 val accuracy: 0.131000
lr 4.018844e-09 reg 5.362393e+02 train accuracy: 0.150939 val accuracy: 0.164000
lr 4.197034e-09 reg 7.937880e+03 train accuracy: 0.156143 val accuracy: 0.172000
lr 4.909419e-09 reg 2.483638e+02 train accuracy: 0.148143 val accuracy: 0.141000
lr 5.546436e-09 reg 1.167849e+03 train accuracy: 0.150204 val accuracy: 0.138000
lr 5.663728e-09 reg 6.404317e+02 train accuracy: 0.147163 val accuracy: 0.143000
lr 6.024065e-09 reg 5.362896e+02 train accuracy: 0.164918 val accuracy: 0.167000
lr 9.375767e-09 reg 2.069430e+03 train accuracy: 0.186878 val accuracy: 0.161000
lr 1.297326e-08 reg 2.975701e+02 train accuracy: 0.198633 val accuracy: 0.210000
lr 2.630716e-08 reg 4.946381e+02 train accuracy: 0.212204 val accuracy: 0.203000
lr 2.744733e-08 reg 4.777409e+03 train accuracy: 0.234510 val accuracy: 0.219000
lr 2.840945e-08 reg 1.222413e+02 train accuracy: 0.214653 val accuracy: 0.202000
lr 2.943564e-08 reg 2.251683e+02 train accuracy: 0.216918 val accuracy: 0.216000
lr 3.099136e-08 reg 6.624127e+02 train accuracy: 0.225571 val accuracy: 0.237000
lr 4.188907e-08 reg 2.081694e+03 train accuracy: 0.233204 val accuracy: 0.235000
lr 4.726345e-08 reg 1.453636e+02 train accuracy: 0.235327 val accuracy: 0.241000
lr 5.099787e-08 reg 4.851741e+02 train accuracy: 0.236082 val accuracy: 0.224000
lr 9.775338e-08 reg 1.242870e+02 train accuracy: 0.263735 val accuracy: 0.277000
lr 1.171567e-07 reg 2.164398e+03 train accuracy: 0.284224 val accuracy: 0.285000
lr 1.302023e-07 reg 5.050791e+02 train accuracy: 0.277673 val accuracy: 0.299000
lr 1.944819e-07 reg 1.236910e+03 train accuracy: 0.292245 val accuracy: 0.300000
lr 1.965821e-07 reg 5.843519e+02 train accuracy: 0.298429 val accuracy: 0.292000
lr 2.179956e-07 reg 1.270829e+03 train accuracy: 0.292776 val accuracy: 0.299000
lr 3.576999e-07 reg 3.697240e+02 train accuracy: 0.315612 val accuracy: 0.328000
lr 3.662994e-07 reg 2.000866e+02 train accuracy: 0.318286 val accuracy: 0.305000
lr 4.013385e-07 reg 3.765995e+02 train accuracy: 0.315980 val accuracy: 0.312000
lr 4.106174e-07 reg 9.277875e+03 train accuracy: 0.355816 val accuracy: 0.368000
lr 4.592794e-07 reg 2.325922e+02 train accuracy: 0.324224 val accuracy: 0.322000
lr 4.901308e-07 reg 5.015977e+03 train accuracy: 0.361490 val accuracy: 0.347000
lr 5.093925e-07 reg 4.430814e+02 train accuracy: 0.325857 val accuracy: 0.304000
lr 5.446493e-07 reg 5.666251e+03 train accuracy: 0.344327 val accuracy: 0.371000
lr 5.731062e-07 reg 1.270305e+03 train accuracy: 0.343490 val accuracy: 0.348000
lr 6.495145e-07 reg 6.505492e+02 train accuracy: 0.332163 val accuracy: 0.307000
lr 6.908356e-07 reg 3.587090e+02 train accuracy: 0.328469 val accuracy: 0.328000
lr 7.798060e-07 reg 2.727286e+02 train accuracy: 0.342531 val accuracy: 0.346000
lr 9.765465e-07 reg 6.010247e+02 train accuracy: 0.344837 val accuracy: 0.330000
lr 2.047145e-06 reg 2.239257e+03 train accuracy: 0.310245 val accuracy: 0.310000
lr 2.135709e-06 reg 2.499123e+03 train accuracy: 0.314837 val accuracy: 0.330000
lr 2.576906e-06 reg 2.381580e+03 train accuracy: 0.302122 val accuracy: 0.303000
lr 2.802804e-06 reg 8.992264e+02 train accuracy: 0.264020 val accuracy: 0.265000
lr 2.870434e-06 reg 9.868929e+03 train accuracy: 0.277122 val accuracy: 0.293000
lr 3.557946e-06 reg 2.387583e+02 train accuracy: 0.324959 val accuracy: 0.336000
lr 3.852597e-06 reg 4.514150e+03 train accuracy: 0.295633 val accuracy: 0.300000
lr 4.896848e-06 reg 2.693174e+02 train accuracy: 0.294633 val accuracy: 0.275000
best validation accuracy achieved during cross-validation: 0.371000

In [18]:
# Find the best learning rate and regularization strength
best_lr = 0.
best_reg = 0.

for lr, reg in sorted(results):
    if results[(lr, reg)][1] == best_val:
        best_lr = lr
        best_reg = reg
        break


# Train the best_svm with more iterations
best_svm = LinearSVM()
best_svm.train(X_train, y_train, 
               learning_rate=best_lr, 
               reg=best_reg, 
               num_iters=2000, verbose=True)

y_train_pred = best_svm.predict(X_train)
y_val_pred = best_svm.predict(X_val)
accuracy = (np.mean(y_train == y_train_pred), np.mean(y_val == y_val_pred))

print 'Best validation accuracy now: %f' % accuracy[1]


iteration 0 / 2000: loss 106.725367
iteration 100 / 2000: loss 53.407930
iteration 200 / 2000: loss 30.142728
iteration 300 / 2000: loss 17.930831
iteration 400 / 2000: loss 12.474820
iteration 500 / 2000: loss 7.946178
iteration 600 / 2000: loss 7.054409
iteration 700 / 2000: loss 6.468421
iteration 800 / 2000: loss 5.103795
iteration 900 / 2000: loss 5.268660
iteration 1000 / 2000: loss 5.160260
iteration 1100 / 2000: loss 5.531798
iteration 1200 / 2000: loss 5.321307
iteration 1300 / 2000: loss 4.653675
iteration 1400 / 2000: loss 5.272408
iteration 1500 / 2000: loss 5.160252
iteration 1600 / 2000: loss 4.846376
iteration 1700 / 2000: loss 4.647264
iteration 1800 / 2000: loss 4.635100
iteration 1900 / 2000: loss 4.793430
Best validation accuracy now: 0.368000

In [19]:
# Visualize the cross-validation results
import math
x_scatter = [math.log10(x[0]) for x in results]
y_scatter = [math.log10(x[1]) for x in results]

# plot training accuracy
marker_size = 100
colors = [results[x][0] for x in results]
plt.subplot(2, 1, 1)
plt.scatter(x_scatter, y_scatter, marker_size, c=colors)
plt.colorbar()
plt.xlabel('log learning rate')
plt.ylabel('log regularization strength')
plt.title('CIFAR-10 training accuracy')

# plot validation accuracy
colors = [results[x][1] for x in results] # default size of markers is 20
plt.subplot(2, 1, 2)
plt.scatter(x_scatter, y_scatter, marker_size, c=colors)
plt.colorbar()
plt.xlabel('log learning rate')
plt.ylabel('log regularization strength')
plt.title('CIFAR-10 validation accuracy')
plt.show()



In [20]:
# Evaluate the best svm on test set
y_test_pred = best_svm.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print 'linear SVM on raw pixels final test set accuracy: %f' % test_accuracy


linear SVM on raw pixels final test set accuracy: 0.349000

In [21]:
# Visualize the learned weights for each class.
# Depending on your choice of learning rate and regularization strength, these may
# or may not be nice to look at.
w = best_svm.W[:-1,:] # strip out the bias
w = w.reshape(32, 32, 3, 10)
w_min, w_max = np.min(w), np.max(w)
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in xrange(10):
  plt.subplot(2, 5, i + 1)
    
  # Rescale the weights to be between 0 and 255
  wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
  plt.imshow(wimg.astype('uint8'))
  plt.axis('off')
  plt.title(classes[i])


Inline question 2:

Describe what your visualized SVM weights look like, and offer a brief explanation for why they look they way that they do.

Your answer: The weights describe an average representation of the class in an image form.