Softmax exercise

Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the assignments page on the course website.

This exercise is analogous to the SVM exercise. You will:

  • implement a fully-vectorized loss function for the Softmax classifier
  • implement the fully-vectorized expression for its analytic gradient
  • check your implementation with numerical gradient
  • use a validation set to tune the learning rate and regularization strength
  • optimize the loss function with SGD
  • visualize the final learned weights

In [1]:
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading extenrnal modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [2]:
def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, num_dev=500):
  """
  Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
  it for the linear classifier. These are the same steps as we used for the
  SVM, but condensed to a single function.  
  """
  # Load the raw CIFAR-10 data
  cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
  X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
  
  # subsample the data
  mask = range(num_training, num_training + num_validation)
  X_val = X_train[mask]
  y_val = y_train[mask]
  mask = range(num_training)
  X_train = X_train[mask]
  y_train = y_train[mask]
  mask = range(num_test)
  X_test = X_test[mask]
  y_test = y_test[mask]
  mask = np.random.choice(num_training, num_dev, replace=False)
  X_dev = X_train[mask]
  y_dev = y_train[mask]
  
  # Preprocessing: reshape the image data into rows
  X_train = np.reshape(X_train, (X_train.shape[0], -1))
  X_val = np.reshape(X_val, (X_val.shape[0], -1))
  X_test = np.reshape(X_test, (X_test.shape[0], -1))
  X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))
  
  # Normalize the data: subtract the mean image
  mean_image = np.mean(X_train, axis = 0)
  X_train -= mean_image
  X_val -= mean_image
  X_test -= mean_image
  X_dev -= mean_image
  
  # add bias dimension and transform into columns
  X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])
  X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])
  X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])
  X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])
  
  return X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev


# Invoke the above function to get our data.
X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()
print 'Train data shape: ', X_train.shape
print 'Train labels shape: ', y_train.shape
print 'Validation data shape: ', X_val.shape
print 'Validation labels shape: ', y_val.shape
print 'Test data shape: ', X_test.shape
print 'Test labels shape: ', y_test.shape
print 'dev data shape: ', X_dev.shape
print 'dev labels shape: ', y_dev.shape


Train data shape:  (49000, 3073)
Train labels shape:  (49000,)
Validation data shape:  (1000, 3073)
Validation labels shape:  (1000,)
Test data shape:  (1000, 3073)
Test labels shape:  (1000,)
dev data shape:  (500, 3073)
dev labels shape:  (500,)

Softmax Classifier

Your code for this section will all be written inside cs231n/classifiers/softmax.py.


In [5]:
# First implement the naive softmax loss function with nested loops.
# Open the file cs231n/classifiers/softmax.py and implement the
# softmax_loss_naive function.

from cs231n.classifiers.softmax import softmax_loss_naive
import time

# Generate a random softmax weight matrix and use it to compute the loss.
W = np.random.randn(3073, 10) * 0.0001
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)

# As a rough sanity check, our loss should be something close to -log(0.1).
print 'loss: %f' % loss
print 'sanity check: %f' % (-np.log(0.1))


loss: 2.397563
sanity check: 2.302585

Inline Question 1:

Why do we expect our loss to be close to -log(0.1)? Explain briefly.**

Your answer: Random W do not have classification ability. The possibility is therefore 0.1


In [6]:
# Complete the implementation of softmax_loss_naive and implement a (naive)
# version of the gradient that uses nested loops.
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)

# As we did for the SVM, use numeric gradient checking as a debugging tool.
# The numeric gradient should be close to the analytic gradient.
from cs231n.gradient_check import grad_check_sparse
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 0.0)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)

# similar to SVM case, do another gradient check with regularization
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 1e2)
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 1e2)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)


numerical: -1.204135 analytic: -1.204135, relative error: 8.750527e-09
numerical: 1.906080 analytic: 1.906080, relative error: 3.977318e-08
numerical: 0.528011 analytic: 0.528011, relative error: 3.343905e-08
numerical: -0.610268 analytic: -0.610268, relative error: 2.828683e-08
numerical: 0.321206 analytic: 0.321206, relative error: 9.657798e-08
numerical: 0.763097 analytic: 0.763098, relative error: 3.178224e-08
numerical: -1.883311 analytic: -1.883311, relative error: 5.084884e-09
numerical: -0.899918 analytic: -0.899918, relative error: 1.259345e-08
numerical: 6.514251 analytic: 6.514251, relative error: 8.344245e-09
numerical: 1.385145 analytic: 1.385145, relative error: 4.784940e-08
numerical: 2.471611 analytic: 2.471879, relative error: 5.415519e-05
numerical: 0.379779 analytic: 0.368153, relative error: 1.554376e-02
numerical: -6.167855 analytic: -6.206047, relative error: 3.086431e-03
numerical: -0.613962 analytic: -0.644210, relative error: 2.404121e-02
numerical: 1.982912 analytic: 1.997106, relative error: 3.566353e-03
numerical: 0.229274 analytic: 0.209570, relative error: 4.490127e-02
numerical: 1.504908 analytic: 1.490827, relative error: 4.700338e-03
numerical: 1.763680 analytic: 1.766540, relative error: 8.101290e-04
numerical: -0.009006 analytic: -0.006130, relative error: 1.900678e-01
numerical: 1.527465 analytic: 1.508669, relative error: 6.190811e-03

In [7]:
# Now that we have a naive implementation of the softmax loss function and its gradient,
# implement a vectorized version in softmax_loss_vectorized.
# The two versions should compute the same results, but the vectorized version should be
# much faster.
tic = time.time()
loss_naive, grad_naive = softmax_loss_naive(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'naive loss: %e computed in %fs' % (loss_naive, toc - tic)

from cs231n.classifiers.softmax import softmax_loss_vectorized
tic = time.time()
loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)

# As we did for the SVM, we use the Frobenius norm to compare the two versions
# of the gradient.
grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')
print 'Loss difference: %f' % np.abs(loss_naive - loss_vectorized)
print 'Gradient difference: %f' % grad_difference


naive loss: 2.397563e+00 computed in 0.087422s
vectorized loss: 2.397563e+00 computed in 0.020355s
Loss difference: 0.000000
Gradient difference: 0.000000

In [9]:
# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of over 0.35 on the validation set.
from cs231n.classifiers import Softmax
results = {}
best_val = -1
best_softmax = None
learning_rates = [1e-7, 5e-7]
regularization_strengths = [5e4, 1e8]

################################################################################
# TODO:                                                                        #
# Use the validation set to set the learning rate and regularization strength. #
# This should be identical to the validation that you did for the SVM; save    #
# the best trained softmax classifer in best_softmax.                          #
################################################################################

# cheat here for finer division to make life better
learning_rates = np.linspace(learning_rates[0], learning_rates[1], num=5)
regularization_strengths = np.linspace(regularization_strengths[0], regularization_strengths[1], num=5)

# it seem that high learning rate would make the loss -> inf
for learning_rate in learning_rates:
    for regularization_strength in regularization_strengths:
        softmax = Softmax()
        loss_hist = softmax.train(X_train, y_train, learning_rate=learning_rate, reg=regularization_strength,
                      num_iters=400, verbose=True)
        y_train_pred = softmax.predict(X_train)
        y_val_pred = softmax.predict(X_val)
        current_val = np.mean(y_val == y_val_pred)
        if current_val > best_val:
            best_val = current_val
            best_softmax = softmax
        results[(learning_rate, regularization_strength)] = (np.mean(y_train == y_train_pred),  current_val)
        

        
################################################################################
#                              END OF YOUR CODE                                #
################################################################################
    
# Print out results.
for lr, reg in sorted(results):
    train_accuracy, val_accuracy = results[(lr, reg)]
    print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (
                lr, reg, train_accuracy, val_accuracy)
    
print 'best validation accuracy achieved during cross-validation: %f' % best_val


iteration 0 / 400: loss 5.772058
iteration 100 / 400: loss 2.446937
iteration 200 / 400: loss 2.115400
iteration 300 / 400: loss 2.106650
iteration 0 / 400: loss 4.905951
iteration 100 / 400: loss inf
iteration 200 / 400: loss inf
iteration 300 / 400: loss inf
iteration 0 / 400: loss 5.509102
iteration 100 / 400: loss inf
iteration 200 / 400: loss inf
iteration 300 / 400: loss inf
iteration 0 / 400: loss 4.907421
iteration 100 / 400: loss inf
iteration 200 / 400: loss inf
iteration 300 / 400: loss nan
iteration 0 / 400: loss 5.690875
iteration 100 / 400: loss inf
iteration 200 / 400: loss inf
iteration 300 / 400: loss nan
iteration 0 / 400: loss 5.221801
iteration 100 / 400: loss 2.138481
iteration 200 / 400: loss 2.065346
iteration 300 / 400: loss 2.053401
iteration 0 / 400: loss 5.316043
iteration 100 / 400: loss inf
iteration 200 / 400: loss inf
iteration 300 / 400: loss inf
iteration 0 / 400: loss 5.136066
iteration 100 / 400: loss inf
iteration 200 / 400: loss inf
iteration 300 / 400: loss nan
iteration 0 / 400: loss 4.888397
iteration 100 / 400: loss inf
iteration 200 / 400: loss inf
iteration 300 / 400: loss nan
iteration 0 / 400: loss 5.540207
iteration 100 / 400: loss inf
cs231n/classifiers/softmax.py:86: RuntimeWarning: invalid value encountered in subtract
iteration 200 / 400: loss nan
iteration 300 / 400: loss nan
iteration 0 / 400: loss 5.022580
iteration 100 / 400: loss 2.073193
iteration 200 / 400: loss 2.080531
iteration 300 / 400: loss 2.099312
iteration 0 / 400: loss 5.366702
iteration 100 / 400: loss inf
iteration 200 / 400: loss inf
iteration 300 / 400: loss nan
iteration 0 / 400: loss 6.277229
iteration 100 / 400: loss inf
iteration 200 / 400: loss inf
iteration 300 / 400: loss nan
iteration 0 / 400: loss 5.735230
iteration 100 / 400: loss inf
iteration 200 / 400: loss nan
iteration 300 / 400: loss nan
iteration 0 / 400: loss 5.346516
iteration 100 / 400: loss inf
iteration 200 / 400: loss nan
iteration 300 / 400: loss nan
iteration 0 / 400: loss 5.653258
iteration 100 / 400: loss 2.025459
iteration 200 / 400: loss 2.078061
iteration 300 / 400: loss 2.095888
iteration 0 / 400: loss 5.715989
iteration 100 / 400: loss inf
iteration 200 / 400: loss inf
iteration 300 / 400: loss nan
iteration 0 / 400: loss 5.042449
iteration 100 / 400: loss inf
iteration 200 / 400: loss nan
iteration 300 / 400: loss nan
iteration 0 / 400: loss 5.500166
iteration 100 / 400: loss inf
iteration 200 / 400: loss nan
iteration 300 / 400: loss nan
iteration 0 / 400: loss 5.628261
iteration 100 / 400: loss inf
iteration 200 / 400: loss nan
iteration 300 / 400: loss nan
iteration 0 / 400: loss 6.349744
iteration 100 / 400: loss 2.042371
iteration 200 / 400: loss 2.087944
iteration 300 / 400: loss 2.083138
iteration 0 / 400: loss 6.009105
iteration 100 / 400: loss inf
iteration 200 / 400: loss inf
iteration 300 / 400: loss nan
iteration 0 / 400: loss 5.941990
iteration 100 / 400: loss inf
iteration 200 / 400: loss nan
iteration 300 / 400: loss nan
iteration 0 / 400: loss 5.539706
iteration 100 / 400: loss inf
iteration 200 / 400: loss nan
iteration 300 / 400: loss nan
iteration 0 / 400: loss 6.148245
iteration 100 / 400: loss inf
iteration 200 / 400: loss nan
iteration 300 / 400: loss nan
lr 1.000000e-07 reg 5.000000e+04 train accuracy: 0.308837 val accuracy: 0.316000
lr 1.000000e-07 reg 2.503750e+07 train accuracy: 0.092898 val accuracy: 0.095000
lr 1.000000e-07 reg 5.002500e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 1.000000e-07 reg 7.501250e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 1.000000e-07 reg 1.000000e+08 train accuracy: 0.100265 val accuracy: 0.087000
lr 2.000000e-07 reg 5.000000e+04 train accuracy: 0.301878 val accuracy: 0.320000
lr 2.000000e-07 reg 2.503750e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 2.000000e-07 reg 5.002500e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 2.000000e-07 reg 7.501250e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 2.000000e-07 reg 1.000000e+08 train accuracy: 0.100265 val accuracy: 0.087000
lr 3.000000e-07 reg 5.000000e+04 train accuracy: 0.311857 val accuracy: 0.319000
lr 3.000000e-07 reg 2.503750e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 3.000000e-07 reg 5.002500e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 3.000000e-07 reg 7.501250e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 3.000000e-07 reg 1.000000e+08 train accuracy: 0.100265 val accuracy: 0.087000
lr 4.000000e-07 reg 5.000000e+04 train accuracy: 0.301224 val accuracy: 0.320000
lr 4.000000e-07 reg 2.503750e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 4.000000e-07 reg 5.002500e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 4.000000e-07 reg 7.501250e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 4.000000e-07 reg 1.000000e+08 train accuracy: 0.100265 val accuracy: 0.087000
lr 5.000000e-07 reg 5.000000e+04 train accuracy: 0.292816 val accuracy: 0.303000
lr 5.000000e-07 reg 2.503750e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 5.000000e-07 reg 5.002500e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 5.000000e-07 reg 7.501250e+07 train accuracy: 0.100265 val accuracy: 0.087000
lr 5.000000e-07 reg 1.000000e+08 train accuracy: 0.100265 val accuracy: 0.087000
best validation accuracy achieved during cross-validation: 0.320000

In [10]:
# evaluate on test set
# Evaluate the best softmax on test set
y_test_pred = best_softmax.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print 'softmax on raw pixels final test set accuracy: %f' % (test_accuracy, )


softmax on raw pixels final test set accuracy: 0.319000

In [11]:
# Visualize the learned weights for each class
w = best_softmax.W[:-1,:] # strip out the bias
w = w.reshape(32, 32, 3, 10)

w_min, w_max = np.min(w), np.max(w)

classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in xrange(10):
  plt.subplot(2, 5, i + 1)
  
  # Rescale the weights to be between 0 and 255
  wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
  plt.imshow(wimg.astype('uint8'))
  plt.axis('off')
  plt.title(classes[i])



In [ ]: