Softmax exercise

Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the assignments page on the course website.

This exercise is analogous to the SVM exercise. You will:

  • implement a fully-vectorized loss function for the Softmax classifier
  • implement the fully-vectorized expression for its analytic gradient
  • check your implementation with numerical gradient
  • use a validation set to tune the learning rate and regularization strength
  • optimize the loss function with SGD
  • visualize the final learned weights

In [1]:
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading extenrnal modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [2]:
def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, num_dev=500):
  """
  Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
  it for the linear classifier. These are the same steps as we used for the
  SVM, but condensed to a single function.  
  """
  # Load the raw CIFAR-10 data
  cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
  X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
  
  # subsample the data
  mask = range(num_training, num_training + num_validation)
  X_val = X_train[mask]
  y_val = y_train[mask]
  mask = range(num_training)
  X_train = X_train[mask]
  y_train = y_train[mask]
  mask = range(num_test)
  X_test = X_test[mask]
  y_test = y_test[mask]
  mask = np.random.choice(num_training, num_dev, replace=False)
  X_dev = X_train[mask]
  y_dev = y_train[mask]
  
  # Preprocessing: reshape the image data into rows
  X_train = np.reshape(X_train, (X_train.shape[0], -1))
  X_val = np.reshape(X_val, (X_val.shape[0], -1))
  X_test = np.reshape(X_test, (X_test.shape[0], -1))
  X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))
  
  # Normalize the data: subtract the mean image
  mean_image = np.mean(X_train, axis = 0)
  X_train -= mean_image
  X_val -= mean_image
  X_test -= mean_image
  X_dev -= mean_image
  
  # add bias dimension and transform into columns
  X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])
  X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])
  X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])
  X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])
  
  return X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev


# Invoke the above function to get our data.
X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()
print 'Train data shape: ', X_train.shape
print 'Train labels shape: ', y_train.shape
print 'Validation data shape: ', X_val.shape
print 'Validation labels shape: ', y_val.shape
print 'Test data shape: ', X_test.shape
print 'Test labels shape: ', y_test.shape
print 'dev data shape: ', X_dev.shape
print 'dev labels shape: ', y_dev.shape


Train data shape:  (49000L, 3073L)
Train labels shape:  (49000L,)
Validation data shape:  (1000L, 3073L)
Validation labels shape:  (1000L,)
Test data shape:  (1000L, 3073L)
Test labels shape:  (1000L,)
dev data shape:  (500L, 3073L)
dev labels shape:  (500L,)

Softmax Classifier

Your code for this section will all be written inside cs231n/classifiers/softmax.py.


In [3]:
# First implement the naive softmax loss function with nested loops.
# Open the file cs231n/classifiers/softmax.py and implement the
# softmax_loss_naive function.

from cs231n.classifiers.softmax import softmax_loss_naive
import time

# Generate a random softmax weight matrix and use it to compute the loss.
W = np.random.randn(3073, 10) * 0.0001
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)

# As a rough sanity check, our loss should be something close to -log(0.1).
print 'loss: %f' % loss
print 'sanity check: %f' % (-np.log(0.1))


loss: 2.322765
sanity check: 2.302585

Inline Question 1:

Why do we expect our loss to be close to -log(0.1)? Explain briefly.**

Your answer: Fill this in


In [4]:
# Complete the implementation of softmax_loss_naive and implement a (naive)
# version of the gradient that uses nested loops.
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)

# As we did for the SVM, use numeric gradient checking as a debugging tool.
# The numeric gradient should be close to the analytic gradient.
from cs231n.gradient_check import grad_check_sparse
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 0.0)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)

# similar to SVM case, do another gradient check with regularization
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 1e2)
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 1e2)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)


numerical: 1.451427 analytic: 1.451427, relative error: 3.165712e-08
numerical: -1.080426 analytic: -1.080426, relative error: 3.437196e-08
numerical: -3.949267 analytic: -3.949267, relative error: 1.135935e-08
numerical: 0.123257 analytic: 0.123257, relative error: 1.243456e-08
numerical: -2.861904 analytic: -2.861904, relative error: 1.221247e-08
numerical: 1.274729 analytic: 1.274729, relative error: 1.961729e-08
numerical: 0.968225 analytic: 0.968225, relative error: 6.991440e-08
numerical: 0.679894 analytic: 0.679894, relative error: 1.035185e-07
numerical: 0.133312 analytic: 0.133312, relative error: 4.396982e-07
numerical: 2.036132 analytic: 2.036132, relative error: 1.603428e-08
numerical: 2.678608 analytic: 2.678608, relative error: 5.396274e-09
numerical: 2.889973 analytic: 2.889973, relative error: 4.422368e-09
numerical: 2.965088 analytic: 2.965088, relative error: 3.099539e-08
numerical: -1.826219 analytic: -1.826219, relative error: 7.334068e-09
numerical: -1.535934 analytic: -1.535934, relative error: 1.135855e-08
numerical: 1.846096 analytic: 1.846096, relative error: 1.829303e-08
numerical: -0.349858 analytic: -0.349858, relative error: 5.299985e-08
numerical: -0.200257 analytic: -0.200257, relative error: 1.911086e-07
numerical: 0.317783 analytic: 0.317783, relative error: 4.947264e-08
numerical: 4.027211 analytic: 4.027211, relative error: 8.642725e-09

In [5]:
# Now that we have a naive implementation of the softmax loss function and its gradient,
# implement a vectorized version in softmax_loss_vectorized.
# The two versions should compute the same results, but the vectorized version should be
# much faster.
tic = time.time()
loss_naive, grad_naive = softmax_loss_naive(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'naive loss: %e computed in %fs' % (loss_naive, toc - tic)

from cs231n.classifiers.softmax import softmax_loss_vectorized
tic = time.time()
loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)

# As we did for the SVM, we use the Frobenius norm to compare the two versions
# of the gradient.
grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')
print 'Loss difference: %f' % np.abs(loss_naive - loss_vectorized)
print 'Gradient difference: %f' % grad_difference


naive loss: 2.322765e+00 computed in 0.084000s
vectorized loss: 2.322765e+00 computed in 0.006000s
Loss difference: 0.000000
Gradient difference: 0.000000

In [6]:
# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of over 0.35 on the validation set.
from cs231n.classifiers import Softmax
results = {}
best_val = -1
best_softmax = None
learning_rates = [1e-7, 5e-7]
regularization_strengths = [5e4, 1e8]

################################################################################
# TODO:                                                                        #
# Use the validation set to set the learning rate and regularization strength. #
# This should be identical to the validation that you did for the SVM; save    #
# the best trained softmax classifer in best_softmax.                          #
################################################################################
for lr in learning_rates:
    for reg in regularization_strengths:
        sfm = Softmax()
        loss_history = sfm.train(X_train, y_train, learning_rate=1e-7, reg=5e4,
                              num_iters=1500, verbose=True)
        y_train_pred = sfm.predict(X_train)
        train_accuracy = np.mean(y_train == y_train_pred)
        y_val_pred = sfm.predict(X_val)
        val_accuracy = np.mean(y_val == y_val_pred)
        results[(lr,reg)] = (train_accuracy, val_accuracy)
        if val_accuracy > best_val:
            best_val = val_accuracy
            best_softmax = sfm
################################################################################
#                              END OF YOUR CODE                                #
################################################################################
    
# Print out results.
for lr, reg in sorted(results):
    train_accuracy, val_accuracy = results[(lr, reg)]
    print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (
                lr, reg, train_accuracy, val_accuracy)
    
print 'best validation accuracy achieved during cross-validation: %f' % best_val


iteration 0 / 1500: loss 772.747166
iteration 100 / 1500: loss 283.461331
iteration 200 / 1500: loss 105.062341
iteration 300 / 1500: loss 39.774281
iteration 400 / 1500: loss 15.775679
iteration 500 / 1500: loss 7.143364
iteration 600 / 1500: loss 3.966115
iteration 700 / 1500: loss 2.811199
iteration 800 / 1500: loss 2.346392
iteration 900 / 1500: loss 2.165398
iteration 1000 / 1500: loss 2.138737
iteration 1100 / 1500: loss 2.109692
iteration 1200 / 1500: loss 2.068124
iteration 1300 / 1500: loss 2.097528
iteration 1400 / 1500: loss 2.017230
iteration 0 / 1500: loss 768.308826
iteration 100 / 1500: loss 281.905075
iteration 200 / 1500: loss 104.567686
iteration 300 / 1500: loss 39.482059
iteration 400 / 1500: loss 15.797437
iteration 500 / 1500: loss 7.107607
iteration 600 / 1500: loss 3.915210
iteration 700 / 1500: loss 2.764193
iteration 800 / 1500: loss 2.363761
iteration 900 / 1500: loss 2.231859
iteration 1000 / 1500: loss 2.078744
iteration 1100 / 1500: loss 2.068757
iteration 1200 / 1500: loss 2.102469
iteration 1300 / 1500: loss 2.096126
iteration 1400 / 1500: loss 2.039079
iteration 0 / 1500: loss 768.293632
iteration 100 / 1500: loss 281.814754
iteration 200 / 1500: loss 104.489026
iteration 300 / 1500: loss 39.598558
iteration 400 / 1500: loss 15.856701
iteration 500 / 1500: loss 7.093284
iteration 600 / 1500: loss 3.866680
iteration 700 / 1500: loss 2.726511
iteration 800 / 1500: loss 2.308808
iteration 900 / 1500: loss 2.182247
iteration 1000 / 1500: loss 2.099903
iteration 1100 / 1500: loss 2.084347
iteration 1200 / 1500: loss 2.091173
iteration 1300 / 1500: loss 2.014386
iteration 1400 / 1500: loss 2.102048
iteration 0 / 1500: loss 781.000258
iteration 100 / 1500: loss 286.539274
iteration 200 / 1500: loss 106.227451
iteration 300 / 1500: loss 40.212583
iteration 400 / 1500: loss 16.061913
iteration 500 / 1500: loss 7.177310
iteration 600 / 1500: loss 3.923516
iteration 700 / 1500: loss 2.760485
iteration 800 / 1500: loss 2.306952
iteration 900 / 1500: loss 2.224776
iteration 1000 / 1500: loss 2.129513
iteration 1100 / 1500: loss 2.111508
iteration 1200 / 1500: loss 2.068389
iteration 1300 / 1500: loss 2.094957
iteration 1400 / 1500: loss 2.131040
lr 1.000000e-07 reg 5.000000e+04 train accuracy: 0.333592 val accuracy: 0.346000
lr 1.000000e-07 reg 1.000000e+08 train accuracy: 0.329837 val accuracy: 0.344000
lr 5.000000e-07 reg 5.000000e+04 train accuracy: 0.322510 val accuracy: 0.336000
lr 5.000000e-07 reg 1.000000e+08 train accuracy: 0.330429 val accuracy: 0.344000
best validation accuracy achieved during cross-validation: 0.346000

In [7]:
# evaluate on test set
# Evaluate the best softmax on test set
y_test_pred = best_softmax.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print 'softmax on raw pixels final test set accuracy: %f' % (test_accuracy, )


softmax on raw pixels final test set accuracy: 0.350000

In [8]:
# Visualize the learned weights for each class
w = best_softmax.W[:-1,:] # strip out the bias
w = w.reshape(32, 32, 3, 10)

w_min, w_max = np.min(w), np.max(w)

classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in xrange(10):
  plt.subplot(2, 5, i + 1)
  
  # Rescale the weights to be between 0 and 255
  wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
  plt.imshow(wimg.astype('uint8'))
  plt.axis('off')
  plt.title(classes[i])