Softmax exercise

Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the assignments page on the course website.

This exercise is analogous to the SVM exercise. You will:

  • implement a fully-vectorized loss function for the Softmax classifier
  • implement the fully-vectorized expression for its analytic gradient
  • check your implementation with numerical gradient
  • use a validation set to tune the learning rate and regularization strength
  • optimize the loss function with SGD
  • visualize the final learned weights

In [1]:
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading extenrnal modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2


/home/levin/anaconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [3]:
def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, num_dev=500):
  """
  Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
  it for the linear classifier. These are the same steps as we used for the
  SVM, but condensed to a single function.  
  """
  # Load the raw CIFAR-10 data
  cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
  X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
  
  # subsample the data
  mask = range(num_training, num_training + num_validation)
  X_val = X_train[mask]
  y_val = y_train[mask]
  mask = range(num_training)
  X_train = X_train[mask]
  y_train = y_train[mask]
  mask = range(num_test)
  X_test = X_test[mask]
  y_test = y_test[mask]
  mask = np.random.choice(num_training, num_dev, replace=False)
  X_dev = X_train[mask]
  y_dev = y_train[mask]
  
  # Preprocessing: reshape the image data into rows
  X_train = np.reshape(X_train, (X_train.shape[0], -1))
  X_val = np.reshape(X_val, (X_val.shape[0], -1))
  X_test = np.reshape(X_test, (X_test.shape[0], -1))
  X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))
  
  # Normalize the data: subtract the mean image
  mean_image = np.mean(X_train, axis = 0)
  X_train -= mean_image
  X_val -= mean_image
  X_test -= mean_image
  X_dev -= mean_image
  
  # add bias dimension and transform into columns
  X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])
  X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])
  X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])
  X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])
  
  return X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev


# Invoke the above function to get our data.
X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()
print 'Train data shape: ', X_train.shape
print 'Train labels shape: ', y_train.shape
print 'Validation data shape: ', X_val.shape
print 'Validation labels shape: ', y_val.shape
print 'Test data shape: ', X_test.shape
print 'Test labels shape: ', y_test.shape
print 'dev data shape: ', X_dev.shape
print 'dev labels shape: ', y_dev.shape


Train data shape:  (49000, 3073)
Train labels shape:  (49000,)
Validation data shape:  (1000, 3073)
Validation labels shape:  (1000,)
Test data shape:  (1000, 3073)
Test labels shape:  (1000,)
dev data shape:  (500, 3073)
dev labels shape:  (500,)

Softmax Classifier

Your code for this section will all be written inside cs231n/classifiers/softmax.py.


In [4]:
# First implement the naive softmax loss function with nested loops.
# Open the file cs231n/classifiers/softmax.py and implement the
# softmax_loss_naive function.

from cs231n.classifiers.softmax import softmax_loss_naive
import time

# Generate a random softmax weight matrix and use it to compute the loss.
W = np.random.randn(3073, 10) * 0.0001
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)

# As a rough sanity check, our loss should be something close to -log(0.1).
print 'loss: %f' % loss
print 'sanity check: %f' % (-np.log(0.1))


loss: 2.387537
sanity check: 2.302585

Inline Question 1:

Why do we expect our loss to be close to -log(0.1)? Explain briefly.**

Your answer:
Expecting our loss to be close to -log(0.1) is equivalent to expecting the 10 classes to be in uniform districution, each with a probablity of being 01. This is quite plausible, considering that weight is randomly initialized and does not have any bias.


In [5]:
# Complete the implementation of softmax_loss_naive and implement a (naive)
# version of the gradient that uses nested loops.
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)

# As we did for the SVM, use numeric gradient checking as a debugging tool.
# The numeric gradient should be close to the analytic gradient.
from cs231n.gradient_check import grad_check_sparse
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 0.0)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)

# similar to SVM case, do another gradient check with regularization
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 1e2)
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 1e2)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)


numerical: 1.924204 analytic: 1.924203, relative error: 2.377507e-08
numerical: 3.273051 analytic: 3.273051, relative error: 2.293258e-08
numerical: 1.416040 analytic: 1.416040, relative error: 3.771324e-08
numerical: -0.899851 analytic: -0.899851, relative error: 3.505760e-08
numerical: 0.336886 analytic: 0.336886, relative error: 1.243691e-07
numerical: -0.432425 analytic: -0.432425, relative error: 1.070022e-07
numerical: -2.534122 analytic: -2.534122, relative error: 1.069666e-08
numerical: 1.690181 analytic: 1.690181, relative error: 2.207165e-08
numerical: -0.448847 analytic: -0.448847, relative error: 1.270582e-08
numerical: 2.916113 analytic: 2.916113, relative error: 2.549746e-08
numerical: -1.646460 analytic: -1.646460, relative error: 1.046878e-08
numerical: 0.721956 analytic: 0.721956, relative error: 2.661721e-08
numerical: -0.583664 analytic: -0.583664, relative error: 1.965129e-08
numerical: 1.563782 analytic: 1.563782, relative error: 2.139016e-08
numerical: -1.022960 analytic: -1.022960, relative error: 9.729376e-09
numerical: -0.719193 analytic: -0.719193, relative error: 6.275960e-09
numerical: -7.273979 analytic: -7.273979, relative error: 4.406025e-09
numerical: 1.560906 analytic: 1.560906, relative error: 2.206220e-08
numerical: 1.455565 analytic: 1.455565, relative error: 2.637870e-08
numerical: 2.884417 analytic: 2.884417, relative error: 2.016106e-08

In [6]:
# Now that we have a naive implementation of the softmax loss function and its gradient,
# implement a vectorized version in softmax_loss_vectorized.
# The two versions should compute the same results, but the vectorized version should be
# much faster.
tic = time.time()
loss_naive, grad_naive = softmax_loss_naive(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'naive loss: %e computed in %fs' % (loss_naive, toc - tic)

from cs231n.classifiers.softmax import softmax_loss_vectorized
tic = time.time()
loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)

# As we did for the SVM, we use the Frobenius norm to compare the two versions
# of the gradient.
grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')
print 'Loss difference: %f' % np.abs(loss_naive - loss_vectorized)
print 'Gradient difference: %f' % grad_difference


naive loss: 2.387537e+00 computed in 0.261431s
vectorized loss: 2.387537e+00 computed in 0.022554s
Loss difference: 0.000000
Gradient difference: 0.000000

In [9]:
# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of over 0.35 on the validation set.
from cs231n.classifiers import Softmax
results = {}
best_val = -1
best_softmax = None
learning_rates = [1e-7, 5e-8]
regularization_strengths = [5e4, 1e3]

################################################################################
# TODO:                                                                        #
# Use the validation set to set the learning rate and regularization strength. #
# This should be identical to the validation that you did for the SVM; save    #
# the best trained softmax classifer in best_softmax.                          #
################################################################################
num_iters = 1500
for learning_rate in learning_rates:
    for regularization_strength in regularization_strengths:
        print "learning_rage {}, regularization_strength {}".format(learning_rate, regularization_strength)
        #train it
        model = Softmax()
        model.train(X_train, y_train, learning_rate=learning_rate, reg=regularization_strength,
                      num_iters=num_iters, verbose=True)
        #predict
        y_train_pred = model.predict(X_train)
        training_accuracy = np.mean(y_train == y_train_pred)
        y_val_pred = model.predict(X_val)
        validation_accuracy = np.mean(y_val == y_val_pred)
        results[(learning_rate,regularization_strength)] = training_accuracy, validation_accuracy
        if validation_accuracy > best_val:
            best_val = validation_accuracy
            best_softmax = model
################################################################################
#                              END OF YOUR CODE                                #
################################################################################
    
# Print out results.
for lr, reg in sorted(results):
    train_accuracy, val_accuracy = results[(lr, reg)]
    print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (
                lr, reg, train_accuracy, val_accuracy)
    
print 'best validation accuracy achieved during cross-validation: %f' % best_val


learning_rage 1e-07, regularization_strength 50000.0
iteration 0 / 1500: loss 5.401598
iteration 100 / 1500: loss 2.734902
iteration 200 / 1500: loss 2.404785
iteration 300 / 1500: loss 2.066126
iteration 400 / 1500: loss 2.030593
iteration 500 / 1500: loss 1.999755
iteration 600 / 1500: loss 1.991619
iteration 700 / 1500: loss 1.975749
iteration 800 / 1500: loss 2.038639
iteration 900 / 1500: loss 2.044697
iteration 1000 / 1500: loss 2.028992
iteration 1100 / 1500: loss 1.957985
iteration 1200 / 1500: loss 2.037768
iteration 1300 / 1500: loss 1.962232
iteration 1400 / 1500: loss 2.023484
learning_rage 1e-07, regularization_strength 1000.0
iteration 0 / 1500: loss 6.079699
iteration 100 / 1500: loss 3.936488
iteration 200 / 1500: loss 3.350787
iteration 300 / 1500: loss 3.327111
iteration 400 / 1500: loss 3.008811
iteration 500 / 1500: loss 2.847320
iteration 600 / 1500: loss 2.866963
iteration 700 / 1500: loss 2.907743
iteration 800 / 1500: loss 2.810694
iteration 900 / 1500: loss 2.779843
iteration 1000 / 1500: loss 2.637984
iteration 1100 / 1500: loss 2.439051
iteration 1200 / 1500: loss 2.632869
iteration 1300 / 1500: loss 2.487482
iteration 1400 / 1500: loss 2.592269
learning_rage 5e-08, regularization_strength 50000.0
iteration 0 / 1500: loss 6.651233
iteration 100 / 1500: loss 3.794906
iteration 200 / 1500: loss 2.914429
iteration 300 / 1500: loss 2.559281
iteration 400 / 1500: loss 2.313131
iteration 500 / 1500: loss 2.110121
iteration 600 / 1500: loss 2.094315
iteration 700 / 1500: loss 2.046862
iteration 800 / 1500: loss 2.017530
iteration 900 / 1500: loss 2.000550
iteration 1000 / 1500: loss 2.028373
iteration 1100 / 1500: loss 1.998935
iteration 1200 / 1500: loss 2.023894
iteration 1300 / 1500: loss 2.006492
iteration 1400 / 1500: loss 1.969010
learning_rage 5e-08, regularization_strength 1000.0
iteration 0 / 1500: loss 5.683377
iteration 100 / 1500: loss 4.156387
iteration 200 / 1500: loss 3.669945
iteration 300 / 1500: loss 3.721072
iteration 400 / 1500: loss 3.695043
iteration 500 / 1500: loss 3.335078
iteration 600 / 1500: loss 3.479732
iteration 700 / 1500: loss 3.146838
iteration 800 / 1500: loss 3.090696
iteration 900 / 1500: loss 2.871892
iteration 1000 / 1500: loss 2.655937
iteration 1100 / 1500: loss 3.079671
iteration 1200 / 1500: loss 2.831032
iteration 1300 / 1500: loss 2.831564
iteration 1400 / 1500: loss 2.771489
lr 5.000000e-08 reg 1.000000e+03 train accuracy: 0.219000 val accuracy: 0.225000
lr 5.000000e-08 reg 5.000000e+04 train accuracy: 0.326061 val accuracy: 0.338000
lr 1.000000e-07 reg 1.000000e+03 train accuracy: 0.252776 val accuracy: 0.248000
lr 1.000000e-07 reg 5.000000e+04 train accuracy: 0.328898 val accuracy: 0.343000
best validation accuracy achieved during cross-validation: 0.343000

In [10]:
# evaluate on test set
# Evaluate the best softmax on test set
y_test_pred = best_softmax.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print 'softmax on raw pixels final test set accuracy: %f' % (test_accuracy, )


softmax on raw pixels final test set accuracy: 0.341000

In [11]:
# Visualize the learned weights for each class
w = best_softmax.W[:-1,:] # strip out the bias
w = w.reshape(32, 32, 3, 10)

w_min, w_max = np.min(w), np.max(w)

classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in xrange(10):
  plt.subplot(2, 5, i + 1)
  
  # Rescale the weights to be between 0 and 255
  wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
  plt.imshow(wimg.astype('uint8'))
  plt.axis('off')
  plt.title(classes[i])



In [ ]: