Softmax exercise

(Adapted from Stanford University's CS231n Open Courseware)

Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the HW page on the course website.

This exercise is analogous to the SVM exercise. You will:

implement a fully-vectorized loss function for the Softmax classifier
implement the fully-vectorized expression for its analytic gradient
check your implementation with numerical gradient
use a validation set to tune the learning rate and regularization strength
optimize the loss function with SGD
visualize the final learned weights



In [1]:

    
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading extenrnal modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2



In [2]:

    
def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):
#Increase training if you have memory:
#def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):
  """
  Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
  it for the linear classifier. These are the same steps as we used for the
  SVM, but condensed to a single function.  
  """
  # Load the raw CIFAR-10 data
  cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
  X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir,num_of_batches=6)
# Increase num_of_batches to 6 if you have sufficient memory
  
  # subsample the data
  mask = range(num_training, num_training + num_validation)
  X_val = X_train[mask]
  y_val = y_train[mask]
  mask = range(num_training)
  X_train = X_train[mask]
  y_train = y_train[mask]
  mask = range(num_test)
  X_test = X_test[mask]
  y_test = y_test[mask]
  
  # Preprocessing: reshape the image data into rows
  X_train = np.reshape(X_train, (X_train.shape[0], -1))
  X_val = np.reshape(X_val, (X_val.shape[0], -1))
  X_test = np.reshape(X_test, (X_test.shape[0], -1))
  
  # Normalize the data: subtract the mean image
  mean_image = np.mean(X_train, axis = 0)
  X_train -= mean_image
  X_val -= mean_image
  X_test -= mean_image
  
  # add bias dimension and transform into columns
  X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))]).T
  X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))]).T
  X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))]).T
  
  return X_train, y_train, X_val, y_val, X_test, y_test


# Invoke the above function to get our data.
X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()
print 'Train data shape: ', X_train.shape
print 'Train labels shape: ', y_train.shape
print 'Validation data shape: ', X_val.shape
print 'Validation labels shape: ', y_val.shape
print 'Test data shape: ', X_test.shape
print 'Test labels shape: ', y_test.shape









    



Train data shape:  (3073, 49000)
Train labels shape:  (49000,)
Validation data shape:  (3073, 1000)
Validation labels shape:  (1000,)
Test data shape:  (3073, 1000)
Test labels shape:  (1000,)

Softmax Classifier

Your code for this section will all be written inside cs231n/classifiers/softmax.py.



In [3]:

    
# First implement the naive softmax loss function with nested loops.
# Open the file cs231n/classifiers/softmax.py and implement the
# softmax_loss_naive function.

from cs231n.classifiers.softmax import softmax_loss_naive
import time

# Generate a random softmax weight matrix and use it to compute the loss.
W = np.random.randn(10, 3073) * 0.0001
loss, grad = softmax_loss_naive(W, X_train, y_train, 0.0)

# As a rough sanity check, our loss should be something close to -log(0.1).
print 'loss: %f' % loss
print 'sanity check: %f' % (-np.log(0.1))









    



loss: 2.358218
sanity check: 2.302585

Inline Question 1:

Why do we expect our loss to be close to -log(0.1)? Explain briefly.**

Your answer: Since there are 10 classes and we assign the inital weigths randomly, we expect a uniform distrubiton between classes so it should be around -log(1/10)=-log(.1)



In [4]:

    
# Complete the implementation of softmax_loss_naive and implement a (naive)
# version of the gradient that uses nested loops.
loss, grad = softmax_loss_naive(W, X_train, y_train, 0.0)

# As we did for the SVM, use numeric gradient checking as a debugging tool.
# The numeric gradient should be close to the analytic gradient.
from cs231n.gradient_check import grad_check_sparse
f = lambda w: softmax_loss_naive(w, X_train, y_train, 0.0)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)









    



numerical: -0.213940 analytic: -0.213940, relative error: 1.789436e-07
numerical: 0.128688 analytic: 0.128688, relative error: 5.467276e-07
numerical: 1.940427 analytic: 1.940427, relative error: 3.030184e-08
numerical: 1.396616 analytic: 1.396616, relative error: 2.457026e-09
numerical: -1.504184 analytic: -1.504183, relative error: 3.814074e-08
numerical: -0.009148 analytic: -0.009149, relative error: 7.770238e-06
numerical: -2.361141 analytic: -2.361141, relative error: 3.199240e-08
numerical: 0.558828 analytic: 0.558829, relative error: 2.835155e-08
numerical: 3.619593 analytic: 3.619593, relative error: 1.330366e-08
numerical: -0.625066 analytic: -0.625066, relative error: 5.080904e-08



In [5]:

    
# Now that we have a naive implementation of the softmax loss function and its gradient,
# implement a vectorized version in softmax_loss_vectorized.
# The two versions should compute the same results, but the vectorized version should be
# much faster.
tic = time.time()
loss_naive, grad_naive = softmax_loss_naive(W, X_train, y_train, 0.00001)
toc = time.time()
print 'naive loss: %e computed in %fs' % (loss_naive, toc - tic)

from cs231n.classifiers.softmax import softmax_loss_vectorized
tic = time.time()
loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_train, y_train, 0.00001)
toc = time.time()
print 'vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)

# As we did for the SVM, we use the Frobenius norm to compare the two versions
# of the gradient.
grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')
print 'Loss difference: %f' % np.abs(loss_naive - loss_vectorized)
print 'Gradient difference: %f' % grad_difference









    



naive loss: 2.358218e+00 computed in 7.250779s
vectorized loss: 2.358218e+00 computed in 0.783531s
Loss difference: 0.000000
Gradient difference: 0.000000



In [6]:

    
# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of over 0.35 on the validation set.
from cs231n.classifiers import Softmax
results = {}
best_val = -1
best_softmax = None
learning_rates = np.arange(5)*1e-7+5e-7
regularization_strengths = np.arange(5)*2e3+7e3

################################################################################
# TODO:                                                                        #
# Use the validation set to set the learning rate and regularization strength. #
# This should be identical to the validation that you did for the SVM; save    #
# the best trained softmax classifer in best_softmax.                          #
################################################################################
for lr in learning_rates:
    for rs in regularization_strengths:
        smc = Softmax()
        smc.train(X_train, y_train, lr, rs, 500)
        results[(lr, rs)] = (np.mean(smc.predict(X_train)==y_train),
                             np.mean(smc.predict(X_val)==y_val))
        if results[(lr,rs)][1]>best_val:
            best_val = results[(lr,rs)][1]
            best_softmax = smc
            print best_val
################################################################################
#                              END OF YOUR CODE                                #
################################################################################
    
# Print out results.
for lr, reg in sorted(results):
    train_accuracy, val_accuracy = results[(lr, reg)]
    print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (
                lr, reg, train_accuracy, val_accuracy)
    
print 'best validation accuracy achieved during cross-validation: %f' % best_val









    



0.356
0.375
0.384
0.39
0.391
lr 5.000000e-07 reg 7.000000e+03 train accuracy: 0.357857 val accuracy: 0.356000
lr 5.000000e-07 reg 9.000000e+03 train accuracy: 0.366000 val accuracy: 0.375000
lr 5.000000e-07 reg 1.100000e+04 train accuracy: 0.358571 val accuracy: 0.384000
lr 5.000000e-07 reg 1.300000e+04 train accuracy: 0.368204 val accuracy: 0.382000
lr 5.000000e-07 reg 1.500000e+04 train accuracy: 0.357980 val accuracy: 0.372000
lr 6.000000e-07 reg 7.000000e+03 train accuracy: 0.365286 val accuracy: 0.381000
lr 6.000000e-07 reg 9.000000e+03 train accuracy: 0.372469 val accuracy: 0.375000
lr 6.000000e-07 reg 1.100000e+04 train accuracy: 0.366571 val accuracy: 0.382000
lr 6.000000e-07 reg 1.300000e+04 train accuracy: 0.363490 val accuracy: 0.367000
lr 6.000000e-07 reg 1.500000e+04 train accuracy: 0.352347 val accuracy: 0.363000
lr 7.000000e-07 reg 7.000000e+03 train accuracy: 0.369122 val accuracy: 0.369000
lr 7.000000e-07 reg 9.000000e+03 train accuracy: 0.367837 val accuracy: 0.381000
lr 7.000000e-07 reg 1.100000e+04 train accuracy: 0.368857 val accuracy: 0.380000
lr 7.000000e-07 reg 1.300000e+04 train accuracy: 0.363449 val accuracy: 0.372000
lr 7.000000e-07 reg 1.500000e+04 train accuracy: 0.361755 val accuracy: 0.378000
lr 8.000000e-07 reg 7.000000e+03 train accuracy: 0.378184 val accuracy: 0.367000
lr 8.000000e-07 reg 9.000000e+03 train accuracy: 0.371000 val accuracy: 0.390000
lr 8.000000e-07 reg 1.100000e+04 train accuracy: 0.364898 val accuracy: 0.375000
lr 8.000000e-07 reg 1.300000e+04 train accuracy: 0.363878 val accuracy: 0.372000
lr 8.000000e-07 reg 1.500000e+04 train accuracy: 0.353612 val accuracy: 0.359000
lr 9.000000e-07 reg 7.000000e+03 train accuracy: 0.374939 val accuracy: 0.391000
lr 9.000000e-07 reg 9.000000e+03 train accuracy: 0.358714 val accuracy: 0.381000
lr 9.000000e-07 reg 1.100000e+04 train accuracy: 0.365041 val accuracy: 0.383000
lr 9.000000e-07 reg 1.300000e+04 train accuracy: 0.355898 val accuracy: 0.372000
lr 9.000000e-07 reg 1.500000e+04 train accuracy: 0.360918 val accuracy: 0.374000
best validation accuracy achieved during cross-validation: 0.391000



In [7]:

    
# evaluate on test set
# Evaluate the best svm on test set
y_test_pred = best_softmax.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print 'softmax on raw pixels final test set accuracy: %f' % (test_accuracy, )









    



softmax on raw pixels final test set accuracy: 0.385000



In [8]:

    
# Visualize the learned weights for each class
w = best_softmax.W[:,:-1] # strip out the bias
w = w.reshape(10, 32, 32, 3)

w_min, w_max = np.min(w), np.max(w)

classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in xrange(10):
  plt.subplot(2, 5, i + 1)
  
  # Rescale the weights to be between 0 and 255
  wimg = 255.0 * (w[i].squeeze() - w_min) / (w_max - w_min)
  plt.imshow(wimg.astype('uint8'))
  plt.axis('off')
  plt.title(classes[i])



In [ ]: