Softmax exercise

Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the assignments page on the course website.

This exercise is analogous to the SVM exercise. You will:

  • implement a fully-vectorized loss function for the Softmax classifier
  • implement the fully-vectorized expression for its analytic gradient
  • check your implementation with numerical gradient
  • use a validation set to tune the learning rate and regularization strength
  • optimize the loss function with SGD
  • visualize the final learned weights

In [1]:
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt

from __future__ import print_function

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading extenrnal modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [2]:
def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, num_dev=500):
    """
    Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
    it for the linear classifier. These are the same steps as we used for the
    SVM, but condensed to a single function.  
    """
    # Load the raw CIFAR-10 data
    cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
    X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
    
    # subsample the data
    mask = list(range(num_training, num_training + num_validation))
    X_val = X_train[mask]
    y_val = y_train[mask]
    mask = list(range(num_training))
    X_train = X_train[mask]
    y_train = y_train[mask]
    mask = list(range(num_test))
    X_test = X_test[mask]
    y_test = y_test[mask]
    mask = np.random.choice(num_training, num_dev, replace=False)
    X_dev = X_train[mask]
    y_dev = y_train[mask]
    
    # Preprocessing: reshape the image data into rows
    X_train = np.reshape(X_train, (X_train.shape[0], -1))
    X_val = np.reshape(X_val, (X_val.shape[0], -1))
    X_test = np.reshape(X_test, (X_test.shape[0], -1))
    X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))
    
    # Normalize the data: subtract the mean image
    mean_image = np.mean(X_train, axis = 0)
    X_train -= mean_image
    X_val -= mean_image
    X_test -= mean_image
    X_dev -= mean_image
    
    # add bias dimension and transform into columns
    X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])
    X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])
    X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])
    X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])
    
    return X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev


# Invoke the above function to get our data.
X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()
print('Train data shape: ', X_train.shape)
print('Train labels shape: ', y_train.shape)
print('Validation data shape: ', X_val.shape)
print('Validation labels shape: ', y_val.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
print('dev data shape: ', X_dev.shape)
print('dev labels shape: ', y_dev.shape)


Train data shape:  (49000, 3073)
Train labels shape:  (49000,)
Validation data shape:  (1000, 3073)
Validation labels shape:  (1000,)
Test data shape:  (1000, 3073)
Test labels shape:  (1000,)
dev data shape:  (500, 3073)
dev labels shape:  (500,)

Softmax Classifier

Your code for this section will all be written inside cs231n/classifiers/softmax.py.


In [17]:
# First implement the naive softmax loss function with nested loops.
# Open the file cs231n/classifiers/softmax.py and implement the
# softmax_loss_naive function.

from cs231n.classifiers.softmax import softmax_loss_naive
import time

# Generate a random softmax weight matrix and use it to compute the loss.
W = np.random.randn(3073, 10) * 0.0001
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)

# As a rough sanity check, our loss should be something close to -log(0.1).
print('loss: %f' % loss)
print('sanity check: %f' % (-np.log(0.1)))


loss: 2.333027
sanity check: 2.302585

Inline Question 1:

Why do we expect our loss to be close to -log(0.1)? Explain briefly.**

Your answer: Fill this in

  • For an arbitrary example, the chance of hitting the right answer through random selection is 0.1 (1/10). Then by definition the loss value is -log(0.1)

In [39]:
# Complete the implementation of softmax_loss_naive and implement a (naive)
# version of the gradient that uses nested loops.
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)

# As we did for the SVM, use numeric gradient checking as a debugging tool.
# The numeric gradient should be close to the analytic gradient.
from cs231n.gradient_check import grad_check_sparse
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 0.0)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)

# similar to SVM case, do another gradient check with regularization
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 5e1)
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 5e1)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)


numerical: -1.479844 analytic: -1.479844, relative error: 1.782147e-09
numerical: 1.100589 analytic: 1.100589, relative error: 3.442954e-08
numerical: 1.441948 analytic: 1.441948, relative error: 4.729376e-08
numerical: 0.882819 analytic: 0.882819, relative error: 1.348912e-07
numerical: -0.883276 analytic: -0.883276, relative error: 1.263262e-07
numerical: -0.733017 analytic: -0.733017, relative error: 5.324849e-08
numerical: 1.943821 analytic: 1.943821, relative error: 2.261808e-08
numerical: 2.068780 analytic: 2.068780, relative error: 8.268541e-09
numerical: -2.981932 analytic: -2.981932, relative error: 3.301819e-08
numerical: 0.876112 analytic: 0.876112, relative error: 5.323699e-09
numerical: 0.538905 analytic: 0.538905, relative error: 3.651186e-09
numerical: -1.704399 analytic: -1.704399, relative error: 1.612388e-08
numerical: 0.021751 analytic: 0.021751, relative error: 7.649066e-07
numerical: -0.723039 analytic: -0.723039, relative error: 1.361390e-08
numerical: -0.548528 analytic: -0.548528, relative error: 2.824811e-09
numerical: 0.852736 analytic: 0.852736, relative error: 5.601555e-08
numerical: 1.714801 analytic: 1.714801, relative error: 6.015610e-08
numerical: 2.056742 analytic: 2.056742, relative error: 1.506802e-08
numerical: -0.478372 analytic: -0.478372, relative error: 1.922909e-07
numerical: -0.758759 analytic: -0.758759, relative error: 5.663333e-08

In [60]:
# Now that we have a naive implementation of the softmax loss function and its gradient,
# implement a vectorized version in softmax_loss_vectorized.
# The two versions should compute the same results, but the vectorized version should be
# much faster.
tic = time.time()
loss_naive, grad_naive = softmax_loss_naive(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('naive loss: %e computed in %fs' % (loss_naive, toc - tic))

from cs231n.classifiers.softmax import softmax_loss_vectorized
tic = time.time()
loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic))

# As we did for the SVM, we use the Frobenius norm to compare the two versions
# of the gradient.
grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')
print('Loss difference: %f' % np.abs(loss_naive - loss_vectorized))
print('Gradient difference: %f' % grad_difference)


naive loss: 2.333027e+00 computed in 0.067715s
(500, 10)
(500,)
(500, 10)
(3073, 10)
(3073, 10)
vectorized loss: 2.333027e+00 computed in 0.018437s
Loss difference: 0.000000
Gradient difference: 0.000000

In [62]:
# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of over 0.35 on the validation set.
from cs231n.classifiers import Softmax
results = {}
best_val = -1
best_softmax = None
learning_rates = [1e-7, 5e-7]
regularization_strengths = [2.5e4, 5e4]

################################################################################
# TODO:                                                                        #
# Use the validation set to set the learning rate and regularization strength. #
# This should be identical to the validation that you did for the SVM; save    #
# the best trained softmax classifer in best_softmax.                          #
################################################################################
#pass
num_lr = len(learning_rates)
num_reg = len(regularization_strengths)
num_folds = 5

X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
for i in range(num_lr):
    for j in range(num_reg):
        train_accuracies = []
        val_accuracies = []

        lr = learning_rates[i]
        reg = regularization_strengths[j]

        for k in range(num_folds):
            X_sub_train = np.concatenate(np.delete(X_train_folds, k, axis=0))
            y_sub_train = np.concatenate(np.delete(y_train_folds, k, axis=0))

            X_sub_test = X_train_folds[k]
            y_sub_test = y_train_folds[k]

            softmax = Softmax()
            softmax.train(X_sub_train, y_sub_train, learning_rate=lr, reg=reg, num_iters=500,verbose=True)

            y_sub_train_pred = softmax.predict(X_sub_train)
            train_accuracies.append(np.mean(y_sub_train == y_sub_train_pred))

            y_sub_test_pred = softmax.predict(X_sub_test)
            val_accuracies.append(np.mean(y_sub_test == y_sub_test_pred))

        mean_train_accuracy = np.mean(train_accuracies)
        mean_val_accuracy = np.mean(val_accuracies)
        results[(lr, reg)] = (mean_train_accuracy, mean_val_accuracy)
        print(lr, reg, mean_train_accuracy, mean_val_accuracy)
        if mean_val_accuracy > best_val:
            best_val = mean_val_accuracy
            best_softmax = softmax
################################################################################
#                              END OF YOUR CODE                                #
################################################################################
    
# Print out results.
for lr, reg in sorted(results):
    train_accuracy, val_accuracy = results[(lr, reg)]
    print('lr %e reg %e train accuracy: %f val accuracy: %f' % (
                lr, reg, train_accuracy, val_accuracy))
    
print('best validation accuracy achieved during cross-validation: %f' % best_val)


iteration 0 / 500: loss 775.094055
iteration 100 / 500: loss 284.774091
iteration 200 / 500: loss 105.470507
iteration 300 / 500: loss 39.939331
iteration 400 / 500: loss 15.960799
iteration 0 / 500: loss 779.906812
iteration 100 / 500: loss 286.321006
iteration 200 / 500: loss 106.019371
iteration 300 / 500: loss 40.159316
iteration 400 / 500: loss 15.998181
iteration 0 / 500: loss 775.595801
iteration 100 / 500: loss 284.541590
iteration 200 / 500: loss 105.437643
iteration 300 / 500: loss 39.929086
iteration 400 / 500: loss 15.921370
iteration 0 / 500: loss 765.469917
iteration 100 / 500: loss 280.851754
iteration 200 / 500: loss 104.018112
iteration 300 / 500: loss 39.467800
iteration 400 / 500: loss 15.797074
iteration 0 / 500: loss 770.176479
iteration 100 / 500: loss 282.679073
iteration 200 / 500: loss 104.752795
iteration 300 / 500: loss 39.655019
iteration 400 / 500: loss 15.881786
1e-07 25000.0 0.310469387755 0.307183673469
iteration 0 / 500: loss 1531.399728
iteration 100 / 500: loss 206.262223
iteration 200 / 500: loss 29.425960
iteration 300 / 500: loss 5.795257
iteration 400 / 500: loss 2.641642
iteration 0 / 500: loss 1526.976821
iteration 100 / 500: loss 205.791782
iteration 200 / 500: loss 29.315956
iteration 300 / 500: loss 5.760030
iteration 400 / 500: loss 2.638813
iteration 0 / 500: loss 1538.715384
iteration 100 / 500: loss 207.027036
iteration 200 / 500: loss 29.448862
iteration 300 / 500: loss 5.788410
iteration 400 / 500: loss 2.593163
iteration 0 / 500: loss 1532.785431
iteration 100 / 500: loss 206.360071
iteration 200 / 500: loss 29.370649
iteration 300 / 500: loss 5.808101
iteration 400 / 500: loss 2.660422
iteration 0 / 500: loss 1550.551644
iteration 100 / 500: loss 208.964827
iteration 200 / 500: loss 29.761157
iteration 300 / 500: loss 5.862206
iteration 400 / 500: loss 2.676022
1e-07 50000.0 0.308959183673 0.30793877551
iteration 0 / 500: loss 767.270816
iteration 100 / 500: loss 6.894618
iteration 200 / 500: loss 2.121830
iteration 300 / 500: loss 2.063515
iteration 400 / 500: loss 2.040518
iteration 0 / 500: loss 776.813585
iteration 100 / 500: loss 6.900791
iteration 200 / 500: loss 2.125244
iteration 300 / 500: loss 2.076850
iteration 400 / 500: loss 2.136209
iteration 0 / 500: loss 772.134717
iteration 100 / 500: loss 6.882611
iteration 200 / 500: loss 2.168984
iteration 300 / 500: loss 2.067731
iteration 400 / 500: loss 2.029872
iteration 0 / 500: loss 766.213224
iteration 100 / 500: loss 6.880060
iteration 200 / 500: loss 2.111911
iteration 300 / 500: loss 2.085864
iteration 400 / 500: loss 2.058431
iteration 0 / 500: loss 773.125505
iteration 100 / 500: loss 6.948351
iteration 200 / 500: loss 2.131334
iteration 300 / 500: loss 2.071108
iteration 400 / 500: loss 2.059105
5e-07 25000.0 0.324658163265 0.32206122449
iteration 0 / 500: loss 1538.596060
iteration 100 / 500: loss 2.209344
iteration 200 / 500: loss 2.183698
iteration 300 / 500: loss 2.155738
iteration 400 / 500: loss 2.191131
iteration 0 / 500: loss 1528.268078
iteration 100 / 500: loss 2.233152
iteration 200 / 500: loss 2.129984
iteration 300 / 500: loss 2.150635
iteration 400 / 500: loss 2.139640
iteration 0 / 500: loss 1526.075284
iteration 100 / 500: loss 2.190217
iteration 200 / 500: loss 2.219256
iteration 300 / 500: loss 2.132000
iteration 400 / 500: loss 2.166279
iteration 0 / 500: loss 1536.213835
iteration 100 / 500: loss 2.171952
iteration 200 / 500: loss 2.107740
iteration 300 / 500: loss 2.143999
iteration 400 / 500: loss 2.134183
iteration 0 / 500: loss 1528.913415
iteration 100 / 500: loss 2.185296
iteration 200 / 500: loss 2.138703
iteration 300 / 500: loss 2.114430
iteration 400 / 500: loss 2.116449
5e-07 50000.0 0.300642857143 0.29912244898
lr 1.000000e-07 reg 2.500000e+04 train accuracy: 0.310469 val accuracy: 0.307184
lr 1.000000e-07 reg 5.000000e+04 train accuracy: 0.308959 val accuracy: 0.307939
lr 5.000000e-07 reg 2.500000e+04 train accuracy: 0.324658 val accuracy: 0.322061
lr 5.000000e-07 reg 5.000000e+04 train accuracy: 0.300643 val accuracy: 0.299122
best validation accuracy achieved during cross-validation: 0.322061

In [63]:
# evaluate on test set
# Evaluate the best softmax on test set
y_test_pred = best_softmax.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print('softmax on raw pixels final test set accuracy: %f' % (test_accuracy, ))


softmax on raw pixels final test set accuracy: 0.349000

In [64]:
# Visualize the learned weights for each class
w = best_softmax.W[:-1,:] # strip out the bias
w = w.reshape(32, 32, 3, 10)

w_min, w_max = np.min(w), np.max(w)

classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in range(10):
    plt.subplot(2, 5, i + 1)
    
    # Rescale the weights to be between 0 and 255
    wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
    plt.imshow(wimg.astype('uint8'))
    plt.axis('off')
    plt.title(classes[i])