SVM classification

Building a SVM classification classifier to solve multi-classification CIFAR-10 dataset

  • implement a fully-vectorized loss function for the SVM classification
  • implement the fully-vectorized expression for its analytic gradient
  • check implementation using numerical gradient
  • use a validation set to tune the learning rate and regularization strength
  • optimize the loss function with Batch Gradient Descent and Stochastic Gradient Descent

Download CIFAR-10 Data

I use the loading function from course code from Stanford University

Run get_datasets.sh in terminal to download the datasets, or download from Alex Krizhevsky.

get_datasets.sh

# Get CIFAR10
wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xzvf cifar-10-python.tar.gz
rm cifar-10-python.tar.gz

The results of the downloading is showed in following figure.


In [1]:
# Setup code for this notebook
import random 
import numpy as np
import matplotlib.pyplot as plt

# This is a bit of magic gto make matplotlib figures appear inline
# in the notebook rather than in a new window
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [2]:
### Write function to load data in data_utils.py
# Write function to load the cifar-10 data
# The original code is from http://cs231n.github.io/assignment1/
# The function is in data_utils.py file for reusing.
import cPickle as pickle
import numpy as np
import os

def load_CIFAR_batch(filename):
  """ load single batch of cifar """
  with open(filename, 'r') as f:
    datadict = pickle.load(f)
    X = datadict['data']
    Y = datadict['labels']
    X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float")
    Y = np.array(Y)
    return X, Y

def load_CIFAR10(ROOT):
  """ load all of cifar """
  xs = []
  ys = []
  for b in range(1,6):
    f = os.path.join(ROOT, 'data_batch_%d' % (b, ))
    X, Y = load_CIFAR_batch(f)
    xs.append(X)
    ys.append(Y)    
  Xtr = np.concatenate(xs)
  Ytr = np.concatenate(ys)
  del X, Y
  Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch'))
  return Xtr, Ytr, Xte, Yte

Load data and visualize samples


In [3]:
from algorithms.data_utils import load_CIFAR10
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

def get_CIFAR10_data(num_training=49000, num_val=1000, num_test=10000, show_sample=True):
    """
    Load the CIFAR-10 dataset, and divide the sample into training set, validation set and test set
    """

    cifar10_dir = 'datasets/datasets-cifar-10/cifar-10-batches-py/'
    X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
        
    # subsample the data for validation set
    mask = range(num_training, num_training + num_val)
    X_val = X_train[mask]
    y_val = y_train[list(mask)]
    mask = range(num_training)
    X_train = X_train[mask]
    y_train = y_train[mask]
    mask = range(num_test)
    X_test = X_test[mask]
    y_test = y_test[mask]
    return X_train, y_train, X_val, y_val, X_test, y_test

def visualize_sample(X_train, y_train, classes, samples_per_class=7):
    """visualize some samples in the training datasets """
    num_classes = len(classes)
    for y, cls in enumerate(classes):
        idxs = np.flatnonzero(y_train == y) # get all the indexes of cls
        idxs = np.random.choice(idxs, samples_per_class, replace=False)
        for i, idx in enumerate(idxs): # plot the image one by one
            plt_idx = i * num_classes + y + 1 # i*num_classes and y+1 determine the row and column respectively
            plt.subplot(samples_per_class, num_classes, plt_idx)
            plt.imshow(X_train[idx].astype('uint8'))
            plt.axis('off')
            if i == 0:
                plt.title(cls)
    plt.show()
    
def preprocessing_CIFAR10_data(X_train, y_train, X_val, y_val, X_test, y_test):
    
    # Preprocessing: reshape the image data into rows
    X_train = np.reshape(X_train, (X_train.shape[0], -1)) # [49000, 3072]
    X_val = np.reshape(X_val, (X_val.shape[0], -1)) # [1000, 3072]
    X_test = np.reshape(X_test, (X_test.shape[0], -1)) # [10000, 3072]
    
    # Normalize the data: subtract the mean image
    mean_image = np.mean(X_train, axis = 0)
    X_train -= mean_image
    X_val -= mean_image
    X_test -= mean_image
    
    # Add bias dimension and transform into columns
    X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))]).T
    X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))]).T
    X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))]).T
    return X_train, y_train, X_val, y_val, X_test, y_test


# Invoke the above functions to get our data
X_train_raw, y_train_raw, X_val_raw, y_val_raw, X_test_raw, y_test_raw = get_CIFAR10_data()
visualize_sample(X_train_raw, y_train_raw, classes)
X_train, y_train, X_val, y_val, X_test, y_test = preprocessing_CIFAR10_data(X_train_raw, y_train_raw, X_val_raw, y_val_raw, X_test_raw, y_test_raw)

# As a sanity check, we print out th size of the training and test data dimenstion
print 'Train data shape: ', X_train.shape
print 'Train labels shape: ', y_train.shape
print 'Validation data shape: ', X_val.shape
print 'Validation labels shape: ', y_val.shape
print 'Test data shape: ', X_test.shape
print 'Test labels shape: ', y_test.shape


Train data shape:  (3073, 49000)
Train labels shape:  (49000,)
Validation data shape:  (3073, 1000)
Validation labels shape:  (1000,)
Test data shape:  (3073, 10000)
Test labels shape:  (10000,)

SVM Classifier

The code is running in the backend, you can find it here, or github


In [4]:
# Test the loss and gradient
from algorithms.classifiers import loss_grad_svm_vectorized
import time

# generate a rand weights W 
W = np.random.randn(10, X_train.shape[0]) * 0.001

tic = time.time()
loss_vec, grad_vect = loss_grad_svm_vectorized(W, X_train, y_train, 0)
toc = time.time()
print 'Vectorized loss: %f, and gradient: computed in %fs' % (loss_vec, toc - tic)


Vectorized loss: 23.268075, and gradient: computed in 0.371483s

Write function to compute gradient numerically to test the analytic gradient

And put the function in algorithms/gradient_check.py file for future use


In [5]:
# file: algorithms/gradient_check.py
def grad_check_sparse(f, x, analytic_grad, num_checks):
  """
  sample a few random elements and only return numerical
  in this dimensions.
  """
  h = 1e-5

  print x.shape

  for i in xrange(num_checks):
    ix = tuple([randrange(m) for m in x.shape])
    print ix
    x[ix] += h # increment by h
    fxph = f(x) # evaluate f(x + h)
    x[ix] -= 2 * h # increment by h
    fxmh = f(x) # evaluate f(x - h)
    x[ix] += h # reset

    grad_numerical = (fxph - fxmh) / (2 * h)
    grad_analytic = analytic_grad[ix]
    rel_error = abs(grad_numerical - grad_analytic) / (abs(grad_numerical) + abs(grad_analytic))
    print 'numerical: %f analytic: %f, relative error: %e' % (grad_numerical, grad_analytic, rel_error)

In [6]:
# Check gradient using numerical gradient along several randomly chosen dimenstion
from algorithms.classifiers import loss_grad_svm_vectorized
from algorithms.gradient_check import grad_check_sparse

f = lambda w: loss_grad_svm_vectorized(w, X_train, y_train, 0)[0]
grad_numerical = grad_check_sparse(f, W, grad_vect, 10)


(10, 3073)
(8, 2480)
numerical: -9.699973 analytic: -9.701368, relative error: 7.191412e-05
(8, 186)
numerical: 1.983721 analytic: 1.982665, relative error: 2.661989e-04
(4, 917)
numerical: 11.228352 analytic: 11.228552, relative error: 8.917422e-06
(3, 262)
numerical: 24.938945 analytic: 24.938564, relative error: 7.629843e-06
(2, 2527)
numerical: -1.444160 analytic: -1.445250, relative error: 3.769463e-04
(0, 2262)
numerical: -5.966210 analytic: -5.965701, relative error: 4.268464e-05
(0, 604)
numerical: -39.100243 analytic: -39.100302, relative error: 7.444130e-07
(0, 1709)
numerical: -31.457527 analytic: -31.457868, relative error: 5.410615e-06
(4, 921)
numerical: 7.726344 analytic: 7.726770, relative error: 2.754343e-05
(3, 1037)
numerical: 16.876939 analytic: 16.879089, relative error: 6.367964e-05

In [10]:
# Training svm regression classifier using SGD and BGD
from algorithms.classifiers import SVM

# # using SGD algorithm
SVM_sgd = SVM()
tic = time.time()
losses_sgd = SVM_sgd.train(X_train, y_train, method='sgd', batch_size=200, learning_rate=1e-6,
              reg = 1e5, num_iters=1000, verbose=True, vectorized=True)
toc = time.time()
print 'Traning time for SGD with vectorized version is %f \n' % (toc - tic)

y_train_pred_sgd = SVM_sgd.predict(X_train)[0]
print 'Training accuracy: %f' % (np.mean(y_train == y_train_pred_sgd))
y_val_pred_sgd = SVM_sgd.predict(X_val)[0]
print 'Validation accuracy: %f' % (np.mean(y_val == y_val_pred_sgd))


iteration 0/1000: loss 1550.093675
iteration 100/1000: loss 7.164944
iteration 200/1000: loss 8.013753
iteration 300/1000: loss 7.506033
iteration 400/1000: loss 7.133593
iteration 500/1000: loss 8.129725
iteration 600/1000: loss 6.453695
iteration 700/1000: loss 7.891130
iteration 800/1000: loss 7.396763
iteration 900/1000: loss 7.338838
Traning time for SGD with vectorized version is 27.089724 

Training accuracy: 0.270898
Validation accuracy: 0.284000

In [11]:
from ggplot import *
qplot(xrange(len(losses_sgd)), losses_sgd) + labs(x='Iteration number', y='SGD Loss value')


Out[11]:
<ggplot: (471566857)>

In [12]:
# Using validation set to tuen hyperparameters, i.e., learning rate and regularization strength
learning_rates = [1e-5, 1e-8]
regularization_strengths = [10e2, 10e4]

# Result is a dictionary mapping tuples of the form (learning_rate, regularization_strength) 
# to tuples of the form (training_accuracy, validation_accuracy). The accuracy is simply the fraction
# of data points that are correctly classified.
results = {}
best_val = -1
best_svm = None
# Choose the best hyperparameters by tuning on the validation set
i = 0
interval = 5
for learning_rate in np.linspace(learning_rates[0], learning_rates[1], num=interval):
    i += 1
    print 'The current iteration is %d/%d' % (i, interval)
    for reg in np.linspace(regularization_strengths[0], regularization_strengths[1], num=interval):
        svm = SVM()
        svm.train(X_train, y_train, method='sgd', batch_size=200, learning_rate=learning_rate,
              reg = reg, num_iters=1000, verbose=False, vectorized=True)
        y_train_pred = svm.predict(X_train)[0]
        y_val_pred = svm.predict(X_val)[0]
        train_accuracy = np.mean(y_train == y_train_pred)
        val_accuracy = np.mean(y_val == y_val_pred)
        results[(learning_rate, reg)] = (train_accuracy, val_accuracy)
        if val_accuracy > best_val:
            best_val = val_accuracy
            best_svm = svm
        else:
            pass

# Print out the results
for learning_rate, reg in sorted(results):
    train_accuracy,val_accuracy = results[(learning_rate, reg)]
    print 'learning rate %e and regularization %e, \n \
    the training accuracy is: %f and validation accuracy is: %f.\n' % (learning_rate, reg, train_accuracy, val_accuracy)


The current iteration is 1/5
The current iteration is 2/5
The current iteration is 3/5
The current iteration is 4/5
The current iteration is 5/5
learning rate 1.000000e-08 and regularization 1.000000e+03, 
     the training accuracy is: 0.207878 and validation accuracy is: 0.227000.

learning rate 1.000000e-08 and regularization 2.575000e+04, 
     the training accuracy is: 0.205592 and validation accuracy is: 0.218000.

learning rate 1.000000e-08 and regularization 5.050000e+04, 
     the training accuracy is: 0.222633 and validation accuracy is: 0.245000.

learning rate 1.000000e-08 and regularization 7.525000e+04, 
     the training accuracy is: 0.243000 and validation accuracy is: 0.259000.

learning rate 1.000000e-08 and regularization 1.000000e+05, 
     the training accuracy is: 0.240224 and validation accuracy is: 0.282000.

learning rate 2.507500e-06 and regularization 1.000000e+03, 
     the training accuracy is: 0.347082 and validation accuracy is: 0.369000.

learning rate 2.507500e-06 and regularization 2.575000e+04, 
     the training accuracy is: 0.287959 and validation accuracy is: 0.304000.

learning rate 2.507500e-06 and regularization 5.050000e+04, 
     the training accuracy is: 0.252449 and validation accuracy is: 0.259000.

learning rate 2.507500e-06 and regularization 7.525000e+04, 
     the training accuracy is: 0.235816 and validation accuracy is: 0.253000.

learning rate 2.507500e-06 and regularization 1.000000e+05, 
     the training accuracy is: 0.212694 and validation accuracy is: 0.237000.

learning rate 5.005000e-06 and regularization 1.000000e+03, 
     the training accuracy is: 0.306143 and validation accuracy is: 0.291000.

learning rate 5.005000e-06 and regularization 2.575000e+04, 
     the training accuracy is: 0.220878 and validation accuracy is: 0.227000.

learning rate 5.005000e-06 and regularization 5.050000e+04, 
     the training accuracy is: 0.165898 and validation accuracy is: 0.184000.

learning rate 5.005000e-06 and regularization 7.525000e+04, 
     the training accuracy is: 0.175143 and validation accuracy is: 0.166000.

learning rate 5.005000e-06 and regularization 1.000000e+05, 
     the training accuracy is: 0.163939 and validation accuracy is: 0.163000.

learning rate 7.502500e-06 and regularization 1.000000e+03, 
     the training accuracy is: 0.265918 and validation accuracy is: 0.248000.

learning rate 7.502500e-06 and regularization 2.575000e+04, 
     the training accuracy is: 0.181327 and validation accuracy is: 0.196000.

learning rate 7.502500e-06 and regularization 5.050000e+04, 
     the training accuracy is: 0.191816 and validation accuracy is: 0.196000.

learning rate 7.502500e-06 and regularization 7.525000e+04, 
     the training accuracy is: 0.189592 and validation accuracy is: 0.190000.

learning rate 7.502500e-06 and regularization 1.000000e+05, 
     the training accuracy is: 0.158020 and validation accuracy is: 0.149000.

learning rate 1.000000e-05 and regularization 1.000000e+03, 
     the training accuracy is: 0.250327 and validation accuracy is: 0.266000.

learning rate 1.000000e-05 and regularization 2.575000e+04, 
     the training accuracy is: 0.201673 and validation accuracy is: 0.201000.

learning rate 1.000000e-05 and regularization 5.050000e+04, 
     the training accuracy is: 0.165449 and validation accuracy is: 0.160000.

learning rate 1.000000e-05 and regularization 7.525000e+04, 
     the training accuracy is: 0.170959 and validation accuracy is: 0.153000.

learning rate 1.000000e-05 and regularization 1.000000e+05, 
     the training accuracy is: 0.155000 and validation accuracy is: 0.164000.

Test best svm classifier on test datasets

The trained model works well on the test datasets, which has 33.54% accuracy!


In [13]:
y_test_predict_result = best_svm.predict(X_test)
y_test_predict = y_test_predict_result[0]
test_accuracy = np.mean(y_test == y_test_predict)
print 'The test accuracy is: %f' % test_accuracy


The test accuracy is: 0.335400