k-Nearest Neighbor (kNN) exercise

Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the assignments page on the course website.

The kNN classifier consists of two stages:

  • During training, the classifier takes the training data and simply remembers it
  • During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples
  • The value of k is cross-validated

In this exercise you will implement these steps and understand the basic Image Classification pipeline, cross-validation, and gain proficiency in writing efficient, vectorized code.


In [1]:
# Run some setup code for this notebook.

import random
import numpy as np
from data_utils import load_CIFAR10
import matplotlib.pyplot as plt

# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [2]:
# Load the raw CIFAR-10 data.
cifar10_dir = 'cifar-10-batches-py'
X_train_im, y_train, X_test_im, y_test = load_CIFAR10(cifar10_dir)

# As a sanity check, we print out the size of the training and test data.
print 'Training data shape: ', X_train_im.shape
print 'Training labels shape: ', y_train.shape
print 'Test data shape: ', X_test_im.shape
print 'Test labels shape: ', y_test.shape


Training data shape:  (50000, 32, 32, 3)
Training labels shape:  (50000,)
Test data shape:  (10000, 32, 32, 3)
Test labels shape:  (10000,)

In [4]:
# Visualize some examples from the dataset.
# We show a few examples of training images from each class.
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y)
    idxs = np.random.choice(idxs, samples_per_class, replace=False)
    for i, idx in enumerate(idxs):
        plt_idx = i * num_classes + y + 1
        plt.subplot(samples_per_class, num_classes, plt_idx)
        plt.imshow(X_train_im[idx].astype('uint8'))
        plt.axis('off')
        if i == 0:
            plt.title(cls)
plt.show()

# plt.subplot(1,2,1)
# plt.imshow(X_train_im[index_max10])
# plt.subplot(1,2,2)
# plt.imshow(X_test_im[10])



In [5]:
# Subsample the data for more efficient code execution in this exercise
num_training = 5000
mask = range(num_training)
X_train = X_train_im[mask]
y_train = y_train[mask]

num_test = 500
mask = range(num_test)
X_test = X_test_im[mask]
y_test = y_test[mask]

In [6]:
# Reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print X_train.shape, X_test.shape


(5000, 3072) (500, 3072)

In [7]:
from classifiers import KNearestNeighbor

# Create a kNN classifier instance. 
# Remember that training a kNN classifier is a noop: 
# the Classifier simply remembers the data and does no further processing 
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)

We would now like to classify the test data with the kNN classifier. Recall that we can break down this process into two steps:

  1. First we must compute the distances between all test examples and all train examples.
  2. Given these distances, for each test example we find the k nearest examples and have them vote for the label

Lets begin with computing the distance matrix between all training and test examples. For example, if there are Ntr training examples and Nte test examples, this stage should result in a Nte x Ntr matrix where each element (i,j) is the distance between the i-th test and j-th train example.

First, open cs231n/classifiers/k_nearest_neighbor.py and implement the function compute_distances_two_loops that uses a (very inefficient) double loop over all pairs of (test, train) examples and computes the distance matrix one element at a time.


In [40]:
# Open cs231n/classifiers/k_nearest_neighbor.py and implement
# compute_distances_two_loops.

# Test your implementation:
print X_train.shape, X_test.shape
dists = classifier.compute_distances_two_loops(X_test)
print dists.shape


(5000, 3072) (500, 3072)
(500, 5000)

In [9]:
# We can visualize the distance matrix: each row is a single test example and
# its distances to training examples
plt.imshow(dists, interpolation='none')
plt.show()
index_min10 = np.argmin(dists, axis=1)[10]
index_max10 = np.argmax(dists, axis=1)[10]
print index_min10  # index of min dist for 10th test example
print index_max10
test = np.argsort(dists, axis=1)[10,:3]
print test.shape


303
3286
(3,)

Inline Question #1: Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)

  • What in the data is the cause behind the distinctly bright rows?
  • What causes the columns?

Your Answer: opposite background or foreground colors


In [10]:
# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)

# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)


Got 137 / 500 correct => accuracy: 0.274000

You should expect to see approximately 27% accuracy. Now lets try out a larger k, say k = 5:


In [11]:
y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)


Got 142 / 500 correct => accuracy: 0.284000

You should expect to see a slightly better performance than with k = 1.

Test one_loop code

In [19]:
print np.square(X_train - X_test[0,:]).shape
print np.sqrt(np.sum(np.square(X_train - X_test[0,:]), axis = 1)).shape


(5000, 3072)
(5000,)

In [47]:
# Now lets speed up distance matrix computation by using partial vectorization
# with one loop. Implement the function compute_distances_one_loop and run the
# code below:
dists_one = classifier.compute_distances_one_loop(X_test)
print dists_one

# To ensure that our vectorized implementation is correct, we make sure that it
# agrees with the naive implementation. There are many ways to decide whether
# two matrices are similar; one of the simplest is the Frobenius norm. In case
# you haven't seen it before, the Frobenius norm of two matrices is the square
# root of the squared sum of differences of all elements; in other words, reshape
# the matrices into vectors and compute the Euclidean distance between them.
difference = np.linalg.norm(dists - dists_one, ord='fro')
print 'Difference was: %f' % (difference, )
if difference < 0.001:
  print 'Good! The distance matrices are the same'
else:
  print 'Uh-oh! The distance matrices are different'


[[ 3803.92350081  4210.59603857  5504.0544147  ...,  4007.64756434
   4203.28086142  4354.20256764]
 [ 6336.83367306  5270.28006846  4040.63608854 ...,  4829.15334194
   4694.09767687  7768.33347636]
 [ 5224.83913628  4250.64289255  3773.94581307 ...,  3766.81549853
   4464.99921613  6353.57190878]
 ..., 
 [ 5366.93534524  5062.8772452   6361.85774755 ...,  5126.56824786
   4537.30613911  5920.94156364]
 [ 3671.92919322  3858.60765044  4846.88157479 ...,  3521.04515734
   3182.3673578   4448.65305458]
 [ 6960.92443573  6083.71366848  6338.13442584 ...,  6083.55504619
   4128.24744898  8041.05223214]]
Difference was: 0.000000
Good! The distance matrices are the same
Test no_loop code

In [48]:
# np.sum(X*X, axis=1).reshape((num_test,1)).dot(np.ones((1,num_test))) + np.ones((num_train, 1)).dot(np.transpose(np.sum(self.X_train*self.X_train, axis=1).reshape((num_train,1)))) - 2*X*np.transpose(X_train)
num_features = X_train.shape[1]
num_train = X_train.shape[0]
num_test = X_test.shape[0]
print num_features
print np.sum(X_test*X_test, axis=1).reshape((num_test,1)).dot(np.ones((1,num_train))).shape
print np.ones((num_test, 1)).dot(np.transpose(np.sum(X_train*X_train, axis=1).reshape((num_train,1)))).shape
print 2*X_test.dot(np.transpose(X_train)).shape
print np.sqrt(np.sum(X_test*X_test, axis=1).reshape((num_test,1)).dot(np.ones((1,num_train))) + np.ones((num_test, 1)).dot(np.transpose(np.sum(X_train*X_train, axis=1).reshape((num_train,1)))) - 2*X_test.dot(np.transpose(X_train)))
print dists


3072
(500, 5000)
(500, 5000)
(500, 5000, 500, 5000)
[[ 3803.92350081  4210.59603857  5504.0544147  ...,  4007.64756434
   4203.28086142  4354.20256764]
 [ 6336.83367306  5270.28006846  4040.63608854 ...,  4829.15334194
   4694.09767687  7768.33347636]
 [ 5224.83913628  4250.64289255  3773.94581307 ...,  3766.81549853
   4464.99921613  6353.57190878]
 ..., 
 [ 5366.93534524  5062.8772452   6361.85774755 ...,  5126.56824786
   4537.30613911  5920.94156364]
 [ 3671.92919322  3858.60765044  4846.88157479 ...,  3521.04515734
   3182.3673578   4448.65305458]
 [ 6960.92443573  6083.71366848  6338.13442584 ...,  6083.55504619
   4128.24744898  8041.05223214]]
[[ 3803.92350081  4210.59603857  5504.0544147  ...,  4007.64756434
   4203.28086142  4354.20256764]
 [ 6336.83367306  5270.28006846  4040.63608854 ...,  4829.15334194
   4694.09767687  7768.33347636]
 [ 5224.83913628  4250.64289255  3773.94581307 ...,  3766.81549853
   4464.99921613  6353.57190878]
 ..., 
 [ 5366.93534524  5062.8772452   6361.85774755 ...,  5126.56824786
   4537.30613911  5920.94156364]
 [ 3671.92919322  3858.60765044  4846.88157479 ...,  3521.04515734
   3182.3673578   4448.65305458]
 [ 6960.92443573  6083.71366848  6338.13442584 ...,  6083.55504619
   4128.24744898  8041.05223214]]

In [49]:
# Now implement the fully vectorized version inside compute_distances_no_loops
# and run the code
dists_two = classifier.compute_distances_no_loops(X_test)

# check that the distance matrix agrees with the one we computed before:
difference = np.linalg.norm(dists - dists_two, ord='fro')
print 'Difference was: %f' % (difference, )
if difference < 0.001:
  print 'Good! The distance matrices are the same'
else:
  print 'Uh-oh! The distance matrices are different'


Difference was: 0.000000
Good! The distance matrices are the same

In [50]:
# Let's compare how fast the implementations are
def time_function(f, *args):
  """
  Call a function f with args and return the time (in seconds) that it took to execute.
  """
  import time
  tic = time.time()
  f(*args)
  toc = time.time()
  return toc - tic

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print 'Two loop version took %f seconds' % two_loop_time

one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print 'One loop version took %f seconds' % one_loop_time

no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print 'No loop version took %f seconds' % no_loop_time

# you should see significantly faster performance with the fully vectorized implementation


Two loop version took 44.701227 seconds
One loop version took 33.225114 seconds
No loop version took 0.211719 seconds

Cross-validation

We have implemented the k-Nearest Neighbor classifier but we set the value k = 5 arbitrarily. We will now determine the best value of this hyperparameter with cross-validation.

test python code for creating fold lists

In [56]:
num_folds = 5
folds = np.arange(num_folds)
print folds
for count in folds:
    # select fold
    a = folds[np.arange(len(folds))!=count] 
    print a
    print folds[count]


[0 1 2 3 4]
[1 2 3 4]
0
[0 2 3 4]
1
[0 1 3 4]
2
[0 1 2 4]
3
[0 1 2 3]
4

In [60]:
num_folds = 5
folds = np.arange(num_folds)
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #
################################################################################
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
# print len(X_train_folds), X_train_folds[0].shape
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #
################################################################################
classifier1 = KNearestNeighbor() #re-init classifier to make sure it's there
for k in k_choices:
    print 'fold %i' % k
    # list of accuracies across all folds for this k
    accuracies = []
    for fold in folds:
        # select fold
        a = folds[np.arange(len(folds))!=fold] 
        for item in a:
            classifier1.train(X_train_folds[item], y_train_folds[item])
        dists_two = classifier1.compute_distances_no_loops(X_train_folds[fold])   
        y_test_pred = classifier1.predict_labels(dists_two, k=k)
        num_correct = np.sum(y_test_pred == y_train_folds[fold])
        accuracies.append(float(num_correct) / len(y_train_folds[fold]))
        print 'Got %d / %d correct => accuracy: %f for k: %i, fold: %i' % (num_correct, num_test, accuracies[fold], k, fold)
    k_to_accuracies[k] = accuracies
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# Print out the computed accuracies
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print 'k = %d, accuracy = %f' % (k, accuracy)


fold 1
Got 235 / 500 correct => accuracy: 0.235000 for k: 1, fold: 0
Got 237 / 500 correct => accuracy: 0.237000 for k: 1, fold: 1
Got 258 / 500 correct => accuracy: 0.258000 for k: 1, fold: 2
Got 242 / 500 correct => accuracy: 0.242000 for k: 1, fold: 3
Got 227 / 500 correct => accuracy: 0.227000 for k: 1, fold: 4
fold 3
Got 232 / 500 correct => accuracy: 0.232000 for k: 3, fold: 0
Got 235 / 500 correct => accuracy: 0.235000 for k: 3, fold: 1
Got 259 / 500 correct => accuracy: 0.259000 for k: 3, fold: 2
Got 219 / 500 correct => accuracy: 0.219000 for k: 3, fold: 3
Got 222 / 500 correct => accuracy: 0.222000 for k: 3, fold: 4
fold 5
Got 246 / 500 correct => accuracy: 0.246000 for k: 5, fold: 0
Got 263 / 500 correct => accuracy: 0.263000 for k: 5, fold: 1
Got 262 / 500 correct => accuracy: 0.262000 for k: 5, fold: 2
Got 251 / 500 correct => accuracy: 0.251000 for k: 5, fold: 3
Got 240 / 500 correct => accuracy: 0.240000 for k: 5, fold: 4
fold 8
Got 252 / 500 correct => accuracy: 0.252000 for k: 8, fold: 0
Got 273 / 500 correct => accuracy: 0.273000 for k: 8, fold: 1
Got 264 / 500 correct => accuracy: 0.264000 for k: 8, fold: 2
Got 242 / 500 correct => accuracy: 0.242000 for k: 8, fold: 3
Got 235 / 500 correct => accuracy: 0.235000 for k: 8, fold: 4
fold 10
Got 264 / 500 correct => accuracy: 0.264000 for k: 10, fold: 0
Got 272 / 500 correct => accuracy: 0.272000 for k: 10, fold: 1
Got 267 / 500 correct => accuracy: 0.267000 for k: 10, fold: 2
Got 255 / 500 correct => accuracy: 0.255000 for k: 10, fold: 3
Got 230 / 500 correct => accuracy: 0.230000 for k: 10, fold: 4
fold 12
Got 259 / 500 correct => accuracy: 0.259000 for k: 12, fold: 0
Got 269 / 500 correct => accuracy: 0.269000 for k: 12, fold: 1
Got 264 / 500 correct => accuracy: 0.264000 for k: 12, fold: 2
Got 254 / 500 correct => accuracy: 0.254000 for k: 12, fold: 3
Got 247 / 500 correct => accuracy: 0.247000 for k: 12, fold: 4
fold 15
Got 262 / 500 correct => accuracy: 0.262000 for k: 15, fold: 0
Got 268 / 500 correct => accuracy: 0.268000 for k: 15, fold: 1
Got 253 / 500 correct => accuracy: 0.253000 for k: 15, fold: 2
Got 252 / 500 correct => accuracy: 0.252000 for k: 15, fold: 3
Got 235 / 500 correct => accuracy: 0.235000 for k: 15, fold: 4
fold 20
Got 260 / 500 correct => accuracy: 0.260000 for k: 20, fold: 0
Got 253 / 500 correct => accuracy: 0.253000 for k: 20, fold: 1
Got 252 / 500 correct => accuracy: 0.252000 for k: 20, fold: 2
Got 242 / 500 correct => accuracy: 0.242000 for k: 20, fold: 3
Got 246 / 500 correct => accuracy: 0.246000 for k: 20, fold: 4
fold 50
Got 247 / 500 correct => accuracy: 0.247000 for k: 50, fold: 0
Got 255 / 500 correct => accuracy: 0.255000 for k: 50, fold: 1
Got 240 / 500 correct => accuracy: 0.240000 for k: 50, fold: 2
Got 249 / 500 correct => accuracy: 0.249000 for k: 50, fold: 3
Got 240 / 500 correct => accuracy: 0.240000 for k: 50, fold: 4
fold 100
Got 243 / 500 correct => accuracy: 0.243000 for k: 100, fold: 0
Got 239 / 500 correct => accuracy: 0.239000 for k: 100, fold: 1
Got 240 / 500 correct => accuracy: 0.240000 for k: 100, fold: 2
Got 230 / 500 correct => accuracy: 0.230000 for k: 100, fold: 3
Got 233 / 500 correct => accuracy: 0.233000 for k: 100, fold: 4
k = 1, accuracy = 0.235000
k = 1, accuracy = 0.237000
k = 1, accuracy = 0.258000
k = 1, accuracy = 0.242000
k = 1, accuracy = 0.227000
k = 3, accuracy = 0.232000
k = 3, accuracy = 0.235000
k = 3, accuracy = 0.259000
k = 3, accuracy = 0.219000
k = 3, accuracy = 0.222000
k = 5, accuracy = 0.246000
k = 5, accuracy = 0.263000
k = 5, accuracy = 0.262000
k = 5, accuracy = 0.251000
k = 5, accuracy = 0.240000
k = 8, accuracy = 0.252000
k = 8, accuracy = 0.273000
k = 8, accuracy = 0.264000
k = 8, accuracy = 0.242000
k = 8, accuracy = 0.235000
k = 10, accuracy = 0.264000
k = 10, accuracy = 0.272000
k = 10, accuracy = 0.267000
k = 10, accuracy = 0.255000
k = 10, accuracy = 0.230000
k = 12, accuracy = 0.259000
k = 12, accuracy = 0.269000
k = 12, accuracy = 0.264000
k = 12, accuracy = 0.254000
k = 12, accuracy = 0.247000
k = 15, accuracy = 0.262000
k = 15, accuracy = 0.268000
k = 15, accuracy = 0.253000
k = 15, accuracy = 0.252000
k = 15, accuracy = 0.235000
k = 20, accuracy = 0.260000
k = 20, accuracy = 0.253000
k = 20, accuracy = 0.252000
k = 20, accuracy = 0.242000
k = 20, accuracy = 0.246000
k = 50, accuracy = 0.247000
k = 50, accuracy = 0.255000
k = 50, accuracy = 0.240000
k = 50, accuracy = 0.249000
k = 50, accuracy = 0.240000
k = 100, accuracy = 0.243000
k = 100, accuracy = 0.239000
k = 100, accuracy = 0.240000
k = 100, accuracy = 0.230000
k = 100, accuracy = 0.233000

In [65]:
# plot the raw observations
for k in k_choices:
    accuracies = k_to_accuracies[k]
    print 'average accuracy for k %i: %f'% (k,np.mean(k_to_accuracies[k]))
    plt.scatter([k] * len(accuracies), accuracies)

# plot the trend line with error bars that correspond to standard deviation
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.show()


average accuracy for k 1: 0.239800
average accuracy for k 3: 0.233400
average accuracy for k 5: 0.252400
average accuracy for k 8: 0.253200
average accuracy for k 10: 0.257600
average accuracy for k 12: 0.258600
average accuracy for k 15: 0.254000
average accuracy for k 20: 0.250600
average accuracy for k 50: 0.246200
average accuracy for k 100: 0.237000

In [69]:
# Based on the cross-validation results above, choose the best value for k,   
# retrain the classifier using all the training data, and test it on the test
# data. You should be able to get above 28% accuracy on the test data.
best_k = 7

classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)

# Compute and display the accuracy
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)


Got 141 / 500 correct => accuracy: 0.282000

In [ ]: