k-Nearest Neighbor (kNN) exercise

Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the assignments page on the course website.

The kNN classifier consists of two stages:

  • During training, the classifier takes the training data and simply remembers it
  • During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples
  • The value of k is cross-validated

In this exercise you will implement these steps and understand the basic Image Classification pipeline, cross-validation, and gain proficiency in writing efficient, vectorized code.


In [1]:
import os
os.chdir(os.getcwd() + '/..')

# Run some setup code for this notebook
import random
import numpy as np
import matplotlib.pyplot as plt

from utils.data_utils import load_CIFAR10

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [2]:
# Load the raw CIFAR-10 data.
cifar10_dir = 'datasets/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

print ('Training data shape: ', X_train.shape)
print ('Training labels shape: ', y_train.shape)
print ('Test data shape: ', X_test.shape)
print ('Test labels shape: ', y_test.shape)


('Training data shape: ', (50000, 32, 32, 3))
('Training labels shape: ', (50000,))
('Test data shape: ', (10000, 32, 32, 3))
('Test labels shape: ', (10000,))

In [3]:
# Visualize some examples from the dataset.
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
sample_per_class = 7
for y, cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y)
    idxs = np.random.choice(idxs, sample_per_class, replace=False)
    for i, idx in enumerate(idxs):
        plt_idx = i * num_classes + y + 1;
        plt.subplot(sample_per_class, num_classes, plt_idx)
        plt.imshow(X_train[idx].astype('uint8'))
        plt.axis('off')
        if i == 0:
            plt.title(cls)
plt.show()



In [4]:
# Subsample the data for more efficient code execution in this exercise
num_training = 5000
mask = list(range(num_training))
X_train = X_train[mask]
y_train = y_train[mask]

num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]

In [5]:
# Reshape the image data into rows
X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)
print(X_train.shape, X_test.shape)


((5000, 3072), (500, 3072))

In [6]:
from classifiers import KNearestNeighbor

# Create a kNN classifier instance.
# Remember that training a kNN classifier is a noop:
# the Classifier simply remembers the data ans does no further processing
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)

We would now like to classify the test data with the kNN classifier. Recall that we can break down this process into two steps:

  1. First we must compute the distances between all test examples and all train examples.
  2. Given these distances, for each test example we find the k nearest examples and have them vote for the label

Lets begin with computing the distance matrix between all training and test examples. For example, if there are Ntr training examples and Nte test examples, this stage should result in a Nte x Ntr matrix where each element (i,j) is the distance between the i-th test and j-th train example.


In [7]:
# compute_distance_two_loops.

dists = classifier.compute_distances_two_loops(X_test)
print (dists.shape)


(500, 5000)

In [8]:
# We can visualize the distance mateix: each row is a single test example and
# its distances to training examples
plt.imshow(dists, interpolation='none')
plt.show()


Inline Question #1: Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)

  • What in the data is the cause behind the distinctly bright rows?
  • What causes the columns?

Your Answer: distances between two images.


In [9]:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)

# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test == y_test_pred)
accuracy = float(num_correct) / num_test
print ('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))


Got 137 / 500 correct => accuracy: 0.274000

You should expect to see approximately 27% accuracy. Now lets try out a larger k, say k = 5:


In [10]:
y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print ('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))


Got 139 / 500 correct => accuracy: 0.278000

You should expect to see a slightly better performance than with k = 1.


In [11]:
# Now lets speed up distance matrix computation by using partial vectorization
# with one loop.
dists_one = classifier.compute_distances_one_loop(X_test)

# Compute Frobenius norm of two matrices: 
# the square root of the squared sum of differences of all elements
difference = np.linalg.norm(dists - dists_one, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance metrices are the same')
else:
    print('Uh-oh! The distance metrices are different')


Difference was: 0.000000
Good! The distance metrices are the same

In [12]:
# fully vectorized version
dists_two = classifier.compute_distances_no_loops(X_test)

difference = np.linalg.norm(dists_two - dists_one, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance metrices are the same')
else:
    print('Uh-oh! The distance metrices are different')


Difference was: 0.000000
Good! The distance metrices are the same

In [ ]:
# Let's compare how fast the implementations are
def time_function(f, *args):
    """
    Call a function f with args and return the time(in seconds) that it took to execute
    """
    import time
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc - tic

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)

one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)

no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)

# you should see significantly faster performance with fully vectorized implementation


Two loop version took 34.365219 seconds
One loop version took 54.740267 seconds
No loop version took 0.307001 seconds

Cross-validation

We have implemented the k-Nearest Neighbor classifier but we set the value k = 5 arbitrarily. We will now determine the best value of this hyperparameter with cross-validation.


In [7]:
num_folds = 5
k_choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

X_train_folds = []
y_train_folds = []

fold_ids = np.array_split(np.arange(num_training), num_folds)
for i in xrange(num_folds):
    X_train_folds.append(X_train[fold_ids[i], :])
    y_train_folds.append(y_train[fold_ids[i]])
    
# a dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}

for k in k_choices:
    k_to_accuracies[k] = []
    for i in xrange(num_folds):
        Xs = []
        ys = []
        X_val = X_train_folds[i]
        y_val = y_train_folds[i]
        
        for j in xrange(num_folds):
            if i == j:
                continue
                
            Xs.append(X_train_folds[j])
            ys.append(y_train_folds[j])
        
        Xs = np.concatenate(Xs)
        ys = np.concatenate(ys)
        
        classifier.train(Xs, ys)
        y_val_pred = classifier.predict(X_val, k=k)
            
        accuracy = np.mean(y_val == y_val_pred)
        k_to_accuracies[k].append(accuracy)
        print('k = %d, accuracy = %f' % (k, accuracy))
    print

# Print out the computed accuracies
for k in k_to_accuracies:
    for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))


k = 1, accuracy = 0.263000
k = 1, accuracy = 0.257000
k = 1, accuracy = 0.264000
k = 1, accuracy = 0.278000
k = 1, accuracy = 0.266000

k = 2, accuracy = 0.235000
k = 2, accuracy = 0.219000
k = 2, accuracy = 0.234000
k = 2, accuracy = 0.247000
k = 2, accuracy = 0.252000

k = 3, accuracy = 0.239000
k = 3, accuracy = 0.249000
k = 3, accuracy = 0.240000
k = 3, accuracy = 0.266000
k = 3, accuracy = 0.254000

k = 4, accuracy = 0.259000
k = 4, accuracy = 0.270000
k = 4, accuracy = 0.269000
k = 4, accuracy = 0.294000
k = 4, accuracy = 0.272000

k = 5, accuracy = 0.248000
k = 5, accuracy = 0.266000
k = 5, accuracy = 0.280000
k = 5, accuracy = 0.292000
k = 5, accuracy = 0.280000

k = 6, accuracy = 0.253000
k = 6, accuracy = 0.277000
k = 6, accuracy = 0.274000
k = 6, accuracy = 0.273000
k = 6, accuracy = 0.282000

k = 7, accuracy = 0.261000
k = 7, accuracy = 0.279000
k = 7, accuracy = 0.268000
k = 7, accuracy = 0.288000
k = 7, accuracy = 0.276000

k = 8, accuracy = 0.262000
k = 8, accuracy = 0.282000
k = 8, accuracy = 0.273000
k = 8, accuracy = 0.290000
k = 8, accuracy = 0.273000

k = 9, accuracy = 0.259000
k = 9, accuracy = 0.283000
k = 9, accuracy = 0.270000
k = 9, accuracy = 0.285000
k = 9, accuracy = 0.285000

k = 10, accuracy = 0.265000
k = 10, accuracy = 0.296000
k = 10, accuracy = 0.276000
k = 10, accuracy = 0.284000
k = 10, accuracy = 0.280000

k = 11, accuracy = 0.265000
k = 11, accuracy = 0.296000
k = 11, accuracy = 0.277000
k = 11, accuracy = 0.279000
k = 11, accuracy = 0.270000

k = 12, accuracy = 0.260000
k = 12, accuracy = 0.295000
k = 12, accuracy = 0.279000
k = 12, accuracy = 0.283000
k = 12, accuracy = 0.280000

k = 13, accuracy = 0.262000
k = 13, accuracy = 0.291000
k = 13, accuracy = 0.273000
k = 13, accuracy = 0.274000
k = 13, accuracy = 0.273000

k = 14, accuracy = 0.263000
k = 14, accuracy = 0.289000
k = 14, accuracy = 0.287000
k = 14, accuracy = 0.292000
k = 14, accuracy = 0.279000

k = 15, accuracy = 0.252000
k = 15, accuracy = 0.289000
k = 15, accuracy = 0.278000
k = 15, accuracy = 0.282000
k = 15, accuracy = 0.274000

k = 16, accuracy = 0.265000
k = 16, accuracy = 0.276000
k = 16, accuracy = 0.290000
k = 16, accuracy = 0.281000
k = 16, accuracy = 0.266000

k = 17, accuracy = 0.264000
k = 17, accuracy = 0.276000
k = 17, accuracy = 0.286000
k = 17, accuracy = 0.281000
k = 17, accuracy = 0.269000

k = 18, accuracy = 0.266000
k = 18, accuracy = 0.275000
k = 18, accuracy = 0.281000
k = 18, accuracy = 0.284000
k = 18, accuracy = 0.282000

k = 19, accuracy = 0.269000
k = 19, accuracy = 0.283000
k = 19, accuracy = 0.280000
k = 19, accuracy = 0.278000
k = 19, accuracy = 0.279000

k = 20, accuracy = 0.270000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.282000
k = 20, accuracy = 0.285000

k = 1, accuracy = 0.263000
k = 1, accuracy = 0.257000
k = 1, accuracy = 0.264000
k = 1, accuracy = 0.278000
k = 1, accuracy = 0.266000
k = 2, accuracy = 0.235000
k = 2, accuracy = 0.219000
k = 2, accuracy = 0.234000
k = 2, accuracy = 0.247000
k = 2, accuracy = 0.252000
k = 3, accuracy = 0.239000
k = 3, accuracy = 0.249000
k = 3, accuracy = 0.240000
k = 3, accuracy = 0.266000
k = 3, accuracy = 0.254000
k = 4, accuracy = 0.259000
k = 4, accuracy = 0.270000
k = 4, accuracy = 0.269000
k = 4, accuracy = 0.294000
k = 4, accuracy = 0.272000
k = 5, accuracy = 0.248000
k = 5, accuracy = 0.266000
k = 5, accuracy = 0.280000
k = 5, accuracy = 0.292000
k = 5, accuracy = 0.280000
k = 6, accuracy = 0.253000
k = 6, accuracy = 0.277000
k = 6, accuracy = 0.274000
k = 6, accuracy = 0.273000
k = 6, accuracy = 0.282000
k = 7, accuracy = 0.261000
k = 7, accuracy = 0.279000
k = 7, accuracy = 0.268000
k = 7, accuracy = 0.288000
k = 7, accuracy = 0.276000
k = 8, accuracy = 0.262000
k = 8, accuracy = 0.282000
k = 8, accuracy = 0.273000
k = 8, accuracy = 0.290000
k = 8, accuracy = 0.273000
k = 9, accuracy = 0.259000
k = 9, accuracy = 0.283000
k = 9, accuracy = 0.270000
k = 9, accuracy = 0.285000
k = 9, accuracy = 0.285000
k = 10, accuracy = 0.265000
k = 10, accuracy = 0.296000
k = 10, accuracy = 0.276000
k = 10, accuracy = 0.284000
k = 10, accuracy = 0.280000
k = 11, accuracy = 0.265000
k = 11, accuracy = 0.296000
k = 11, accuracy = 0.277000
k = 11, accuracy = 0.279000
k = 11, accuracy = 0.270000
k = 12, accuracy = 0.260000
k = 12, accuracy = 0.295000
k = 12, accuracy = 0.279000
k = 12, accuracy = 0.283000
k = 12, accuracy = 0.280000
k = 13, accuracy = 0.262000
k = 13, accuracy = 0.291000
k = 13, accuracy = 0.273000
k = 13, accuracy = 0.274000
k = 13, accuracy = 0.273000
k = 14, accuracy = 0.263000
k = 14, accuracy = 0.289000
k = 14, accuracy = 0.287000
k = 14, accuracy = 0.292000
k = 14, accuracy = 0.279000
k = 15, accuracy = 0.252000
k = 15, accuracy = 0.289000
k = 15, accuracy = 0.278000
k = 15, accuracy = 0.282000
k = 15, accuracy = 0.274000
k = 16, accuracy = 0.265000
k = 16, accuracy = 0.276000
k = 16, accuracy = 0.290000
k = 16, accuracy = 0.281000
k = 16, accuracy = 0.266000
k = 17, accuracy = 0.264000
k = 17, accuracy = 0.276000
k = 17, accuracy = 0.286000
k = 17, accuracy = 0.281000
k = 17, accuracy = 0.269000
k = 18, accuracy = 0.266000
k = 18, accuracy = 0.275000
k = 18, accuracy = 0.281000
k = 18, accuracy = 0.284000
k = 18, accuracy = 0.282000
k = 19, accuracy = 0.269000
k = 19, accuracy = 0.283000
k = 19, accuracy = 0.280000
k = 19, accuracy = 0.278000
k = 19, accuracy = 0.279000
k = 20, accuracy = 0.270000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.282000
k = 20, accuracy = 0.285000

In [8]:
# plot the raw observaations
for k in k_choices:
    accuracies = k_to_accuracies[k]
    plt.scatter([k] * len(accuracies), accuracies)

# plot the trend line with error bars that correspond to standard deviation
accuracies_mean = np.array([np.mean(v) for k, v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k, v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')

plt.show()



In [11]:
# Based on the cross-validation results above, choose the best value for k,
# retrain the classifier using all the training data, and test it on the test
# data. You should be able to get above 28% accuracy on the test data.
best_k = 10

classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)

# Compute and display the accuracy
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))


Got 131 / 500 correct => accuracy: 0.262000

In [ ]: