Introduction to Classification. The $k$ nearest neighbors method

Notebook version: 1.0 (Oct 11, 2015)

Author: Jesús Cid Sueiro (jcid@tsc.uc3m.es)

Changes: v.1.0 - First version

In [1]:
# To visualize plots in the notebook
%matplotlib inline 

# Import some libraries that will be necessary for working with data and displaying plots
import csv     # To read csv files
import random
import matplotlib.pyplot as plt
import numpy as np
from scipy import spatial
from sklearn import neighbors, datasets

1. Introduction to Classification.

1.1 Introduction to the Classification problem

In a generic classification problem, we are given an observation vector ${\bf x}\in \mathbb{R}^N$ which is known to belong to one and only one category or class, $y$, in the set ${\mathcal Y} = \{0, 1, \ldots, M-1\}$. The goal of a classifier system is to predict the value of $y$ based on ${\bf x}$.

To design the classifier, we are given a collection of labelled observations ${\mathcal S} = \{({\bf x}^{(k)}, y^{(k)}), k=1,\ldots,K\}$ where, for each observation ${\bf x}^{(k)}$, the value of its true category, $y^{(k)}$, is known.

We will focus in binary classification problems, where the label set is binary, ${\mathcal Y} = \{0, 1\}$. Despite its simplicity, this is the most frequent case. Many multi-class classification problems are usually solved by decomposing them into a collection of binary problems.

The classification algorithms, as many other machine learning algorithms, are based on two major underlying hypothesis:

  • All samples in dataset ${\mathcal S}$ are i.i.d. (independent and identically distributed), i.e., all samples are independent outcomes of an unknown distribution $p({\bf x}, y)$.
  • The tuple formed by the input sample and its unknown class, $({\bf x}, y)$, is an independent outcome of the same distribution.

These two assumptions are essential to have some guarantees that a classifier designed based on ${\mathcal S}$ has a good perfomance when applied to new input samples. Note that, despite we assume that an underlying distribution exists, it is unknown: otherwise, we could ignore ${\mathcal S}$ and apply classical decision theory to find the optimal predictor based on $p({\bf x}, y)$,

1.2. The Iris dataset

(This dataset presentation is based on this Tutorial by Jason Brownlee )

As an illustration, consider the Iris dataset , taken from the UCI Machine Learning repository . Quoted from the dataset description:

This is perhaps the best known database to be found in the pattern recognition literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. [...] One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

The class is the species, which is one of setosa, versicolor or virginica. Each instance contains 4 measurements of given flowers: sepal length, sepal width, petal length and petal width, all in centimeters.


In [2]:
# Taken from Jason Brownlee notebook.
with open('iris.data', 'rb') as csvfile:
	lines = csv.reader(csvfile)
	for row in lines:
		print ', '.join(row)


5.1, 3.5, 1.4, 0.2, Iris-setosa
4.9, 3.0, 1.4, 0.2, Iris-setosa
4.7, 3.2, 1.3, 0.2, Iris-setosa
4.6, 3.1, 1.5, 0.2, Iris-setosa
5.0, 3.6, 1.4, 0.2, Iris-setosa
5.4, 3.9, 1.7, 0.4, Iris-setosa
4.6, 3.4, 1.4, 0.3, Iris-setosa
5.0, 3.4, 1.5, 0.2, Iris-setosa
4.4, 2.9, 1.4, 0.2, Iris-setosa
4.9, 3.1, 1.5, 0.1, Iris-setosa
5.4, 3.7, 1.5, 0.2, Iris-setosa
4.8, 3.4, 1.6, 0.2, Iris-setosa
4.8, 3.0, 1.4, 0.1, Iris-setosa
4.3, 3.0, 1.1, 0.1, Iris-setosa
5.8, 4.0, 1.2, 0.2, Iris-setosa
5.7, 4.4, 1.5, 0.4, Iris-setosa
5.4, 3.9, 1.3, 0.4, Iris-setosa
5.1, 3.5, 1.4, 0.3, Iris-setosa
5.7, 3.8, 1.7, 0.3, Iris-setosa
5.1, 3.8, 1.5, 0.3, Iris-setosa
5.4, 3.4, 1.7, 0.2, Iris-setosa
5.1, 3.7, 1.5, 0.4, Iris-setosa
4.6, 3.6, 1.0, 0.2, Iris-setosa
5.1, 3.3, 1.7, 0.5, Iris-setosa
4.8, 3.4, 1.9, 0.2, Iris-setosa
5.0, 3.0, 1.6, 0.2, Iris-setosa
5.0, 3.4, 1.6, 0.4, Iris-setosa
5.2, 3.5, 1.5, 0.2, Iris-setosa
5.2, 3.4, 1.4, 0.2, Iris-setosa
4.7, 3.2, 1.6, 0.2, Iris-setosa
4.8, 3.1, 1.6, 0.2, Iris-setosa
5.4, 3.4, 1.5, 0.4, Iris-setosa
5.2, 4.1, 1.5, 0.1, Iris-setosa
5.5, 4.2, 1.4, 0.2, Iris-setosa
4.9, 3.1, 1.5, 0.1, Iris-setosa
5.0, 3.2, 1.2, 0.2, Iris-setosa
5.5, 3.5, 1.3, 0.2, Iris-setosa
4.9, 3.1, 1.5, 0.1, Iris-setosa
4.4, 3.0, 1.3, 0.2, Iris-setosa
5.1, 3.4, 1.5, 0.2, Iris-setosa
5.0, 3.5, 1.3, 0.3, Iris-setosa
4.5, 2.3, 1.3, 0.3, Iris-setosa
4.4, 3.2, 1.3, 0.2, Iris-setosa
5.0, 3.5, 1.6, 0.6, Iris-setosa
5.1, 3.8, 1.9, 0.4, Iris-setosa
4.8, 3.0, 1.4, 0.3, Iris-setosa
5.1, 3.8, 1.6, 0.2, Iris-setosa
4.6, 3.2, 1.4, 0.2, Iris-setosa
5.3, 3.7, 1.5, 0.2, Iris-setosa
5.0, 3.3, 1.4, 0.2, Iris-setosa
7.0, 3.2, 4.7, 1.4, Iris-versicolor
6.4, 3.2, 4.5, 1.5, Iris-versicolor
6.9, 3.1, 4.9, 1.5, Iris-versicolor
5.5, 2.3, 4.0, 1.3, Iris-versicolor
6.5, 2.8, 4.6, 1.5, Iris-versicolor
5.7, 2.8, 4.5, 1.3, Iris-versicolor
6.3, 3.3, 4.7, 1.6, Iris-versicolor
4.9, 2.4, 3.3, 1.0, Iris-versicolor
6.6, 2.9, 4.6, 1.3, Iris-versicolor
5.2, 2.7, 3.9, 1.4, Iris-versicolor
5.0, 2.0, 3.5, 1.0, Iris-versicolor
5.9, 3.0, 4.2, 1.5, Iris-versicolor
6.0, 2.2, 4.0, 1.0, Iris-versicolor
6.1, 2.9, 4.7, 1.4, Iris-versicolor
5.6, 2.9, 3.6, 1.3, Iris-versicolor
6.7, 3.1, 4.4, 1.4, Iris-versicolor
5.6, 3.0, 4.5, 1.5, Iris-versicolor
5.8, 2.7, 4.1, 1.0, Iris-versicolor
6.2, 2.2, 4.5, 1.5, Iris-versicolor
5.6, 2.5, 3.9, 1.1, Iris-versicolor
5.9, 3.2, 4.8, 1.8, Iris-versicolor
6.1, 2.8, 4.0, 1.3, Iris-versicolor
6.3, 2.5, 4.9, 1.5, Iris-versicolor
6.1, 2.8, 4.7, 1.2, Iris-versicolor
6.4, 2.9, 4.3, 1.3, Iris-versicolor
6.6, 3.0, 4.4, 1.4, Iris-versicolor
6.8, 2.8, 4.8, 1.4, Iris-versicolor
6.7, 3.0, 5.0, 1.7, Iris-versicolor
6.0, 2.9, 4.5, 1.5, Iris-versicolor
5.7, 2.6, 3.5, 1.0, Iris-versicolor
5.5, 2.4, 3.8, 1.1, Iris-versicolor
5.5, 2.4, 3.7, 1.0, Iris-versicolor
5.8, 2.7, 3.9, 1.2, Iris-versicolor
6.0, 2.7, 5.1, 1.6, Iris-versicolor
5.4, 3.0, 4.5, 1.5, Iris-versicolor
6.0, 3.4, 4.5, 1.6, Iris-versicolor
6.7, 3.1, 4.7, 1.5, Iris-versicolor
6.3, 2.3, 4.4, 1.3, Iris-versicolor
5.6, 3.0, 4.1, 1.3, Iris-versicolor
5.5, 2.5, 4.0, 1.3, Iris-versicolor
5.5, 2.6, 4.4, 1.2, Iris-versicolor
6.1, 3.0, 4.6, 1.4, Iris-versicolor
5.8, 2.6, 4.0, 1.2, Iris-versicolor
5.0, 2.3, 3.3, 1.0, Iris-versicolor
5.6, 2.7, 4.2, 1.3, Iris-versicolor
5.7, 3.0, 4.2, 1.2, Iris-versicolor
5.7, 2.9, 4.2, 1.3, Iris-versicolor
6.2, 2.9, 4.3, 1.3, Iris-versicolor
5.1, 2.5, 3.0, 1.1, Iris-versicolor
5.7, 2.8, 4.1, 1.3, Iris-versicolor
6.3, 3.3, 6.0, 2.5, Iris-virginica
5.8, 2.7, 5.1, 1.9, Iris-virginica
7.1, 3.0, 5.9, 2.1, Iris-virginica
6.3, 2.9, 5.6, 1.8, Iris-virginica
6.5, 3.0, 5.8, 2.2, Iris-virginica
7.6, 3.0, 6.6, 2.1, Iris-virginica
4.9, 2.5, 4.5, 1.7, Iris-virginica
7.3, 2.9, 6.3, 1.8, Iris-virginica
6.7, 2.5, 5.8, 1.8, Iris-virginica
7.2, 3.6, 6.1, 2.5, Iris-virginica
6.5, 3.2, 5.1, 2.0, Iris-virginica
6.4, 2.7, 5.3, 1.9, Iris-virginica
6.8, 3.0, 5.5, 2.1, Iris-virginica
5.7, 2.5, 5.0, 2.0, Iris-virginica
5.8, 2.8, 5.1, 2.4, Iris-virginica
6.4, 3.2, 5.3, 2.3, Iris-virginica
6.5, 3.0, 5.5, 1.8, Iris-virginica
7.7, 3.8, 6.7, 2.2, Iris-virginica
7.7, 2.6, 6.9, 2.3, Iris-virginica
6.0, 2.2, 5.0, 1.5, Iris-virginica
6.9, 3.2, 5.7, 2.3, Iris-virginica
5.6, 2.8, 4.9, 2.0, Iris-virginica
7.7, 2.8, 6.7, 2.0, Iris-virginica
6.3, 2.7, 4.9, 1.8, Iris-virginica
6.7, 3.3, 5.7, 2.1, Iris-virginica
7.2, 3.2, 6.0, 1.8, Iris-virginica
6.2, 2.8, 4.8, 1.8, Iris-virginica
6.1, 3.0, 4.9, 1.8, Iris-virginica
6.4, 2.8, 5.6, 2.1, Iris-virginica
7.2, 3.0, 5.8, 1.6, Iris-virginica
7.4, 2.8, 6.1, 1.9, Iris-virginica
7.9, 3.8, 6.4, 2.0, Iris-virginica
6.4, 2.8, 5.6, 2.2, Iris-virginica
6.3, 2.8, 5.1, 1.5, Iris-virginica
6.1, 2.6, 5.6, 1.4, Iris-virginica
7.7, 3.0, 6.1, 2.3, Iris-virginica
6.3, 3.4, 5.6, 2.4, Iris-virginica
6.4, 3.1, 5.5, 1.8, Iris-virginica
6.0, 3.0, 4.8, 1.8, Iris-virginica
6.9, 3.1, 5.4, 2.1, Iris-virginica
6.7, 3.1, 5.6, 2.4, Iris-virginica
6.9, 3.1, 5.1, 2.3, Iris-virginica
5.8, 2.7, 5.1, 1.9, Iris-virginica
6.8, 3.2, 5.9, 2.3, Iris-virginica
6.7, 3.3, 5.7, 2.5, Iris-virginica
6.7, 3.0, 5.2, 2.3, Iris-virginica
6.3, 2.5, 5.0, 1.9, Iris-virginica
6.5, 3.0, 5.2, 2.0, Iris-virginica
6.2, 3.4, 5.4, 2.3, Iris-virginica
5.9, 3.0, 5.1, 1.8, Iris-virginica

We need to split the data into a training dataset that kNN can use to make predictions and a test dataset that we can use to evaluate the accuracy of the model.

We first need to convert the flower measures that were loaded as strings into numbers that we can work with. Next we need to split the data set randomly into train and datasets. A ratio of 67/33 for train/test is a standard ratio used.

Pulling it all together, we can define a function called loadDataset that loads a CSV with the provided filename and splits it randomly into train and test datasets using the provided split ratio.


In [3]:
# Adapted from a notebook by Jason Brownlee
def loadDataset(filename, split):
    xTrain = []
    cTrain = []
    xTest = []
    cTest = []

    with open(filename, 'rb') as csvfile:
        lines = csv.reader(csvfile)
        dataset = list(lines)
    for i in range(len(dataset)-1):
        for y in range(4):
            dataset[i][y] = float(dataset[i][y])
        item = dataset[i]
        if random.random() < split:
            xTrain.append(item[0:4])
            cTrain.append(item[4])
        else:
            xTest.append(item[0:4])
            cTest.append(item[4])
    return xTrain, cTrain, xTest, cTest

We can use this function to get a data split.


In [4]:
xTrain_all, cTrain_all, xTest_all, cTest_all = loadDataset('iris.data', 0.66)
nTrain_all = len(xTrain_all)
nTest_all = len(xTest_all)
print 'Train: ' + str(nTrain_all)
print 'Test: ' + str(nTest_all)


Train: 97
Test: 53

To get some intuition about this four dimensional dataset we can plot 2-dimensional projections taking only two variables each time.


In [5]:
i = 0 # Try 0,1,2,3
j = 1 # Try 0,1,2,3 with j!=i

# Take coordinates for each class separately
xiSe = [xTrain_all[n][i] for n in range(nTrain_all) if cTrain_all[n]=='Iris-setosa']
xjSe = [xTrain_all[n][j] for n in range(nTrain_all) if cTrain_all[n]=='Iris-setosa']
xiVe = [xTrain_all[n][i] for n in range(nTrain_all) if cTrain_all[n]=='Iris-versicolor']
xjVe = [xTrain_all[n][j] for n in range(nTrain_all) if cTrain_all[n]=='Iris-versicolor']
xiVi = [xTrain_all[n][i] for n in range(nTrain_all) if cTrain_all[n]=='Iris-virginica']
xjVi = [xTrain_all[n][j] for n in range(nTrain_all) if cTrain_all[n]=='Iris-virginica']

plt.plot(xiSe, xjSe,'bx', label='Setosa')
plt.plot(xiVe, xjVe,'r.', label='Versicolor')
plt.plot(xiVi, xjVi,'g+', label='Virginica')
plt.xlabel('$x_' + str(i) + '$')
plt.ylabel('$x_' + str(j) + '$')
plt.legend(loc='best')


Out[5]:
<matplotlib.legend.Legend at 0x1078116d0>

In the following, we will design a classifier to separate classes "Versicolor" and "Virginica" using $x_0$ and $x_1$ only. To do so, we build a training set with samples from these categories, and a bynary label $y^{(k)} = 1$ for samples in class "Virginica", and $0$ for "Versicolor" data.


In [6]:
# Select two classes
c0 = 'Iris-versicolor' 
c1 = 'Iris-virginica'

# Select two coordinates
ind = [0, 1]

# Take training test
X_tr = np.array([[xTrain_all[n][i] for i in ind] for n in range(nTrain_all) 
                  if cTrain_all[n]==c0 or cTrain_all[n]==c1])
C_tr = [cTrain_all[n] for n in range(nTrain_all) 
          if cTrain_all[n]==c0 or cTrain_all[n]==c1]
Y_tr = np.array([int(c==c1) for c in C_tr])
n_tr = len(X_tr)

# Take test set
X_tst = np.array([[xTest_all[n][i] for i in ind] for n in range(nTest_all) 
                 if cTest_all[n]==c0 or cTest_all[n]==c1])
C_tst = [cTest_all[n] for n in range(nTest_all) 
         if cTest_all[n]==c0 or cTest_all[n]==c1]
Y_tst = np.array([int(c==c1) for c in C_tst])
n_tst = len(X_tst)

# Separate components of x into different arrays (just for the plots)
x0c0 = [X_tr[n][0] for n in range(n_tr) if Y_tr[n]==0]
x1c0 = [X_tr[n][1] for n in range(n_tr) if Y_tr[n]==0]
x0c1 = [X_tr[n][0] for n in range(n_tr) if Y_tr[n]==1]
x1c1 = [X_tr[n][1] for n in range(n_tr) if Y_tr[n]==1]

# Scatterplot.
labels = {'Iris-setosa': 'Setosa', 
          'Iris-versicolor': 'Versicolor',
          'Iris-virginica': 'Virginica'}
plt.plot(x0c0, x1c0,'r.', label=labels[c0])
plt.plot(x0c1, x1c1,'g+', label=labels[c1])
plt.xlabel('$x_' + str(ind[0]) + '$')
plt.ylabel('$x_' + str(ind[1]) + '$')
plt.legend(loc='best')


Out[6]:
<matplotlib.legend.Legend at 0x107d62d50>

1.3 Baseline Classifier: Maximum A Priori.

For the selected data set, we have two clases and a dataset with the following class proportions:


In [7]:
print 'Class 0 (' + c0 + '): ' + str(sum(Y_tr)) + ' samples'
print 'Class 1 (' + c1 + '): ' + str(n_tr - sum(Y_tr)) + ' samples'


Class 0 (Iris-versicolor): 30 samples
Class 1 (Iris-virginica): 34 samples

The maximum a priori classifier assigns any sample ${\bf x}$ to the most frequent class in the training set. Therefore, the ouput $y$ for any sample ${\bf x}$ is


In [8]:
y = int(2*sum(Y_tr) > n_tr)
print 'y = ' + str(y) + ' (' + (c1 if y==1 else c0) + ')'


y = 0 (Iris-versicolor)

The error rate for this baseline classifier is:


In [9]:
# Training and test error arrays
E_tr = (Y_tr != y)
E_tst = (Y_tst != y)

# Error rates
pe_tr = float(sum(E_tr)) / n_tr
pe_tst = float(sum(E_tst)) / n_tst
print 'Pe(train) = ' + str(pe_tr)
print 'Pe(test) = ' + str(pe_tst)


Pe(train) = 0.46875
Pe(test) = 0.555555555556

The error rate of the baseline classifier is a simple benchmark for classification. Since the maximum a priori decision is independent on the observation, ${\bf x}$, any classifier based on ${\bf x}$ should have a better (or, at least, not worse) performance than the baseline classifier.

2. Non-parametric classification.

2.1. The Nearest-Neighbour Classifier (1-NN).

The 1-NN classifier assigns any instance ${\bf x}$ to the category of the nearest neighbour in the training set. $$ d = f({\bf x}) = y^{(n)} \\ n = \arg \min_k \|{\bf x}-{\bf x}^{(k)}\| $$ In case of ties (i.e. if there is more than one instance at minimum distance) the class of one of them, taken arbitrarily, is assigned to ${\bf x}$.


In [10]:
def nn_classifier(X1,Y1,X2):
    """ Compute the 1-NN classification for the observations contained in
        the rows of X2, for the training set given by the rows in X1 and the
        components of Y1.
    """
    if X1.ndim == 1:
        X1 = np.asmatrix(X1).T
    if X2.ndim == 1:
        X2 = np.asmatrix(X2).T
    distances = spatial.distance.cdist(X1,X2,'euclidean')
    neighbors = np.argsort(distances, axis=0, kind='quicksort', order=None)
    closest = neighbors[0,:]
    y_values = np.zeros([X2.shape[0],1])
    for idx in range(X2.shape[0]):
        y_values[idx] = Y1[closest[idx]]
        
    return y_values

Let us apply the 1-NN classifier to the given dataset. First, we will shown the decision regions of the classifier. To do so, we compute the classifier output for all points in a rectangular grid from the sample space.


In [11]:
# Create a regtangular grid.
x_min, x_max = X_tr[:, 0].min(), X_tr[:, 0].max() 
y_min, y_max = X_tr[:, 1].min(), X_tr[:, 1].max()
dx = x_max - x_min
dy = y_max - y_min
h = dy / 200
xx, yy = np.meshgrid(np.arange(x_min - 0.1 * dx, x_max + 0.1 * dx, h),
                     np.arange(y_min - 0.1 * dx, y_max + 0.1 * dy, h))
X_grid = np.array([xx.ravel(), yy.ravel()]).T

# Compute the classifier output for all samples in the grid.
Z = nn_classifier(X_tr, Y_tr, X_grid)

# Put the result into a color plot
plt.plot(x0c0, x1c0,'r.', label=labels[c0])
plt.plot(x0c1, x1c1,'g+', label=labels[c1])
plt.xlabel('$x_' + str(ind[0]) + '$')
plt.ylabel('$x_' + str(ind[1]) + '$')
plt.legend(loc='best')

Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)


Out[11]:
<matplotlib.contour.QuadContourSet instance at 0x10ec3d7a0>

We can observe that the decision boudary of the 1-NN classifier is rather intricate, and it may contain small islands covering ore or few samples from one classs. Actually, the extension of this small regions usually reduces as we have more training samples, though the number of them may increase.

Now we compute the error rates over the training and test sets.


In [12]:
# Training errors
Z_tr = nn_classifier(X_tr, Y_tr, X_tr)
E_tr = Z_tr.flatten()!=Y_tr

# Test errors
Z_tst = nn_classifier(X_tr, Y_tr, X_tst)
E_tst = Z_tst.flatten()!=Y_tst

# Error rates
pe_tr = float(sum(E_tr)) / n_tr
pe_tst = float(sum(E_tst)) / n_tst
print 'Pe(train) = ' + str(pe_tr)
print 'Pe(test) = ' + str(pe_tst)


Pe(train) = 0.09375
Pe(test) = 0.333333333333

The training and test error rates of the 1-NN may be significantly different. In fact, the training error may go down to zero if samples do not overlap. In the Iris dataset, this is not the case, because samples from different classes coincide at the same point, causing some classification errors.

Despite the 1-NN usually reduces the error rate with respect to the baseline classifier, the number of errors may be too large. Errors may be attributed to diferent causes:

  1. The class distributions are overlapped, because the selected features have no complete information for discriminating between the classes: this would imply that, even the best possible classifier would be prone to errors.
  2. The training sample is small, and it is not enough to obtaing a good estimate of the optimal classifiers.
  3. The classifier has intrinsic limitations: even though we had an infinite number of samples, the classifier performance does not approach the optimal classifiers.

In general, a classifier is said to be consistent if it makes nearly optimal decisions as the number of training samples increases. Actually, it can be shown that this is the case of the 1-NN classifier if the classification problem is separable, i.e. if there exist a decision boundary with zero error probability. Unfortunately, in a non-separable case, the 1-NN classifier is not consistent: it can be shown that the error rate of the 1-NN can converge to up to twice the minimum attainable error rate as the number of training samples goes to infinity.

Exercise 1: In this exercise we test the non-consistency of the 1-NN classifiers for overlapping distributions. Generate an artifitial dataset for classification as follows:

  • Generate $N$ binary labels at random with values '0' and '1'. Store them in vector ${\bf y}$
  • For every label $y_k$ in ${\bf y}$:
    • If the label is 0, take sample $x_k$ at random from a uniform distribution $U(0,2)$.
    • If the label is 1, take sample $x_k$ at random from a uniform distribution $U(1,5)$.

Take $N=1000$ for the test set. This is a large sample to get accurate error rate estimates. Also, take $N=10$, $20$, $40$, $80$,... for the training set. Compute the 1-NN classifier, and observe the test error rate as a function of $N$.

Now, compute the test error rate of the classifier making decision $1$ if $x_k>1.5$, and $0$ otherwise.

2.2. $k$-NN classifier

A simple extension of the 1-NN classifier is the $k$-NN classifier, which, for any input sample ${\bf x}$, computes the $k$ closest neighbors in the training set, and takes the majority class in the subset. To avoid ties, $k$ is usually taken as an odd number.


In [13]:
def knn_classifier(X1,Y1,X2,k):
    """ Compute the k-NN classification for the observations contained in
        the rows of X2, for the training set given by the rows in X1 and the
        components of S1. k is the number of neighbours.
    """
    if X1.ndim == 1:
        X1 = np.asmatrix(X1).T
    if X2.ndim == 1:
        X2 = np.asmatrix(X2).T
    distances = spatial.distance.cdist(X1,X2,'euclidean')
    neighbors = np.argsort(distances, axis=0, kind='quicksort', order=None)
    closest = neighbors[range(k),:]
    
    y_values = np.zeros([X2.shape[0],1])
    for idx in range(X2.shape[0]):
        y_values[idx] = np.median(Y1[closest[:,idx]])
        
    return y_values

In [14]:
k = 5

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
Z = knn_classifier(X_tr, Y_tr, X_grid, k)

# Put the result into a color plot
plt.plot(x0c0, x1c0,'r.', label=labels[c0])
plt.plot(x0c1, x1c1,'g+', label=labels[c1])
plt.xlabel('$x_' + str(ind[0]) + '$')
plt.ylabel('$x_' + str(ind[1]) + '$')
plt.legend(loc='best')

Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)


Out[14]:
<matplotlib.contour.QuadContourSet instance at 0x1078df488>

We can analyze the influence of parameter $k$ by observing both traning and test errors.


In [15]:
# Plot training and test error as a function of parameter k.
pe_tr = []
pe_tst = []
k_list = [2*n+1 for n in range(n_tr/2)]

for k in k_list:

    # Training errors
    Z_tr = knn_classifier(X_tr, Y_tr, X_tr, k)
    E_tr = Z_tr.flatten()!=Y_tr

    # Test errors
    Z_tst = knn_classifier(X_tr, Y_tr, X_tst, k)
    E_tst = Z_tst.flatten()!=Y_tst

    # Error rates
    pe_tr.append(float(sum(E_tr)) / n_tr)
    pe_tst.append(float(sum(E_tst)) / n_tst)

# Put the result into a color plot
markerline, stemlines, baseline = plt.stem(k_list, pe_tr,'r', markerfmt='s', label='Training')
plt.plot(k_list, pe_tr,'r:')
plt.setp(markerline, 'markerfacecolor', 'r', )
plt.setp(baseline, 'color','r', 'linewidth', 2)
markerline, stemlines, baseline = plt.stem(k_list, pe_tst, label='Test')
plt.plot(k_list, pe_tst,'b:')
plt.xlabel('$k$')
plt.ylabel('Error rate')
plt.legend(loc='best')


Out[15]:
<matplotlib.legend.Legend at 0x10f5fcb50>

Exercise 2: Observe the train and test error for large $k$. Could you relate the error rate of the baseline classifier with that to the $k$-NN for certain value of $k$?

The figure above suggest that the optimal value of $k$ is


In [16]:
i = np.argmin(pe_tst)
k_opt = k_list[i]
print 'k_opt = ' + str(k_opt)


k_opt = 33

However, using the test set to select the optimal value of the hyperparameter $k$ is not advisable, because we have no samples left to evaluate the performance of the final selection. A more precise procedure is cross validation.

2.3 Hyperparameter selection via cross-validation

An inconvenient of the application of the $k$-nn method is that the selection of $k$ influences the final error of the algorithm. In the previous experiments, we noticed that the location of the minimum is not necessarily the same from the perspective of the test and training data. Ideally, we would like that the designed classification model works as well as possible on future unlabeled patterns that are not available during the training phase. This property is known as generalization. Fitting the training data is only pursued in the hope that we are also indirectly obtaining a model that generalizes well. In order to achieve this goal, there are some strategies that try to guarantee a correct generalization of the model. One of such approaches is known as cross-validation

Since using the test labels during the training phase is not allowed (they should be kept aside to simultate the future application of the regression model on unseen patterns), we need to figure out some way to improve our estimation of the hyperparameter that requires only training data. Cross-validation allows us to do so by following the following steps:

  • Split the training data into several (generally non-overlapping) subsets. If we use $M$ subsets, the method is referred to as $M$-fold cross-validation. If we consider each pattern a different subset, the method is usually referred to as leave-one-out (LOO) cross-validation.
  • Carry out the training of the system $M$ times. For each run, use a different partition as a validation set, and use the restating partitions as the training set. Evaluate the performance for different choices of the hyperparameter (i.e., for different values of $k$ for the $k$-NN method).
  • Average the validation error over all partitions, and pick the hyperparameter that provided the minimum validation error.
  • Rerun the algorithm using all the training data, keeping the value of the parameter that came out of the cross-validation process.

In [17]:
## This fragment of code runs k-nn with M-fold cross validation

# Obtain the indices for the different folds
n_tr = X_tr.shape[0]
M = n_tr
permutation = np.random.permutation(n_tr)

# Initialize sets of indices
set_indices = {n: [] for n in range(M)}

# Distribute data amont M partitions
n = 0
for pos in range(n_tr):
    set_indices[n].append(permutation[pos])
    n = (n+1) % M

# Now, we run the cross-validation process using the k-nn method
k_max = 40
k_list = [2*j+1 for j in range(k_max/2)]

# Obtain the validation errors
pe_val = 0
for n in range(M):
    i_val = set_indices[n]
    i_tr = []
    for kk in range(M):
        if not n==kk:
            i_tr += set_indices[kk]
    
    pe_val_iter = []
    for k in k_list:
        y_tr_iter = knn_classifier(X_tr[i_tr], Y_tr[i_tr], X_tr[i_val], k)
        pe_val_iter.append(np.mean(Y_tr[i_val] != y_tr_iter))

    pe_val = pe_val + np.asarray(pe_val_iter).T

pe_val = pe_val / M

# We compute now the train and test errors curves
pe_tr = [np.mean(Y_tr != knn_classifier(X_tr, Y_tr, X_tr, k).T) for k in k_list]

k_opt = k_list[np.argmin(pe_val)]
pe_tst = np.mean(Y_tst != knn_classifier(X_tr, Y_tr, X_tst, k_opt).T)

plt.plot(k_list, pe_tr,'b--o',label='Training error')
plt.plot(k_list, pe_val.T,'g--o',label='Validation error')
plt.stem([k_opt], [pe_tst],'r-o',label='Test error')
plt.legend(loc='best')
plt.title('The optimal value of $k$ is ' + str(k_opt))
plt.xlabel('$k$')
plt.ylabel('Error rate')


Out[17]:
<matplotlib.text.Text at 0x10f6a2410>

2.4. Scikit-learn implementation

In practice, most well-known machine learning methods are implemented and available for python. Probably, the most complete module for machine learning tools is Scikit-learn. The following piece of code uses the method

KNeighborsClassifier

available in Scikit-learn, to compute the $k$-NN classifier using the four components of the observations in the original dataset. This routine allows us to classify a particular point using a weighted average of the targets of the neighbors:

To classify point ${\bf x}$:

  • Find $k$ closest points to ${\bf x}$ in the training set
  • Average the corresponding targets, weighting each value according to the distance of each point to ${\bf x}$, so that closer points have a larger influence in the estimation.

In [18]:
k = 5

# import some data to play with
iris = datasets.load_iris()

# Take training test
X_tr = np.array([xTrain_all[n] for n in range(nTrain_all) 
                 if cTrain_all[n]==c0 or cTrain_all[n]==c1])
C_tr = [cTrain_all[n] for n in range(nTrain_all) 
          if cTrain_all[n]==c0 or cTrain_all[n]==c1]
Y_tr = np.array([int(c==c1) for c in C_tr])
n_tr = len(X_tr)

# Take test set
X_tst = np.array([xTest_all[n] for n in range(nTest_all) 
                 if cTest_all[n]==c0 or cTest_all[n]==c1])
C_tst = [cTest_all[n] for n in range(nTest_all) 
         if cTest_all[n]==c0 or cTest_all[n]==c1]
Y_tst = np.array([int(c==c1) for c in C_tst])
n_tst = len(X_tst)

for weights in ['uniform', 'distance']:
    # we create an instance of Neighbours Classifier and fit the data.
    clf = neighbors.KNeighborsClassifier(k, weights=weights)
    clf.fit(X_tr, Y_tr)
    Z = clf.predict(X_tst)
    pe_tst = np.mean(Y_tst != Z)
    print 'Test error rate with ' + weights + ' weights = ' + str(pe_tst)


Test error rate with uniform weights = 0.0555555555556
Test error rate with distance weights = 0.0555555555556

Here you can find an example of the application of KNeighborsClassifier to the complete 3-class Iris flower classification problem.

$k$-NN Classification and Probability Estimation.

If a sample ${\bf x}$ has $m$ neighbours from class 1 and $k-n$ neighbours from class $0$, we can estimate the posterior probability that an observation ${\bf x}$ belongs to class 1 as $$ \hat p = \frac{m}{k} $$ Therefore, besides computing a decision about the class of the data, we can modify the $k$-NN algorithm to obtain posterior probability estimates.

Note that the above equation is equivalent $$ \hat p = \frac{\sum_{n \in {\mathcal N}({\bf x})} y_n}{m} $$ That is, $\hat p$ is the average of the neighbor labels. Averages can be computed using KNeighborsRegressor, which is usefull for regression applications.

Exercise 3: Plot a $k$-NN posterior probability map for the Iris flower data, for $k=15$.