Nearest Neighbor Tutorial

K-nearest neighbors, or K-NN, is a simple form of supervised learning. It assigns an output label to a new input x based on which datapoints it is closest to, where K is the number of data points to use. For K=1, x is assigned the label of the closest neighbor. If K>1, the majority vote is used to label x.

The code in this tutorial is slightly modified from the scikit-learn K-NN example. There is also information on the K-NN classifier function KNeighborsClassifier.

Setup

Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), pylab (for plotting figures) ListedColormap (for plotting colors), neighbors (for the scikit-learn nearest-neighbor algorithm) and datasets (to download the iris dataset from scikit-learn).


In [1]:
# Print figures in the notebook
%matplotlib inline 

print(__doc__)

import numpy as np
import pylab as pl
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets # Import the nerest neighbor function and dataset from scikit-learn


Automatically created module for IPython interactive environment

Import your data

Import the iris dataset through scikit-learn. Scikit-learn's explanation of the dataset is here.

The dataset consists of measurements made on 50 examples from each of three different species of iris flowers (Setosa, Versicolour, and Virginica). Each example has four features (or measurements): sepal length, sepal width, petal length, and petal width. All measurements are in cm.

Below, we import the first two features from the dataset (sepal length and width) and store them in X. Normally we would try to use all useful features, but sticking with two allows us to visualize the data more easily. The labels (which species of iris) are stored in y.

Then we plot the data, to get a look at what we're dealing with. The colormap is used to determine what colors are used for each class when plotting.


In [2]:
# Import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features, 
                      #sepal length and sepal width, and store them in X
y = iris.target       # Labels are stored in y as numbers
labelNames = ['Setosa', 'Versicolour', 'Virginica'] # Species names corresponding to labels 0, 1, and 2

# Plot the data

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# Get the minimum and maximum values with an additional 0.5 border
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

pl.figure(figsize=(8, 6))
pl.clf()

# Plot the training points
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
pl.xlabel('Sepal length (cm)')
pl.ylabel('Sepal width (cm)')

# Set the plot limits
pl.xlim(x_min, x_max)
pl.ylim(y_min, y_max)

# Create a legend for the colors, using rectangles for the corresponding colormap colors
r = Rectangle((0, 0), 1, 1, fc='#FF0000')
g = Rectangle((0, 0), 1, 1, fc='#00FF00')
b = Rectangle((0, 0), 1, 1, fc='#0000FF')
legend([r,g,b], labelNames)

pl.show()


Nearest neighbors: training

Next, we train a nearest neighbor classifier on our data.

The first section chooses the number of neighbors to use, and stores it in the variable n_neighbors. The last two lines create and train the classifier.

The first line creates a classifier (clf) using the KNeighborsClassifier() function, and tells it to use the number of neighbors stored in n_neighbors. The second line uses the fit() method to train the classifier on the features in X, using the labels in y.


In [3]:
# Choose your number of neighbors
n_neighbors = 15

# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors)
clf.fit(X, y)


Out[3]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           n_neighbors=15, p=2, weights='uniform')

Plot the classification boundaries

Now that we have our classifier, let's visualize what it's doing.

First we plot the decision boundaries, or the lines dividing areas assigned to the different labels (species of iris). Then we plot our examples onto the space, showing where each point lies and the corresponding decision boundary.

The colored background shows the areas that are considered to belong to a certain label. If we took sepal measurements from a new flower, we could plot it in this space and use the color to determine which type of iris our classifier believes it to be.


In [4]:
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.figure(figsize=(8, 6))
pl.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot the training points
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
pl.xlim(xx.min(), xx.max())
pl.ylim(yy.min(), yy.max())
pl.title("3-Class classification (k = %i)"
         % (n_neighbors))
pl.xlabel('Sepal length (cm)')
pl.ylabel('Sepal width (cm)')

# Create a legend for the colors, using previously defined values
legend([r,g,b], labelNames)

pl.show()


Changing the number of neighbors

Go back and see how the plot changes when you change the number of neighbors. Try as many or as few as you'd like, but remember there are only 150 examples in the dataset, so selecting all 150 wouldn't be very useful!

Making predictions

Now, let's say we go out and measure the sepals of two iris plants, and want to know what species they are. We're going to use our classifier to predict the flowers with the following measurements:

Plant Sepal length Sepal width
A 4.3 2.5
B 6.3 2.1

We can use our classifier's predict() function to predict the label for our input features. We pass in the variable examples to the predict() function, which is a list, and each element is another list containing the features (measurements) for a particular example. The output is a list of labels corresponding to the input examples.

We'll also plot them on the boundary plot, to show why they were predicted that way.


In [5]:
# Add our new data examples
examples = [[4.3, 2.5], # Plant A
            [6.3, 2.1]] # Plant B

# Reset our number of neighbors
n_neighbors = 15

# Create an instance of Neighbours Classifier and fit the original data.
clf = neighbors.KNeighborsClassifier(n_neighbors)
clf.fit(X, y)

# Predict the labels for our new examples
labels = clf.predict(examples)

# Print the predicted species names
print('A: ' + labelNames[labels[0]])
print('B: ' + labelNames[labels[1]])

# Now plot the results
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.figure(figsize=(8, 6))
pl.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot the training points
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
pl.xlim(xx.min(), xx.max())
pl.ylim(yy.min(), yy.max())
pl.title("3-Class classification (k = %i)"
         % (n_neighbors))
pl.xlabel('Sepal length (cm)')
pl.ylabel('Sepal width (cm)')

# Display the new examples as labeled text on the graph
pl.text(examples[0][0], examples[0][1],'A', fontsize=14)
pl.text(examples[1][0], examples[1][1],'B', fontsize=14)

# Create a legend for the colors, using previously defined values
legend([r,g,b], labelNames)

pl.show()


A: Setosa
B: Virginica

Changing neighbor weights

Looking at our previous plot, we can see that plant A is classified as Setosa (red) and B is classified as Virginica (blue). While A seems to be clearly correct, B is much closer to two examples from Versicolour (green). Maybe we should give more importance to labels that are closer to our example?

In the previous example, all the neighbors were considered equally important when deciding what label to give our input. But what if we want to give more importance (or a heigher weight) to neighbors that are closer to our new example? The K-NN algorithm allows to to change from uniform weights, where all neighbors have the same importance, to distance-based weights, where closer neighbors are given more consideration.

Below, we create a new classifier using distance-based weights and plot the results. The only difference in the code is that we call KNeighborsClassifier() with the argument weights='distance'.

Look at how it's different from the previous plot, and see how the classifications of plant B change. Try it with different combinations of neighbors, and compare it to the previous plot.


In [6]:
# Choose your number of neighbors
n_neighbors_distance = 15

# we create an instance of Neighbours Classifier and fit the data.
clf_distance = neighbors.KNeighborsClassifier(n_neighbors_distance, weights='distance')
clf_distance.fit(X, y)

# Predict the labels of the new examples
labels = clf_distance.predict(examples)

# Print the predicted species names
print('A: ' + labelNames[labels[0]])
print('B: ' + labelNames[labels[1]])

# Plot the results
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = clf_distance.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.figure(figsize=(8, 6))
pl.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
pl.xlim(xx.min(), xx.max())
pl.ylim(yy.min(), yy.max())
pl.title("3-Class classification (k = %i, weights='distance')"
         % (n_neighbors))
pl.xlabel('Sepal length (cm)')
pl.ylabel('Sepal width (cm)')

# Display the new examples as labeled text on the graph
pl.text(examples[0][0], examples[0][1],'A', fontsize=14)
pl.text(examples[1][0], examples[1][1],'B', fontsize=14)

# Create a legend for the colors, using previously defined values
legend([r,g,b], labelNames)

pl.show()


A: Setosa
B: Versicolour