Nearest Neighbor Tutorial

K-nearest neighbors, or K-NN, is a simple form of supervised learning. It assigns an output label to a new input example x based on it's closest neighboring datapoints. The number K is the number of data points to use. For K=1, x is assigned the label of the closest neighbor. If K>1, the majority vote is used to label x.

The code in this tutorial is slightly modified from the scikit-learn K-NN example. There is also information on the K-NN classifier function KNeighborsClassifier.

Setup

Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), matplotlib.pyplot (for plotting figures), ListedColormap (for plotting colors), neighbors (for the scikit-learn nearest-neighbor algorithm), and datasets (to download the iris dataset from scikit-learn).

Also create the color maps to use to color the plotted data, and "labelList", which is a list of colored rectangles to use in plotted legends


In [ ]:
# Print figures in the notebook
%matplotlib inline 

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets # Import the nerest neighbor function and dataset from scikit-learn
from sklearn.model_selection import train_test_split, KFold

# Import patch for drawing rectangles in the legend
from matplotlib.patches import Rectangle

# Create color maps
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])

# Create a legend for the colors, using rectangles for the corresponding colormap colors
labelList = []
for color in cmap_bold.colors:
    labelList.append(Rectangle((0, 0), 1, 1, fc=color))

Import the dataset

Import the dataset and store it to a variable called iris. Scikit-learn's explanation of the dataset is here. This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target_names', 'target', 'data', 'feature_names']

The data features are stored in iris.data, where each row is an example from a single flower, and each column is a single feature. The feature names are stored in iris.feature_names. Labels are stored as the numbers 0, 1, or 2 in iris.target, and the names of these labels are in iris.target_names.

The dataset consists of measurements made on 50 examples from each of three different species of iris flowers (Setosa, Versicolour, and Virginica). Each example has four features (or measurements): sepal length, sepal width, petal length, and petal width. All measurements are in cm.

Below, we load the labels into y, the corresponding label names into labelNames, the data into X, and the names of the features into featureNames.


In [ ]:
# Import some data to play with
iris = datasets.load_iris()

# Store the labels (y), label names, features (X), and feature names
y = iris.target       # Labels are stored in y as numbers
labelNames = iris.target_names # Species names corresponding to labels 0, 1, and 2
X = iris.data
featureNames = iris.feature_names

Below, we plot the first two features from the dataset (sepal length and width). Normally we would try to use all useful features, but sticking with two allows us to visualize the data more easily.

Then we plot the data to get a look at what we're dealing with. The colormap is used to determine what colors are used for each class when plotting.


In [ ]:
# Plot the data

# Sepal length and width
X_small = X[:,:2]
# Get the minimum and maximum values with an additional 0.5 border
x_min, x_max = X_small[:, 0].min() - .5, X_small[:, 0].max() + .5
y_min, y_max = X_small[:, 1].min() - .5, X_small[:, 1].max() + .5

plt.figure(figsize=(8, 6))

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')
plt.title('Sepal width vs length')

# Set the plot limits
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

# Plot the legend
plt.legend(labelList, labelNames)

plt.show()

Nearest neighbors: training

Next, we train a nearest neighbor classifier on our data.

The first section chooses the number of neighbors to use, and stores it in the variable n_neighbors (another, more intuitive, name for the K variable mentioned previously).

The last two lines create and train the classifier. Line 1 creates a classifier (clf) using the KNeighborsClassifier() function, and tells it to use the number of neighbors stored in n_neighbors. Line 2 uses the fit() method to train the classifier on the features in X_small, using the labels in y.


In [ ]:
# Choose your number of neighbors
n_neighbors = 15

# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors)
clf.fit(X_small, y)

Plot the classification boundaries

Now that we have our classifier, let's visualize what it's doing.

First we plot the decision boundaries, or the lines dividing areas assigned to the different labels (species of iris). Then we plot our examples onto the space, showing where each point lies and the corresponding decision boundary.

The colored background shows the areas that are considered to belong to a certain label. If we took sepal measurements from a new flower, we could plot it in this space and use the color to determine which type of iris our classifier believes it to be.


In [ ]:
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i)"
         % (n_neighbors))
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')

# Plot the legend
plt.legend(labelList, labelNames)

plt.show()

Changing the number of neighbors

Change the number of neighbors (n_neighbors) below, and see how the plot boundaries change. Try as many or as few as you'd like, but remember there are only 150 examples in the dataset, so selecting all 150 wouldn't be very useful!


In [ ]:
# Choose your number of neighbors
n_neighbors = # Choose a new number of neighbors

# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors)
clf.fit(X_small, y)

h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i)"
         % (n_neighbors))
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')

# Plot the legend
plt.legend(labelList, labelNames)

plt.show()

Making predictions

Now, let's say we go out and measure the sepals of two new iris plants, and want to know what species they are. We're going to use our classifier to predict the flowers with the following measurements:

Plant Sepal length Sepal width
A 4.3 2.5
B 6.3 2.1

We can use our classifier's predict() function to predict the label for our input features. We pass in the variable examples to the predict() function, which is a list, and each element is another list containing the features (measurements) for a particular example. The output is a list of labels corresponding to the input examples.


In [ ]:
# Add our new data examples
examples = [[4.3, 2.5], # Plant A
            [6.3, 2.1]] # Plant B

# Reset our number of neighbors
n_neighbors = 15

# Create an instance of Neighbours Classifier and fit the original data.
clf = neighbors.KNeighborsClassifier(n_neighbors)
clf.fit(X_small, y)

# Predict the labels for our new examples
labels = clf.predict(examples)

# Print the predicted species names
print('A: ' + labelNames[labels[0]])
print('B: ' + labelNames[labels[1]])

Plotting our predictions

Now let's plot our predictions to see why they were classified that way.


In [ ]:
# Now plot the results
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i)"
         % (n_neighbors))
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')

# Display the new examples as labeled text on the graph
plt.text(examples[0][0], examples[0][1],'A', fontsize=14)
plt.text(examples[1][0], examples[1][1],'B', fontsize=14)

# Plot the legend
plt.legend(labelList, labelNames)

plt.show()

What about our other features?

You may remember that our original dataset contains two additional features, the length and width of the petals.

What does the plot look like when you train on the petal length and width? How does it change when you change the number of neighbors?

How would you plot our two new plants, A and B, on these new plots? Assume we have all four measurements for each plant, as shown below.

Plant Sepal length Sepal width Petal length Petal width
A 4.3 2.5 1.5 0.5
B 6.3 2.1 4.8 1.5

In [ ]:
# Put your code here! 

# Feel free to add as many code cells as you need

Changing neighbor weights

Looking at our orignal plot of sepal dimensions, we can see that plant A is classified as Setosa (red) and B is classified as Virginica (blue). While A seems to be clearly correct, B is much closer to two examples from Versicolour (green). Maybe we should give more importance to labels that are closer to our example?

In the previous examples, all the neighbors were considered equally important when deciding what label to give our input. But what if we want to give more importance (or a heigher weight) to neighbors that are closer to our new example? The K-NN algorithm allows you to change from uniform weights, where all included neighbors have the same importance, to distance-based weights, where closer neighbors are given more consideration.

Below, we create a new classifier using distance-based weights and plot the results. The only difference in the code is that we call KNeighborsClassifier() with the argument weights='distance'.

Look at how it's different from the original plot, and see how the classifications of plant B change. Try it with different combinations of neighbors, and compare it to the previous plots.


In [ ]:
# Choose your number of neighbors
n_neighbors_distance = 15

# we create an instance of Neighbours Classifier and fit the data.
clf_distance = neighbors.KNeighborsClassifier(n_neighbors_distance, 
                                              weights='distance')
clf_distance.fit(X_small, y)

# Predict the labels of the new examples
labels = clf_distance.predict(examples)

# Print the predicted species names
print('A: ' + labelNames[labels[0]])
print('B: ' + labelNames[labels[1]])

# Plot the results
h = .02  # step size in the mesh

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X_small[:, 0].min() - 1, X_small[:, 0].max() + 1
y_min, y_max = X_small[:, 1].min() - 1, X_small[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = clf_distance.predict(np.c_[xx.ravel(), yy.ravel()]) # Make a prediction oat every point 
                                               # in the mesh in order to find the 
                                               # classification areas for each label

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot the training points
plt.scatter(X_small[:, 0], X_small[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i)"
         % (n_neighbors))
plt.xlabel('Sepal length (cm)')
plt.ylabel('Sepal width (cm)')

# Display the new examples as labeled text on the graph
plt.text(examples[0][0], examples[0][1],'A', fontsize=14)
plt.text(examples[1][0], examples[1][1],'B', fontsize=14)

# Plot the legend
plt.legend(labelList, labelNames)

plt.show()

Now, see how this affects your plots using other features.


In [ ]:
# Put your code here!

Using more than two features

Using two features is great for visualizing, but it's often not good for training a good classifier. Below, we will train a classifier using the 'distance' weighting method and all four features, and use that to predict plants A and B.

How do the predictions compare to our predictions using only two labels?


In [ ]:
# Add our new data examples
examples = [[4.3, 2.5, 1.5, 0.5], # Plant A
            [6.3, 2.1, 4.8, 1.5]] # Plant B

# Choose your number of neighbors
n_neighbors_distance = 15

# we create an instance of Neighbours Classifier and fit the data.
clf_distance = neighbors.KNeighborsClassifier(n_neighbors_distance, weights='distance')
clf_distance.fit(X, y)

# Predict the labels of the new examples
labels = clf_distance.predict(examples)

# Print the predicted species names
print('A: ' + labelNames[labels[0]])
print('B: ' + labelNames[labels[1]])

Evaluating The Classifier

In order to evaluate a classifier, we need to split our dataset into training data, which we'll show to the classifier so it can learn, and testing data, which we will hold back from training and use to test its predictions.

Below, we create the training and testing datasets, using all four features. We then train our classifier on the training data, and get the predictions for the test data.


In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Choose your number of neighbors
n_neighbors_distance = 15

# we create an instance of Neighbours Classifier and fit the data.
clf_distance = neighbors.KNeighborsClassifier(n_neighbors_distance, weights='distance')
clf_distance.fit(X_train, y_train)

# Predict the labels of the test data
predictions = clf_distance.predict(X_test)

Next, we evaluate how well the classifier did. The easiest way to do this is to get the average number of correct predictions, usually referred to as the accuracy


In [ ]:
accuracy = np.mean(predictions == y_test )*100

print('The accuracy is ' + '%.2f' % accuracy + '%')

Comparing Models with Crossvalidation

To select the best number of neighbors to use in our model, we need to use crossvalidation. We can then get our final result using our test data.

First we choose the number of neighbors we want to investigate, then divide our training data into folds. We loop through the sets of training and validation folds. Each time, we train each model on the training data and evaluate on the validation data. We store the accuracy of each classifier on each fold so we can look at them later.


In [ ]:
# Choose our k values
kvals = [1,3,5,10,20,40]

# Create a dictionary of arrays to store accuracies
accuracies = {}
for k in kvals:
    accuracies[k] = []

# Loop through 5 folds
kf = KFold(n_splits=5)
for trainInd, valInd in kf.split(X_train):
    X_tr = X_train[trainInd,:]
    y_tr = y_train[trainInd]
    X_val = X_train[valInd,:]
    y_val = y_train[valInd]
    
    # Loop through each value of k
    for k in kvals:
        
        # Create the classifier
        clf = neighbors.KNeighborsClassifier(k, weights='distance')
        
        # Train the classifier
        clf.fit(X_tr, y_tr) 
        
        # Make our predictions
        pred = clf.predict(X_val)
    
        # Calculate the accuracy
        accuracies[k].append(np.mean(pred == y_val))

Select a Model

To select a model, we look at the average accuracy across all folds.


In [ ]:
for k in kvals:
    print('k=%i: %.2f' % (k, np.mean(accuracies[k])))

Final Evaluation

K=3 gives us the highest accuracy, so we select it as our best model. Now we can evaluate it on our training set and get our final accuracy rating.


In [ ]:
clf = neighbors.KNeighborsClassifier(3, weights='distance')
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

accuracy = np.mean(predictions == y_test)*100

print('The final accuracy is %.2f' % accuracy + '%')

Sidenote: Randomness and Results

Every time you run this notebook, you will get slightly different results. Why? Because data is randomly divided among the training/testing/validation data sets. Running the code again will create a different division of the data, and will make the results slightly different. However, the overall outcome should remain consistent and have approximately the same values. If you have drastically different results when running an analysis multiple times, it suggests a problem with your model or that you need more data.

If it's important that you get the exact same results every time you run the code, you can specify a random state in the random_state argument of train_test_split() and KFold.


In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)