Title: K-Nearest Neighbors Classification
Slug: k-nearest_neighbors_using_scikit_pandas
Summary: A quick guide to using k-nearest neighbor using numpy and scikit.
Date: 2016-08-31 12:00
Category: Machine Learning
Tags: Nearest Neighbors
Authors: Chris Albon
K-nearest neighbors classifier (KNN) is a simple and powerful classification learner.
KNN has three basic parts:
Imagine we have an observation where we know its independent variables $x_{test}$ but do not know its class $y_{test}$. The KNN learner finds the K other observations that are closest to $x_{test}$ and uses their known classes to assign a classes to $x_{test}$.
In [1]:
import pandas as pd
from sklearn import neighbors
import numpy as np
%matplotlib inline
import seaborn
Here we create three variables, test_1
and test_2
are our independent variables, 'outcome' is our dependent variable. We will use this data to train our learner.
In [2]:
training_data = pd.DataFrame()
training_data['test_1'] = [0.3051,0.4949,0.6974,0.3769,0.2231,0.341,0.4436,0.5897,0.6308,0.5]
training_data['test_2'] = [0.5846,0.2654,0.2615,0.4538,0.4615,0.8308,0.4962,0.3269,0.5346,0.6731]
training_data['outcome'] = ['win','win','win','win','win','loss','loss','loss','loss','loss']
This is not necessary, but because we only have three variables, we can plot the training dataset. The X and Y axes are the independent variables, while the colors of the points are their classes.
In [3]:
seaborn.lmplot('test_1', 'test_2', data=training_data, fit_reg=False,hue="outcome", scatter_kws={"marker": "D","s": 100})
The scikit-learn
library requires the data be formatted as a numpy
array. Here are doing that reformatting.
In [4]:
X = training_data.as_matrix(columns=['test_1', 'test_2'])
y = np.array(training_data['outcome'])
This is our big moment. We train a KNN learner using the parameters that an observation's neighborhood is its three closest neighors. weights = 'uniform'
can be thought of as the voting system used. For example, uniform
means that all neighbors get an equally weighted "vote" about an observation's class while weights = 'distance'
would tell the learner to weigh each observation's "vote" by its distance from the observation we are classifying.
In [5]:
clf = neighbors.KNeighborsClassifier(3, weights = 'uniform')
trained_model = clf.fit(X, y)
How good is our trained model compared to our training data?
In [6]:
trained_model.score(X, y)
Our model is 80% accurate!
Note: that in any real world example we'd want to compare the trained model to some holdout test data. But since this is a toy example I used the training data.
Now that we have trained our model, we can predict the class any new observation, $y_{test}$. Let us do that now!
In [7]:
# Create a new observation with the value of the first independent variable, 'test_1', as .4
# and the second independent variable, test_1', as .6
x_test = np.array([[.4,.6]])
In [8]:
# Apply the learner to the new, unclassified observation.
Huzzah! We can see that the learner has predicted that the new observation's class is loss
We can even look at the probabilities the learner assigned to each class:
In [9]:
According to this result, the model predicted that the observation was loss
with a ~67% probability and win
with a ~33% probability. Because the observation had a greater probability of being loss
, it predicted that class for the observation.