K Nearest Neighbor (KNN) is a popular non-parametric method. The prediction (for regression/classification) is obtained by looking into the K closest memorized examples.

The algorithm itself can be summarized into three steps:

  • Select a positive integer K along with a new example

  • Select K entries in the training databse which are closest to the new example

  • For regression problem, we perform an average or weighted average of the response of these closest training examples to make the prediction. For classification scenarios, we do a majority vote within the traning entries to assign the label to the new example

The following class KNearestNeighbors() implements this idea.


In [1]:
import numpy as np
import operator

class KNearestNeighbors():

    def __init__(self, k, model_type='regression', weights='uniform'):

        # model_type can be either 'classification' or 'regression'
        # weights = 'uniform', the K nearest neighbors are equally weighted
        # weights = 'distance', the K nearest entries are weighted by inverse of the distance
        self.model_type = model_type
        self.k = k
        self.weights = weights
        self.X_train = None
        self.y_train = None

    def _dist(self, example1, example2):

        # calculate euclidean distance between two examples
        if len(example1) != len(example2):
            print "Inconsistent Dimension!"
            return

        return np.sqrt(sum(np.power(np.array(example1) - np.array(example2), 2)))

    def _find_neighbors(self, test_instance):

        # find K nearest neighbors for a test instance
        # this function return a list of K nearest neighbors for this test instance,
        # each element of the list is another list of distance and target
        m, n = self.X_train.shape
        neighbors = [[self._dist(self.X_train[i, :], test_instance), self.y_train[i]]
                     for i in range(m)]
        neighbors.sort(key=lambda x: x[0])
        return neighbors[:self.k]

    def fit(self, X, y):

        # no parameters learning in model fitting process for KNN
        # just to store all the training instances
        self.X_train = X
        self.y_train = y

        return self

    def predict(self, X):

        # predict using KNN algorithm
        X = np.array(X)

        # if only have one test example to predict
        if len(X.shape) == 1:
            X = X[np.newaxis, :]

        m = X.shape[0]
        y_predict = np.zeros((m, 1))

        # for regression problems, depending on the weights ('uniform' or 'distance'),
        # it will perform average or weighted average based on inverse of distance
        if self.model_type == 'regression':
            for i in range(m):
                distance_mat = np.array(self._find_neighbors(X[i, :]))
                if self.weights == 'distance':
                    y_predict[i] = np.average(distance_mat[:, 1], weights=1.0/distance_mat[:, 0])
                else:
                    y_predict[i] = np.average(distance_mat[:, 1])

        # for classification, we will apply majority vote for prediction
        # it still offer two options in terms of the weights
        else:
            for i in range(m):
                votes = {}
                distance_mat = np.array(self._find_neighbors(X[i, :]))
                for j in range(self.k):
                    if self.weights == 'distance':
                        votes[distance_mat[j, 1]] = votes.get(distance_mat[j, 1], 0) \
                                                    + 1.0 / distance_mat[j, 0]
                    else:
                        votes[distance_mat[j, 1]] = votes.get(distance_mat[j, 1], 0) + 1.0
                sorted_votes = sorted(votes.iteritems(), key=operator.itemgetter(1), reverse=True)
                y_predict[i] = sorted_votes[0][0]

            y_predict = y_predict.astype(int)

        return y_predict.ravel()

Let's first look at a demo for classification problem.

The Iris data set with three differe classes will be loaded here:


In [2]:
from sklearn.datasets import load_iris
iris = load_iris()

X = iris['data']
y = iris['target']

print X.shape
print y.shape
print "Number of Classes: {}".format(len(np.unique(y)))


(150L, 4L)
(150L,)
Number of Classes: 3

We can split the original data for test purposes


In [3]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=26)

We will define an object of the KNearestNeighbors() class to train the model and predict for test examples:


In [4]:
knn = KNearestNeighbors(k=3, model_type='classification', weights='uniform')
knn = knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)
print "True Values:      {}".format(y_test)
print "Predicted Values: {}".format(y_predict)
print "Prediction Accuracy: {:.2%}".format(np.mean((y_predict == y_test).astype(float)))


True Values:      [1 1 0 0 2 1 0 2 2 2 0 1 1 0 1 0 2 1 2 0 0 2 2 2 2 2 1 0 0 2]
Predicted Values: [1 1 0 0 2 1 0 2 2 2 0 1 1 0 1 0 2 1 2 0 0 2 2 2 2 1 1 0 0 2]
Prediction Accuracy: 96.67%

Let's then look at a demo for regression problem.

The Boston housing data will be loaded here:


In [5]:
from sklearn.datasets import load_boston

boston = load_boston()

X = boston['data']
y = boston['target']

print X.shape
print y.shape


(506L, 13L)
(506L,)

Again, let's split the entire data into training and test sets:


In [6]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.025, random_state=26)

Now we will utilize the KNearestNeighbors() class again to train the model and predict for test examples:


In [7]:
knn = KNearestNeighbors(k=20, model_type='regression', weights='uniform')
knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)
print "True Values:\n{}".format([round(elem, 1) for elem in y_test])
print "Predicted Values:\n{}".format([round(elem, 1) for elem in y_predict])
print "RMSE is {:.4}".format(np.sqrt(np.mean((y_test.reshape((len(y_test), 1)) - y_predict) ** 2)))


True Values:
[19.2, 32.0, 14.3, 13.2, 20.0, 30.7, 23.1, 22.6, 20.9, 13.4, 14.6, 17.8, 16.7]
Predicted Values:
[20.0, 25.2, 18.3, 19.0, 23.5, 30.3, 12.9, 25.7, 24.8, 12.9, 19.9, 18.0, 18.1]
RMSE is 7.644

To summarize, the good side about KNN is that we don't have to tune complex parameters to build the model. However, when the data size is large, the computation cost of finding the nearest neighbors for all test examples is high. Also, the performance of the model relies on the dimension of features. There is always the thing called the "Curse of Dimensionality". With high dimension, the nearest neighbors based on distance measures can be actually very far away. Thus, sometimes it is helpful to run PCA to reduce dimension before running KNN.