k Nearest Neighbors

We are going to learn how to label some unlabled data based on similar characterisitcs to labled data. In terms of vocabulary, we are going to create a classifier for a specific target variable using feature comparisons.

The starting point is the labeled dataset with various features. In this n-dimensional space, we then caluculate the distance between our unknown datum and it's k closests neighbors.

The process has three main parts:

Optimize k
Distance calculation
Lablel logic

This tutorial is uses two main resources:

Optimize k

A common question: Since k is an input, how do we choose the size of k?

We can optimize k before we use kNN. This technique usually involves something like n-folds and testing various values of k, which let's us compare the mean error rates and choose "the best" k-value.

An alternative approach: optimize k while we use kNN. What if we let k vary (for each unknown) depending on the distribution of distances? A simple application would be to calculate all distances and k to be the number of neighbors within 1 standard deviation from the mean distance.

In summary, optimizing k is a great idea. I don't have a great answer for you, but many people seem to like n-folds cross validation [reference paper].

Distance calculation

We can use euclidean distance:

$ \text{distance = } \sum \limits_{i=1}^{n} \left(a_i - b_i \right)^2 $

Alternatively, we could measure distance using something like using curved space. The scikit-learn package has various options.

(add example code)

The method for compute distance is an important choice since it directly quantifies the values used in the logic to determine "nearness" of the neighbors.

This computation can also be the source of major memory issues. If we can recognize those points "closer" to the unknown without computing and storing distances for the entire dataset, then we can drastically improve the performace of kNN.

Label Logic

Several questions can be considered while building the logic of labeling the unknown using kNN:

When a tie vote occors, which label should be selected?
Should those lables of closer nearest neighbors be weighted?
How should votes be counted?
Is the return value a confidence score for each label?

A typical technique is to organize the k neariest neighbors labels as votes towards a certain label. The candidate with the most votes could win.

However, the decision making behind the label returned is entirely up to the author of the technique.



In [ ]:

    
import numpy as np
import operator

def createDataSet():
    
    # features: [boolean for "poops in a litter box", average number of "morning meows", average number of "kisses per greeting"]
    dataSet = np.array([ [1.0,1.1,0],[1.0,1.0,1],[0,0,20],[0,0.1,3] ])
    
    #labels = [0,0,1,1]
    labels = np.array(['CAT','CAT','DOG','DOG'])
    
    return dataSet, labels

dataSet, labels = createDataSet()
print "dataSet: \n{}".format(dataSet)
print
print "labels: \n{}".format(labels)
print 
print "group.shape: \n{}".format(dataSet.shape)

ML in Action

We can build our own custom kNN classifier based on code similar to that in Machine Learning In Action.



In [ ]:

    
def kNN(unknown,dataSet,labels,k):
    """
    Apply kNN algorithm
    """
    
    # get length of dataset
    dataSetSize = dataSet.shape[0]
    
    # tile creates a matrix the same size as dataSet with repeated rows of the unknown
    diffMat = np.tile(unknown, (dataSetSize,1)) - dataSet
    
    # DISTANCE CALCULATION (eucledian distance from unknown to labeled points)
    sqDiff = diffMat ** 2
    sqDistances = sqDiff.sum(axis=1)
    distances = sqDistances ** 0.5
    
    # sort the index of distances  
    sortedDistIndex = distances.argsort()
    
    # LABEL LOGIC (here we use the shortest distance and votes are summed up to define the label)
    classCount = {}
    for i in range(k):
        # grabs the labels the of the k nearest neighbors
        voteLabel = labels[sortedDistIndex[i]]
        
        # tally the labels like votes
        classCount[voteLabel] = classCount.get(voteLabel,0) + 1
        
    # sorts the tally (note: if it's a tie, then what?)
    sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
    print "tally: \n{}\n".format(sortedClassCount)
    
    # return the label with the most votes
    return sortedClassCount[0][0]



In [ ]:

    
# use the tally to lable the unknown
print 'CUSTOM classification: \n{}'.format(kNN([0,0,2], dataSet, labels,3))

scikit-learn

We can use the KNeighborsClassifier from scikit-learn to accomplish this task as well, but we have several algorithms from which we can choose. The following choices allow us to consider how we calculate the nearest neighbors.

BallTree1 - decision making technique for determining which distances to calculate.
KDTree1 - decision making technique for determining which distances to calculate.
Brute - Calculate all distances between the unknown and the known labels.
Auto - Atempts to choose the best algorithm.



In [ ]:

    
# use scikit-learn to classfiy
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier

# Create the kNN object 

knn = KNeighborsClassifier(n_neighbors=3)

# Use the known data
knn.fit(dataSet, labels) 

# Predict the unknown
print 'Scikit-learn classification: \n{}'.format(knn.predict(np.array([0,0,1])))

Ideas for improvements (fork me!):

k-folds example
real/practical example using kNN
explain scikit-learn algorithms for calculating the nearest neighbor (explain why to use specific algorithms with examples)
add details for the distance issue (ie. calculation distance for the entire dataset)
What's the break point for using kNN brute force vs other algorithms?
Normalize the feature space (possibly a full on RST)