We are going to learn how to label some unlabled data based on similar characterisitcs to labled data. In terms of vocabulary, we are going to create a classifier for a specific target variable using feature comparisons.
The starting point is the labeled dataset with various features. In this n-dimensional space, we then caluculate the distance between our unknown datum and it's k closests neighbors.
The process has three main parts:
This tutorial is uses two main resources:
A common question: Since k is an input, how do we choose the size of k?
We can optimize k before we use kNN. This technique usually involves something like n-folds and testing various values of k, which let's us compare the mean error rates and choose "the best" k-value.
An alternative approach: optimize k while we use kNN. What if we let k vary (for each unknown) depending on the distribution of distances? A simple application would be to calculate all distances and k to be the number of neighbors within 1 standard deviation from the mean distance.
In summary, optimizing k is a great idea. I don't have a great answer for you, but many people seem to like n-folds cross validation [reference paper].
We can use euclidean distance:
$ \text{distance = } \sum \limits_{i=1}^{n} \left(a_i - b_i \right)^2 $
Alternatively, we could measure distance using something like using curved space. The scikit-learn package has various options.
(add example code)
The method for compute distance is an important choice since it directly quantifies the values used in the logic to determine "nearness" of the neighbors.
This computation can also be the source of major memory issues. If we can recognize those points "closer" to the unknown without computing and storing distances for the entire dataset, then we can drastically improve the performace of kNN.
Several questions can be considered while building the logic of labeling the unknown using kNN:
A typical technique is to organize the k neariest neighbors labels as votes towards a certain label. The candidate with the most votes could win.
However, the decision making behind the label returned is entirely up to the author of the technique.
In [ ]:
import numpy as np
import operator
def createDataSet():
# features: [boolean for "poops in a litter box", average number of "morning meows", average number of "kisses per greeting"]
dataSet = np.array([ [1.0,1.1,0],[1.0,1.0,1],[0,0,20],[0,0.1,3] ])
#labels = [0,0,1,1]
labels = np.array(['CAT','CAT','DOG','DOG'])
return dataSet, labels
dataSet, labels = createDataSet()
print "dataSet: \n{}".format(dataSet)
print
print "labels: \n{}".format(labels)
print
print "group.shape: \n{}".format(dataSet.shape)
We can build our own custom kNN classifier based on code similar to that in Machine Learning In Action.
In [ ]:
def kNN(unknown,dataSet,labels,k):
"""
Apply kNN algorithm
"""
# get length of dataset
dataSetSize = dataSet.shape[0]
# tile creates a matrix the same size as dataSet with repeated rows of the unknown
diffMat = np.tile(unknown, (dataSetSize,1)) - dataSet
# DISTANCE CALCULATION (eucledian distance from unknown to labeled points)
sqDiff = diffMat ** 2
sqDistances = sqDiff.sum(axis=1)
distances = sqDistances ** 0.5
# sort the index of distances
sortedDistIndex = distances.argsort()
# LABEL LOGIC (here we use the shortest distance and votes are summed up to define the label)
classCount = {}
for i in range(k):
# grabs the labels the of the k nearest neighbors
voteLabel = labels[sortedDistIndex[i]]
# tally the labels like votes
classCount[voteLabel] = classCount.get(voteLabel,0) + 1
# sorts the tally (note: if it's a tie, then what?)
sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
print "tally: \n{}\n".format(sortedClassCount)
# return the label with the most votes
return sortedClassCount[0][0]
In [ ]:
# use the tally to lable the unknown
print 'CUSTOM classification: \n{}'.format(kNN([0,0,2], dataSet, labels,3))
We can use the KNeighborsClassifier from scikit-learn to accomplish this task as well, but we have several algorithms from which we can choose. The following choices allow us to consider how we calculate the nearest neighbors.
BallTree1 - decision making technique for determining which distances to calculate.KDTree1 - decision making technique for determining which distances to calculate.Brute - Calculate all distances between the unknown and the known labels. Auto - Atempts to choose the best algorithm.
In [ ]:
# use scikit-learn to classfiy
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
# Create the kNN object
knn = KNeighborsClassifier(n_neighbors=3)
# Use the known data
knn.fit(dataSet, labels)
# Predict the unknown
print 'Scikit-learn classification: \n{}'.format(knn.predict(np.array([0,0,1])))