In this homework notebook we use nearest neighbor classification to classify back injuries for patients in a hospital, based on measurements of the shape and orientation of their pelvis and spine.
The data set contains information from 310 patients. For each patient, there are: six measurements (the x) and a label (the y). The label has 3 possible values, ’NO’
(normal), ’DH’
(herniated disk), or ’SL’
(spondilolysthesis).
Note: Before attempting this homework, please go through the *Nearest neighbor for handwritten digit recognition* notebook.
We import all necessary packages for the homework. Notice that we do NOT import any of the sklearn
packages. This is because we want you to implement a nearest neighbor classifier manually, as in the *Nearest neighbor for handwritten digit recognition* notebook.
In [ ]:
import numpy as np
We now load the dataset. We divide the data into a training set of 248 patients and a separate test set of 62 patients. The following arrays are created:
trainx
: The training data's features, one point per row.trainy
: The training data's labels.testx
: The test data's features, one point per row.testy
: The test data's labels.We will use the training set (trainx
and trainy
), with nearest neighbor classification, to predict labels for the test data (testx
). We will then compare these predictions with the correct labels, testy
.
Notice that we code the three labels as 0. = ’NO’, 1. = ’DH’, 2. = ’SL’
.
In [ ]:
# Load data set and code labels as 0 = ’NO’, 1 = ’DH’, 2 = ’SL’
labels = [b'NO', b'DH', b'SL']
data = np.loadtxt('column_3C.dat', converters={6: lambda s: labels.index(s)} )
# Separate features from labels
x = data[:,0:6]
y = data[:,6]
# Divide into training and test set
training_indices = list(range(0,20)) + list(range(40,188)) + list(range(230,310))
test_indices = list(range(20,40)) + list(range(188,230))
trainx = x[training_indices,:]
trainy = y[training_indices]
testx = x[test_indices,:]
testy = y[test_indices]
In this exercise we will build a nearest neighbor classifier based on L2 (Euclidean) distance.
**For you to do:** Write a function, NN_L2, which takes as input the training data (trainx
and trainy
) and the test points (testx
) and predicts labels for these test points using 1-NN classification. These labels should be returned in a numpy
array with one entry per test point. For NN_L2, the L2 norm should be used as the distance metric.
**Code**
# test function
testy_L2 = NN_L2(trainx, trainy, testx)
print( type( testy_L2) )
print( len(testy_L2) )
print( testy_L2[40:50] )
**Output**
<class 'numpy.ndarray'>
62
[ 2. 2. 1. 0. 0. 2. 0. 0. 0. 0.]
In [ ]:
# Modify this Cell
def NN_L2(trainx, trainy, testx):
# inputs: trainx, trainy, testx <-- as defined above
# output: an np.array of the predicted values for testy
### BEGIN SOLUTION
### END SOLUTION
After you are done, run the cell below to check your function. If an error is triggered, you should go back and revise your function.
In [ ]:
testy_L2 = NN_L2(trainx, trainy, testx)
assert( type( testy_L2).__name__ == 'ndarray' )
assert( len(testy_L2) == 62 )
assert( np.all( testy_L2[50:60] == [ 0., 0., 0., 0., 2., 0., 2., 0., 0., 0.] ) )
assert( np.all( testy_L2[0:10] == [ 0., 0., 0., 1., 1., 0., 1., 0., 0., 1.] ) )
We now compute nearest neighbors using the L1 distance (sometimes called Manhattan Distance).
**For you to do:** Write a function, NN_L1, which again takes as input the arrays trainx
, trainy
, and testx
, and predicts labels for the test points using 1-nearest neighbor classification. For NN_L1, the L1 distance metric should be used. As before, the predicted labels should be returned in a numpy
array with one entry per test point.
Notice that NN_L1 and NN_L2 may well produce different predictions on the test set.
**Code**
# test function
testy_L2 = NN_L2(trainx, trainy, testx)
testy_L1 = NN_L1(trainx, trainy, testx)
print( type( testy_L1) )
print( len(testy_L1) )
print( testy_L1[40:50] )
print( all(testy_L1 == testy_L2) )
**Output**
<class 'numpy.ndarray'>
62
[ 2. 2. 0. 0. 0. 0. 0. 0. 0. 0.]
False
In [ ]:
# Modify this Cell
def NN_L1(trainx, trainy, testx):
# inputs: trainx, trainy, testx <-- as defined above
# output: an np.array of the predicted values for testy
### BEGIN SOLUTION
### END SOLUTION
Again, use the following cell to check your code.
In [ ]:
testy_L1 = NN_L1(trainx, trainy, testx)
testy_L2 = NN_L2(trainx, trainy, testx)
assert( type( testy_L1).__name__ == 'ndarray' )
assert( len(testy_L1) == 62 )
assert( not all(testy_L1 == testy_L2) )
assert( all(testy_L1[50:60]== [ 0., 2., 1., 0., 2., 0., 0., 0., 0., 0.]) )
assert( all( testy_L1[0:10] == [ 0., 0., 0., 0., 1., 0., 1., 0., 0., 1.]) )
Let's see if the L1 and L2 distance functions yield different error rates for nearest neighbor classification of the test data.
In [ ]:
def error_rate(testy, testy_fit):
return float(sum(testy!=testy_fit))/len(testy)
print("Error rate of NN_L1: ", error_rate(testy,testy_L1) )
print("Error rate of NN_L2: ", error_rate(testy,testy_L2) )
We will now look a bit more deeply into the specific types of errors made by nearest neighbor classification, by constructing the *confusion matrix*.
Since there are three labels, the confusion matrix is a 3x3 matrix whose rows correspond to the true label and whose columns correspond to the predicted label. For example, the entry at row DH, column SL, contains the number of test points whose correct label was DH but which were classified as SL.
Write a function, confusion, which takes as input the true labels for the test set (that is, testy
) as well as the predicted labels and returns the confusion matrix. The confusion matrix should be a np.array
of shape (3,3)
.
**Code**
L2_neo = confusion(testy, testy_L2)
print( type(L2_neo) )
print( L2_neo.shape )
print( L2_neo )
**Output**
<class 'numpy.ndarray'>
(3, 3)
[[ 0. 9. 2.]
[ 0. 0. 0.]
[ 3. 0. 0.]]
In [ ]:
# Modify this cell
def confusion(testy,testy_fit):
# inputs: the correct labels, the fitted NN labels
# output: a 3x3 np.array representing the confusion matrix as above
### BEGIN SOLUTION
### END SOLUTION
Now check your code by running the following cell.
In [ ]:
# Test Function
L1_neo = confusion(testy, testy_L1)
assert( type(L1_neo).__name__ == 'ndarray' )
assert( L1_neo.shape == (3,3) )
assert( np.all(L1_neo == [[ 0., 2., 2.],[ 10., 0., 0.],[ 0., 0., 0.]]) )
L2_neo = confusion(testy, testy_L2)
assert( np.all(L2_neo == [[ 0., 1., 2.],[ 10., 0., 0.],[ 0., 0., 0.]]) )
Note down the answers to these, since you will need to enter them as part of this week's assignment.