Nearest neighbor for spine injury classification

In this homework notebook we use nearest neighbor classification to classify back injuries for patients in a hospital, based on measurements of the shape and orientation of their pelvis and spine.

The data set contains information from 310 patients. For each patient, there are: six measurements (the x) and a label (the y). The label has 3 possible values, ’NO’ (normal), ’DH’ (herniated disk), or ’SL’ (spondilolysthesis).

Note: Before attempting this homework, please go through the *Nearest neighbor for handwritten digit recognition* notebook.

1. Setup notebook

We import all necessary packages for the homework. Notice that we do NOT import any of the sklearn packages. This is because we want you to implement a nearest neighbor classifier manually, as in the *Nearest neighbor for handwritten digit recognition* notebook.


In [ ]:
import numpy as np

We now load the dataset. We divide the data into a training set of 248 patients and a separate test set of 62 patients. The following arrays are created:

  • trainx : The training data's features, one point per row.
  • trainy : The training data's labels.
  • testx : The test data's features, one point per row.
  • testy : The test data's labels.

We will use the training set (trainx and trainy), with nearest neighbor classification, to predict labels for the test data (testx). We will then compare these predictions with the correct labels, testy.

Notice that we code the three labels as 0. = ’NO’, 1. = ’DH’, 2. = ’SL’.


In [ ]:
# Load data set and code labels as 0 = ’NO’, 1 = ’DH’, 2 = ’SL’
labels = [b'NO', b'DH', b'SL']
data = np.loadtxt('column_3C.dat', converters={6: lambda s: labels.index(s)} )

# Separate features from labels
x = data[:,0:6]
y = data[:,6]

# Divide into training and test set
training_indices = list(range(0,20)) + list(range(40,188)) + list(range(230,310))
test_indices = list(range(20,40)) + list(range(188,230))

trainx = x[training_indices,:]
trainy = y[training_indices]
testx = x[test_indices,:]
testy = y[test_indices]

2. Nearest neighbor classification with L2 distance

In this exercise we will build a nearest neighbor classifier based on L2 (Euclidean) distance.

**For you to do:** Write a function, NN_L2, which takes as input the training data (trainx and trainy) and the test points (testx) and predicts labels for these test points using 1-NN classification. These labels should be returned in a numpy array with one entry per test point. For NN_L2, the L2 norm should be used as the distance metric.

**Code**

# test function 
testy_L2 = NN_L2(trainx, trainy, testx)
print( type( testy_L2) )
print( len(testy_L2) )
print( testy_L2[40:50] )

**Output**

<class 'numpy.ndarray'>
62
[ 2.  2.  1.  0.  0.  2.  0.  0.  0.  0.]

In [ ]:
# Modify this Cell

def NN_L2(trainx, trainy, testx):
    # inputs: trainx, trainy, testx <-- as defined above
    # output: an np.array of the predicted values for testy 
    
    ### BEGIN SOLUTION
    ### END SOLUTION

After you are done, run the cell below to check your function. If an error is triggered, you should go back and revise your function.


In [ ]:
testy_L2 = NN_L2(trainx, trainy, testx)

assert( type( testy_L2).__name__ == 'ndarray' )
assert( len(testy_L2) == 62 ) 
assert( np.all( testy_L2[50:60] == [ 0.,  0.,  0.,  0.,  2.,  0.,  2.,  0.,  0.,  0.] )  )
assert( np.all( testy_L2[0:10] == [ 0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  1.] ) )

3. Nearest neighbor classification with L1 distance

We now compute nearest neighbors using the L1 distance (sometimes called Manhattan Distance).

**For you to do:** Write a function, NN_L1, which again takes as input the arrays trainx, trainy, and testx, and predicts labels for the test points using 1-nearest neighbor classification. For NN_L1, the L1 distance metric should be used. As before, the predicted labels should be returned in a numpy array with one entry per test point.

Notice that NN_L1 and NN_L2 may well produce different predictions on the test set.

**Code**

# test function 
testy_L2 = NN_L2(trainx, trainy, testx)
testy_L1 = NN_L1(trainx, trainy, testx)

print( type( testy_L1) )
print( len(testy_L1) )
print( testy_L1[40:50] )
print( all(testy_L1 == testy_L2) )

**Output**

<class 'numpy.ndarray'>
62
[ 2.  2.  0.  0.  0.  0.  0.  0.  0.  0.]
False

In [ ]:
# Modify this Cell

def NN_L1(trainx, trainy, testx):
    # inputs: trainx, trainy, testx <-- as defined above
    # output: an np.array of the predicted values for testy 
    
    ### BEGIN SOLUTION
    ### END SOLUTION

Again, use the following cell to check your code.


In [ ]:
testy_L1 = NN_L1(trainx, trainy, testx)
testy_L2 = NN_L2(trainx, trainy, testx)

assert( type( testy_L1).__name__ == 'ndarray' )
assert( len(testy_L1) == 62 ) 
assert( not all(testy_L1 == testy_L2) )
assert( all(testy_L1[50:60]== [ 0.,  2.,  1.,  0.,  2.,  0.,  0.,  0.,  0.,  0.]) )
assert( all( testy_L1[0:10] == [ 0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.]) )

4. Test errors and the confusion matrix

Let's see if the L1 and L2 distance functions yield different error rates for nearest neighbor classification of the test data.


In [ ]:
def error_rate(testy, testy_fit):
    return float(sum(testy!=testy_fit))/len(testy) 

print("Error rate of NN_L1: ", error_rate(testy,testy_L1) )
print("Error rate of NN_L2: ", error_rate(testy,testy_L2) )

We will now look a bit more deeply into the specific types of errors made by nearest neighbor classification, by constructing the *confusion matrix*.

Since there are three labels, the confusion matrix is a 3x3 matrix whose rows correspond to the true label and whose columns correspond to the predicted label. For example, the entry at row DH, column SL, contains the number of test points whose correct label was DH but which were classified as SL.

Write a function, confusion, which takes as input the true labels for the test set (that is, testy) as well as the predicted labels and returns the confusion matrix. The confusion matrix should be a np.array of shape (3,3) .

**Code**

L2_neo = confusion(testy, testy_L2)  
print( type(L2_neo) )
print( L2_neo.shape )
print( L2_neo )

**Output**

<class 'numpy.ndarray'>
(3, 3)
[[ 0.  9.  2.]
 [ 0.  0.  0.]
 [ 3.  0.  0.]]

In [ ]:
# Modify this cell

def confusion(testy,testy_fit):
    # inputs: the correct labels, the fitted NN labels 
    # output: a 3x3 np.array representing the confusion matrix as above
    
    ### BEGIN SOLUTION
    ### END SOLUTION

Now check your code by running the following cell.


In [ ]:
# Test Function

L1_neo = confusion(testy, testy_L1) 
assert( type(L1_neo).__name__ == 'ndarray' )
assert( L1_neo.shape == (3,3) )
assert( np.all(L1_neo == [[ 0.,  2.,  2.],[ 10.,  0.,  0.],[ 0.,  0.,  0.]]) )
L2_neo = confusion(testy, testy_L2)  
assert( np.all(L2_neo == [[ 0.,  1.,  2.],[ 10.,  0.,  0.],[ 0.,  0.,  0.]]) )

5. Some questions for you

Note down the answers to these, since you will need to enter them as part of this week's assignment.

  • In the test set, which class (NO, DH, or SL) was most frequently misclassified by the L1-based nearest neighbor classifier?
  • In the test set, which class (NO, DH, or SL) was never misclassified by the L2-based nearest neighbor classifier?
  • On how many of the test points did the two classification schemes (based on L1 and L2 distance) yield different predictions?