Module 13 - Programming Assignment

This is the notebook for the Module 13 Programming Assignment.

Here are a few tips for using the iPython HTML notebook:

  1. You can use tab . Try le<<tab> and see the available functions.
  2. You can change the type of cell by picking "Code" or "Markdown" from the menu at the left.
  3. If you keep typing in a Markdown text area, you will eventually get scroll bars. To prevent this, hit return when you come to the end of the window. Only a double return creates a new paragraph.
  4. You can find out more about Markdown text here: http://daringfireball.net/projects/markdown/ (Copy this link and put it in another tab for reference--Don't click it or you'll leave your notebook!).
  5. Every so often, restart the kernel, clear all output and run all code cells so you can be certain that you didn't define something out of order.

You should rename this notebook to be <your JHED id>_PR13.ipynb Do it right now.

Make certain the entire notebook executes before you submit it. (See Hint #5 above)

Change the following variables:


In [1]:
name = "Shehzad Qureshi"
jhed_id = "squresh6"
if name == "Student Name" or jhed_id == "sname1":
    raise Exception( "Change the name and/or JHED ID...preferrably to yours.")

Add whatever additional imports you require here. Stick with the standard libraries and those required by the class. The import gives you access to these functions: http://ipython.org/ipython-doc/stable/api/generated/IPython.core.display.html (Copy this link) Which, among other things, will permit you to display HTML as the result of evaluated code (see HTML() or display_html()).


In [2]:
from IPython.core.display import *
import numpy as np
from scipy import stats
from operator import itemgetter
import math

The goals of this assignment are three-fold:

  1. Implement k-Nearest Neighbor regression as described in the Module 13.
  2. Use the evaluation procedure described in Module 9 to determine the best value of k. For this you can simply split the data randomly into a training and test set with a 67/33 split.
  3. Use 10-fold cross-validation to establish confidence bounds on your model. Use 10-fold cross-validation to establish the mean squared error (don't use the "1/2" version) and variance of squared error for your model. Use this to estimate the 95% confidence interval.

Use the data in concrete_compressive_strength.csv for this assignment.

k-nearest neighbor

We'll start with our function for k-nearest neighbor. This is the naive implementation that calculates the euclidean distance between each query set and the data set. It then returns the average of the k minimum y values based on the calculated euclidean distance.

I opted to use Numpy's fantastic normlization routine instead of manually calculating each result. This method is several orders of magnitudes faster!


In [3]:
def knn(k, dataset, query):
    y, x = np.hsplit(dataset, np.array([1]))
    distances = np.linalg.norm(x - query, axis=1)
    nearest = sorted(zip(distances, dataset[:,0]), key=itemgetter(0))[:k]
    return np.average(np.array(zip(*nearest)[1]))

This is a simple test function to show that knn works. We use the last row of the data set as our query.


In [4]:
dataset = np.loadtxt('concrete_compressive_strength.csv', delimiter=',')
query = dataset[-1:][0][1:]
dataset = dataset[:-1]
knn(3, dataset, query)


Out[4]:
40.32

Model Evaluation

We need to determine what the best value of k is for our k-nearest neighbor algorithm. The function below is used for partitioning the data set into a training and validation set. It splits the data into segments then concatenates the training data as necessary (test data is usually a single segment).

This function will be reused once again for cross-validation.


In [5]:
def partition(dataset, fold, k):
    segments = np.array_split(dataset, k)
    test = segments[fold-1]
    training = [segments[i] for i in xrange(k) if i != (fold-1)]
    return np.concatenate(training), test

This function will determine the best k value from a specified range of values (defaults to [1,21]). It divides the data into training and validation sets and tests each example in the validation set against the query set using the specified k value. Each k value stores the mean squared error of the validation set results and the best (i.e. lowest) k value is returned. The default uses a 67/33 training/validation set split.


In [6]:
def determine_best_k_value(dataset, split=3, low_k=1, high_k=9):
    training_set, test_set = partition(dataset, split, split)
    error = []
    for kval in xrange(low_k, high_k + 1):
        t = [(knn(kval, training_set, query[1:]) - query[0])**2 for query in test_set]
        error.append((kval, np.average(np.array(t))))
    return sorted(error, key=itemgetter(1))[0][0]

This function is a wrapper for the above function to find the best k value. It runs through a specified number of iterations (default: 100) and finds the best k value for each iteration. Each iteration also uses a shuffled instance of the dataset to randomize the training and validation sets.


In [7]:
def find_best_k(dataset, iterations=100):
    kval = []
    for i in xrange(iterations):
        np.random.shuffle(dataset)
        val = determine_best_k_value(dataset)
        kval.append(val)
    print np.bincount(kval)
    return np.argmax(np.bincount(kval))

And now we try to find the best k value using the given data.


In [8]:
dataset = np.loadtxt('concrete_compressive_strength.csv', delimiter=',')
%time best_k = find_best_k(dataset)
print best_k


[ 0 11 21 37 19  4  4  2  1  1]
Wall time: 10min 46s
3

So our best k value is determined to be 3. Without the optimization to use Numpy's normalization routine each knn evaluation would take 3 minutes, so this is significantly better!

Cross validation

We will now do a 10-fold cross validation of our model. This means we need to partition our data into 10 "folds" and then iteratively use each fold as a validation set against the rest of the data (as a training set) and determine the error rate.


In [9]:
def cross_validation(dataset, k, folds=10):
    error = []
    for i in xrange(1, folds+1):
        training_set, test_set = partition(dataset, i, folds)
        t = [(knn(k, training_set, query[1:]) - query[0])**2 for query in test_set]
        error.append(np.average(np.array(t)))
    return error

This is a function that will calculate the lower and upper endpoints (confidence interval) given a confidence level. Wikipedia wasn't as helpful as I hoped it would be, so after some googling, this StackOverflow link really helped.

http://stackoverflow.com/a/15034143


In [10]:
def confidence_interval(data, confidence=0.95):
    a = 1.0*np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1+confidence)/2., n-1)
    return m-h, m+h

And now we determine our mean-squared error, variance of squared error and confidence interval using the k value we found above.


In [11]:
dataset = np.loadtxt('concrete_compressive_strength.csv', delimiter=',')
error = cross_validation(dataset, best_k)
print "Mean: ", np.mean(np.array(error))
print "Variance:", np.var(np.array(error))
print "Confidence Interval: ", confidence_interval(error)


Mean:  167.001813355
Variance: 21582.9501497
Confidence Interval:  (56.222945289799938, 277.78068142001672)

Summary

k-nearest neighbor attempts to perform the same function as linear regression, except that it doesn't build a model. What it does, instead, is use an evaluation metric to determine the k nearest classified data points and compute the average mean of those metrics. The metric used above was the Euclidean distance between the query and data points.

We can use a learning model to determine what the best value of k should be. We divided the data into a 2/1 split and used the former as training data and the latter as validation data. We then tested each example in the validation set against the training set and computed the mean squared error for each possible k value. This process was repeated with a randomized data set on every iteration for 100 iterations to obtain a good k value. What is fascinating about the results of this process is that the histogram forms a nice bell curve around the best value of 3. Values of 2 and 4 were significantly good but every other value tapered off. Values of k > 7 were statistically insignificant and are probably a waste of resources.

Finally we performed a 10-fold cross-validation on our model. This involved running through the data with 10 folds, similar to the above, except that this time we used our best k value. We determined the mean squared error, variance of squared error and the confidence intervals.

One thing I would have liked to have done was to perform mean normalization on the data. Unfortunately lack of time prevented this. I think mean normalization would have reduced the cross-validation mean and variance. I'm not sure if it would skew the learning model to find the best k value though.


In [11]: