This is the notebook for the Module 13 Programming Assignment.
Here are a few tips for using the iPython HTML notebook:
You should rename this notebook to be <your JHED id>_PR13.ipynb Do it right now.
Make certain the entire notebook executes before you submit it. (See Hint #5 above)
Change the following variables:
In [1]:
name = "Shehzad Qureshi"
jhed_id = "squresh6"
if name == "Student Name" or jhed_id == "sname1":
raise Exception( "Change the name and/or JHED ID...preferrably to yours.")
Add whatever additional imports you require here. Stick with the standard libraries and those required by the class. The import gives you access to these functions: http://ipython.org/ipython-doc/stable/api/generated/IPython.core.display.html (Copy this link) Which, among other things, will permit you to display HTML as the result of evaluated code (see HTML() or display_html()).
In [2]:
from IPython.core.display import *
import numpy as np
from scipy import stats
from operator import itemgetter
import math
The goals of this assignment are three-fold:
Use the data in concrete_compressive_strength.csv for this assignment.
We'll start with our function for k-nearest neighbor. This is the naive implementation that calculates the euclidean distance between each query set and the data set. It then returns the average of the k minimum y values based on the calculated euclidean distance.
I opted to use Numpy's fantastic normlization routine instead of manually calculating each result. This method is several orders of magnitudes faster!
In [3]:
def knn(k, dataset, query):
y, x = np.hsplit(dataset, np.array([1]))
distances = np.linalg.norm(x - query, axis=1)
nearest = sorted(zip(distances, dataset[:,0]), key=itemgetter(0))[:k]
return np.average(np.array(zip(*nearest)[1]))
This is a simple test function to show that knn works. We use the last row of the data set as our query.
In [4]:
dataset = np.loadtxt('concrete_compressive_strength.csv', delimiter=',')
query = dataset[-1:][0][1:]
dataset = dataset[:-1]
knn(3, dataset, query)
Out[4]:
We need to determine what the best value of k is for our k-nearest neighbor algorithm. The function below is used for partitioning the data set into a training and validation set. It splits the data into segments then concatenates the training data as necessary (test data is usually a single segment).
This function will be reused once again for cross-validation.
In [5]:
def partition(dataset, fold, k):
segments = np.array_split(dataset, k)
test = segments[fold-1]
training = [segments[i] for i in xrange(k) if i != (fold-1)]
return np.concatenate(training), test
This function will determine the best k value from a specified range of values (defaults to [1,21]). It divides the data into training and validation sets and tests each example in the validation set against the query set using the specified k value. Each k value stores the mean squared error of the validation set results and the best (i.e. lowest) k value is returned. The default uses a 67/33 training/validation set split.
In [6]:
def determine_best_k_value(dataset, split=3, low_k=1, high_k=9):
training_set, test_set = partition(dataset, split, split)
error = []
for kval in xrange(low_k, high_k + 1):
t = [(knn(kval, training_set, query[1:]) - query[0])**2 for query in test_set]
error.append((kval, np.average(np.array(t))))
return sorted(error, key=itemgetter(1))[0][0]
This function is a wrapper for the above function to find the best k value. It runs through a specified number of iterations (default: 100) and finds the best k value for each iteration. Each iteration also uses a shuffled instance of the dataset to randomize the training and validation sets.
In [7]:
def find_best_k(dataset, iterations=100):
kval = []
for i in xrange(iterations):
np.random.shuffle(dataset)
val = determine_best_k_value(dataset)
kval.append(val)
print np.bincount(kval)
return np.argmax(np.bincount(kval))
And now we try to find the best k value using the given data.
In [8]:
dataset = np.loadtxt('concrete_compressive_strength.csv', delimiter=',')
%time best_k = find_best_k(dataset)
print best_k
So our best k value is determined to be 3. Without the optimization to use Numpy's normalization routine each knn evaluation would take 3 minutes, so this is significantly better!
In [9]:
def cross_validation(dataset, k, folds=10):
error = []
for i in xrange(1, folds+1):
training_set, test_set = partition(dataset, i, folds)
t = [(knn(k, training_set, query[1:]) - query[0])**2 for query in test_set]
error.append(np.average(np.array(t)))
return error
This is a function that will calculate the lower and upper endpoints (confidence interval) given a confidence level. Wikipedia wasn't as helpful as I hoped it would be, so after some googling, this StackOverflow link really helped.
In [10]:
def confidence_interval(data, confidence=0.95):
a = 1.0*np.array(data)
n = len(a)
m, se = np.mean(a), stats.sem(a)
h = se * stats.t.ppf((1+confidence)/2., n-1)
return m-h, m+h
And now we determine our mean-squared error, variance of squared error and confidence interval using the k value we found above.
In [11]:
dataset = np.loadtxt('concrete_compressive_strength.csv', delimiter=',')
error = cross_validation(dataset, best_k)
print "Mean: ", np.mean(np.array(error))
print "Variance:", np.var(np.array(error))
print "Confidence Interval: ", confidence_interval(error)
k-nearest neighbor attempts to perform the same function as linear regression, except that it doesn't build a model. What it does, instead, is use an evaluation metric to determine the k nearest classified data points and compute the average mean of those metrics. The metric used above was the Euclidean distance between the query and data points.
We can use a learning model to determine what the best value of k should be. We divided the data into a 2/1 split and used the former as training data and the latter as validation data. We then tested each example in the validation set against the training set and computed the mean squared error for each possible k value. This process was repeated with a randomized data set on every iteration for 100 iterations to obtain a good k value. What is fascinating about the results of this process is that the histogram forms a nice bell curve around the best value of 3. Values of 2 and 4 were significantly good but every other value tapered off. Values of k > 7 were statistically insignificant and are probably a waste of resources.
Finally we performed a 10-fold cross-validation on our model. This involved running through the data with 10 folds, similar to the above, except that this time we used our best k value. We determined the mean squared error, variance of squared error and the confidence intervals.
One thing I would have liked to have done was to perform mean normalization on the data. Unfortunately lack of time prevented this. I think mean normalization would have reduced the cross-validation mean and variance. I'm not sure if it would skew the learning model to find the best k value though.
In [11]: