In [1]:
from sklearn.datasets import load_boston
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import scale
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.cross_validation import KFold
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
For this exercise, we will be using a dataset of housing prices in Boston during the 1970s. Python's super-awesome sklearn package already has the data we need to get started. Below is the command to load the data. The data is stored as a dictionary.
The 'DESCR' is a description of the data and the command for printing it is below. Note all the features we have to work with. From the dictionary, we need the data and the target variable (in this case, housing price). Store these as variables named "data" and "price", respectively. Once you have these, print their shapes to see all checks out with the DESCR.
In [2]:
boston = load_boston()
print boston.DESCR
In [3]:
data = boston.data
price = boston.target
In [4]:
print data.shape
print price.shape
Now, using sklearn's train_test_split, (see here for more. I've already imported it for you.) let's make a random train-test split with the test size equal to 30% of our data (i.e. set the test_size parameter to 0.3). For consistency, let's also set the random.state parameter = 11.
Name the variables train_data, train_price for the training data and test_data, test_price for the test data. As a sanity check, let's also print the shapes of these variables.
In [6]:
data_train, data_test, price_train, price_test = train_test_split(data, price, test_size=0.30, random_state=11)
In [7]:
print data_train.shape
print data_test.shape
print price_train.shape
print price_test.shape
Before we get too far ahead, let's scale our data. Let's subtract the min from each column (feature) and divide by the difference between the max and min for each column.
Here's where things can get tricky. Remember, our test data is unseen yet we need to scale it. We cannot scale using it's min/max because the data is unseen might not be available to us en masse. Instead, we use the training min/max to scale the test data.
Be sure to check which axis you use to take the mins/maxes!
Let's add a "_stand" suffix to our train/test variable names for the standardized values
In [8]:
mins = np.min(data_train, axis = 0)
maxes = np.max(data_train, axis = 0)
diff = maxes - mins
In [9]:
diff
Out[9]:
In [10]:
data_train_stand = (data_train - mins) / diff
data_test_stand = (data_test - mins) / diff
In [11]:
minPrice = np.min(price_train)
maxPrice = np.max(price_train)
diffPrice = maxPrice - minPrice
In [12]:
price_train_stand = (price_train - minPrice) / diffPrice
price_test_stand = (price_test - minPrice) / diffPrice
Now, here's where things might get really messy. Let's implement 10-Fold Cross Validation on K-NN across a range of K values (given below - 9 total). We'll keep our K for K-fold CV constant at 10.
Let's determine our accuracy using an RMSE (root-mean-square-error) value based on Euclidean distance. Save the errors for each fold at each K value (10 folds x 9 K values = 90 values) as you loop through.
Take a look at sklearn's K-fold CV. Also, sklearn has it's own K-NN implementation. There is also an implementation of mean squared error, though you'll have to take the root yourself. I've imported these for you already. :)
In [12]:
kValues = [1, 2, 3, 4, 5, 10, 20, 40, 80]
In [13]:
folds = KFold(len(data_train_stand), n_folds = 10, shuffle = True)
In [14]:
for train_index, val_index in folds:
print train_index
In [15]:
scores = {}
for k in kValues:
currentScores = []
for train_index, val_index in folds:
current_train_data, current_val_data = data_train_stand[train_index], data_train_stand[val_index]
current_train_price, current_val_price = price_train_stand[train_index], price_train_stand[val_index]
neigh = KNeighborsRegressor(n_neighbors = k)
neigh.fit(current_train_data, current_train_price)
guesses = neigh.predict(current_val_data)
rmse = np.sqrt(mean_squared_error(guesses, current_val_price))
currentScores.append(rmse)
scores[k] = currentScores
In [16]:
keys = sorted(scores.keys())
means = []
stdevs = []
for each in keys:
current = scores[each]
means.append(np.mean(current))
stdevs.append(np.std(current))
In [17]:
figure = plt.figure()
plt.plot(keys, means, 'bo:')
plt.xlabel('K')
plt.ylabel('Normalized RMSE')
plt.title('Error as a function of # of neighbors')
Out[17]:
In [18]:
figure = plt.figure()
plt.plot(keys, means, 'bo:')
plt.xlabel('K')
plt.ylabel('Normalized RMSE')
plt.title('Error as a function of # of neighbors')
plt.xlim([0, 10]);