Notebook version: 2.2 (Sep 08, 2017)
Changes: v.1.0 - First version
Changes: v.1.1 - Stock dataset included.
Changes: v.2.0 - Notebook for UTAD course. Advertising data incorporated
Changes: v.2.1 - Text and code revisited. General introduction removed.
Changes: v.2.2 - Compatibility with python 2 and 3.
In [1]:
# Import some libraries that will be necessary for working with data and displaying plots
# To visualize plots in the notebook
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pylab
# Packages used to read datasets
import scipy.io # To read matlab files
import pandas as pd # To read datasets in csv format
# For the student tests (only for python 2)
import sys
if sys.version_info.major==2:
from test_helper import Test
# That's default image size for this interactive session
pylab.rcParams['figure.figsize'] = 9, 6
We describe next the regression task that we will use in the session. The dataset is an adaptation of the `STOCK` dataset, taken originally from the StatLib Repository. The goal of this problem is to predict the values of the stocks of a given airplane company, given the values of another 9 companies in the same day.
If you are reading this text from the python notebook with its full functionality, you can explore the results of the regression experiments using two alternative datasets:
The
`CONCRETE` dataset, taken from the Machine Learning Repository at the University of California Irvine. The goal of the CONCRETE
dataset tas is to predict the compressive strength of cement mixtures based on eight observed variables related to the composition of the mixture and the age of the material).
The Advertising
dataset, taken from the book An Introduction to Statistical Learning with applications in R, with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani. The goal of this problem is to predict the sales of a given product, knowing the investment in different advertising sectors. More specifically, the input and output variables can be described as follows:
Input features:
Response variable:
To do so, just replace stock
by concrete
or advertising
in the next cell. Remind that you must run the cells again to see the changes.
</small>
In [2]:
# SELECT dataset
# Available options are 'stock', 'concrete' or 'advertising'
ds_name = 'stock'
In [3]:
# Let us start by loading the data into the workspace, and visualizing the dimensions of all matrices
if ds_name == 'stock':
# STOCK DATASET
data = scipy.io.loadmat('datasets/stock.mat')
X_tr = data['xTrain']
S_tr = data['sTrain']
X_tst = data['xTest']
S_tst = data['sTest']
elif ds_name == 'concrete':
# CONCRETE DATASET.
data = scipy.io.loadmat('datasets/concrete.mat')
X_tr = data['X_tr']
S_tr = data['S_tr']
X_tst = data['X_tst']
S_tst = data['S_tst']
elif ds_name == 'advertising':
# ADVERTISING DATASET
df = pd.read_csv('datasets/Advertising.csv', header=0)
X_tr = df.values[:150, 1:4]
S_tr = df.values[:150, [-1]] # The brackets around -1 is to make sure S_tr is a column vector, as in the other datasets
X_tst = df.values[150:, 1:4]
S_tst = df.values[150:, [-1]]
else:
print('Unknown dataset')
# Print the data dimension and the dataset sizes
print("SELECTED DATASET: " + ds_name)
print("---- The size of the training set is {0}, that is: {1} samples with dimension {2}.".format(
X_tr.shape, X_tr.shape[0], X_tr.shape[1]))
print("---- The target variable of the training set contains {0} samples with dimension {1}".format(
S_tr.shape[0], S_tr.shape[1]))
print("---- The size of the test set is {0}, that is: {1} samples with dimension {2}.".format(
X_tst.shape, X_tst.shape[0], X_tst.shape[1]))
print("---- The target variable of the test set contains {0} samples with dimension {1}".format(
S_tst.shape[0], S_tst.shape[1]))
In [4]:
pylab.subplots_adjust(hspace=0.2)
for idx in range(X_tr.shape[1]):
ax1 = plt.subplot(3,3,idx+1)
ax1.plot(X_tr[:,idx],S_tr,'.')
ax1.get_xaxis().set_ticks([])
ax1.get_yaxis().set_ticks([])
plt.show()
A first very simple method to build the regression model is to use the average of all the target values in the training set as the output of the model, discarding the value of the observation input vector.
This approach can be considered as a baseline, given that any other method making an effective use of the observation variables, statistically related to $s$, should improve the performance of this method.
The prediction is thus given by
In [5]:
# Mean of all target values in the training set
s_hat = np.mean(S_tr)
print(s_hat)
for any input ${\bf x}$.
In [6]:
# We start by defining a function that calculates the average square error
def square_error(s, s_est):
# Squeeze is used to make sure that s and s_est have the appropriate dimensions.
y = np.mean(np.power((s - s_est), 2))
# y = np.mean(np.power((np.squeeze(s) - np.squeeze(s_est)), 2))
return y
# Mean square error of the baseline prediction over the training data
# MSE_tr = <FILL IN>
MSE_tr = square_error(S_tr, s_hat)
# Mean square error of the baseline prediction over the test data
# MSE_tst = <FILL IN>
MSE_tst = square_error(S_tst, s_hat)
print('Average square error in the training set (baseline method): {0}'.format(MSE_tr))
print('Average square error in the test set (baseline method): {0}'.format(MSE_tst))
Note that in the previous piece of code, function 'square_error' can be used when the second argument is a number instead of a vector with the same length as the first argument. The value will be subtracted from each of the components of the vector provided as the first argument.
In [7]:
if sys.version_info.major == 2:
Test.assertTrue(np.isclose(MSE_tr, square_error(S_tr, s_hat)),'Incorrect value for MSE_tr')
Test.assertTrue(np.isclose(MSE_tst, square_error(S_tst, s_hat)),'Incorrect value for MSE_tst')
The principles of the $k$-nn method are the following:
The number of neighbors is a hyperparameter that plays an important role in the performance of the method. You can test its influence by changing $k$ in the following piece of code. In particular, you can sart with $k=1$ and observe the efect of increasing the value of $k$.
In [8]:
# We implement unidimensional regression using the k-nn method
# In other words, the estimations are to be made using only one variable at a time
from scipy import spatial
var = 0 # pick a variable (e.g., any value from 0 to 8 for the STOCK dataset)
k = 1 # Number of neighbors
n_points = 1000 # Number of points in the 'x' axis (for representational purposes)
# For representational purposes, we will compute the output of the regression model
# in a series of equally spaced-points along the x-axis
grid_min = np.min([np.min(X_tr[:,var]), np.min(X_tst[:,var])])
grid_max = np.max([np.max(X_tr[:,var]), np.max(X_tst[:,var])])
X_grid = np.linspace(grid_min,grid_max,num=n_points)
def knn_regression(X1, S1, X2, k):
""" Compute the k-NN regression estimate for the observations contained in
the rows of X2, for the training set given by the rows in X1 and the
components of S1. k is the number of neighbours of the k-NN algorithm
"""
if X1.ndim == 1:
X1 = np.asmatrix(X1).T
if X2.ndim == 1:
X2 = np.asmatrix(X2).T
distances = spatial.distance.cdist(X1,X2,'euclidean')
neighbors = np.argsort(distances, axis=0, kind='quicksort', order=None)
closest = neighbors[range(k),:]
est_values = np.zeros([X2.shape[0],1])
for idx in range(X2.shape[0]):
est_values[idx] = np.mean(S1[closest[:,idx]])
return est_values
est_tst = knn_regression(X_tr[:,var], S_tr, X_tst[:,var], k)
est_grid = knn_regression(X_tr[:,var], S_tr, X_grid, k)
plt.plot(X_tr[:,var], S_tr,'b.',label='Training points')
plt.plot(X_tst[:,var], S_tst,'rx',label='Test points')
plt.plot(X_grid, est_grid,'g-',label='Regression model')
plt.axis('tight')
plt.legend(loc='best')
plt.show()
We see that a small $k$ results in a regression curve that exhibits many and large oscillations. The curve is capturing any noise that may be present in the training data, and overfits the training set. On the other hand, picking a too large $k$ (e.g., 200) the regression curve becomes too smooth, averaging out the values of the labels in the training set over large intervals of the observation variable.
The next code illustrates this effect by plotting the average training and test square errors as a function of $k$.
In [9]:
var = 0
k_max = 60
k_max = np.minimum(k_max, X_tr.shape[0]) # k_max cannot be larger than the number of samples
#Be careful with the use of range, e.g., range(3) = [0,1,2] and range(1,3) = [1,2]
MSEk_tr = [square_error(S_tr, knn_regression(X_tr[:,var], S_tr, X_tr[:,var],k))
for k in range(1, k_max+1)]
MSEk_tst = [square_error(S_tst,knn_regression(X_tr[:,var], S_tr, X_tst[:,var],k))
for k in range(1, k_max+1)]
kgrid = np.arange(1, k_max+1)
plt.plot(kgrid, MSEk_tr,'bo', label='Training square error')
plt.plot(kgrid, MSEk_tst,'ro', label='Test square error')
plt.xlabel('$k$')
plt.axis('tight')
plt.legend(loc='best')
plt.show()
As we can see, the error initially decreases achiving a minimum (in the test set) for some finite value of $k$ ($k\approx 10$ for the STOCK
dataset). Increasing the value of $k$ beyond that value results in poorer performance.
Analize the training MSE for $k=1$. Why is it smaller than for any other $k$? Under which conditions will it be exactly zero?
Modify the code above to visualize the square error from $k=1$ up to $k$ equal to the number of training instances. Can you relate the square error of the $k$-NN method with that of the baseline method for certain value of $k$?
Having a look at the scatter plots, we can observe that some observation variables seem to have a more clear relationship with the target value. Thus, we can expect that not all variables are equally useful for the regression task. In the following plot, we carry out a study of the performance that can be achieved with each variable.
Note that, in practice, the test labels are not available for the selection of hyperparameter $k$, so we should be careful about the conclusions of this experiment. A more realistic approach will be studied later when we introduce the concept of model validation.
In [10]:
k_max = 20
var_performance = []
k_values = []
for var in range(X_tr.shape[1]):
MSE_tr = [square_error(S_tr, knn_regression(X_tr[:,var], S_tr, X_tr[:, var], k))
for k in range(1, k_max+1)]
MSE_tst = [square_error(S_tst, knn_regression(X_tr[:,var], S_tr, X_tst[:, var], k))
for k in range(1, k_max+1)]
MSE_tr = np.asarray(MSE_tr)
MSE_tst = np.asarray(MSE_tst)
# We select the variable associated to the value of k for which the training error is minimum
pos = np.argmin(MSE_tr)
k_values.append(pos + 1)
var_performance.append(MSE_tst[pos])
plt.stem(range(X_tr.shape[1]), var_performance)
plt.title('Results of unidimensional regression ($k$NN)')
plt.xlabel('Variable')
plt.ylabel('Test MSE')
plt.figure(2)
plt.stem(range(X_tr.shape[1]), k_values)
plt.xlabel('Variable')
plt.ylabel('$k$')
plt.title('Selection of the hyperparameter')
plt.show()
In the previous subsection, we have studied the performance of the $k$-nn method when using only one variable. Doing so was convenient, because it allowed us to plot the regression curves in a 2-D plot, and to get some insight about the consequences of modifying the number of neighbors.
For completeness, we evaluate now the performance of the $k$-nn method in this dataset when using all variables together. In fact, when designing a regression model, we should proceed in this manner, using all available information to make as accurate an estimation as possible. In this way, we can also account for correlations that might be present among the different observation variables, and that may carry very relevant information for the regression task.
For instance, in the STOCK
dataset, it may be that the combination of the stock values of two airplane companies is more informative about the price of the target company, while the value for a single company is not enough.
Also, in the `CONCRETE` dataset, it may be that for the particular problem at hand the combination of a large proportion of water and a small proportion of coarse grain is a clear indication of certain compressive strength of the material, while the proportion of water or coarse grain alone are not enough to get to that result.
In [11]:
k_max = 20
MSE_tr = [square_error(S_tr, knn_regression(X_tr, S_tr, X_tr, k)) for k in range(1, k_max+1)]
MSE_tst = [square_error(S_tst, knn_regression(X_tr, S_tr, X_tst, k)) for k in range(1, k_max+1)]
plt.plot(np.arange(k_max)+1, MSE_tr,'bo',label='Training square error')
plt.plot(np.arange(k_max)+1, MSE_tst,'ro',label='Test square error')
plt.xlabel('k')
plt.ylabel('Square error')
plt.legend(loc='best')
plt.show()
In this case, we can check that the average test square error is much lower than the error that was achieved when using only one variable, and also far better than the baseline method. It is also interesting to note that in this particular case the best performance is achieved for a small value of $k$, with the error increasing for larger values of the hyperparameter.
Nevertheless, as we discussed previously, these results should be taken carefully. How would we select the value of $k$, if test labels are (obvioulsy) not available for model validation?
An inconvenient of the application of the $k$-nn method is that the selection of $k$ influences the final error of the algorithm. In the previous experiments, we kept the value of $k$ that minimized the square error on the training set. However, we also noticed that the location of the minimum is not necessarily the same from the perspective of the test data. Ideally, we would like that the designed regression model works as well as possible on future unlabeled patterns that are not available during the training phase. This property is known as generalization.
Fitting the training data is only pursued in the hope that we are also indirectly obtaining a model that generalizes well. In order to achieve this goal, there are some strategies that try to guarantee a correct generalization of the model. One of such approaches is known as cross-validation
Since using the test labels during the training phase is not allowed (they should be kept aside to simultate the future application of the regression model on unseen patterns), we need to figure out some way to improve our estimation of the hyperparameter that requires only training data. Cross-validation allows us to do so by following the following steps:
In [12]:
### This fragment of code runs k-nn with M-fold cross validation
# Parameters:
M = 5 # Number of folds for M-cv
k_max = 40 # Maximum value of the k-nn hyperparameter to explore
# First we compute the train error curve, that will be useful for comparative visualization.
MSE_tr = [square_error(S_tr, knn_regression(X_tr, S_tr, X_tr, k)) for k in range(1, k_max+1)]
## M-CV
# Obtain the indices for the different folds
n_tr = X_tr.shape[0]
permutation = np.random.permutation(n_tr)
# Split the indices in M subsets with (almost) the same size.
set_indices = {i: [] for i in range(M)}
i = 0
for pos in range(n_tr):
set_indices[i].append(permutation[pos])
i = (i+1) % M
# Obtain the validation errors
MSE_val = np.zeros((1,k_max))
for i in range(M):
val_indices = set_indices[i]
# Take out the val_indices from the set of indices.
tr_indices = list(set(permutation) - set(val_indices))
MSE_val_iter = [square_error(S_tr[val_indices],
knn_regression(X_tr[tr_indices, :], S_tr[tr_indices],
X_tr[val_indices, :], k))
for k in range(1, k_max+1)]
MSE_val = MSE_val + np.asarray(MSE_val_iter).T
MSE_val = MSE_val/M
# Select the best k based on the validation error
k_best = np.argmin(MSE_val) + 1
# Compute the final test MSE for the selecte k
MSE_tst = square_error(S_tst, knn_regression(X_tr, S_tr, X_tst, k_best))
plt.plot(np.arange(k_max)+1, MSE_tr, 'bo', label='Training square error')
plt.plot(np.arange(k_max)+1, MSE_val.T, 'go', label='Validation square error')
plt.plot([k_best, k_best], [0, MSE_tst],'r-')
plt.plot(k_best, MSE_tst,'ro',label='Test error')
plt.legend(loc='best')
plt.show()
In practice, most well-known machine learning methods are implemented and available for python. Probably, the most complete module for machine learning tools is Scikit-learn. The following piece of code uses the method
KNeighborsRegressor
available in Scikit-learn. The example has been taken from here. As you can check, this routine allows us to build the estimation for a particular point using a weighted average of the targets of the neighbors:
To obtain the estimation at a point ${\bf x}$:
In [13]:
# Author: Alexandre Gramfort <alexandre.gramfort@inria.fr>
# Fabian Pedregosa <fabian.pedregosa@inria.fr>
#
# License: BSD 3 clause (C) INRIA
###############################################################################
# Generate sample data
import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors
np.random.seed(0)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()
# Add noise to targets
y[::5] += 1 * (0.5 - np.random.rand(8))
###############################################################################
# Fit regression model
n_neighbors = 5
for i, weights in enumerate(['uniform', 'distance']):
knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
y_ = knn.fit(X, y).predict(T)
plt.subplot(2, 1, i + 1)
plt.scatter(X, y, c='k', label='data')
plt.plot(T, y_, c='g', label='prediction')
plt.axis('tight')
plt.legend()
plt.title("KNeighborsRegressor (k = %i, weights = '%s')" % (n_neighbors,
weights))
plt.show()