Notebook version: 2.1 (Sep 27, 2019)
Changes: v.1.0 - First version. Extracted from regression_intro_knn v.1.0.
v.1.1 - Compatibility with python 2 and python 3
v.2.0 - New notebook generated. Fuses code from Notebooks R1, R2, and R3
v.2.1 - Updated index notation
In [2]:
# Import some libraries that will be necessary for working with data and displaying plots
# To visualize plots in the notebook
%matplotlib inline
import numpy as np
import scipy.io # To read matlab files
import pandas as pd # To read data tables from csv files
# For plots and graphical results
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pylab
# For the student tests (only for python 2)
import sys
if sys.version_info.major==2:
from test_helper import Test
# That's default image size for this interactive session
pylab.rcParams['figure.figsize'] = 9, 6
The goal of regression methods is to predict the value of some target variable $S$ from the observation of one or more input variables $X_0, X_1, \ldots, X_{K-1}$ (that we will collect in a single vector $\bf X$).
Regression problems arise in situations where the value of the target variable is not easily accessible, but we can measure other dependent variables, from which we can try to predict $S$.
The only information available to estimate the relation between the inputs and the target is a dataset $\mathcal D$ containing several observations of all variables.
$$\mathcal{D} = \{{\bf x}_k, s_k\}_{k=0}^{K-1}$$The dataset $\mathcal{D}$ must be used to find a function $f$ that, for any observation vector ${\bf x}$, computes an output $\hat{s} = f({\bf x})$ that is a good predition of the true value of the target, $s$.
Note that for the generation of the regression model, we exploit the statistical dependence between random variable $S$ and random vector ${\bf X}$. In this respect, we can assume that the available dataset $\mathcal{D}$ consists of i.i.d. points from the joint distribution $p_{S,{\bf X}}(s,{\bf x})$. If we had access to the true distribution, a statistical approach would be more accurate; however, in many situations such knowledge is not available, but using training data to do the design is feasible (e.g., relying on historic data, or by manual labelling of a set of patterns).
The scikit-learn package contains several datasets related to regression problems.
We can load these datasets as follows:
In [3]:
from sklearn import datasets
# Load the dataset. Select it by uncommenting the appropriate line
D_all = datasets.load_boston()
#D_all = datasets.load_diabetes()
# Extract data and data parameters.
X = D_all.data # Complete data matrix (including input and target variables)
S = D_all.target # Target variables
n_samples = X.shape[0] # Number of observations
n_vars = X.shape[1] # Number of variables (including input and target)
This dataset contains
In [4]:
print(n_samples)
observations of the target variable and
In [5]:
print(n_vars)
input variables.
When the instances of the dataset are multidimensional, they cannot be visualized directly, but we can get a first rough idea about the regression task if we plot the target variable versus one of the input variables. These representations are known as scatter plots
Python methods plot
and scatter
from the matplotlib
package can be used for these graphical representations.
In [6]:
# Select a dataset
nrows = 4
ncols = 1 + (X.shape[1]-1)/nrows
# Some adjustment for the subplot.
pylab.subplots_adjust(hspace=0.2)
# Plot all variables
for idx in range(X.shape[1]):
ax = plt.subplot(nrows,ncols,idx+1)
ax.scatter(X[:,idx], S) # <-- This is the key command
ax.get_xaxis().set_ticks([])
ax.get_yaxis().set_ticks([])
plt.ylabel('Target')
In order to evaluate the performance of a given predictor, we need to quantify the quality of predictions. This is usually done by means of a loss function $l(s,\hat{s})$. Two common losses are
Note that both the square and absolute errors are functions of the estimation error $e = s-{\hat s}$. However, this is not necessarily the case. As an example, imagine a situation in which we would like to introduce a penalty which increases with the magnitude of the estimated variable. For such case, the following cost would better fit our needs: $l(s,{\hat s}) = s^2 \left(s-{\hat s}\right)^2$.
In [7]:
# In this section we will plot together the square and absolute errors
grid = np.linspace(-3,3,num=100)
plt.plot(grid, grid**2, 'b-', label='Square error')
plt.plot(grid, np.absolute(grid), 'r--', label='Absolute error')
plt.xlabel('Error')
plt.ylabel('Cost')
plt.legend(loc='best')
plt.show()
In general, we do not care much about an isolated application of the regression model, but instead, we are looking for a generally good behavior, for which we need to average the loss function over a set of samples. In this notebook, we will use the average of the square loss, to which we will refer as the mean-square error
(MSE).
The following code fragment defines a function to compute the MSE based on the availability of two vectors, one of them containing the predictions of the model, and the other the true target values.
In [7]:
# We start by defining a function that calculates the average square error
def square_error(s, s_est):
# Squeeze is used to make sure that s and s_est have the appropriate dimensions.
y = np.mean(np.power((np.squeeze(s) - np.squeeze(s_est)), 2))
return y
The major goal of the regression problem is that the predictor should make good predictions for arbitrary new inputs, not taken from the dataset used by the regression algorithm.
Thus, in order to evaluate the prediction accuracy of some regression algorithm, we need some data, not used during the predictor design, to test the performance of the predictor under new data. To do so, the original dataset is usually divided in (at least) two disjoint sets:
A good regression algorithm uses $\cal{D}_{\text{train}}$ to obtain a predictor with small average loss based on $\cal{D}_{\text{test}}$ $$ {\bar R}_{\text{test}} = \frac{1}{K_{\text{test}}} \sum_{ ({\bf x},s) \in \mathcal{D}_{\text{test}}} l(s, f({\bf x})) $$ where $K_{\text{test}}$ is the size of the test set.
As a designer, you only have access to training data. However, for illustration purposes, you may be given a test dataset for many examples in this course. Note that in such a case, using the test data to adjust the regression model is completely forbidden. You should work as if such test data set were not available at all, and recur to it just to assess the performance of the model after the design is complete.
To model the availability of a train/test partition, we split next the boston dataset into a training and test partitions, using 60% and 40% of the data, respectively.
In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, s_train, s_test = train_test_split(X, S, test_size=0.4, random_state=0)
A first very simple method to build the regression model is to use the average of all the target values in the training set as the output of the model, discarding the value of the observation input vector.
This approach can be considered as a baseline, given that any other method making an effective use of the observation variables, statistically related to $s$, should improve the performance of this method.
The following code fragment uses the train data to compute the baseline regression model, and it shows the MSE calculated over the test partitions.
In [9]:
S_baseline = np.mean(s_train)
print('The baseline estimator is:', S_baseline)
#Compute MSE for the train data
#MSE_train = square_error(s_train, S_baseline)
#Compute MSE for the test data. IMPORTANT: Note that we still use
#S_baseline as the prediction.
MSE_test = square_error(s_test, S_baseline)
#print('The MSE for the training data is:', MSE_train)
print('The MSE for the test data is:', MSE_test)
Generally speaking, we can distinguish two approaches when designing a regression model:
Parametric approach: In this case, the estimation function is given a priori a parametric form, and the goal of the design is to find the most appropriate values of the parameters according to a certain goal
For instance, we could assume a linear expression $${\hat s} = f({\bf x}) = {\bf w}^\top {\bf x}$$ and adjust the parameter vector in order to minimize the average of the quadratic error over the training data. This is known as least-squares regression, and we will study it in Section 8 of this notebook.
Non-parametric approach: In this case, the analytical shape of the regression model is not assumed a priori.
The principles of the $k$-nn method are the following:
The number of neighbors is a hyperparameter that plays an important role in the performance of the method. You can test its influence by changing $k$ in the following piece of code.
In [10]:
from sklearn import neighbors
n_neighbors = 1
knn = neighbors.KNeighborsRegressor(n_neighbors)
knn.fit(X_train, s_train)
s_hat_train = knn.predict(X_train)
s_hat_test = knn.predict(X_test)
print('The MSE for the training data is:', square_error(s_train, s_hat_train))
print('The MSE for the test data is:', square_error(s_test, s_hat_test))
In [11]:
max_k = 25
n_neighbors_list = np.arange(max_k)+1
MSE_train = []
MSE_test = []
for n_neighbors in n_neighbors_list:
knn = neighbors.KNeighborsRegressor(n_neighbors)
knn.fit(X_train, s_train)
s_hat_train = knn.predict(X_train)
s_hat_test = knn.predict(X_test)
MSE_train.append(square_error(s_train, s_hat_train))
MSE_test.append(square_error(s_test, s_hat_test))
plt.plot(n_neighbors_list, MSE_train,'bo', label='Training square error')
plt.plot(n_neighbors_list, MSE_test,'ro', label='Test square error')
plt.xlabel('$k$')
plt.axis('tight')
plt.legend(loc='best')
plt.show()
Although the above figures illustrate evolution of the training and test MSE for different selections of the number of neighbors, it is important to note that this figure, and in particular the red points, cannot be used to select the value of such parameter. Remember that it is only legal to use the test data to assess the final performance of the method, what includes also that any parameters inherent to the method should be adjusted using the train data only.
An inconvenient of the application of the $k$-nn method is that the selection of $k$ influences the final error of the algorithm. In the previous experiments, we kept the value of $k$ that minimized the square error on the training set. However, we also noticed that the location of the minimum is not necessarily the same from the perspective of the test data. Ideally, we would like that the designed regression model works as well as possible on future unlabeled patterns that are not available during the training phase. This property is known as generalization. Fitting the training data is only pursued in the hope that we are also indirectly obtaining a model that generalizes well. In order to achieve this goal, there are some strategies that try to guarantee a correct generalization of the model. One of such approaches is known as cross-validation
Since using the test labels during the training phase is not allowed (they should be kept aside to simultate the future application of the regression model on unseen patterns), we need to figure out some way to improve our estimation of the hyperparameter that requires only training data. Cross-validation allows us to do so by following the following steps:
Exercise: Use Kfold
function from the sklearn
library to validate parameter k
. Use a 10-fold validation strategy. What is the best number of neighbors according to this strategy? What is the corresponding MSE averaged over the test data?
In [12]:
from sklearn.model_selection import KFold
max_k = 25
n_neighbors_list = np.arange(max_k)+1
MSE_val = np.zeros((max_k,))
nfolds = 10
kf = KFold(n_splits=nfolds)
for train, val in kf.split(X_train):
for idx,n_neighbors in enumerate(n_neighbors_list):
knn = neighbors.KNeighborsRegressor(n_neighbors)
knn.fit(X_train[train,:], s_train[train])
s_hat_val = knn.predict(X_train[val,:])
MSE_val[idx] += square_error(s_train[val], s_hat_val)
MSE_val = [el/10 for el in MSE_val]
selected_k = np.argmin(MSE_val) + 1
plt.plot(n_neighbors_list, MSE_train,'bo', label='Training square error')
plt.plot(n_neighbors_list, MSE_val,'ro', label='Validation square error')
plt.plot(selected_k, MSE_test[selected_k-1],'gs', label='Test square error')
plt.xlabel('$k$')
plt.axis('tight')
plt.legend(loc='best')
plt.show()
print('Cross-validation selected the following value for the number of neighbors:', selected_k)
print('Test MSE:', MSE_test[selected_k-1])
The goal is to learn a (possibly non-linear) regression model from a set of $L$ labeled points, $\{{\bf x}_k,s_k\}_{k=0}^{K-1}$.
We assume a parametric function of the form:
$${\hat s}({\bf x}) = f({\bf x}) = w_0 z_0({\bf x}) + w_1 z_1({\bf x}) + \dots w_{m-1} z_{m-1}({\bf x})$$
where $z_i({\bf x})$ are particular transformations of the input vector variables.
Some examples are:
Least squares (LS) regression finds the coefficients of the model with the aim of minimizing the square of the residuals. If we define ${\bf w} = [w_0,w_1,\dots,w_M]^\top$, the LS solution would be defined as
\begin{equation}{\bf w}_{LS} = \arg \min_{\bf w} \sum_{k=0}^{K-1} e_k^2 = \arg \min_{\bf w} \sum_{k=0}^{K-1} \left[s_k - {\hat s}_k \right]^2 \end{equation}The estimation of the model for a single input vector ${\bf z}_k$ (which would be computed from ${\bf x}_k$), can be expressed as the following inner product
$${\hat s}_k = {\bf z}_k^\top {\bf w}$$
Since the above expression depends quadratically on ${\bf w}$ and is non-negative, we know that there is only one point where the derivative of $C({\bf w})$ becomes zero, and that point is necessarily a minimum of the cost
$$\nabla_{\bf w} \|{\bf s} - {\bf Z}{\bf w}\|^2\Bigg|_{{\bf w} = {\bf w}_{LS}} = {\bf 0}$$
Exercise: Solve the previous problem to show that $${\bf w}_{LS} = \left( {\bf Z}^\top{\bf Z} \right)^{-1} {\bf Z}^\top{\bf s}$$
The next fragment of code adjusts polynomia of increasing order to randomly generated training data.
In [13]:
n_points = 20
n_grid = 200
frec = 3
std_n = 0.2
max_degree = 20
colors = 'brgcmyk'
#Location of the training points
X_tr = (3 * np.random.random((n_points,1)) - 0.5)
#Labels are obtained from a sinusoidal function, and contaminated by noise
S_tr = np.cos(frec*X_tr) + std_n * np.random.randn(n_points,1)
#Equally spaced points in the X-axis
X_grid = np.linspace(np.min(X_tr),np.max(X_tr),n_grid)
#We start by building the Z matrix
Z = []
for el in X_tr.tolist():
Z.append([el[0]**k for k in range(max_degree+1)])
Z = np.matrix(Z)
Z_grid = []
for el in X_grid.tolist():
Z_grid.append([el**k for k in range(max_degree+1)])
Z_grid = np.matrix(Z_grid)
plt.plot(X_tr,S_tr,'b.')
for k in [1, 2, n_points]: # range(max_degree+1):
Z_iter = Z[:,:k+1]
# Least square solution
#w_LS = (np.linalg.inv(Z_iter.T.dot(Z_iter))).dot(Z_iter.T).dot(S_tr)
# Least squares solution, with leass numerical errors
w_LS, resid, rank, s = np.linalg.lstsq(Z_iter, S_tr)
#estimates at all grid points
fout = Z_grid[:,:k+1].dot(w_LS)
fout = np.array(fout).flatten()
plt.plot(X_grid,fout,colors[k%len(colors)]+'-',label='Degree '+str(k))
plt.legend(loc='best')
plt.ylim(1.2*np.min(S_tr), 1.2*np.max(S_tr))
plt.show()
It may seem that increasing the degree of the polynomia is always beneficial, as we can implement a more expressive function. A polynomia of degree $M$ would include all polynomia of lower degrees as particular cases. However, if we increase the number of parameters without control, the polynomia would eventually get expressive enough to adjust any given set of training points to arbitrary precision, what does not necessarily mean that the solution is obtaining a model that can be extrapolated to new data.
The conclusions is that, when adjusting a parametric model using least squares, we need to validate the model, for which we can use the cross-validation techniques we introudece in Section 7. In this contexts, validating the model implies:
The code below shows the performance of different models. However, no validation process is considered, so the reported test MSEs could not be used as criteria to select the best model.
In [14]:
# Linear model with no bias
w_LS, resid, rank, s = np.linalg.lstsq(X_train, s_train)
s_hat_test = X_test.dot(w_LS)
print('Test MSE for linear model without bias:', square_error(s_test, s_hat_test))
# Linear model with no bias
Z_train = np.hstack((np.ones((X_train.shape[0],1)), X_train))
Z_test = np.hstack((np.ones((X_test.shape[0],1)), X_test))
w_LS, resid, rank, s = np.linalg.lstsq(Z_train, s_train)
s_hat_test = Z_test.dot(w_LS)
print('Test MSE for linear model with bias:', square_error(s_test, s_hat_test))
# Polynomial model degree 2
Z_train = np.hstack((np.ones((X_train.shape[0],1)), X_train, X_train**2))
Z_test = np.hstack((np.ones((X_test.shape[0],1)), X_test, X_test**2))
w_LS, resid, rank, s = np.linalg.lstsq(Z_train, s_train)
s_hat_test = Z_test.dot(w_LS)
print('Test MSE for polynomial model (order 2):', square_error(s_test, s_hat_test))
In [ ]: