Notebook title: Introduction to Regression. The k nearest neighbors method

Notebook version: 1.0 (Sep 25, 2015)

Author: Jerónimo Arenas García (jarenas@tsc.uc3m.es)
        Jesús Cid Sueiro (jcid@tsc.uc3m.es)

Changes: v.1.0 - First version
Changes: v.1.1 - Stock dataset included.

Pending changes: Test with other datasets. Try to get more intuitive results for the cross-validation part

In [1]:
# Import some libraries that will be necessary for working with data and displaying plots

# To visualize plots in the notebook
%matplotlib inline 

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.io       # To read matlab files
import pylab

1. Introduction to Regression. The $k$ nearest neighbors method

1.1 Introduction to the Estimation and Regression problems

In this session we will introduce the problem of estimating the value of a target variable $s$ from the knowledge of some variables statistically related to it, $x_i,~i=1,\dots,N$. We will collect all these variables in a vector $\bf x$, to which we will refer as pattern or observation vector. The regression problem can be illustrated as follows: The statistical relation between $s$ and $\bf x$ can be given by:

  • A training set, consisting of several observation vectors for which the true target variable is known $\{{\bf x}^{(k)}, s^{(k)}\}_{k=1}^K$
  • The joint distribution $p(s,{\bf x})$

In this session (and also in the course) we mostly focus on the first case above, to which we refer as a machine learning approach (i.e., learning from data)

Designing a classifier is equivalent to designing a function that takes the observation vector as the input argument and outputs the estimated value of the target variable

Cost functions

In practice, once we have build a regressor, it will be used in many experiments, each of them consisting of different values of the target variable and also different observations (even for a common $s$).

The real goal of the regression is not to adjust the training set, but to make accurate predictions on future data

This requirement for accurateness implies that we need some way of measuring and penalizing the discrepancy between the true and estimated values of $s$. This is done by defining a cost function, such as:

  • Square error: $c(e) = e^2 = (s - {\hat s})^2$
  • Absolute error: $c(e) = |e| = |s - {\hat s}|$

The above expressions apply to a particular application of the regression model. In practice, we would like to obtain an appropriate average performance, i.e., when averaging over the future applications of the design.

Since we only have labels for the training data, some idea about the performance of the system can be obtained by computing the average of the cost function over the training data:

$${\bar C} = \frac{1}{K}\sum_{k=1}^K c\left(e^{(k)}\right)$$

The square and absolute costs are functions of the error of estimation, which is defined as $e=s-{\hat s}$. However, this is not necessarily the case. As an example, imagine a situation in which we would like to introduce a penalty which increases with the magnitude of the estimated variable. For such case, the following cost would better fit our needs: $c\left(s,{\hat s}\right) = s^2 \left(s-{\hat s}\right)^2$.


In [2]:
# In this section we will plot together the square and absolute errors
grid = np.linspace(-3,3,num=100)
plt.plot(grid,grid**2,'b-',label='Square error')
plt.plot(grid,np.absolute(grid),'r--',label='Absolute error')
plt.xlabel('Error')
plt.ylabel('Cost')
plt.legend(loc='best')


Out[2]:
<matplotlib.legend.Legend at 0x1062f2890>

Parametric and non-parametric regression models

Generally speaking, we can distinguish two approaches when designing a regression model:

  • Parametric approach: In this case, the estimation function is given a priori a parametric form, and the goal of the design is to find the most appropriate values of the parameters according to a certain goal

    For instance, we could assume a linear expression $${\hat s}({\bf x}) = {\bf w}^\top {\bf x}$$ And adjust the parameter vector in order to minimize the average of the quadratic error over the training data. This is known as least-squares regression, and we will study it in a future session.

  • Non-parametric approach: In this case, the analytical shape of the regression model is not assumed a priori.

Today, we will explain some important facts about the regression problem, recurring to a non-parametric method known as the k-nearest neighbors approach.

1.2 The dataset

We describe next the regression task that we will use in the session. The dataset is an adaptation of the `STOCK` dataset, taken originally from the StatLib Repository. The goal of this problem is to predict the values of the stocks of a given airplane company, given the values of another 9 companis in the same day.

(Alternatively, if you are reading this text from the python notebook with its full functionality, you can explore the results of the regression experiments using the `CONCRETE` dataset which is taken from the Machine Learning Repository at the University of California Irvine. To do so, just uncomment the block of code entitled `CONCRETE` and place comments in `STOCK` in the cell bellow. Remind that you must run the cells again to see the changes. The goal of the `CONCRETE` dataset tas is to predict the compressive strength of cement mixtures based on eight observed variables related to the composition of the mixture and the age of the material).


In [3]:
# Let us start by loading the data into the workspace, and visualizing the dimensions of all matrices

# CONCRETE DATASET. 
# Use this code block to use the concrete dataset.
#data = scipy.io.loadmat('concrete.mat')
#X_tr = data['X_tr']
#X_tst = data['X_tst']
#S_tr = data['S_tr']
#S_tst = data['S_tst']

# STOCK
# Use this code block to use the airplanes dataset.
data = scipy.io.loadmat('stock.mat')
X_tr = data['xTrain']
X_tst = data['xTest']
S_tr = data['sTrain']
S_tst = data['sTest']

print X_tr.shape
print S_tr.shape
print X_tst.shape
print S_tst.shape


(380, 9)
(380, 1)
(190, 9)
(190, 1)

Questions

Please, reflect about the following issues:

  • What is the difference between a variable and a pattern? How many of each do we have in this case?
  • What is the legal use of the test labels S_tst ?
  • What is the legal use of the test observation vectors `X_tst`` ?

Scatter plots

We can get a first rough idea about the regression task if we plot each of the one-dimensional variables against the target data. These representations are known as scatter plots


In [4]:
pylab.subplots_adjust(hspace=0.2)

for idx in range(X_tr.shape[1]):
    ax1 = plt.subplot(3,3,idx+1)
    ax1.plot(X_tr[:,idx],S_tr,'.')
    ax1.get_xaxis().set_ticks([])
    ax1.get_yaxis().set_ticks([])
    #plt.ylabel('Compressive strength')

plt.show()


1.3 Baseline estimation. Using the average of the training set labels

A first very simple method to build the regression model is to use the average of all the target values in the training set as the output of the model, discarding the value of the observation input vector.

This approach can be considered as a baseline, given that any other method making an effective use of the observation variables, statistically related to $s$, should improve the performance of this method.


In [5]:
#We start by defining a function that calculates the average square error
def square_error(s,s_est):
    y = np.mean((s - s_est)**2)
    return y

s_hat = np.mean(S_tr)
print 'Average square error in the training set (baseline method): ' \
    + str(square_error(S_tr,s_hat))
print 'Average square error in the test set (baseline method): ' \
    + str(square_error(S_tst,s_hat))


Average square error in the training set (baseline method): 44.1434101238
Average square error in the test set (baseline method): 43.7992669018

Note that in the previous piece of code, function 'square_error' can be used when the second argument is a number instead of a vector with the same length as the first argument. The value will be subtracted from each of the components of the vector provided as the first argument.

1.4 Unidimensional regression with the k-nn method

The principles of the k-nn method are as follows:

  • For each point where a prediction is to be made, find the $k$ closest neighbors to that point (in the training set)
  • Obtain the estimation averaging the labels corresponding to the selected neighbors

The number of neighbors is a hyperparameter that plays an important role in the performance of the method. You can test its influence by changing $k$ in the following piece of code. In particular, you can sart with $k=1$ and observe the efect of increasing the value of $k$.


In [6]:
# We implement unidimensional regression using the k-nn method
# In other words, the estimations are to be made using only one variable at a time

from scipy import spatial

var = 0 #pick a variable (0 to 7)
k = 9 #Number of neighbors
n_points = 1000 #Number of points in the 'x' axis (for representational purposes)

#For representational purposes, we will compute the output of the regression model
#in a series of equally spaced-points along the x-axis
grid_min = np.min([np.min(X_tr[:,var]), np.min(X_tst[:,var])])
grid_max = np.max([np.max(X_tr[:,var]), np.max(X_tst[:,var])])
X_grid = np.linspace(grid_min,grid_max,num=n_points)

def knn_regression(X1,S1,X2,k):
    """ Compute the k-NN regression estimate for the observations contained in
        the rows of X2, for the training set given by the rows in X1 and the
        components of S1. k is the number of neighbours of the k-NN algorithm
    """
    if X1.ndim == 1:
        X1 = np.asmatrix(X1).T
    if X2.ndim == 1:
        X2 = np.asmatrix(X2).T
    distances = spatial.distance.cdist(X1,X2,'euclidean')
    neighbors = np.argsort(distances, axis=0, kind='quicksort', order=None)
    closest = neighbors[range(k),:]
    
    est_values = np.zeros([X2.shape[0],1])
    for idx in range(X2.shape[0]):
        est_values[idx] = np.mean(S1[closest[:,idx]])
        
    return est_values

est_tst = knn_regression(X_tr[:,var],S_tr,X_tst[:,var],k)
est_grid = knn_regression(X_tr[:,var],S_tr,X_grid,k)

plt.plot(X_tr[:,var],S_tr,'b.',label='Training points')
plt.plot(X_tst[:,var],S_tst,'rx',label='Test points')
plt.plot(X_grid,est_grid,'g-',label='Regression model')
plt.axis('tight')
plt.legend(loc='best')


Out[6]:
<matplotlib.legend.Legend at 0x1073ccdd0>

Evolution of the error with the number of neighbors ($k$)

We see that a small $k$ results in a regression curve that exhibits many and large oscillations. The curve is capturing any noise that may be present in the training data, and overfits the training set. On the other hand, picking a too large $k$ (e.g., 200) the regression curve becomes too smooth, averaging out the values of the labels in the training set over large intervals of the observation variable.

The next code illustrates this effect by plotting the average training and test square errors as a function of $k$. As we can see, the error initially decreases achiving a minimum (in the test set) for $k\approx 10$. Increasing the value of $k$ beyond that value results in poorer performance.


In [7]:
var = 0
k_max = 40

#Be careful with the use of range, e.g., range(3) = [0, 1, 2]
sq_error_tr = [square_error(S_tr,knn_regression(X_tr[:,var],S_tr,X_tr[:,var],k+1)) for k in range(k_max)]
sq_error_tst = [square_error(S_tst,knn_regression(X_tr[:,var],S_tr,X_tst[:,var],k+1)) for k in range(k_max)]

plt.plot(np.arange(k_max)+1,sq_error_tr,'b-',label='Training square error')
plt.plot(np.arange(k_max)+1,sq_error_tst,'r',label='Test square error')
plt.axis('tight')

plt.legend(loc='best')


Out[7]:
<matplotlib.legend.Legend at 0x10773f310>

Discussion topics and exercises:

  1. Analize the training square error for $k=0$
  2. Modify the code above to visualize the square error from $k=1$ up to $k$ equal to the number of training instances. Can you relate the square error of the $k$-NN method with that of the baseline method for certain value of $k$?
  3. Modify the code above to print on the screen the exact value of $k$ minimizing the test square error.

 Analysis of the performance for the different variables

Having a look at the scatter plots, we can observe that some observation variables seem to have a more clear relationship with the target value. For instance, variable 1 seems to have a linear correlation, in spite of the noise. Thus, we can expect that not all variables are equally useful for the regression task. In the following plot, we carry out a study of the performance that can be achieved with each variable. Since the test labels cannot be used to select the hyperparamenter $k$, we will pick the value that minimizes the training error (for each variable).


In [8]:
k_max = 20

var_performance = []
k_values = []

for var in range(X_tr.shape[1]):
    sq_error_tr = [square_error(S_tr,knn_regression(X_tr[:,var],S_tr,X_tr[:,var],k+1)) for k in range(k_max)]
    sq_error_tst = [square_error(S_tst,knn_regression(X_tr[:,var],S_tr,X_tst[:,var],k+1)) for k in range(k_max)]
    sq_error_tr = np.asarray(sq_error_tr)
    sq_error_tst = np.asarray(sq_error_tst)

    # We select the variable associated to the value of k for which the training error is minimum
    pos = np.argmin(sq_error_tr)
    k_values.append(pos+1)
    var_performance.append(sq_error_tst[pos])
    
plt.stem(range(X_tr.shape[1]),var_performance)
plt.title('Results of unidimensional regression (kNN)')
plt.xlabel('Variable')
plt.ylabel('Quadratic error')

plt.figure(2)
plt.stem(range(X_tr.shape[1]),k_values)
plt.xlabel('Variable')
plt.ylabel('k')
plt.title('Selection of the hyperparameter')


Out[8]:
<matplotlib.text.Text at 0x10811b310>

 1.5 Multidimensional regression with the k-nn method

In the previous subsection, we have studied the performance of the k-nn method when using only one variable. Doing so was convenient, because it allowed us to plot the regression curves in a 2-D plot, and to get some insight about the consequences of modifying the number of neighbors.

For completeness, we evaluate now the performance of the k-nn method in this dataset when using all variables together. In fact, when designing a regression model, we should proceed in this manner, using all available information to make as accurate an estimation as possible. In this way, we can also account for correlations that might be present among the different observation variables, and that may carry very relevant information for the regression task.

For instance, in the STOCK dataset, it may be that the combination of the stock values of two airplane companies is more informative about the price of the target company, while the value for a single company is not enough.

Also, in the CONCRETE dataset, it may be that for the particular problem at hand the combination of a large proportion of water and a small proportion of coarse grain is a clear indication of certain compressive strength of the material, while the proportion of water or coarse grain alone are not enough to get to that result.<\small>


In [12]:
k_max = 20

sq_error_tr = [square_error(S_tr,knn_regression(X_tr,S_tr,X_tr,k+1)) for k in range(k_max)]
sq_error_tst = [square_error(S_tst,knn_regression(X_tr,S_tr,X_tst,k+1)) for k in range(k_max)]

plt.plot(np.arange(k_max)+1,sq_error_tr,'b-',label='Training square error')
plt.plot(np.arange(k_max)+1,sq_error_tst,'r--',label='Test square error')
plt.xlabel('k')
plt.ylabel('Square error')

plt.legend(loc='best')


Out[12]:
<matplotlib.legend.Legend at 0x109cd5610>

In this case, we can check that the average test square error is much lower than the error that was achieved when using only one variable, and also far better than the baseline method. It is also interesting to note that in this particular case the best performance is achieved for a small value of $k$, with the error increasing for larger values of the hyperparameter.

1.6 Hyperparameter selection via cross-validation

An inconvenient of the application of the $k$-nn method is that the selection of $k$ influences the final error of the algorithm. In the previous experiments, we kept the value of $k$ that minimized the square error on the training set. However, we also noticed that the location of the minimum is not necessarily the same from the perspective of the test data. Ideally, we would like that the designed regression model works as well as possible on future unlabeled patterns that are not available during the training phase. This property is known as generalization. Fitting the training data is only pursued in the hope that we are also indirectly obtaining a model that generalizes well. In order to achieve this goal, there are some strategies that try to guarantee a correct generalization of the model. One of such approaches is known as cross-validation

Since using the test labels during the training phase is not allowed (they should be kept aside to simultate the future application of the regression model on unseen patterns), we need to figure out some way to improve our estimation of the hyperparameter that requires only training data. Cross-validation allows us to do so by following the following steps:

  • Split the training data into several (generally non-overlapping) subsets. If we use M subsets, the method is referred to as M-fold cross-validation. If we consider each pattern a different subset, the method is usually referred to as leave-one-out (LOO) cross-validation.
  • Carry out the training of the system $M$ times. For each run, use a different partition as a validation set, and use the restating partitions as the training set. Evaluate the performance for different choices of the hyperparameter (i.e., for different values of $k$ for the k-NN method).
  • Average the validation error over all partitions, and pick the hyperparameter that provided the minimum validation error.
  • Rerun the algorithm using all the training data, keeping the value of the parameter that came out of the cross-validation process.

In [10]:
## This fragment of code runs k-nn with M-fold cross validation

# Obtain the indices for the different folds
n_tr = X_tr.shape[0]
M = 3
permutation = np.random.permutation(n_tr)

#Not very smart, but it works
set_indices = {}
for k in range(M):
    set_indices[k] = []
k = 0
for pos in range(n_tr):
    set_indices[k].append(permutation[pos])
    k = (k+1) % M
    
# Now, we run the cross-validation process using the k-nn method
var = 0
k_max = 40

# Obtain the validation errors
sq_error_val = np.zeros((1,k_max))
for k in range(M):
    val_indices = set_indices[k]
    tr_indices = []
    for kk in range(M):
        if not k==kk:
            tr_indices += set_indices[kk]
    
    sq_error_val_iter = [square_error(S_tr[val_indices],knn_regression(X_tr[tr_indices,var],S_tr[tr_indices] \
                                                                       ,X_tr[val_indices,var],k+1)) for k in range(k_max)]
    sq_error_val = sq_error_val + np.asarray(sq_error_val_iter).T
    
sq_error_val = sq_error_val/M

# We compute now the train and test errors curves
sq_error_tr = [square_error(S_tr,knn_regression(X_tr[:,var],S_tr,X_tr[:,var],k+1)) for k in range(k_max)]
sq_error_tst = [square_error(S_tst,knn_regression(X_tr[:,var],S_tr,X_tst[:,var],k+1)) for k in range(k_max)]

plt.plot(np.arange(k_max)+1,sq_error_tr,'b-',label='Training square error')
plt.plot(np.arange(k_max)+1,sq_error_tst,'r--',label='Test square error')
plt.plot(np.arange(k_max)+1,sq_error_val.T,'g--',label='Validation square error')

plt.legend(loc='best')


Out[10]:
<matplotlib.legend.Legend at 0x108623950>

1.7 Scikit-learn implementation

In practice, most well-known machine learning methods are implemented and available for python. Probably, the most complete module for machine learning tools is Scikit-learn. The following piece of code uses the method

KNeighborsRegressor

available in Scikit-learn. The example has been taken from here. As you can check, this routine allows us to build the estimation for a particular point using a weighted average of the targets of the neighbors:

To obtain the estimation at a point ${\bf x}$:

  • Find $k$ closest points to ${\bf x}$ in the training set
  • Average the corresponding targets, weighting each value according to the distance of each point to ${\bf x}$, so that closer points have a larger influence in the estimation.

In [11]:
# Author: Alexandre Gramfort <alexandre.gramfort@inria.fr>
#         Fabian Pedregosa <fabian.pedregosa@inria.fr>
#
# License: BSD 3 clause (C) INRIA


###############################################################################
# Generate sample data
import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors

np.random.seed(0)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()

# Add noise to targets
y[::5] += 1 * (0.5 - np.random.rand(8))

###############################################################################
# Fit regression model
n_neighbors = 5

for i, weights in enumerate(['uniform', 'distance']):
    knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
    y_ = knn.fit(X, y).predict(T)

    plt.subplot(2, 1, i + 1)
    plt.scatter(X, y, c='k', label='data')
    plt.plot(T, y_, c='g', label='prediction')
    plt.axis('tight')
    plt.legend()
    plt.title("KNeighborsRegressor (k = %i, weights = '%s')" % (n_neighbors,
                                                                weights))

plt.show()


Exercise

Use scikit-learn implementation of the $k$-nn method to compute the generalization error on the CONCRETE dataset. Compare the perfomance when using uniform and distance-based weights in the computation the estimates. Visualize the regression curves and error for different values of $k$.


In [ ]: