Regression problems generally deal with estimating a set of unknown parameters ($\beta$) from a set of known "independent" variables $X$ such that an arbitrary function $f(X,\beta)$ (the model) can approximate a set of "dependent" variables $Y$, or $f(X,\beta)\approx Y$.

Many algorithms exists depending on $f$ being linear of non-linear, or the particularities of X and Y datasets.

Diabetes dataset

Most ML algorithm classes are improved to fit general datasets. Some famous standard datasets exists, such as the diabetes datasets. While many things can be learned from it, it is used especially as an example for simple regression.

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

features: 10 physiological variables (age, sex, weight, blood pressure)
dimensions: 442 patients
target (response): an indication of disease progression after one year 20-260

Linear regression

In the example below, a linear regression model is used to predict diabete. Which feature does better at predicting diabetes?



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model


diabetes = datasets.load_diabetes()
#diabetes
print diabetes.data.shape, diabetes.target.shape
print diabetes.data[:5,:3]

ft = 2 # feature type 0 - age, 1 - sex, 2 - bmi

X = diabetes.data[:, np.newaxis] #expand the new
print "Expansion:", diabetes.data.shape, X.shape
print X[0]
print X[0, 0, ft]

Xt = X[:, :, ft]
print Xt.shape
print Xt[:5]









    



(442, 10) (442,)
[[ 0.03807591  0.05068012  0.06169621]
 [-0.00188202 -0.04464164 -0.05147406]
 [ 0.08529891  0.05068012  0.04445121]
 [-0.08906294 -0.04464164 -0.01159501]
 [ 0.00538306 -0.04464164 -0.03638469]]
Expansion: (442, 10) (442, 1, 10)
[[ 0.03807591  0.05068012  0.06169621  0.02187235 -0.0442235  -0.03482076
  -0.04340085 -0.00259226  0.01990842 -0.01764613]]
0.0616962065187
(442, 1)
[[ 0.06169621]
 [-0.05147406]
 [ 0.04445121]
 [-0.01159501]
 [-0.03638469]]

Crossvalidation

Crosvalidation is a general validation method that uses a part of the dataset for training (training data) a model while the remaining rest is used to estimate how efective the model is at predicting (test data). This will split X and Y into four groups, training X, Y and testing X, Y. Usually several crossvalidation tests are done by randomly picking training variables and the training set is by convenience at 80% of the whole data.

Goodness of fit

Residual sum of squares (mean square error), $v = \sum (y_t - \langle y_t \rangle )^2$.
Regression sum of squares $u = \sum (y_t - y_p)^2$.
R^2, the coefficient of determination, defined as (1 - u/v). Best possible score is 1.0, lower values are worse.

Task

Modify the script to perform crossvalidation on 100 randomly picked training sets. Design a crosvalidation function of your own and afterwards look for the scikit-learn's own function and use it instead. What is KFold crossvalidation?



In [24]:

    
#Split the data into training/testing sets
X_train = Xt[:-20]
X_test = Xt[-20:]

#Split the targets into training/testing sets
y_train = diabetes.target[:-20]
y_test = diabetes.target[-20:]

regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_predict = regr.predict(X_test)
score = regr.score(X_test, y_test)

print 'Betas (regression coefficients): \n', regr.coef_
print("Mean square error: %.2f" % np.mean((y_predict - y_test) ** 2))
print('Variance score: %.2f' % score)

plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_predict, color='blue', linewidth=3)









    



Betas (regression coefficients): 
[ 306.72757499]
Residual sum of squares (mean square error): 5472.26
Variance score: -0.13






    Out[24]:





[<matplotlib.lines.Line2D at 0x7f95b899e250>]



In [29]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

diabetes = datasets.load_diabetes()
X = diabetes.data

#Split the data into training/testing sets
X_train = X[:-20]
X_test = X[-20:]

#Split the targets into training/testing sets
y_train = diabetes.target[:-20]
y_test = diabetes.target[-20:]

regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_predict = regr.predict(X_test)
score = regr.score(X_test, y_test)

print('Coefficients: \n', regr.coef_)
print("Mean square error: %.2f"% np.mean((y_predict - y_test) ** 2))
print('Variance score: %.2f' % score)









    



('Coefficients: \n', array([  3.03499549e-01,  -2.37639315e+02,   5.10530605e+02,
         3.27736980e+02,  -8.14131709e+02,   4.92814588e+02,
         1.02848452e+02,   1.84606489e+02,   7.43519617e+02,
         7.60951722e+01]))
Mean square error: 2004.57
Variance score: 0.59

Multivariate case, a PLS Regression example

So far our regression target was unidimensional. The hardcore regression however involves multiple target vectors, also called multiple regression. One can still use linear regression techniques, with different corrections that take into account for many exceptions, one such being that the X matrix must be full rank in order for the least square optimization (the engine of linear regression) to work. There are also many other techniques, one my favorites being partial least least squares regression (PLS-R), which I choose to exemplify.

PLS is also called projection to latent structures, which mignt be more apropriate, since the multiple regression problem

$Y = XB + E$

, is solved by projecting the X and Y matrices into latent (hidden) lower dimensional space that is describing them. Just as with least square fitting ,it is difficult to explain how this is done, and it requires knowledge of factor analysis. The end result is a decomposition into a product of score and loading matrices like this:

$X = T P^{\top} + E$

$Y = U Q^{\top} + F$

, from which the coeficient matrix B is estimated.