Regression Models

By Saurabh Mahindre - github.com/Saurabh7 as a part of Google Summer of Code 2014 project mentored by - Heiko Strathmann - github.com/karlnapf - herrstrathmann.de

This notebook demonstrates various regression methods provided in Shogun. Linear models like Least Square regression, Ridge regression, Least Angle regression, etc. and also kernel based methods like Kernel Ridge regression are discussed and applied to toy and real life data.

Introduction
Least Squares regression
1. Prediction using Least Squares
2. Training and generating weights
Ridge Regression
1. Weights and regularization
Least Angle Regression and LASSO
Kernel Ridge Regression
Support Vector Regression

Introduction

Regression is a case of supervised learning where the goal is to learn a mapping from inputs $x\in\mathcal{X}$ to outputs $y\in\mathcal{Y}$, given a labeled set of input-output pairs $\mathcal{D} = {(x_i,y_i)}^{\text N}_{i=1} \subseteq \mathcal{X} \times \mathcal{Y}$. The response variable $y_i$ is continuous in regression analysis. Regression finds applications in many fields like for predicting stock prices or predicting consumption spending, etc. In linear regression, the mapping is a linear (straight-line) equation.

Least Squares regression

A Linear regression model can be defined as $\text y =$ $\bf {w} \cdot \bf{x} $ $+ b$. Here $\text y$ is the predicted value, $\text x$ the independent variable and $\text w$ the so called weights.</br> We aim to find the linear function (line) that best explains the data, i.e. that minimises some measure of deviation to the training data $\mathcal{D}$. One such measure is the sum of squared distances. The Ordinary Least Sqaures method minimizes the sum of squared distances between the observed responses in the dataset and the responses predicted by the linear approximation.

The distances called residuals have to minimized. This can be represented as:$$E({\bf{w}}) = \sum_{i=1}^N(y_i-{\bf w}\cdot {\bf x}_i)^2$$ One can differentiate with respect to $\bf w$ and equate to zero to determine the $\bf w$ that minimises $E({\bf w})$. This leads to solution of the form:

$${\bf w} = \left(\sum_{i=1}^N{\bf x}_i{\bf x}_i^T\right)^{-1}\left(\sum_{i=1}^N y_i{\bf x}_i\right)$$

Prediction using Least Squares

Regression using Least squares is demonstrated below on toy data. Shogun provides the tool to do it using CLeastSquaresRegression class. The data is a straight line with lot of noise and having slope 3. Comparing with the mathematical equation above we thus expect $\text w$ to be around 3 for a good prediction. Once the data is converted to Shogun format, we are ready to train the machine. To label the training data CRegressionLabels are used.



In [ ]:

    
%pylab inline
%matplotlib inline
# import all shogun classes
from modshogun import *
slope=3

X_train=rand(30)*10
y_train=slope*(X_train)+random.randn(30)*2+2
y_true=slope*(X_train)+2
X_test=concatenate((linspace(0,10, 50),X_train))

#Convert data to shogun format features
feats_train=RealFeatures(X_train.reshape(1,len(X_train)))
feats_test=RealFeatures(X_test.reshape(1,len(X_test)))
labels_train=RegressionLabels(y_train)

Training and generating weights

LeastSquaresRegression has to be initialised with the training features and training labels. Once that is done to learn from data we train() it. This also generates the $\text w$ from the general equation described above. To access $\text w$ use get_w().



In [ ]:

    
ls=LeastSquaresRegression(feats_train, labels_train)
ls.train()
w=ls.get_w()
print 'Weights:'
print w

This value of $\text w$ is pretty close to 3, which certifies a pretty good fit for the training data. Now let's apply this trained machine to our test data to get the ouput values.



In [ ]:

    
out=ls.apply(feats_test).get_labels()

As an aid to visualisation, a plot of the output and also of the residuals is shown. The sum of the squares of these residuals is minimised.



In [ ]:

    
figure(figsize=(20,5))
#Regression and true plot
pl1=subplot(131)
title('Regression')
_=plot(X_train,labels_train, 'ro')
_=plot(X_test,out, color='blue')
_=plot(X_train, y_true, color='green')
p1 = Rectangle((0, 0), 1, 1, fc="r")
p2 = Rectangle((0, 0), 1, 1, fc="b")
p3 = Rectangle((0, 0), 1, 1, fc="g")
pl1.legend((p1, p2, p3), ["Training samples", "Predicted output", "True relationship"], loc=2)
xlabel('Samples (X)', fontsize=12)
ylabel('Response variable (Y)', fontsize=12)

#plot residues
pl2=subplot(132)
title("Squared error and output")
_=plot(X_test,out, linewidth=2)
gray()
_=scatter(X_train,labels_train,c=ones(50) ,cmap=gray(), s=40)
for i in range(50,80):
    plot([X_test[i],X_test[i]],[out[i],y_train[i-50]] , linewidth=2, color='red')
p1 = Rectangle((0, 0), 1, 1, fc="r")
p2 = Rectangle((0, 0), 1, 1, fc="b")
pl2.legend((p1, p2), ["Error/residuals to be squared", "Predicted output"], loc=2)
xlabel('Samples (X)', fontsize=12)
ylabel('Response variable (Y)', fontsize=12)
jet()

Ridge Regression

The function we choose should not only best fit the training data but also generalise well. If the coefficients/weights are unconstrained, they are susceptible to high variance and overfitting. To control variance, one has to regularize the coefficients i.e. control how large the coefficients grow. This is what is done in Ridge regression which is L2 (sum of squared components of $\bf w$) regularized form of least squares. A penalty is imposed on the size of coefficients. The error to be minimized is:

$$E({\bf{w}}) = \sum_{i=1}^N(y_i-{\bf w}\cdot {\bf x}_i)^2 + \tau||{\bf w}||^2$$

Here $\tau$ imposes a penalty on the weights.</br> By differentiating the regularised training error and equating to zero, we find the optimal $\bf w$, given by:

$${\bf w} = \left(\tau {\bf I}+ \sum_{i=1}^N{\bf x}_i{\bf x}_i^T\right)^{-1}\left(\sum_{i=1}^N y_i{\bf x}_i\right)$$

Ridge regression can be performed in Shogun using CLinearRidgeRegression class. It takes the regularization constant $\tau$ as an additional argument. Let us see the basic regression example solved using the same.



In [ ]:

    
tau=0.8
rr=LinearRidgeRegression(tau, feats_train, labels_train)
rr.train()
w=rr.get_w()
print w
out=rr.apply(feats_test).get_labels()



In [ ]:

    
figure(figsize=(20,5))
#Regression and true plot
pl1=subplot(131)
title('Ridge Regression')
_=plot(X_train,labels_train, 'ro')
_=plot(X_test, out, color='blue')
_=plot(X_train, y_true, color='green')
p1 = Rectangle((0, 0), 1, 1, fc="r")
p2 = Rectangle((0, 0), 1, 1, fc="b")
p3 = Rectangle((0, 0), 1, 1, fc="g")
pl1.legend((p1, p2, p3), ["Training samples", "Predicted output", "True relationship"], loc=2)
xlabel('Samples (X)', fontsize=12)
ylabel('Response variable (Y)', fontsize=12)
jet()

Relationship between weights and regularization

The prediction in the basic regression example was simliar to that of least squares one. To actually see ridge regression's forte, we analyse how the weights change along with the regularization constant. Data with slightly higher dimensions is sampled to do this because overfitting is more likely to occur in such data. Here set_tau() method is used to set the necessary parameter.



In [ ]:

    
#Generate Data
def generate_data(N, D):
    w = randn(D,1)
    X=zeros((N,D))
    y=zeros((N,1))
    for i in range(N):
        x=randn(1,D)
        for j in range(D):
            X[i][j]=x[0][j]
    y=dot(X,w) + randn(N,1);
    y.reshape(N,)
    return X,y.T

def generate_weights(taus, feats_train, labels_train):
    preproc=PruneVarSubMean(True)
    preproc.init(feats_train)
    feats_train.add_preprocessor(preproc)
    feats_train.apply_preprocessor()     
    weights=[]
    rr=LinearRidgeRegression(tau, feats_train, labels_train)
    
    #vary regularization
    for t in taus:
        rr.set_tau(t)
        rr.train()
        weights.append(rr.get_w())
    return weights, rr

def plot_regularization(taus, weights):     
    ax = gca()
    ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm'])
    ax.plot(taus, weights, linewidth=2)
    xlabel('Tau', fontsize=12)
    ylabel('Weights', fontsize=12)
    ax.set_xscale('log')

The mean squared error (MSE) of an estimator measures the average of the squares of the errors. CMeanSquaredError class is used to compute the MSE as :

$$\frac{1}{|L|} \sum_{i=1}^{|L|} (L_i - R_i)^2$$

Here $L$ is the vector of predicted labels and $R$ is the vector of real labels.

We use 5-fold cross-validation to compute MSE and have a look at how MSE varies with regularisation.



In [ ]:

    
def xval_results(taus):
    errors=[]
    for t in taus:
        rr.set_tau(t)
        splitting_strategy=CrossValidationSplitting(labels_train, 5)
        # evaluation method
        evaluation_criterium=MeanSquaredError()
        # cross-validation instance
        cross_validation=CrossValidation(rr, feats_train, labels_train, splitting_strategy, evaluation_criterium, False)
        cross_validation.set_num_runs(100)
        cross_validation.set_conf_int_alpha(0.05)
        result=cross_validation.evaluate()
        result=CrossValidationResult.obtain_from_generic(result)
        errors.append(result.mean)
    return errors

Data with dimension: 10 and number of samples: 200 is now sampled.



In [ ]:

    
n = 500
taus = logspace(-6, 4, n)
figure(figsize=(20,6))
suptitle('Effect of Regularisation for 10-dimensional data with 200 samples', fontsize=12)

matrix, y=generate_data(200,10)
feats_train = RealFeatures(matrix.T)
labels_train= RegressionLabels(y[0])
weights, rr=generate_weights(taus, feats_train, labels_train)
errors=xval_results(taus)

p1=subplot(121)
plot_regularization(taus, weights)

p2=subplot(122)
plot(taus, errors)
p2.set_xscale('log')
xlabel('Tau', fontsize=12)
ylabel('Error', fontsize=12)
jet()

As seen from the plot of errors, regularisation doesn't seem to affect the errors significantly. One interpretation could be that this is beacuse there is less overfitting as we have large number of samples. For a small sample size as compared to the dimensionality, the test set performance may be poor even. The reason for this is that the regression function will fit the noise too much, while the interesting part of the signal is too small. We now generate 10 samples of 10-dimensions to test this.



In [ ]:

    
figure(figsize=(20,6))
suptitle('Effect of Regularisation for 10-dimensional data with 10 samples', fontsize=12)
matrix, y=generate_data(10,10)
feats_train = RealFeatures(matrix.T)
labels_train= RegressionLabels(y[0])
weights, rr=generate_weights(taus, feats_train, labels_train)
errors=xval_results(taus)

p1=subplot(121)
plot_regularization(taus, weights)


p2=subplot(122)
plot(taus, errors)
p2.set_xscale('log')
xlabel('Tau', fontsize=12)
ylabel('Error', fontsize=12)
jet()

The first plot is the famous ridge trace that is the signature of this technique. The plot is really very straight forward to read. It presents the standardized regression coefficients (weights) on the vertical axis and various values of tau (Regularisation constant) along the horizontal axis. Since the values of tau ($\tau$) span several orders of magnitude, we adopt a logarithmic scale along this axis. As tau is increased, the values of the regression estimates change, often wildly at first. At some point, the coefficients seem to settle down and then gradually drift towards zero. Often the value of tau for which these coefficients are at their stable values is the best one. This should be supported by a low error value for that tau.

Least Angle Regression and LASSO

LASSO (Least Absolute Shrinkage and Selection Operator) is another version of Least Squares regression, which uses a L1-norm of the parameter vector. This intuitively enforces sparse solutions, whereas L2-norm penalties usually result in smooth and dense solutions.

$$ \min \|X^T\beta - y\|^2 + \lambda\|\beta\|_1$$

In Shogun, following equivalent form is solved, where increasing $C$ selects more variables:

$$\min \|X^T\beta - y\|^2 \quad s.t. \|\beta\|_1 \leq C $$

One way to solve this regularized form is by using Least Angle Regression (LARS).

LARS is essentially forward stagewise made fast. LARS can be briefly described as follows.

Start with an empty set.
Select $x_j$ that is most correlated with residuals.
Proceed in the direction of $x_j$ until another variable $x_k$ is equally correlated with residuals.
Choose equiangular direction between $x_j$ and $x_k$.
Proceed until third variable enters the active set, etc.

It should be noticed that instead of making tiny hops in the direction of one variable at a time, LARS makes optimally-sized leaps in optimal directions. These directions are chosen to make equal angles (equal correlations) with each of the variables currently in our set (equiangular).

Shogun provides tools for Least angle regression (LARS) and lasso using CLeastAngleRegression class. As explained in the mathematical formaulation, LARS is just like Stepwise Regression but increases the estimated variables in a direction equiangular to each one's correlations with the residual. The working of this is shown below by plotting the LASSO path. Data is generated in a similar way to the previous section.



In [ ]:

    
#sample some data
X=rand(10)*1.5
for i in range(9):
    x=random.standard_normal(10)*0.5
    X=vstack((X, x))
y=ones(10)

feats_train=RealFeatures(X)
labels_train=RegressionLabels(y)

CLeastAngleRegression requires the features to be normalized with a zero mean and unit norm. Hence we use two preprocessors: PruneVarSubMean and NormOne.



In [ ]:

    
#Preprocess data
preproc=PruneVarSubMean()
preproc.init(feats_train)
feats_train.add_preprocessor(preproc)
feats_train.apply_preprocessor()

preprocessor=NormOne()
preprocessor.init(feats_train)
feats_train.add_preprocessor(preprocessor)
feats_train.apply_preprocessor()

print "(No. of attributes, No. of samples) of data:"
print feats_train.get_feature_matrix().shape

Next we train on the data. Keeping in mind that we had 10 attributes/dimensions in our data, let us have a look at the size of LASSO path which is obtained readily using get_path_size().



In [ ]:

    
#Train and generate weights
la=LeastAngleRegression()
la.set_labels(labels_train)
la.train(feats_train)

size=la.get_path_size()
print ("Size of path is %s" %size)

The weights generated ($\beta_i$) and their norm ($\sum_i|\beta_i|$) change with each step. This is when a new variable is added to path. To get the weights at each of these steps get_w_for_var() method is used. The argument is the index of the variable which should be in the range [0, path_size).



In [ ]:

    
#calculate weights
weights=[]
for i in range(size):
    weights.append(la.get_w_for_var(i))  
s = sum(abs(array(weights)), axis=1)
print ('Max. norm is %s' %s[-1])

figure(figsize(30,7))
#plot 1
ax=subplot(131)
title('Lasso path')
ax.plot(s, weights, linewidth=2)
ymin, ymax = ylim()
ax.vlines(s[1:-1], ymin, ymax, linestyle='dashed')
xlabel("Norm")
ylabel("weights")

#Restrict norm to half for early termination
la.set_max_l1_norm(s[-1]*0.5)
la.train(feats_train)
size=la.get_path_size()
weights=[]
for i in range(size):
    weights.append(la.get_w_for_var(i))
s = sum(abs(array(weights)), axis=1)

#plot 2
ax2=subplot(132)
title('Lasso path with restricted norm')
ax2.plot(s, weights, linewidth=2)
ax2.vlines(s[1:-1], ymin, ymax, linestyle='dashed')
xlabel("Norm")
ylabel("weights")
print ('Restricted norm is  %s' %(s[-1]))

Each color in the plot represents a coefficient and the vertical lines denote steps. It is clear that the weights are piecewise linear function of the norm.

Kernel Ridge Regression

Kernel ridge regression (KRR) is a kernel-based regularized form of regression. The dual form of Ridge regression can be shown to be: $${\bf \alpha}=\left({\bf X}^T{\bf X}+\tau{\bf I}\right)^{-1}{\bf y} \quad \quad(1)$$ It can be seen that the equation to compute $\alpha$ only contains the vectors $\bf X$ in inner products with each other. If a non-linear mapping $\Phi : x \rightarrow \Phi(x) \in \mathcal F$ is used, the equation can be defined in terms of inner products $\Phi(x)^T \Phi(x)$ instead. We can then use the kernel trick where a kernel function, which can be evaluated efficiently, is choosen $K({\bf x_i, x_j})=\Phi({\bf x_i})\Phi({\bf x_j})$. This is done because it is sufficient to know these inner products only, instead of the actual vectors $\bf x_i$. Linear regression methods like above discussed Ridge Regression can then be carried out in the feature space by using a kernel function representing a non-linear map which amounts to nonlinear regression in original input space.

KRR can be performed in Shogun using CKernelRidgeRegression class. Let us apply it on a non linear regression problem from the Boston Housing Dataset, where the task is to predict prices of houses by finding a relationship with the various attributes provided. The per capita crime rate attribute is used in this particular example.



In [ ]:

    
feats=RealFeatures(CSVFile('../../../data/uci/housing/fm_housing.dat'))
train_labels=RegressionLabels(CSVFile('../../../data/uci/housing/housing_label.dat'))
mat = feats.get_feature_matrix()

crime_rate=mat[0]
feats_train=RealFeatures(crime_rate.reshape(1, len(mat[0])))

preproc=RescaleFeatures()
preproc.init(feats_train)
feats_train.add_preprocessor(preproc)
feats_train.apply_preprocessor(True)

# Store preprocessed feature matrix.
preproc_data=feats_train.get_feature_matrix()

size=500
x1=linspace(0, 1, size)



In [ ]:

    
width=0.5
tau=0.5
kernel=GaussianKernel(feats_train, feats_train, width)
krr=KernelRidgeRegression(tau, kernel, train_labels)
krr.train(feats_train)

feats_test=RealFeatures(x1.reshape(1,len(x1)))
kernel.init(feats_train, feats_test)
out = krr.apply().get_labels()



In [ ]:

    
#Visualization of regression
fig=figure(figsize(6,6))
#first plot with only one attribute
title("Regression with 1st attribute")
_=scatter(preproc_data[0:], train_labels.get_labels(), c=ones(560), cmap=gray(), s=20)
_=xlabel('Crime rate')
_=ylabel('Median value of homes')

_=plot(x1,out, linewidth=3)

As seen from the example KRR (using the kernel trick) can apply techniques for linear regression in the feature space to perform nonlinear regression in the input space.

Support Vector Regression

In Kernel Ridge Regression $(1)$ we have seen the result to be a dense solution. Thus all training examples are active which limits its usage to fewer number of training examples. Support Vector Regression (SVR) uses the concept of support vectors as in Support Vector Machines that leads to a sparse solution. In the SVM the penalty was paid for being on the wrong side of the discriminating plane. Here we do the same thing: we introduce a penalty for being far away from predicted line, but once you are close enough, i.e. in some “epsilon-tube” around this line, there is no penalty.

We are given a labeled set of input-output pairs $\mathcal{D}=(x_i,y_i)^N_{i=1}\subseteq \mathcal{X} \times \mathcal{Y}$ where $x\in\mathcal{X}$ and $y\in \mathcal{Y}$ and the primary problem is as follows: $$\arg\min_{\mathbf{w},\mathbf{\xi}, b } ({\frac{1}{2} \|\mathbf{w}\|^2 +C \sum_{i=1}^n (\xi_i+ {\xi_i}^*)) }$$

For the constraints: $$ {\bf w}^T{\bf x}_i+b-c_i-\xi_i\leq 0, \, \forall i=1\dots N$$ $$ -{\bf w}^T{\bf x}_i-b-c_i^*-\xi_i^*\leq 0, \, \forall i=1\dots N $$ with $c_i=y_i+ \epsilon$ and $c_i^*=-y_i+ \epsilon$

The resulting dual optimaization problem is: $$ \max_{{\bf \alpha},{\bf \alpha}^*} -\frac{1}{2}\sum_{i,j=1}^N(\alpha_i-\alpha_i^*)(\alpha_j-\alpha_j^*) {\bf x}_i^T {\bf x}_j-\sum_{i=1}^N(\alpha_i+\alpha_i^*)\epsilon - \sum_{i=1}^N(\alpha_i-\alpha_i^*)y_i\\$$ $$ \mbox{wrt}:$$ $${\bf \alpha},{\bf \alpha}^*\in{\bf R}^N\\ \mbox{s.t.}: 0\leq \alpha_i,\alpha_i^*\leq C,\, \forall i=1\dots N\\ \sum_{i=1}^N(\alpha_i-\alpha_i^*)y_i=0 $$

This class also support the $\nu$-SVR regression version of the problem, where $\nu$ replaces the $\epsilon$ parameter and represents an upper bound on the action of margin errors and a lower bound on the fraction of support vectors. The resulting problem generally takes a bit longer to solve. The details and comparison of these two versioins can be found in [1].

Let us try regression using Shogun's LibSVR. The dataset from last section is used. The svr_param argument is the $\epsilon$-tube for the $\epsilon$ version and is the $\nu$ parameter in other case.



In [ ]:

    
# Use different kernels
gaussian_kernel=GaussianKernel(feats_train, feats_train, 0.1)
#Polynomial kernel of degree 2
poly_kernel=PolyKernel(feats_train, feats_train, 2, True)
linear_kernel=LinearKernel(feats_train, feats_train)

kernels=[linear_kernel, poly_kernel, gaussian_kernel]



In [ ]:

    
svr_param=1
svr_C=10
svr=LibSVR(svr_C, svr_param, gaussian_kernel, train_labels, LIBSVR_EPSILON_SVR)

#Visualization of regression
x1=linspace(0, 1, size)
feats_test_=RealFeatures(x1.reshape(1,len(x1)))

def svr_regress(kernels):
    fig=figure(figsize(8,8))
    for i, kernel in enumerate(kernels):
        svr.set_kernel(kernel)
        svr.train()
        out=svr.apply(feats_test_).get_labels()
        #subplot(1,len(kernels), i)
        #first plot with only one attribute
        title("Support Vector Regression")
        _=scatter(preproc_data[0:], train_labels.get_labels(), c=ones(560), cmap=gray(), s=20)
        _=xlabel('Crime rate')
        _=ylabel('Median value of homes')
        _=plot(x1,out, linewidth=3)
        ylim([0, 40])
    p1 = Rectangle((0, 0), 1, 1, fc="r")
    p2 = Rectangle((0, 0), 1, 1, fc="b")
    p3 = Rectangle((0, 0), 1, 1, fc="g")
    _=legend((p1, p2, p3), ["Gaussian Kernel", "Linear Kernel", "Polynomial Kernel"], loc=1)

    
svr_regress(kernels)

Let us do comparison of time taken for the 2 different models simliar to that done in section 6 of [1]. The Boston Housing Dataset is used.



In [ ]:

    
import time
gaussian_kernel=GaussianKernel(feats, feats, 13)
nus=[0.2, 0.4, 0.6, 0.8]
epsilons=[0.16, 0.09, 0.046, 0.0188]
svr_C=10

def compare_svr(nus, epsilons):
    time_eps=[]
    time_nus=[]
    for i in range(len(epsilons)):
        svr_param=1
        svr=LibSVR(svr_C, epsilons[i], gaussian_kernel, train_labels, LIBSVR_EPSILON_SVR)
        t_start=time.clock()
        svr.train()
        time_test=(time.clock() - t_start)
        time_eps.append(time_test)
        
    for i in range(len(nus)):
        svr_param=1
        svr=LibSVR(svr_C, nus[i], gaussian_kernel, train_labels,  LIBSVR_NU_SVR)
        t_start=time.clock()
        svr.train()
        time_test=(time.clock() - t_start)
        time_nus.append(time_test)

    print "-"*72      
    print "|", "%15s" % 'Nu' ,"|", "%15s" % 'Epsilon',"|","%15s" % 'Time (Nu)' ,"|", "%15s" % 'Time(Epsilon)' ,"|"
    for i in range(len(nus)):
        print "-"*72      
        print "|", "%15s" % nus[i] ,"|", "%15s" %epsilons[i],"|","%15s" %time_nus[i] ,"|", "%15s" %time_eps[i] ,"|"  
    print "-"*72  
 
title_='SVR Performance on Boston Housing dataset'
print "%50s" %title_
compare_svr(nus, epsilons)

References

[1] Training $\nu$-Support Vector Regression: Theory and Algorithms by Chih-Chung Chang and Chih-Jen Lin