Ridge Regression

Given dataset $$S = \{(X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_N)\}$$ we want to solve the Tikhonov minimization problem with square loss:

$$min\frac{1}{2}\sum\limits_{i=1}^n(f(X_i)-Y_i)^2 + \frac{\lambda}{2}||f||_k^2$$

where f is the regression function. According to representer theorem, the solution can be written as:

$$f = \sum\limits_{i=1}^nc_ik(X_i,\cdot)$$

So the formulation can be rewritten as $$min\frac{1}{2}\sum\limits_{i=1}^n(Y-Kc)^2 + \frac{\lambda}{2}c^tKc$$ K is the kernel matrix whose j-th element on i-th row is $k(X_i, X_j)$.

The closed-form solution to this minimization problem is given as $c^* = (K+\lambda I)^{-1}Y$. When we have a new piece of data $X^*$, the model will predict the output via $f(X^*) = \sum ck(X_i,X^*) = Y^t(K+\lambda I)^{-1}k(X,X^*)$.

Solving the inverse of a matrix is computationally expensive. Alternatively we try to perform eigendecomposition of $K$ first and use the decomposed matrix to calculate $c$.

Suppose $G = K + \lambda I$, and the eigendecomposition of $K$ can be written as $Q\Lambda Q^t$ where $Q$ is a orthogonal matrix, and $\Lambda$ is diagonal. The decomposition takes $O(n^3)$ time. The inverse can be easily solved after eigendecomposition, and $c^* = Q(\Lambda^{-1}+\lambda I)Q^tY$.

A special case is when we use linear kernel, the problem reduces to linear regression plus an extra L-2 norm regularizer. In this case the regression function becomes a dot product between weights $w$ and new data $X^*$, where $w = (X^tX + \lambda I)^{-1}X^tY$.

A common approach to find out good penalty parameter $\lambda$ is to apply cross validation. In practice we either define validation sets (when data is enormous) or use leave-one-out (when training set is small) to perform CV.


In [94]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
import numpy as np
from sklearn.preprocessing import RobustScaler

boston = load_boston()

X = boston.data

##split training, validation and testing set. only use one validation set (not CV) just to demonstrate the idea
boston_X_train = X[:-250]
boston_X_val = X[-250:-100]
boston_X_test = X[-100:]

boston_Y_train = boston.target[:-250]
boston_Y_val = boston.target[-250:-100]
boston_Y_test = boston.target[-100:]

In [95]:
###fit model using training data
alpha = 1.0
U, s, Vt = np.linalg.svd(boston_X_train, full_matrices=False)
temp = np.diag(s/(s*s+alpha))
w = np.dot(np.dot(np.transpose(Vt), np.dot(temp, np.transpose(U))), boston_Y_train)

In [96]:
predicts = np.dot(boston_X_test, w)
print np.mean((predicts - boston_Y_test) ** 2)


128.254245632

In [97]:
# Plot outputs
plt.scatter(range(len(boston_Y_test)), boston_Y_test,  color='black')
plt.scatter(range(len(predicts)), predicts,  color='red')


Out[97]:
<matplotlib.collections.PathCollection at 0x10c36b8d0>

In [99]:
alpha = 0.01
U, s, Vt = np.linalg.svd(boston_X_train, full_matrices=False)
alpha_opt = alpha
val_cost = np.inf
while alpha <= 10000:
    temp = np.diag(s/(s*s+alpha))
    w = np.dot(np.dot(np.transpose(Vt), np.dot(temp, np.transpose(U))), boston_Y_train)
    predicts = np.dot(boston_X_val, w)
    new_val_cost = np.mean((predicts - boston_Y_val) ** 2)
    if new_val_cost < val_cost:
        val_cost = new_val_cost
        alpha_opt = alpha
        
    print "validation error is ", val_cost, " when alpha is ", alpha, "norm of w is ",np.dot(w,w)
    alpha *=2

print "best regularization parameter is: ", alpha_opt


validation error is  347.939901981  when alpha is  0.01 norm of w is  241.317362774
validation error is  342.821678984  when alpha is  0.02 norm of w is  237.00208692
validation error is  333.087470488  when alpha is  0.04 norm of w is  228.827390371
validation error is  315.427301925  when alpha is  0.08 norm of w is  214.111409312
validation error is  286.031981616  when alpha is  0.16 norm of w is  189.981933085
validation error is  243.744549481  when alpha is  0.32 norm of w is  156.232013012
validation error is  194.890426535  when alpha is  0.64 norm of w is  119.117327227
validation error is  152.294130171  when alpha is  1.28 norm of w is  89.0616921819
validation error is  125.273670565  when alpha is  2.56 norm of w is  70.9701518437
validation error is  114.792173635  when alpha is  5.12 norm of w is  61.2455565318
validation error is  114.792173635  when alpha is  10.24 norm of w is  54.0411606499
validation error is  114.792173635  when alpha is  20.48 norm of w is  45.501402487
validation error is  114.792173635  when alpha is  40.96 norm of w is  34.3637858733
validation error is  114.792173635  when alpha is  81.92 norm of w is  22.0420068173
validation error is  114.792173635  when alpha is  163.84 norm of w is  11.6759611377
validation error is  114.792173635  when alpha is  327.68 norm of w is  5.28255857516
validation error is  107.444029409  when alpha is  655.36 norm of w is  2.28284388506
validation error is  89.4185714182  when alpha is  1310.72 norm of w is  1.08231447151
validation error is  82.7122227069  when alpha is  2621.44 norm of w is  0.579874027596
validation error is  82.7122227069  when alpha is  5242.88 norm of w is  0.312389311208
best regularization parameter is:  2621.44

Reference: technical report