Linear Regression and Subset Selection

StatML: Lecture 3

Prof. James Sharpnack

Reading: "The Elements of Statistical Learning," Hastie, Tibshirani, Friedman, Ch. 3 (ESL)



In [64]:

    
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
import scipy as sc

Recall LinearRegression.fit

Throughout let $p < n$

Fit in OLS solves the following, on training set $$ \hat \beta = (X^\top X)^{-1} X^\top y $$ where $X,y$ are $n \times p$ and $n$ arrays.

Linear solve:

$$ (X^\top X) \hat \beta = X^\top y $$

Recall LinearRegression.predict

Apply predict to training set then $$ \hat y = X \hat \beta = X (X^\top X)^{-1} X^\top y $$ is a projection of $y$ onto the column space of $X$. Projection in $n$-D space.

Projections are idempotent, $$ P := X (X^\top X)^{-1} X^\top $$ has $$ P P = X (X^\top X)^{-1} X^\top X (X^\top X)^{-1} X^\top = X (X^\top X)^{-1} X^\top. $$

Image from wikipedia.

Exercise 3.1

Suppose that we have a perfectly reasonable $n \times p$ design matrix $X$, and $n$ response vector $y$ such that $X^\top X$ is invertible. Suppose that we duplicate the columns of $X$ to make an $n \times (2p)$ matrix. Suppose that we want to run OLS with the new data by finding solutions to the normal equation.

From the projection intuition above, what is the impact on $\hat y$?
Show that a valid solution to the new normal equations is $\tilde \beta = \frac 12 [\hat \beta; \hat \beta]$?
What form do other valid solutions take?

STOP

Answer to 3.1

No impact on $\hat y$, because there is no change in the column space of $X$.
Let $\tilde X = (X,X)$ then $$\tilde X^\top \tilde X = [X^\top X, X^\top X; X^\top X, X^\top X]$$ This is not invertible! But normal equations are... $$ \tilde X^\top \tilde X \tilde \beta = \tilde X^\top y = [X^\top y; X^\top y] $$ Which gives, $$ [X^\top X, X^\top X; X^\top X, X^\top X] \frac{[\hat \beta ; \hat \beta]}{2} = \frac 12 [X^\top X \hat \beta + X^\top X \hat \beta; X^\top X \hat \beta + X^\top X \hat \beta] = [X^\top X \hat \beta;X^\top X \hat \beta] $$ but because $\hat \beta$ solves the original normal equations this is equal to $$ = [X^\top y; X^\top y] = \tilde X^\top y.$$
There was nothing special about $1/2$ and we could repeat the arguments above with $[\theta \hat \beta; (1-\theta) \hat \beta]$

Regression by Successive Orthogonalization

ESL pg. 54

Input $x_0=1, x_1, \ldots, x_p$ columns of design matrix.
Init $z_0 = x_0 = 1$
For $j = 1,\ldots,p$
- Regress $x_j$ on $z_0,\ldots,z_{j-1}$ giving $$\hat \gamma_{j,l} = \frac{z_l^\top x_j}{z_l^\top z_l},$$ $l = 0, \ldots, j-1$ and $z_j = x_j - \sum_{k=0}^{j-1} \hat \gamma_{j,k} z_k$
Regress $y$ on the residual $z_p$ to give $\hat \beta_p$

Regression by Successive Orthogonalization

What does "regress onto" mean?

Solving the normal equation $$ Z^\top Z \hat \gamma_j = Z^\top x_j $$

where $Z$ has columns $z_0, \ldots z_{j-1}$.

Why is this any easier?

$Z$ is orthogonal, i.e. the columns are orthogonal, $$ z_j^\top z_k = 0, j\ne k $$ which means $Z^\top Z$ is diagonal (easy to invert).

Exercise. 3.2.1 Show that Successive Orthogonalization is equivalent to the Gram-Schmidt procedure for finding an orthonormal basis of column space of $X$.

Regression by Successive Orthogonalization

Regress $y$ on the residual $z_p$ to give $\hat \beta_p$?

We know that $z_p$ is the only basis element that contains $x_p$ and that regressing $y$ onto $Z$ is equivalent to regressing $y$ onto $X$.

Why?

We can write these matrices as $$ X = Z \Gamma $$ where $Z$ is orthogonal and $\Gamma$ is upper triangular. Let $D$ be the diagonal matrix with $\| z_j\|$ on diagonal. Then $$ X = Z D^{-1} D \Gamma = Q R $$ for $Q = Z D^{-1}$ is $n \times p$, $R = D \Gamma$ is $p \times p$.

$Q$ is orthonormal ($Q^\top Q = I$) and $R$ is upper triangular.

Regression by Successive Orthogonalization

Normal eqn is $$ X^\top X \hat \beta = X^\top y \equiv R^\top R \hat \beta = R^\top (Q^\top y) $$
Upper triangular matrices are easy to invert!
```
[1, 2] [a] = [4]
[0, 3] [b]   [5]
```
One of many decompositions that can make linear regression easy (after the decomposition is made).

Regression by Successive Orthogonalization

Input $x_0=1, x_1, \ldots, x_p$ columns of design matrix.
Init $z_0 = x_0 = 1$
For $j = 1,\ldots,p$
- Regress $x_j$ on $z_0,\ldots,z_{j-1}$ giving $$\hat \gamma_{j,l} = \frac{z_l^\top x_j}{z_l^\top z_l},$$ $l = 0, \ldots, j-1$ and $z_j = x_j - \sum_{k=0}^{j-1} \hat \gamma_{j,k} z_k$
Regress $y$ on the residual $z_p$ to give $\hat \beta_p$

Only gives us $\hat \beta_p$! Not a great algorithm in its current form.
Algorithm exposes the effect of correlated input $x_j$.

Suppose that $y$ follows the linear model $$ y = X \beta + \epsilon $$ where $\epsilon_i$ is iid normal$(0,\sigma^2)$. Then

Regress $y$ on the residual $z_p$ to give $\hat \beta_p$

Means $$ \hat \beta_p = \frac{y^\top z_p}{z_p^\top z_p} $$ If $\|z_p\|$ is small then this is instable (high variance), when does this happen? $$ z_j = x_j - \sum_{k=0}^{j-1} \hat \gamma_{j,k} z_k $$

If $x_p$ is correlated with $x_0,\ldots,x_{p-1}$ (small residual when regressed onto) then $\hat \beta_p$ is instable.

Exercise 3.2.2

Below is some code for generating a design matrix with correlated X variables, and the response vector. The rho parameter is the correlation of the X variables, so that $rho = 1$ means that they are perfectly correlated. Use the code below to generate the true beta once (also a parameter of sim_corr_lm) and set sigma=1. Choose a sequence of rho's: Rhos = [0,.2,.4,.6,.8,.9,.95,.99] and for each one run 100 trials of the following: simulate from the linear model, fit OLS, and save the first coefficient beta_1. For each rho in the list, calculate the variance of beta_1 and plot the variance as a function of rho.



In [34]:

    
def sim_corr_lm(n,p,rho,beta,sigma):
    """
    Simulate a design matrix with all columns having marginal correlation rho
    """
    assert p < n and rho < 1 and rho >= 0, "p must be less than n and rho in [0,1)"
    Sigma = (1 - rho)*np.eye(p) + rho*np.ones((p,p))
    X = np.random.multivariate_normal(np.zeros(p),Sigma,n)
    y = X @ beta + np.random.normal(0,sigma,n)
    return X,y



In [51]:

    
n, p, rho = 100, 2, .8
beta = np.random.normal(0,1,p)
sigma = 1.
X, y = sim_corr_lm(n,p,rho,beta,sigma)
plt.plot(X[:,0],X[:,1],'.')









    Out[51]:





[<matplotlib.lines.Line2D at 0x7f8b01ffd4a8>]

STOP



In [52]:

    
## Solution to 3.2.2
def sample_coef_corr_lm(trials,**kwargs):
    """
    Sample the OLS coefficients for rho correlated input
    """
    beta_sim = []
    for t in range(trials):
        X,y = sim_corr_lm(**kwargs)
        beta_sim += [linear_model.LinearRegression(fit_intercept=False).fit(X,y).coef_]
    return np.array(beta_sim)



In [62]:

    
## Sample coefficients with different rho and plot variance of beta_1 as rho increases
Rhos = [0,.2,.4,.6,.8,.9,.95,.99]
coef_vars = [sample_coef_corr_lm(100,n=n,p=p,rho=rho,beta=beta,sigma=sigma)[:,0].var(axis=0) for rho in Rhos]
plt.plot(Rhos,coef_vars)
plt.xlabel('X correlation')
plt.ylabel('beta variance')









    Out[62]:





Text(0, 0.5, 'beta variance')

Solution to Exercise 3.2.1

In Gram-Schmidt the projection operator is $$\frac{\langle x, z\rangle}{\langle z, z \rangle} z,$$ and so the projection of $x_j$ onto the space spanned by $z_k$ is $$ \frac{x_j^\top z_k}{z_k^\top z_k} z_k = \hat \gamma_{j,k} z_k.$$

The result is immediate from algorithm: $$z_j = x_j - \sum_{k=0}^{j-1} \hat \gamma_{j,k} z_k$$

Singular value decomposition

Recall that QR decomposition was computed to make LinearRegression.fit easier.
There are other decompositions that can also be used: Cholesky and Singular Value.

Singular Value Decomposition (for $n > p$ and $X^\top X$ invertible)

$$ X = U D V^\top, $$

U is orthonormal ($U^\top U = I$) $n \times p$
V is orthonormal $p \times p$
D is diagonal

If X is singular, there is a non-zero vector $z$ such that $Xz = 0$, then an eigenvalue is $0$. This is equivalent to a residual in Succ. Ortho. being zero.
Computing SVD is more expensive then QR in general.

Singular Value Decomposition

Suppose that we precomputed the SVD, $$ X = UDV^\top. $$ Then the Gram matrix is $$ X^\top X = V D U^\top U D V^\top = V D^2 V^\top, $$ the Spectral decomposition.

You should derive that $$ \hat \beta = V D^{-1} U^\top y. $$

Singular Value Decomposition

The coefficient formula $$ \hat \beta = V D^{-1} U^\top y $$ means that $$ \hat \beta = \sum_{j=0}^p d_j^{-1} (u_j^\top y) v_j $$ which means that

$\hat \beta$ is instable (high variance in some direction) if the eigenvalues are very small.

Exercise 3.3

Below you can see some code for computing the SVD of $X^\top X$. Using the same selection of Rhos for each rho compute the singular values and plot them in order from largest to smallest. You should get one line for each rho. What does this say about the stability of $\hat \beta$ as rho approaches 1.



In [105]:

    
## Simulate again
n, p, rho = 100, 8, .8
beta = np.random.normal(0,1,p)
sigma = 1.
X, y = sim_corr_lm(n,p,rho,beta,sigma)

## SVD
U,d,Vt = sc.linalg.svd(X)

## Check to make sure that we understand
print(U.shape, Vt.shape, d.shape)
np.abs(X - U[:,:p] @ np.diag(d) @ Vt).sum()









    



(100, 100) (8, 8) (8,)






    Out[105]:





1.558768684874895e-12

STOP



In [103]:

    
## Sample coefficients with different rho and store eigenvalues and beta variances
Rhos = [0,.2,.4,.6,.8,.9,.95,.99]
res_mat = []
for rho in Rhos:
    X, y = sim_corr_lm(n,p,rho,beta,sigma)
    U,d,Vt = sc.linalg.svd(X)
    res_mat += [d]



In [104]:

    
## plot the ordered eigenvalues for each rho
colors = plt.cm.jet(np.linspace(0,1,len(Rhos)))
for col, rho, res_vec in zip(colors,Rhos,res_mat):
    plt.plot(res_vec,label=str(rho),c=col)
plt.legend()
_ = plt.xlabel('eigenvalue index')

Ridge regression

We can summarize the above statements as

If X is nearly singular then $\hat \beta$ is instable

One solution is Ridge regression, $\hat \beta$ that solves $$ \min. \sum_{i=1}^n (y_i - x_i^\top \beta)^2 + \lambda \sum_{j=1}^p \beta_j^2. $$

Ridge regularization will "pull" $\beta$ towards 0.
$\lambda$ is a tuning parameter and the second term is ridge penalty
centered $X$ ($x_{i,j} \gets x_{i,j} - \bar x_j$ for all but intercept) then $\hat \beta = \bar y$, can then remove the intercept from $y$ and perform ridge with $\beta$ without intercept
normalize $X$ ($x_{i,j} \gets x_{i,j} / \| x_j \|$) makes all penalty terms on same scale

Ridge regression

Ridge objective (with no intercept) $$ (y - X \beta)^\top (y - X \beta) + \lambda \beta^\top \beta $$ is proportional to $$- 2 y^\top X \beta + \beta^\top (X^\top X) \beta + \lambda \beta^\top \beta = -2 (X^\top y)^\top \beta + \beta^\top (X^\top X + \lambda I) \beta.$$

The ridge solution satisfies some new normal equations, $$ (X^\top X + \lambda I) \hat \beta = X^\top y. $$

Ridge regularization always has a solution because $(X^\top X + \lambda I)^{-1}$ exists! But it is biased!

Exercise 3.4

Recall that we can write the OLS solution as $$ \hat \beta = V D^{-1} U^\top y . $$

Consider the solution to the ridge normal equation,
$$ (X^\top X + \lambda I) \hat \beta = X^\top y. $$ Show that it can also be solved with a similar form but where the singular values $d_j$ are augmented.

STOP

Solution to exercise 3.4

If $X = U D V^\top$ then $$ X^\top X + \lambda I = V D^2 V^\top + \lambda I = V (D^2 + \lambda I) V^\top $$ because when $p < n$ then $V V^\top = I$. So Ridge solution can be shown to be (exercise) $$ \hat \beta = V (D^2 + \lambda I)^{-1} D U^\top y $$ compare to OLS $$ \hat \beta = V D^{-1} U^\top y $$ modifies the eigenvalues to be $$ d_j \gets \frac{d_j^2 + \lambda}{d_j}. $$

Subset selection

Two motivations for selecting variables

Fewer variables can lead to lower risk
We want to discover a subset of variables with large effects

A model for sparsity $$ y_i = x_i^\top \beta + \epsilon_i $$ where for some of the $j = 1,\ldots,p$, $\beta_j = 0$.

Def Support of $\beta$ is $$ \textrm{supp}(\beta) = \{j = 1,\ldots,p : \beta_j \ne 0 \}. $$ then goal is to find supp$(\beta)$ or $\beta$ such that $|$supp$(\beta)| \le s$.

Computational challenges of subset selection

Combinatorial ($L_0$) subset selection: Select $S \subseteq \{1,\ldots, p\}$ s.t. $|S| \le s$ and minimizes $$ \sum_{i=1}^n \left(y_i - \sum_{j \in S} x_{i,j} \beta_j \right)^2 $$

This optimization is NP hard in general!

Def $L_0$ norm (not a norm), is $\|\beta\|_0 = |{\rm supp}(\beta)|$.

Then subset selection is $$ \min \| y - X \beta \|^2 \textrm{ s.t. } \beta \in \mathbb R^p, \| \beta \|_0 \le s $$

Greedy methods

Basic idea: at each step, choose an action that improves empirical risk

Forward Stepwise

Input X standardized, $x_0 = 1$, y
Let $S_0 = \{0\}$
For $s = 1,\ldots, p$:
- For $j \notin S$: $$ R_j = \min \| y - X \beta \|^2 \textrm{ s.t. supp}(\beta) = S_{s-1} \cup \{ j \} $$
- Add minimizer of $R_j$ to $S_{s-1}$ update $S_s$
Intermediate OLS steps can be slow
Correlations can cause issues: add a variable early on only because it is correlated with significant vars, never lose it