In [64]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
import scipy as sc
Apply predict to training set then $$ \hat y = X \hat \beta = X (X^\top X)^{-1} X^\top y $$ is a projection of $y$ onto the column space of $X$. Projection in $n$-D space.
Projections are idempotent, $$ P := X (X^\top X)^{-1} X^\top $$ has $$ P P = X (X^\top X)^{-1} X^\top X (X^\top X)^{-1} X^\top = X (X^\top X)^{-1} X^\top. $$
Suppose that we have a perfectly reasonable $n \times p$ design matrix $X$, and $n$ response vector $y$ such that $X^\top X$ is invertible. Suppose that we duplicate the columns of $X$ to make an $n \times (2p)$ matrix. Suppose that we want to run OLS with the new data by finding solutions to the normal equation.
Why is this any easier?
$Z$ is orthogonal, i.e. the columns are orthogonal, $$ z_j^\top z_k = 0, j\ne k $$ which means $Z^\top Z$ is diagonal (easy to invert).
Exercise. 3.2.1 Show that Successive Orthogonalization is equivalent to the Gram-Schmidt procedure for finding an orthonormal basis of column space of $X$.
Why?
We can write these matrices as $$ X = Z \Gamma $$ where $Z$ is orthogonal and $\Gamma$ is upper triangular. Let $D$ be the diagonal matrix with $\| z_j\|$ on diagonal. Then $$ X = Z D^{-1} D \Gamma = Q R $$ for $Q = Z D^{-1}$ is $n \times p$, $R = D \Gamma$ is $p \times p$.
$Q$ is orthonormal ($Q^\top Q = I$) and $R$ is upper triangular.
[1, 2] [a] = [4]
[0, 3] [b] [5]
Suppose that $y$ follows the linear model $$ y = X \beta + \epsilon $$ where $\epsilon_i$ is iid normal$(0,\sigma^2)$. Then
Means $$ \hat \beta_p = \frac{y^\top z_p}{z_p^\top z_p} $$ If $\|z_p\|$ is small then this is instable (high variance), when does this happen? $$ z_j = x_j - \sum_{k=0}^{j-1} \hat \gamma_{j,k} z_k $$
If $x_p$ is correlated with $x_0,\ldots,x_{p-1}$ (small residual when regressed onto) then $\hat \beta_p$ is instable.
Below is some code for generating a design matrix with correlated X variables, and the response vector. The rho parameter is the correlation of the X variables, so that $rho = 1$ means that they are perfectly correlated. Use the code below to generate the true beta once (also a parameter of sim_corr_lm
) and set sigma=1
. Choose a sequence of rho's: Rhos = [0,.2,.4,.6,.8,.9,.95,.99]
and for each one run 100 trials of the following: simulate from the linear model, fit OLS, and save the first coefficient beta_1. For each rho in the list, calculate the variance of beta_1 and plot the variance as a function of rho.
In [34]:
def sim_corr_lm(n,p,rho,beta,sigma):
"""
Simulate a design matrix with all columns having marginal correlation rho
"""
assert p < n and rho < 1 and rho >= 0, "p must be less than n and rho in [0,1)"
Sigma = (1 - rho)*np.eye(p) + rho*np.ones((p,p))
X = np.random.multivariate_normal(np.zeros(p),Sigma,n)
y = X @ beta + np.random.normal(0,sigma,n)
return X,y
In [51]:
n, p, rho = 100, 2, .8
beta = np.random.normal(0,1,p)
sigma = 1.
X, y = sim_corr_lm(n,p,rho,beta,sigma)
plt.plot(X[:,0],X[:,1],'.')
Out[51]:
In [52]:
## Solution to 3.2.2
def sample_coef_corr_lm(trials,**kwargs):
"""
Sample the OLS coefficients for rho correlated input
"""
beta_sim = []
for t in range(trials):
X,y = sim_corr_lm(**kwargs)
beta_sim += [linear_model.LinearRegression(fit_intercept=False).fit(X,y).coef_]
return np.array(beta_sim)
In [62]:
## Sample coefficients with different rho and plot variance of beta_1 as rho increases
Rhos = [0,.2,.4,.6,.8,.9,.95,.99]
coef_vars = [sample_coef_corr_lm(100,n=n,p=p,rho=rho,beta=beta,sigma=sigma)[:,0].var(axis=0) for rho in Rhos]
plt.plot(Rhos,coef_vars)
plt.xlabel('X correlation')
plt.ylabel('beta variance')
Out[62]:
In Gram-Schmidt the projection operator is $$\frac{\langle x, z\rangle}{\langle z, z \rangle} z,$$ and so the projection of $x_j$ onto the space spanned by $z_k$ is $$ \frac{x_j^\top z_k}{z_k^\top z_k} z_k = \hat \gamma_{j,k} z_k.$$
The result is immediate from algorithm: $$z_j = x_j - \sum_{k=0}^{j-1} \hat \gamma_{j,k} z_k$$
Singular Value Decomposition (for $n > p$ and $X^\top X$ invertible)
$$ X = U D V^\top, $$Below you can see some code for computing the SVD of $X^\top X$. Using the same selection of Rhos
for each rho
compute the singular values and plot them in order from largest to smallest. You should get one line for each rho. What does this say about the stability of $\hat \beta$ as rho approaches 1.
In [105]:
## Simulate again
n, p, rho = 100, 8, .8
beta = np.random.normal(0,1,p)
sigma = 1.
X, y = sim_corr_lm(n,p,rho,beta,sigma)
## SVD
U,d,Vt = sc.linalg.svd(X)
## Check to make sure that we understand
print(U.shape, Vt.shape, d.shape)
np.abs(X - U[:,:p] @ np.diag(d) @ Vt).sum()
Out[105]:
In [103]:
## Sample coefficients with different rho and store eigenvalues and beta variances
Rhos = [0,.2,.4,.6,.8,.9,.95,.99]
res_mat = []
for rho in Rhos:
X, y = sim_corr_lm(n,p,rho,beta,sigma)
U,d,Vt = sc.linalg.svd(X)
res_mat += [d]
In [104]:
## plot the ordered eigenvalues for each rho
colors = plt.cm.jet(np.linspace(0,1,len(Rhos)))
for col, rho, res_vec in zip(colors,Rhos,res_mat):
plt.plot(res_vec,label=str(rho),c=col)
plt.legend()
_ = plt.xlabel('eigenvalue index')
One solution is Ridge regression, $\hat \beta$ that solves $$ \min. \sum_{i=1}^n (y_i - x_i^\top \beta)^2 + \lambda \sum_{j=1}^p \beta_j^2. $$
Ridge objective (with no intercept) $$ (y - X \beta)^\top (y - X \beta) + \lambda \beta^\top \beta $$ is proportional to $$- 2 y^\top X \beta + \beta^\top (X^\top X) \beta + \lambda \beta^\top \beta = -2 (X^\top y)^\top \beta + \beta^\top (X^\top X + \lambda I) \beta.$$
The ridge solution satisfies some new normal equations, $$ (X^\top X + \lambda I) \hat \beta = X^\top y. $$
Ridge regularization always has a solution because $(X^\top X + \lambda I)^{-1}$ exists! But it is biased!
Recall that we can write the OLS solution as $$ \hat \beta = V D^{-1} U^\top y . $$
Consider the solution to the ridge normal equation,
$$
(X^\top X + \lambda I) \hat \beta = X^\top y.
$$
Show that it can also be solved with a similar form but where the singular values $d_j$ are augmented.
If $X = U D V^\top$ then $$ X^\top X + \lambda I = V D^2 V^\top + \lambda I = V (D^2 + \lambda I) V^\top $$ because when $p < n$ then $V V^\top = I$. So Ridge solution can be shown to be (exercise) $$ \hat \beta = V (D^2 + \lambda I)^{-1} D U^\top y $$ compare to OLS $$ \hat \beta = V D^{-1} U^\top y $$ modifies the eigenvalues to be $$ d_j \gets \frac{d_j^2 + \lambda}{d_j}. $$
A model for sparsity $$ y_i = x_i^\top \beta + \epsilon_i $$ where for some of the $j = 1,\ldots,p$, $\beta_j = 0$.
Def Support of $\beta$ is $$ \textrm{supp}(\beta) = \{j = 1,\ldots,p : \beta_j \ne 0 \}. $$ then goal is to find supp$(\beta)$ or $\beta$ such that $|$supp$(\beta)| \le s$.
Def $L_0$ norm (not a norm), is $\|\beta\|_0 = |{\rm supp}(\beta)|$.
Then subset selection is $$ \min \| y - X \beta \|^2 \textrm{ s.t. } \beta \in \mathbb R^p, \| \beta \|_0 \le s $$
Basic idea: at each step, choose an action that improves empirical risk
For $s = 1,\ldots, p$:
Intermediate OLS steps can be slow