In [64]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
import scipy as sc
Apply predict to training set then $$ \hat y = X \hat \beta = X (X^\top X)^{-1} X^\top y $$ is a projection of $y$ onto the column space of $X$. Projection in $n$-D space.
Projections are idempotent, $$ P := X (X^\top X)^{-1} X^\top $$ has $$ P P = X (X^\top X)^{-1} X^\top X (X^\top X)^{-1} X^\top = X (X^\top X)^{-1} X^\top. $$
Why is this any easier?
$Z$ is orthogonal, i.e. the columns are orthogonal, $$ z_j^\top z_k = 0, j\ne k $$ which means $Z^\top Z$ is diagonal (easy to invert).
Practice Exercise. Show that Successive Orthogonalization is equivalent to the Gram-Schmidt procedure for finding an orthonormal basis of column space of $X$.
Why?
We can write these matrices as $$ X = Z \Gamma $$ where $Z$ is orthogonal and $\Gamma$ is upper triangular. Let $D$ be the diagonal matrix with $\| z_j\|$ on diagonal. Then $$ X = Z D^{-1} D \Gamma = Q R $$ for $Q = Z D^{-1}$ is $n \times p$, $R = D \Gamma$ is $p \times p$.
$Q$ is orthonormal ($Q^\top Q = I$) and $R$ is upper triangular.
[1, 2] [a] = [4]
[0, 3] [b] [5]
Suppose that $y$ follows the linear model $$ y = X \beta + \epsilon $$ where $\epsilon_i$ is iid normal$(0,\sigma^2)$. Then
Means $$ \hat \beta_p = \frac{y^\top z_p}{z_p^\top z_p} $$ If $\|z_p\|$ is small then this is instable (high variance), when does this happen? $$ z_j = x_j - \sum_{k=0}^{j-1} \hat \gamma_{j,k} z_k $$
If $x_p$ is correlated with $x_0,\ldots,x_{p-1}$ (small residual when regressed onto) then $\hat \beta_p$ is instable.
In [34]:
def sim_corr_lm(n,p,rho,beta,sigma):
"""
Simulate a design matrix with all columns having marginal correlation rho
"""
assert p < n and rho < 1 and rho >= 0, "p must be less than n and rho in [0,1)"
Sigma = (1 - rho)*np.eye(p) + rho*np.ones((p,p))
X = np.random.multivariate_normal(np.zeros(p),Sigma,n)
y = X @ beta + np.random.normal(0,sigma,n)
return X,y
In [51]:
n, p, rho = 100, 2, .8
beta = np.random.normal(0,1,p)
sigma = 1.
X, y = sim_corr_lm(n,p,rho,beta,sigma)
plt.plot(X[:,0],X[:,1],'.')
Out[51]:
In [52]:
def sample_coef_corr_lm(trials,**kwargs):
"""
Sample the OLS coefficients for rho correlated input
"""
beta_sim = []
for t in range(trials):
X,y = sim_corr_lm(**kwargs)
beta_sim += [linear_model.LinearRegression(fit_intercept=False).fit(X,y).coef_]
return np.array(beta_sim)
In [62]:
## Sample coefficients with different rho and plot variance of beta_1 as rho increases
Rhos = [0,.2,.4,.6,.8,.9,.95,.99]
coef_vars = [sample_coef_corr_lm(100,n=n,p=p,rho=rho,beta=beta,sigma=sigma)[:,0].var(axis=0) for rho in Rhos]
plt.plot(Rhos,coef_vars)
plt.xlabel('X correlation')
plt.ylabel('beta variance')
Out[62]:
Singular Value Decomposition (for $n > p$ and $X^\top X$ invertible)
$$ X = U D V^\top, $$Suppose that we precomputed the SVD, $$ X = UDV^\top. $$ Then the Gram matrix is $$ X^\top X = V D U^\top U D V^\top = V D^2 V^\top, $$ the Spectral decomposition.
In homework 1, you should derive that $$ \hat \beta = V D^{-1} U^\top y $$
Hint: show that $(X^\top X)^{-1} = V D^{-2} V^\top$
In [105]:
## Simulate again
n, p, rho = 100, 8, .8
beta = np.random.normal(0,1,p)
sigma = 1.
X, y = sim_corr_lm(n,p,rho,beta,sigma)
## SVD
U,d,Vt = sc.linalg.svd(X)
## Check to make sure that we understand
print(U.shape, Vt.shape, d.shape)
np.abs(X - U[:,:p] @ np.diag(d) @ Vt).sum()
Out[105]:
In [103]:
## Sample coefficients with different rho and store eigenvalues and beta variances
Rhos = [0,.2,.4,.6,.8,.9,.95,.99]
res_mat = []
for rho in Rhos:
X, y = sim_corr_lm(n,p,rho,beta,sigma)
U,d,Vt = sc.linalg.svd(X)
res_mat += [d]
In [104]:
## plot the ordered eigenvalues for each rho
colors = plt.cm.jet(np.linspace(0,1,len(Rhos)))
for col, rho, res_vec in zip(colors,Rhos,res_mat):
plt.plot(res_vec,label=str(rho),c=col)
plt.legend()
_ = plt.xlabel('eigenvalue index')
One solution is Ridge regression, $\hat \beta$ that solves $$ \min. \sum_{i=1}^n (y_i - x_i^\top \beta)^2 + \lambda \sum_{j=1}^p \beta_j^2. $$
Ridge objective (with no intercept) $$ (y - X \beta)^\top (y - X \beta) + \lambda \beta^\top \beta $$ is proportional to $$
Ridge regularization always has a solution because $(X^\top X + \lambda I)^{-1}$ exists!
If $X = U D V^\top$ then $$ X^\top X + \lambda I = V D^2 V^\top + \lambda I = V (D^2 + \lambda I) V^\top $$ because when $p < n$ then $V V^\top = I$. So Ridge solution can be shown to be (exercise) $$ \hat \beta = V (D^2 + \lambda I)^{-1} D U^\top y $$ compare to OLS $$ \hat \beta = V D^{-1} U^\top y $$ modifies the eigenvalues to be $$ d_j \gets \frac{d_j^2 + \lambda}{d_j} $$ but but but Ridge regression is biased!
A model for sparsity $$ y_i = x_i^\top \beta + \epsilon_i $$ where for some of the $j = 1,\ldots,p$, $\beta_j = 0$.
Def Support of $\beta$ is $$ \textrm{supp}(\beta) = \{j = 1,\ldots,p : \beta_j \ne 0 \}. $$ then goal is to find supp$(\beta)$ or $\beta$ such that $|$supp$(\beta)| \le s$.
Def $L_0$ norm (not a norm), is $\|\beta\|_0 = {\rm supp}(\beta)$.
Then subset selection is $$ \min \| y - X \beta \|^2 \textrm{ s.t. } \beta \in \mathbb R^p, \| \beta \|_0 \le s $$
Basic idea: at each step, choose an action that improves empirical risk
For $s = 1,\ldots, p$:
Intermediate OLS steps can be slow