Chapter 06
Linear Model Selection and Regularization

1 Subset Selection

Methods for selecting subsets of predictors.

2.1 Best Subset Selections

Fit a separate least squares regression for each possible combination of the $p$ predictors. There are $2^p$ possibilities considered by best subset selection.

2.2 Stepwise Selection

A approach applied with very large $p$.

Forward Stepwise Selection

Begin with a model with no predictors, and then adds preidcitors to the model, one-at-a-time, until all the predictors are in the model. Particularly, at each step the variable that gives the greastest additional improvement to the fit is added to the model.
For $k$ predictors model, this selection contains a total $1+\sum_{k=0}^{p-1}(p-k)=1+p(p+1)/2$ models

Backward Stepwise Seletion

Unlike forward stepwise selection, it begins with the full least squares model containsing all $p$ predictors, and then iteratively removes the least useful predictor, one-at-a-time. Like forward stepwise selection,the backward selection approach searches through only $1+p(p+1)/2$ models.

Usages FSS and BSS

BSS requires that the number of sample $n$ is larger than the number of variables $p$. In contrast, FSS can be used even when $n<p$

2.3 Choosing the Optimal Model

We need a way to determine which of these model is best. $RSS$ and $R^2$ are not suitable for selecting the best model among a collection of models with different numbers of predictors as training error can be poor estimate the test error.

$C_p$, AIC, BIC and Adjustment $R^2$

  • $C_p$

the $C_p$ estimate of test $\text{MSE}$ is computed using the equation: $$ C_p=\frac{1}{n}(RSS+2d\hat{\sigma}^2) $$ where $\hat{\sigma}^2$ is estimate of the variance of the error $\epsilon$ associated with each response measurement and $d$ is the number of predictors.

  • AIC

The criterion is define for the a large class of the models fit by maximum likehood $$ AIC = \frac{1}{n\hat{\sigma}^2}(RSS+2d\hat{\sigma}^2) $$

  • BIC

Derived from Bayesian point of view, but ends up looking similar to $C_p$ $$ BIC=\frac{1}{n}(RSS+log(n)d\hat{\sigma}^2) $$

  • Adjustment $R^2$

Since RSS always decreasees as more variables are added to the model, the $R^2$ always increase as more variables are added. The adjusted $R^2$ statistic is calculated as $$ Ajusetd\quad R^2 = 1 -\frac{RSS/(n-d-1)}{TSS/(n-1)} $$

2 Shrinkage Method

Shrinking the coefficient estimates can significantly reduce their variance.

2.1 Ridge Regression

In particular, the ridge regression coefficient estiamte $\hat{\beta}^R$ are the values that minimuize $$ \sum_{i=1}^{n}\big(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\big)^2+\lambda\sum_{j=1}^{p}\beta_j^2=RSS+\lambda\sum_{j=1}^{p}\beta_j^2 $$ where $\lambda \ge 0$ is a turning parameter, to be determined separately.

2.2 The Lasso

Ridge regression will include all $p$ predicotrs in the final model. The penalty $\lambda \sum \beta_j^2$ will shrink all the coefficients toward zero,bu it will not set any of them exactly to zero.

The lasso is a relatively recent alternative to ridge regression that overcomes above disadvantages. The Lasso minimize the quantity: $$ \sum_{i=1}^{n}\big(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\big)^2+\lambda\sum_{j=1}^{p}|\beta_j|=RSS+\lambda\sum_{j=1}^{p}|\beta_j| $$

The $\ell_1$ penalty has the effect of forcing some of the coefficient estimates to exactly qual to zero when the turning parameters $\lambda$ is sufficiently large.

3 Dimension Reduction Methods

Above all approaches are metioned using original predictors, $X_1,X_2,\ldots,X_p$. Let $Z_1,Z_2,\ldots,Z_M$ represents $M<p$ linear combinations of our original $p$ preditors. That is $$ Z_m=\sum_{j=1}^{p}\phi_{jm}X_j $$ for some constants $\phi_{1m},\ldots,\phi_{pm}, m=1,\ldots,M$, then we can fit the linear regression model $$ y_i=\theta_0+\sum_{m=1}^{M}\theta_mz_{im}+\epsilon_i, i=1,\ldots,n $$

3.1 Principle Component Regression

A popular approach for deriving a low-dimensional set of features from a large set of variables. It involves constructing the first $M$ principal components, $Z_1,\ldots,Z_M$, and then using these components as the predictors in a linear regression model that fit the least squares. Estimating only $M \ll p$ coefficients we can mitigate overfitting.

In PCR, the number of principal components, $M$, typically chosen by corss-validation. Before performing PCR, it is essential statndard each predictor (subtract the mean and divide the standard error)

3.2 Parial Least Sequare

The PCR approach that we just descirbed above suffers from a drawback: there is no guarantee that the directions that best explain the predictors will also be the best directiions to use for predicting the response.
After standardzing the $p$ predictors, PLS computes the first direction $Z_1$ by setting each $\phi_{j1}$ equal to the coefficient from the simple linear regression of $Y$ onto $X_j$. Hence, in computing $Z_1=\sum_{j=1}^p\phi_{j1}X_j$, PLS places the highest weight on the variables that are most strongly related to the response.

4 Considerations of High Dimensions

Most traditional statistical methods are designed for low-dimensional dataset which $n$, the number ofthe observation, is much greater than $p$, the number of the variables. When handling the high-dimensioanl occasions, it is possible to fit the training data perfectly, but it perform exactly poorly on the independent test data.

4.1 Regression in High Dimensions

Solutions for curse of dimensionality

  • regularization or shrinkages
  • appropriate tuning parameter selection.