Chapter 06
Linear Model Selection and Regularization
Methods for selecting subsets of predictors.
Fit a separate least squares regression for each possible combination of the $p$ predictors. There are $2^p$ possibilities considered by best subset selection.
A approach applied with very large $p$.
Begin with a model with no predictors, and then adds preidcitors to the model, one-at-a-time, until all the predictors are in the model. Particularly, at each step the variable that gives the greastest additional improvement to the fit is added to the model.
For $k$ predictors model, this selection contains a total $1+\sum_{k=0}^{p-1}(p-k)=1+p(p+1)/2$ models
Unlike forward stepwise selection, it begins with the full least squares model containsing all $p$ predictors, and then iteratively removes the least useful predictor, one-at-a-time. Like forward stepwise selection,the backward selection approach searches through only $1+p(p+1)/2$ models.
BSS requires that the number of sample $n$ is larger than the number of variables $p$. In contrast, FSS can be used even when $n<p$
We need a way to determine which of these model is best. $RSS$ and $R^2$ are not suitable for selecting the best model among a collection of models with different numbers of predictors as training error can be poor estimate the test error.
the $C_p$ estimate of test $\text{MSE}$ is computed using the equation: $$ C_p=\frac{1}{n}(RSS+2d\hat{\sigma}^2) $$ where $\hat{\sigma}^2$ is estimate of the variance of the error $\epsilon$ associated with each response measurement and $d$ is the number of predictors.
The criterion is define for the a large class of the models fit by maximum likehood $$ AIC = \frac{1}{n\hat{\sigma}^2}(RSS+2d\hat{\sigma}^2) $$
Derived from Bayesian point of view, but ends up looking similar to $C_p$ $$ BIC=\frac{1}{n}(RSS+log(n)d\hat{\sigma}^2) $$
Since RSS always decreasees as more variables are added to the model, the $R^2$ always increase as more variables are added. The adjusted $R^2$ statistic is calculated as $$ Ajusetd\quad R^2 = 1 -\frac{RSS/(n-d-1)}{TSS/(n-1)} $$
In particular, the ridge regression coefficient estiamte $\hat{\beta}^R$ are the values that minimuize $$ \sum_{i=1}^{n}\big(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\big)^2+\lambda\sum_{j=1}^{p}\beta_j^2=RSS+\lambda\sum_{j=1}^{p}\beta_j^2 $$ where $\lambda \ge 0$ is a turning parameter, to be determined separately.
Ridge regression will include all $p$ predicotrs in the final model. The penalty $\lambda \sum \beta_j^2$ will shrink all the coefficients toward zero,bu it will not set any of them exactly to zero.
The lasso is a relatively recent alternative to ridge regression that overcomes above disadvantages. The Lasso minimize the quantity: $$ \sum_{i=1}^{n}\big(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij}\big)^2+\lambda\sum_{j=1}^{p}|\beta_j|=RSS+\lambda\sum_{j=1}^{p}|\beta_j| $$
The $\ell_1$ penalty has the effect of forcing some of the coefficient estimates to exactly qual to zero when the turning parameters $\lambda$ is sufficiently large.
Above all approaches are metioned using original predictors, $X_1,X_2,\ldots,X_p$. Let $Z_1,Z_2,\ldots,Z_M$ represents $M<p$ linear combinations of our original $p$ preditors. That is $$ Z_m=\sum_{j=1}^{p}\phi_{jm}X_j $$ for some constants $\phi_{1m},\ldots,\phi_{pm}, m=1,\ldots,M$, then we can fit the linear regression model $$ y_i=\theta_0+\sum_{m=1}^{M}\theta_mz_{im}+\epsilon_i, i=1,\ldots,n $$
A popular approach for deriving a low-dimensional set of features from a large set of variables. It involves constructing the first $M$ principal components, $Z_1,\ldots,Z_M$, and then using these components as the predictors in a linear regression model that fit the least squares. Estimating only $M \ll p$ coefficients we can mitigate overfitting.
In PCR, the number of principal components, $M$, typically chosen by corss-validation. Before performing PCR, it is essential statndard each predictor (subtract the mean and divide the standard error)
The PCR approach that we just descirbed above suffers from a drawback: there is no guarantee that the directions that best explain the predictors will also be the best directiions to use for predicting the response.
After standardzing the $p$ predictors, PLS computes the first direction $Z_1$ by setting each $\phi_{j1}$ equal to the coefficient from the simple linear regression of $Y$ onto $X_j$. Hence, in computing $Z_1=\sum_{j=1}^p\phi_{j1}X_j$, PLS places the highest weight on the variables that are most strongly related to the response.
Most traditional statistical methods are designed for low-dimensional dataset which $n$, the number ofthe observation, is much greater than $p$, the number of the variables. When handling the high-dimensioanl occasions, it is possible to fit the training data perfectly, but it perform exactly poorly on the independent test data.
Solutions for curse of dimensionality