1 The Basics of Decision Trees

1.1 Regression Trees

Roughing Speaking, there are two steps:

Dividing the predictors space- that is, the set of possible values for $X_1,X_2,\ldots,X_p$- into $J$ distinct and non-overlapping regions, $R_1,R_2,\ldots,R_J$
For every observation that fall into the Region $R_j$, we make the same prediction, which is simpley the mean of the response values for the training observation in $R_j$

The goal is to find boxes $R_1,\ldots,R_j$ that minimize the RSS given by $$ \sum_{j=1}^{J}\sum_{i\in R_j}(y_i-\hat{y}_{R_j})^2 $$

where $\hat{y}_{R_j}$ is the mean response for the training observations with the $j$th box.

In greater detial, for any $j$ and $s$, we define the pair of half-planes $$ R_1(j,s)=\{X|X_j < s \} \& R_2(j,s)=\{X|X_j \ge s\} $$ and seek the value of $j$ and $s$ that minimize the equation $$ \sum_{i:x_i \in R_1(j,s)}(y_i-\hat{y}_{R_1})^2 + \sum_{i:x_i\in R_2(j,s)}(y_i-\hat{y}_{R_2})^2 $$ where $\hat{y}_{R_1}$ is the mean response for the training observations in $R_1(j,s)$ and $\hat{y}_{R_2}$ is the mean response for the training observations in $R_2(j,s)$

1.2 Tree Pruning

The strategy to reduce the variance of regression tree is to grow a very large tree $T_0$, and then prune it back in order to obtain a subtree. Instead of considering every possible subtree, we consider a sequence of trees indexed by nonnegative tuning parameters $\alpha$.

For each value of $\alpha$ there corresponds a subtree $T \in T_0$ such that $$ \sum_{m=1}^{|T|}\sum_{i:x_i\in R_m} (y_i-\hat{y}_{R_m})^2 +\alpha|T| $$ is as small as possible. Here $|T|$ indicates the number of the terminal nodes of the tree $T, R_m$ is the rectangle.

1.4 Classifiction Trees

It generate the tree Likely the Regression Trees, but its criterion for making the binary splits is Gini index or cross entropy

2 Bagging, Random Forest and Boosting

2.1 Bagging

The decision tree metioned above suffer from high variance. Recall that given a set of $n$ independent observations $Z_1,\ldots,Z_n$, each with varinace $\sigma^2$, the variance of the mean $\bar{Z}$ of the observation is given by $\sigma^2/n$. So we calculate $\hat{f}^1(x),\hat{f}^2(x),\ldots,\hat{f}^B(x)$ using $B$ separate training sets, and averages them in order to obtain a single low-variance statistical learning model. $$ \hat{f}_{avg}(x)=\frac{1}{B}\sum_{b=1}^{B}\hat{f}^b(x) $$ In practice, we do not have access to multiple training sets. Instead, we can bootstrap, by taking repeated samples from the (single) training data sets. We then train our method on the $b$th bootstrap training set in order to get $\hat{f}^{*b}(x)$, and final average all predictions.

2.2 Out-of-Bag Error Estimation

Each bagged tree makes use of around two-thrirs of the observations. The remaining one-third of the observations not used to fit a given bagged tree are refered to as the out-of-bag (OOB) observations.

2.3 Random Forests

Likely the bagging approach, we build a number of decision tree on bootstrapped training samples. Nerverthless, when building these decision trees, each time a split in a tree is considered, a random sample of $m$ preditors is chosen as split cnadidates from the full set of $p$ predictors. And typically we choose $m \approx \sqrt{p}$, where $p$ is the total number of preditors.

We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable.

2.4 Boosting

Boosting works in the similar way, except that the trees are grown sequentitally: each tree is grown using information from previously grown trees. Boosting does not involve boostrap sampling; instead each tree fit on a modified version of original data set.