Chapter 08
Tree-Based Methods
Roughing Speaking, there are two steps:
The goal is to find boxes $R_1,\ldots,R_j$ that minimize the RSS given by $$ \sum_{j=1}^{J}\sum_{i\in R_j}(y_i-\hat{y}_{R_j})^2 $$
where $\hat{y}_{R_j}$ is the mean response for the training observations with the $j$th box.
In greater detial, for any $j$ and $s$, we define the pair of half-planes $$ R_1(j,s)=\{X|X_j < s \} \& R_2(j,s)=\{X|X_j \ge s\} $$ and seek the value of $j$ and $s$ that minimize the equation $$ \sum_{i:x_i \in R_1(j,s)}(y_i-\hat{y}_{R_1})^2 + \sum_{i:x_i\in R_2(j,s)}(y_i-\hat{y}_{R_2})^2 $$ where $\hat{y}_{R_1}$ is the mean response for the training observations in $R_1(j,s)$ and $\hat{y}_{R_2}$ is the mean response for the training observations in $R_2(j,s)$
The strategy to reduce the variance of regression tree is to grow a very large tree $T_0$, and then prune it back in order to obtain a subtree. Instead of considering every possible subtree, we consider a sequence of trees indexed by nonnegative tuning parameters $\alpha$.
For each value of $\alpha$ there corresponds a subtree $T \in T_0$ such that $$ \sum_{m=1}^{|T|}\sum_{i:x_i\in R_m} (y_i-\hat{y}_{R_m})^2 +\alpha|T| $$ is as small as possible. Here $|T|$ indicates the number of the terminal nodes of the tree $T, R_m$ is the rectangle.
The decision tree metioned above suffer from high variance. Recall that given a set of $n$ independent observations $Z_1,\ldots,Z_n$, each with varinace $\sigma^2$, the variance of the mean $\bar{Z}$ of the observation is given by $\sigma^2/n$. So we calculate $\hat{f}^1(x),\hat{f}^2(x),\ldots,\hat{f}^B(x)$ using $B$ separate training sets, and averages them in order to obtain a single low-variance statistical learning model. $$ \hat{f}_{avg}(x)=\frac{1}{B}\sum_{b=1}^{B}\hat{f}^b(x) $$ In practice, we do not have access to multiple training sets. Instead, we can bootstrap, by taking repeated samples from the (single) training data sets. We then train our method on the $b$th bootstrap training set in order to get $\hat{f}^{*b}(x)$, and final average all predictions.
Likely the bagging approach, we build a number of decision tree on bootstrapped training samples. Nerverthless, when building these decision trees, each time a split in a tree is considered, a random sample of $m$ preditors is chosen as split cnadidates from the full set of $p$ predictors. And typically we choose $m \approx \sqrt{p}$, where $p$ is the total number of preditors.
We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable.