Summary

Random Forest

Train the model: rf.boston <- randomForest(medv ~ . - medv, data = Boston, subset = train)
- The "Mean of squared residuals" is the out of bag residuals
- mtry is the # of predictors used in each split
- mse is an vector, corresponds to the number of tree
Make prediction
- pred <- predict(fit, Boston[-train,])
The only tuning parameter is mtry, i.e. the number of predictors in each split.

Boosting

Train the model: boost.boston <- gbm(medv ~ ., data = Boston[train,], distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 4)
- interaction.depth is the number of splits in each tree.
- distribution = "gaussian" squared error
When do the summary， print out the importance graph
plot the relation with one variable: plot(boost.boston, i = "lstat")
Make prediction: predmat <- predict(boost.boston, newdata = Boston[-train,], n.trees = n.trees)
- n.trees can be a sequence
A trick: apply((predmat-medv)^2, 2, mean)
- Note predmat is a matrix but medv is a vector, medv is reused in each row.
The tuning parameters are: # of trees, shrinkage parameter and depth.

Boosting



In [ ]:

    
require(gbm)
require(tidyverse)



In [ ]:

    
boost.boston <- gbm(medv ~ ., data = Boston[train,], distribution = "gaussian", n.trees = 10000,
                   shrinkage = 0.01, interaction.depth = 4)
summary(boost.boston)



In [ ]:

    
plot(boost.boston, i = "lstat")



In [ ]:

    
plot(boost.boston, i = "rm")



In [ ]:

    
n.trees <- seq(100, 10000, 100)
predmat <- predict(boost.boston, newdata = Boston[-train,], n.trees = n.trees)
dim(predmat)



In [ ]:

    
perr <- with(Boston[-train,], apply((predmat-medv)^2, 2, mean))



In [ ]:

    
ggplot() +
    geom_point(mapping = aes(x = n.trees, y = perr), color = "blue")

Random forest



In [ ]:

    
require(randomForest)
require(MASS)



In [ ]:

    
set.seed(101)



In [ ]:

    
dim(Boston)



In [ ]:

    
train <- sample(1:nrow(Boston), 300)



In [ ]:

    
length(train)



In [ ]:

    
rf.boston <- randomForest(medv ~ ., data = Boston, subset = train)
str(rf.boston)



In [ ]:

    
oob.err = double(13)
test.err = double(13)
for (mtry in 1:13) {
    fit <- randomForest(medv ~ ., data = Boston, subset = train, mtry = mtry, ntree = 400)
    oob.err[mtry] <- fit$mse[400]
    pred <- predict(fit, Boston[-train,])
    test.err[mtry] <- with(Boston[-train,], mean((pred-medv)^2))
    cat(mtry, "")
}



In [ ]:

    
matplot(1:mtry, cbind(test.err, oob.err), pch = 19, col = c("red", "blue"), type = "b")
legend("topright", legend = c("Test", "OOB"), pch = 19, col = c("red", "blue"))