Drawing by Phil Cutler.
Any tutorial on Random Forests (RF) should also include a review of decicion trees, as these are models that are ensembled together to create the Random Forest model -- or put another way, the "trees that comprise the forest." Much of the complexity and detail of the Random Forest algorithm occurs within the individual decision trees and therefore it's important to understand decision trees to understand the RF algorithm as a whole. Therefore, before proceeding, it is recommended that you read through the accompanying Classification and Regression Trees Tutorial.
The Random Forest algorithm is preceeded by the Random Subspace Method (aka "attribute bagging"), which accounts for half of the source of randomness in a Random Forest. The Random Subspace Method is an ensemble method that consists of several classifiers each operating in a subspace of the original feature space. The outputs of the models are then combined, usually by a simple majority vote. Tin Kam Ho applied the random subspace method to decision trees in 1995.
Leo Breiman and Adele Culter combined Breiman's bagging idea with the random subspace method to create a "Random Forest", a name which is trademarked by the duo. Due to the trademark, the algorithm is sometimes called Random Decision Forests.
The introduction of random forests proper was first made in a paper by Leo Breiman [1]. This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular:
The report also offers the first theoretical result for random forests in the form of a bound on the generalization error which depends on the strength of the trees in the forest and their correlation.
Although Brieman's implemenation of Random Forests used his CART algorithm to construct the decision trees, many modern implementations of Random Forest use entropy-based algorithms for constructing the trees.
Bagging (Bootstrap aggregating) was proposed by Leo Breiman in 1994 to improve the classification by combining classifications of randomly generated training sets. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.
The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set $X = x_1, ..., x_n$ with responses $Y = y_1, ..., y_n$, bagging repeatedly ($B$ times) selects a random sample with replacement of the training set and fits trees to these samples:
For $b = 1, ..., B$:
After training, predictions for unseen samples $x'$ can be made by averaging the predictions from all the individual regression trees on $x'$:
$$ {\hat {f}}={\frac {1}{B}}\sum _{b=1}^{B}{\hat {f}}_{b}(x')$$or by taking the majority vote in the case of decision trees.
The above procedure describes the original bagging algorithm for trees. Random Forests differ in only one way from this general scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging".
In Extremely Randomized Trees (aka Extra-Trees) [2], randomness goes one step further in the way splits are computed. As in Random Forests, a random subset of candidate features is used, but instead of looking for the best split, thresholds (for the split) are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.
Extremely Randomized Trees is implemented in the extraTrees R package and also available in the h2o R package as part of the h2o.randomForest()
function via the histogram_type = "Random"
argument.
In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:
In every tree grown in the forest, put down the OOB cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-$m$-permuted OOB data from the number of votes for the correct class in the untouched OOB data. The average of this number over all trees in the forest is the raw importance score for variable $m$.
If the values of this score from tree to tree are independent, then the standard error can be computed by a standard computation. The correlations of these scores between trees have been computed for a number of data sets and proved to be quite low, therefore we compute standard errors in the classical way, divide the raw score by its standard error to get a $z$-score, ands assign a significance level to the $z$-score assuming normality.
If the number of variables is very large, forests can be run once with all the variables, then run again using only the most important variables from the first run.
For each case, consider all the trees for which it is oob. Subtract the percentage of votes for the correct class in the variable-$m$-permuted OOB data from the percentage of votes for the correct class in the untouched OOB data. This is the local importance score for variable m for this case.
Variable importance in Extremely Randomized Trees is explained here.
Leo Brieman famously claimed that "Random Forests do not overfit." This is perhaps not exactly the case, however they are certainly more robust to overfitting than a Gradient Boosting Machine (GBM). Random Forests can be overfit by growing trees that are "too deep", for example. However, it is hard to overfit a Random Forest by adding more trees to the forest -- typically that will increase accuracy (at the expense of computation time).
Missing values do not neccessarily have to be imputed in a Random Forest implemenation, although some software packages will require it.
Here is a short article called, The Unreasonable Effectiveness of Random Forests, by Ahmed El Deeb, about the utility of Random Forests. It summarizes some of the algorithm's pros and cons nicely.
The oldest and most well known implementation of the Random Forest algorithm in R is the randomForest package. There are also a number of packages that implement variants of the algorithm, and in the past few years, there have been several "big data" focused implementations contributed to the R ecosystem as well.
Here is a non-comprehensive list:
party::cForest
)obliqueRF
)randomForest
+ foreach
)rFerns
)randomForest
)ranger
)quantregForest
)extraTrees
)inTrees
)Boruta
)RRF
)rotationForest
)wsrf
)party::cForest
)rotationForest
)ParallelForest
)randomForestSRC
)rFerns
)randomForest
)ranger
)randomForestSRC
)randomUniformForest
)Since there are so many different Random Forest implementations available, there have been several benchmarks to compare the performance of popular implementations, including implementations outside of R. A few examples:
Authors: Fortran original by Leo Breiman and Adele Cutler, R port by Andy Liaw and Matthew Wiener.
Backend: Fortran
Features:
In [ ]:
# randomForest example
#install.packages("randomForest")
#install.packages("cvAUC")
library(randomForest)
library(cvAUC)
In [ ]:
# Load binary-response dataset
#train <- read.csv("data/higgs_train_10k.csv")
#test <- read.csv("data/higgs_test_5k.csv")
train <- read.csv("higgs_train_10k.csv")
test <- read.csv("higgs_test_5k.csv")
# Dimensions
dim(train)
dim(test)
# Columns
names(train)
In [ ]:
# Identity the response column
ycol <- "response"
# Identify the predictor columns
xcols <- setdiff(names(train), ycol)
# Convert response to factor (required by randomForest)
train[ , ycol] <- as.factor(train[ , ycol])
test[ , ycol] <- as.factor(test[ , ycol])
In [ ]:
# Train a default RF model with 500 trees
set.seed(1) # For reproducibility
system.time(model <- randomForest(x = train[ , xcols],
y = train[ , ycol],
xtest = test[ , xcols],
ntree = 50))
In [7]:
# Generate predictions on test dataset
preds <- model$test$votes[, 2]
labels <- test[ , ycol]
# Compute AUC on the test set
cvAUC::AUC(predictions = preds, labels = labels)
In [ ]:
#library(caret)
#library(doParallel)
#library(e1071)
In [ ]:
# Train a "parRF" model using caret
#registerDoParallel(cores = 8)
#registerDoParallel(cores = 4)
#model <- caret::train(x = train[,xcols],
# y = train[,ycol],
# method = "parRF",
# preProcess = NULL,
# weights = NULL,
# metric = "Accuracy",
# maximize = TRUE,
# trControl = trainControl(),
# tuneGrid = NULL,
# tuneLength = 3)
Authors: Jan Vitek, Arno Candel, H2O.ai contributors
Backend: Java
Features:
Implementation details are presented in slidedecks by Michal Mahalova and Jan Vitek.
In [1]:
#install.packages("h2o")
library(h2o)
#h2o.shutdown(prompt = FALSE)
h2o.init(nthreads = -1) #Start a local H2O cluster using nthreads = num available cores
In [2]:
# Load binary-response dataset
train <- h2o.importFile("./higgs_train_10k.csv")
test <- h2o.importFile("./higgs_test_5k.csv")
# Dimensions
dim(train)
dim(test)
# Columns
names(train)
In [3]:
# Identity the response column
ycol <- "response"
# Identify the predictor columns
xcols <- setdiff(names(train), ycol)
# Convert response to factor (required by randomForest)
train[ , ycol] <- as.factor(train[ , ycol])
test[ , ycol] <- as.factor(test[ , ycol])
In [4]:
# Train a default RF model with 100 trees
system.time(model <- h2o.randomForest(x = xcols,
y = ycol,
training_frame = train,
seed = 1, #for reproducibility
ntrees = 50))
In [5]:
# Compute AUC on test dataset
# H2O computes many model performance metrics automatically, including AUC
perf <- h2o.performance(model = model, newdata = test)
h2o.auc(perf)
Authors: Mark Seligman
Backend: C++
The Arborist provides a fast, open-source implementation of the Random Forest algorithm. The Arborist achieves its speed through efficient C++ code and parallel, distributed tree construction. This slidedeck provides detail about the implementation and vision of the project.
Features:
Authors: Marvin N. Wright and Andreas Ziegler
Backend: C++
Ranger is a fast implementation of random forest (Breiman 2001) or recursive partitioning, particularly suited for high dimensional data. Classification, regression, probability estimation and survival forests are supported. Classification and regression forests are implemented as in the original Random Forest (Breiman 2001), survival forests as in Random Survival Forests (Ishwaran et al. 2008). For probability estimation forests see Malley et al. (2012).
Features: