Stacking (sometimes called "stacked generalization") involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm, the metalearner, is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. If an arbitrary metalearning algorithm is used, then stacking can theoretically represent any of the ensemble techniques described in this article, although in practice, a single-layer logistic regression model is often used for metalearning.
Stacking typically yields performance better than any single one of the trained models in the ensemble.
Leo Breiman, known for his work on classification and regression trees and the creator of the Random Forest algorithm, formalized stacking in his 1996 paper, "Stacked Regressions". Although the idea originated with David Wolpert in 1992 under the name "Stacked Generalization", the modern form of stacking that uses internal k-fold cross-validation was Dr. Breiman's contribution.
However, it wasn't until 2007 that the theoretical background for stacking was developed, which is when the algorithm took on the name, "Super Learner". Until this time, the mathematical reasons for why stacking worked were unknown and stacking was considered a "black art." The Super Learner algorithm learns the optimal combination of the base learner fits. In an article titled, "Super Learner", by Mark van der Laan et al., proved that the Super Learner ensemble represents an asymptotically optimal system for learning.
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained by any of the constituent algorithms.
A common task in machine learning is to perform model selection by specifying a number of models with different parameters. An example of this is Grid Search. The first phase of the Super Learner algorithm is computationally equivalent to performing model selection via cross-validation. The latter phase of the Super Learner algorithm (the metalearning step) is just training another single model (no cross-validation) on the level one data.
Stacking is a broad class of algorithms that involves training a second-level "metalearner" to ensemble a group of base learners. The three packages in the R ecosystem which implement the Super Learner algorithm (stacking on cross-validated predictions) are SuperLearner, subsemble and h2oEnsemble.
Among ensemble software in R, there is also caretEnsemble, but it implements a boostrapped (rather than cross-validated) version of stacking via the caretStack() function. The bootstrapped version will train faster since bootrapping (with a train/test) is a fraction of the work as k-fold cross-validation, however the the ensemble performance suffers as a result of this shortcut.
Authors: Eric Polley, Erin LeDell, Mark van der Laan
Backend: R with constituent algorithms written in a variety of algorithms
The original Super Learner implemenation is the SuperLearner R package (2010).
Features:
Authors: Erin LeDell, Stephanie Sapp
Backend: R with constituent algorithms written in a variety of algorithms
Subsemble is a general subset ensemble prediction method, which can be used for small, moderate, or large datasets. Subsemble partitions the full dataset into subsets of observations, fits a specified underlying algorithm on each subset, and uses a unique form of k-fold cross-validation to output a prediction function that combines the subset-specific fits. An oracle result provides a theoretical performance guarantee for Subsemble.
Features:
Authors: Erin LeDell
Backend: Java
H2O Ensemble has been implemented as a stand-alone R package called h2oEnsemble. The package is an extension to the h2o R package that allows the user to train an ensemble in the H2O cluster using any of the supervised machine learning algorithms H2O.
Features:
In [1]:
# Install from GitHub
#install.packages("devtools")
#install.packages("h2o")
#library(devtools)
#install_github("h2oai/h2o-3/h2o-r/ensemble/h2oEnsemble-package")
library(h2oEnsemble)
This is an example of binary classification using the h2o.ensemble() function, the main stacking function in h2oEnsemble.
This demo uses a subset of the HIGGS dataset, which has 28 numeric features and a binary response. The machine learning task in this example is to distinguish between a signal process which produces Higgs bosons (Y = 1) and a background process which does not (Y = 0). The dataset contains approximately the same number of positive vs negative examples. In other words, this is a balanced, rather than imbalanced, dataset.
In [2]:
h2o.init(nthreads = -1) # Start an H2O cluster with nthreads = num cores
In [3]:
# Import a sample binary outcome train/test set into R
train <- h2o.importFile("higgs_train_10k.csv")
test <- h2o.importFile("higgs_test_5k.csv")
y <- "response"
x <- setdiff(names(train), y)
family <- "binomial"
For binary classification, the response should be encoded as factor (also known as the enum type in Java). The user can specify column types in the h2o.importFile() command, or you can convert the response column as follows:
In [5]:
#For binary classification, response should be a factor
train[ , y] <- as.factor(train[ , y])
test[ , y] <- as.factor(test[ , y])
For this example, we will use the default base learner library for h2o.ensemble, which includes the default H2O GLM, Random Forest, GBM and Deep Neural Net (all using default model parameter values). We will also use the default metalearner, the H2O GLM.
In [6]:
# Specify the base learner library & the metalearner
learner <- c("h2o.glm.wrapper", "h2o.randomForest.wrapper",
"h2o.gbm.wrapper", "h2o.deeplearning.wrapper")
metalearner <- "h2o.deeplearning.wrapper"
Train the ensemble (using 5-fold internal CV) to generate the level-one data. Note that more CV folds will take longer to train, but should increase performance.
In [7]:
# Train the ensemble using 5-fold CV to generate level-one data
# More CV folds will take longer to train, but should increase performance
system.time(fit <- h2o.ensemble(x = x,
y = y,
training_frame = train,
family = family,
learner = learner,
metalearner = metalearner,
cvControl = list(V = 5, shuffle = TRUE)))
The predict() method for an h2o.ensemble() fit will return a list of two objects. The pred$pred object contains the ensemble predictions, and pred$basepred is a matrix of predictions from each of the base learners.
However, a more useful function is h2o.ensemble_performance() which will compute the performance metrics for the base learners and ensemble on a test set.
In [12]:
# Compute test set performance:
perf <- h2o.ensemble_performance(fit, newdata = test)
perf
In [13]:
# To print the results for a particular metric, like MSE, do the following:
print(perf, metric = "MSE")
In [9]:
# To access results directly in the ensemble performance object:
perf$ensemble@metrics$AUC
In [14]:
# Here is an example of how to generate a base learner library using custom base learners:
h2o.randomForest.1 <- function(..., ntrees = 1000, nbins = 100, seed = 1) {
h2o.randomForest.wrapper(..., ntrees = ntrees, nbins = nbins, seed = seed)
}
h2o.deeplearning.1 <- function(..., hidden = c(500,500), activation = "Rectifier", seed = 1) {
h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
}
h2o.deeplearning.2 <- function(..., hidden = c(200,200,200), activation = "Tanh", seed = 1) {
h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
}
learner <- c("h2o.randomForest.1", "h2o.deeplearning.1", "h2o.deeplearning.2")
This example uses the technique of generating a random grid of base learners for maximum diversity, along with the h2o.stack() function. The h2o.stack() function produces the exact same output as the h2o.ensemble() function, however you can use h2o.stack() to combine (via metalearning) a set of exsiting H2O models (where cross-validation was used and the predictions were saved).
In [15]:
#library(h2oEnsemble)
#h2o.init(nthreads = -1)
# Import a sample binary outcome train/test set into R
train <- h2o.importFile("higgs_train_10k.csv")
test <- h2o.importFile("higgs_test_5k.csv")
y <- "response"
x <- setdiff(names(train), y)
family <- "binomial"
#For binary classification, response should be a factor
train[ , y] <- as.factor(train[,y])
test[ , y] <- as.factor(test[,y])
In [16]:
# Random Grid Search (e.g. 120 second maximum)
# This is set to run fairly quickly, increase max_runtime_secs
# or max_models to cover more of the hyperparameter space.
# Also, you can expand the hyperparameter space of each of the
# algorithms by modifying the hyper param code below.
search_criteria <- list(strategy = "RandomDiscrete",
max_runtime_secs = 120)
nfolds <- 5
In [17]:
# GBM Hyperparamters
learn_rate_opt <- c(0.01, 0.03)
max_depth_opt <- c(3, 4, 5, 6, 9)
sample_rate_opt <- c(0.7, 0.8, 0.9, 1.0)
col_sample_rate_opt <- c(0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8)
hyper_params <- list(learn_rate = learn_rate_opt,
max_depth = max_depth_opt,
sample_rate = sample_rate_opt,
col_sample_rate = col_sample_rate_opt)
gbm_grid <- h2o.grid("gbm", x = x, y = y,
training_frame = train,
ntrees = 100,
seed = 1,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
hyper_params = hyper_params,
search_criteria = search_criteria)
gbm_models <- lapply(gbm_grid@model_ids, function(model_id) h2o.getModel(model_id))
In [18]:
# RF Hyperparamters
mtries_opt <- 8:20
max_depth_opt <- c(5, 10, 15, 20, 25)
sample_rate_opt <- c(0.7, 0.8, 0.9)
col_sample_rate_per_tree_opt <- c(0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8)
hyper_params <- list(mtries = mtries_opt,
max_depth = max_depth_opt,
sample_rate = sample_rate_opt,
col_sample_rate_per_tree = col_sample_rate_per_tree_opt)
rf_grid <- h2o.grid("randomForest", x = x, y = y,
training_frame = train,
ntrees = 200,
seed = 1,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
hyper_params = hyper_params,
search_criteria = search_criteria)
rf_models <- lapply(rf_grid@model_ids, function(model_id) h2o.getModel(model_id))
In [19]:
# Deeplearning Hyperparamters
activation_opt <- c("Rectifier", "RectifierWithDropout",
"Maxout", "MaxoutWithDropout")
hidden_opt <- list(c(10,10), c(20,15), c(50,50,50))
l1_opt <- c(0, 1e-3, 1e-5)
l2_opt <- c(0, 1e-3, 1e-5)
hyper_params <- list(activation = activation_opt,
hidden = hidden_opt,
l1 = l1_opt,
l2 = l2_opt)
dl_grid <- h2o.grid("deeplearning", x = x, y = y,
training_frame = train,
epochs = 15,
seed = 1,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
hyper_params = hyper_params,
search_criteria = search_criteria)
dl_models <- lapply(dl_grid@model_ids, function(model_id) h2o.getModel(model_id))
In [20]:
# GLM Hyperparamters
alpha_opt <- seq(0,1,0.1)
lambda_opt <- c(0,1e-7,1e-5,1e-3,1e-1)
hyper_params <- list(alpha = alpha_opt,
lambda = lambda_opt)
glm_grid <- h2o.grid("glm", x = x, y = y,
training_frame = train,
family = "binomial",
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
hyper_params = hyper_params,
search_criteria = search_criteria)
glm_models <- lapply(glm_grid@model_ids, function(model_id) h2o.getModel(model_id))
In [21]:
# Create a list of all the base models
models <- c(gbm_models, rf_models, dl_models, glm_models)
# Specify a defalt GLM as the metalearner
metalearner <- "h2o.glm.wrapper"
In [22]:
# Let's stack!
stack <- h2o.stack(models = models,
response_frame = train[ , y],
metalearner = metalearner)
In [23]:
# Compute test set performance:
perf <- h2o.ensemble_performance(stack, newdata = test)
perf
In [24]:
# Now let's refit the metalearner using a DL and GLM-NN
# Note: DL is not usually the best metalearner
stack2 <- h2o.metalearn(stack, metalearner = "h2o.deeplearning.wrapper")
perf2 <- h2o.ensemble_performance(stack2, newdata = test, score_base_models = FALSE)
perf2
In [26]:
# It's always a good idea to try a GLM restricted to non-negative weights as a metalearner
# There have been a lot of empircal studies that show that non-negative weights can lead to better performance
h2o.glm_nn <- function(..., non_negative = TRUE) h2o.glm.wrapper(..., non_negative = non_negative)
stack3 <- h2o.metalearn(stack, metalearner = "h2o.glm_nn")
perf3 <- h2o.ensemble_performance(stack3, newdata = test, score_base_models = FALSE)
perf3
The test set of the top base learner was 0.779, where as the best ensemble got 0.788. The gain is more than enough to jump 100+ spots on a Kaggle competition. :-)