In statistics and machine learning, ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models (Wikipredia, 2015). This notebook demonstrates an easy way to carry out ensemble learning with H2O models using h2oEnsemble
.
We give our users the ability to build, compare and stack different H2O, MXNet, TensorFlow and Caffe models quickly and easily using the H2O platform.
In [35]:
# Load R Packages
suppressPackageStartupMessages(library(h2o))
suppressPackageStartupMessages(library(mlbench)) # for Boston Housing Data
In [36]:
# Install h2oEnsemble from GitHub if needed
# Reference: https://github.com/h2oai/h2o-3/tree/master/h2o-r/ensemble
if (!require(h2oEnsemble)) {
install.packages("https://h2o-release.s3.amazonaws.com/h2o-ensemble/R/h2oEnsemble_0.1.8.tar.gz", repos = NULL)
}
suppressPackageStartupMessages(library(h2oEnsemble)) # for model stacking
The dataset used in this demo is Boston Housing
from mlbench
, it contains housing values in suburbs of Boston.
In [38]:
# Import data
data(BostonHousing)
head(BostonHousing)
dim(BostonHousing)
In [39]:
# Split data
set.seed(1234)
row_train <- sample(1:nrow(BostonHousing), 400)
train <- BostonHousing[row_train,]
test <- BostonHousing[-row_train,]
In [40]:
# Training data - quick summary
dim(train)
head(train)
summary(train)
In [41]:
# Test data - quick summary
dim(test)
head(test)
summary(test)
We are now ready to train regression models using different algorithms in H2O.
- H2O Gradient Boosting Machines (CPU)
- H2O Distributed Random Forest (CPU)
Note 1: Although the three algorithms used in this example are different, the core parameters are consistent (see below). This allows H2O users to get quick and easy access to different existing (and future) algorithms with a very shallow learning curve. The core parameters are:
- x = features
- y = target
- training_frame = h_train
Note 2: For model stacking, we need to generate holdout predictions from cross-validation. The parameters required for model stacking are:
- nfolds = 5
- fold_assignment = 'Modulo'
- keep_cross_validation_predictions = TRUE
In [42]:
# Convert R data frames into H2O data frames
h_train <- as.h2o(train)
h_test <- as.h2o(test)
In [43]:
# Regression - define features (x) and target (y)
target <- "medv"
features <- setdiff(colnames(train), target)
print(features)
In [45]:
# Train a H2O GBM model
model_gbm <- h2o.gbm(x = features, y = target,
training_frame = h_train,
model_id = "h2o_gbm",
learn_rate = 0.1,
learn_rate_annealing = 0.99,
sample_rate = 0.8,
col_sample_rate = 0.8,
nfolds = 5,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
ntrees = 100)
In [46]:
# Train a H2O DRF model
model_drf <- h2o.randomForest(x = features, y = target,
training_frame = h_train,
model_id = "h2o_drf",
nfolds = 5,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
ntrees = 100)
In [47]:
# Create a list to include all the models for stacking
models <- list(model_dw, model_gbm, model_drf)
In [48]:
# Define a metalearner (one of the H2O supervised machine learning algorithms)
metalearner <- "h2o.glm.wrapper"
In [49]:
# Use h2o.stack() to carry out metalearning
stack <- h2o.stack(models = models,
response_frame = h_train$medv,
metalearner = metalearner)
In [50]:
# Finally, we evaluate the predictive performance on the ensemble as well as indiviudal models.
h2o.ensemble_performance(stack, newdata = h_test)
In [51]:
# Use the ensemble to make predictions
yhat_test <- predict(stack, h_test)