Easy Ensemble Learning with h2oEnsemble

Introduction

In statistics and machine learning, ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models (Wikipredia, 2015). This notebook demonstrates an easy way to carry out ensemble learning with H2O deep water models and other H2O models using h2oEnsemble.

Key Benefit

We give our users the ability to build, compare and stack different H2O, MXNet, TensorFlow and Caffe models quickly and easily using the H2O platform.

Setup

We need three R packages for this demo: h2o, h2oEnsemble and mlbench.


In [35]:
# Load R Packages
suppressPackageStartupMessages(library(h2o))
suppressPackageStartupMessages(library(mlbench))     # for Boston Housing Data

In [36]:
# Install h2oEnsemble from GitHub if needed
# Reference: https://github.com/h2oai/h2o-3/tree/master/h2o-r/ensemble
if (!require(h2oEnsemble)) {
    install.packages("https://h2o-release.s3.amazonaws.com/h2o-ensemble/R/h2oEnsemble_0.1.8.tar.gz", repos = NULL)
}
suppressPackageStartupMessages(library(h2oEnsemble)) # for model stacking

In [37]:
# Start and connect to H2O Cluster with Deep Water
h2o.init(nthreads = -1)
h2o.no_progress()


 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         12 minutes 45 seconds 
    H2O cluster version:        3.11.0.99999 
    H2O cluster version age:    1 month and 14 days  
    H2O cluster name:           ubuntu 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   8.64 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    R Version:                  R version 3.2.3 (2015-12-10) 

Boston Housing Data

The dataset used in this demo is Boston Housing from mlbench, it contains housing values in suburbs of Boston.

  • Reference: UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Housing)
  • Source: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
  • Creator: Harrison, D. and Rubinfeld, D.L., 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
  • Type: Regression
  • Dimensions: 506 instances, 13 numeric features and 1 numeric target.
  • 13 Features:
    • CRIM: per capita crime rate by town
    • ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
    • INDUS: proportion of non-retail business acres per town
    • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    • NOX: nitric oxides concentration (parts per 10 million)
    • RM: average number of rooms per dwelling
    • AGE: proportion of owner-occupied units built prior to 1940
    • DIS: weighted distances to five Boston employment centres
    • RAD: index of accessibility to radial highways
    • TAX: full-value property-tax rate per $10,000
    • PTRATIO: pupil-teacher ratio by town
    • B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    • LSTAT: % lower status of the population
  • Target:
    • MEDV: Median value of owner-occupied homes in $1000's (this is the value we want to predict)

In [38]:
# Import data
data(BostonHousing)
head(BostonHousing)
dim(BostonHousing)


crimzninduschasnoxrmagedisradtaxptratioblstatmedv
0.0063218 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
  1. 506
  2. 14

Splitting Data into Training/Test Set

We want to evaluate the predictive performance on a holdout dataset. The following code split the Boston Housing data randomly into:

  • Training: 400 instances
  • Test: 106 instances

In [39]:
# Split data
set.seed(1234)
row_train <- sample(1:nrow(BostonHousing), 400)
train <- BostonHousing[row_train,]
test <- BostonHousing[-row_train,]

In [40]:
# Training data - quick summary
dim(train)
head(train)
summary(train)


  1. 400
  2. 14
crimzninduschasnoxrmagedisradtaxptratioblstatmedv
580.01432100 1.32 0 0.411 6.816 40.5 8.3248 5 256 15.1 392.90 3.95 31.6
3150.36920 0 9.90 0 0.544 6.567 87.3 3.6023 4 304 18.4 395.69 9.28 23.8
3080.04932 33 2.18 0 0.472 6.849 70.3 3.1827 7 222 18.4 396.90 7.53 28.2
3140.26938 0 9.90 0 0.544 6.266 82.8 3.2628 4 304 18.4 393.39 7.90 21.6
4336.44405 0 18.10 0 0.584 6.425 74.8 2.2004 24 666 20.2 97.95 12.03 16.1
3210.16760 0 7.38 0 0.493 6.426 52.3 4.5404 5 287 19.6 396.90 7.20 23.8
      crim                zn             indus       chas         nox        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:370   Min.   :0.3850  
 1st Qu.: 0.07782   1st Qu.:  0.00   1st Qu.: 5.13   1: 30   1st Qu.:0.4520  
 Median : 0.24751   Median :  0.00   Median : 8.56           Median :0.5380  
 Mean   : 3.33351   Mean   : 12.01   Mean   :10.98           Mean   :0.5549  
 3rd Qu.: 3.48946   3rd Qu.: 18.50   3rd Qu.:18.10           3rd Qu.:0.6258  
 Max.   :73.53410   Max.   :100.00   Max.   :27.74           Max.   :0.8710  
       rm             age              dis              rad       
 Min.   :3.561   Min.   :  6.20   Min.   : 1.130   Min.   : 1.00  
 1st Qu.:5.883   1st Qu.: 47.08   1st Qu.: 2.103   1st Qu.: 4.00  
 Median :6.205   Median : 77.75   Median : 3.239   Median : 5.00  
 Mean   :6.273   Mean   : 69.25   Mean   : 3.824   Mean   : 9.44  
 3rd Qu.:6.626   3rd Qu.: 94.03   3rd Qu.: 5.234   3rd Qu.:24.00  
 Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
      tax           ptratio            b              lstat      
 Min.   :187.0   Min.   :12.60   Min.   :  2.52   Min.   : 1.73  
 1st Qu.:279.0   1st Qu.:17.40   1st Qu.:376.46   1st Qu.: 7.17  
 Median :330.0   Median :19.10   Median :391.99   Median :11.25  
 Mean   :404.8   Mean   :18.52   Mean   :359.94   Mean   :12.61  
 3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.54   3rd Qu.:16.43  
 Max.   :711.0   Max.   :22.00   Max.   :396.90   Max.   :37.97  
      medv      
 Min.   : 5.00  
 1st Qu.:17.27  
 Median :21.15  
 Mean   :22.51  
 3rd Qu.:24.85  
 Max.   :50.00  

In [41]:
# Test data - quick summary
dim(test)
head(test)
summary(test)


  1. 106
  2. 14
crimzninduschasnoxrmagedisradtaxptratioblstatmedv
20.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
100.1700412.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.10 18.9
130.0937812.5 7.87 0 0.524 5.889 39.0 5.4509 5 311 15.2 390.50 15.71 21.7
180.78420 0.0 8.14 0 0.538 5.990 81.7 4.2579 4 307 21.0 386.75 14.67 17.5
240.98843 0.0 8.14 0 0.538 5.813 100.0 4.0952 4 307 21.0 394.54 19.88 14.5
280.95577 0.0 8.14 0 0.538 6.047 88.8 4.4534 4 307 21.0 306.38 17.28 14.8
      crim                zn             indus        chas         nox        
 Min.   : 0.00906   Min.   : 0.000   Min.   : 0.740   0:101   Min.   :0.4000  
 1st Qu.: 0.09535   1st Qu.: 0.000   1st Qu.: 5.945   1:  5   1st Qu.:0.4480  
 Median : 0.30770   Median : 0.000   Median :10.300           Median :0.5350  
 Mean   : 4.67018   Mean   : 8.929   Mean   :11.720           Mean   :0.5540  
 3rd Qu.: 4.86247   3rd Qu.: 0.000   3rd Qu.:18.100           3rd Qu.:0.6128  
 Max.   :88.97620   Max.   :95.000   Max.   :27.740           Max.   :0.8710  
       rm             age              dis             rad        
 Min.   :4.926   Min.   :  2.90   Min.   :1.202   Min.   : 1.000  
 1st Qu.:5.910   1st Qu.: 37.98   1st Qu.:2.084   1st Qu.: 4.000  
 Median :6.231   Median : 76.35   Median :3.117   Median : 5.000  
 Mean   :6.330   Mean   : 66.01   Mean   :3.686   Mean   : 9.962  
 3rd Qu.:6.562   3rd Qu.: 94.35   3rd Qu.:4.906   3rd Qu.:24.000  
 Max.   :8.398   Max.   :100.00   Max.   :9.188   Max.   :24.000  
      tax           ptratio            b              lstat       
 Min.   :193.0   Min.   :13.00   Min.   :  0.32   Min.   : 2.960  
 1st Qu.:287.5   1st Qu.:16.60   1st Qu.:368.61   1st Qu.: 6.758  
 Median :367.5   Median :18.40   Median :389.75   Median :11.690  
 Mean   :421.3   Mean   :18.23   Mean   :344.37   Mean   :12.806  
 3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:395.49   3rd Qu.:17.407  
 Max.   :711.0   Max.   :21.20   Max.   :396.90   Max.   :30.810  
      medv      
 Min.   : 5.00  
 1st Qu.:15.72  
 Median :21.45  
 Mean   :22.61  
 3rd Qu.:26.57  
 Max.   :50.00  

Training Different Regression Models

We are now ready to train regression models using different algorithms in H2O.

  • First of all, we convert R data frames into H2O data frames.
  • Then, we define the names of features and target.
  • Finally, we train three different models:
      - H2O Deep Water (using MXNet as GPU backend)
      - H2O Gradient Boosting Machines (CPU)
      - H2O Distributed Random Forest (CPU)

Note 1: Although the three algorithms used in this example are different, the core parameters are consistent (see below). This allows H2O users to get quick and easy access to different existing (and future) algorithms with a very shallow learning curve. The core parameters are:

- x = features
- y = target
- training_frame = h_train

Note 2: For model stacking, we need to generate holdout predictions from cross-validation. The parameters required for model stacking are:

- nfolds = 5
- fold_assignment = 'Modulo'
- keep_cross_validation_predictions = TRUE

In [42]:
# Convert R data frames into H2O data frames
h_train <- as.h2o(train)
h_test <- as.h2o(test)

In [43]:
# Regression - define features (x) and target (y)
target <- "medv"
features <- setdiff(colnames(train), target)
print(features)


 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "b"       "lstat"  

H2O Deep Water Model

For more information, enter ?h2o.deepwater in R to look at the full list of parameters.


In [44]:
# Train a Deep Water model using MXNet as GPU backend
model_dw <- h2o.deepwater(x = features, y = target,
                          training_frame = h_train,
                          model_id = "h2o_deepwater",
                          learning_rate = 1e-3, 
                          mini_batch_size = 64,
                          hidden = c(50, 50),
                          activation = "Rectifier",
                          score_duty_cycle = 1,
                          score_training_samples = 0,
                          epochs = 200,
                          nfolds = 5,
                          fold_assignment = "Modulo",
                          keep_cross_validation_predictions = TRUE,
                          backend = "mxnet",
                          network = "auto")

H2O GBM model

For more information, enter ?h2o.gbm in R to look at the full list of parameters.


In [45]:
# Train a H2O GBM model
model_gbm <- h2o.gbm(x = features, y = target,
                     training_frame = h_train,
                     model_id = "h2o_gbm",
                     learn_rate = 0.1,
                     learn_rate_annealing = 0.99,
                     sample_rate = 0.8,
                     col_sample_rate = 0.8,
                     nfolds = 5,
                     fold_assignment = "Modulo",
                     keep_cross_validation_predictions = TRUE,
                     ntrees = 100)

H2O DRF model

For more information, enter ?h2o.randomForest in R to look at the full list of parameters.


In [46]:
# Train a H2O DRF model
model_drf <- h2o.randomForest(x = features, y = target,
                              training_frame = h_train,
                              model_id = "h2o_drf",
                              nfolds = 5,
                              fold_assignment = "Modulo",
                              keep_cross_validation_predictions = TRUE,
                              ntrees = 100)

Model Stacking

Now we have three different models, we are ready to carry out model stacking.


In [47]:
# Create a list to include all the models for stacking
models <- list(model_dw, model_gbm, model_drf)

In [48]:
# Define a metalearner (one of the H2O supervised machine learning algorithms)
metalearner <- "h2o.glm.wrapper"

In [49]:
# Use h2o.stack() to carry out metalearning
stack <- h2o.stack(models = models, 
                   response_frame = h_train$medv,
                   metalearner = metalearner)


[1] "Metalearning"

In [50]:
# Finally, we evaluate the predictive performance on the ensemble as well as indiviudal models.
h2o.ensemble_performance(stack, newdata = h_test)


Base learner performance, sorted by specified metric:
        learner      MSE
1 h2o_deepwater 8.377644
2       h2o_gbm 8.106541
3       h2o_drf 7.443517


H2O Ensemble Performance on <newdata>:
----------------
Family: gaussian

Ensemble performance (MSE): 5.80436983051916

In [51]:
# Use the ensemble to make predictions
yhat_test <- predict(stack, h_test)