Easy Ensemble Learning with h2oEnsemble

Introduction

In statistics and machine learning, ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models (Wikipredia, 2015). This notebook demonstrates an easy way to carry out ensemble learning with H2O deep water models and other H2O models using h2oEnsemble.

Key Benefit

We give our users the ability to build, compare and stack different H2O, MXNet, TensorFlow and Caffe models quickly and easily using the H2O platform.

Setup

We need three R packages for this demo: h2o, h2oEnsemble and mlbench.



In [35]:

    
# Load R Packages
suppressPackageStartupMessages(library(h2o))
suppressPackageStartupMessages(library(mlbench))     # for Boston Housing Data



In [36]:

    
# Install h2oEnsemble from GitHub if needed
# Reference: https://github.com/h2oai/h2o-3/tree/master/h2o-r/ensemble
if (!require(h2oEnsemble)) {
    install.packages("https://h2o-release.s3.amazonaws.com/h2o-ensemble/R/h2oEnsemble_0.1.8.tar.gz", repos = NULL)
}
suppressPackageStartupMessages(library(h2oEnsemble)) # for model stacking



In [37]:

    
# Start and connect to H2O Cluster with Deep Water
h2o.init(nthreads = -1)
h2o.no_progress()









    



 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         12 minutes 45 seconds 
    H2O cluster version:        3.11.0.99999 
    H2O cluster version age:    1 month and 14 days  
    H2O cluster name:           ubuntu 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   8.64 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    R Version:                  R version 3.2.3 (2015-12-10)

Boston Housing Data

The dataset used in this demo is Boston Housing from mlbench, it contains housing values in suburbs of Boston.

Reference: UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Housing)
Source: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
Creator: Harrison, D. and Rubinfeld, D.L., 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
Type: Regression
Dimensions: 506 instances, 13 numeric features and 1 numeric target.

13 Features:
- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population

Target:
- MEDV: Median value of owner-occupied homes in $1000's (this is the value we want to predict)



In [38]:

    
# Import data
data(BostonHousing)
head(BostonHousing)
dim(BostonHousing)









    





crim zn indus chas nox rm age dis rad tax ptratio b lstat medv

	0.00632 18     2.31   0      0.538  6.575  65.2   4.0900 1      296    15.3   396.90 4.98   24.0   
	0.02731  0     7.07   0      0.469  6.421  78.9   4.9671 2      242    17.8   396.90 9.14   21.6   
	0.02729  0     7.07   0      0.469  7.185  61.1   4.9671 2      242    17.8   392.83 4.03   34.7   
	0.03237  0     2.18   0      0.458  6.998  45.8   6.0622 3      222    18.7   394.63 2.94   33.4   
	0.06905  0     2.18   0      0.458  7.147  54.2   6.0622 3      222    18.7   396.90 5.33   36.2   
	0.02985  0     2.18   0      0.458  6.430  58.7   6.0622 3      222    18.7   394.12 5.21   28.7   









    





	506
	14

Splitting Data into Training/Test Set

We want to evaluate the predictive performance on a holdout dataset. The following code split the Boston Housing data randomly into:

Training: 400 instances
Test: 106 instances



In [39]:

    
# Split data
set.seed(1234)
row_train <- sample(1:nrow(BostonHousing), 400)
train <- BostonHousing[row_train,]
test <- BostonHousing[-row_train,]



In [40]:

    
# Training data - quick summary
dim(train)
head(train)
summary(train)









    





	400
	14








    





crim zn indus chas nox rm age dis rad tax ptratio b lstat medv

	58 0.01432 100     1.32  0      0.411  6.816  40.5   8.3248  5     256    15.1   392.90  3.95  31.6   
	315 0.36920   0     9.90  0      0.544  6.567  87.3   3.6023  4     304    18.4   395.69  9.28  23.8   
	308 0.04932  33     2.18  0      0.472  6.849  70.3   3.1827  7     222    18.4   396.90  7.53  28.2   
	314 0.26938   0     9.90  0      0.544  6.266  82.8   3.2628  4     304    18.4   393.39  7.90  21.6   
	433 6.44405   0    18.10  0      0.584  6.425  74.8   2.2004 24     666    20.2    97.95 12.03  16.1   
	321 0.16760   0     7.38  0      0.493  6.426  52.3   4.5404  5     287    19.6   396.90  7.20  23.8   









    





      crim                zn             indus       chas         nox        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:370   Min.   :0.3850  
 1st Qu.: 0.07782   1st Qu.:  0.00   1st Qu.: 5.13   1: 30   1st Qu.:0.4520  
 Median : 0.24751   Median :  0.00   Median : 8.56           Median :0.5380  
 Mean   : 3.33351   Mean   : 12.01   Mean   :10.98           Mean   :0.5549  
 3rd Qu.: 3.48946   3rd Qu.: 18.50   3rd Qu.:18.10           3rd Qu.:0.6258  
 Max.   :73.53410   Max.   :100.00   Max.   :27.74           Max.   :0.8710  
       rm             age              dis              rad       
 Min.   :3.561   Min.   :  6.20   Min.   : 1.130   Min.   : 1.00  
 1st Qu.:5.883   1st Qu.: 47.08   1st Qu.: 2.103   1st Qu.: 4.00  
 Median :6.205   Median : 77.75   Median : 3.239   Median : 5.00  
 Mean   :6.273   Mean   : 69.25   Mean   : 3.824   Mean   : 9.44  
 3rd Qu.:6.626   3rd Qu.: 94.03   3rd Qu.: 5.234   3rd Qu.:24.00  
 Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
      tax           ptratio            b              lstat      
 Min.   :187.0   Min.   :12.60   Min.   :  2.52   Min.   : 1.73  
 1st Qu.:279.0   1st Qu.:17.40   1st Qu.:376.46   1st Qu.: 7.17  
 Median :330.0   Median :19.10   Median :391.99   Median :11.25  
 Mean   :404.8   Mean   :18.52   Mean   :359.94   Mean   :12.61  
 3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.54   3rd Qu.:16.43  
 Max.   :711.0   Max.   :22.00   Max.   :396.90   Max.   :37.97  
      medv      
 Min.   : 5.00  
 1st Qu.:17.27  
 Median :21.15  
 Mean   :22.51  
 3rd Qu.:24.85  
 Max.   :50.00



In [41]:

    
# Test data - quick summary
dim(test)
head(test)
summary(test)









    





	106
	14








    





crim zn indus chas nox rm age dis rad tax ptratio b lstat medv

	2 0.02731  0.0   7.07   0      0.469  6.421   78.9  4.9671 2      242    17.8   396.90  9.14  21.6   
	10 0.17004 12.5   7.87   0      0.524  6.004   85.9  6.5921 5      311    15.2   386.71 17.10  18.9   
	13 0.09378 12.5   7.87   0      0.524  5.889   39.0  5.4509 5      311    15.2   390.50 15.71  21.7   
	18 0.78420  0.0   8.14   0      0.538  5.990   81.7  4.2579 4      307    21.0   386.75 14.67  17.5   
	24 0.98843  0.0   8.14   0      0.538  5.813  100.0  4.0952 4      307    21.0   394.54 19.88  14.5   
	28 0.95577  0.0   8.14   0      0.538  6.047   88.8  4.4534 4      307    21.0   306.38 17.28  14.8   









    





      crim                zn             indus        chas         nox        
 Min.   : 0.00906   Min.   : 0.000   Min.   : 0.740   0:101   Min.   :0.4000  
 1st Qu.: 0.09535   1st Qu.: 0.000   1st Qu.: 5.945   1:  5   1st Qu.:0.4480  
 Median : 0.30770   Median : 0.000   Median :10.300           Median :0.5350  
 Mean   : 4.67018   Mean   : 8.929   Mean   :11.720           Mean   :0.5540  
 3rd Qu.: 4.86247   3rd Qu.: 0.000   3rd Qu.:18.100           3rd Qu.:0.6128  
 Max.   :88.97620   Max.   :95.000   Max.   :27.740           Max.   :0.8710  
       rm             age              dis             rad        
 Min.   :4.926   Min.   :  2.90   Min.   :1.202   Min.   : 1.000  
 1st Qu.:5.910   1st Qu.: 37.98   1st Qu.:2.084   1st Qu.: 4.000  
 Median :6.231   Median : 76.35   Median :3.117   Median : 5.000  
 Mean   :6.330   Mean   : 66.01   Mean   :3.686   Mean   : 9.962  
 3rd Qu.:6.562   3rd Qu.: 94.35   3rd Qu.:4.906   3rd Qu.:24.000  
 Max.   :8.398   Max.   :100.00   Max.   :9.188   Max.   :24.000  
      tax           ptratio            b              lstat       
 Min.   :193.0   Min.   :13.00   Min.   :  0.32   Min.   : 2.960  
 1st Qu.:287.5   1st Qu.:16.60   1st Qu.:368.61   1st Qu.: 6.758  
 Median :367.5   Median :18.40   Median :389.75   Median :11.690  
 Mean   :421.3   Mean   :18.23   Mean   :344.37   Mean   :12.806  
 3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:395.49   3rd Qu.:17.407  
 Max.   :711.0   Max.   :21.20   Max.   :396.90   Max.   :30.810  
      medv      
 Min.   : 5.00  
 1st Qu.:15.72  
 Median :21.45  
 Mean   :22.61  
 3rd Qu.:26.57  
 Max.   :50.00

Training Different Regression Models

We are now ready to train regression models using different algorithms in H2O.

First of all, we convert R data frames into H2O data frames.
Then, we define the names of features and target.

Finally, we train three different models:

  - H2O Deep Water (using MXNet as GPU backend)
  - H2O Gradient Boosting Machines (CPU)
  - H2O Distributed Random Forest (CPU)

Note 1: Although the three algorithms used in this example are different, the core parameters are consistent (see below). This allows H2O users to get quick and easy access to different existing (and future) algorithms with a very shallow learning curve. The core parameters are:

- x = features
- y = target
- training_frame = h_train

Note 2: For model stacking, we need to generate holdout predictions from cross-validation. The parameters required for model stacking are:

- nfolds = 5
- fold_assignment = 'Modulo'
- keep_cross_validation_predictions = TRUE



In [42]:

    
# Convert R data frames into H2O data frames
h_train <- as.h2o(train)
h_test <- as.h2o(test)



In [43]:

    
# Regression - define features (x) and target (y)
target <- "medv"
features <- setdiff(colnames(train), target)
print(features)









    



 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "b"       "lstat"

H2O Deep Water Model

For more information, enter ?h2o.deepwater in R to look at the full list of parameters.



In [44]:

    
# Train a Deep Water model using MXNet as GPU backend
model_dw <- h2o.deepwater(x = features, y = target,
                          training_frame = h_train,
                          model_id = "h2o_deepwater",
                          learning_rate = 1e-3, 
                          mini_batch_size = 64,
                          hidden = c(50, 50),
                          activation = "Rectifier",
                          score_duty_cycle = 1,
                          score_training_samples = 0,
                          epochs = 200,
                          nfolds = 5,
                          fold_assignment = "Modulo",
                          keep_cross_validation_predictions = TRUE,
                          backend = "mxnet",
                          network = "auto")

H2O GBM model

For more information, enter ?h2o.gbm in R to look at the full list of parameters.



In [45]:

    
# Train a H2O GBM model
model_gbm <- h2o.gbm(x = features, y = target,
                     training_frame = h_train,
                     model_id = "h2o_gbm",
                     learn_rate = 0.1,
                     learn_rate_annealing = 0.99,
                     sample_rate = 0.8,
                     col_sample_rate = 0.8,
                     nfolds = 5,
                     fold_assignment = "Modulo",
                     keep_cross_validation_predictions = TRUE,
                     ntrees = 100)

H2O DRF model

For more information, enter ?h2o.randomForest in R to look at the full list of parameters.



In [46]:

    
# Train a H2O DRF model
model_drf <- h2o.randomForest(x = features, y = target,
                              training_frame = h_train,
                              model_id = "h2o_drf",
                              nfolds = 5,
                              fold_assignment = "Modulo",
                              keep_cross_validation_predictions = TRUE,
                              ntrees = 100)

Model Stacking

Now we have three different models, we are ready to carry out model stacking.



In [47]:

    
# Create a list to include all the models for stacking
models <- list(model_dw, model_gbm, model_drf)



In [48]:

    
# Define a metalearner (one of the H2O supervised machine learning algorithms)
metalearner <- "h2o.glm.wrapper"



In [49]:

    
# Use h2o.stack() to carry out metalearning
stack <- h2o.stack(models = models, 
                   response_frame = h_train$medv,
                   metalearner = metalearner)









    



[1] "Metalearning"



In [50]:

    
# Finally, we evaluate the predictive performance on the ensemble as well as indiviudal models.
h2o.ensemble_performance(stack, newdata = h_test)









    





Base learner performance, sorted by specified metric:
        learner      MSE
1 h2o_deepwater 8.377644
2       h2o_gbm 8.106541
3       h2o_drf 7.443517


H2O Ensemble Performance on <newdata>:
----------------
Family: gaussian

Ensemble performance (MSE): 5.80436983051916



In [51]:

    
# Use the ensemble to make predictions
yhat_test <- predict(stack, h_test)

crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
0.00632	18	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2
0.02985	0	2.18	0.458	6.430	58.7	6.0622	3	222	18.7	394.12	5.21	28.7

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
58	0.01432	100	1.32	0.411	6.816	40.5	8.3248	5	256	15.1	392.90	3.95	31.6
315	0.36920	0	9.90	0.544	6.567	87.3	3.6023	4	304	18.4	395.69	9.28	23.8
308	0.04932	33	2.18	0.472	6.849	70.3	3.1827	7	222	18.4	396.90	7.53	28.2
314	0.26938	0	9.90	0.544	6.266	82.8	3.2628	4	304	18.4	393.39	7.90	21.6
433	6.44405	0	18.10	0.584	6.425	74.8	2.2004	24	666	20.2	97.95	12.03	16.1
321	0.16760	0	7.38	0.493	6.426	52.3	4.5404	5	287	19.6	396.90	7.20	23.8