Machine Learning with H2O - Tutorial 3c: Regression Models (Ensembles)


Objective:

  • This tutorial explains how to create stacked ensembles of regression models for better out-of-bag performance.

Wine Quality Dataset:


Steps:

  1. Build GBM models using random grid search and extract the best one.
  2. Build DRF models using random grid search and extract the best one.
  3. Build DNN models using random grid search and extract the best one.
  4. Use model stacking to combining different models.

Full Technical Reference:



In [1]:
# Start and connect to a local H2O cluster
suppressPackageStartupMessages(library(h2o))
h2o.init(nthreads = -1)


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/RtmpUi2yAy/h2o_joe_started_from_r.out
    /tmp/RtmpUi2yAy/h2o_joe_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 138 milliseconds 
    H2O cluster version:        3.10.4.4 
    H2O cluster version age:    5 days  
    H2O cluster name:           H2O_started_from_R_joe_upv818 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   5.21 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 3.3.2 (2016-10-31) 



In [2]:
# Import wine quality data from a local CSV file
wine = h2o.importFile("winequality-white.csv")
head(wine, 5)


  |======================================================================| 100%
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
7.0 0.27 0.36 20.7 0.045 45 170 1.00103.00 0.45 8.8 6
6.3 0.30 0.34 1.6 0.049 14 132 0.99403.30 0.49 9.5 6
8.1 0.28 0.40 6.9 0.050 30 97 0.99513.26 0.44 10.1 6
7.2 0.23 0.32 8.5 0.058 47 186 0.99563.19 0.40 9.9 6
7.2 0.23 0.32 8.5 0.058 47 186 0.99563.19 0.40 9.9 6

In [3]:
# Define features (or predictors)
features = colnames(wine)  # we want to use all the information
features = setdiff(features, 'quality')    # we need to exclude the target 'quality'
features


  1. 'fixed acidity'
  2. 'volatile acidity'
  3. 'citric acid'
  4. 'residual sugar'
  5. 'chlorides'
  6. 'free sulfur dioxide'
  7. 'total sulfur dioxide'
  8. 'density'
  9. 'pH'
  10. 'sulphates'
  11. 'alcohol'

In [4]:
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = h2o.splitFrame(wine, ratios = 0.8, seed = 1234)

wine_train = wine_split[[1]] # using 80% for training
wine_test = wine_split[[2]]  # using the rest 20% for out-of-bag evaluation

In [5]:
dim(wine_train)


  1. 3932
  2. 12

In [6]:
dim(wine_test)


  1. 966
  2. 12



In [7]:
# define the criteria for random grid search
search_criteria = list(strategy = "RandomDiscrete",
                       max_models = 9,
                       seed = 1234)


Step 1: Build GBM Models using Random Grid Search and Extract the Best Model


In [8]:
# define the range of hyper-parameters for GBM grid search
# 27 combinations in total
hyper_params <- list(
    sample_rate = c(0.7, 0.8, 0.9),
    col_sample_rate = c(0.7, 0.8, 0.9),
    max_depth = c(3, 5, 7)
)

In [9]:
# Set up GBM grid search
# Add a seed for reproducibility
# Set up GBM grid search
# Add a seed for reproducibility
gbm_rand_grid <- h2o.grid(
  
    # Core parameters for model training
    x = features,
    y = 'quality',
    training_frame = wine_train,
    ntrees = 10000,
    nfolds = 5,
    seed = 1234,

    # Parameters for grid search
    grid_id = "gbm_rand_grid",
    hyper_params = hyper_params,
    algorithm = "gbm",
    search_criteria = search_criteria,

    # Parameters for early stopping
    stopping_metric = "MSE",
    stopping_rounds = 15,
    score_tree_interval = 1,
    
    # Parameters required for stacked ensembles
    fold_assignment = "Modulo",
    keep_cross_validation_predictions = TRUE
  
)


  |======================================================================| 100%

In [10]:
# Sort and show the grid search results
gbm_rand_grid <- h2o.getGrid(grid_id = "gbm_rand_grid", sort_by = "mse", decreasing = FALSE)
print(gbm_rand_grid)


H2O Grid Details
================

Grid ID: gbm_rand_grid 
Used hyper parameters: 
  -  col_sample_rate 
  -  max_depth 
  -  sample_rate 
Number of models: 9 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing mse
  col_sample_rate max_depth sample_rate             model_ids
1             0.9         7         0.9 gbm_rand_grid_model_5
2             0.8         7         0.7 gbm_rand_grid_model_4
3             0.7         7         0.7 gbm_rand_grid_model_1
4             0.9         7         0.7 gbm_rand_grid_model_6
5             0.7         5         0.8 gbm_rand_grid_model_0
6             0.8         3         0.9 gbm_rand_grid_model_7
7             0.7         3         0.7 gbm_rand_grid_model_8
8             0.9         3         0.9 gbm_rand_grid_model_2
9             0.8         3         0.8 gbm_rand_grid_model_3
                  mse
1 0.41467703216892454
2  0.4188744246328386
3 0.42294704197026883
4  0.4285238866231086
5 0.44601214899796604
6 0.46338551281728263
7  0.4681243149102324
8 0.46849996267402233
9  0.4690100493856379

In [11]:
# Extract the best model from random grid search
best_gbm_model_id <- gbm_rand_grid@model_ids[[1]] # top of the list
best_gbm_from_rand_grid <- h2o.getModel(best_gbm_model_id)
summary(best_gbm_from_rand_grid)


Model Details:
==============

H2ORegressionModel: gbm
Model Key:  gbm_rand_grid_model_5 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             168                      168              103543         7
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         7    7.00000         13         82    43.80953

H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.09975218
RMSE:  0.3158357
MAE:  0.2350127
RMSLE:  0.04701275
Mean Residual Deviance :  0.09975218



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.414677
RMSE:  0.6439542
MAE:  0.4747976
RMSLE:  0.09641845
Mean Residual Deviance :  0.414677


Cross-Validation Metrics Summary: 
                        mean           sd cv_1_valid  cv_2_valid cv_3_valid
mae               0.47480187  0.014146665 0.49126652  0.44133535  0.4956603
mse               0.41468513   0.01769817 0.42318034   0.3743674 0.44260392
r2                0.47613242  0.022100862  0.4654261  0.51358205  0.4544707
residual_deviance 0.41468513   0.01769817 0.42318034   0.3743674 0.44260392
rmse               0.6436622  0.013858161  0.6505231   0.6118557  0.6652848
rmsle             0.09633803 0.0027988742 0.09947316 0.089695655 0.09997074
                  cv_4_valid cv_5_valid
mae               0.46371725 0.48202994
mse               0.39864415 0.43462983
r2                 0.5115735 0.43560982
residual_deviance 0.39864415 0.43462983
rmse               0.6313827  0.6592646
rmsle             0.09391075 0.09863986

Scoring History: 
            timestamp   duration number_of_trees training_rmse training_mae
1 2017-04-20 22:29:35 45.502 sec               0       0.89009      0.67683
2 2017-04-20 22:29:35 45.508 sec               1       0.85417      0.64726
3 2017-04-20 22:29:35 45.514 sec               2       0.81998      0.62140
4 2017-04-20 22:29:35 45.518 sec               3       0.79127      0.60341
5 2017-04-20 22:29:35 45.522 sec               4       0.76588      0.58860
  training_deviance
1           0.79225
2           0.72961
3           0.67237
4           0.62611
5           0.58657

---
              timestamp   duration number_of_trees training_rmse training_mae
164 2017-04-20 22:29:36 46.246 sec             163       0.32082      0.23913
165 2017-04-20 22:29:36 46.251 sec             164       0.32012      0.23846
166 2017-04-20 22:29:36 46.256 sec             165       0.31826      0.23707
167 2017-04-20 22:29:36 46.261 sec             166       0.31756      0.23653
168 2017-04-20 22:29:36 46.265 sec             167       0.31725      0.23615
169 2017-04-20 22:29:36 46.270 sec             168       0.31584      0.23501
    training_deviance
164           0.10293
165           0.10248
166           0.10129
167           0.10085
168           0.10065
169           0.09975

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
               variable relative_importance scaled_importance percentage
1               alcohol         3803.677246          1.000000   0.269828
2      volatile acidity         1756.441528          0.461775   0.124600
3   free sulfur dioxide         1483.690063          0.390067   0.105251
4                    pH         1039.827271          0.273374   0.073764
5               density          968.992920          0.254752   0.068739
6  total sulfur dioxide          921.516785          0.242270   0.065371
7        residual sugar          919.475098          0.241733   0.065226
8             sulphates          869.059875          0.228479   0.061650
9         fixed acidity          867.632874          0.228104   0.061549
10          citric acid          784.689758          0.206298   0.055665
11            chlorides          681.660889          0.179211   0.048356


Step 2: Build DRF Models using Random Grid Search and Extract the Best Model


In [12]:
# define the range of hyper-parameters for DRF grid search
# 27 combinations in total
hyper_params <- list(
    sample_rate = c(0.5, 0.6, 0.7),
    col_sample_rate_per_tree = c(0.7, 0.8, 0.9),
    max_depth = c(3, 5, 7)
)

In [13]:
# Set up DRF grid search
# Add a seed for reproducibility
drf_rand_grid <- h2o.grid(
  
    # Core parameters for model training
    x = features,
    y = 'quality',
    training_frame = wine_train,
    ntrees = 200,
    nfolds = 5,
    seed = 1234,

    # Parameters for grid search
    grid_id = "drf_rand_grid",
    hyper_params = hyper_params,
    algorithm = "randomForest",
    search_criteria = search_criteria,
    
    # Parameters required for stacked ensembles
    fold_assignment = "Modulo",
    keep_cross_validation_predictions = TRUE
  
)


  |======================================================================| 100%

In [14]:
# Sort and show the grid search results
drf_rand_grid <- h2o.getGrid(grid_id = "drf_rand_grid", sort_by = "mse", decreasing = FALSE)
print(drf_rand_grid)


H2O Grid Details
================

Grid ID: drf_rand_grid 
Used hyper parameters: 
  -  col_sample_rate_per_tree 
  -  max_depth 
  -  sample_rate 
Number of models: 9 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing mse
  col_sample_rate_per_tree max_depth sample_rate             model_ids
1                      0.9         7         0.7 drf_rand_grid_model_5
2                      0.9         7         0.5 drf_rand_grid_model_6
3                      0.8         7         0.5 drf_rand_grid_model_4
4                      0.7         7         0.5 drf_rand_grid_model_1
5                      0.7         5         0.6 drf_rand_grid_model_0
6                      0.9         3         0.7 drf_rand_grid_model_2
7                      0.8         3         0.6 drf_rand_grid_model_3
8                      0.8         3         0.7 drf_rand_grid_model_7
9                      0.7         3         0.5 drf_rand_grid_model_8
                  mse
1 0.48533899185762636
2   0.487315432336594
3 0.49004168463947945
4  0.4927544483353685
5  0.5307039662299886
6  0.5846039939024897
7  0.5850640013528532
8  0.5855927668634072
9  0.5857362760598669

In [15]:
# Extract the best model from random grid search
best_drf_model_id <- drf_rand_grid@model_ids[[1]] # top of the list
best_drf_from_rand_grid <- h2o.getModel(best_drf_model_id)
summary(best_drf_from_rand_grid)


Model Details:
==============

H2ORegressionModel: drf
Model Key:  drf_rand_grid_model_5 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             200                      200              239751         7
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         7    7.00000         70        111    90.26500

H2ORegressionMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  0.4881925
RMSE:  0.6987078
MAE:  0.55672
RMSLE:  0.1038554
Mean Residual Deviance :  0.4881925



H2ORegressionMetrics: drf
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.485339
RMSE:  0.6966628
MAE:  0.5534049
RMSLE:  0.1036737
Mean Residual Deviance :  0.485339


Cross-Validation Metrics Summary: 
                        mean           sd cv_1_valid cv_2_valid cv_3_valid
mae               0.55340695  0.005552894  0.5597114  0.5389095  0.5607463
mse               0.48534155  0.008135928  0.4963012  0.4643429 0.49416298
r2                 0.3867965  0.011288858  0.3730577 0.39667627  0.3909218
residual_deviance 0.48534155  0.008135928  0.4963012  0.4643429 0.49416298
rmse               0.6966151  0.005873075  0.7044865  0.6814271  0.7029673
rmsle             0.10364755 0.0016516468 0.10672623 0.09989605 0.10539349
                   cv_4_valid cv_5_valid
mae                 0.5552819  0.5523856
mse                 0.4826099 0.48929074
r2                 0.40869704 0.36462972
residual_deviance   0.4826099 0.48929074
rmse                0.6947013  0.6994932
rmsle             0.102883786 0.10333821

Scoring History: 
            timestamp   duration number_of_trees training_rmse training_mae
1 2017-04-20 22:30:26 31.592 sec               0                           
2 2017-04-20 22:30:26 31.596 sec               1       0.78987      0.60592
3 2017-04-20 22:30:26 31.599 sec               2       0.78827      0.61267
4 2017-04-20 22:30:26 31.602 sec               3       0.77905      0.60767
5 2017-04-20 22:30:26 31.605 sec               4       0.77239      0.60398
  training_deviance
1                  
2           0.62390
3           0.62137
4           0.60692
5           0.59659

---
              timestamp   duration number_of_trees training_rmse training_mae
196 2017-04-20 22:30:26 32.328 sec             195       0.69872      0.55659
197 2017-04-20 22:30:26 32.332 sec             196       0.69869      0.55663
198 2017-04-20 22:30:26 32.336 sec             197       0.69879      0.55671
199 2017-04-20 22:30:26 32.341 sec             198       0.69883      0.55681
200 2017-04-20 22:30:26 32.345 sec             199       0.69872      0.55668
201 2017-04-20 22:30:26 32.350 sec             200       0.69871      0.55672
    training_deviance
196           0.48820
197           0.48817
198           0.48830
199           0.48836
200           0.48821
201           0.48819

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
               variable relative_importance scaled_importance percentage
1               alcohol        60814.687500          1.000000   0.303835
2               density        30903.451172          0.508158   0.154396
3      volatile acidity        24629.945312          0.405000   0.123053
4   free sulfur dioxide        17794.929688          0.292609   0.088905
5             chlorides        14527.195312          0.238876   0.072579
6  total sulfur dioxide        13325.857422          0.219122   0.066577
7           citric acid         9832.670898          0.161683   0.049125
8         fixed acidity         7867.541504          0.129369   0.039307
9                    pH         7696.380371          0.126555   0.038452
10       residual sugar         7565.418945          0.124401   0.037797
11            sulphates         5199.086426          0.085491   0.025975


Step 3: Build DNN Models using Random Grid Search and Extract the Best Model


In [16]:
# define the range of hyper-parameters for DNN grid search
# 81 combinations in total
hyper_params <- list(
    activation = c('tanh', 'rectifier', 'maxout'),
    hidden = list(c(50), c(50,50), c(50,50,50)),
    l1 = c(0, 1e-3, 1e-5),
    l2 = c(0, 1e-3, 1e-5)
)

In [17]:
# Set up DNN grid search
# Add a seed for reproducibility
dnn_rand_grid <- h2o.grid(
  
    # Core parameters for model training
    x = features,
    y = 'quality',
    training_frame = wine_train,
    epochs = 20,
    nfolds = 5,
    seed = 1234,

    # Parameters for grid search
    grid_id = "dnn_rand_grid",
    hyper_params = hyper_params,
    algorithm = "deeplearning",
    search_criteria = search_criteria,
    
    # Parameters required for stacked ensembles
    fold_assignment = "Modulo",
    keep_cross_validation_predictions = TRUE
  
)


  |======================================================================| 100%

In [18]:
# Sort and show the grid search results
dnn_rand_grid <- h2o.getGrid(grid_id = "dnn_rand_grid", sort_by = "mse", decreasing = FALSE)
print(dnn_rand_grid)


H2O Grid Details
================

Grid ID: dnn_rand_grid 
Used hyper parameters: 
  -  activation 
  -  hidden 
  -  l1 
  -  l2 
Number of models: 9 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing mse
  activation       hidden     l1     l2             model_ids
1       Tanh [50, 50, 50] 1.0E-5 1.0E-5 dnn_rand_grid_model_7
2     Maxout [50, 50, 50] 1.0E-5 1.0E-5 dnn_rand_grid_model_3
3       Tanh [50, 50, 50]    0.0 1.0E-5 dnn_rand_grid_model_0
4     Maxout         [50] 1.0E-5  0.001 dnn_rand_grid_model_6
5     Maxout     [50, 50]    0.0 1.0E-5 dnn_rand_grid_model_8
6  Rectifier     [50, 50] 1.0E-5    0.0 dnn_rand_grid_model_2
7     Maxout         [50]    0.0    0.0 dnn_rand_grid_model_5
8     Maxout         [50]  0.001    0.0 dnn_rand_grid_model_4
9       Tanh [50, 50, 50]  0.001    0.0 dnn_rand_grid_model_1
                 mse
1 0.5190244167006844
2 0.5213907219250163
3 0.5228407400854493
4 0.5232635053190984
5 0.5307057081827324
6 0.5319141335049986
7 0.5324977175515652
8  0.533464620888763
9 0.5351207487276177

In [19]:
# Extract the best model from random grid search
best_dnn_model_id <- dnn_rand_grid@model_ids[[1]] # top of the list
best_dnn_from_rand_grid <- h2o.getModel(best_dnn_model_id)
summary(best_dnn_from_rand_grid)


Model Details:
==============

H2ORegressionModel: deeplearning
Model Key:  dnn_rand_grid_model_7 
Status of Neuron Layers: predicting quality, regression, gaussian distribution, Quadratic loss, 5,751 weights/biases, 75.7 KB, 81,920 training samples, mini-batch size 1
  layer units   type dropout       l1       l2 mean_rate rate_rms momentum
1     1    11  Input  0.00 %                                              
2     2    50   Tanh  0.00 % 0.000010 0.000010  0.002212 0.000846 0.000000
3     3    50   Tanh  0.00 % 0.000010 0.000010  0.006017 0.002141 0.000000
4     4    50   Tanh  0.00 % 0.000010 0.000010  0.098442 0.099110 0.000000
5     5     1 Linear         0.000010 0.000010  0.001662 0.000851 0.000000
  mean_weight weight_rms mean_bias bias_rms
1                                          
2    0.002276   0.231624  0.012922 0.116485
3    0.001190   0.192257 -0.023824 0.266163
4    0.005200   0.143210 -0.006688 0.146705
5   -0.064008   0.182394 -0.080526 0.000000

H2ORegressionMetrics: deeplearning
** Reported on training data. **
** Metrics reported on full training frame **

MSE:  0.3929263
RMSE:  0.6268384
MAE:  0.492776
RMSLE:  0.09360466
Mean Residual Deviance :  0.3929263



H2ORegressionMetrics: deeplearning
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.5190244
RMSE:  0.7204335
MAE:  0.5658033
RMSLE:  0.1070117
Mean Residual Deviance :  0.5190244


Cross-Validation Metrics Summary: 
                        mean           sd cv_1_valid cv_2_valid cv_3_valid
mae               0.56580484  0.006983193 0.57709104 0.54840773   0.562632
mse                0.5190286 0.0135020735 0.53879184 0.48284402 0.52024233
r2                0.34431395  0.016484078 0.31938225 0.37263763  0.3587778
residual_deviance  0.5190286 0.0135020735 0.53879184 0.48284402 0.52024233
rmse               0.7203119  0.009471315  0.7340244  0.6948698  0.7212783
rmsle             0.10697291 0.0020405161 0.11117696 0.10213999 0.10771321
                  cv_4_valid cv_5_valid
mae               0.57193834  0.5689551
mse                0.5248941  0.5283706
r2                0.35688958  0.3138824
residual_deviance  0.5248941  0.5283706
rmse              0.72449577 0.72689104
rmsle             0.10707782 0.10675656

Scoring History: 
            timestamp          duration training_speed   epochs iterations
1 2017-04-20 22:31:49         0.000 sec                 0.00000          0
2 2017-04-20 22:31:50  1 min  6.798 sec  32089 obs/sec  2.09741          1
3 2017-04-20 22:31:52  1 min  9.066 sec  32689 obs/sec 20.83418         10
       samples training_rmse training_deviance training_mae
1     0.000000                                             
2  8247.000000       0.73865           0.54561      0.57744
3 81920.000000       0.62684           0.39293      0.49278


Model Stacking


In [20]:
# Define a list of models to be stacked
# i.e. best model from each grid
all_ids = list(best_gbm_model_id, best_drf_model_id, best_dnn_model_id)

In [21]:
# Stack models
# GLM as the default metalearner
ensemble = h2o.stackedEnsemble(x = features,
                               y = 'quality',
                               training_frame = wine_train,
                               model_id = "my_ensemble",
                               base_models = all_ids)


  |======================================================================| 100%


Comparison of Model Performance on Test Data


In [22]:
cat('Best GBM model from Grid (MSE) : ', h2o.performance(best_gbm_from_rand_grid, wine_test)@metrics$MSE, "\n")
cat('Best DRF model from Grid (MSE) : ', h2o.performance(best_drf_from_rand_grid, wine_test)@metrics$MSE, "\n")
cat('Best DNN model from Grid (MSE) : ', h2o.performance(best_dnn_from_rand_grid, wine_test)@metrics$MSE, "\n")
cat('Stacked Ensembles        (MSE) : ', h2o.performance(ensemble, wine_test)@metrics$MSE, "\n")


Best GBM model from Grid (MSE) :  0.4013943 
Best DRF model from Grid (MSE) :  0.4781568 
Best DNN model from Grid (MSE) :  0.505399 
Stacked Ensembles        (MSE) :  0.3992703