Machine Learning with H2O - Tutorial 3c: Regression Models (Ensembles)


Objective:

  • This tutorial explains how to create stacked ensembles of regression models for better out-of-bag performance.

Wine Quality Dataset:


Steps:

  1. Build GBM models using random grid search and extract the best one.
  2. Build DRF models using random grid search and extract the best one.
  3. Build DNN models using random grid search and extract the best one.
  4. Use model stacking to combining different models.

Full Technical Reference:



In [1]:
# Start and connect to a local H2O cluster
suppressPackageStartupMessages(library(h2o))
h2o.init(nthreads = -1)


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/RtmpddUCFh/h2o_joe_started_from_r.out
    /tmp/RtmpddUCFh/h2o_joe_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 905 milliseconds 
    H2O cluster version:        3.10.3.5 
    H2O cluster version age:    10 days  
    H2O cluster name:           H2O_started_from_R_joe_qbs574 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   5.21 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    R Version:                  R version 3.3.2 (2016-10-31) 



In [2]:
# Import wine quality data from a local CSV file
wine = h2o.importFile("winequality-white.csv")
head(wine, 5)


  |======================================================================| 100%
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
7.0 0.27 0.36 20.7 0.045 45 170 1.00103.00 0.45 8.8 6
6.3 0.30 0.34 1.6 0.049 14 132 0.99403.30 0.49 9.5 6
8.1 0.28 0.40 6.9 0.050 30 97 0.99513.26 0.44 10.1 6
7.2 0.23 0.32 8.5 0.058 47 186 0.99563.19 0.40 9.9 6
7.2 0.23 0.32 8.5 0.058 47 186 0.99563.19 0.40 9.9 6

In [3]:
# Define features (or predictors)
features = colnames(wine)  # we want to use all the information
features = setdiff(features, 'quality')    # we need to exclude the target 'quality'
features


  1. 'fixed acidity'
  2. 'volatile acidity'
  3. 'citric acid'
  4. 'residual sugar'
  5. 'chlorides'
  6. 'free sulfur dioxide'
  7. 'total sulfur dioxide'
  8. 'density'
  9. 'pH'
  10. 'sulphates'
  11. 'alcohol'

In [4]:
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = h2o.splitFrame(wine, ratios = 0.8, seed = 1234)

wine_train = wine_split[[1]] # using 80% for training
wine_test = wine_split[[2]]  # using the rest 20% for out-of-bag evaluation

In [5]:
dim(wine_train)


  1. 3932
  2. 12

In [6]:
dim(wine_test)


  1. 966
  2. 12



In [7]:
# define the criteria for random grid search
search_criteria = list(strategy = "RandomDiscrete",
                       max_models = 9,
                       seed = 1234)


Step 1: Build GBM Models using Random Grid Search and Extract the Best Model


In [8]:
# define the range of hyper-parameters for GBM grid search
# 27 combinations in total
hyper_params <- list(
    sample_rate = c(0.7, 0.8, 0.9),
    col_sample_rate = c(0.7, 0.8, 0.9),
    max_depth = c(3, 5, 7)
)

In [9]:
# Set up GBM grid search
# Add a seed for reproducibility
# Set up GBM grid search
# Add a seed for reproducibility
gbm_rand_grid <- h2o.grid(
  
    # Core parameters for model training
    x = features,
    y = 'quality',
    training_frame = wine_train,
    ntrees = 10000,
    nfolds = 5,
    seed = 1234,

    # Parameters for grid search
    grid_id = "gbm_rand_grid",
    hyper_params = hyper_params,
    algorithm = "gbm",
    search_criteria = search_criteria,

    # Parameters for early stopping
    stopping_metric = "MSE",
    stopping_rounds = 15,
    score_tree_interval = 1,
    
    # Parameters required for stacked ensembles
    fold_assignment = "Modulo",
    keep_cross_validation_predictions = TRUE
  
)


  |======================================================================| 100%

In [10]:
# Sort and show the grid search results
gbm_rand_grid <- h2o.getGrid(grid_id = "gbm_rand_grid", sort_by = "mse", decreasing = FALSE)
print(gbm_rand_grid)


H2O Grid Details
================

Grid ID: gbm_rand_grid 
Used hyper parameters: 
  -  col_sample_rate 
  -  max_depth 
  -  sample_rate 
Number of models: 9 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing mse
  col_sample_rate max_depth sample_rate             model_ids
1             0.9         7         0.9 gbm_rand_grid_model_5
2             0.8         7         0.7 gbm_rand_grid_model_4
3             0.7         7         0.7 gbm_rand_grid_model_1
4             0.9         7         0.7 gbm_rand_grid_model_6
5             0.7         5         0.8 gbm_rand_grid_model_0
6             0.8         3         0.9 gbm_rand_grid_model_7
7             0.7         3         0.7 gbm_rand_grid_model_8
8             0.9         3         0.9 gbm_rand_grid_model_2
9             0.8         3         0.8 gbm_rand_grid_model_3
                  mse
1 0.41467703216892454
2  0.4188744246328386
3 0.42294704197026883
4  0.4285238866231086
5 0.44601214899796604
6 0.46338551281728263
7  0.4681243149102324
8 0.46849996267402233
9  0.4690100493856379

In [11]:
# Extract the best model from random grid search
best_gbm_model_id <- gbm_rand_grid@model_ids[[1]] # top of the list
best_gbm_from_rand_grid <- h2o.getModel(best_gbm_model_id)
summary(best_gbm_from_rand_grid)


Model Details:
==============

H2ORegressionModel: gbm
Model Key:  gbm_rand_grid_model_5 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             168                      168              103536         7
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         7    7.00000         13         82    43.80953

H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.09975218
RMSE:  0.3158357
MAE:  0.2350127
RMSLE:  0.04701275
Mean Residual Deviance :  0.09975218



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.414677
RMSE:  0.6439542
MAE:  0.4747976
RMSLE:  0.09641845
Mean Residual Deviance :  0.414677


Cross-Validation Metrics Summary: 
                        mean           sd cv_1_valid  cv_2_valid cv_3_valid
mae               0.47480187  0.014146665 0.49126652  0.44133535  0.4956603
mse               0.41468513   0.01769817 0.42318034   0.3743674 0.44260392
r2                0.47613242  0.022100862  0.4654261  0.51358205  0.4544707
residual_deviance 0.41468513   0.01769817 0.42318034   0.3743674 0.44260392
rmse               0.6436622  0.013858161  0.6505231   0.6118557  0.6652848
rmsle             0.09633803 0.0027988742 0.09947316 0.089695655 0.09997074
                  cv_4_valid cv_5_valid
mae               0.46371725 0.48202994
mse               0.39864415 0.43462983
r2                 0.5115735 0.43560982
residual_deviance 0.39864415 0.43462983
rmse               0.6313827  0.6592646
rmsle             0.09391075 0.09863986

Scoring History: 
            timestamp   duration number_of_trees training_rmse training_mae
1 2017-03-01 05:53:56 36.700 sec               0       0.89009      0.67683
2 2017-03-01 05:53:56 36.705 sec               1       0.85417      0.64726
3 2017-03-01 05:53:56 36.709 sec               2       0.81998      0.62140
4 2017-03-01 05:53:56 36.713 sec               3       0.79127      0.60341
5 2017-03-01 05:53:56 36.717 sec               4       0.76588      0.58860
  training_deviance
1           0.79225
2           0.72961
3           0.67237
4           0.62611
5           0.58657

---
              timestamp   duration number_of_trees training_rmse training_mae
164 2017-03-01 05:53:57 37.376 sec             163       0.32082      0.23913
165 2017-03-01 05:53:57 37.381 sec             164       0.32012      0.23846
166 2017-03-01 05:53:57 37.386 sec             165       0.31826      0.23707
167 2017-03-01 05:53:57 37.390 sec             166       0.31756      0.23653
168 2017-03-01 05:53:57 37.394 sec             167       0.31725      0.23615
169 2017-03-01 05:53:57 37.399 sec             168       0.31584      0.23501
    training_deviance
164           0.10293
165           0.10248
166           0.10129
167           0.10085
168           0.10065
169           0.09975

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
               variable relative_importance scaled_importance percentage
1               alcohol         3803.677246          1.000000   0.269828
2      volatile acidity         1756.441528          0.461775   0.124600
3   free sulfur dioxide         1483.690063          0.390067   0.105251
4                    pH         1039.827271          0.273374   0.073764
5               density          968.992920          0.254752   0.068739
6  total sulfur dioxide          921.516785          0.242270   0.065371
7        residual sugar          919.475098          0.241733   0.065226
8             sulphates          869.059875          0.228479   0.061650
9         fixed acidity          867.632874          0.228104   0.061549
10          citric acid          784.689758          0.206298   0.055665
11            chlorides          681.660889          0.179211   0.048356


Step 2: Build DRF Models using Random Grid Search and Extract the Best Model


In [12]:
# define the range of hyper-parameters for DRF grid search
# 27 combinations in total
hyper_params <- list(
    sample_rate = c(0.5, 0.6, 0.7),
    col_sample_rate_per_tree = c(0.7, 0.8, 0.9),
    max_depth = c(3, 5, 7)
)

In [13]:
# Set up DRF grid search
# Add a seed for reproducibility
drf_rand_grid <- h2o.grid(
  
    # Core parameters for model training
    x = features,
    y = 'quality',
    training_frame = wine_train,
    ntrees = 200,
    nfolds = 5,
    seed = 1234,

    # Parameters for grid search
    grid_id = "drf_rand_grid",
    hyper_params = hyper_params,
    algorithm = "randomForest",
    search_criteria = search_criteria,
    
    # Parameters required for stacked ensembles
    fold_assignment = "Modulo",
    keep_cross_validation_predictions = TRUE
  
)


  |======================================================================| 100%

In [14]:
# Sort and show the grid search results
drf_rand_grid <- h2o.getGrid(grid_id = "drf_rand_grid", sort_by = "mse", decreasing = FALSE)
print(drf_rand_grid)


H2O Grid Details
================

Grid ID: drf_rand_grid 
Used hyper parameters: 
  -  col_sample_rate_per_tree 
  -  max_depth 
  -  sample_rate 
Number of models: 9 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing mse
  col_sample_rate_per_tree max_depth sample_rate             model_ids
1                      0.9         7         0.7 drf_rand_grid_model_5
2                      0.9         7         0.5 drf_rand_grid_model_6
3                      0.8         7         0.5 drf_rand_grid_model_4
4                      0.7         7         0.5 drf_rand_grid_model_1
5                      0.7         5         0.6 drf_rand_grid_model_0
6                      0.9         3         0.7 drf_rand_grid_model_2
7                      0.8         3         0.6 drf_rand_grid_model_3
8                      0.8         3         0.7 drf_rand_grid_model_7
9                      0.7         3         0.5 drf_rand_grid_model_8
                  mse
1 0.48533899185762636
2   0.487315432336594
3 0.49004168463947945
4  0.4927544483353685
5  0.5307039662299886
6  0.5846039939024897
7  0.5850640013528532
8  0.5855927668634072
9  0.5857362760598669

In [15]:
# Extract the best model from random grid search
best_drf_model_id <- drf_rand_grid@model_ids[[1]] # top of the list
best_drf_from_rand_grid <- h2o.getModel(best_drf_model_id)
summary(best_drf_from_rand_grid)


Model Details:
==============

H2ORegressionModel: drf
Model Key:  drf_rand_grid_model_5 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             200                      200              239751         7
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         7    7.00000         70        111    90.26500

H2ORegressionMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  0.4881925
RMSE:  0.6987078
MAE:  0.55672
RMSLE:  0.1038554
Mean Residual Deviance :  0.4881925



H2ORegressionMetrics: drf
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.485339
RMSE:  0.6966628
MAE:  0.5534049
RMSLE:  0.1036737
Mean Residual Deviance :  0.485339


Cross-Validation Metrics Summary: 
                        mean           sd cv_1_valid cv_2_valid cv_3_valid
mae               0.55340695  0.005552894  0.5597114  0.5389095  0.5607463
mse               0.48534155  0.008135928  0.4963012  0.4643429 0.49416298
r2                 0.3867965  0.011288858  0.3730577 0.39667627  0.3909218
residual_deviance 0.48534155  0.008135928  0.4963012  0.4643429 0.49416298
rmse               0.6966151  0.005873075  0.7044865  0.6814271  0.7029673
rmsle             0.10364755 0.0016516468 0.10672623 0.09989605 0.10539349
                   cv_4_valid cv_5_valid
mae                 0.5552819  0.5523856
mse                 0.4826099 0.48929074
r2                 0.40869704 0.36462972
residual_deviance   0.4826099 0.48929074
rmse                0.6947013  0.6994932
rmsle             0.102883786 0.10333821

Scoring History: 
            timestamp   duration number_of_trees training_rmse training_mae
1 2017-03-01 05:54:44 31.563 sec               0                           
2 2017-03-01 05:54:44 31.566 sec               1       0.78987      0.60592
3 2017-03-01 05:54:44 31.569 sec               2       0.78827      0.61267
4 2017-03-01 05:54:44 31.571 sec               3       0.77905      0.60767
5 2017-03-01 05:54:44 31.574 sec               4       0.77239      0.60398
  training_deviance
1                  
2           0.62390
3           0.62137
4           0.60692
5           0.59659

---
              timestamp   duration number_of_trees training_rmse training_mae
196 2017-03-01 05:54:45 32.325 sec             195       0.69872      0.55659
197 2017-03-01 05:54:45 32.329 sec             196       0.69869      0.55663
198 2017-03-01 05:54:45 32.336 sec             197       0.69879      0.55671
199 2017-03-01 05:54:45 32.341 sec             198       0.69883      0.55681
200 2017-03-01 05:54:45 32.344 sec             199       0.69872      0.55668
201 2017-03-01 05:54:45 32.348 sec             200       0.69871      0.55672
    training_deviance
196           0.48820
197           0.48817
198           0.48830
199           0.48836
200           0.48821
201           0.48819

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
               variable relative_importance scaled_importance percentage
1               alcohol        60814.687500          1.000000   0.303835
2               density        30903.451172          0.508158   0.154396
3      volatile acidity        24629.945312          0.405000   0.123053
4   free sulfur dioxide        17794.929688          0.292609   0.088905
5             chlorides        14527.195312          0.238876   0.072579
6  total sulfur dioxide        13325.857422          0.219122   0.066577
7           citric acid         9832.670898          0.161683   0.049125
8         fixed acidity         7867.541504          0.129369   0.039307
9                    pH         7696.380371          0.126555   0.038452
10       residual sugar         7565.418945          0.124401   0.037797
11            sulphates         5199.086426          0.085491   0.025975


Step 3: Build DNN Models using Random Grid Search and Extract the Best Model


In [16]:
# define the range of hyper-parameters for DNN grid search
# 81 combinations in total
hyper_params <- list(
    activation = c('tanh', 'rectifier', 'maxout'),
    hidden = list(c(50), c(50,50), c(50,50,50)),
    l1 = c(0, 1e-3, 1e-5),
    l2 = c(0, 1e-3, 1e-5)
)

In [17]:
# Set up DNN grid search
# Add a seed for reproducibility
dnn_rand_grid <- h2o.grid(
  
    # Core parameters for model training
    x = features,
    y = 'quality',
    training_frame = wine_train,
    epochs = 20,
    nfolds = 5,
    seed = 1234,

    # Parameters for grid search
    grid_id = "dnn_rand_grid",
    hyper_params = hyper_params,
    algorithm = "deeplearning",
    search_criteria = search_criteria,
    
    # Parameters required for stacked ensembles
    fold_assignment = "Modulo",
    keep_cross_validation_predictions = TRUE
  
)


  |======================================================================| 100%

In [18]:
# Sort and show the grid search results
dnn_rand_grid <- h2o.getGrid(grid_id = "dnn_rand_grid", sort_by = "mse", decreasing = FALSE)
print(dnn_rand_grid)


H2O Grid Details
================

Grid ID: dnn_rand_grid 
Used hyper parameters: 
  -  activation 
  -  hidden 
  -  l1 
  -  l2 
Number of models: 9 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing mse
  activation       hidden     l1     l2             model_ids
1     Maxout [50, 50, 50] 1.0E-5 1.0E-5 dnn_rand_grid_model_3
2  Rectifier     [50, 50] 1.0E-5    0.0 dnn_rand_grid_model_2
3     Maxout     [50, 50]    0.0 1.0E-5 dnn_rand_grid_model_8
4       Tanh [50, 50, 50] 1.0E-5 1.0E-5 dnn_rand_grid_model_7
5     Maxout         [50] 1.0E-5  0.001 dnn_rand_grid_model_6
6       Tanh [50, 50, 50]    0.0 1.0E-5 dnn_rand_grid_model_0
7     Maxout         [50]    0.0    0.0 dnn_rand_grid_model_5
8     Maxout         [50]  0.001    0.0 dnn_rand_grid_model_4
9       Tanh [50, 50, 50]  0.001    0.0 dnn_rand_grid_model_1
                 mse
1 0.5132317444689928
2 0.5147930385440149
3 0.5231170352359251
4 0.5243904925311967
5 0.5257152424817406
6 0.5276392369040392
7 0.5300169058534957
8 0.5450070477599134
9 0.5486681769362338

In [19]:
# Extract the best model from random grid search
best_dnn_model_id <- dnn_rand_grid@model_ids[[1]] # top of the list
best_dnn_from_rand_grid <- h2o.getModel(best_dnn_model_id)
summary(best_dnn_from_rand_grid)


Model Details:
==============

H2ORegressionModel: deeplearning
Model Key:  dnn_rand_grid_model_3 
Status of Neuron Layers: predicting quality, regression, gaussian distribution, Quadratic loss, 11,451 weights/biases, 143.8 KB, 81,920 training samples, mini-batch size 1
  layer units   type dropout       l1       l2 mean_rate rate_rms momentum
1     1    11  Input  0.00 %                                              
2     2    50 Maxout  0.00 % 0.000010 0.000010  0.001362 0.000463 0.000000
3     3    50 Maxout  0.00 % 0.000010 0.000010  0.002507 0.000914 0.000000
4     4    50 Maxout  0.00 % 0.000010 0.000010  0.035343 0.053778 0.000000
5     5     1 Linear         0.000010 0.000010  0.000370 0.000208 0.000000
  mean_weight weight_rms mean_bias bias_rms
1                                          
2   -0.002575   0.198772  0.427465 0.066787
3   -0.031066   0.149890  0.957756 0.031819
4   -0.021807   0.144457  0.817316 0.199981
5    0.000514   0.119923  0.022593 0.000000

H2ORegressionMetrics: deeplearning
** Reported on training data. **
** Metrics reported on full training frame **

MSE:  0.428123
RMSE:  0.6543111
MAE:  0.5181838
RMSLE:  0.09789516
Mean Residual Deviance :  0.428123



H2ORegressionMetrics: deeplearning
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.5132317
RMSE:  0.7164019
MAE:  0.5610945
RMSLE:  0.1065543
Mean Residual Deviance :  0.5132317


Cross-Validation Metrics Summary: 
                        mean           sd  cv_1_valid cv_2_valid cv_3_valid
mae                0.5610955  0.004813636  0.56158286  0.5569238  0.5517425
mse               0.51323146  0.012347572   0.5097897   0.517834 0.49660873
r2                 0.3509187  0.026187148   0.3560186  0.3271749 0.38790733
residual_deviance 0.51323146  0.012347572   0.5097897   0.517834 0.49660873
rmse               0.7162994  0.008560747   0.7139956  0.7196069  0.7047047
rmsle             0.10652595 0.0017459263 0.107351966 0.10474491 0.10536039
                  cv_4_valid cv_5_valid
mae                 0.563053  0.5721752
mse                0.4975557  0.5443691
r2                0.39038515 0.29310754
residual_deviance  0.4975557  0.5443691
rmse              0.70537627  0.7378137
rmsle              0.1041935 0.11097897

Scoring History: 
            timestamp   duration training_speed   epochs iterations
1 2017-03-01 05:55:44  0.000 sec                 0.00000          0
2 2017-03-01 05:55:45 43.336 sec  25691 obs/sec  2.09741          1
3 2017-03-01 05:55:47 46.208 sec  25850 obs/sec 20.83418         10
       samples training_rmse training_deviance training_mae
1     0.000000                                             
2  8247.000000       0.71247           0.50762      0.55722
3 81920.000000       0.65431           0.42812      0.51818


Model Stacking


In [20]:
# Define a list of models to be stacked
# i.e. best model from each grid
all_ids = list(best_gbm_model_id, best_drf_model_id, best_dnn_model_id)

In [21]:
# Stack models
# GLM as the default metalearner
ensemble = h2o.stackedEnsemble(x = features,
                               y = 'quality',
                               training_frame = wine_train,
                               model_id = "my_ensemble",
                               base_models = all_ids)


  |======================================================================| 100%


Comparison of Model Performance on Test Data


In [22]:
cat('Best GBM model from Grid (MSE) : ', h2o.performance(best_gbm_from_rand_grid, wine_test)@metrics$MSE, "\n")
cat('Best DRF model from Grid (MSE) : ', h2o.performance(best_drf_from_rand_grid, wine_test)@metrics$MSE, "\n")
cat('Best DNN model from Grid (MSE) : ', h2o.performance(best_dnn_from_rand_grid, wine_test)@metrics$MSE, "\n")
cat('Stacked Ensembles        (MSE) : ', h2o.performance(ensemble, wine_test)@metrics$MSE, "\n")


Best GBM model from Grid (MSE) :  0.4013943 
Best DRF model from Grid (MSE) :  0.4781568 
Best DNN model from Grid (MSE) :  0.5543555 
Stacked Ensembles        (MSE) :  0.3989076