Machine Learning with H2O - Tutorial 3b: Regression Models (Grid Search)

Objective:

This tutorial explains how to fine-tune regression models for better out-of-bag performance.

Wine Quality Dataset:

Steps:

GBM with default settings
GBM with manual settings
GBM with manual settings & cross-validation
GBM with manual settings, cross-validation and early stopping
GBM with cross-validation, early stopping and full grid search
GBM with cross-validation, early stopping and random grid search
Model stacking (combining different GLM, DRF, GBM and DNN models)

Full Technical Reference:

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/RBooklet.pdf



In [1]:

    
# Start and connect to a local H2O cluster
suppressPackageStartupMessages(library(h2o))
h2o.init(nthreads = -1)









    



H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/Rtmptn0ivG/h2o_joe_started_from_r.out
    /tmp/Rtmptn0ivG/h2o_joe_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 808 milliseconds 
    H2O cluster version:        3.10.3.5 
    H2O cluster version age:    10 days  
    H2O cluster name:           H2O_started_from_R_joe_jyt717 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   5.21 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    R Version:                  R version 3.3.2 (2016-10-31)



In [2]:

    
# Import wine quality data from a local CSV file
wine = h2o.importFile("winequality-white.csv")
head(wine, 5)









    



  |======================================================================| 100%






    





fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality

	7.0   0.27  0.36  20.7  0.045 45    170   1.0010 3.00  0.45   8.8  6     
	6.3   0.30  0.34   1.6  0.049 14    132   0.9940 3.30  0.49   9.5  6     
	8.1   0.28  0.40   6.9  0.050 30     97   0.9951 3.26  0.44  10.1  6     
	7.2   0.23  0.32   8.5  0.058 47    186   0.9956 3.19  0.40   9.9  6     
	7.2   0.23  0.32   8.5  0.058 47    186   0.9956 3.19  0.40   9.9  6



In [3]:

    
# Define features (or predictors)
features = colnames(wine)  # we want to use all the information
features = setdiff(features, 'quality')    # we need to exclude the target 'quality'
features









    





	'fixed acidity'
	'volatile acidity'
	'citric acid'
	'residual sugar'
	'chlorides'
	'free sulfur dioxide'
	'total sulfur dioxide'
	'density'
	'pH'
	'sulphates'
	'alcohol'



In [4]:

    
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = h2o.splitFrame(wine, ratios = 0.8, seed = 1234)

wine_train = wine_split[[1]] # using 80% for training
wine_test = wine_split[[2]]  # using the rest 20% for out-of-bag evaluation



In [5]:

    
dim(wine_train)



In [6]:

    
dim(wine_test)

Step 1 - Gradient Boosting Machines (GBM) with Default Settings



In [7]:

    
# Build a Gradient Boosting Machines (GBM) model with default settings
gbm_default = h2o.gbm(x = features,
                      y = 'quality',
                      training_frame = wine_train,
                      seed = 1234,
                      model_id = 'gbm_default')









    



  |======================================================================| 100%



In [8]:

    
# Check the model performance on test dataset
h2o.performance(gbm_default, wine_test)









    





H2ORegressionMetrics: gbm

MSE:  0.4551121
RMSE:  0.67462
MAE:  0.5219768
RMSLE:  0.1001376
Mean Residual Deviance :  0.4551121

Step 2 - GBM with Manual Settings



In [9]:

    
# Build a GBM with manual settings
gbm_manual = h2o.gbm(x = features,
                     y = 'quality',
                     training_frame = wine_train,
                     seed = 1234,
                     model_id = 'gbm_manual',
                     ntrees = 100,
                     sample_rate = 0.9,
                     col_sample_rate = 0.9)









    



  |======================================================================| 100%



In [10]:

    
# Check the model performance on test dataset
h2o.performance(gbm_manual, wine_test)









    





H2ORegressionMetrics: gbm

MSE:  0.4432567
RMSE:  0.6657752
MAE:  0.5114358
RMSLE:  0.0989581
Mean Residual Deviance :  0.4432567

Step 3 - GBM with Manual Settings & Cross-Validation (CV)



In [11]:

    
# Build a GBM with manual settings & cross-validation
gbm_manual_cv = h2o.gbm(x = features,
                        y = 'quality',
                        training_frame = wine_train,
                        seed = 1234,
                        model_id = 'gbm_manual_cv',
                        ntrees = 100,
                        sample_rate = 0.9,
                        col_sample_rate = 0.9,
                        nfolds = 5)









    



  |======================================================================| 100%



In [12]:

    
# Check the cross-validation model performance
gbm_manual_cv









    





Model Details:
==============

H2ORegressionModel: gbm
Model ID:  gbm_manual_cv 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             100                      100               32355         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000          7         31    20.59000


H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.2743835
RMSE:  0.5238162
MAE:  0.4075921
RMSLE:  0.07748354
Mean Residual Deviance :  0.2743835



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.4502182
RMSE:  0.670983
MAE:  0.5185164
RMSLE:  0.1000784
Mean Residual Deviance :  0.4502182


Cross-Validation Metrics Summary: 
                        mean          sd cv_1_valid cv_2_valid cv_3_valid
mae                0.5183839 0.005799581  0.5173063 0.53289497   0.507699
mse               0.45017532 0.009006354  0.4428255 0.47013062 0.45373598
r2                0.43098298 0.017648963 0.43654183 0.38503054  0.4380041
residual_deviance 0.45017532 0.009006354  0.4428255 0.47013062 0.45373598
rmse                0.670884 0.006707203 0.66545135  0.6856607  0.6735993
rmsle             0.10007383 7.270967E-4 0.09834545 0.10152771 0.09989998
                   cv_4_valid cv_5_valid
mae                0.51518273 0.51883656
mse                0.43171054 0.45247397
r2                 0.46120018  0.4341383
residual_deviance  0.43171054 0.45247397
rmse               0.65704685 0.67266184
rmsle             0.100463666 0.10013236



In [13]:

    
# Check the model performance on test dataset
h2o.performance(gbm_manual_cv, wine_test)
# It should be the same as gbm_manual above as the model is trained with same parameters









    





H2ORegressionMetrics: gbm

MSE:  0.4432567
RMSE:  0.6657752
MAE:  0.5114358
RMSLE:  0.0989581
Mean Residual Deviance :  0.4432567

Step 4 - GBM with Manual Settings, CV and Early Stopping



In [14]:

    
# Build a GBM with manual settings, CV and early stopping
gbm_manual_cv_es = h2o.gbm(x = features,
                           y = 'quality',
                           training_frame = wine_train,
                           seed = 1234,
                           model_id = 'gbm_manual_cv_es',
                           ntrees = 10000,              # increase the number of trees
                           sample_rate = 0.9,
                           col_sample_rate = 0.9,
                           nfolds = 5,
                           stopping_metric = 'MSE',     # let early stopping feature determine
                           stopping_rounds = 15,        # the optimal number of trees
                           score_tree_interval = 1)     # by looking at the MSE metric









    



  |======================================================================| 100%



In [15]:

    
# Check the model summary
# which also includes cross-validation model performance
summary(gbm_manual_cv_es)









    



Model Details:
==============

H2ORegressionModel: gbm
Model Key:  gbm_manual_cv_es 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             155                      155               49780         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000          7         32    20.37419

H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.2210799
RMSE:  0.4701914
MAE:  0.3620056
RMSLE:  0.06954328
Mean Residual Deviance :  0.2210799



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.4428879
RMSE:  0.6654982
MAE:  0.5094015
RMSLE:  0.09937082
Mean Residual Deviance :  0.4428879


Cross-Validation Metrics Summary: 
                        mean           sd cv_1_valid cv_2_valid cv_3_valid
mae               0.50952744  0.006480625  0.4993615  0.5179948 0.49803287
mse               0.44306728  0.006771577 0.43059766 0.45220712  0.4484727
r2                0.44011936  0.012657494  0.4521007 0.40847594 0.44452316
residual_deviance 0.44306728  0.006771577 0.43059766 0.45220712  0.4484727
rmse                0.665594 0.0050969874  0.6561994  0.6724635  0.6696811
rmsle             0.09938694 8.0789777E-4 0.09720557 0.09966757  0.0995299
                  cv_4_valid cv_5_valid
mae                0.5125335  0.5197147
mse               0.43233046  0.4517285
r2                0.46042648 0.43507057
residual_deviance 0.43233046  0.4517285
rmse               0.6575184  0.6721075
rmsle             0.10051537 0.10001628

Scoring History: 
            timestamp   duration number_of_trees training_rmse training_mae
1 2017-03-01 05:00:06  7.074 sec               0       0.89009      0.67683
2 2017-03-01 05:00:06  7.079 sec               1       0.85999      0.65134
3 2017-03-01 05:00:06  7.082 sec               2       0.83216      0.62951
4 2017-03-01 05:00:06  7.085 sec               3       0.80819      0.61514
5 2017-03-01 05:00:06  7.088 sec               4       0.78820      0.60640
  training_deviance
1           0.79225
2           0.73959
3           0.69248
4           0.65317
5           0.62126

---
              timestamp   duration number_of_trees training_rmse training_mae
151 2017-03-01 05:00:07  7.582 sec             150       0.47228      0.36394
152 2017-03-01 05:00:07  7.586 sec             151       0.47191      0.36351
153 2017-03-01 05:00:07  7.590 sec             152       0.47147      0.36321
154 2017-03-01 05:00:07  7.593 sec             153       0.47125      0.36301
155 2017-03-01 05:00:07  7.597 sec             154       0.47047      0.36229
156 2017-03-01 05:00:07  7.601 sec             155       0.47019      0.36201
    training_deviance
151           0.22305
152           0.22270
153           0.22228
154           0.22208
155           0.22135
156           0.22108

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
               variable relative_importance scaled_importance percentage
1               alcohol         3619.511230          1.000000   0.312433
2      volatile acidity         1571.861328          0.434274   0.135682
3   free sulfur dioxide         1227.192017          0.339049   0.105930
4                    pH          727.217407          0.200916   0.062773
5        residual sugar          694.166016          0.191784   0.059920
6  total sulfur dioxide          680.544678          0.188021   0.058744
7         fixed acidity          664.078186          0.183472   0.057323
8               density          651.062378          0.179876   0.056199
9           citric acid          594.493042          0.164247   0.051316
10            sulphates          592.997131          0.163833   0.051187
11            chlorides          561.783325          0.155210   0.048493



In [16]:

    
# Check the model performance on test dataset
h2o.performance(gbm_manual_cv_es, wine_test)









    





H2ORegressionMetrics: gbm

MSE:  0.4287345
RMSE:  0.6547782
MAE:  0.4990124
RMSLE:  0.09753734
Mean Residual Deviance :  0.4287345

Step 5 - GBM with CV, Early Stopping and Full Grid Search



In [17]:

    
# define the criteria for full grid search
search_criteria = list(strategy = "Cartesian")



In [18]:

    
# define the range of hyper-parameters for grid search
param_list <- list(
  sample_rate = c(0.7, 0.8, 0.9),
  col_sample_rate = c(0.7, 0.8, 0.9)
)



In [19]:

    
# Set up GBM grid search
# Add a seed for reproducibility
# Full Grid Search
gbm_full_grid <- h2o.grid(
  
    # Core parameters for model training
    x = features,
    y = 'quality',
    training_frame = wine_train,
    ntrees = 10000,
    nfolds = 5,
    seed = 1234,

    # Parameters for grid search
    grid_id = "gbm_full_grid",
    hyper_params = param_list,
    algorithm = "gbm",
    search_criteria = search_criteria,

    # Parameters for early stopping
    stopping_metric = "MSE",
    stopping_rounds = 15,
    score_tree_interval = 1
  
)









    



  |======================================================================| 100%



In [20]:

    
# Sort and show the grid search results
gbm_full_grid <- h2o.getGrid(grid_id = "gbm_full_grid", sort_by = "mse")
print(gbm_full_grid)









    



H2O Grid Details
================

Grid ID: gbm_full_grid 
Used hyper parameters: 
  -  col_sample_rate 
  -  sample_rate 
Number of models: 9 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing mse
  col_sample_rate sample_rate             model_ids                 mse
1             0.8         0.9 gbm_full_grid_model_7 0.43780785687779805
2             0.7         0.9 gbm_full_grid_model_6 0.44060532786277523
3             0.8         0.8 gbm_full_grid_model_4 0.44096100224896634
4             0.9         0.9 gbm_full_grid_model_8 0.44288792056243054
5             0.9         0.8 gbm_full_grid_model_5 0.44475412455519636
6             0.9         0.7 gbm_full_grid_model_2  0.4457317997358452
7             0.7         0.8 gbm_full_grid_model_3   0.448140619501795
8             0.7         0.7 gbm_full_grid_model_0  0.4528872144586896
9             0.8         0.7 gbm_full_grid_model_1  0.4529771807006373



In [21]:

    
# Extract the best model from full grid search
best_model_id <- gbm_full_grid@model_ids[[1]] # top of the list
best_gbm_from_full_grid <- h2o.getModel(best_model_id)
summary(best_gbm_from_full_grid)









    



Model Details:
==============

H2ORegressionModel: gbm
Model Key:  gbm_full_grid_model_7 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             187                      187               57184         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000          7         31    19.16043

H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.2103961
RMSE:  0.4586895
MAE:  0.3519789
RMSLE:  0.06802612
Mean Residual Deviance :  0.2103961



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.4378079
RMSE:  0.6616705
MAE:  0.5053946
RMSLE:  0.09860186
Mean Residual Deviance :  0.4378079


Cross-Validation Metrics Summary: 
                        mean           sd cv_1_valid cv_2_valid cv_3_valid
mae                0.5049402  0.010057319  0.5143834 0.52700925 0.48693994
mse               0.43750215  0.009719265 0.44019926 0.46233726  0.4309816
r2                0.44687334  0.019917823  0.4398835  0.3952249  0.4661876
residual_deviance 0.43750215  0.009719265 0.44019926 0.46233726  0.4309816
rmse               0.6613589 0.0073008444 0.66347516  0.6799539  0.6564919
rmsle             0.09856781 8.5846684E-4 0.09796896  0.1006569 0.09713692
                  cv_4_valid cv_5_valid
mae                0.5013457 0.49502298
mse               0.42164272 0.43234992
r2                0.47376537 0.45930532
residual_deviance 0.42164272 0.43234992
rmse               0.6493402  0.6575332
rmsle             0.09908475 0.09799155

Scoring History: 
            timestamp   duration number_of_trees training_rmse training_mae
1 2017-03-01 05:00:59 51.743 sec               0       0.89009      0.67683
2 2017-03-01 05:00:59 51.747 sec               1       0.85982      0.65191
3 2017-03-01 05:00:59 51.750 sec               2       0.83245      0.63065
4 2017-03-01 05:00:59 51.753 sec               3       0.80845      0.61621
5 2017-03-01 05:00:59 51.756 sec               4       0.78862      0.60647
  training_deviance
1           0.79225
2           0.73930
3           0.69298
4           0.65360
5           0.62192

---
              timestamp   duration number_of_trees training_rmse training_mae
183 2017-03-01 05:01:00 52.357 sec             182       0.46286      0.35558
184 2017-03-01 05:01:00 52.361 sec             183       0.46192      0.35486
185 2017-03-01 05:01:00 52.366 sec             184       0.46141      0.35436
186 2017-03-01 05:01:00 52.370 sec             185       0.46085      0.35376
187 2017-03-01 05:01:00 52.375 sec             186       0.45960      0.35273
188 2017-03-01 05:01:00 52.379 sec             187       0.45869      0.35198
    training_deviance
183           0.21424
184           0.21337
185           0.21290
186           0.21239
187           0.21124
188           0.21040

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
               variable relative_importance scaled_importance percentage
1               alcohol         3585.425537          1.000000   0.302645
2      volatile acidity         1600.595215          0.446417   0.135106
3   free sulfur dioxide         1222.376831          0.340929   0.103181
4                    pH          799.083313          0.222870   0.067451
5        residual sugar          786.511230          0.219363   0.066389
6  total sulfur dioxide          760.961853          0.212238   0.064233
7             chlorides          686.944031          0.191593   0.057985
8             sulphates          632.622375          0.176443   0.053400
9           citric acid          617.363770          0.172187   0.052112
10        fixed acidity          607.181152          0.169347   0.051252
11              density          547.883850          0.152809   0.046247



In [22]:

    
# Check the model performance on test dataset
h2o.performance(best_gbm_from_full_grid, wine_test)









    





H2ORegressionMetrics: gbm

MSE:  0.4196124
RMSE:  0.647775
MAE:  0.4896544
RMSLE:  0.09630233
Mean Residual Deviance :  0.4196124

GBM with CV, Early Stopping and Random Grid Search



In [23]:

    
# define the criteria for random grid search
search_criteria = list(strategy = "RandomDiscrete",
                       max_models = 9,
                       seed = 1234)



In [24]:

    
# define the range of hyper-parameters for grid search
# 27 combinations in total
param_list <- list(
    sample_rate = c(0.7, 0.8, 0.9),
    col_sample_rate = c(0.7, 0.8, 0.9),
    max_depth = c(3, 5, 7)
)



In [25]:

    
# Set up GBM grid search
# Add a seed for reproducibility
gbm_rand_grid <- h2o.grid(
  
    # Core parameters for model training
    x = features,
    y = 'quality',
    training_frame = wine_train,
    ntrees = 10000,
    nfolds = 5,
    seed = 1234,

    # Parameters for grid search
    grid_id = "gbm_rand_grid",
    hyper_params = param_list,
    algorithm = "gbm",
    search_criteria = search_criteria,

    # Parameters for early stopping
    stopping_metric = "MSE",
    stopping_rounds = 15,
    score_tree_interval = 1
  
)









    



  |======================================================================| 100%



In [26]:

    
# Sort and show the grid search results
gbm_rand_grid <- h2o.getGrid(grid_id = "gbm_rand_grid", sort_by = "mse", decreasing = FALSE)
print(gbm_rand_grid)









    



H2O Grid Details
================

Grid ID: gbm_rand_grid 
Used hyper parameters: 
  -  col_sample_rate 
  -  max_depth 
  -  sample_rate 
Number of models: 9 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing mse
  col_sample_rate max_depth sample_rate             model_ids
1             0.9         7         0.9 gbm_rand_grid_model_5
2             0.7         7         0.7 gbm_rand_grid_model_1
3             0.9         7         0.7 gbm_rand_grid_model_6
4             0.8         7         0.7 gbm_rand_grid_model_4
5             0.7         5         0.8 gbm_rand_grid_model_0
6             0.8         3         0.9 gbm_rand_grid_model_7
7             0.9         3         0.9 gbm_rand_grid_model_2
8             0.8         3         0.8 gbm_rand_grid_model_3
9             0.7         3         0.7 gbm_rand_grid_model_8
                  mse
1  0.4227388012308513
2  0.4327748309201154
3  0.4369533108701783
4  0.4397321318633594
5   0.448140619501795
6  0.4647039373596571
7  0.4690321721360509
8 0.47384072192391513
9 0.47745552186979223



In [27]:

    
# Extract the best model from random grid search
best_model_id <- gbm_rand_grid@model_ids[[1]] # top of the list
best_gbm_from_rand_grid <- h2o.getModel(best_model_id)
summary(best_gbm_from_rand_grid)









    



Model Details:
==============

H2ORegressionModel: gbm
Model Key:  gbm_rand_grid_model_5 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1             142                      142               87919         7
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         7    7.00000         16         82    44.04930

H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  0.1197865
RMSE:  0.3461018
MAE:  0.2597976
RMSLE:  0.05153244
Mean Residual Deviance :  0.1197865



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.4227388
RMSE:  0.6501837
MAE:  0.4856414
RMSLE:  0.09723137
Mean Residual Deviance :  0.4227388


Cross-Validation Metrics Summary: 
                         mean           sd cv_1_valid cv_2_valid cv_3_valid
mae                 0.4854467  0.006546657 0.49238658  0.4940408 0.46821254
mse                0.42264447  0.003234772 0.42546168  0.4278089 0.42508337
r2                 0.46589816  0.010486309  0.4586358  0.4403908 0.47349313
residual_deviance  0.42264447  0.003234772 0.42546168  0.4278089 0.42508337
rmse                0.6501016 0.0024916055 0.65227425 0.65407103 0.65198416
rmsle             0.097222246  5.291425E-4 0.09675461 0.09714792 0.09677877
                  cv_4_valid  cv_5_valid
mae               0.48830998  0.48428363
mse               0.41522205  0.41964635
r2                0.48177877   0.4751923
residual_deviance 0.41522205  0.41964635
rmse              0.64437723  0.64780116
rmsle             0.09868797 0.096741945

Scoring History: 
            timestamp   duration number_of_trees training_rmse training_mae
1 2017-03-01 05:01:37 30.289 sec               0       0.89009      0.67683
2 2017-03-01 05:01:37 30.294 sec               1       0.85417      0.64726
3 2017-03-01 05:01:37 30.298 sec               2       0.81998      0.62140
4 2017-03-01 05:01:37 30.302 sec               3       0.79127      0.60341
5 2017-03-01 05:01:37 30.307 sec               4       0.76588      0.58860
  training_deviance
1           0.79225
2           0.72961
3           0.67237
4           0.62611
5           0.58657

---
              timestamp   duration number_of_trees training_rmse training_mae
138 2017-03-01 05:01:38 30.900 sec             137       0.35396      0.26628
139 2017-03-01 05:01:38 30.904 sec             138       0.35279      0.26520
140 2017-03-01 05:01:38 30.909 sec             139       0.35093      0.26371
141 2017-03-01 05:01:38 30.914 sec             140       0.34963      0.26282
142 2017-03-01 05:01:38 30.919 sec             141       0.34683      0.26059
143 2017-03-01 05:01:38 30.924 sec             142       0.34610      0.25980
    training_deviance
138           0.12528
139           0.12446
140           0.12315
141           0.12224
142           0.12029
143           0.11979

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

Variable Importances: 
               variable relative_importance scaled_importance percentage
1               alcohol         3769.951416          1.000000   0.275925
2      volatile acidity         1721.885010          0.456739   0.126026
3   free sulfur dioxide         1449.235962          0.384418   0.106070
4                    pH          991.609131          0.263030   0.072576
5               density          915.934631          0.242957   0.067038
6        residual sugar          876.687866          0.232546   0.064165
7  total sulfur dioxide          873.630737          0.231735   0.063942
8             sulphates          831.926697          0.220673   0.060889
9         fixed acidity          830.125793          0.220195   0.060757
10          citric acid          749.599487          0.198835   0.054864
11            chlorides          652.373901          0.173046   0.047748



In [28]:

    
# Check the model performance on test dataset
h2o.performance(best_gbm_from_rand_grid, wine_test)









    





H2ORegressionMetrics: gbm

MSE:  0.404719
RMSE:  0.6361753
MAE:  0.473215
RMSLE:  0.09498904
Mean Residual Deviance :  0.404719



In [29]:

    
h2o.performance(best_gbm_from_rand_grid, wine_test)@metrics$MSE









    




0.404718976240411

Comparison of Model Performance on Test Data



In [30]:

    
cat('GBM with Default Settings                        :', 
          h2o.performance(gbm_default, wine_test)@metrics$MSE, "\n")
cat('GBM with Manual Settings                         :', 
          h2o.performance(gbm_manual, wine_test)@metrics$MSE, "\n")
cat('GBM with Manual Settings & CV                    :', 
          h2o.performance(gbm_manual_cv, wine_test)@metrics$MSE, "\n")
cat('GBM with Manual Settings, CV & Early Stopping    :', 
          h2o.performance(gbm_manual_cv_es, wine_test)@metrics$MSE, "\n")
cat('GBM with CV, Early Stopping & Full Grid Search   :', 
          h2o.performance(best_gbm_from_full_grid, wine_test)@metrics$MSE, "\n")
cat('GBM with CV, Early Stopping & Random Grid Search :', 
          h2o.performance(best_gbm_from_rand_grid, wine_test)@metrics$MSE, "\n")









    



GBM with Default Settings                        : 0.4551121 
GBM with Manual Settings                         : 0.4432567 
GBM with Manual Settings & CV                    : 0.4432567 
GBM with Manual Settings, CV & Early Stopping    : 0.4287345 
GBM with CV, Early Stopping & Full Grid Search   : 0.4196124 
GBM with CV, Early Stopping & Random Grid Search : 0.404719

fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
7.0	0.27	0.36	20.7	0.045	45	170	1.0010	3.00	0.45	8.8	6
6.3	0.30	0.34	1.6	0.049	14	132	0.9940	3.30	0.49	9.5	6
8.1	0.28	0.40	6.9	0.050	30	97	0.9951	3.26	0.44	10.1	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9	6