Machine Learning with H2O - Tutorial 3b: Regression Models (Grid Search)


Objective:

  • This tutorial explains how to fine-tune regression models for better out-of-bag performance.

Wine Quality Dataset:


Steps:

  1. GBM with default settings
  2. GBM with manual settings
  3. GBM with manual settings & cross-validation
  4. GBM with manual settings, cross-validation and early stopping
  5. GBM with cross-validation, early stopping and full grid search
  6. GBM with cross-validation, early stopping and random grid search
  7. Model stacking (combining different GLM, DRF, GBM and DNN models)

Full Technical Reference:



In [1]:
# Start and connect to a local H2O cluster
import h2o
h2o.init(nthreads = -1)


Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_131"; OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11); OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from /home/joe/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp68uwhnzg
  JVM stdout: /tmp/tmp68uwhnzg/h2o_joe_started_from_python.out
  JVM stderr: /tmp/tmp68uwhnzg/h2o_joe_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.
H2O cluster uptime: 01 secs
H2O cluster version: 3.10.5.2
H2O cluster version age: 10 days
H2O cluster name: H2O_from_python_joe_wncaln
H2O cluster total nodes: 1
H2O cluster free memory: 5.210 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: None
H2O internal security: False
Python version: 3.6.1 final



In [2]:
# Import wine quality data from a local CSV file
wine = h2o.import_file("winequality-white.csv")
wine.head(5)


Parse progress: |█████████████████████████████████████████████████████████| 100%
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
7 0.27 0.36 20.7 0.045 45 170 1.001 3 0.45 8.8 6
6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3 0.49 9.5 6
8.1 0.28 0.4 6.9 0.05 30 97 0.99513.26 0.44 10.1 6
7.2 0.23 0.32 8.5 0.058 47 186 0.99563.19 0.4 9.9 6
7.2 0.23 0.32 8.5 0.058 47 186 0.99563.19 0.4 9.9 6
Out[2]:


In [3]:
# Define features (or predictors)
features = list(wine.columns) # we want to use all the information
features.remove('quality')    # we need to exclude the target 'quality' (otherwise there is nothing to predict)
features


Out[3]:
['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

In [4]:
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = wine.split_frame(ratios = [0.8], seed = 1234)

wine_train = wine_split[0] # using 80% for training
wine_test = wine_split[1]  # using the rest 20% for out-of-bag evaluation

In [5]:
wine_train.shape


Out[5]:
(3932, 12)

In [6]:
wine_test.shape


Out[6]:
(966, 12)


Step 1 - Gradient Boosting Machines (GBM) with Default Settings


In [7]:
# Build a Gradient Boosting Machines (GBM) model with default settings

# Import the function for GBM
from h2o.estimators.gbm import H2OGradientBoostingEstimator

# Set up GBM for regression
# Add a seed for reproducibility
gbm_default = H2OGradientBoostingEstimator(model_id = 'gbm_default', 
                                           seed = 1234)

# Use .train() to build the model
gbm_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)


gbm Model Build progress: |███████████████████████████████████████████████| 100%

In [8]:
# Check the model performance on test dataset
gbm_default.model_performance(wine_test)


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.45511211588709155
RMSE: 0.6746199788674299
MAE: 0.5219768028633305
RMSLE: 0.10013755931021842
Mean Residual Deviance: 0.45511211588709155
Out[8]:


Step 2 - GBM with Manual Settings


In [9]:
# Build a GBM with manual settings

# Set up GBM for regression
# Add a seed for reproducibility
gbm_manual = H2OGradientBoostingEstimator(model_id = 'gbm_manual', 
                                          seed = 1234,
                                          ntrees = 100,
                                          sample_rate = 0.9,
                                          col_sample_rate = 0.9)

# Use .train() to build the model
gbm_manual.train(x = features, 
                 y = 'quality', 
                 training_frame = wine_train)


gbm Model Build progress: |███████████████████████████████████████████████| 100%

In [10]:
# Check the model performance on test dataset
gbm_manual.model_performance(wine_test)


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.44325665649714924
RMSE: 0.6657752297113112
MAE: 0.5114358481376113
RMSLE: 0.09895809708429235
Mean Residual Deviance: 0.44325665649714924
Out[10]:


Step 3 - GBM with Manual Settings & Cross-Validation (CV)


In [11]:
# Build a GBM with manual settings & cross-validation

# Set up GBM for regression
# Add a seed for reproducibility
gbm_manual_cv = H2OGradientBoostingEstimator(model_id = 'gbm_manual_cv', 
                                             seed = 1234,
                                             ntrees = 100,
                                             sample_rate = 0.9,
                                             col_sample_rate = 0.9,
                                             nfolds = 5)
                                            
# Use .train() to build the model
gbm_manual_cv.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)


gbm Model Build progress: |███████████████████████████████████████████████| 100%

In [12]:
# Check the cross-validation model performance
gbm_manual_cv


Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_manual_cv


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.27438346229216
RMSE: 0.5238162485950202
MAE: 0.4075920913493524
RMSLE: 0.0774835431572533
Mean Residual Deviance: 0.27438346229216

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 0.45021820302163834
RMSE: 0.6709830124687497
MAE: 0.5185163944803867
RMSLE: 0.10007842575662584
Mean Residual Deviance: 0.45021820302163834
Cross-Validation Metrics Summary: 
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
mae 0.5183839 0.0057996 0.5173063 0.5328950 0.507699 0.5151827 0.5188366
mean_residual_deviance 0.4501753 0.0090064 0.4428255 0.4701306 0.4537360 0.4317105 0.4524740
mse 0.4501753 0.0090064 0.4428255 0.4701306 0.4537360 0.4317105 0.4524740
r2 0.4309830 0.0176490 0.4365418 0.3850305 0.4380041 0.4612002 0.4341383
residual_deviance 0.4501753 0.0090064 0.4428255 0.4701306 0.4537360 0.4317105 0.4524740
rmse 0.670884 0.0067072 0.6654513 0.6856607 0.6735993 0.6570468 0.6726618
rmsle 0.1000738 0.0007271 0.0983455 0.1015277 0.0999000 0.1004637 0.1001324
Scoring History: 
timestamp duration number_of_trees training_rmse training_mae training_deviance
2017-06-29 23:25:30 2.770 sec 0.0 0.8900853 0.6768335 0.7922518
2017-06-29 23:25:30 2.779 sec 1.0 0.8599939 0.6513445 0.7395894
2017-06-29 23:25:30 2.784 sec 2.0 0.8321552 0.6295051 0.6924822
2017-06-29 23:25:30 2.791 sec 3.0 0.8081890 0.6151421 0.6531695
2017-06-29 23:25:30 2.797 sec 4.0 0.7882018 0.6064024 0.6212620
--- --- --- --- --- --- ---
2017-06-29 23:25:30 3.190 sec 96.0 0.5267685 0.4106401 0.2774850
2017-06-29 23:25:30 3.193 sec 97.0 0.5255850 0.4095203 0.2762396
2017-06-29 23:25:30 3.196 sec 98.0 0.5252608 0.4091795 0.2758989
2017-06-29 23:25:30 3.200 sec 99.0 0.5239927 0.4076997 0.2745683
2017-06-29 23:25:30 3.203 sec 100.0 0.5238162 0.4075921 0.2743835
See the whole table with table.as_data_frame()
Variable Importances: 
variable relative_importance scaled_importance percentage
alcohol 3520.4504395 1.0 0.3371040
volatile acidity 1474.0030518 0.4186973 0.1411445
free sulfur dioxide 1111.8027344 0.3158126 0.1064617
pH 621.6004639 0.1765684 0.0595219
residual sugar 608.0207520 0.1727111 0.0582216
total sulfur dioxide 592.9692993 0.1684356 0.0567803
fixed acidity 558.3989868 0.1586158 0.0534700
density 545.3736572 0.1549159 0.0522228
citric acid 479.6713257 0.1362528 0.0459314
sulphates 474.8290405 0.1348774 0.0454677
chlorides 456.0965881 0.1295563 0.0436740
Out[12]:


In [13]:
# Check the model performance on test dataset
gbm_manual_cv.model_performance(wine_test)
# It should be the same as gbm_manual above as the model is trained with same parameters


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.44325665649714924
RMSE: 0.6657752297113112
MAE: 0.5114358481376113
RMSLE: 0.09895809708429235
Mean Residual Deviance: 0.44325665649714924
Out[13]:


Step 4 - GBM with Manual Settings, CV and Early Stopping


In [14]:
# Build a GBM with manual settings, CV and early stopping

# Set up GBM for regression
# Add a seed for reproducibility
gbm_manual_cv_es = H2OGradientBoostingEstimator(model_id = 'gbm_manual_cv_es', 
                                                seed = 1234,
                                                ntrees = 10000,   # increase the number of trees 
                                                sample_rate = 0.9,
                                                col_sample_rate = 0.9,
                                                nfolds = 5,
                                                stopping_metric = 'mse', # let early stopping feature determine
                                                stopping_rounds = 15,     # the optimal number of trees
                                                score_tree_interval = 1) # by looking at the MSE metric
# Use .train() to build the model
gbm_manual_cv_es.train(x = features, 
                       y = 'quality', 
                       training_frame = wine_train)


gbm Model Build progress: |███████████████████████████████████████████████| 100%

In [15]:
# Check the model summary
gbm_manual_cv_es.summary()


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
155.0 155.0 49771.0 5.0 5.0 5.0 7.0 32.0 20.374193
Out[15]:


In [16]:
# Check the cross-validation model performance
gbm_manual_cv_es


Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_manual_cv_es


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.22107991282362896
RMSE: 0.470191357665822
MAE: 0.36200557768890596
RMSLE: 0.06954327915133354
Mean Residual Deviance: 0.22107991282362896

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 0.44288792056243054
RMSE: 0.665498249856775
MAE: 0.5094014952755754
RMSLE: 0.09937081861305609
Mean Residual Deviance: 0.44288792056243054
Cross-Validation Metrics Summary: 
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
mae 0.5095274 0.0064806 0.4993615 0.5179948 0.4980329 0.5125335 0.5197147
mean_residual_deviance 0.4430673 0.0067716 0.4305977 0.4522071 0.4484727 0.4323305 0.4517285
mse 0.4430673 0.0067716 0.4305977 0.4522071 0.4484727 0.4323305 0.4517285
r2 0.4401194 0.0126575 0.4521007 0.4084759 0.4445232 0.4604265 0.4350706
residual_deviance 0.4430673 0.0067716 0.4305977 0.4522071 0.4484727 0.4323305 0.4517285
rmse 0.665594 0.0050970 0.6561994 0.6724635 0.6696811 0.6575184 0.6721075
rmsle 0.0993869 0.0008079 0.0972056 0.0996676 0.0995299 0.1005154 0.1000163
Scoring History: 
timestamp duration number_of_trees training_rmse training_mae training_deviance
2017-06-29 23:25:37 6.820 sec 0.0 0.8900853 0.6768335 0.7922518
2017-06-29 23:25:37 6.825 sec 1.0 0.8599939 0.6513445 0.7395894
2017-06-29 23:25:37 6.829 sec 2.0 0.8321552 0.6295051 0.6924822
2017-06-29 23:25:37 6.832 sec 3.0 0.8081890 0.6151421 0.6531695
2017-06-29 23:25:37 6.836 sec 4.0 0.7882018 0.6064024 0.6212620
--- --- --- --- --- --- ---
2017-06-29 23:25:38 7.395 sec 151.0 0.4719094 0.3635129 0.2226985
2017-06-29 23:25:38 7.399 sec 152.0 0.4714653 0.3632114 0.2222796
2017-06-29 23:25:38 7.403 sec 153.0 0.4712549 0.3630109 0.2220812
2017-06-29 23:25:38 7.409 sec 154.0 0.4704735 0.3622888 0.2213453
2017-06-29 23:25:38 7.413 sec 155.0 0.4701914 0.3620056 0.2210799
See the whole table with table.as_data_frame()
Variable Importances: 
variable relative_importance scaled_importance percentage
alcohol 3619.5112305 1.0 0.3124334
volatile acidity 1571.8613281 0.4342745 0.1356818
free sulfur dioxide 1227.1920166 0.3390491 0.1059302
pH 727.2174072 0.2009159 0.0627728
residual sugar 694.1660156 0.1917845 0.0599199
total sulfur dioxide 680.5446777 0.1880212 0.0587441
fixed acidity 664.0781860 0.1834718 0.0573227
density 651.0623779 0.1798758 0.0561992
citric acid 594.4930420 0.1642468 0.0513162
sulphates 592.9971313 0.1638335 0.0511870
chlorides 561.7833252 0.1552097 0.0484927
Out[16]:


In [17]:
# Check the model performance on test dataset
gbm_manual_cv_es.model_performance(wine_test)


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.4287344643828695
RMSE: 0.6547781795256081
MAE: 0.4990124321946826
RMSLE: 0.09753734379917677
Mean Residual Deviance: 0.4287344643828695
Out[17]:



In [18]:
# import Grid Search
from h2o.grid.grid_search import H2OGridSearch

In [19]:
# define the criteria for full grid search
search_criteria = {'strategy': "Cartesian"}

In [20]:
# define the range of hyper-parameters for grid search
hyper_params = {'sample_rate': [0.7, 0.8, 0.9],
                'col_sample_rate': [0.7, 0.8, 0.9]}

In [21]:
# Set up GBM grid search
# Add a seed for reproducibility
gbm_full_grid = H2OGridSearch(
                    H2OGradientBoostingEstimator(
                        model_id = 'gbm_full_grid', 
                        seed = 1234,
                        ntrees = 10000,   
                        nfolds = 5,
                        stopping_metric = 'mse', 
                        stopping_rounds = 15,     
                        score_tree_interval = 1),
                    search_criteria = search_criteria, # full grid search
                    hyper_params = hyper_params)

In [22]:
# Use .train() to start the grid search
gbm_full_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)


gbm Grid Build progress: |████████████████████████████████████████████████| 100%

In [23]:
# Sort and show the grid search results
gbm_full_grid_sorted = gbm_full_grid.get_grid(sort_by='mse', decreasing=False)
print(gbm_full_grid_sorted)


    col_sample_rate sample_rate  \
0               0.8         0.9   
1               0.7         0.9   
2               0.8         0.8   
3               0.9         0.9   
4               0.9         0.8   
5               0.9         0.7   
6               0.7         0.8   
7               0.7         0.7   
8               0.8         0.7   

                                                     model_ids  \
0  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_1_model_7   
1  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_1_model_6   
2  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_1_model_4   
3  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_1_model_8   
4  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_1_model_5   
5  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_1_model_2   
6  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_1_model_3   
7  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_1_model_0   
8  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_1_model_1   

                   mse  
0  0.43780785687779805  
1  0.44060532786277523  
2  0.44096100224896634  
3  0.44288792056243054  
4  0.44475412455519636  
5   0.4457317997358452  
6    0.448140619501795  
7   0.4528872144586896  
8   0.4529771807006373  


In [24]:
# Extract the best model from full grid search
best_model_id = gbm_full_grid_sorted.model_ids[0]
best_gbm_from_full_grid = h2o.get_model(best_model_id)
best_gbm_from_full_grid.summary()


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
187.0 187.0 57180.0 5.0 5.0 5.0 7.0 31.0 19.160427
Out[24]:


In [25]:
# Check the model performance on test dataset
best_gbm_from_full_grid.model_performance(wine_test)


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.4196124030489544
RMSE: 0.6477749632773363
MAE: 0.48965435078727043
RMSLE: 0.09630232810628427
Mean Residual Deviance: 0.4196124030489544
Out[25]:


In [26]:
# define the criteria for random grid search
search_criteria = {'strategy': "RandomDiscrete", 
                   'max_models': 9,
                   'seed': 1234}

In [27]:
# define the range of hyper-parameters for grid search
# 27 combinations in total
hyper_params = {'sample_rate': [0.7, 0.8, 0.9],
                'col_sample_rate': [0.7, 0.8, 0.9],
                'max_depth': [3, 5, 7]}

In [28]:
# Set up GBM grid search
# Add a seed for reproducibility
gbm_rand_grid = H2OGridSearch(
                    H2OGradientBoostingEstimator(
                        model_id = 'gbm_rand_grid', 
                        seed = 1234,
                        ntrees = 10000,   
                        nfolds = 5,
                        stopping_metric = 'mse', 
                        stopping_rounds = 15,     
                        score_tree_interval = 1),
                    search_criteria = search_criteria, # full grid search
                    hyper_params = hyper_params)

In [29]:
# Use .train() to start the grid search
gbm_rand_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)


gbm Grid Build progress: |████████████████████████████████████████████████| 100%

In [30]:
# Sort and show the grid search results
gbm_rand_grid_sorted = gbm_rand_grid.get_grid(sort_by='mse', decreasing=False)
print(gbm_rand_grid_sorted)


    col_sample_rate max_depth sample_rate  \
0               0.9         7         0.9   
1               0.7         7         0.7   
2               0.9         7         0.7   
3               0.8         7         0.7   
4               0.7         5         0.8   
5               0.8         3         0.9   
6               0.9         3         0.9   
7               0.8         3         0.8   
8               0.7         3         0.7   

                                                     model_ids  \
0  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_2_model_5   
1  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_2_model_1   
2  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_2_model_6   
3  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_2_model_4   
4  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_2_model_0   
5  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_2_model_7   
6  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_2_model_2   
7  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_2_model_3   
8  Grid_GBM_py_4_sid_9f52_model_python_1498775122375_2_model_8   

                   mse  
0   0.4227388012308513  
1   0.4327748309201154  
2   0.4369533108701783  
3   0.4397321318633594  
4    0.448140619501795  
5   0.4647039373596571  
6   0.4690321721360509  
7  0.47384072192391513  
8  0.47745552186979223  


In [31]:
# Extract the best model from random grid search
best_model_id = gbm_rand_grid_sorted.model_ids[0]
best_gbm_from_rand_grid = h2o.get_model(best_model_id)
best_gbm_from_rand_grid.summary()


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
142.0 142.0 87920.0 7.0 7.0 7.0 16.0 82.0 44.049297
Out[31]:


In [32]:
# Check the model performance on test dataset
best_gbm_from_rand_grid.model_performance(wine_test)


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.4047189762404106
RMSE: 0.636175271635428
MAE: 0.47321498369668896
RMSLE: 0.09498904157909563
Mean Residual Deviance: 0.4047189762404106
Out[32]:


Comparison of Model Performance on Test Data


In [33]:
print('GBM with Default Settings                        :', gbm_default.model_performance(wine_test).mse())
print('GBM with Manual Settings                         :', gbm_manual.model_performance(wine_test).mse())
print('GBM with Manual Settings & CV                    :', gbm_manual_cv.model_performance(wine_test).mse())
print('GBM with Manual Settings, CV & Early Stopping    :', gbm_manual_cv_es.model_performance(wine_test).mse())
print('GBM with CV, Early Stopping & Full Grid Search   :', 
          best_gbm_from_full_grid.model_performance(wine_test).mse())
print('GBM with CV, Early Stopping & Random Grid Search :', 
          best_gbm_from_rand_grid.model_performance(wine_test).mse())


GBM with Default Settings                        : 0.45511211588709155
GBM with Manual Settings                         : 0.44325665649714924
GBM with Manual Settings & CV                    : 0.44325665649714924
GBM with Manual Settings, CV & Early Stopping    : 0.4287344643828695
GBM with CV, Early Stopping & Full Grid Search   : 0.4196124030489544
GBM with CV, Early Stopping & Random Grid Search : 0.4047189762404106