Machine Learning with H2O - Tutorial 3c: Regression Models (Ensembles)

Objective:

This tutorial explains how to create stacked ensembles of regression models for better out-of-bag performance.

Wine Quality Dataset:

Steps:

Build GBM models using random grid search and extract the best one.
Build DRF models using random grid search and extract the best one.
Build DNN models using random grid search and extract the best one.
Use model stacking to combining different models.

Full Technical Reference:



In [1]:

    
# Import all required modules
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch

# Start and connect to a local H2O cluster
h2o.init(nthreads = -1)









    



Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_131"; OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11); OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from /home/joe/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp6kskjz_d
  JVM stdout: /tmp/tmp6kskjz_d/h2o_joe_started_from_python.out
  JVM stderr: /tmp/tmp6kskjz_d/h2o_joe_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.






    




H2O cluster uptime:
01 secs
H2O cluster version:
3.10.5.2
H2O cluster version age:
10 days 
H2O cluster name:
H2O_from_python_joe_i7ekvz
H2O cluster total nodes:
1
H2O cluster free memory:
5.210 Gb
H2O cluster total cores:
8
H2O cluster allowed cores:
8
H2O cluster status:
accepting new members, healthy
H2O connection url:
http://127.0.0.1:54321
H2O connection proxy:
None
H2O internal security:
False
Python version:
3.6.1 final



In [2]:

    
# Import wine quality data from a local CSV file
wine = h2o.import_file("winequality-white.csv")
wine.head(5)









    



Parse progress: |█████████████████████████████████████████████████████████| 100%






    






  fixed acidity   volatile acidity   citric acid   residual sugar   chlorides   free sulfur dioxide   total sulfur dioxide   density   pH   sulphates   alcohol   quality


            7                0.27          0.36             20.7       0.045                    45                    170    1.001 3          0.45       8.8         6
            6.3               0.3          0.34              1.6       0.049                    14                    132    0.994 3.3        0.49       9.5         6
            8.1               0.28          0.4              6.9       0.05                    30                     97    0.9951 3.26        0.44      10.1         6
            7.2               0.23          0.32              8.5       0.058                    47                    186    0.9956 3.19        0.4       9.9         6
            7.2               0.23          0.32              8.5       0.058                    47                    186    0.9956 3.19        0.4       9.9         6








    Out[2]:



In [3]:

    
# Define features (or predictors)
features = list(wine.columns) # we want to use all the information
features.remove('quality')    # we need to exclude the target 'quality' (otherwise there is nothing to predict)
features









    Out[3]:





['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']



In [4]:

    
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = wine.split_frame(ratios = [0.8], seed = 1234)

wine_train = wine_split[0] # using 80% for training
wine_test = wine_split[1]  # using the rest 20% for out-of-bag evaluation



In [5]:

    
wine_train.shape









    Out[5]:





(3932, 12)



In [6]:

    
wine_test.shape









    Out[6]:





(966, 12)

Define Search Criteria for Random Grid Search



In [7]:

    
# define the criteria for random grid search
search_criteria = {'strategy': "RandomDiscrete", 
                   'max_models': 9,
                   'seed': 1234}

Step 1: Build GBM Models using Random Grid Search and Extract the Best Model



In [8]:

    
# define the range of hyper-parameters for GBM grid search
# 27 combinations in total
hyper_params = {'sample_rate': [0.7, 0.8, 0.9],
                'col_sample_rate': [0.7, 0.8, 0.9],
                'max_depth': [3, 5, 7]}



In [9]:

    
# Set up GBM grid search
# Add a seed for reproducibility
gbm_rand_grid = H2OGridSearch(
                    H2OGradientBoostingEstimator(
                        model_id = 'gbm_rand_grid', 
                        seed = 1234,
                        ntrees = 10000,   
                        nfolds = 5,
                        fold_assignment = "Modulo",               # needed for stacked ensembles
                        keep_cross_validation_predictions = True, # needed for stacked ensembles
                        stopping_metric = 'mse', 
                        stopping_rounds = 15,     
                        score_tree_interval = 1),
                    search_criteria = search_criteria, 
                    hyper_params = hyper_params)



In [10]:

    
# Use .train() to start the grid search
gbm_rand_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)









    



gbm Grid Build progress: |████████████████████████████████████████████████| 100%



In [11]:

    
# Sort and show the grid search results
gbm_rand_grid_sorted = gbm_rand_grid.get_grid(sort_by='mse', decreasing=False)
print(gbm_rand_grid_sorted)









    



    col_sample_rate max_depth sample_rate  \
0               0.9         7         0.9   
1               0.8         7         0.7   
2               0.7         7         0.7   
3               0.9         7         0.7   
4               0.7         5         0.8   
5               0.8         3         0.9   
6               0.7         3         0.7   
7               0.9         3         0.9   
8               0.8         3         0.8   

                                                     model_ids  \
0  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_5   
1  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_4   
2  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_1   
3  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_6   
4  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_0   
5  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_7   
6  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_8   
7  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_2   
8  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_3   

                   mse  
0  0.41467703216892454  
1   0.4188744246328386  
2  0.42294704197026883  
3   0.4285238866231086  
4  0.44601214899796604  
5  0.46338551281728263  
6   0.4681243149102324  
7  0.46849996267402233  
8   0.4690100493856379



In [12]:

    
# Extract the best model from random grid search
best_gbm_model_id = gbm_rand_grid_sorted.model_ids[0]
best_gbm_from_rand_grid = h2o.get_model(best_gbm_model_id)
best_gbm_from_rand_grid.summary()









    



Model Summary: 






    





number_of_trees
number_of_internal_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves

168.0
168.0
103534.0
7.0
7.0
7.0
13.0
82.0
43.809525






    Out[12]:

Step 2: Build DRF Models using Random Grid Search and Extract the Best Model



In [13]:

    
# define the range of hyper-parameters for DRF grid search
# 27 combinations in total
hyper_params = {'sample_rate': [0.5, 0.6, 0.7],
                'col_sample_rate_per_tree': [0.7, 0.8, 0.9],
                'max_depth': [3, 5, 7]}



In [14]:

    
# Set up DRF grid search
# Add a seed for reproducibility
drf_rand_grid = H2OGridSearch(
                    H2ORandomForestEstimator(
                        model_id = 'drf_rand_grid', 
                        seed = 1234,
                        ntrees = 200,   
                        nfolds = 5,
                        fold_assignment = "Modulo",                 # needed for stacked ensembles
                        keep_cross_validation_predictions = True),  # needed for stacked ensembles
                    search_criteria = search_criteria, 
                    hyper_params = hyper_params)



In [15]:

    
# Use .train() to start the grid search
drf_rand_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)









    



drf Grid Build progress: |████████████████████████████████████████████████| 100%



In [16]:

    
# Sort and show the grid search results
drf_rand_grid_sorted = drf_rand_grid.get_grid(sort_by='mse', decreasing=False)
print(drf_rand_grid_sorted)









    



    col_sample_rate_per_tree max_depth sample_rate  \
0                        0.9         7         0.7   
1                        0.9         7         0.5   
2                        0.8         7         0.5   
3                        0.7         7         0.5   
4                        0.7         5         0.6   
5                        0.9         3         0.7   
6                        0.8         3         0.6   
7                        0.8         3         0.7   
8                        0.7         3         0.5   

                                                     model_ids  \
0  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_5   
1  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_6   
2  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_4   
3  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_1   
4  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_0   
5  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_2   
6  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_3   
7  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_7   
8  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_8   

                   mse  
0  0.48533899185762636  
1    0.487315432336594  
2  0.49004168463947945  
3   0.4927544483353685  
4   0.5307039662299886  
5   0.5846039939024897  
6   0.5850640013528532  
7   0.5855927668634072  
8   0.5857362760598669



In [17]:

    
# Extract the best model from random grid search
best_drf_model_id = drf_rand_grid_sorted.model_ids[0]
best_drf_from_rand_grid = h2o.get_model(best_drf_model_id)
best_drf_from_rand_grid.summary()









    



Model Summary: 






    





number_of_trees
number_of_internal_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves

200.0
200.0
239756.0
7.0
7.0
7.0
70.0
111.0
90.265






    Out[17]:

Step 3: Build DNN Models using Random Grid Search and Extract the Best Model



In [18]:

    
# define the range of hyper-parameters for DNN grid search
# 81 combinations in total
hyper_params = {'activation': ['tanh', 'rectifier', 'maxout'],
                'hidden': [[50], [50,50], [50,50,50]],
                'l1': [0, 1e-3, 1e-5],
                'l2': [0, 1e-3, 1e-5]}



In [19]:

    
# Set up DNN grid search
# Add a seed for reproducibility
dnn_rand_grid = H2OGridSearch(
                    H2ODeepLearningEstimator(
                        model_id = 'dnn_rand_grid', 
                        seed = 1234,
                        epochs = 20,   
                        nfolds = 5,
                        fold_assignment = "Modulo",                # needed for stacked ensembles
                        keep_cross_validation_predictions = True), # needed for stacked ensembles
                    search_criteria = search_criteria, 
                    hyper_params = hyper_params)



In [20]:

    
# Use .train() to start the grid search
dnn_rand_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)









    



deeplearning Grid Build progress: |███████████████████████████████████████| 100%



In [21]:

    
# Sort and show the grid search results
dnn_rand_grid_sorted = dnn_rand_grid.get_grid(sort_by='mse', decreasing=False)
print(dnn_rand_grid_sorted)









    



    activation        hidden      l1      l2  \
0    Rectifier      [50, 50]  1.0E-5     0.0   
1       Maxout      [50, 50]     0.0  1.0E-5   
2       Maxout  [50, 50, 50]  1.0E-5  1.0E-5   
3         Tanh  [50, 50, 50]     0.0  1.0E-5   
4         Tanh  [50, 50, 50]  1.0E-5  1.0E-5   
5       Maxout          [50]     0.0     0.0   
6       Maxout          [50]  1.0E-5   0.001   
7         Tanh  [50, 50, 50]   0.001     0.0   
8       Maxout          [50]   0.001     0.0   

                                                              model_ids  \
0  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_2   
1  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_8   
2  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_3   
3  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_0   
4  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_7   
5  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_5   
6  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_6   
7  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_1   
8  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_4   

                  mse  
0  0.5108409743268761  
1   0.515796569537398  
2  0.5198162831198087  
3  0.5209406735213454  
4  0.5236483939826322  
5   0.525274387870051  
6  0.5264857429694172  
7  0.5273379914285496  
8  0.5349398188631833



In [22]:

    
# Extract the best model from random grid search
best_dnn_model_id = dnn_rand_grid_sorted.model_ids[0]
best_dnn_from_rand_grid = h2o.get_model(best_dnn_model_id)
best_dnn_from_rand_grid.summary()









    



Status of Neuron Layers: predicting quality, regression, gaussian distribution, Quadratic loss, 3,201 weights/biases, 44.9 KB, 81,920 training samples, mini-batch size 1







    





layer
units
type
dropout
l1
l2
mean_rate
rate_rms
momentum
mean_weight
weight_rms
mean_bias
bias_rms

1
11
Input
0.0










2
50
Rectifier
0.0
1e-05
0.0
0.0014578
0.0006493
0.0
-0.0066843
0.2054468
0.3404232
0.0963176

3
50
Rectifier
0.0
1e-05
0.0
0.0232022
0.0358157
0.0
-0.0484576
0.1955158
0.8763745
0.1542610

4
1
Linear

1e-05
0.0
0.0005779
0.0003537
0.0
-0.0018716
0.1864522
0.0838900
0.0000000






    Out[22]:

Model Stacking



In [23]:

    
# Define a list of models to be stacked
# i.e. best model from each grid
all_ids = [best_gbm_model_id, best_drf_model_id, best_dnn_model_id]



In [24]:

    
# Set up Stacked Ensemble
ensemble = H2OStackedEnsembleEstimator(model_id = "my_ensemble",
                                       base_models = all_ids)



In [25]:

    
# use .train to start model stacking
# GLM as the default metalearner
ensemble.train(x = features, 
               y = 'quality', 
               training_frame = wine_train)









    



stackedensemble Model Build progress: |███████████████████████████████████| 100%

Comparison of Model Performance on Test Data



In [26]:

    
print('Best GBM model from Grid (MSE) : ', best_gbm_from_rand_grid.model_performance(wine_test).mse())
print('Best DRF model from Grid (MSE) : ', best_drf_from_rand_grid.model_performance(wine_test).mse())
print('Best DNN model from Grid (MSE) : ', best_dnn_from_rand_grid.model_performance(wine_test).mse())
print('Stacked Ensembles        (MSE) : ', ensemble.model_performance(wine_test).mse())









    



Best GBM model from Grid (MSE) :  0.4013942890547201
Best DRF model from Grid (MSE) :  0.4781568285687009
Best DNN model from Grid (MSE) :  0.5199803154635598
Stacked Ensembles        (MSE) :  0.39948493548786057

H2O cluster uptime:	01 secs
H2O cluster version:	3.10.5.2
H2O cluster version age:	10 days
H2O cluster name:	H2O_from_python_joe_i7ekvz
H2O cluster total nodes:	1
H2O cluster free memory:	5.210 Gb
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster status:	accepting new members, healthy
H2O connection url:	http://127.0.0.1:54321
H2O connection proxy:	None
H2O internal security:	False
Python version:	3.6.1 final

fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
7	0.27	0.36	20.7	0.045	45	170	1.001	3	0.45	8.8	6
6.3	0.3	0.34	1.6	0.049	14	132	0.994	3.3	0.49	9.5	6
8.1	0.28	0.4	6.9	0.05	30	97	0.9951	3.26	0.44	10.1	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.4	9.9	6
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.4	9.9	6

	number_of_trees	number_of_internal_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	168.0	168.0	103534.0	7.0	7.0	7.0	13.0	82.0	43.809525

layer	units	type	dropout	l1	l2	mean_rate	rate_rms	momentum	mean_weight	weight_rms	mean_bias	bias_rms
1	11	Input	0.0
2	50	Rectifier	0.0	1e-05	0.0	0.0014578	0.0006493	0.0	-0.0066843	0.2054468	0.3404232	0.0963176
3	50	Rectifier	0.0	1e-05	0.0	0.0232022	0.0358157	0.0	-0.0484576	0.1955158	0.8763745	0.1542610
4	1	Linear		1e-05	0.0	0.0005779	0.0003537	0.0	-0.0018716	0.1864522	0.0838900	0.0000000