Machine Learning with H2O - Tutorial 3c: Regression Models (Ensembles)


Objective:

  • This tutorial explains how to create stacked ensembles of regression models for better out-of-bag performance.

Wine Quality Dataset:


Steps:

  1. Build GBM models using random grid search and extract the best one.
  2. Build DRF models using random grid search and extract the best one.
  3. Build DNN models using random grid search and extract the best one.
  4. Use model stacking to combining different models.

Full Technical Reference:



In [1]:
# Import all required modules
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch

# Start and connect to a local H2O cluster
h2o.init(nthreads = -1)


Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_131"; OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11); OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from /home/joe/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp6kskjz_d
  JVM stdout: /tmp/tmp6kskjz_d/h2o_joe_started_from_python.out
  JVM stderr: /tmp/tmp6kskjz_d/h2o_joe_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.
H2O cluster uptime: 01 secs
H2O cluster version: 3.10.5.2
H2O cluster version age: 10 days
H2O cluster name: H2O_from_python_joe_i7ekvz
H2O cluster total nodes: 1
H2O cluster free memory: 5.210 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: None
H2O internal security: False
Python version: 3.6.1 final



In [2]:
# Import wine quality data from a local CSV file
wine = h2o.import_file("winequality-white.csv")
wine.head(5)


Parse progress: |█████████████████████████████████████████████████████████| 100%
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
7 0.27 0.36 20.7 0.045 45 170 1.001 3 0.45 8.8 6
6.3 0.3 0.34 1.6 0.049 14 132 0.994 3.3 0.49 9.5 6
8.1 0.28 0.4 6.9 0.05 30 97 0.99513.26 0.44 10.1 6
7.2 0.23 0.32 8.5 0.058 47 186 0.99563.19 0.4 9.9 6
7.2 0.23 0.32 8.5 0.058 47 186 0.99563.19 0.4 9.9 6
Out[2]:


In [3]:
# Define features (or predictors)
features = list(wine.columns) # we want to use all the information
features.remove('quality')    # we need to exclude the target 'quality' (otherwise there is nothing to predict)
features


Out[3]:
['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

In [4]:
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = wine.split_frame(ratios = [0.8], seed = 1234)

wine_train = wine_split[0] # using 80% for training
wine_test = wine_split[1]  # using the rest 20% for out-of-bag evaluation

In [5]:
wine_train.shape


Out[5]:
(3932, 12)

In [6]:
wine_test.shape


Out[6]:
(966, 12)



In [7]:
# define the criteria for random grid search
search_criteria = {'strategy': "RandomDiscrete", 
                   'max_models': 9,
                   'seed': 1234}


Step 1: Build GBM Models using Random Grid Search and Extract the Best Model


In [8]:
# define the range of hyper-parameters for GBM grid search
# 27 combinations in total
hyper_params = {'sample_rate': [0.7, 0.8, 0.9],
                'col_sample_rate': [0.7, 0.8, 0.9],
                'max_depth': [3, 5, 7]}

In [9]:
# Set up GBM grid search
# Add a seed for reproducibility
gbm_rand_grid = H2OGridSearch(
                    H2OGradientBoostingEstimator(
                        model_id = 'gbm_rand_grid', 
                        seed = 1234,
                        ntrees = 10000,   
                        nfolds = 5,
                        fold_assignment = "Modulo",               # needed for stacked ensembles
                        keep_cross_validation_predictions = True, # needed for stacked ensembles
                        stopping_metric = 'mse', 
                        stopping_rounds = 15,     
                        score_tree_interval = 1),
                    search_criteria = search_criteria, 
                    hyper_params = hyper_params)

In [10]:
# Use .train() to start the grid search
gbm_rand_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)


gbm Grid Build progress: |████████████████████████████████████████████████| 100%

In [11]:
# Sort and show the grid search results
gbm_rand_grid_sorted = gbm_rand_grid.get_grid(sort_by='mse', decreasing=False)
print(gbm_rand_grid_sorted)


    col_sample_rate max_depth sample_rate  \
0               0.9         7         0.9   
1               0.8         7         0.7   
2               0.7         7         0.7   
3               0.9         7         0.7   
4               0.7         5         0.8   
5               0.8         3         0.9   
6               0.7         3         0.7   
7               0.9         3         0.9   
8               0.8         3         0.8   

                                                     model_ids  \
0  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_5   
1  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_4   
2  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_1   
3  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_6   
4  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_0   
5  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_7   
6  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_8   
7  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_2   
8  Grid_GBM_py_4_sid_94fe_model_python_1498775536496_1_model_3   

                   mse  
0  0.41467703216892454  
1   0.4188744246328386  
2  0.42294704197026883  
3   0.4285238866231086  
4  0.44601214899796604  
5  0.46338551281728263  
6   0.4681243149102324  
7  0.46849996267402233  
8   0.4690100493856379  


In [12]:
# Extract the best model from random grid search
best_gbm_model_id = gbm_rand_grid_sorted.model_ids[0]
best_gbm_from_rand_grid = h2o.get_model(best_gbm_model_id)
best_gbm_from_rand_grid.summary()


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
168.0 168.0 103534.0 7.0 7.0 7.0 13.0 82.0 43.809525
Out[12]:


Step 2: Build DRF Models using Random Grid Search and Extract the Best Model


In [13]:
# define the range of hyper-parameters for DRF grid search
# 27 combinations in total
hyper_params = {'sample_rate': [0.5, 0.6, 0.7],
                'col_sample_rate_per_tree': [0.7, 0.8, 0.9],
                'max_depth': [3, 5, 7]}

In [14]:
# Set up DRF grid search
# Add a seed for reproducibility
drf_rand_grid = H2OGridSearch(
                    H2ORandomForestEstimator(
                        model_id = 'drf_rand_grid', 
                        seed = 1234,
                        ntrees = 200,   
                        nfolds = 5,
                        fold_assignment = "Modulo",                 # needed for stacked ensembles
                        keep_cross_validation_predictions = True),  # needed for stacked ensembles
                    search_criteria = search_criteria, 
                    hyper_params = hyper_params)

In [15]:
# Use .train() to start the grid search
drf_rand_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)


drf Grid Build progress: |████████████████████████████████████████████████| 100%

In [16]:
# Sort and show the grid search results
drf_rand_grid_sorted = drf_rand_grid.get_grid(sort_by='mse', decreasing=False)
print(drf_rand_grid_sorted)


    col_sample_rate_per_tree max_depth sample_rate  \
0                        0.9         7         0.7   
1                        0.9         7         0.5   
2                        0.8         7         0.5   
3                        0.7         7         0.5   
4                        0.7         5         0.6   
5                        0.9         3         0.7   
6                        0.8         3         0.6   
7                        0.8         3         0.7   
8                        0.7         3         0.5   

                                                     model_ids  \
0  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_5   
1  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_6   
2  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_4   
3  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_1   
4  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_0   
5  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_2   
6  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_3   
7  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_7   
8  Grid_DRF_py_4_sid_94fe_model_python_1498775536496_2_model_8   

                   mse  
0  0.48533899185762636  
1    0.487315432336594  
2  0.49004168463947945  
3   0.4927544483353685  
4   0.5307039662299886  
5   0.5846039939024897  
6   0.5850640013528532  
7   0.5855927668634072  
8   0.5857362760598669  


In [17]:
# Extract the best model from random grid search
best_drf_model_id = drf_rand_grid_sorted.model_ids[0]
best_drf_from_rand_grid = h2o.get_model(best_drf_model_id)
best_drf_from_rand_grid.summary()


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
200.0 200.0 239756.0 7.0 7.0 7.0 70.0 111.0 90.265
Out[17]:


Step 3: Build DNN Models using Random Grid Search and Extract the Best Model


In [18]:
# define the range of hyper-parameters for DNN grid search
# 81 combinations in total
hyper_params = {'activation': ['tanh', 'rectifier', 'maxout'],
                'hidden': [[50], [50,50], [50,50,50]],
                'l1': [0, 1e-3, 1e-5],
                'l2': [0, 1e-3, 1e-5]}

In [19]:
# Set up DNN grid search
# Add a seed for reproducibility
dnn_rand_grid = H2OGridSearch(
                    H2ODeepLearningEstimator(
                        model_id = 'dnn_rand_grid', 
                        seed = 1234,
                        epochs = 20,   
                        nfolds = 5,
                        fold_assignment = "Modulo",                # needed for stacked ensembles
                        keep_cross_validation_predictions = True), # needed for stacked ensembles
                    search_criteria = search_criteria, 
                    hyper_params = hyper_params)

In [20]:
# Use .train() to start the grid search
dnn_rand_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)


deeplearning Grid Build progress: |███████████████████████████████████████| 100%

In [21]:
# Sort and show the grid search results
dnn_rand_grid_sorted = dnn_rand_grid.get_grid(sort_by='mse', decreasing=False)
print(dnn_rand_grid_sorted)


    activation        hidden      l1      l2  \
0    Rectifier      [50, 50]  1.0E-5     0.0   
1       Maxout      [50, 50]     0.0  1.0E-5   
2       Maxout  [50, 50, 50]  1.0E-5  1.0E-5   
3         Tanh  [50, 50, 50]     0.0  1.0E-5   
4         Tanh  [50, 50, 50]  1.0E-5  1.0E-5   
5       Maxout          [50]     0.0     0.0   
6       Maxout          [50]  1.0E-5   0.001   
7         Tanh  [50, 50, 50]   0.001     0.0   
8       Maxout          [50]   0.001     0.0   

                                                              model_ids  \
0  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_2   
1  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_8   
2  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_3   
3  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_0   
4  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_7   
5  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_5   
6  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_6   
7  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_1   
8  Grid_DeepLearning_py_4_sid_94fe_model_python_1498775536496_3_model_4   

                  mse  
0  0.5108409743268761  
1   0.515796569537398  
2  0.5198162831198087  
3  0.5209406735213454  
4  0.5236483939826322  
5   0.525274387870051  
6  0.5264857429694172  
7  0.5273379914285496  
8  0.5349398188631833  


In [22]:
# Extract the best model from random grid search
best_dnn_model_id = dnn_rand_grid_sorted.model_ids[0]
best_dnn_from_rand_grid = h2o.get_model(best_dnn_model_id)
best_dnn_from_rand_grid.summary()


Status of Neuron Layers: predicting quality, regression, gaussian distribution, Quadratic loss, 3,201 weights/biases, 44.9 KB, 81,920 training samples, mini-batch size 1

layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight weight_rms mean_bias bias_rms
1 11 Input 0.0
2 50 Rectifier 0.0 1e-05 0.0 0.0014578 0.0006493 0.0 -0.0066843 0.2054468 0.3404232 0.0963176
3 50 Rectifier 0.0 1e-05 0.0 0.0232022 0.0358157 0.0 -0.0484576 0.1955158 0.8763745 0.1542610
4 1 Linear 1e-05 0.0 0.0005779 0.0003537 0.0 -0.0018716 0.1864522 0.0838900 0.0000000
Out[22]:


Model Stacking


In [23]:
# Define a list of models to be stacked
# i.e. best model from each grid
all_ids = [best_gbm_model_id, best_drf_model_id, best_dnn_model_id]

In [24]:
# Set up Stacked Ensemble
ensemble = H2OStackedEnsembleEstimator(model_id = "my_ensemble",
                                       base_models = all_ids)

In [25]:
# use .train to start model stacking
# GLM as the default metalearner
ensemble.train(x = features, 
               y = 'quality', 
               training_frame = wine_train)


stackedensemble Model Build progress: |███████████████████████████████████| 100%


Comparison of Model Performance on Test Data


In [26]:
print('Best GBM model from Grid (MSE) : ', best_gbm_from_rand_grid.model_performance(wine_test).mse())
print('Best DRF model from Grid (MSE) : ', best_drf_from_rand_grid.model_performance(wine_test).mse())
print('Best DNN model from Grid (MSE) : ', best_dnn_from_rand_grid.model_performance(wine_test).mse())
print('Stacked Ensembles        (MSE) : ', ensemble.model_performance(wine_test).mse())


Best GBM model from Grid (MSE) :  0.4013942890547201
Best DRF model from Grid (MSE) :  0.4781568285687009
Best DNN model from Grid (MSE) :  0.5199803154635598
Stacked Ensembles        (MSE) :  0.39948493548786057