Machine Learning with H2O - Tutorial 4b: Classification Models (Ensembles)

Objective:

This tutorial explains how to create stacked ensembles of classification models for better out-of-bag performance.

Titanic Dataset:

Source: https://www.kaggle.com/c/titanic/data

Steps:

Build GBM models using random grid search and extract the best one.
Build DRF models using random grid search and extract the best one.
Use model stacking to combining different models.

Full Technical Reference:



In [1]:

    
# Import all required modules
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch

# Start and connect to a local H2O cluster
h2o.init(nthreads = -1)









    



Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_131"; OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11); OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from /home/joe/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpbf3ymcq4
  JVM stdout: /tmp/tmpbf3ymcq4/h2o_joe_started_from_python.out
  JVM stderr: /tmp/tmpbf3ymcq4/h2o_joe_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.






    




H2O cluster uptime:
01 secs
H2O cluster version:
3.10.5.2
H2O cluster version age:
10 days 
H2O cluster name:
H2O_from_python_joe_oj6td1
H2O cluster total nodes:
1
H2O cluster free memory:
5.210 Gb
H2O cluster total cores:
8
H2O cluster allowed cores:
8
H2O cluster status:
accepting new members, healthy
H2O connection url:
http://127.0.0.1:54321
H2O connection proxy:
None
H2O internal security:
False
Python version:
3.6.1 final



In [2]:

    
# Import Titanic data (local CSV)
titanic = h2o.import_file("kaggle_titanic.csv")
titanic.head(5)









    



Parse progress: |█████████████████████████████████████████████████████████| 100%






    






  PassengerId   Survived   Pclass Name                                               Sex     Age   SibSp   Parch   Ticket    Fare Cabin  Embarked  


            1          0        3 Braund, Mr. Owen Harris                            male     22       1       0      nan  7.25         S         
            2          1        1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female    38       1       0      nan 71.2833 C85    C         
            3          1        3 Heikkinen, Miss. Laina                             female    26       0       0      nan  7.925        S         
            4          1        1 Futrelle, Mrs. Jacques Heath (Lily May Peel)       female    35       1       0   113803 53.1   C123   S         
            5          0        3 Allen, Mr. William Henry                           male     35       0       0   373450  8.05         S         








    Out[2]:



In [3]:

    
# Convert 'Survived' and 'Pclass' to categorical values
titanic['Survived'] = titanic['Survived'].asfactor()
titanic['Pclass'] = titanic['Pclass'].asfactor()



In [4]:

    
# Define features (or predictors) manually
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']



In [5]:

    
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
titanic_split = titanic.split_frame(ratios = [0.8], seed = 1234)

titanic_train = titanic_split[0] # using 80% for training
titanic_test = titanic_split[1]  # using the rest 20% for out-of-bag evaluation



In [6]:

    
titanic_train.shape









    Out[6]:





(712, 12)



In [7]:

    
titanic_test.shape









    Out[7]:





(179, 12)

Define Search Criteria for Random Grid Search



In [8]:

    
# define the criteria for random grid search
search_criteria = {'strategy': "RandomDiscrete", 
                   'max_models': 9,
                   'seed': 1234}

Step 1: Build GBM Models using Random Grid Search and Extract the Best Model



In [9]:

    
# define the range of hyper-parameters for GBM grid search
# 27 combinations in total
hyper_params = {'sample_rate': [0.7, 0.8, 0.9],
                'col_sample_rate': [0.7, 0.8, 0.9],
                'max_depth': [3, 5, 7]}



In [10]:

    
# Set up GBM grid search
# Add a seed for reproducibility
gbm_rand_grid = H2OGridSearch(
                    H2OGradientBoostingEstimator(
                        model_id = 'gbm_rand_grid', 
                        seed = 1234,
                        ntrees = 10000,   
                        nfolds = 5,
                        fold_assignment = "Modulo",               # needed for stacked ensembles
                        keep_cross_validation_predictions = True, # needed for stacked ensembles
                        stopping_metric = 'mse', 
                        stopping_rounds = 15,     
                        score_tree_interval = 1),
                    search_criteria = search_criteria, # full grid search
                    hyper_params = hyper_params)



In [11]:

    
# Use .train() to start the grid search
gbm_rand_grid.train(x = features, 
                    y = 'Survived', 
                    training_frame = titanic_train)









    



gbm Grid Build progress: |████████████████████████████████████████████████| 100%



In [12]:

    
# Sort and show the grid search results
gbm_rand_grid_sorted = gbm_rand_grid.get_grid(sort_by='auc', decreasing=True)
print(gbm_rand_grid_sorted)









    



    col_sample_rate max_depth sample_rate  \
0               0.8         3         0.8   
1               0.9         3         0.9   
2               0.8         3         0.9   
3               0.7         3         0.7   
4               0.9         7         0.7   
5               0.7         5         0.8   
6               0.7         7         0.7   
7               0.8         7         0.7   
8               0.9         7         0.9   

                                                     model_ids  \
0  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_3   
1  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_2   
2  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_7   
3  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_8   
4  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_6   
5  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_0   
6  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_1   
7  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_4   
8  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_5   

                  auc  
0   0.866063061922249  
1  0.8641045122620403  
2  0.8622050567726142  
3  0.8610653834789583  
4  0.8580515807690684  
5  0.8557004769743785  
6  0.8542568908024145  
7  0.8530665653623739  
8  0.8423325313410156



In [13]:

    
# Extract the best model from random grid search
best_gbm_model_id = gbm_rand_grid_sorted.model_ids[0]
best_gbm_from_rand_grid = h2o.get_model(best_gbm_model_id)
best_gbm_from_rand_grid.summary()









    



Model Summary: 






    





number_of_trees
number_of_internal_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves

74.0
74.0
14528.0
3.0
3.0
3.0
4.0
8.0
6.918919






    Out[13]:

Step 2: Build DRF Models using Random Grid Search and Extract the Best Model



In [14]:

    
# define the range of hyper-parameters for DRF grid search
# 27 combinations in total
hyper_params = {'sample_rate': [0.5, 0.6, 0.7],
                'col_sample_rate_per_tree': [0.7, 0.8, 0.9],
                'max_depth': [3, 5, 7]}



In [15]:

    
# Set up DRF grid search
# Add a seed for reproducibility
drf_rand_grid = H2OGridSearch(
                    H2ORandomForestEstimator(
                        model_id = 'drf_rand_grid', 
                        seed = 1234,
                        ntrees = 200,   
                        nfolds = 5,
                        fold_assignment = "Modulo",                # needed for stacked ensembles
                        keep_cross_validation_predictions = True), # needed for stacked ensembles
                    search_criteria = search_criteria, # full grid search
                    hyper_params = hyper_params)



In [16]:

    
# Use .train() to start the grid search
drf_rand_grid.train(x = features, 
                    y = 'Survived', 
                    training_frame = titanic_train)









    



drf Grid Build progress: |████████████████████████████████████████████████| 100%



In [17]:

    
# Sort and show the grid search results
drf_rand_grid_sorted = drf_rand_grid.get_grid(sort_by='auc', decreasing=True)
print(drf_rand_grid_sorted)









    



    col_sample_rate_per_tree max_depth sample_rate  \
0                        0.8         7         0.5   
1                        0.9         7         0.5   
2                        0.9         7         0.7   
3                        0.7         7         0.5   
4                        0.7         5         0.6   
5                        0.8         3         0.7   
6                        0.8         3         0.6   
7                        0.7         3         0.5   
8                        0.9         3         0.7   

                                                         model_ids  \
0  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_4   
1  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_6   
2  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_5   
3  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_1   
4  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_0   
5  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_7   
6  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_3   
7  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_8   
8  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_2   

                  auc  
0  0.8625469587607109  
1  0.8618504917479212  
2  0.8604828837955342  
3  0.8603182643197839  
4  0.8588240260014351  
5  0.8546536659490945  
6   0.854248448778017  
7   0.853218521801528  
8  0.8494998100544511



In [18]:

    
# Extract the best model from random grid search
best_drf_model_id = drf_rand_grid_sorted.model_ids[0]
best_drf_from_rand_grid = h2o.get_model(best_drf_model_id)
best_drf_from_rand_grid.summary()









    



Model Summary: 






    





number_of_trees
number_of_internal_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves

200.0
200.0
125030.0
7.0
7.0
7.0
24.0
61.0
41.0






    Out[18]:

Model Stacking



In [19]:

    
# Define a list of models to be stacked
# i.e. best model from each grid
all_ids = [best_gbm_model_id, best_drf_model_id]



In [20]:

    
# Set up Stacked Ensemble
ensemble = H2OStackedEnsembleEstimator(model_id = "my_ensemble",
                                       base_models = all_ids)



In [21]:

    
# use .train to start model stacking
# GLM as the default metalearner
ensemble.train(x = features, 
               y = 'Survived', 
               training_frame = titanic_train)









    



stackedensemble Model Build progress: |███████████████████████████████████| 100%

Comparison of Model Performance on Test Data



In [22]:

    
print('Best GBM model from Grid (AUC) : ', best_gbm_from_rand_grid.model_performance(titanic_test).auc())
print('Best DRF model from Grid (AUC) : ', best_drf_from_rand_grid.model_performance(titanic_test).auc())
print('Stacked Ensembles        (AUC) : ', ensemble.model_performance(titanic_test).auc())









    



Best GBM model from Grid (AUC) :  0.8892284186401833
Best DRF model from Grid (AUC) :  0.8903106697224344
Stacked Ensembles        (AUC) :  0.8918385536032595

H2O cluster uptime:	01 secs
H2O cluster version:	3.10.5.2
H2O cluster version age:	10 days
H2O cluster name:	H2O_from_python_joe_oj6td1
H2O cluster total nodes:	1
H2O cluster free memory:	5.210 Gb
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster status:	accepting new members, healthy
H2O connection url:	http://127.0.0.1:54321
H2O connection proxy:	None
H2O internal security:	False
Python version:	3.6.1 final

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	nan	7.25		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	nan	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	nan	7.925		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.05		S

	number_of_trees	number_of_internal_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	74.0	74.0	14528.0	3.0	3.0	3.0	4.0	8.0	6.918919