Machine Learning with H2O - Tutorial 4b: Classification Models (Ensembles)


Objective:

  • This tutorial explains how to create stacked ensembles of classification models for better out-of-bag performance.

Titanic Dataset:


Steps:

  1. Build GBM models using random grid search and extract the best one.
  2. Build DRF models using random grid search and extract the best one.
  3. Use model stacking to combining different models.

Full Technical Reference:



In [1]:
# Import all required modules
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch

# Start and connect to a local H2O cluster
h2o.init(nthreads = -1)


Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_131"; OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11); OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
  Starting server from /home/joe/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpbf3ymcq4
  JVM stdout: /tmp/tmpbf3ymcq4/h2o_joe_started_from_python.out
  JVM stderr: /tmp/tmpbf3ymcq4/h2o_joe_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.
H2O cluster uptime: 01 secs
H2O cluster version: 3.10.5.2
H2O cluster version age: 10 days
H2O cluster name: H2O_from_python_joe_oj6td1
H2O cluster total nodes: 1
H2O cluster free memory: 5.210 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: None
H2O internal security: False
Python version: 3.6.1 final



In [2]:
# Import Titanic data (local CSV)
titanic = h2o.import_file("kaggle_titanic.csv")
titanic.head(5)


Parse progress: |█████████████████████████████████████████████████████████| 100%
PassengerId Survived PclassName Sex Age SibSp Parch Ticket FareCabin Embarked
1 0 3Braund, Mr. Owen Harris male 22 1 0 nan 7.25 S
2 1 1Cumings, Mrs. John Bradley (Florence Briggs Thayer)female 38 1 0 nan71.2833C85 C
3 1 3Heikkinen, Miss. Laina female 26 0 0 nan 7.925 S
4 1 1Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 11380353.1 C123 S
5 0 3Allen, Mr. William Henry male 35 0 0 373450 8.05 S
Out[2]:


In [3]:
# Convert 'Survived' and 'Pclass' to categorical values
titanic['Survived'] = titanic['Survived'].asfactor()
titanic['Pclass'] = titanic['Pclass'].asfactor()

In [4]:
# Define features (or predictors) manually
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

In [5]:
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
titanic_split = titanic.split_frame(ratios = [0.8], seed = 1234)

titanic_train = titanic_split[0] # using 80% for training
titanic_test = titanic_split[1]  # using the rest 20% for out-of-bag evaluation

In [6]:
titanic_train.shape


Out[6]:
(712, 12)

In [7]:
titanic_test.shape


Out[7]:
(179, 12)



In [8]:
# define the criteria for random grid search
search_criteria = {'strategy': "RandomDiscrete", 
                   'max_models': 9,
                   'seed': 1234}


Step 1: Build GBM Models using Random Grid Search and Extract the Best Model


In [9]:
# define the range of hyper-parameters for GBM grid search
# 27 combinations in total
hyper_params = {'sample_rate': [0.7, 0.8, 0.9],
                'col_sample_rate': [0.7, 0.8, 0.9],
                'max_depth': [3, 5, 7]}

In [10]:
# Set up GBM grid search
# Add a seed for reproducibility
gbm_rand_grid = H2OGridSearch(
                    H2OGradientBoostingEstimator(
                        model_id = 'gbm_rand_grid', 
                        seed = 1234,
                        ntrees = 10000,   
                        nfolds = 5,
                        fold_assignment = "Modulo",               # needed for stacked ensembles
                        keep_cross_validation_predictions = True, # needed for stacked ensembles
                        stopping_metric = 'mse', 
                        stopping_rounds = 15,     
                        score_tree_interval = 1),
                    search_criteria = search_criteria, # full grid search
                    hyper_params = hyper_params)

In [11]:
# Use .train() to start the grid search
gbm_rand_grid.train(x = features, 
                    y = 'Survived', 
                    training_frame = titanic_train)


gbm Grid Build progress: |████████████████████████████████████████████████| 100%

In [12]:
# Sort and show the grid search results
gbm_rand_grid_sorted = gbm_rand_grid.get_grid(sort_by='auc', decreasing=True)
print(gbm_rand_grid_sorted)


    col_sample_rate max_depth sample_rate  \
0               0.8         3         0.8   
1               0.9         3         0.9   
2               0.8         3         0.9   
3               0.7         3         0.7   
4               0.9         7         0.7   
5               0.7         5         0.8   
6               0.7         7         0.7   
7               0.8         7         0.7   
8               0.9         7         0.9   

                                                     model_ids  \
0  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_3   
1  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_2   
2  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_7   
3  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_8   
4  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_6   
5  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_0   
6  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_1   
7  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_4   
8  Grid_GBM_py_6_sid_bee0_model_python_1498775809226_1_model_5   

                  auc  
0   0.866063061922249  
1  0.8641045122620403  
2  0.8622050567726142  
3  0.8610653834789583  
4  0.8580515807690684  
5  0.8557004769743785  
6  0.8542568908024145  
7  0.8530665653623739  
8  0.8423325313410156  


In [13]:
# Extract the best model from random grid search
best_gbm_model_id = gbm_rand_grid_sorted.model_ids[0]
best_gbm_from_rand_grid = h2o.get_model(best_gbm_model_id)
best_gbm_from_rand_grid.summary()


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
74.0 74.0 14528.0 3.0 3.0 3.0 4.0 8.0 6.918919
Out[13]:


Step 2: Build DRF Models using Random Grid Search and Extract the Best Model


In [14]:
# define the range of hyper-parameters for DRF grid search
# 27 combinations in total
hyper_params = {'sample_rate': [0.5, 0.6, 0.7],
                'col_sample_rate_per_tree': [0.7, 0.8, 0.9],
                'max_depth': [3, 5, 7]}

In [15]:
# Set up DRF grid search
# Add a seed for reproducibility
drf_rand_grid = H2OGridSearch(
                    H2ORandomForestEstimator(
                        model_id = 'drf_rand_grid', 
                        seed = 1234,
                        ntrees = 200,   
                        nfolds = 5,
                        fold_assignment = "Modulo",                # needed for stacked ensembles
                        keep_cross_validation_predictions = True), # needed for stacked ensembles
                    search_criteria = search_criteria, # full grid search
                    hyper_params = hyper_params)

In [16]:
# Use .train() to start the grid search
drf_rand_grid.train(x = features, 
                    y = 'Survived', 
                    training_frame = titanic_train)


drf Grid Build progress: |████████████████████████████████████████████████| 100%

In [17]:
# Sort and show the grid search results
drf_rand_grid_sorted = drf_rand_grid.get_grid(sort_by='auc', decreasing=True)
print(drf_rand_grid_sorted)


    col_sample_rate_per_tree max_depth sample_rate  \
0                        0.8         7         0.5   
1                        0.9         7         0.5   
2                        0.9         7         0.7   
3                        0.7         7         0.5   
4                        0.7         5         0.6   
5                        0.8         3         0.7   
6                        0.8         3         0.6   
7                        0.7         3         0.5   
8                        0.9         3         0.7   

                                                         model_ids  \
0  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_4   
1  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_6   
2  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_5   
3  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_1   
4  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_0   
5  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_7   
6  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_3   
7  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_8   
8  Grid_DRF_py_6_sid_bee0_model_python_1498775809226_13356_model_2   

                  auc  
0  0.8625469587607109  
1  0.8618504917479212  
2  0.8604828837955342  
3  0.8603182643197839  
4  0.8588240260014351  
5  0.8546536659490945  
6   0.854248448778017  
7   0.853218521801528  
8  0.8494998100544511  


In [18]:
# Extract the best model from random grid search
best_drf_model_id = drf_rand_grid_sorted.model_ids[0]
best_drf_from_rand_grid = h2o.get_model(best_drf_model_id)
best_drf_from_rand_grid.summary()


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
200.0 200.0 125030.0 7.0 7.0 7.0 24.0 61.0 41.0
Out[18]:


Model Stacking


In [19]:
# Define a list of models to be stacked
# i.e. best model from each grid
all_ids = [best_gbm_model_id, best_drf_model_id]

In [20]:
# Set up Stacked Ensemble
ensemble = H2OStackedEnsembleEstimator(model_id = "my_ensemble",
                                       base_models = all_ids)

In [21]:
# use .train to start model stacking
# GLM as the default metalearner
ensemble.train(x = features, 
               y = 'Survived', 
               training_frame = titanic_train)


stackedensemble Model Build progress: |███████████████████████████████████| 100%


Comparison of Model Performance on Test Data


In [22]:
print('Best GBM model from Grid (AUC) : ', best_gbm_from_rand_grid.model_performance(titanic_test).auc())
print('Best DRF model from Grid (AUC) : ', best_drf_from_rand_grid.model_performance(titanic_test).auc())
print('Stacked Ensembles        (AUC) : ', ensemble.model_performance(titanic_test).auc())


Best GBM model from Grid (AUC) :  0.8892284186401833
Best DRF model from Grid (AUC) :  0.8903106697224344
Stacked Ensembles        (AUC) :  0.8918385536032595