Using the H2OPipeline

This notebook will provide an overview of the H2OPipeline and its nuanced behavior.

The H2OPipeline generates a sklearn-esque pipeline of H2O steps finished with an optional H2OEstimator. Note that as of version 0.1.0, the behavior of the H2OPipeline has slightly changed, given the inclusion of the exclude_from_ppc and exclude_from_fit parameters.

The pipeline, at the core, is comprised of a list of length-two tuples in the form of ('name', SomeH2OTransformer()), punctuated with an optional H2OEstimator as the final step. The pipeline will procedurally fit each stage, transforming the training data prior to fitting the next stage. When predicting or transforming new (test) data, each stage calls either transform or predict at the respective step.

On the topic of exclusions and feature_names:

Prior to version 0.1.0, H2OTransformers did not take the keyword exclude_features. Its addition necessitated two new keywords in the H2OPipeline, and a slight change in behavior of feature_names:

  • exclude_from_ppc - If set in the H2OPipeline constructor, these features will be universally omitted from every preprocessing stage. Since exclude_features can be set individually in each separate transformer, in the case that exclude_features has been explicitly set, the exclusions in that respective stage will include the union of exclude_from_ppc and exclude_features.
  • exclude_from_fit - If set in the H2OPipeline constructor, these features will be omitted from the training_cols_ fit attribute, which are the columns passed to the final stage in the pipeline.
  • feature_names - The former behavior of the H2OPipeline only used feature_names in the fit of the first transformer, passing the remaining columns to the next transformer as the feature_names parameter. The new behavior is more discriminating in the case of explicitly-set attributes. In the case where a transformer's feature_names parameter has been explicitly set, only those names will be used in the fit. This is useful in cases where someone may only want to, for instance, drop one of two multicollinear features using the H2OMulticollinearityFilterer rather than fitting against the entire dataset. It also adheres to the now expected behavior of the exclusion parameters.

We will start by loading the boston housing dataset from sklearn and uploading it into an H2OFrame. Fortunately, skutil makes this very easy (note you must start your h2o cluster first!):


In [2]:
import h2o
h2o.connect(ip='10.7.54.204', port=54321) # I started this on command line


Connecting to H2O server at http://10.7.54.204:54321... successful.
H2O cluster uptime: 12 mins 21 secs
H2O cluster version: 3.10.0.7
H2O cluster version age: 25 days
H2O cluster name: fp7y
H2O cluster total nodes: 1
H2O cluster free memory: 3.313 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: locked, healthy
H2O connection url: http://10.7.54.204:54321
H2O connection proxy: None
Python version: 2.7.12 final
Out[2]:
<H2OConnection to http://10.7.54.204:54321, no session>

In [4]:
from skutil.h2o import load_boston_h2o
from skutil.h2o import h2o_train_test_split

X = load_boston_h2o(include_tgt=True, shuffle=True, tgt_name='target')
X_train, X_test = h2o_train_test_split(X, train_size=0.7) # this splits our data

X_train.head()


Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
0.17783 0 9.69 00.5855.569 73.52.3999 6 391 19.2395.77 15.1 17.5
6.80117 0 18.1 00.7136.081 84.42.7175 24 666 20.2396.9 14.7 20
0.08707 0 12.83 00.4376.14 45.84.0905 5 398 18.7386.96 10.27 20.8
9.51363 0 18.1 00.7136.728 94.12.4961 24 666 20.2 6.68 18.71 14.9
1.13081 0 8.14 00.5385.713 94.14.233 4 307 21 360.17 22.6 12.7
8.71675 0 18.1 00.6936.471 98.81.7257 24 666 20.2391.98 17.12 13.1
0.04462 25 4.86 00.4266.619 70.45.4007 4 281 19 395.63 7.22 23.9
4.03841 0 18.1 00.5326.229 90.73.0993 24 666 20.2395.33 12.87 19.6
37.6619 0 18.1 00.6796.202 78.71.8629 24 666 20.2 18.82 14.52 10.9
7.02259 0 18.1 00.7186.006 95.31.8746 24 666 20.2319.98 15.7 14.2
Out[4]:

Fit our pipeline

There are several demos out there that show the entire data munging and exploration process using skutil. We won't duplicate efforts here, but we will jump straight into the H2OPipeline demo.


In [5]:
from skutil.h2o import H2OPipeline
from skutil.h2o.transform import H2OSelectiveScaler
from skutil.h2o.select import H2OMulticollinearityFilterer
from h2o.estimators import H2OGradientBoostingEstimator

# Declare our pipe - this one is intentionally a bit complex in behavior
pipe = H2OPipeline([
        ('scl', H2OSelectiveScaler(feature_names=['B','PTRATIO','CRIM'])), # will ONLY operate on these features
        ('mcf', H2OMulticollinearityFilterer(exclude_features=['CHAS'])),  # will exclude this AS WELL AS 'TAX'
        ('gbm', H2OGradientBoostingEstimator())
    ],
    
    exclude_from_ppc=['TAX'], # excluded from all preprocessor fits
    feature_names=None,       # fit the first stage on ALL features (minus exceptions)
    target_feature='target')  # will be excluded from all preprocessor fits, as it's the target

# do actual fit:
pipe.fit(X_train)


gbm Model Build progress: |███████████████████████████████████████████████████████████████████| 100%
Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Method
Model Key:  GBM_model_python_1476531369030_1
Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
50.0 50.0 11663.0 5.0 5.0 5.0 8.0 21.0 13.66

ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 2.63251954428
RMSE: 1.62250409685
MAE: 1.06575154719
RMSLE: 0.0756197889648
Mean Residual Deviance: 2.63251954428
Scoring History: 
timestamp duration number_of_trees training_rmse training_mae training_deviance
2016-10-15 06:48:39 0.022 sec 0.0 9.4733020 6.8840244 89.7434513
2016-10-15 06:48:40 0.181 sec 1.0 8.6632611 6.3015832 75.0520937
2016-10-15 06:48:40 0.222 sec 2.0 7.9437570 5.7884133 63.1032755
2016-10-15 06:48:40 0.244 sec 3.0 7.2968971 5.3183637 53.2447073
2016-10-15 06:48:40 0.267 sec 4.0 6.7243868 4.9080230 45.2173780
--- --- --- --- --- --- ---
2016-10-15 06:48:40 0.809 sec 46.0 1.6875792 1.1103921 2.8479235
2016-10-15 06:48:40 0.817 sec 47.0 1.6761373 1.1002459 2.8094363
2016-10-15 06:48:40 0.826 sec 48.0 1.6588035 1.0889463 2.7516292
2016-10-15 06:48:40 0.835 sec 49.0 1.6443885 1.0810185 2.7040134
2016-10-15 06:48:40 0.845 sec 50.0 1.6225041 1.0657515 2.6325195
See the whole table with table.as_data_frame()
Variable Importances: 
variable relative_importance scaled_importance percentage
LSTAT 102906.4062500 1.0 0.6340451
RM 32338.6660156 0.3142532 0.1992507
NOX 6833.0043945 0.0664002 0.0421007
DIS 6525.0424805 0.0634075 0.0402032
CRIM 3430.0273438 0.0333315 0.0211337
TAX 2854.7770996 0.0277415 0.0175894
PTRATIO 2323.4426269 0.0225782 0.0143156
AGE 2037.4730225 0.0197993 0.0125536
B 1366.0084228 0.0132743 0.0084165
INDUS 653.3645630 0.0063491 0.0040256
RAD 650.1253052 0.0063176 0.0040057
CHAS 380.0959778 0.0036936 0.0023419
ZN 2.9641106 0.0000288 0.0000183
Out[5]:
H2OPipeline(exclude_from_fit=None, exclude_from_ppc=['TAX'],
      feature_names=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'],
      steps=[('scl', H2OSelectiveScaler(exclude_features=['TAX'],
          feature_names=['B', 'PTRATIO', 'CRIM'], target_feature='target',
          with_mean=True, with_std=True)), ('mcf', H2OMulticollinearityFilterer(exclude_features=['TAX', 'CHAS'],
               feature_names=['CRIM', 'ZN', 'INDUS'..._warn=True, target_feature='target',
               threshold=0.85, use='complete.obs')), ('gbm', )],
      target_feature='target')

Validating our hypotheses

Let's ensure each stage behaved like we thought it would


In [6]:
# First stage should ONLY be fit on these features: ['B','PTRATIO','CRIM']
step = pipe.steps[0][1] # extract the transformer from the tuple
step.means


Out[6]:
{'B': 357.52463276836176,
 'CRIM': 3.748034491525425,
 'PTRATIO': 18.409887005649722}

In [8]:
# Second stage should be fit on everything BUT ['CHAS', 'TAX'] (and of course, the target)
step = pipe.steps[1][1]
step.correlations_ # looks like we had nothing to drop anyways


Out[8]:
[]

In [9]:
# here are the features we ultimately fit the estimator on:
pipe.training_cols_


Out[9]:
['CRIM',
 'ZN',
 'INDUS',
 'CHAS',
 'NOX',
 'RM',
 'AGE',
 'DIS',
 'RAD',
 'TAX',
 'PTRATIO',
 'B',
 'LSTAT']

In [11]:
# Let's check our R^2:
from skutil.h2o.metrics import h2o_r2_score

test_pred = pipe.predict(X_test)
print('Testing R^2: %.5f' %h2o_r2_score(X_test['target'] , test_pred))


gbm prediction progress: |████████████████████████████████████████████████████████████████████| 100%
Testing R^2: 0.77387