Using the `H2OPipeline`

This notebook will provide an overview of the H2OPipeline and its nuanced behavior.

The H2OPipeline generates a sklearn-esque pipeline of H2O steps finished with an optional H2OEstimator. Note that as of version 0.1.0, the behavior of the H2OPipeline has slightly changed, given the inclusion of the exclude_from_ppc and exclude_from_fit parameters.

The pipeline, at the core, is comprised of a list of length-two tuples in the form of ('name', SomeH2OTransformer()), punctuated with an optional H2OEstimator as the final step. The pipeline will procedurally fit each stage, transforming the training data prior to fitting the next stage. When predicting or transforming new (test) data, each stage calls either transform or predict at the respective step.

On the topic of exclusions and feature_names:

Prior to version 0.1.0, H2OTransformers did not take the keyword exclude_features. Its addition necessitated two new keywords in the H2OPipeline, and a slight change in behavior of feature_names:

exclude_from_ppc - If set in the H2OPipeline constructor, these features will be universally omitted from every preprocessing stage. Since exclude_features can be set individually in each separate transformer, in the case that exclude_features has been explicitly set, the exclusions in that respective stage will include the union of exclude_from_ppc and exclude_features.

exclude_from_fit - If set in the H2OPipeline constructor, these features will be omitted from the training_cols_ fit attribute, which are the columns passed to the final stage in the pipeline.

feature_names - The former behavior of the H2OPipeline only used feature_names in the fit of the first transformer, passing the remaining columns to the next transformer as the feature_names parameter. The new behavior is more discriminating in the case of explicitly-set attributes. In the case where a transformer's feature_names parameter has been explicitly set, only those names will be used in the fit. This is useful in cases where someone may only want to, for instance, drop one of two multicollinear features using the H2OMulticollinearityFilterer rather than fitting against the entire dataset. It also adheres to the now expected behavior of the exclusion parameters.

We will start by loading the boston housing dataset from sklearn and uploading it into an H2OFrame. Fortunately, skutil makes this very easy (note you must start your h2o cluster first!):



In [2]:

    
import h2o
h2o.connect(ip='10.7.54.204', port=54321) # I started this on command line









    



Connecting to H2O server at http://10.7.54.204:54321... successful.






    




H2O cluster uptime:
12 mins 21 secs
H2O cluster version:
3.10.0.7
H2O cluster version age:
25 days 
H2O cluster name:
fp7y
H2O cluster total nodes:
1
H2O cluster free memory:
3.313 Gb
H2O cluster total cores:
8
H2O cluster allowed cores:
8
H2O cluster status:
locked, healthy
H2O connection url:
http://10.7.54.204:54321
H2O connection proxy:
None
Python version:
2.7.12 final






    Out[2]:





<H2OConnection to http://10.7.54.204:54321, no session>



In [4]:

    
from skutil.h2o import load_boston_h2o
from skutil.h2o import h2o_train_test_split

X = load_boston_h2o(include_tgt=True, shuffle=True, tgt_name='target')
X_train, X_test = h2o_train_test_split(X, train_size=0.7) # this splits our data

X_train.head()









    



Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%






    





    CRIM   ZN   INDUS   CHAS   NOX    RM   AGE    DIS   RAD   TAX   PTRATIO      B   LSTAT   target
 0.17783    0    9.69      0 0.585 5.569  73.5 2.3999     6   391      19.2 395.77   15.1     17.5
 6.80117    0   18.1      0 0.713 6.081  84.4 2.7175    24   666      20.2 396.9   14.7     20  
 0.08707    0   12.83      0 0.437 6.14  45.8 4.0905     5   398      18.7 386.96   10.27     20.8
 9.51363    0   18.1      0 0.713 6.728  94.1 2.4961    24   666      20.2   6.68   18.71     14.9
 1.13081    0    8.14      0 0.538 5.713  94.1 4.233     4   307      21  360.17   22.6     12.7
 8.71675    0   18.1      0 0.693 6.471  98.8 1.7257    24   666      20.2 391.98   17.12     13.1
 0.04462   25    4.86      0 0.426 6.619  70.4 5.4007     4   281      19  395.63    7.22     23.9
 4.03841    0   18.1      0 0.532 6.229  90.7 3.0993    24   666      20.2 395.33   12.87     19.6
37.6619    0   18.1      0 0.679 6.202  78.7 1.8629    24   666      20.2  18.82   14.52     10.9
 7.02259    0   18.1      0 0.718 6.006  95.3 1.8746    24   666      20.2 319.98   15.7     14.2







    Out[4]:

Fit our pipeline

There are several demos out there that show the entire data munging and exploration process using skutil. We won't duplicate efforts here, but we will jump straight into the H2OPipeline demo.



In [5]:

    
from skutil.h2o import H2OPipeline
from skutil.h2o.transform import H2OSelectiveScaler
from skutil.h2o.select import H2OMulticollinearityFilterer
from h2o.estimators import H2OGradientBoostingEstimator

# Declare our pipe - this one is intentionally a bit complex in behavior
pipe = H2OPipeline([
        ('scl', H2OSelectiveScaler(feature_names=['B','PTRATIO','CRIM'])), # will ONLY operate on these features
        ('mcf', H2OMulticollinearityFilterer(exclude_features=['CHAS'])),  # will exclude this AS WELL AS 'TAX'
        ('gbm', H2OGradientBoostingEstimator())
    ],
    
    exclude_from_ppc=['TAX'], # excluded from all preprocessor fits
    feature_names=None,       # fit the first stage on ALL features (minus exceptions)
    target_feature='target')  # will be excluded from all preprocessor fits, as it's the target

# do actual fit:
pipe.fit(X_train)









    



gbm Model Build progress: |███████████████████████████████████████████████████████████████████| 100%
Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Method
Model Key:  GBM_model_python_1476531369030_1
Model Summary: 






    





number_of_trees
number_of_internal_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves

50.0
50.0
11663.0
5.0
5.0
5.0
8.0
21.0
13.66






    



ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 2.63251954428
RMSE: 1.62250409685
MAE: 1.06575154719
RMSLE: 0.0756197889648
Mean Residual Deviance: 2.63251954428
Scoring History: 






    





timestamp
duration
number_of_trees
training_rmse
training_mae
training_deviance

2016-10-15 06:48:39
 0.022 sec
0.0
9.4733020
6.8840244
89.7434513

2016-10-15 06:48:40
 0.181 sec
1.0
8.6632611
6.3015832
75.0520937

2016-10-15 06:48:40
 0.222 sec
2.0
7.9437570
5.7884133
63.1032755

2016-10-15 06:48:40
 0.244 sec
3.0
7.2968971
5.3183637
53.2447073

2016-10-15 06:48:40
 0.267 sec
4.0
6.7243868
4.9080230
45.2173780
---
---
---
---
---
---
---

2016-10-15 06:48:40
 0.809 sec
46.0
1.6875792
1.1103921
2.8479235

2016-10-15 06:48:40
 0.817 sec
47.0
1.6761373
1.1002459
2.8094363

2016-10-15 06:48:40
 0.826 sec
48.0
1.6588035
1.0889463
2.7516292

2016-10-15 06:48:40
 0.835 sec
49.0
1.6443885
1.0810185
2.7040134

2016-10-15 06:48:40
 0.845 sec
50.0
1.6225041
1.0657515
2.6325195






    



See the whole table with table.as_data_frame()
Variable Importances: 






    




variable
relative_importance
scaled_importance
percentage
LSTAT
102906.4062500
1.0
0.6340451
RM
32338.6660156
0.3142532
0.1992507
NOX
6833.0043945
0.0664002
0.0421007
DIS
6525.0424805
0.0634075
0.0402032
CRIM
3430.0273438
0.0333315
0.0211337
TAX
2854.7770996
0.0277415
0.0175894
PTRATIO
2323.4426269
0.0225782
0.0143156
AGE
2037.4730225
0.0197993
0.0125536
B
1366.0084228
0.0132743
0.0084165
INDUS
653.3645630
0.0063491
0.0040256
RAD
650.1253052
0.0063176
0.0040057
CHAS
380.0959778
0.0036936
0.0023419
ZN
2.9641106
0.0000288
0.0000183






    Out[5]:





H2OPipeline(exclude_from_fit=None, exclude_from_ppc=['TAX'],
      feature_names=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'],
      steps=[('scl', H2OSelectiveScaler(exclude_features=['TAX'],
          feature_names=['B', 'PTRATIO', 'CRIM'], target_feature='target',
          with_mean=True, with_std=True)), ('mcf', H2OMulticollinearityFilterer(exclude_features=['TAX', 'CHAS'],
               feature_names=['CRIM', 'ZN', 'INDUS'..._warn=True, target_feature='target',
               threshold=0.85, use='complete.obs')), ('gbm', )],
      target_feature='target')

Validating our hypotheses

Let's ensure each stage behaved like we thought it would



In [6]:

    
# First stage should ONLY be fit on these features: ['B','PTRATIO','CRIM']
step = pipe.steps[0][1] # extract the transformer from the tuple
step.means









    Out[6]:





{'B': 357.52463276836176,
 'CRIM': 3.748034491525425,
 'PTRATIO': 18.409887005649722}



In [8]:

    
# Second stage should be fit on everything BUT ['CHAS', 'TAX'] (and of course, the target)
step = pipe.steps[1][1]
step.correlations_ # looks like we had nothing to drop anyways









    Out[8]:





[]



In [9]:

    
# here are the features we ultimately fit the estimator on:
pipe.training_cols_









    Out[9]:





['CRIM',
 'ZN',
 'INDUS',
 'CHAS',
 'NOX',
 'RM',
 'AGE',
 'DIS',
 'RAD',
 'TAX',
 'PTRATIO',
 'B',
 'LSTAT']



In [11]:

    
# Let's check our R^2:
from skutil.h2o.metrics import h2o_r2_score

test_pred = pipe.predict(X_test)
print('Testing R^2: %.5f' %h2o_r2_score(X_test['target'] , test_pred))









    



gbm prediction progress: |████████████████████████████████████████████████████████████████████| 100%
Testing R^2: 0.77387

H2O cluster uptime:	12 mins 21 secs
H2O cluster version:	3.10.0.7
H2O cluster version age:	25 days
H2O cluster name:	fp7y
H2O cluster total nodes:	1
H2O cluster free memory:	3.313 Gb
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster status:	locked, healthy
H2O connection url:	http://10.7.54.204:54321
H2O connection proxy:	None
Python version:	2.7.12 final

CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target
0.17783	0	9.69	0.585	5.569	73.5	2.3999	6	391	19.2	395.77	15.1	17.5
6.80117	0	18.1	0.713	6.081	84.4	2.7175	24	666	20.2	396.9	14.7	20
0.08707	0	12.83	0.437	6.14	45.8	4.0905	5	398	18.7	386.96	10.27	20.8
9.51363	0	18.1	0.713	6.728	94.1	2.4961	24	666	20.2	6.68	18.71	14.9
1.13081	0	8.14	0.538	5.713	94.1	4.233	4	307	21	360.17	22.6	12.7
8.71675	0	18.1	0.693	6.471	98.8	1.7257	24	666	20.2	391.98	17.12	13.1
0.04462	25	4.86	0.426	6.619	70.4	5.4007	4	281	19	395.63	7.22	23.9
4.03841	0	18.1	0.532	6.229	90.7	3.0993	24	666	20.2	395.33	12.87	19.6
37.6619	0	18.1	0.679	6.202	78.7	1.8629	24	666	20.2	18.82	14.52	10.9
7.02259	0	18.1	0.718	6.006	95.3	1.8746	24	666	20.2	319.98	15.7	14.2

	number_of_trees	number_of_internal_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	50.0	50.0	11663.0	5.0	5.0	5.0	8.0	21.0	13.66

	timestamp	duration	number_of_trees	training_rmse	training_mae	training_deviance
	2016-10-15 06:48:39	0.022 sec	0.0	9.4733020	6.8840244	89.7434513
	2016-10-15 06:48:40	0.181 sec	1.0	8.6632611	6.3015832	75.0520937
	2016-10-15 06:48:40	0.222 sec	2.0	7.9437570	5.7884133	63.1032755
	2016-10-15 06:48:40	0.244 sec	3.0	7.2968971	5.3183637	53.2447073
	2016-10-15 06:48:40	0.267 sec	4.0	6.7243868	4.9080230	45.2173780
---	---	---	---	---	---	---
	2016-10-15 06:48:40	0.809 sec	46.0	1.6875792	1.1103921	2.8479235
	2016-10-15 06:48:40	0.817 sec	47.0	1.6761373	1.1002459	2.8094363
	2016-10-15 06:48:40	0.826 sec	48.0	1.6588035	1.0889463	2.7516292
	2016-10-15 06:48:40	0.835 sec	49.0	1.6443885	1.0810185	2.7040134
	2016-10-15 06:48:40	0.845 sec	50.0	1.6225041	1.0657515	2.6325195

variable	relative_importance	scaled_importance	percentage
LSTAT	102906.4062500	1.0	0.6340451
RM	32338.6660156	0.3142532	0.1992507
NOX	6833.0043945	0.0664002	0.0421007
DIS	6525.0424805	0.0634075	0.0402032
CRIM	3430.0273438	0.0333315	0.0211337
TAX	2854.7770996	0.0277415	0.0175894
PTRATIO	2323.4426269	0.0225782	0.0143156
AGE	2037.4730225	0.0197993	0.0125536
B	1366.0084228	0.0132743	0.0084165
INDUS	653.3645630	0.0063491	0.0040256
RAD	650.1253052	0.0063176	0.0040057
CHAS	380.0959778	0.0036936	0.0023419
ZN	2.9641106	0.0000288	0.0000183

Using the H2OPipeline

Fit our pipeline

Validating our hypotheses

Using the `H2OPipeline`