H2OPipeline
This notebook will provide an overview of the H2OPipeline
and its nuanced behavior.
The H2OPipeline
generates a sklearn-esque pipeline of H2O steps finished with an optional H2OEstimator
. Note that as of version 0.1.0, the behavior of the H2OPipeline has slightly changed, given the inclusion of the exclude_from_ppc
and exclude_from_fit
parameters.
The pipeline, at the core, is comprised of a list of length-two tuples in the form of ('name', SomeH2OTransformer())
, punctuated with an optional H2OEstimator
as the final step. The pipeline will procedurally
fit each stage, transforming the training data prior to fitting the next stage. When predicting or transforming new (test) data, each stage calls either transform
or predict
at the respective step.
On the topic of exclusions and feature_names
:
Prior to version 0.1.0, H2OTransformer
s did not take the keyword exclude_features
. Its addition necessitated two new keywords in the H2OPipeline
, and a slight change in behavior of feature_names
:
exclude_from_ppc
- If set in the H2OPipeline
constructor, these features will be universally omitted from every preprocessing stage. Since exclude_features
can be set individually in each separate transformer, in the case that exclude_features
has been explicitly set, the exclusions in that respective stage will include the union of exclude_from_ppc
and exclude_features
.exclude_from_fit
- If set in the H2OPipeline
constructor, these features will be omitted from the training_cols_
fit attribute, which are the columns passed to the final stage in the pipeline.feature_names
- The former behavior of the H2OPipeline
only used feature_names
in the fit of the first transformer, passing the remaining columns to the next transformer as the feature_names
parameter. The new behavior is more discriminating in the case of explicitly-set attributes. In the case where a transformer's feature_names
parameter has been explicitly set, only those names will be used in the fit. This is useful in cases where someone may only want to, for instance, drop one of two multicollinear features using the H2OMulticollinearityFilterer
rather than fitting against the entire dataset. It also adheres to the now expected behavior of the exclusion parameters.We will start by loading the boston housing dataset from sklearn
and uploading it into an H2OFrame
. Fortunately, skutil
makes this very easy (note you must start your h2o cluster first!):
In [2]:
import h2o
h2o.connect(ip='10.7.54.204', port=54321) # I started this on command line
Out[2]:
In [4]:
from skutil.h2o import load_boston_h2o
from skutil.h2o import h2o_train_test_split
X = load_boston_h2o(include_tgt=True, shuffle=True, tgt_name='target')
X_train, X_test = h2o_train_test_split(X, train_size=0.7) # this splits our data
X_train.head()
Out[4]:
In [5]:
from skutil.h2o import H2OPipeline
from skutil.h2o.transform import H2OSelectiveScaler
from skutil.h2o.select import H2OMulticollinearityFilterer
from h2o.estimators import H2OGradientBoostingEstimator
# Declare our pipe - this one is intentionally a bit complex in behavior
pipe = H2OPipeline([
('scl', H2OSelectiveScaler(feature_names=['B','PTRATIO','CRIM'])), # will ONLY operate on these features
('mcf', H2OMulticollinearityFilterer(exclude_features=['CHAS'])), # will exclude this AS WELL AS 'TAX'
('gbm', H2OGradientBoostingEstimator())
],
exclude_from_ppc=['TAX'], # excluded from all preprocessor fits
feature_names=None, # fit the first stage on ALL features (minus exceptions)
target_feature='target') # will be excluded from all preprocessor fits, as it's the target
# do actual fit:
pipe.fit(X_train)
Out[5]:
In [6]:
# First stage should ONLY be fit on these features: ['B','PTRATIO','CRIM']
step = pipe.steps[0][1] # extract the transformer from the tuple
step.means
Out[6]:
In [8]:
# Second stage should be fit on everything BUT ['CHAS', 'TAX'] (and of course, the target)
step = pipe.steps[1][1]
step.correlations_ # looks like we had nothing to drop anyways
Out[8]:
In [9]:
# here are the features we ultimately fit the estimator on:
pipe.training_cols_
Out[9]:
In [11]:
# Let's check our R^2:
from skutil.h2o.metrics import h2o_r2_score
test_pred = pipe.predict(X_test)
print('Testing R^2: %.5f' %h2o_r2_score(X_test['target'] , test_pred))