skutilSkutil brings the best of both worlds to H2O and sklearn, delivering an easy transition into the world of distributed computing that H2O offers, while providing the same, familiar interface that sklearn users have come to know and love. This notebook will give an example of how to use skutil preprocessors with H2OEstimators and H2OFrames.
Author: Taylor G Smith
Contact: tgsmith61591@gmail.com
Python packages you will need:
python 2.7numpy >= 1.6scipy >= 0.17scikit-learn >= 0.16pandas >= 0.18cython >= 0.22h2o >= 3.8.2.9Misc. requirements (for compiling Fortran a la f2py):
gfortrangcc
Note that the El Capitan Apple Developer tool upgrade necessitates upgrading this! Use:
brew upgrade gcc
This notebook is intended for an audience with a working understanding of machine learning principles and a background in Python development, ideally sklearn or H2O users. Note that this notebook is not designed to teach machine learning, but to demonstrate use of the skutil package.
In [5]:
from __future__ import print_function, division, absolute_import
import warnings
import skutil
import sklearn
import h2o
import pandas as pd
import numpy as np
# we'll be plotting inline...
%matplotlib inline
print('Skutil version: %s' % skutil.__version__)
print('H2O version: %s' % h2o.__version__)
print('Numpy version: %s' % np.__version__)
print('Sklearn version: %s' % sklearn.__version__)
print('Pandas version: %s' % pd.__version__)
In [7]:
with warnings.catch_warnings():
warnings.simplefilter('ignore')
# I started this cluster up via CLI with:
# $ java -Xmx2g -jar /anaconda/h2o_jar/h2o.jar
h2o.init(ip='10.7.187.84', port=54321, start_h2o=False)
In [8]:
from sklearn.datasets import load_breast_cancer
from skutil.h2o.util import from_pandas
# import data, load into pandas
bc = load_breast_cancer()
X = pd.DataFrame.from_records(data=bc.data, columns=bc.feature_names)
X['target'] = bc.target
# push to h2o cloud
X = from_pandas(X)
print(X.shape)
X.head()
Out[8]:
In [9]:
# Here are our feature names:
x = list(bc.feature_names)
y = 'target'
In [11]:
from skutil.h2o import h2o_train_test_split
# first, let's make sure our target is a factor
X[y] = X[y].asfactor()
# we'll use 75% of the data for training, 25%
X_train, X_val = h2o_train_test_split(X, train_size=0.75, random_state=42)
# make sure we did it right...
# assert X.shape[0] == (X_train.shape[0] + X_val.shape[0])
skutil.h2oSkutil provides an h2o module which delivers some skutil feature_selection classes that can operate on an H2OFrame. Each BaseH2OTransformer has the following __init__ signature:
BaseH2OTransformer(self, feature_names=None, target_feature=None)
The selector will only operate on the feature_names (if provided—else it will operate on all features) and will always exclude the target_feature.
The first step would be to ensure our data is balanced, as we don't want imbalanced minority/majority classes. The problem of class imbalance is well-documented, and many solutions have been proposed. Skutil provides a mechanism by which we could over-sample the minority class using the H2OOversamplingClassBalancer, or under-sample the majority class using the H2OUndersamplingClassBalancer.
Fortunately for us, the classes in this dataset are fairly balanced, so we can move on to the next piece.
Some predictors contain few unique values and are considered "near-zero variance" predictors. For parametric many models, this may cause the fit to be unstable. Skutil's NearZeroVarianceFilterer and H2ONearZeroVarianceFilterer drop features with variance below a given threshold (based on caret's preprocessor).
Note: sklearn added this in 0.18 (released last week) under VarianceThreshold
In [12]:
from skutil.h2o import H2ONearZeroVarianceFilterer
# Let's determine whether we're at risk for any near-zero variance
nzv = H2ONearZeroVarianceFilterer(feature_names=x, target_feature=y, threshold=1e-4)
nzv.fit(X_train)
# let's see if anything was dropped...
nzv.drop_
Out[12]:
In [13]:
nzv.var_
Out[13]:
Multicollinearity (MC) can be detrimental to the fit of parametric models (for our example, we're going to use a tree-based model, which is non-parametric, but the demo is still useful), and can cause confounding results in some models' variable importances. With skutil, we can filter out features that are correlated beyond a certain absolute threshold. When a violating correlation is identified, the feature with the highest mean absolute correlation is removed (see also).
Before filtering out collinear features, let's take a look at the correlation matrix.
In [14]:
from skutil.h2o import h2o_corr_plot
# note that we want to exclude the target!!
h2o_corr_plot(X_train[x], xticklabels=x, yticklabels=x)
In [15]:
from skutil.h2o import H2OMulticollinearityFilterer
# Are we at risk of any multicollinearity?
mcf = H2OMulticollinearityFilterer(feature_names=x, target_feature=y, threshold=0.90)
mcf.fit(X_train)
# we can look at the dropped features
mcf.correlations_
Out[15]:
As you'll see in the next section (Pipelines), where certain preprocessing steps take place matters. If there are a subset of features on which you don't want to model or process, you can drop them out. Sometimes this is more effective than creating a list of potentially thousands of feature names to pass as the feature_names parameter.
In [10]:
from skutil.h2o import H2OFeatureDropper
# maybe I don't like 'mean fractal dimension'
dropper = H2OFeatureDropper(feature_names=['mean fractal dimension'], target_feature=y)
transformed = dropper.fit_transform(X_train)
# we can ensure it's not there
assert not 'mean fractal dimension' in transformed.columns
skutil.h2o modelingSkutil's h2o module allows us to form the Pipeline objects we're familiar with from sklearn. This permits us to string a series of preprocessors together, with an optional H2OEstimator as the last step. Like sklearn Pipelines, the first argument is a single list of length-two tuples (where the first arg is the name of the step, and the second is the Estimator/Transformer), however the H2OPipeline takes two more arguments: feature_names and target_feature.
Note that the feature_names arg is the names the first preprocessor will operate on; after that, all remaining feature names (i.e., not the target) will be passed to the next processor.
In [16]:
from skutil.h2o import H2OPipeline
from h2o.estimators import H2ORandomForestEstimator
from skutil.h2o.metrics import h2o_accuracy_score # same as sklearn's, but with H2OFrames
# let's fit a pipeline with our estimator...
pipe = H2OPipeline([
('nzv', H2ONearZeroVarianceFilterer(threshold=1e-1)),
('mcf', H2OMulticollinearityFilterer(threshold=0.95)),
('rf' , H2ORandomForestEstimator(ntrees=50, max_depth=8, min_rows=5))
],
# feature_names is the set of features the first transformer
# will operate on. The remaining features will be passed
# to the next step
feature_names=x,
target_feature=y)
# fit...
pipe = pipe.fit(X_train)
# eval accuracy on validation set
pred = pipe.predict(X_val)
actual = X_val[y]
pred = pred['predict']
print('Validation accuracy: %.5f' % h2o_accuracy_score(actual, pred))
In [17]:
pipe.training_cols_
Out[17]:
In [18]:
from skutil.h2o import H2ORandomizedSearchCV
from skutil.h2o import H2OKFold
from scipy.stats import uniform, randint
# define our random state
rand_state = 2016
# we have the option to choose the model that maximizes CV scores,
# or the model that minimizes std deviations between CV scores.
# let's choose the former for this example
minimize = 'bias'
# let's redefine our pipeline
pipe = H2OPipeline([
('nzv', H2ONearZeroVarianceFilterer()),
('mcf', H2OMulticollinearityFilterer()),
('rf' , H2ORandomForestEstimator(seed=rand_state))
])
# our hyperparameters over which to search...
hyper = {
'nzv__threshold' : uniform(1e-4,1e-1), # see scipy.stats.uniform:
'mcf__threshold' : uniform(0.7, 0.29), # uniform in range (0.7 + 0.29)
'rf__ntrees' : randint(50, 100),
'rf__max_depth' : randint(10, 12),
'rf__min_rows' : randint(25, 50)
}
# define our grid search
search = H2ORandomizedSearchCV(
estimator=pipe,
param_grid=hyper,
feature_names=x,
target_feature=y,
n_iter=2, # keep it small for our demo...
random_state=rand_state,
scoring='accuracy_score',
cv=H2OKFold(n_folds=3, shuffle=True, random_state=rand_state),
verbose=3,
minimize=minimize
)
# fit
search.fit(X_train)
Out[18]:
In [19]:
from skutil.utils import report_grid_score_detail
# now let's look deeper...
sort_by = 'std' if minimize == 'variance' else 'score'
report_grid_score_detail(search, charts=True, sort_results=True,
ascending=minimize=='variance',
sort_by=sort_by)
Out[19]:
In [20]:
search.varimp()
Out[20]:
So our best estimator achieves a mean cross validation accuracy of 93%! We can predict on our best estimator as follows:
In [21]:
val_preds = search.predict(X_val)
# print accuracy
print('Validation accuracy: %.5f' % h2o_accuracy_score(actual, val_preds['predict']))
val_preds.head()
Out[21]:
(Not shown: other models we built and evaluated against the validation set (once!)—we only introduce the holdout set at the very end)
In a real situation, you probably will have a holdout set, and will have built several models. After you have a collection of models and you'd like to select one, you introduce the holdout set only once!
In [22]:
import os
# get absolute path
cwd = os.getcwd()
model_path = os.path.join(cwd, 'grid.pkl')
# save -- it's that easy!!!
search.save(location=model_path, warn_if_exists=False)
In [23]:
search = H2ORandomizedSearchCV.load(model_path)
new_predictions = search.predict(X_val)
new_predictions.head()
Out[23]:
In [24]:
h2o.shutdown(prompt=False) # shutdown cluster
os.unlink(model_path) # remove the pickle file...
In [ ]: