H2O Tutorial

Author: Spencer Aiello

Contact: spencer@h2oai.com

This tutorial steps through a quick introduction to H2O's Python API. The goal of this tutorial is to introduce through a complete example H2O's capabilities from Python. Also, to help those that are accustomed to Scikit Learn and Pandas, the demo will be specific call outs for differences between H2O and those packages; this is intended to help anyone that needs to do machine learning on really Big Data make the transition. It is not meant to be a tutorial on machine learning or algorithms.

Detailed documentation about H2O's and the Python API is available at http://docs.h2o.ai.

Setting up your system for this demo

The following code creates two csv files using data from the Boston Housing dataset which is built into scikit-learn and adds them to the local directory


In [49]:
import pandas as pd
import numpy
from numpy.random import choice
from sklearn.datasets import load_boston

import h2o
h2o.init()


H2O cluster uptime: 4 seconds 805 milliseconds
H2O cluster version: 3.3.0.99999
H2O cluster name: ece
H2O cluster total nodes: 1
H2O cluster total memory: 8.89 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: True
H2O Connection ip: 127.0.0.1
H2O Connection port: 54321

In [50]:
# transfer the boston data from pandas to H2O
boston_data = load_boston()
X = pd.DataFrame(data=boston_data.data, columns=boston_data.feature_names)
X["Median_value"] = boston_data.target
X = h2o.H2OFrame(python_obj=X.to_dict("list"))


Uploaded py1b3c880b-d4e9-4b9b-b6a8-e5094ee3961b into cluster with 506 rows and 14 cols

In [51]:
# select 10% for valdation
r = X.runif(seed=123456789)
train = X[r < 0.9,:]
valid = X[r >= 0.9,:]

h2o.export_file(train, "Boston_housing_train.csv", force=True)
h2o.export_file(valid, "Boston_housing_test.csv", force=True)

Enable inline plotting in the Jupyter Notebook


In [52]:
%matplotlib inline
import matplotlib.pyplot as plt

Intro to H2O Data Munging

Read csv data into H2O. This loads the data into the H2O column compressed, in-memory, key-value store.


In [53]:
fr = h2o.import_file("Boston_housing_train.csv")


Imported Boston_housing_train.csv. Parsed 462 rows and 14 cols

View the top of the H2O frame.


In [54]:
fr.head()


H2OFrame with 10 rows and 14 columns: 
      CRIM    ZN       B  LSTAT  Median_value    AGE  TAX  RAD  CHAS    NOX  \
0  0.00632  18.0  396.90   4.98          24.0   65.2  296    1     0  0.538   
1  0.02729   0.0  392.83   4.03          34.7   61.1  242    2     0  0.469   
2  0.03237   0.0  394.63   2.94          33.4   45.8  222    3     0  0.458   
3  0.06905   0.0  396.90   5.33          36.2   54.2  222    3     0  0.458   
4  0.02985   0.0  394.12   5.21          28.7   58.7  222    3     0  0.458   
5  0.08829  12.5  395.60  12.43          22.9   66.6  311    5     0  0.524   
6  0.14455  12.5  396.90  19.15          27.1   96.1  311    5     0  0.524   
7  0.21124  12.5  386.63  29.93          16.5  100.0  311    5     0  0.524   
8  0.17004  12.5  386.71  17.10          18.9   85.9  311    5     0  0.524   
9  0.22489  12.5  392.52  20.45          15.0   94.3  311    5     0  0.524   

      RM  INDUS  PTRATIO     DIS  
0  6.575   2.31     15.3  4.0900  
1  7.185   7.07     17.8  4.9671  
2  6.998   2.18     18.7  6.0622  
3  7.147   2.18     18.7  6.0622  
4  6.430   2.18     18.7  6.0622  
5  6.012   7.87     15.2  5.5605  
6  6.172   7.87     15.2  5.9505  
7  5.631   7.87     15.2  6.0821  
8  6.004   7.87     15.2  6.5921  
9  6.377   7.87     15.2  6.3467  
Out[54]:

View the bottom of the H2O Frame


In [55]:
fr.tail()


H2OFrame with 10 rows and 14 columns: 
      CRIM  ZN       B  LSTAT  Median_value   AGE  TAX  RAD  CHAS    NOX  \
0  0.28960   0  396.90  21.14          19.7  72.9  391    6     0  0.585   
1  0.26838   0  396.90  14.10          18.3  70.6  391    6     0  0.585   
2  0.23912   0  396.90  12.92          21.2  65.3  391    6     0  0.585   
3  0.17783   0  395.77  15.10          17.5  73.5  391    6     0  0.585   
4  0.22438   0  396.90  14.33          16.8  79.7  391    6     0  0.585   
5  0.06263   0  391.99   9.67          22.4  69.1  273    1     0  0.573   
6  0.04527   0  396.90   9.08          20.6  76.7  273    1     0  0.573   
7  0.06076   0  396.90   5.64          23.9  91.0  273    1     0  0.573   
8  0.10959   0  393.45   6.48          22.0  89.3  273    1     0  0.573   
9  0.04741   0  396.90   7.88          11.9  80.8  273    1     0  0.573   

      RM  INDUS  PTRATIO     DIS  
0  5.390   9.69     19.2  2.7986  
1  5.794   9.69     19.2  2.8927  
2  6.019   9.69     19.2  2.4091  
3  5.569   9.69     19.2  2.3999  
4  6.027   9.69     19.2  2.4982  
5  6.593  11.93     21.0  2.4786  
6  6.120  11.93     21.0  2.2875  
7  6.976  11.93     21.0  2.1675  
8  6.794  11.93     21.0  2.3889  
9  6.030  11.93     21.0  2.5050  
Out[55]:

Select a column

fr["VAR_NAME"]


In [56]:
fr["CRIM"].head() # Tab completes


H2OFrame with 10 rows and 1 columns: 
      CRIM
0  0.00632
1  0.02729
2  0.03237
3  0.06905
4  0.02985
5  0.08829
6  0.14455
7  0.21124
8  0.17004
9  0.22489
Out[56]:

Select a few columns


In [57]:
columns = ["CRIM", "RM", "RAD"]
fr[columns].head()


H2OFrame with 10 rows and 3 columns: 
      CRIM     RM  RAD
0  0.00632  6.575    1
1  0.02729  7.185    2
2  0.03237  6.998    3
3  0.06905  7.147    3
4  0.02985  6.430    3
5  0.08829  6.012    5
6  0.14455  6.172    5
7  0.21124  5.631    5
8  0.17004  6.004    5
9  0.22489  6.377    5
Out[57]:

Select a subset of rows

Unlike in Pandas, columns may be identified by index or column name. Therefore, when subsetting by rows, you must also pass the column selection.


In [58]:
fr[2:7,:]  # explicitly select all columns with :


H2OFrame with 5 rows and 14 columns: 
      CRIM    ZN       B  LSTAT  Median_value   AGE  TAX  RAD  CHAS    NOX  \
0  0.03237   0.0  394.63   2.94          33.4  45.8  222    3     0  0.458   
1  0.06905   0.0  396.90   5.33          36.2  54.2  222    3     0  0.458   
2  0.02985   0.0  394.12   5.21          28.7  58.7  222    3     0  0.458   
3  0.08829  12.5  395.60  12.43          22.9  66.6  311    5     0  0.524   
4  0.14455  12.5  396.90  19.15          27.1  96.1  311    5     0  0.524   

      RM  INDUS  PTRATIO     DIS  
0  6.998   2.18     18.7  6.0622  
1  7.147   2.18     18.7  6.0622  
2  6.430   2.18     18.7  6.0622  
3  6.012   7.87     15.2  5.5605  
4  6.172   7.87     15.2  5.9505  
Out[58]:

Key attributes:

  * columns, names, col_names
  * len, shape, dim, nrow, ncol
  * types

Note:

Since the data is not in local python memory there is no "values" attribute. If you want to pull all of the data into the local python memory then do so explicitly with h2o.export_file and reading the data into python memory from disk.


In [59]:
# The columns attribute is exactly like Pandas
print "Columns:", fr.columns, "\n"
print "Columns:", fr.names, "\n"
print "Columns:", fr.col_names, "\n"

# There are a number of attributes to get at the shape
print "length:", str( len(fr) ), "\n"
print "shape:", fr.shape, "\n"
print "dim:", fr.dim, "\n"
print "nrow:", fr.nrow, "\n"
print "ncol:", fr.ncol, "\n"

# Use the "types" attribute to list the column types
print "types:", fr.types, "\n"


Columns: [u'CRIM', u'ZN', u'B', u'LSTAT', u'Median_value', u'AGE', u'TAX', u'RAD', u'CHAS', u'NOX', u'RM', u'INDUS', u'PTRATIO', u'DIS'] 

Columns: [u'CRIM', u'ZN', u'B', u'LSTAT', u'Median_value', u'AGE', u'TAX', u'RAD', u'CHAS', u'NOX', u'RM', u'INDUS', u'PTRATIO', u'DIS'] 

Columns: [u'CRIM', u'ZN', u'B', u'LSTAT', u'Median_value', u'AGE', u'TAX', u'RAD', u'CHAS', u'NOX', u'RM', u'INDUS', u'PTRATIO', u'DIS'] 

length: 462 

shape: (462, 14) 

dim: [462, 14] 

nrow: 462 

ncol: 14 

types: {u'CRIM': u'Numeric', u'ZN': u'Numeric', u'B': u'Numeric', u'LSTAT': u'Numeric', u'Median_value': u'Numeric', u'AGE': u'Numeric', u'TAX': u'Numeric', u'RAD': u'Numeric', u'CHAS': u'Numeric', u'NOX': u'Numeric', u'RM': u'Numeric', u'INDUS': u'Numeric', u'PTRATIO': u'Numeric', u'DIS': u'Numeric'} 

Select rows based on value


In [60]:
fr.shape


Out[60]:
(462, 14)

Boolean masks can be used to subselect rows based on a criteria.


In [61]:
mask = fr["CRIM"]>1
fr[mask,:].shape


Out[61]:
(155, 14)

Get summary statistics of the data and additional data distribution information.


In [62]:
fr.describe()


Rows: 462 Cols: 14
  chunk_type                 chunk_name  count  count_percentage        size  \
0        CBS                       Bits      1          7.142858      128  B   
1        C1N  1-Byte Integers (w/o NAs)      1          7.142858      530  B   
2         C2            2-Byte Integers      1          7.142858      992  B   
3        C2S           2-Byte Fractions      1          7.142858     1008  B   
4        CUD               Unique Reals      4         28.571430      7.2 KB   
5        C8D               64-bit Reals      6         42.857143     22.1 KB   

   size_percentage  
0         0.392710  
1         1.626066  
2         3.043505  
3         3.092594  
4        22.556300  
5        69.288826  
                            size  number_of_rows  number_of_chunks_per_column  \
0  172.16.2.59:54321     31.8 KB             462                            1   
1               mean     31.8 KB             462                            1   
2                min     31.8 KB             462                            1   
3                max     31.8 KB             462                            1   
4             stddev        0  B               0                            0   
5              total     31.8 KB             462                            1   

   number_of_chunks  
0                14  
1                14  
2                14  
3                14  
4                 0  
5                14  

Column-by-Column Summary:

CRIM ZN B LSTAT Median_value AGE TAX RAD CHAS NOX RM INDUS PTRATIO DIS
type real real real real real real int int int real real real real real
mins 0.00632 0.0 0.32 1.73 5.0 6.0 187.0 1.0 0.0 0.385 3.561 0.46 12.6 1.1296
maxs 88.9762 100.0 396.9 37.97 50.0 100.0 711.0 24.0 1.0 0.871 8.78 27.74 22.0 12.1265
sigma 8.68268014543 23.2086423052 90.7500779002 7.1419482934 9.21258527358 27.9631409743 167.460295078 8.64357146773 0.242812755044 0.115349440715 0.707139172922 6.85982058776 2.16522966932 2.11032018051
zero_count 0 343 0 0 0 0 0 0 433 0 0 0 0 0
missing_count 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Set up the predictor and response column names

Using H2O algorithms, it's easier to reference predictor and response columns by name in a single frame (i.e., don't split up X and y)


In [63]:
x = fr.names
y="Median_value"
x.remove(y)

Machine Learning With H2O

H2O is a machine learning library built in Java with interfaces in Python, R, Scala, and Javascript. It is open source and well-documented.

Unlike Scikit-learn, H2O allows for categorical and missing data.

The basic work flow is as follows:

  • Fit the training data with a machine learning algorithm
  • Predict on the testing data

Simple model


In [64]:
model = h2o.random_forest(x=fr[:400,x],y=fr[:400,y],seed=42) # Define and fit first 400 points

In [65]:
model.predict(fr[400:fr.nrow,:])        # Predict the rest


H2OFrame with 62 rows and 1 columns: 
   predict
0   12.712
1   10.266
2   10.480
3   12.416
4   10.462
5   14.062
6   16.392
7   14.582
8   14.836
9   15.867
Out[65]:

The performance of the model can be checked using the holdout dataset


In [66]:
perf = model.model_performance(fr[400:fr.nrow,:])
perf.r2()      # get the r2 on the holdout data
perf.mse()     # get the mse on the holdout data
perf           # display the performance object


ModelMetricsRegression: drf
** Reported on test data. **

MSE: 14.1917271618
R^2: 0.374431026605
Mean Residual Deviance: 14.1917271618
Out[66]:

Train-Test Split

Instead of taking the first 400 observations for training, we can use H2O to create a random test train split of the data.


In [67]:
r = fr.runif(seed=12345)   # build random uniform column over [0,1]
train= fr[r<0.75,:]     # perform a 75-25 split
test = fr[r>=0.75,:]

model = h2o.random_forest(x=train[x],y=train[y],seed=42)

perf = model.model_performance(test)
perf.r2()


Out[67]:
0.8542968512013444

There was a massive jump in the R^2 value. This is because the original data is not shuffled.

Cross validation

H2O's machine learning algorithms take an optional parameter nfolds to specify the number of cross-validation folds to build. H2O's cross-validation uses an internal weight vector to build the folds in an efficient manner (instead of physically building the splits).

In conjunction with the nfolds parameter, a user may specify the way in which observations are assigned to each fold with the fold_assignment parameter, which can be set to either:

    * AUTO:  Perform random assignment
    * Random: Each row has a equal (1/nfolds) chance of being in any fold.
    * Modulo: Observations are in/out of the fold based by modding on nfolds

In [68]:
model = h2o.random_forest(x=fr[x],y=fr[y], nfolds=10) # build a 10-fold cross-validated model

In [69]:
scores = numpy.array([m.r2() for m in model.xvals]) # iterate over the xval models using the xvals attribute
print "Expected R^2: %.2f +/- %.2f \n" % (scores.mean(), scores.std()*1.96)
print "Scores:", scores.round(2)


Expected R^2: 0.86 +/- 0.02 

Scores: [ 0.87  0.87  0.86  0.88  0.87  0.85  0.87  0.86  0.84  0.87]

However, you can still make use of the cross_val_score from Scikit-Learn

Cross validation: H2O and Scikit-Learn


In [70]:
from sklearn.cross_validation import cross_val_score
from h2o.cross_validation import H2OKFold
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.model.regression import h2o_r2_score
from sklearn.metrics.scorer import make_scorer

You still must use H2O to make the folds. Currently, there is no H2OStratifiedKFold. Additionally, the H2ORandomForestEstimator is analgous to the scikit-learn RandomForestRegressor object with its own fit method


In [71]:
model = H2ORandomForestEstimator(seed=42)

In [72]:
scorer = make_scorer(h2o_r2_score)   # make h2o_r2_score into a scikit_learn scorer
custom_cv = H2OKFold(fr, n_folds=10, seed=42) # make a cv 
scores = cross_val_score(model, fr[x], fr[y], scoring=scorer, cv=custom_cv)

print "Expected R^2: %.2f +/- %.2f \n" % (scores.mean(), scores.std()*1.96)
print "Scores:", scores.round(2)


Expected R^2: 0.87 +/- 0.11 

Scores: [ 0.81  0.85  0.88  0.95  0.75  0.83  0.88  0.92  0.89  0.91]

There isn't much difference in the R^2 value since the fold strategy is exactly the same. However, there was a major difference in terms of computation time and memory usage.

Since the progress bar print out gets annoying let's disable that


In [73]:
h2o.__PROGRESS_BAR__=False
h2o.no_progress()

Grid search in H2O is still under active development and it will be available very soon. However, it is possible to make use of Scikit's grid search infrastructure (with some performance penalties)

Randomized grid search: H2O and Scikit-Learn


In [74]:
from sklearn import __version__
sklearn_version = __version__
print sklearn_version


0.16.1

If you have 0.16.1, then your system can't handle complex randomized grid searches (it works in every other version of sklearn, including the soon to be released 0.16.2 and the older versions).

The steps to perform a randomized grid search:

  1. Import model and RandomizedSearchCV
  2. Define model
  3. Specify parameters to test
  4. Define grid search object
  5. Fit data to grid search object
  6. Collect scores

All the steps will be repeated from above.

Because 0.16.1 is installed, we use scipy to define specific distributions

ADVANCED TIP:

Turn off reference counting for spawning jobs in parallel (n_jobs=-1, or n_jobs > 1). We'll turn it back on again in the aftermath of a Parallel job.

If you don't want to run jobs in parallel, don't turn off the reference counting.

Pattern is:

     >>> h2o.turn_off_ref_cnts()
     >>> .... parallel job ....
     >>> h2o.turn_on_ref_cnts()

In [75]:
%%time
from h2o.estimators.random_forest import H2ORandomForestEstimator  # Import model
from sklearn.grid_search import RandomizedSearchCV  # Import grid search
from scipy.stats import randint, uniform

model = H2ORandomForestEstimator(seed=42)        # Define model

params = {"ntrees": randint(20,50),
          "max_depth": randint(1,10),
          "min_rows": randint(1,10),    # scikit's  min_samples_leaf
          "mtries": randint(2,fr[x].shape[1]),} # Specify parameters to test

scorer = make_scorer(h2o_r2_score)   # make h2o_r2_score into a scikit_learn scorer
custom_cv = H2OKFold(fr, n_folds=10, seed=42) # make a cv 
random_search = RandomizedSearchCV(model, params, 
                                   n_iter=30, 
                                   scoring=scorer, 
                                   cv=custom_cv, 
                                   random_state=42,
                                   n_jobs=1)       # Define grid search object

random_search.fit(fr[x], fr[y])

print "Best R^2:", random_search.best_score_, "\n"
print "Best params:", random_search.best_params_


Best R^2: 0.853718179113 

Best params: {'mtries': 8, 'ntrees': 25, 'min_rows': 1, 'max_depth': 9}
CPU times: user 31.9 s, sys: 3.37 s, total: 35.3 s
Wall time: 1min 28s

We might be tempted to think that we just had a large improvement; however we must be cautious. The function below creates a more detailed report.


In [76]:
def report_grid_score_detail(random_search, charts=True):
    """Input fit grid search estimator. Returns df of scores with details"""
    df_list = []

    for line in random_search.grid_scores_:
        results_dict = dict(line.parameters)
        results_dict["score"] = line.mean_validation_score
        results_dict["std"] = line.cv_validation_scores.std()*1.96
        df_list.append(results_dict)

    result_df = pd.DataFrame(df_list)
    result_df = result_df.sort("score", ascending=False)
    
    if charts:
        for col in get_numeric(result_df):
            if col not in ["score", "std"]:
                plt.scatter(result_df[col], result_df.score)
                plt.title(col)
                plt.show()

        for col in list(result_df.columns[result_df.dtypes == "object"]):
            cat_plot = result_df.score.groupby(result_df[col]).mean()
            cat_plot.sort()
            cat_plot.plot(kind="barh", xlim=(.5, None), figsize=(7, cat_plot.shape[0]/2))
            plt.show()
    return result_df

def get_numeric(X):
    """Return list of numeric dtypes variables"""
    return X.dtypes[X.dtypes.apply(lambda x: str(x).startswith(("float", "int", "bool")))].index.tolist()

In [77]:
report_grid_score_detail(random_search).head()


Out[77]:
max_depth min_rows mtries ntrees score std
3 9 1 8 25 0.853718 0.121141
17 6 3 4 38 0.837344 0.132959
14 9 1 2 32 0.827494 0.155412
1 9 5 6 22 0.827051 0.150251
5 9 6 6 29 0.820891 0.148135

Based on the grid search report, we can narrow the parameters to search and rerun the analysis. The parameters below were chosen after a few runs:


In [78]:
%%time

params = {"ntrees": randint(30,40),
          "max_depth": randint(4,10),
          "mtries": randint(4,10),}

custom_cv = H2OKFold(fr, n_folds=5, seed=42)           # In small datasets, the fold size can have a big
                                                       # impact on the std of the resulting scores. More
random_search = RandomizedSearchCV(model, params,      # folds --> Less examples per fold --> higher 
                                   n_iter=10,          # variation per sample
                                   scoring=scorer, 
                                   cv=custom_cv, 
                                   random_state=43, 
                                   n_jobs=1)       

random_search.fit(fr[x], fr[y])

print "Best R^2:", random_search.best_score_, "\n"
print "Best params:", random_search.best_params_

report_grid_score_detail(random_search)


Best R^2: 0.860880234682 

Best params: {'mtries': 8, 'ntrees': 30, 'max_depth': 8}
CPU times: user 5.78 s, sys: 564 ms, total: 6.34 s
Wall time: 15 s

Transformations

Rule of machine learning: Don't use your testing data to inform your training data. Unfortunately, this happens all the time when preparing a dataset for the final model. But on smaller datasets, you must be especially careful.

At the moment, there are no classes for managing data transformations. On the one hand, this requires the user to tote around some extra state, but on the other, it allows the user to be more explicit about transforming H2OFrames.

Basic steps:

  1. Remove the response variable from transformations.
  2. Import transformer
  3. Define transformer
  4. Fit train data to transformer
  5. Transform test and train data
  6. Re-attach the response variable.

First let's normalize the data using the means and standard deviations of the training data. Then let's perform a principal component analysis on the training data and select the top 5 components. Using these components, let's use them to reduce the train and test design matrices.


In [79]:
from h2o.transforms.preprocessing import H2OScaler
from h2o.transforms.decomposition import H2OPCA

Normalize Data: Use the means and standard deviations from the training data.


In [80]:
y_train = train.pop("Median_value")
y_test  = test.pop("Median_value")

In [81]:
norm = H2OScaler()
norm.fit(train)
X_train_norm = norm.transform(train)
X_test_norm  = norm.transform(test)

In [82]:
print X_test_norm.shape
X_test_norm


(122, 13)
H2OFrame with 122 rows and 13 columns: 
        CRIM          ZN            B       LSTAT          AGE           TAX  \
0 -24.736214 -246.063411  3478.064487  -52.441200  -413.735371 -30080.217069   
1 -23.659575   36.576994  2566.038932  126.336348   883.754751 -15288.430659   
2 -24.369586   36.576994  3478.064487    5.261797   399.320666 -15288.430659   
3 -24.548962   36.576994  2909.713605   22.994204  -844.343447 -15288.430659   
4 -20.490610 -246.063411  3478.064487  -31.147777  -198.431334 -15953.230048   
5 -15.926989 -246.063411  3478.064487   44.869018   648.620078 -15953.230048   
6 -19.578204 -246.063411  3249.836086   27.281958   716.610827 -15953.230048   
7 -19.406096 -246.063411  2682.373253    1.846128   725.109671 -15953.230048   
8 -13.047041 -246.063411 -9717.444492   56.642174   795.933367 -15953.230048   
9 -23.933599 -246.063411  3169.911743  -17.557776 -1093.642859 -20606.825773   

         RAD     CHAS       NOX        RM      INDUS   PTRATIO       DIS  
0 -51.962844 -0.01489 -0.011189  0.630504 -63.773819  0.684709  4.823702  
1 -34.909401 -0.01489 -0.003593 -0.476113 -23.687122 -7.112459  4.865433  
2 -34.909401 -0.01489 -0.003593 -0.200189 -23.687122 -7.112459  5.168671  
3 -34.909401 -0.01489 -0.003593 -0.287784 -23.687122 -7.112459  3.541759  
4 -43.436123 -0.01489 -0.001981 -0.243987 -21.784942  5.808563  1.982793  
5 -43.436123 -0.01489 -0.001981 -0.103105 -21.784942  5.808563  0.450669  
6 -43.436123 -0.01489 -0.001981 -0.262235 -21.784942  5.808563  1.337103  
7 -43.436123 -0.01489 -0.001981  0.154571 -21.784942  5.808563  1.452652  
8 -43.436123 -0.01489 -0.001981 -0.136683 -21.784942  5.808563 -0.004606  
9 -34.909401 -0.01489 -0.006470 -0.231577 -37.143289  1.798590  0.178888  
Out[82]:

Then, we can apply PCA and keep the top 5 components.


In [83]:
pca = H2OPCA(n_components=5)
pca.fit(X_train_norm)
X_train_norm_pca = pca.transform(X_train_norm)
X_test_norm_pca  = pca.transform(X_test_norm)

In [84]:
# prop of variance explained by top 5 components?

In [85]:
print X_test_norm_pca.shape
X_test_norm_pca[:5]


(122, 5)
H2OFrame with 122 rows and 5 columns: 
            PC1           PC2          PC3         PC4        PC5
0 -30275.268227   -625.603879   190.440193  369.210711  12.389230
1 -15481.014400    465.946145  1014.829847 -446.059456  38.351308
2 -15611.350493   1372.839872   586.131836 -231.060297  -0.387572
3 -15551.885105    817.877766  -530.932422  323.576391  21.281381
4 -16276.796219   1286.235678   185.614799  287.705163  -4.813496
5 -16264.821892   1280.602054   945.464828  -89.920838  17.023511
6 -16233.006802   1054.058236  1003.812227 -119.775113   5.928040
7 -16156.125715    491.806467  1005.791501 -122.633986  -4.737604
8 -14477.165840 -11793.980470   962.696650 -136.697963  -1.805725
9 -20857.888688    356.531393  -552.379748  682.064306  21.077590
Out[85]:


In [86]:
model = H2ORandomForestEstimator(seed=42)
model.fit(X_train_norm_pca,y_train)
y_hat  = model.predict(X_test_norm_pca)

In [87]:
h2o_r2_score(y_test,y_hat)


Out[87]:
0.5344823402879314

Although this is MUCH simpler than keeping track of all of these transformations manually, it gets to be somewhat of a burden when you want to chain together multiple transformers.

Pipelines

"Tranformers unite!"

If your raw data is a mess and you have to perform several transformations before using it, use a pipeline to keep things simple.

Steps:

  1. Import Pipeline, transformers, and model
  2. Define pipeline. The first and only argument is a list of tuples where the first element of each tuple is a name you give the step and the second element is a defined transformer. The last step is optionally an estimator class (like a RandomForest).
  3. Fit the training data to pipeline
  4. Either transform or predict the testing data

In [88]:
from h2o.transforms.preprocessing import H2OScaler
from h2o.transforms.decomposition import H2OPCA
from h2o.estimators.random_forest import H2ORandomForestEstimator

In [89]:
from sklearn.pipeline import Pipeline                # Import Pipeline <other imports not shown>
model = H2ORandomForestEstimator(seed=42)
pipe = Pipeline([("standardize", H2OScaler()),       # Define pipeline as a series of steps
                 ("pca", H2OPCA(n_components=5)),
                 ("rf", model)])                     # Notice the last step is an estimator

pipe.fit(train, y_train)                             # Fit training data
y_hat = pipe.predict(test)                           # Predict testing data (due to last step being an estimator)
h2o_r2_score(y_test, y_hat)                          # Notice the final score is identical to before


Out[89]:
0.533575578318213

This is so much easier!!!

But, wait a second, we did worse after applying these transformations! We might wonder how different hyperparameters for the transformations impact the final score.

Combining randomized grid search and pipelines

"Yo dawg, I heard you like models, so I put models in your models to model models."

Steps:

  1. Import Pipeline, grid search, transformers, and estimators
  2. Define pipeline
  3. Define parameters to test in the form: "(Step name)__(argument name)" A double underscore separates the two words.
  4. Define grid search
  5. Fit to grid search

In [90]:
pipe = Pipeline([("standardize", H2OScaler()),
                 ("pca", H2OPCA()),
                 ("rf", H2ORandomForestEstimator(seed=42))])

params = {"standardize__center":    [True, False],           # Parameters to test
          "standardize__scale":     [True, False],
          "pca__n_components":      randint(2, 6),
          "rf__ntrees":             randint(50,80),
          "rf__max_depth":          randint(4,10),
          "rf__min_rows":           randint(5,10), }
#           "rf__mtries":             randint(1,4),}           # gridding over mtries is 
                                                               # problematic with pca grid over 
                                                               # n_components above 

from sklearn.grid_search import RandomizedSearchCV
from h2o.cross_validation import H2OKFold
from h2o.model.regression import h2o_r2_score
from sklearn.metrics.scorer import make_scorer

custom_cv = H2OKFold(fr, n_folds=5, seed=42)
random_search = RandomizedSearchCV(pipe, params,
                                   n_iter=30,
                                   scoring=make_scorer(h2o_r2_score),
                                   cv=custom_cv,
                                   random_state=42,
                                   n_jobs=1)


random_search.fit(fr[x],fr[y])
results = report_grid_score_detail(random_search)
results.head()


Out[90]:
pca__n_components rf__max_depth rf__min_rows rf__ntrees score standardize__center standardize__scale std
18 5 7 5 69 0.369010 False True 0.084034
11 5 6 5 67 0.368902 False False 0.091818
8 5 8 7 69 0.364806 False False 0.099530
6 5 8 6 74 0.355026 True True 0.102501
7 3 8 6 79 0.351689 False True 0.096872

Currently Under Development (drop-in scikit-learn pieces):

* Richer set of transforms (only PCA and Scale are implemented)
* Richer set of estimators (only RandomForest is available)
* Full H2O Grid Search

Other Tips: Model Save/Load

It is useful to save constructed models to disk and reload them between H2O sessions. Here's how:


In [91]:
best_estimator = random_search.best_estimator_                        # fetch the pipeline from the grid search
h2o_model      = h2o.get_model(best_estimator._final_estimator._id)    # fetch the model from the pipeline

In [92]:
save_path = h2o.save_model(h2o_model, path=".", force=True)
print save_path


/Users/ece/0xdata/h2o-dev/DRF_model_python_1442338733924_1342

In [93]:
# assumes new session
my_model = h2o.load_model(path=save_path)

In [94]:
my_model.predict(fr)


H2OFrame with 462 rows and 1 columns: 
     predict
0  27.582987
1  27.582987
2  27.582987
3  27.582987
4  27.582987
5  27.582987
6  27.582987
7  27.582987
8  27.582987
9  27.582987
Out[94]: