H2O Tutorial

Author: Spencer Aiello

Contact: spencer@h2oai.com

This tutorial steps through a quick introduction to H2O's Python API. The goal of this tutorial is to introduce through a complete example H2O's capabilities from Python. Also, to help those that are accustomed to Scikit Learn and Pandas, the demo will be specific call outs for differences between H2O and those packages; this is intended to help anyone that needs to do machine learning on really Big Data make the transition. It is not meant to be a tutorial on machine learning or algorithms.

Detailed documentation about H2O's and the Python API is available at http://docs.h2o.ai.

Setting up your system for this demo

The following code creates two csv files using data from the Boston Housing dataset which is built into scikit-learn and adds them to the local directory



In [1]:

    
import pandas as pd
import numpy
from numpy.random import choice
from sklearn.datasets import load_boston
from h2o.estimators.random_forest import H2ORandomForestEstimator


import h2o
h2o.init()









    



Warning: Version mismatch. H2O is version 3.5.0.99999, but the python package is version UNKNOWN.






    




H2O cluster uptime: 
14 minutes 39 seconds 914 milliseconds 
H2O cluster version: 
3.5.0.99999
H2O cluster name: 
ludirehak
H2O cluster total nodes: 
1
H2O cluster total memory: 
3.56 GB
H2O cluster total cores: 
8
H2O cluster allowed cores: 
8
H2O cluster healthy: 
True
H2O Connection ip: 
127.0.0.1
H2O Connection port: 
54321



In [2]:

    
# transfer the boston data from pandas to H2O
boston_data = load_boston()
X = pd.DataFrame(data=boston_data.data, columns=boston_data.feature_names)
X["Median_value"] = boston_data.target
X = h2o.H2OFrame(python_obj=X.to_dict("list"))









    



Parse Progress: [##################################################] 100%
Uploaded pyb34ffe50-362a-4887-9ce7-3f672e6d1249 into cluster with 506 rows and 14 cols



In [3]:

    
# select 10% for valdation
r = X.runif(seed=123456789)
train = X[r < 0.9,:]
valid = X[r >= 0.9,:]

h2o.export_file(train, "Boston_housing_train.csv", force=True)
h2o.export_file(valid, "Boston_housing_test.csv", force=True)









    



Export File Progress: [##################################################] 100%

Export File Progress: [##################################################] 100%

Enable inline plotting in the Jupyter Notebook



In [4]:

    
%matplotlib inline
import matplotlib.pyplot as plt

Intro to H2O Data Munging

Read csv data into H2O. This loads the data into the H2O column compressed, in-memory, key-value store.



In [5]:

    
fr = h2o.import_file("Boston_housing_train.csv")









    



Parse Progress: [##################################################] 100%
Imported Boston_housing_train.csv. Parsed 462 rows and 14 cols

View the top of the H2O frame.



In [6]:

    
fr.head()









    



H2OFrame with 462 rows and 14 columns: 






    




CRIM
0.00632
0.0
0.03237
0.06905
0.02985
0.1
0.14455
0.21124
0.2
0.22489
ZN
18.0
0.0
0.0
0.0
0.0
12.5
12.5
12.5
12.5
12.5
B
396.9
392.83
394.63
396.9
394.12
395.6
396.9
386.63
386.7
392.52
LSTAT
5.0
4.03
2.94
5.33
5.21
12.43
19.2
29.93
17.1
20.5
Median_value
24.0
34.7
33.4
36.2
28.7
22.9
27.1
16.5
18.9
15.0
AGE
65.2
61.1
45.8
54.2
58.7
66.6
96.1
100.0
85.9
94.3
TAX
296
242
222
222
222
311
311
311
311
311
RAD
1
2
3
3
3
5
5
5
5
5
CHAS
0
0
0
0
0
0
0
0
0
0
NOX
0.538
0.5
0.458
0.458
0.458
0.524
0.524
0.524
0.524
0.524
RM
6.575
7.2
6.998
7.147
6.43
6.0
6.172
5.631
6.0
6.377
INDUS
2.31
7.07
2.18
2.18
2.18
7.87
7.87
7.87
7.87
7.87
PTRATIO
15.3
17.8
18.7
18.7
18.7
15.2
15.2
15.2
15.2
15.2
DIS
4.09
4.9671
6.1
6.1
6.1
5.5605
5.9505
6.1
6.5921
6.3467

View the bottom of the H2O Frame



In [7]:

    
fr.tail()









    



H2OFrame with 462 rows and 14 columns: 






    




CRIM
0.2896
0.26838
0.2
0.2
0.2
0.1
0.0
0.06076
0.10959
0.04741
ZN
0
0
0
0
0
0
0
0
0
0
B
396.9
396.9
396.9
395.77
396.9
391.99
396.9
396.9
393.45
396.9
LSTAT
21.14
14.1
12.92
15.1
14.33
9.67
9.08
5.64
6.5
7.88
Median_value
19.7
18.3
21.2
17.5
16.8
22.4
20.6
23.9
22.0
11.9
AGE
72.9
70.6
65.3
73.5
79.7
69.1
76.7
91.0
89.3
80.8
TAX
391
391
391
391
391
273
273
273
273
273
RAD
6
6
6
6
6
1
1
1
1
1
CHAS
0
0
0
0
0
0
0
0
0
0
NOX
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
RM
5.39
5.8
6.019
5.569
6.027
6.593
6.12
6.976
6.8
6.03
INDUS
9.69
9.69
9.69
9.69
9.69
11.93
11.93
11.93
11.93
11.93
PTRATIO
19.2
19.2
19.2
19.2
19.2
21.0
21.0
21.0
21.0
21.0
DIS
2.7986
2.8927
2.4091
2.3999
2.5
2.4786
2.2875
2.1675
2.3889
2.505

Select a column

fr["VAR_NAME"]



In [8]:

    
fr["CRIM"].head() # Tab completes









    



H2OFrame with 462 rows and 1 columns: 






    




CRIM
0.00632
0.0
0.03237
0.06905
0.02985
0.1
0.14455
0.21124
0.2
0.22489

Select a few columns



In [9]:

    
columns = ["CRIM", "RM", "RAD"]
fr[columns].head()









    



H2OFrame with 462 rows and 3 columns: 






    




CRIM
0.00632
0.0
0.03237
0.06905
0.02985
0.1
0.14455
0.21124
0.2
0.22489
RM
6.575
7.2
6.998
7.147
6.43
6.0
6.172
5.631
6.0
6.377
RAD
1
2
3
3
3
5
5
5
5
5

Select a subset of rows

Unlike in Pandas, columns may be identified by index or column name. Therefore, when subsetting by rows, you must also pass the column selection.



In [10]:

    
fr[2:7,:]  # explicitly select all columns with :









    



H2OFrame with 5 rows and 14 columns: 






    






  
    
      
      CRIM
      ZN
      B
      LSTAT
      Median_value
      AGE
      TAX
      RAD
      CHAS
      NOX
      RM
      INDUS
      PTRATIO
      DIS
    
  
  
    
      0
      0.03237
      0.0
      394.63
      2.94
      33.4
      45.8
      222
      3
      0
      0.458
      6.998
      2.18
      18.7
      6.0622
    
    
      1
      0.06905
      0.0
      396.90
      5.33
      36.2
      54.2
      222
      3
      0
      0.458
      7.147
      2.18
      18.7
      6.0622
    
    
      2
      0.02985
      0.0
      394.12
      5.21
      28.7
      58.7
      222
      3
      0
      0.458
      6.430
      2.18
      18.7
      6.0622
    
    
      3
      0.08829
      12.5
      395.60
      12.43
      22.9
      66.6
      311
      5
      0
      0.524
      6.012
      7.87
      15.2
      5.5605
    
    
      4
      0.14455
      12.5
      396.90
      19.15
      27.1
      96.1
      311
      5
      0
      0.524
      6.172
      7.87
      15.2
      5.9505
    
  








    Out[10]:

Key attributes:

  * columns, names, col_names
  * len, shape, dim, nrow, ncol
  * types

Note:

Since the data is not in local python memory there is no "values" attribute. If you want to pull all of the data into the local python memory then do so explicitly with h2o.export_file and reading the data into python memory from disk.



In [11]:

    
# The columns attribute is exactly like Pandas
print "Columns:", fr.columns, "\n"
print "Columns:", fr.names, "\n"
print "Columns:", fr.col_names, "\n"

# There are a number of attributes to get at the shape
print "length:", str( len(fr) ), "\n"
print "shape:", fr.shape, "\n"
print "dim:", fr.dim, "\n"
print "nrow:", fr.nrow, "\n"
print "ncol:", fr.ncol, "\n"

# Use the "types" attribute to list the column types
print "types:", fr.types, "\n"









    



Columns: [u'CRIM', u'ZN', u'B', u'LSTAT', u'Median_value', u'AGE', u'TAX', u'RAD', u'CHAS', u'NOX', u'RM', u'INDUS', u'PTRATIO', u'DIS'] 

Columns: [u'CRIM', u'ZN', u'B', u'LSTAT', u'Median_value', u'AGE', u'TAX', u'RAD', u'CHAS', u'NOX', u'RM', u'INDUS', u'PTRATIO', u'DIS'] 

Columns: [u'CRIM', u'ZN', u'B', u'LSTAT', u'Median_value', u'AGE', u'TAX', u'RAD', u'CHAS', u'NOX', u'RM', u'INDUS', u'PTRATIO', u'DIS'] 

length: 462 

shape: (462, 14) 

dim: [462, 14] 

nrow: 462 

ncol: 14 

types: {u'CRIM': u'Numeric', u'ZN': u'Numeric', u'B': u'Numeric', u'LSTAT': u'Numeric', u'Median_value': u'Numeric', u'AGE': u'Numeric', u'TAX': u'Numeric', u'RAD': u'Numeric', u'CHAS': u'Numeric', u'NOX': u'Numeric', u'RM': u'Numeric', u'INDUS': u'Numeric', u'PTRATIO': u'Numeric', u'DIS': u'Numeric'}

Select rows based on value



In [12]:

    
fr.shape









    Out[12]:





(462, 14)

Boolean masks can be used to subselect rows based on a criteria.



In [13]:

    
mask = fr["CRIM"]>1
fr[mask,:].shape









    Out[13]:





(155, 14)

Get summary statistics of the data and additional data distribution information.



In [14]:

    
fr.describe()









    



Rows: 462 Cols: 14

Chunk compression summary:






    




chunk_type
chunk_name
count
count_percentage
size
size_percentage
CBS
Bits
1
7.1428576
    128  B
0.4
C1N
1-Byte Integers (w/o NAs)
1
7.1428576
    530  B
1.6260661
C2
2-Byte Integers
1
7.1428576
    992  B
3.043505
C2S
2-Byte Fractions
1
7.1428576
   1008  B
3.0925937
CUD
Unique Reals
4
28.57143
    7.2 KB
22.5563
C8D
64-bit Reals
6
42.857143
   22.1 KB
69.288826






    



Frame distribution summary:






    





size
number_of_rows
number_of_chunks_per_column
number_of_chunks
172.16.2.38:54321
   31.8 KB
462.0
1.0
14.0
mean
   31.8 KB
462.0
1.0
14.0
min
   31.8 KB
462.0
1.0
14.0
max
   31.8 KB
462.0
1.0
14.0
stddev
      0  B
0.0
0.0
0.0
total
   31.8 KB
462.0
1.0
14.0






    



Column-by-Column Summary:







    





CRIM
ZN
B
LSTAT
Median_value
AGE
TAX
RAD
CHAS
NOX
RM
INDUS
PTRATIO
DIS
type
real
real
real
real
real
real
int
int
int
real
real
real
real
real
mins
0.00632
0.0
0.32
1.73
5.0
6.0
187.0
1.0
0.0
0.385
3.561
0.46
12.6
1.1296
maxs
88.9762
100.0
396.9
37.97
50.0
100.0
711.0
24.0
1.0
0.871
8.78
27.74
22.0
12.1265
mean
3.6
11.1
357.2
12.7
22.6
68.9
407.6
9.4
0.1
0.6
6.3
11.1
18.5
3.8
sigma
8.7
23.2
90.8
7.1
9.2
28.0
167.5
8.6
0.2
0.1
0.7
6.9
2.2
2.1
zero_count
0
343
0
0
0
0
0
0
433
0
0
0
0
0
missing_count
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Set up the predictor and response column names

Using H2O algorithms, it's easier to reference predictor and response columns by name in a single frame (i.e., don't split up X and y)



In [15]:

    
x = fr.names
y="Median_value"
x.remove(y)

Machine Learning With H2O

H2O is a machine learning library built in Java with interfaces in Python, R, Scala, and Javascript. It is open source and well-documented.

Unlike Scikit-learn, H2O allows for categorical and missing data.

The basic work flow is as follows:

Fit the training data with a machine learning algorithm
Predict on the testing data

Simple model



In [16]:

    
# Define and fit first 400 points
model = H2ORandomForestEstimator(seed=42)
model.train(x=x, y=y, training_frame=fr[:400,:])









    



drf Model Build Progress: [##################################################] 100%



In [17]:

    
model.predict(fr[400:fr.nrow,:])        # Predict the rest









    



H2OFrame with 62 rows and 1 columns: 






    






  
    
      
      predict
    
  
  
    
      0
      12.736
    
    
      1
      10.100
    
    
      2
      10.048
    
    
      3
      12.742
    
    
      4
      10.498
    
    
      5
      14.902
    
    
      6
      17.218
    
    
      7
      15.148
    
    
      8
      14.738
    
    
      9
      16.491
    
  








    Out[17]:

The performance of the model can be checked using the holdout dataset



In [18]:

    
perf = model.model_performance(fr[400:fr.nrow,:])
perf.r2()      # get the r2 on the holdout data
perf.mse()     # get the mse on the holdout data
perf           # display the performance object









    



ModelMetricsRegression: drf
** Reported on test data. **

MSE: 13.4756382476
R^2: 0.405996106866
Mean Residual Deviance: 13.4756382476






    Out[18]:

Train-Test Split

Instead of taking the first 400 observations for training, we can use H2O to create a random test train split of the data.



In [19]:

    
r = fr.runif(seed=12345)   # build random uniform column over [0,1]
train= fr[r<0.75,:]     # perform a 75-25 split
test = fr[r>=0.75,:]

model = H2ORandomForestEstimator(seed=42)
model.train(x=x, y=y, training_frame=train, validation_frame=test)

perf = model.model_performance(test)
perf.r2()









    



drf Model Build Progress: [##################################################] 100%






    Out[19]:





0.8530416308371256

There was a massive jump in the R^2 value. This is because the original data is not shuffled.

Cross validation

H2O's machine learning algorithms take an optional parameter nfolds to specify the number of cross-validation folds to build. H2O's cross-validation uses an internal weight vector to build the folds in an efficient manner (instead of physically building the splits).

In conjunction with the nfolds parameter, a user may specify the way in which observations are assigned to each fold with the fold_assignment parameter, which can be set to either:

    * AUTO:  Perform random assignment
    * Random: Each row has a equal (1/nfolds) chance of being in any fold.
    * Modulo: Observations are in/out of the fold based by modding on nfolds



In [20]:

    
model = H2ORandomForestEstimator(nfolds=10) # build a 10-fold cross-validated model
model.train(x=x, y=y, training_frame=fr)









    



drf Model Build Progress: [##################################################] 100%



In [21]:

    
scores = numpy.array([m.r2() for m in model.xvals]) # iterate over the xval models using the xvals attribute
print "Expected R^2: %.2f +/- %.2f \n" % (scores.mean(), scores.std()*1.96)
print "Scores:", scores.round(2)









    



Expected R^2: 0.87 +/- 0.01 

Scores: [ 0.87  0.87  0.87  0.88  0.87  0.86  0.87  0.87  0.87  0.88]

However, you can still make use of the cross_val_score from Scikit-Learn

Cross validation: H2O and Scikit-Learn



In [22]:

    
from sklearn.cross_validation import cross_val_score
from h2o.cross_validation import H2OKFold
from h2o.model.regression import h2o_r2_score
from sklearn.metrics.scorer import make_scorer

You still must use H2O to make the folds. Currently, there is no H2OStratifiedKFold. Additionally, the H2ORandomForestEstimator is similar to the scikit-learn RandomForestRegressor object with its own train method.



In [23]:

    
model = H2ORandomForestEstimator(seed=42)



In [24]:

    
scorer = make_scorer(h2o_r2_score)   # make h2o_r2_score into a scikit_learn scorer
custom_cv = H2OKFold(fr, n_folds=10, seed=42) # make a cv 
scores = cross_val_score(model, fr[x], fr[y], scoring=scorer, cv=custom_cv)

print "Expected R^2: %.2f +/- %.2f \n" % (scores.mean(), scores.std()*1.96)
print "Scores:", scores.round(2)









    



drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%
Expected R^2: 0.87 +/- 0.10 

Scores: [ 0.83  0.86  0.87  0.95  0.79  0.78  0.89  0.92  0.91  0.91]

There isn't much difference in the R^2 value since the fold strategy is exactly the same. However, there was a major difference in terms of computation time and memory usage.

Since the progress bar print out gets annoying let's disable that



In [25]:

    
h2o.__PROGRESS_BAR__=False
h2o.no_progress()

Grid Search

Grid search in H2O is still under active development and it will be available very soon. However, it is possible to make use of Scikit's grid search infrastructure (with some performance penalties)

Randomized grid search: H2O and Scikit-Learn



In [26]:

    
from sklearn import __version__
sklearn_version = __version__
print sklearn_version

If you have 0.16.1, then your system can't handle complex randomized grid searches (it works in every other version of sklearn, including the soon to be released 0.16.2 and the older versions).

The steps to perform a randomized grid search:

Import model and RandomizedSearchCV
Define model
Specify parameters to test
Define grid search object
Fit data to grid search object
Collect scores

All the steps will be repeated from above.

Because 0.16.1 is installed, we use scipy to define specific distributions

ADVANCED TIP:

Turn off reference counting for spawning jobs in parallel (n_jobs=-1, or n_jobs > 1). We'll turn it back on again in the aftermath of a Parallel job.

If you don't want to run jobs in parallel, don't turn off the reference counting.

Pattern is:

     >>> h2o.turn_off_ref_cnts()
     >>> .... parallel job ....
     >>> h2o.turn_on_ref_cnts()



In [27]:

    
%%time
from sklearn.grid_search import RandomizedSearchCV  # Import grid search
from scipy.stats import randint, uniform

model = H2ORandomForestEstimator(seed=42)        # Define model

params = {"ntrees": randint(20,50),
          "max_depth": randint(1,10),
          "min_rows": randint(1,10),    # scikit's  min_samples_leaf
          "mtries": randint(2,fr[x].shape[1]),} # Specify parameters to test

scorer = make_scorer(h2o_r2_score)   # make h2o_r2_score into a scikit_learn scorer
custom_cv = H2OKFold(fr, n_folds=10, seed=42) # make a cv 
random_search = RandomizedSearchCV(model, params, 
                                   n_iter=30, 
                                   scoring=scorer, 
                                   cv=custom_cv, 
                                   random_state=42,
                                   n_jobs=1)       # Define grid search object

random_search.fit(fr[x], fr[y])

print "Best R^2:", random_search.best_score_, "\n"
print "Best params:", random_search.best_params_









    



Best R^2: 0.84110882468 

Best params: {'mtries': 3, 'ntrees': 21, 'min_rows': 2, 'max_depth': 8}
CPU times: user 40.1 s, sys: 3.78 s, total: 43.9 s
Wall time: 1min 35s

We might be tempted to think that we just had a large improvement; however we must be cautious. The function below creates a more detailed report.



In [28]:

    
def report_grid_score_detail(random_search, charts=True):
    """Input fit grid search estimator. Returns df of scores with details"""
    df_list = []

    for line in random_search.grid_scores_:
        results_dict = dict(line.parameters)
        results_dict["score"] = line.mean_validation_score
        results_dict["std"] = line.cv_validation_scores.std()*1.96
        df_list.append(results_dict)

    result_df = pd.DataFrame(df_list)
    result_df = result_df.sort("score", ascending=False)
    
    if charts:
        for col in get_numeric(result_df):
            if col not in ["score", "std"]:
                plt.scatter(result_df[col], result_df.score)
                plt.title(col)
                plt.show()

        for col in list(result_df.columns[result_df.dtypes == "object"]):
            cat_plot = result_df.score.groupby(result_df[col]).mean()
            cat_plot.sort()
            cat_plot.plot(kind="barh", xlim=(.5, None), figsize=(7, cat_plot.shape[0]/2))
            plt.show()
    return result_df

def get_numeric(X):
    """Return list of numeric dtypes variables"""
    return X.dtypes[X.dtypes.apply(lambda x: str(x).startswith(("float", "int", "bool")))].index.tolist()



In [29]:

    
report_grid_score_detail(random_search).head()

Based on the grid search report, we can narrow the parameters to search and rerun the analysis. The parameters below were chosen after a few runs:



In [30]:

    
%%time

params = {"ntrees": randint(30,40),
          "max_depth": randint(4,10),
          "mtries": randint(4,10),}

custom_cv = H2OKFold(fr, n_folds=5, seed=42)           # In small datasets, the fold size can have a big
                                                       # impact on the std of the resulting scores. More
random_search = RandomizedSearchCV(model, params,      # folds --> Less examples per fold --> higher 
                                   n_iter=10,          # variation per sample
                                   scoring=scorer, 
                                   cv=custom_cv, 
                                   random_state=43, 
                                   n_jobs=1)       

random_search.fit(fr[x], fr[y])

print "Best R^2:", random_search.best_score_, "\n"
print "Best params:", random_search.best_params_

report_grid_score_detail(random_search)









    



Best R^2: 0.858211324241 

Best params: {'mtries': 8, 'ntrees': 33, 'max_depth': 9}






    












    












    












    



CPU times: user 7.35 s, sys: 649 ms, total: 8 s
Wall time: 17.6 s

Transformations

Rule of machine learning: Don't use your testing data to inform your training data. Unfortunately, this happens all the time when preparing a dataset for the final model. But on smaller datasets, you must be especially careful.

At the moment, there are no classes for managing data transformations. On the one hand, this requires the user to tote around some extra state, but on the other, it allows the user to be more explicit about transforming H2OFrames.

Basic steps:

Remove the response variable from transformations.
Import transformer
Define transformer
Fit train data to transformer
Transform test and train data
Re-attach the response variable.

First let's normalize the data using the means and standard deviations of the training data. Then let's perform a principal component analysis on the training data and select the top 5 components. Using these components, let's use them to reduce the train and test design matrices.



In [31]:

    
from h2o.transforms.preprocessing import H2OScaler
from h2o.transforms.decomposition import H2OPCA

Normalize Data: Use the means and standard deviations from the training data.



In [32]:

    
y_train = train.pop("Median_value")
y_test  = test.pop("Median_value")



In [33]:

    
norm = H2OScaler()
norm.fit(train)
X_train_norm = norm.transform(train)
X_test_norm  = norm.transform(test)



In [34]:

    
print X_test_norm.shape
X_test_norm









    



(122, 13)
H2OFrame with 122 rows and 13 columns: 






    






  
    
      
      CRIM
      ZN
      B
      LSTAT
      AGE
      TAX
      RAD
      CHAS
      NOX
      RM
      INDUS
      PTRATIO
      DIS
    
  
  
    
      0
      -24.736214
      -246.063411
      3478.064487
      -52.441200
      -413.735371
      -30080.217069
      -51.962844
      -0.01489
      -0.011189
      0.630504
      -63.773819
      0.684709
      4.823702
    
    
      1
      -23.659575
      36.576994
      2566.038932
      126.336348
      883.754751
      -15288.430659
      -34.909401
      -0.01489
      -0.003593
      -0.476113
      -23.687122
      -7.112459
      4.865433
    
    
      2
      -24.369586
      36.576994
      3478.064487
      5.261797
      399.320666
      -15288.430659
      -34.909401
      -0.01489
      -0.003593
      -0.200189
      -23.687122
      -7.112459
      5.168671
    
    
      3
      -24.548962
      36.576994
      2909.713605
      22.994204
      -844.343447
      -15288.430659
      -34.909401
      -0.01489
      -0.003593
      -0.287784
      -23.687122
      -7.112459
      3.541759
    
    
      4
      -20.490610
      -246.063411
      3478.064487
      -31.147777
      -198.431334
      -15953.230048
      -43.436123
      -0.01489
      -0.001981
      -0.243987
      -21.784942
      5.808563
      1.982793
    
    
      5
      -15.926989
      -246.063411
      3478.064487
      44.869018
      648.620078
      -15953.230048
      -43.436123
      -0.01489
      -0.001981
      -0.103105
      -21.784942
      5.808563
      0.450669
    
    
      6
      -19.578204
      -246.063411
      3249.836086
      27.281958
      716.610827
      -15953.230048
      -43.436123
      -0.01489
      -0.001981
      -0.262235
      -21.784942
      5.808563
      1.337103
    
    
      7
      -19.406096
      -246.063411
      2682.373253
      1.846128
      725.109671
      -15953.230048
      -43.436123
      -0.01489
      -0.001981
      0.154571
      -21.784942
      5.808563
      1.452652
    
    
      8
      -13.047041
      -246.063411
      -9717.444492
      56.642174
      795.933367
      -15953.230048
      -43.436123
      -0.01489
      -0.001981
      -0.136683
      -21.784942
      5.808563
      -0.004606
    
    
      9
      -23.933599
      -246.063411
      3169.911743
      -17.557776
      -1093.642859
      -20606.825773
      -34.909401
      -0.01489
      -0.006470
      -0.231577
      -37.143289
      1.798590
      0.178888
    
  








    Out[34]:

Then, we can apply PCA and keep the top 5 components. A user warning is expected here.



In [54]:

    
pca = H2OPCA(k=5)
pca.fit(X_train_norm)
X_train_norm_pca = pca.transform(X_train_norm)
X_test_norm_pca  = pca.transform(X_test_norm)



In [55]:

    
# prop of variance explained by top 5 components?



In [56]:

    
print X_test_norm_pca.shape
X_test_norm_pca[:5]









    



(122, 5)
H2OFrame with 122 rows and 5 columns: 






    






  
    
      
      PC1
      PC2
      PC3
      PC4
      PC5
    
  
  
    
      0
      -30275.268227
      -625.603879
      190.440193
      369.210711
      12.389230
    
    
      1
      -15481.014400
      465.946145
      1014.829847
      -446.059456
      38.351308
    
    
      2
      -15611.350493
      1372.839872
      586.131836
      -231.060297
      -0.387572
    
    
      3
      -15551.885105
      817.877766
      -530.932422
      323.576391
      21.281381
    
    
      4
      -16276.796219
      1286.235678
      185.614799
      287.705163
      -4.813496
    
    
      5
      -16264.821892
      1280.602054
      945.464828
      -89.920838
      17.023511
    
    
      6
      -16233.006802
      1054.058236
      1003.812227
      -119.775113
      5.928040
    
    
      7
      -16156.125715
      491.806467
      1005.791501
      -122.633986
      -4.737604
    
    
      8
      -14477.165840
      -11793.980470
      962.696650
      -136.697963
      -1.805725
    
    
      9
      -20857.888689
      356.531393
      -552.379748
      682.064306
      21.077590
    
  








    Out[56]:



In [57]:

    
model = H2ORandomForestEstimator(seed=42)
model.train(x=X_train_norm_pca.names, y=y_train.names, training_frame=X_train_norm_pca.cbind(y_train))
y_hat  = model.predict(X_test_norm_pca)



In [58]:

    
h2o_r2_score(y_test,y_hat)









    Out[58]:





0.5344823408872756

Although this is MUCH simpler than keeping track of all of these transformations manually, it gets to be somewhat of a burden when you want to chain together multiple transformers.

Pipelines

"Tranformers unite!"

If your raw data is a mess and you have to perform several transformations before using it, use a pipeline to keep things simple.

Steps:

Import Pipeline, transformers, and model
Define pipeline. The first and only argument is a list of tuples where the first element of each tuple is a name you give the step and the second element is a defined transformer. The last step is optionally an estimator class (like a RandomForest).
Fit the training data to pipeline
Either transform or predict the testing data



In [59]:

    
from h2o.transforms.preprocessing import H2OScaler
from h2o.transforms.decomposition import H2OPCA



In [60]:

    
from sklearn.pipeline import Pipeline                # Import Pipeline <other imports not shown>
model = H2ORandomForestEstimator(seed=42)
pipe = Pipeline([("standardize", H2OScaler()),       # Define pipeline as a series of steps
                 ("pca", H2OPCA(k=5)),
                 ("rf", model)])                     # Notice the last step is an estimator

pipe.fit(train, y_train)                             # Fit training data
y_hat = pipe.predict(test)                           # Predict testing data (due to last step being an estimator)
h2o_r2_score(y_test, y_hat)                          # Notice the final score is identical to before









    Out[60]:





0.5091520528378605

This is so much easier!!!

But, wait a second, we did worse after applying these transformations! We might wonder how different hyperparameters for the transformations impact the final score.

Combining randomized grid search and pipelines

"Yo dawg, I heard you like models, so I put models in your models to model models."

Steps:

Import Pipeline, grid search, transformers, and estimators
Define pipeline
Define parameters to test in the form: "(Step name)__(argument name)" A double underscore separates the two words.
Define grid search
Fit to grid search



In [61]:

    
pipe = Pipeline([("standardize", H2OScaler()),
                 ("pca", H2OPCA()),
                 ("rf", H2ORandomForestEstimator(seed=42))])

params = {"standardize__center":    [True, False],           # Parameters to test
          "standardize__scale":     [True, False],
          "pca__k":                 randint(2, 6),
          "rf__ntrees":             randint(50,80),
          "rf__max_depth":          randint(4,10),
          "rf__min_rows":           randint(5,10), }
#           "rf__mtries":             randint(1,4),}           # gridding over mtries is 
                                                               # problematic with pca grid over 
                                                               # k above 

from sklearn.grid_search import RandomizedSearchCV
from h2o.cross_validation import H2OKFold
from h2o.model.regression import h2o_r2_score
from sklearn.metrics.scorer import make_scorer

custom_cv = H2OKFold(fr, n_folds=5, seed=42)
random_search = RandomizedSearchCV(pipe, params,
                                   n_iter=30,
                                   scoring=make_scorer(h2o_r2_score),
                                   cv=custom_cv,
                                   random_state=42,
                                   n_jobs=1)


random_search.fit(fr[x],fr[y])
results = report_grid_score_detail(random_search)
results.head()









    












    












    












    












    












    












    Out[61]:






  
    
      
      pca__k
      rf__max_depth
      rf__min_rows
      rf__ntrees
      score
      standardize__center
      standardize__scale
      std
    
  
  
    
      0
      5
      8
      5
      79
      0.373792
      True
      False
      0.109543
    
    
      9
      5
      6
      5
      69
      0.367386
      False
      True
      0.092738
    
    
      12
      5
      7
      6
      74
      0.364720
      False
      False
      0.089120
    
    
      21
      3
      6
      5
      61
      0.341235
      False
      False
      0.095376
    
    
      23
      3
      6
      9
      65
      0.332632
      False
      True
      0.106529

Currently Under Development (drop-in scikit-learn pieces):

* Richer set of transforms (only PCA and Scale are implemented)
* Richer set of estimators (only RandomForest is available)
* Full H2O Grid Search

Other Tips: Model Save/Load

It is useful to save constructed models to disk and reload them between H2O sessions. Here's how:



In [62]:

    
best_estimator = random_search.best_estimator_                        # fetch the pipeline from the grid search
h2o_model      = h2o.get_model(best_estimator._final_estimator._id)    # fetch the model from the pipeline



In [63]:

    
save_path = h2o.save_model(h2o_model, path=".", force=True)
print save_path









    



/Users/ludirehak/h2o-3/DRF_model_python_1445557087082_2736



In [64]:

    
# assumes new session
my_model = h2o.load_model(path=save_path)



In [65]:

    
my_model.predict(fr)









    



H2OFrame with 462 rows and 1 columns: 






    






  
    
      
      predict
    
  
  
    
      0
      29.843769
    
    
      1
      29.843769
    
    
      2
      29.843769
    
    
      3
      29.843769
    
    
      4
      29.843769
    
    
      5
      29.843769
    
    
      6
      29.843769
    
    
      7
      29.843769
    
    
      8
      29.843769
    
    
      9
      29.843769
    
  








    Out[65]:



In [ ]:

H2O cluster uptime:	14 minutes 39 seconds 914 milliseconds
H2O cluster version:	3.5.0.99999
H2O cluster name:	ludirehak
H2O cluster total nodes:	1
H2O cluster total memory:	3.56 GB
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster healthy:	True
H2O Connection ip:	127.0.0.1
H2O Connection port:	54321

CRIM	0.00632	0.0	0.03237	0.06905	0.02985	0.1	0.14455	0.21124	0.2	0.22489
ZN	18.0	0.0	0.0	0.0	0.0	12.5	12.5	12.5	12.5	12.5
B	396.9	392.83	394.63	396.9	394.12	395.6	396.9	386.63	386.7	392.52
LSTAT	5.0	4.03	2.94	5.33	5.21	12.43	19.2	29.93	17.1	20.5
Median_value	24.0	34.7	33.4	36.2	28.7	22.9	27.1	16.5	18.9	15.0
AGE	65.2	61.1	45.8	54.2	58.7	66.6	96.1	100.0	85.9	94.3
TAX	296	242	222	222	222	311	311	311	311	311
RAD	1	2	3	3	3	5	5	5	5	5
CHAS	0	0	0	0	0	0	0	0	0	0
NOX	0.538	0.5	0.458	0.458	0.458	0.524	0.524	0.524	0.524	0.524
RM	6.575	7.2	6.998	7.147	6.43	6.0	6.172	5.631	6.0	6.377
INDUS	2.31	7.07	2.18	2.18	2.18	7.87	7.87	7.87	7.87	7.87
PTRATIO	15.3	17.8	18.7	18.7	18.7	15.2	15.2	15.2	15.2	15.2
DIS	4.09	4.9671	6.1	6.1	6.1	5.5605	5.9505	6.1	6.5921	6.3467

CRIM	0.2896	0.26838	0.2	0.2	0.2	0.1	0.0	0.06076	0.10959	0.04741
ZN	0	0	0	0	0	0	0	0	0	0
B	396.9	396.9	396.9	395.77	396.9	391.99	396.9	396.9	393.45	396.9
LSTAT	21.14	14.1	12.92	15.1	14.33	9.67	9.08	5.64	6.5	7.88
Median_value	19.7	18.3	21.2	17.5	16.8	22.4	20.6	23.9	22.0	11.9
AGE	72.9	70.6	65.3	73.5	79.7	69.1	76.7	91.0	89.3	80.8
TAX	391	391	391	391	391	273	273	273	273	273
RAD	6	6	6	6	6	1	1	1	1	1
CHAS	0	0	0	0	0	0	0	0	0	0
NOX	0.6	0.6	0.6	0.6	0.6	0.6	0.6	0.6	0.6	0.6
RM	5.39	5.8	6.019	5.569	6.027	6.593	6.12	6.976	6.8	6.03
INDUS	9.69	9.69	9.69	9.69	9.69	11.93	11.93	11.93	11.93	11.93
PTRATIO	19.2	19.2	19.2	19.2	19.2	21.0	21.0	21.0	21.0	21.0
DIS	2.7986	2.8927	2.4091	2.3999	2.5	2.4786	2.2875	2.1675	2.3889	2.505

	CRIM	ZN	B	LSTAT	Median_value	AGE	TAX	RAD	NOX	RM	INDUS	PTRATIO	DIS
0	0.03237	0.0	394.63	2.94	33.4	45.8	222	3	0.458	6.998	2.18	18.7	6.0622
1	0.06905	0.0	396.90	5.33	36.2	54.2	222	3	0.458	7.147	2.18	18.7	6.0622
2	0.02985	0.0	394.12	5.21	28.7	58.7	222	3	0.458	6.430	2.18	18.7	6.0622
3	0.08829	12.5	395.60	12.43	22.9	66.6	311	5	0.524	6.012	7.87	15.2	5.5605
4	0.14455	12.5	396.90	19.15	27.1	96.1	311	5	0.524	6.172	7.87	15.2	5.9505

chunk_type	chunk_name	count	count_percentage	size	size_percentage
CBS	Bits	1	7.1428576	128 B	0.4
C1N	1-Byte Integers (w/o NAs)	1	7.1428576	530 B	1.6260661
C2	2-Byte Integers	1	7.1428576	992 B	3.043505
C2S	2-Byte Fractions	1	7.1428576	1008 B	3.0925937
CUD	Unique Reals	4	28.57143	7.2 KB	22.5563
C8D	64-bit Reals	6	42.857143	22.1 KB	69.288826

	size	number_of_rows	number_of_chunks_per_column	number_of_chunks
172.16.2.38:54321	31.8 KB	462.0	1.0	14.0
mean	31.8 KB	462.0	1.0	14.0
min	31.8 KB	462.0	1.0	14.0
max	31.8 KB	462.0	1.0	14.0
stddev	0 B	0.0	0.0	0.0
total	31.8 KB	462.0	1.0	14.0

	CRIM	ZN	B	LSTAT	Median_value	AGE	TAX	RAD	CHAS	NOX	RM	INDUS	PTRATIO	DIS
type	real	real	real	real	real	real	int	int	int	real	real	real	real	real
mins	0.00632	0.0	0.32	1.73	5.0	6.0	187.0	1.0	0.0	0.385	3.561	0.46	12.6	1.1296
maxs	88.9762	100.0	396.9	37.97	50.0	100.0	711.0	24.0	1.0	0.871	8.78	27.74	22.0	12.1265
mean	3.6	11.1	357.2	12.7	22.6	68.9	407.6	9.4	0.1	0.6	6.3	11.1	18.5	3.8
sigma	8.7	23.2	90.8	7.1	9.2	28.0	167.5	8.6	0.2	0.1	0.7	6.9	2.2	2.1
zero_count	0	343	0	0	0	0	0	0	433	0	0	0	0	0
missing_count	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	predict
0	12.736
1	10.100
2	10.048
3	12.742
4	10.498
5	14.902
6	17.218
7	15.148
8	14.738
9	16.491

	max_depth	min_rows	mtries	ntrees	score	std
13	8	2	3	21	0.841109	0.141148
23	9	4	4	31	0.835138	0.116733
1	7	4	4	46	0.833804	0.149032
14	7	2	9	35	0.829966	0.152863
27	6	3	4	47	0.822536	0.165389

	CRIM	ZN	B	LSTAT	AGE	TAX	RAD	CHAS	NOX	RM	INDUS	PTRATIO	DIS
0	-24.736214	-246.063411	3478.064487	-52.441200	-413.735371	-30080.217069	-51.962844	-0.01489	-0.011189	0.630504	-63.773819	0.684709	4.823702
1	-23.659575	36.576994	2566.038932	126.336348	883.754751	-15288.430659	-34.909401	-0.01489	-0.003593	-0.476113	-23.687122	-7.112459	4.865433
2	-24.369586	36.576994	3478.064487	5.261797	399.320666	-15288.430659	-34.909401	-0.01489	-0.003593	-0.200189	-23.687122	-7.112459	5.168671
3	-24.548962	36.576994	2909.713605	22.994204	-844.343447	-15288.430659	-34.909401	-0.01489	-0.003593	-0.287784	-23.687122	-7.112459	3.541759
4	-20.490610	-246.063411	3478.064487	-31.147777	-198.431334	-15953.230048	-43.436123	-0.01489	-0.001981	-0.243987	-21.784942	5.808563	1.982793
5	-15.926989	-246.063411	3478.064487	44.869018	648.620078	-15953.230048	-43.436123	-0.01489	-0.001981	-0.103105	-21.784942	5.808563	0.450669
6	-19.578204	-246.063411	3249.836086	27.281958	716.610827	-15953.230048	-43.436123	-0.01489	-0.001981	-0.262235	-21.784942	5.808563	1.337103
7	-19.406096	-246.063411	2682.373253	1.846128	725.109671	-15953.230048	-43.436123	-0.01489	-0.001981	0.154571	-21.784942	5.808563	1.452652
8	-13.047041	-246.063411	-9717.444492	56.642174	795.933367	-15953.230048	-43.436123	-0.01489	-0.001981	-0.136683	-21.784942	5.808563	-0.004606
9	-23.933599	-246.063411	3169.911743	-17.557776	-1093.642859	-20606.825773	-34.909401	-0.01489	-0.006470	-0.231577	-37.143289	1.798590	0.178888

	PC1	PC2	PC3	PC4	PC5
0	-30275.268227	-625.603879	190.440193	369.210711	12.389230
1	-15481.014400	465.946145	1014.829847	-446.059456	38.351308
2	-15611.350493	1372.839872	586.131836	-231.060297	-0.387572
3	-15551.885105	817.877766	-530.932422	323.576391	21.281381
4	-16276.796219	1286.235678	185.614799	287.705163	-4.813496
5	-16264.821892	1280.602054	945.464828	-89.920838	17.023511
6	-16233.006802	1054.058236	1003.812227	-119.775113	5.928040
7	-16156.125715	491.806467	1005.791501	-122.633986	-4.737604
8	-14477.165840	-11793.980470	962.696650	-136.697963	-1.805725
9	-20857.888689	356.531393	-552.379748	682.064306	21.077590

	pca__k	rf__max_depth	rf__min_rows	rf__ntrees	score	standardize__center	standardize__scale	std
0	5	8	5	79	0.373792	True	False	0.109543
9	5	6	5	69	0.367386	False	True	0.092738
12	5	7	6	74	0.364720	False	False	0.089120
21	3	6	5	61	0.341235	False	False	0.095376
23	3	6	9	65	0.332632	False	True	0.106529

	predict
0	29.843769
1	29.843769
2	29.843769
3	29.843769
4	29.843769
5	29.843769
6	29.843769
7	29.843769
8	29.843769
9	29.843769