# H2O Tutorial Author: Spencer Aiello Contact: spencer@h2oai.com This tutorial steps through a quick introduction to H2O's Python API. The goal of this tutorial is to introduce through a complete example H2O's capabilities from Python. Also, to help those that are accustomed to Scikit Learn and Pandas, the demo will be specific call outs for differences between H2O and those packages; this is intended to help anyone that needs to do machine learning on really Big Data make the transition. It is not meant to be a tutorial on machine learning or algorithms. Detailed documentation about H2O's and the Python API is available at http://docs.h2o.ai.

Setting up your system for this demo

The following code creates two csv files using data from the Boston Housing dataset which is built into scikit-learn and adds them to the local directory



In [1]:

    
import pandas as pd
import numpy
from numpy.random import choice
from sklearn.datasets import load_boston

import h2o
h2o.init()









    




H2O cluster uptime: 
42 seconds 642 milliseconds 
H2O cluster version: 
3.7.0.99999
H2O cluster name: 
spIdea
H2O cluster total nodes: 
1
H2O cluster total free memory: 
12.44 GB
H2O cluster total cores: 
8
H2O cluster allowed cores: 
8
H2O cluster healthy: 
True
H2O Connection ip: 
127.0.0.1
H2O Connection port: 
54321
H2O Connection proxy: 
None
Python Version: 
3.5.0



In [2]:

    
# transfer the boston data from pandas to H2O
boston_data = load_boston()
X = pd.DataFrame(data=boston_data.data, columns=boston_data.feature_names)
X["Median_value"] = boston_data.target
X = h2o.H2OFrame.from_python(X.to_dict("list"))









    



Parse Progress: [##################################################] 100%



In [3]:

    
# select 10% for valdation
r = X.runif(seed=123456789)
train = X[r < 0.9,:]
valid = X[r >= 0.9,:]

h2o.export_file(train, "Boston_housing_train.csv", force=True)
h2o.export_file(valid, "Boston_housing_test.csv", force=True)









    



Export File Progress: [##################################################] 100%

Export File Progress: [##################################################] 100%

Enable inline plotting in the Jupyter Notebook



In [4]:

    
%matplotlib inline
import matplotlib.pyplot as plt

Intro to H2O Data Munging

Read csv data into H2O. This loads the data into the H2O column compressed, in-memory, key-value store.



In [5]:

    
fr = h2o.import_file("Boston_housing_train.csv")









    



Parse Progress: [##################################################] 100%

View the top of the H2O frame.



In [6]:

    
fr.head()









    





     B   AGE   NOX   LSTAT   TAX    RM   ZN   INDUS   RAD    DIS   PTRATIO    CRIM   Median_value   CHAS
396.9  65.2 0.538    4.98   296 6.575 18     2.31     1 4.09       15.3 0.00632           24       0
396.9  78.9 0.469    9.14   242 6.421  0     7.07     2 4.9671      17.8 0.02731           21.6      0
392.83  61.1 0.469    4.03   242 7.185  0     7.07     2 4.9671      17.8 0.02729           34.7      0
394.63  45.8 0.458    2.94   222 6.998  0     2.18     3 6.0622      18.7 0.03237           33.4      0
396.9  54.2 0.458    5.33   222 7.147  0     2.18     3 6.0622      18.7 0.06905           36.2      0
394.12  58.7 0.458    5.21   222 6.43  0     2.18     3 6.0622      18.7 0.02985           28.7      0
395.6  66.6 0.524   12.43   311 6.012 12.5    7.87     5 5.5605      15.2 0.08829           22.9      0
396.9  96.1 0.524   19.15   311 6.172 12.5    7.87     5 5.9505      15.2 0.14455           27.1      0
386.63 100  0.524   29.93   311 5.631 12.5    7.87     5 6.0821      15.2 0.21124           16.5      0
386.71  85.9 0.524   17.1   311 6.004 12.5    7.87     5 6.5921      15.2 0.17004           18.9      0







    Out[6]:

View the bottom of the H2O Frame



In [7]:

    
fr.tail()









    





     B   AGE   NOX   LSTAT   TAX    RM   ZN   INDUS   RAD    DIS   PTRATIO    CRIM   Median_value   CHAS
396.9  42.6 0.585   13.59   391 5.926    0    9.69     6 2.3817      19.2 0.27957           24.5      0
393.29  28.8 0.585   17.6   391 5.67    0    9.69     6 2.7986      19.2 0.17899           23.1      0
396.9  70.6 0.585   14.1   391 5.794    0    9.69     6 2.8927      19.2 0.26838           18.3      0
396.9  65.3 0.585   12.92   391 6.019    0    9.69     6 2.4091      19.2 0.23912           21.2      0
395.77  73.5 0.585   15.1   391 5.569    0    9.69     6 2.3999      19.2 0.17783           17.5      0
396.9  79.7 0.585   14.33   391 6.027    0    9.69     6 2.4982      19.2 0.22438           16.8      0
391.99  69.1 0.573    9.67   273 6.593    0   11.93     1 2.4786      21  0.06263           22.4      0
396.9  76.7 0.573    9.08   273 6.12    0   11.93     1 2.2875      21  0.04527           20.6      0
393.45  89.3 0.573    6.48   273 6.794    0   11.93     1 2.3889      21  0.10959           22       0
396.9  80.8 0.573    7.88   273 6.03    0   11.93     1 2.505      21  0.04741           11.9      0







    Out[7]:

Select a column

fr["VAR_NAME"]



In [8]:

    
fr["CRIM"].head() # Tab completes

Select a few columns



In [9]:

    
columns = ["CRIM", "RM", "RAD"]
fr[columns].head()









    





   CRIM    RM   RAD
0.00632 6.575     1
0.02731 6.421     2
0.02729 7.185     2
0.03237 6.998     3
0.06905 7.147     3
0.02985 6.43     3
0.08829 6.012     5
0.14455 6.172     5
0.21124 5.631     5
0.17004 6.004     5







    Out[9]:

Select a subset of rows

Unlike in Pandas, columns may be identified by index or column name. Therefore, when subsetting by rows, you must also pass the column selection.



In [10]:

    
fr[2:7,:]  # explicitly select all columns with :









    





     B   AGE   NOX   LSTAT   TAX    RM   ZN   INDUS   RAD    DIS   PTRATIO    CRIM   Median_value   CHAS
392.83  61.1 0.469    4.03   242 7.185  0     7.07     2 4.9671      17.8 0.02729           34.7      0
394.63  45.8 0.458    2.94   222 6.998  0     2.18     3 6.0622      18.7 0.03237           33.4      0
396.9  54.2 0.458    5.33   222 7.147  0     2.18     3 6.0622      18.7 0.06905           36.2      0
394.12  58.7 0.458    5.21   222 6.43  0     2.18     3 6.0622      18.7 0.02985           28.7      0
395.6  66.6 0.524   12.43   311 6.012 12.5    7.87     5 5.5605      15.2 0.08829           22.9      0







    Out[10]:

Key attributes:

  * columns, names, col_names
  * len, shape, dim, nrow, ncol
  * types

Note:

Since the data is not in local python memory there is no "values" attribute. If you want to pull all of the data into the local python memory then do so explicitly with h2o.export_file and reading the data into python memory from disk.



In [11]:

    
# The columns attribute is exactly like Pandas
print("Columns:" + str(fr.columns) + "\n")
print("Columns:" + str(fr.names) + "\n")
print("Columns:" + str(fr.col_names) + "\n")

# There are a number of attributes to get at the shape
print("length:" + str( len(fr) ) + "\n")
print("shape:" + str(fr.shape) + "\n")
print("dim:" + str(fr.dim) + "\n")
print("nrow:" + str(fr.nrow) + "\n")
print("ncol:" + str(fr.ncol) + "\n")

# Use the "types" attribute to list the column types
print("types:" + str(fr.types) + "\n")









    



Columns:['B', 'AGE', 'NOX', 'LSTAT', 'TAX', 'RM', 'ZN', 'INDUS', 'RAD', 'DIS', 'PTRATIO', 'CRIM', 'Median_value', 'CHAS']

Columns:['B', 'AGE', 'NOX', 'LSTAT', 'TAX', 'RM', 'ZN', 'INDUS', 'RAD', 'DIS', 'PTRATIO', 'CRIM', 'Median_value', 'CHAS']

Columns:['B', 'AGE', 'NOX', 'LSTAT', 'TAX', 'RM', 'ZN', 'INDUS', 'RAD', 'DIS', 'PTRATIO', 'CRIM', 'Median_value', 'CHAS']

length:453

shape:(453, 14)

dim:[453, 14]

nrow:453

ncol:14

types:{'B': 'real', 'AGE': 'real', 'NOX': 'real', 'LSTAT': 'real', 'TAX': 'int', 'RM': 'real', 'ZN': 'real', 'INDUS': 'real', 'RAD': 'int', 'DIS': 'real', 'PTRATIO': 'real', 'CRIM': 'real', 'Median_value': 'real', 'CHAS': 'int'}

Select rows based on value



In [12]:

    
fr.shape









    Out[12]:





(453, 14)

Boolean masks can be used to subselect rows based on a criteria.



In [13]:

    
mask = fr["CRIM"]>1
fr[mask,:].shape









    Out[13]:





(149, 14)

Get summary statistics of the data and additional data distribution information.



In [14]:

    
fr.describe()









    



Rows:453 Cols:14

Chunk compression summary: 






    




chunk_type
chunk_name
count
count_percentage
size
size_percentage
CBS
Bits
1
7.1428576
    127  B
0.3965280
C1N
1-Byte Integers (w/o NAs)
1
7.1428576
    521  B
1.6267016
C2
2-Byte Integers
1
7.1428576
    974  B
3.041089
C2S
2-Byte Fractions
1
7.1428576
    990  B
3.0910454
CUD
Unique Reals
4
28.57143
    7.1 KB
22.680155
C8D
64-bit Reals
6
42.857143
   21.6 KB
69.16448






    



Frame distribution summary: 






    





size
number_of_rows
number_of_chunks_per_column
number_of_chunks
172.20.10.3:54321
   31.3 KB
453.0
1.0
14.0
mean
   31.3 KB
453.0
1.0
14.0
min
   31.3 KB
453.0
1.0
14.0
max
   31.3 KB
453.0
1.0
14.0
stddev
      0  B
0.0
0.0
0.0
total
   31.3 KB
453.0
1.0
14.0






    










    





       B                AGE               NOX                LSTAT             TAX               RM                ZN                INDUS             RAD              DIS               PTRATIO           CRIM                Median_value      CHAS               
type   real             real              real               real              int               real              real              real              int              real              real              real                real              int                
mins   0.32             2.9000000000000004 0.385              1.73              187.0             3.561             0.0               0.46              1.0              1.1296000000000002 12.600000000000001 0.00632             5.0               0.0                
mean   356.7393818984548 67.77792494481236 0.5533324503311261 12.51885209713025 402.22737306843294 6.272730684326713 11.539735099337754 10.931258278145691 9.275938189845471 3.839994260485648 18.408830022075065 3.290260088300221   22.528918322295826 0.0640176600441501 
maxs   396.9            100.0             0.871              37.97             711.0             8.78              100.0             27.740000000000002 24.0             12.1265           22.0              73.53410000000001   50.0              1.0                
sigma  91.57740504437409 27.902760166122622 0.11725492435439215 7.00633158705198  166.33136494922283 0.7085357660227043 23.31499786818323 6.824765577156836 8.563209498577333 2.095813694508859 2.1837820768329164 7.830928086015703   9.083731832478104 0.24505502298300838
zeros  0                0                 0                  0                 0                 0                 330               0                 0                0                 0                 0                   0                 424                
missing 0                0                 0                  0                 0                 0                 0                 0                 0                0                 0                 0                   0                 0                  
0      396.9            65.2              0.538              4.98              296.0             6.575             18.0              2.31              1.0              4.09              15.3              0.00632             24.0              0.0                
1      396.9            78.9              0.46900000000000003 9.14              242.0             6.421             0.0               7.07              2.0              4.9671            17.8              0.02731             21.6              0.0                
2      392.83           61.1              0.46900000000000003 4.03              242.0             7.1850000000000005 0.0               7.07              2.0              4.9671            17.8              0.027290000000000002 34.7              0.0                
3      394.63           45.800000000000004 0.458              2.94              222.0             6.998             0.0               2.18              3.0              6.062200000000002 18.7              0.03237             33.4              0.0                
4      396.9            54.2              0.458              5.33              222.0             7.147             0.0               2.18              3.0              6.062200000000002 18.7              0.06905             36.2              0.0                
5      394.12           58.7              0.458              5.21              222.0             6.43              0.0               2.18              3.0              6.062200000000002 18.7              0.02985             28.700000000000006 0.0                
6      395.6            66.60000000000001 0.524              12.43             311.0             6.0120000000000005 12.5              7.87              5.0              5.5605            15.200000000000001 0.08829000000000002 22.900000000000002 0.0                
7      396.9            96.1              0.524              19.150000000000002 311.0             6.172             12.5              7.87              5.0              5.9505            15.200000000000001 0.14455             27.1              0.0                
8      386.63           100.0             0.524              29.93             311.0             5.631             12.5              7.87              5.0              6.0821000000000005 15.200000000000001 0.21124             16.5              0.0                
9      386.71           85.9              0.524              17.1              311.0             6.004             12.5              7.87              5.0              6.5921            15.200000000000001 0.17004000000000002 18.900000000000002 0.0

Set up the predictor and response column names

Using H2O algorithms, it's easier to reference predictor and response columns by name in a single frame (i.e., don't split up X and y)



In [15]:

    
x = fr.names[:]
y="Median_value"
x.remove(y)

Machine Learning With H2O

H2O is a machine learning library built in Java with interfaces in Python, R, Scala, and Javascript. It is open source and well-documented.

Unlike Scikit-learn, H2O allows for categorical and missing data.

The basic work flow is as follows:

Fit the training data with a machine learning algorithm
Predict on the testing data

Simple model



In [16]:

    
model = h2o.random_forest(x=fr[:400,x],y=fr[:400,y],seed=42) # Define and fit first 400 points









    



drf Model Build Progress: [##################################################] 100%






    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:1: DeprecationWarning: `h2o.random_forest` is deprecated. Use the estimators sub module to build an H2ORandomForestEstimator.
  if __name__ == '__main__':



In [17]:

    
model.predict(fr[400:fr.nrow,:])        # Predict the rest

The performance of the model can be checked using the holdout dataset



In [18]:

    
perf = model.model_performance(fr[400:fr.nrow,:])
perf.r2()      # get the r2 on the holdout data
perf.mse()     # get the mse on the holdout data
perf           # display the performance object









    



ModelMetricsRegression: drf
** Reported on test data. **

MSE: 10.819764680847584
R^2: 0.4094125118767129
Mean Residual Deviance: 10.819764680847584






    Out[18]:

Train-Test Split

Instead of taking the first 400 observations for training, we can use H2O to create a random test train split of the data.



In [19]:

    
r = fr.runif(seed=12345)   # build random uniform column over [0,1]
train= fr[r<0.75,:]     # perform a 75-25 split
test = fr[r>=0.75,:]

model = h2o.random_forest(x=train[x],y=train[y],seed=42)

perf = model.model_performance(test)
perf.r2()









    



drf Model Build Progress: [##################################################] 100%






    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:5: DeprecationWarning: `h2o.random_forest` is deprecated. Use the estimators sub module to build an H2ORandomForestEstimator.






    Out[19]:





0.8741251833488446

There was a massive jump in the R^2 value. This is because the original data is not shuffled.

Cross validation

H2O's machine learning algorithms take an optional parameter nfolds to specify the number of cross-validation folds to build. H2O's cross-validation uses an internal weight vector to build the folds in an efficient manner (instead of physically building the splits).

In conjunction with the nfolds parameter, a user may specify the way in which observations are assigned to each fold with the fold_assignment parameter, which can be set to either:

    * AUTO:  Perform random assignment
    * Random: Each row has a equal (1/nfolds) chance of being in any fold.
    * Modulo: Observations are in/out of the fold based by modding on nfolds



In [20]:

    
model = h2o.random_forest(x=fr[x],y=fr[y], nfolds=10) # build a 10-fold cross-validated model









    



drf Model Build Progress: [##################################################] 100%






    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:1: DeprecationWarning: `h2o.random_forest` is deprecated. Use the estimators sub module to build an H2ORandomForestEstimator.
  if __name__ == '__main__':



In [22]:

    
scores = numpy.array([m.r2() for m in model.xvals]) # iterate over the xval models using the xvals attribute
print("Expected R^2: %.2f +/- %.2f \n" % (scores.mean(), scores.std()*1.96))
print("Scores:" + str(scores.round(2)))









    



Expected R^2: 0.88 +/- 0.03 

Scores:[ 0.87  0.86  0.88  0.89  0.85  0.89  0.89  0.88  0.86  0.89]

However, you can still make use of the cross_val_score from Scikit-Learn

Cross validation: H2O and Scikit-Learn



In [23]:

    
from sklearn.cross_validation import cross_val_score
from h2o.cross_validation import H2OKFold
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.model.regression import h2o_r2_score
from sklearn.metrics.scorer import make_scorer









    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/scipy/_lib/decorator.py:205: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  first = inspect.getargspec(caller)[0][0]  # first arg

You still must use H2O to make the folds. Currently, there is no H2OStratifiedKFold. Additionally, the H2ORandomForestEstimator is analgous to the scikit-learn RandomForestRegressor object with its own fit method



In [24]:

    
model = H2ORandomForestEstimator(seed=42)



In [26]:

    
scorer = make_scorer(h2o_r2_score)   # make h2o_r2_score into a scikit_learn scorer
custom_cv = H2OKFold(fr, n_folds=10, seed=42) # make a cv 
scores = cross_val_score(model, fr[x], fr[y], scoring=scorer, cv=custom_cv)

print("Expected R^2: %.2f +/- %.2f \n" % (scores.mean(), scores.std()*1.96))
print("Scores:" +  str(scores.round(2)))









    



drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%

drf Model Build Progress: [##################################################] 100%
Expected R^2: 0.88 +/- 0.09 

Scores:[ 0.91  0.85  0.88  0.91  0.89  0.79  0.91  0.91  0.82  0.91]

There isn't much difference in the R^2 value since the fold strategy is exactly the same. However, there was a major difference in terms of computation time and memory usage.

Since the progress bar print out gets annoying let's disable that



In [27]:

    
h2o.__PROGRESS_BAR__=False
h2o.no_progress()

Grid Search

Grid search in H2O is still under active development and it will be available very soon. However, it is possible to make use of Scikit's grid search infrastructure (with some performance penalties)

Randomized grid search: H2O and Scikit-Learn



In [28]:

    
from sklearn import __version__
sklearn_version = __version__
print sklearn_version









    



  File "<ipython-input-28-2e62c9be977d>", line 3
    print sklearn_version
                        ^
SyntaxError: Missing parentheses in call to 'print'

If you have 0.16.1, then your system can't handle complex randomized grid searches (it works in every other version of sklearn, including the soon to be released 0.16.2 and the older versions).

The steps to perform a randomized grid search:

Import model and RandomizedSearchCV
Define model
Specify parameters to test
Define grid search object
Fit data to grid search object
Collect scores

All the steps will be repeated from above.

Because 0.16.1 is installed, we use scipy to define specific distributions

ADVANCED TIP:

Turn off reference counting for spawning jobs in parallel (n_jobs=-1, or n_jobs > 1). We'll turn it back on again in the aftermath of a Parallel job.

If you don't want to run jobs in parallel, don't turn off the reference counting.

Pattern is:

     >>> h2o.turn_off_ref_cnts()
     >>> .... parallel job ....
     >>> h2o.turn_on_ref_cnts()



In [30]:

    
%%time
from h2o.estimators.random_forest import H2ORandomForestEstimator  # Import model
from sklearn.grid_search import RandomizedSearchCV  # Import grid search
from scipy.stats import randint, uniform

model = H2ORandomForestEstimator(seed=42)        # Define model

params = {"ntrees": randint(20,50),
          "max_depth": randint(1,10),
          "min_rows": randint(1,10),    # scikit's  min_samples_leaf
          "mtries": randint(2,fr[x].shape[1]),} # Specify parameters to test

scorer = make_scorer(h2o_r2_score)   # make h2o_r2_score into a scikit_learn scorer
custom_cv = H2OKFold(fr, n_folds=10, seed=42) # make a cv 
random_search = RandomizedSearchCV(model, params, 
                                   n_iter=30, 
                                   scoring=scorer, 
                                   cv=custom_cv, 
                                   random_state=42,
                                   n_jobs=1)       # Define grid search object

random_search.fit(fr[x], fr[y])

print("Best R^2:" + str(random_search.best_score_) + "\n")
print("Best params:" + str(random_search.best_params_))









    



Best R^2:0.8787633665910857

Best params:{'max_depth': 9, 'min_rows': 1, 'ntrees': 27, 'mtries': 9}
CPU times: user 58.5 s, sys: 3.73 s, total: 1min 2s
Wall time: 2min 3s

We might be tempted to think that we just had a large improvement; however we must be cautious. The function below creates a more detailed report.



In [31]:

    
def report_grid_score_detail(random_search, charts=True):
    """Input fit grid search estimator. Returns df of scores with details"""
    df_list = []

    for line in random_search.grid_scores_:
        results_dict = dict(line.parameters)
        results_dict["score"] = line.mean_validation_score
        results_dict["std"] = line.cv_validation_scores.std()*1.96
        df_list.append(results_dict)

    result_df = pd.DataFrame(df_list)
    result_df = result_df.sort("score", ascending=False)
    
    if charts:
        for col in get_numeric(result_df):
            if col not in ["score", "std"]:
                plt.scatter(result_df[col], result_df.score)
                plt.title(col)
                plt.show()

        for col in list(result_df.columns[result_df.dtypes == "object"]):
            cat_plot = result_df.score.groupby(result_df[col]).mean()[0]
            cat_plot.sort()
            cat_plot.plot(kind="barh", xlim=(.5, None), figsize=(7, cat_plot.shape[0]/2))
            plt.show()
    return result_df

def get_numeric(X):
    """Return list of numeric dtypes variables"""
    return X.dtypes[X.dtypes.apply(lambda x: str(x).startswith(("float", "int", "bool")))].index.tolist()



In [32]:

    
report_grid_score_detail(random_search).head()









    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:12: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)






    












    












    












    












    Out[32]:






  
    
      
      max_depth
      min_rows
      mtries
      ntrees
      score
      std
    
  
  
    
      14
      9
      1
      9
      27
      0.878763
      0.058736
    
    
      28
      8
      2
      7
      22
      0.873955
      0.120582
    
    
      16
      8
      2
      8
      48
      0.870907
      0.077211
    
    
      7
      9
      3
      9
      38
      0.847962
      0.125709
    
    
      19
      6
      4
      6
      31
      0.841485
      0.167333

Based on the grid search report, we can narrow the parameters to search and rerun the analysis. The parameters below were chosen after a few runs:



In [34]:

    
%%time

params = {"ntrees": randint(30,40),
          "max_depth": randint(4,10),
          "mtries": randint(4,10),}

custom_cv = H2OKFold(fr, n_folds=5, seed=42)           # In small datasets, the fold size can have a big
                                                       # impact on the std of the resulting scores. More
random_search = RandomizedSearchCV(model, params,      # folds --> Less examples per fold --> higher 
                                   n_iter=10,          # variation per sample
                                   scoring=scorer, 
                                   cv=custom_cv, 
                                   random_state=43, 
                                   n_jobs=1)       

random_search.fit(fr[x], fr[y])

print("Best R^2:" + str(random_search.best_score_) + "\n")
print("Best params:" + str(random_search.best_params_))

report_grid_score_detail(random_search)









    



Best R^2:0.8741357719344148

Best params:{'max_depth': 8, 'ntrees': 39, 'mtries': 6}






    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:12: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)






    












    












    












    



CPU times: user 10.6 s, sys: 648 ms, total: 11.2 s
Wall time: 24.4 s

Transformations

Rule of machine learning: Don't use your testing data to inform your training data. Unfortunately, this happens all the time when preparing a dataset for the final model. But on smaller datasets, you must be especially careful.

At the moment, there are no classes for managing data transformations. On the one hand, this requires the user to tote around some extra state, but on the other, it allows the user to be more explicit about transforming H2OFrames.

Basic steps:

Remove the response variable from transformations.
Import transformer
Define transformer
Fit train data to transformer
Transform test and train data
Re-attach the response variable.

First let's normalize the data using the means and standard deviations of the training data. Then let's perform a principal component analysis on the training data and select the top 5 components. Using these components, let's use them to reduce the train and test design matrices.



In [35]:

    
from h2o.transforms.preprocessing import H2OScaler
from h2o.transforms.decomposition import H2OPCA

Normalize Data: Use the means and standard deviations from the training data.



In [36]:

    
y_train = train.pop("Median_value")
y_test  = test.pop("Median_value")



In [37]:

    
norm = H2OScaler()
norm.fit(train)
X_train_norm = norm.transform(train)
X_test_norm  = norm.transform(test)



In [38]:

    
print(X_test_norm.shape)
X_test_norm









    



(113, 13)






    





      B       AGE         NOX     LSTAT      TAX         RM        ZN    INDUS      RAD      DIS   PTRATIO     CRIM       CHAS
3006.79  -613.976 -0.0109691 -67.7964 -30014.5  0.531103 -247.991 -58.7935 -54.5035 4.6657   0.693564 -25.2944 -0.0170139
3202.85   778.017 -0.00328778  46.5703 -15315.2 -0.0667628   35.1418 -20.3865 -37.345 4.43096 -7.01271 -24.4359 -0.0170139
3202.85   412.722 -0.00328778   5.08501 -15315.2 -0.184743   35.1418 -20.3865 -37.345 5.0114  -7.01271 -24.6431 -0.0170139
1744.96   457    -0.0016584 -16.1515 -15975.9 -0.121772 -247.991 -18.564 -45.9243 1.30266  5.75769 -20.6599 -0.0170139
2999.02   885.945 -0.0016584  51.7206 -15975.9 -0.32661  -247.991 -18.564 -45.9243 0.532034  5.75769 -17.9778 -0.0170139
2210.48 -1801.18 -0.012133  -54.3913 -28197.7  0.366074 -247.991 -26.8664 -54.5035 3.94846 -1.06787 -24.5668 -0.0170139
2992.11 -1701.56 -0.012133  -36.0475 -28197.7 -0.0385343 -247.991 -26.8664 -54.5035 3.94846 -1.06787 -24.3225 -0.0170139
2857.38  -760.647 -0.0164392 -60.6705 -24399   0.39937  2017.07  -64.5984 -37.345 9.42058 -7.23289 -25.4325 -0.0170139
3202.85  -575.233 -0.011551  -23.489  -19774.6 -0.244096  318.275 -38.8813 -11.6072 6.4936   2.89536 -24.7517 -0.0170139
1577.4   703.297 -0.011551   13.3397 -19774.6 -0.215867  318.275 -38.8813 -11.6072 6.25507  2.89536 -24.228 -0.0170139







    Out[38]:

Then, we can apply PCA and keep the top 5 components.



In [39]:

    
pca = H2OPCA(k=5)
pca.fit(X_train_norm)
X_train_norm_pca = pca.transform(X_train_norm)
X_test_norm_pca  = pca.transform(X_test_norm)









    



/Users/spencer/0xdata/h2o-3/h2o-py/h2o/transforms/decomposition.py:61: UserWarning: 

	`fit` is not recommended outside of the sklearn framework. Use `train` instead.
  return super(H2OPCA, self).fit(X)



In [40]:

    
# prop of variance explained by top 5 components?



In [41]:

    
print(X_test_norm_pca.shape)
X_test_norm_pca[:5]









    



(113, 5)






    





    PC1       PC2          PC3        PC4       PC5
30161.1   668.607    -0.595681   466.145   2.55589
15579.7 -1308.37   895.183    -429.89    6.64585
15584.5 -1310.74   573.971    -254.186  -4.39019
16060.6   217.81   744.26      -26.9187 -13.6447 
16207.6 -1024.08  1132.87     -234.328  11.172  
28276.6  1230.04 -1070.76     1040.16   24.3973 
28370.5   454.905  -976.171     991.594  31.6011 
24584    125.492 -1291.95    -1441.28    6.97843
20024.6  -774.698  -370.114     -34.056  14.0372 
19810    846.846   737.13     -647.294   5.65924







    Out[41]:



In [42]:

    
model = H2ORandomForestEstimator(seed=42)
model.fit(X_train_norm_pca,y_train)
y_hat  = model.predict(X_test_norm_pca)









    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:2: UserWarning: 

	`fit` is not recommended outside of the sklearn framework. Use `train` instead.
  from ipykernel import kernelapp as app



In [43]:

    
h2o_r2_score(y_test,y_hat)









    Out[43]:





0.4814662917795077

Although this is MUCH simpler than keeping track of all of these transformations manually, it gets to be somewhat of a burden when you want to chain together multiple transformers.

Pipelines

"Tranformers unite!"

If your raw data is a mess and you have to perform several transformations before using it, use a pipeline to keep things simple.

Steps:

Import Pipeline, transformers, and model
Define pipeline. The first and only argument is a list of tuples where the first element of each tuple is a name you give the step and the second element is a defined transformer. The last step is optionally an estimator class (like a RandomForest).
Fit the training data to pipeline
Either transform or predict the testing data



In [44]:

    
from h2o.transforms.preprocessing import H2OScaler
from h2o.transforms.decomposition import H2OPCA
from h2o.estimators.random_forest import H2ORandomForestEstimator



In [45]:

    
from sklearn.pipeline import Pipeline                # Import Pipeline <other imports not shown>
model = H2ORandomForestEstimator(seed=42)
pipe = Pipeline([("standardize", H2OScaler()),       # Define pipeline as a series of steps
                 ("pca", H2OPCA(k=5)),
                 ("rf", model)])                     # Notice the last step is an estimator

pipe.fit(train, y_train)                             # Fit training data
y_hat = pipe.predict(test)                           # Predict testing data (due to last step being an estimator)
h2o_r2_score(y_test, y_hat)                          # Notice the final score is identical to before









    Out[45]:





0.4044178315789908

This is so much easier!!!

But, wait a second, we did worse after applying these transformations! We might wonder how different hyperparameters for the transformations impact the final score.

Combining randomized grid search and pipelines

"Yo dawg, I heard you like models, so I put models in your models to model models."

Steps:

Import Pipeline, grid search, transformers, and estimators
Define pipeline
Define parameters to test in the form: "(Step name)__(argument name)" A double underscore separates the two words.
Define grid search
Fit to grid search



In [46]:

    
pipe = Pipeline([("standardize", H2OScaler()),
                 ("pca", H2OPCA()),
                 ("rf", H2ORandomForestEstimator(seed=42))])

params = {"standardize__center":    [True, False],           # Parameters to test
          "standardize__scale":     [True, False],
          "pca__k":                 randint(2, 6),
          "rf__ntrees":             randint(50,80),
          "rf__max_depth":          randint(4,10),
          "rf__min_rows":           randint(5,10), }
#           "rf__mtries":             randint(1,4),}           # gridding over mtries is 
                                                               # problematic with pca grid over 
                                                               # k above 

from sklearn.grid_search import RandomizedSearchCV
from h2o.cross_validation import H2OKFold
from h2o.model.regression import h2o_r2_score
from sklearn.metrics.scorer import make_scorer

custom_cv = H2OKFold(fr, n_folds=5, seed=42)
random_search = RandomizedSearchCV(pipe, params,
                                   n_iter=30,
                                   scoring=make_scorer(h2o_r2_score),
                                   cv=custom_cv,
                                   random_state=42,
                                   n_jobs=1)


random_search.fit(fr[x],fr[y])
results = report_grid_score_detail(random_search)
results.head()









    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:12: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)






    












    












    












    












    












    












    Out[46]:






  
    
      
      pca__k
      rf__max_depth
      rf__min_rows
      rf__ntrees
      score
      standardize__center
      standardize__scale
      std
    
  
  
    
      13
      5
      9
      5
      55
      0.481439
      False
      False
      0.155300
    
    
      0
      5
      5
      5
      59
      0.436773
      True
      False
      0.085499
    
    
      10
      5
      7
      5
      55
      0.427226
      False
      True
      0.150276
    
    
      16
      5
      7
      9
      71
      0.416305
      False
      False
      0.106783
    
    
      17
      5
      5
      5
      70
      0.407332
      False
      True
      0.141057

Currently Under Development (drop-in scikit-learn pieces):

* Richer set of transforms (only PCA and Scale are implemented)
* Richer set of estimators (only RandomForest is available)
* Full H2O Grid Search

Other Tips: Model Save/Load

It is useful to save constructed models to disk and reload them between H2O sessions. Here's how:



In [47]:

    
best_estimator = random_search.best_estimator_                        # fetch the pipeline from the grid search
h2o_model      = h2o.get_model(best_estimator._final_estimator._id)    # fetch the model from the pipeline



In [48]:

    
save_path = h2o.save_model(h2o_model, path=".", force=True)
print(save_path)









    



/Users/spencer/0xdata/h2o-3/DRF_model_python_1449627030339_2050



In [49]:

    
# assumes new session
my_model = h2o.load_model(path=save_path)



In [50]:

    
my_model.predict(X_test_norm_pca)

H2O cluster uptime:	42 seconds 642 milliseconds
H2O cluster version:	3.7.0.99999
H2O cluster name:	spIdea
H2O cluster total nodes:	1
H2O cluster total free memory:	12.44 GB
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster healthy:	True
H2O Connection ip:	127.0.0.1
H2O Connection port:	54321
H2O Connection proxy:	None
Python Version:	3.5.0

B	AGE	NOX	LSTAT	TAX	RM	ZN	INDUS	RAD	DIS	PTRATIO	CRIM	Median_value
396.9	65.2	0.538	4.98	296	6.575	18	2.31	1	4.09	15.3	0.00632	24
396.9	78.9	0.469	9.14	242	6.421	0	7.07	2	4.9671	17.8	0.02731	21.6
392.83	61.1	0.469	4.03	242	7.185	0	7.07	2	4.9671	17.8	0.02729	34.7
394.63	45.8	0.458	2.94	222	6.998	0	2.18	3	6.0622	18.7	0.03237	33.4
396.9	54.2	0.458	5.33	222	7.147	0	2.18	3	6.0622	18.7	0.06905	36.2
394.12	58.7	0.458	5.21	222	6.43	0	2.18	3	6.0622	18.7	0.02985	28.7
395.6	66.6	0.524	12.43	311	6.012	12.5	7.87	5	5.5605	15.2	0.08829	22.9
396.9	96.1	0.524	19.15	311	6.172	12.5	7.87	5	5.9505	15.2	0.14455	27.1
386.63	100	0.524	29.93	311	5.631	12.5	7.87	5	6.0821	15.2	0.21124	16.5
386.71	85.9	0.524	17.1	311	6.004	12.5	7.87	5	6.5921	15.2	0.17004	18.9

B	AGE	NOX	LSTAT	TAX	RM	INDUS	RAD	DIS	PTRATIO	CRIM	Median_value
396.9	42.6	0.585	13.59	391	5.926	9.69	6	2.3817	19.2	0.27957	24.5
393.29	28.8	0.585	17.6	391	5.67	9.69	6	2.7986	19.2	0.17899	23.1
396.9	70.6	0.585	14.1	391	5.794	9.69	6	2.8927	19.2	0.26838	18.3
396.9	65.3	0.585	12.92	391	6.019	9.69	6	2.4091	19.2	0.23912	21.2
395.77	73.5	0.585	15.1	391	5.569	9.69	6	2.3999	19.2	0.17783	17.5
396.9	79.7	0.585	14.33	391	6.027	9.69	6	2.4982	19.2	0.22438	16.8
391.99	69.1	0.573	9.67	273	6.593	11.93	1	2.4786	21	0.06263	22.4
396.9	76.7	0.573	9.08	273	6.12	11.93	1	2.2875	21	0.04527	20.6
393.45	89.3	0.573	6.48	273	6.794	11.93	1	2.3889	21	0.10959	22
396.9	80.8	0.573	7.88	273	6.03	11.93	1	2.505	21	0.04741	11.9

chunk_type	chunk_name	count	count_percentage	size	size_percentage
CBS	Bits	1	7.1428576	127 B	0.3965280
C1N	1-Byte Integers (w/o NAs)	1	7.1428576	521 B	1.6267016
C2	2-Byte Integers	1	7.1428576	974 B	3.041089
C2S	2-Byte Fractions	1	7.1428576	990 B	3.0910454
CUD	Unique Reals	4	28.57143	7.1 KB	22.680155
C8D	64-bit Reals	6	42.857143	21.6 KB	69.16448

	size	number_of_rows	number_of_chunks_per_column	number_of_chunks
172.20.10.3:54321	31.3 KB	453.0	1.0	14.0
mean	31.3 KB	453.0	1.0	14.0
min	31.3 KB	453.0	1.0	14.0
max	31.3 KB	453.0	1.0	14.0
stddev	0 B	0.0	0.0	0.0
total	31.3 KB	453.0	1.0	14.0

	B	AGE	NOX	LSTAT	TAX	RM	ZN	INDUS	RAD	DIS	PTRATIO	CRIM	Median_value	CHAS
type	real	real	real	real	int	real	real	real	int	real	real	real	real	int
mins	0.32	2.9000000000000004	0.385	1.73	187.0	3.561	0.0	0.46	1.0	1.1296000000000002	12.600000000000001	0.00632	5.0	0.0
mean	356.7393818984548	67.77792494481236	0.5533324503311261	12.51885209713025	402.22737306843294	6.272730684326713	11.539735099337754	10.931258278145691	9.275938189845471	3.839994260485648	18.408830022075065	3.290260088300221	22.528918322295826	0.0640176600441501
maxs	396.9	100.0	0.871	37.97	711.0	8.78	100.0	27.740000000000002	24.0	12.1265	22.0	73.53410000000001	50.0	1.0
sigma	91.57740504437409	27.902760166122622	0.11725492435439215	7.00633158705198	166.33136494922283	0.7085357660227043	23.31499786818323	6.824765577156836	8.563209498577333	2.095813694508859	2.1837820768329164	7.830928086015703	9.083731832478104	0.24505502298300838
zeros	0	0	0	0	0	0	330	0	0	0	0	0	0	424
missing	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	396.9	65.2	0.538	4.98	296.0	6.575	18.0	2.31	1.0	4.09	15.3	0.00632	24.0	0.0
1	396.9	78.9	0.46900000000000003	9.14	242.0	6.421	0.0	7.07	2.0	4.9671	17.8	0.02731	21.6	0.0
2	392.83	61.1	0.46900000000000003	4.03	242.0	7.1850000000000005	0.0	7.07	2.0	4.9671	17.8	0.027290000000000002	34.7	0.0
3	394.63	45.800000000000004	0.458	2.94	222.0	6.998	0.0	2.18	3.0	6.062200000000002	18.7	0.03237	33.4	0.0
4	396.9	54.2	0.458	5.33	222.0	7.147	0.0	2.18	3.0	6.062200000000002	18.7	0.06905	36.2	0.0
5	394.12	58.7	0.458	5.21	222.0	6.43	0.0	2.18	3.0	6.062200000000002	18.7	0.02985	28.700000000000006	0.0
6	395.6	66.60000000000001	0.524	12.43	311.0	6.0120000000000005	12.5	7.87	5.0	5.5605	15.200000000000001	0.08829000000000002	22.900000000000002	0.0
7	396.9	96.1	0.524	19.150000000000002	311.0	6.172	12.5	7.87	5.0	5.9505	15.200000000000001	0.14455	27.1	0.0
8	386.63	100.0	0.524	29.93	311.0	5.631	12.5	7.87	5.0	6.0821000000000005	15.200000000000001	0.21124	16.5	0.0
9	386.71	85.9	0.524	17.1	311.0	6.004	12.5	7.87	5.0	6.5921	15.200000000000001	0.17004000000000002	18.900000000000002	0.0

	max_depth	min_rows	mtries	ntrees	score	std
14	9	1	9	27	0.878763	0.058736
28	8	2	7	22	0.873955	0.120582
16	8	2	8	48	0.870907	0.077211
7	9	3	9	38	0.847962	0.125709
19	6	4	6	31	0.841485	0.167333

B	AGE	NOX	LSTAT	TAX	RM	ZN	INDUS	RAD	DIS	PTRATIO	CRIM	CHAS
3006.79	-613.976	-0.0109691	-67.7964	-30014.5	0.531103	-247.991	-58.7935	-54.5035	4.6657	0.693564	-25.2944	-0.0170139
3202.85	778.017	-0.00328778	46.5703	-15315.2	-0.0667628	35.1418	-20.3865	-37.345	4.43096	-7.01271	-24.4359	-0.0170139
3202.85	412.722	-0.00328778	5.08501	-15315.2	-0.184743	35.1418	-20.3865	-37.345	5.0114	-7.01271	-24.6431	-0.0170139
1744.96	457	-0.0016584	-16.1515	-15975.9	-0.121772	-247.991	-18.564	-45.9243	1.30266	5.75769	-20.6599	-0.0170139
2999.02	885.945	-0.0016584	51.7206	-15975.9	-0.32661	-247.991	-18.564	-45.9243	0.532034	5.75769	-17.9778	-0.0170139
2210.48	-1801.18	-0.012133	-54.3913	-28197.7	0.366074	-247.991	-26.8664	-54.5035	3.94846	-1.06787	-24.5668	-0.0170139
2992.11	-1701.56	-0.012133	-36.0475	-28197.7	-0.0385343	-247.991	-26.8664	-54.5035	3.94846	-1.06787	-24.3225	-0.0170139
2857.38	-760.647	-0.0164392	-60.6705	-24399	0.39937	2017.07	-64.5984	-37.345	9.42058	-7.23289	-25.4325	-0.0170139
3202.85	-575.233	-0.011551	-23.489	-19774.6	-0.244096	318.275	-38.8813	-11.6072	6.4936	2.89536	-24.7517	-0.0170139
1577.4	703.297	-0.011551	13.3397	-19774.6	-0.215867	318.275	-38.8813	-11.6072	6.25507	2.89536	-24.228	-0.0170139

PC1	PC2	PC3	PC4	PC5
30161.1	668.607	-0.595681	466.145	2.55589
15579.7	-1308.37	895.183	-429.89	6.64585
15584.5	-1310.74	573.971	-254.186	-4.39019
16060.6	217.81	744.26	-26.9187	-13.6447
16207.6	-1024.08	1132.87	-234.328	11.172
28276.6	1230.04	-1070.76	1040.16	24.3973
28370.5	454.905	-976.171	991.594	31.6011
24584	125.492	-1291.95	-1441.28	6.97843
20024.6	-774.698	-370.114	-34.056	14.0372
19810	846.846	737.13	-647.294	5.65924

	pca__k	rf__max_depth	rf__min_rows	rf__ntrees	score	standardize__center	standardize__scale	std
13	5	9	5	55	0.481439	False	False	0.155300
0	5	5	5	59	0.436773	True	False	0.085499
10	5	7	5	55	0.427226	False	True	0.150276
16	5	7	9	71	0.416305	False	False	0.106783
17	5	5	5	70	0.407332	False	True	0.141057