H2O Tutorial: EEG Eye State Classification

Author: Erin LeDell

Contact: erin@h2o.ai

This tutorial steps through a quick introduction to H2O's Python API. The goal of this tutorial is to introduce through a complete example H2O's capabilities from Python.

Most of the functionality for a Pandas DataFrame is exactly the same syntax for an H2OFrame, so if you are comfortable with Pandas, data frame manipulation will come naturally to you in H2O. The modeling syntax in the H2O Python API may also remind you of scikit-learn.

References: H2O Python API documentation and H2O general documentation

Install H2O in Python

Prerequisites

This tutorial assumes you have Python 2.7 installed. The h2o Python package has a few dependencies which can be installed using pip. The packages that are required are (which also have their own dependencies):

pip install requests
pip install tabulate
pip install scikit-learn

If you have any problems (for example, installing the scikit-learn package), check out this page for tips.

Install h2o

Once the dependencies are installed, you can install H2O. We will use the latest stable version of the h2o package, which is currently "Tibshirani-8." The installation instructions are on the "Install in Python" tab on this page.

# The following command removes the H2O module for Python (if it already exists).
pip uninstall h2o

# Next, use pip to install this version of the H2O Python module.
pip install http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/8/Python/h2o-3.6.0.8-py2.py3-none-any.whl

For reference, the Python documentation for the latest stable release of H2O is here.

Start up an H2O cluster

In a Python terminal, we can import the h2o package and start up an H2O cluster.


In [2]:
import h2o

# Start an H2O Cluster on your local machine
h2o.init()


H2O cluster uptime: 12 minutes 16 seconds 831 milliseconds
H2O cluster version: 3.6.0.3
H2O cluster name: H2O_started_from_python
H2O cluster total nodes: 1
H2O cluster total memory: 3.56 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: True
H2O Connection ip: 127.0.0.1
H2O Connection port: 54321

If you already have an H2O cluster running that you'd like to connect to (for example, in a multi-node Hadoop environment), then you can specify the IP and port of that cluster as follows:


In [2]:
# This will not actually do anything since it's a fake IP address
# h2o.init(ip="123.45.67.89", port=54321)

Download EEG Data

The following code downloads a copy of the EEG Eye State dataset. All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video frames. '1' indicates the eye-closed and '0' the eye-open state. All values are in chronological order with the first measured value at the top of the data.

We can import the data directly into H2O using the import_file method in the Python API. The import path can be a URL, a local path, a path to an HDFS file, or a file on Amazon S3.


In [4]:
#csv_url = "http://www.stat.berkeley.edu/~ledell/data/eeg_eyestate_splits.csv"
csv_url = "https://h2o-public-test-data.s3.amazonaws.com/eeg_eyestate_splits.csv"
data = h2o.import_file(csv_url)


Parse Progress: [##################################################] 100%

Explore Data

Once we have loaded the data, let's take a quick look. First the dimension of the frame:


In [6]:
data.shape


Out[6]:
(14980, 16)

Now let's take a look at the top of the frame:


In [7]:
data.head()


AF3 F7 F3 FC5 T7 P7 O1 O2 P8 T8 FC6 F4 F8 AF4 eyeDetectionsplit
4329.234009.234289.234148.214350.264586.154096.924641.034222.054238.464211.284280.514635.9 4393.85 0valid
4324.624004.624293.854148.724342.054586.674097.444638.974210.774226.674207.694279.494632.824384.1 0test
4327.694006.674295.384156.414336.924583.594096.924630.264207.694222.054206.674282.054628.724389.23 0train
4328.724011.794296.414155.9 4343.594582.564097.444630.774217.444235.384210.774287.694632.314396.41 0train
4326.154011.794292.314151.284347.694586.674095.9 4627.694210.774244.1 4212.824288.214632.824398.46 0train
4321.034004.624284.1 4153.334345.644587.184093.334616.924202.564232.824209.744281.034628.214389.74 0train
4319.494001.034280.514151.794343.594584.624089.744615.9 4212.314226.674201.034269.744625.134378.46 0test
4325.644006.674278.464143.084344.1 4583.084087.184614.874205.644230.264195.9 4266.674622.054380.51 0test
4326.154010.774276.414139.494345.134584.1 4091.284608.214187.694229.744202.054273.854627.184389.74 0test
4326.154011.284276.924142.054344.1 4582.564092.824608.724194.364228.724212.824277.954637.444393.33 0train
Out[7]:

The first 14 columns are numeric values that represent EEG measurements from the headset. The "eyeDetection" column is the response. There is an additional column called "split" that was added (by me) in order to specify partitions of the data (so we can easily benchmark against other tools outside of H2O using the same splits). I randomly divided the dataset into three partitions: train (60%), valid (%20) and test (20%) and marked which split each row belongs to in the "split" column.

Let's take a look at the column names. The data contains derived features from the medical images of the tumors.


In [6]:
data.columns


Out[6]:
[u'AF3',
 u'F7',
 u'F3',
 u'FC5',
 u'T7',
 u'P7',
 u'O1',
 u'O2',
 u'P8',
 u'T8',
 u'FC6',
 u'F4',
 u'F8',
 u'AF4',
 u'eyeDetection',
 u'split']

To select a subset of the columns to look at, typical Pandas indexing applies:


In [8]:
columns = ['AF3', 'eyeDetection', 'split']
data[columns].head()


AF3 eyeDetectionsplit
4329.23 0valid
4324.62 0test
4327.69 0train
4328.72 0train
4326.15 0train
4321.03 0train
4319.49 0test
4325.64 0test
4326.15 0test
4326.15 0train
Out[8]:

Now let's select a single column, for example -- the response column, and look at the data more closely:


In [9]:
y = 'eyeDetection'
data[y]


eyeDetection
0
0
0
0
0
0
0
0
0
0
Out[9]:

It looks like a binary response, but let's validate that assumption:


In [10]:
data[y].unique()


C1
0
1
Out[10]:

If you don't specify the column types when you import the file, H2O makes a guess at what your column types are. If there are 0's and 1's in a column, H2O will automatically parse that as numeric by default.

Therefore, we should convert the response column to a more efficient "enum" representation -- in this case it is a categorial variable with two levels, 0 and 1. If the only column in my data that is categorical is the response, I typically don't bother specifying the column type during the parse, and instead use this one-liner to convert it aftewards:


In [11]:
data[y] = data[y].asfactor()

Now we can check that there are two levels in our response column:


In [12]:
data[y].nlevels()


Out[12]:
[2]

We can query the categorical "levels" as well ('0' and '1' stand for "eye open" and "eye closed") to see what they are:


In [13]:
data[y].levels()


Out[13]:
[['0', '1']]

We may want to check if there are any missing values, so let's look for NAs in our dataset. For tree-based methods like GBM and RF, H2O handles missing feature values automatically, so it's not a problem if we are missing certain feature values. However, it is always a good idea to check to make sure that you are not missing any of the training labels.

To figure out which, if any, values are missing, we can use the isna method on the diagnosis column. The columns in an H2O Frame are also H2O Frames themselves, so all the methods that apply to a Frame also apply to a single column.


In [14]:
data.isna()


C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Out[14]:


In [15]:
data[y].isna()


C1
0
0
0
0
0
0
0
0
0
0
Out[15]:

The isna method doesn't directly answer the question, "Does the response column contain any NAs?", rather it returns a 0 if that cell is not missing (Is NA? FALSE == 0) and a 1 if it is missing (Is NA? TRUE == 1). So if there are no missing values, then summing over the whole column should produce a summand equal to 0.0. Let's take a look:


In [16]:
data[y].isna().sum()


Out[16]:
0.0

Great, no missing labels. :-)

Out of curiosity, let's see if there is any missing data in this frame:


In [17]:
data.isna().sum()


Out[17]:
0.0

The sum is still zero, so there are no missing values in any of the cells.

The next thing I may wonder about in a binary classification problem is the distribution of the response in the training data. Is one of the two outcomes under-represented in the training set? Many real datasets have what's called an "imbalanace" problem, where one of the classes has far fewer training examples than the other class. Let's take a look at the distribution:


In [18]:
data[y].table()


eyeDetection Count
0 8257
1 6723
Out[18]:

Ok, the data is not exactly evenly distributed between the two classes -- there are more 0's than 1's in the dataset. However, this level of imbalance shouldn't be much of an issue for the machine learning algos. (We will revisit this later in the modeling section below).

Let's calculate the percentage that each class represents:


In [19]:
n = data.shape[0]  # Total number of training samples
data[y].table()['Count']/n


Count
0.551202
0.448798
Out[19]:

Split H2O Frame into a train and test set

So far we have explored the original dataset (all rows). For the machine learning portion of this tutorial, we will break the dataset into three parts: a training set, validation set and a test set.

If you want H2O to do the splitting for you, you can use the split_frame method. However, we have explicit splits that we want (for reproducibility reasons), so we can just subset the Frame to get the partitions we want.

Subset the data H2O Frame on the "split" column:


In [20]:
train = data[data['split']=="train"]
train.shape


Out[20]:
(8988, 16)

In [21]:
valid = data[data['split']=="valid"]
valid.shape


Out[21]:
(2996, 16)

In [22]:
test = data[data['split']=="test"]
test.shape


Out[22]:
(2996, 16)

Machine Learning in H2O

We will do a quick demo of the H2O software using a Gradient Boosting Machine (GBM). The goal of this problem is to train a model to predict eye state (open vs closed) from EEG data.

Train and Test a GBM model


In [23]:
# Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

We first create a model object of class, "H2OGradientBoostingEstimator". This does not actually do any training, it just sets the model up for training by specifying model parameters.


In [24]:
model = H2OGradientBoostingEstimator(distribution='bernoulli',
                                    ntrees=100,
                                    max_depth=4,
                                    learn_rate=0.1)

Specify the predictor set and response

The model object, like all H2O estimator objects, has a train method, which will actually perform model training. At this step we specify the training and (optionally) a validation set, along with the response and predictor variables.

The x argument should be a list of predictor names in the training frame, and y specifies the response column. We have already set y = "eyeDetector" above, but we still need to specify x.


In [25]:
x = list(train.columns)
x


Out[25]:
[u'AF3',
 u'F7',
 u'F3',
 u'FC5',
 u'T7',
 u'P7',
 u'O1',
 u'O2',
 u'P8',
 u'T8',
 u'FC6',
 u'F4',
 u'F8',
 u'AF4',
 u'eyeDetection',
 u'split']

In [27]:
del x[12:14]  #Remove the 13th and 14th columns, 'eyeDetection' and 'split'
x


Out[27]:
[u'AF3',
 u'F7',
 u'F3',
 u'FC5',
 u'T7',
 u'P7',
 u'O1',
 u'O2',
 u'P8',
 u'T8',
 u'FC6',
 u'F4']

Now that we have specified x and y, we can train the model:


In [28]:
model.train(x=x, y=y, training_frame=train, validation_frame=valid)


gbm Model Build Progress: [##################################################] 100%

Inspect Model

The type of results shown when you print a model, are determined by the following:

  • Model class of the estimator (e.g. GBM, RF, GLM, DL)
  • The type of machine learning problem (e.g. binary classification, multiclass classification, regression)
  • The data you specify (e.g. training_frame only, training_frame and validation_frame, or training_frame and nfolds)

Below, we see a GBM Model Summary, as well as training and validation metrics since we supplied a validation_frame. Since this a binary classification task, we are shown the relevant performance metrics, which inclues: MSE, R^2, LogLoss, AUC and Gini. Also, we are shown a Confusion Matrix, where the threshold for classification is chosen automatically (by H2O) as the threshold which maximizes the F1 score.

The scoring history is also printed, which shows the performance metrics over some increment such as "number of trees" in the case of GBM and RF.

Lastly, for tree-based methods (GBM and RF), we also print variable importance.


In [27]:
print(model)


Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1448559565749_9080

Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
100.0 23614.0 4.0 4.0 4.0 10.0 16.0 14.9

ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.114026790434
R^2: 0.539835211
LogLoss: 0.376005292812
AUC: 0.936370388939
Gini: 0.872740777878

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.43076103173:
0 1 Error Rate
0 4102.0 814.0 0.1656 (814.0/4916.0)
1 534.0 3538.0 0.1311 (534.0/4072.0)
Total 4636.0 4352.0 0.15 (1348.0/8988.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.4307610 0.8399810 225.0
max f2 0.2903699 0.8927355 280.0
max f0point5 0.5822403 0.8685376 167.0
max accuracy 0.5002179 0.8529150 199.0
max precision 0.9870341 1.0 0.0
max absolute_MCC 0.5002179 0.7030709 199.0
max min_per_class_accuracy 0.4473262 0.8492677 218.0
Gains/Lift Table: Avg response rate: 45.30 %

group lower_threshold cumulative_data_fraction response_rate cumulative_response_rate capture_rate cumulative_capture_rate lift cumulative_lift gain cumulative_gain
1 0.9206586 0.0500668 1.0 1.0 0.1105108 0.1105108 2.2072692 2.2072692 120.7269155 120.7269155
2 0.8721988 0.1000223 0.9955457 0.9977753 0.1097741 0.2202849 2.1974372 2.2023587 119.7437221 120.2358657
3 0.8154129 0.1500890 0.9888889 0.9948110 0.1092829 0.3295678 2.1827439 2.1958156 118.2743942 119.5815572
4 0.7587556 0.2000445 0.9599109 0.9860957 0.1058448 0.4354126 2.1187818 2.1765785 111.8781750 117.6578538
5 0.7006297 0.25 0.9020045 0.9692924 0.0994597 0.5348723 1.9909666 2.1394892 99.0966610 113.9489195
6 0.6450366 0.3000668 0.8644444 0.9517983 0.0955305 0.6304028 1.9080616 2.1008750 90.8061559 110.0875017
7 0.5826161 0.3500223 0.7260579 0.9195804 0.0800589 0.7104617 1.6026052 2.0297615 60.2605222 102.9761496
8 0.5183693 0.3999777 0.5924276 0.8787204 0.0653242 0.7757859 1.3076472 1.9395725 30.7647206 93.9572534
9 0.4640757 0.4500445 0.5222222 0.8390606 0.0577112 0.8334971 1.1526850 1.8520325 15.2685003 85.2032512
10 0.4124356 0.5 0.4365256 0.7988429 0.0481336 0.8816306 0.9635295 1.7632613 -3.6470480 76.3261297
11 0.3623212 0.5499555 0.3385301 0.7570301 0.0373281 0.9189587 0.7472270 1.6709693 -25.2773025 67.0969286
12 0.3185648 0.6000223 0.2666667 0.7161135 0.0294695 0.9484283 0.5886051 1.5806552 -41.1394892 58.0655197
13 0.2778815 0.6499777 0.1826281 0.6751113 0.0201375 0.9685658 0.4031093 1.4901523 -59.6890711 49.0152268
14 0.2374738 0.6999332 0.1469933 0.6374185 0.0162083 0.9847741 0.3244538 1.4069543 -67.5546182 40.6954270
15 0.2014388 0.75 0.0644444 0.5991693 0.0071218 0.9918959 0.1422462 1.3225278 -85.7753766 32.2527832
16 0.1656835 0.7999555 0.0356347 0.5639777 0.0039293 0.9958251 0.0786555 1.2448507 -92.1344529 24.4850685
17 0.1326734 0.8499110 0.0200445 0.5320068 0.0022102 0.9980354 0.0442437 1.1742822 -95.5756298 17.4282216
18 0.1036424 0.8999777 0.0088889 0.5029052 0.0009823 0.9990177 0.0196202 1.1100471 -98.0379830 11.0047092
19 0.0758269 0.9499332 0.0044543 0.4766924 0.0004912 0.9995088 0.0098319 1.0521885 -99.0168066 5.2188506
20 0.0081794 1.0 0.0044444 0.4530485 0.0004912 1.0 0.0098101 1.0 -99.0189915 0.0

ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.124121459821
R^2: 0.499326493922
LogLoss: 0.400023227684
AUC: 0.917514329947
Gini: 0.835028659894

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.460577730095:
0 1 Error Rate
0 1364.0 271.0 0.1657 (271.0/1635.0)
1 223.0 1138.0 0.1639 (223.0/1361.0)
Total 1587.0 1409.0 0.1649 (494.0/2996.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.4605777 0.8216606 211.0
max f2 0.3288904 0.8816658 269.0
max f0point5 0.5924794 0.8415575 158.0
max accuracy 0.4749281 0.8354473 205.0
max precision 0.9728563 1.0 0.0
max absolute_MCC 0.4620595 0.6691432 210.0
max min_per_class_accuracy 0.4605777 0.8342508 211.0
Gains/Lift Table: Avg response rate: 45.43 %

group lower_threshold cumulative_data_fraction response_rate cumulative_response_rate capture_rate cumulative_capture_rate lift cumulative_lift gain cumulative_gain
1 0.9213965 0.0500668 1.0 1.0 0.1102131 0.1102131 2.2013226 2.2013226 120.1322557 120.1322557
2 0.8773782 0.1001335 0.9933333 0.9966667 0.1094783 0.2196914 2.1866471 2.1939848 118.6647073 119.3984815
3 0.8248667 0.1502003 0.9866667 0.9933333 0.1087436 0.3284350 2.1719716 2.1866471 117.1971590 118.6647073
4 0.7540418 0.2002670 0.92 0.975 0.1013960 0.4298310 2.0252168 2.1462895 102.5216752 114.6289493
5 0.7023105 0.25 0.8590604 0.9519359 0.0940485 0.5238795 1.8910690 2.0955180 89.1069042 109.5518001
6 0.6471993 0.3000668 0.7666667 0.9210234 0.0844967 0.6083762 1.6876806 2.0274695 68.7680627 102.7469496
7 0.5924509 0.3501335 0.7066667 0.8903718 0.0778839 0.6862601 1.5556013 1.9599955 55.5601274 95.9995489
8 0.5395983 0.4002003 0.6133333 0.8557131 0.0675974 0.7538575 1.3501445 1.8837005 35.0144502 88.3700537
9 0.4826484 0.4499332 0.5167785 0.8182493 0.0565760 0.8104335 1.1375962 1.8012305 13.7596221 80.1230549
10 0.4285123 0.5 0.4066667 0.7770360 0.0448200 0.8552535 0.8952045 1.7105070 -10.4795494 71.0506980
11 0.3800505 0.5500668 0.42 0.7445388 0.0462895 0.9015430 0.9245555 1.6389701 -7.5444526 63.8970132
12 0.3377836 0.6001335 0.3066667 0.7080089 0.0337987 0.9353417 0.6750723 1.5585560 -32.4927749 55.8555959
13 0.2898491 0.6498665 0.1946309 0.6687211 0.0213079 0.9566495 0.4284453 1.4720709 -57.1554670 47.2070862
14 0.2490383 0.6999332 0.1 0.6280401 0.0110213 0.9676708 0.2201323 1.3825187 -77.9867744 38.2518745
15 0.2117511 0.75 0.0866667 0.5919003 0.0095518 0.9772226 0.1907813 1.3029635 -80.9218712 30.2963507
16 0.1752189 0.8000668 0.1 0.5611181 0.0110213 0.9882439 0.2201323 1.2352019 -77.9867744 23.5201852
17 0.1445499 0.8497997 0.0469799 0.5310291 0.0051433 0.9933872 0.1034178 1.1689663 -89.6582162 16.8966260
18 0.1085791 0.8998665 0.0266667 0.5029674 0.0029390 0.9963262 0.0587019 1.1071934 -94.1298065 10.7193393
19 0.0788782 0.9499332 0.02 0.4775123 0.0022043 0.9985305 0.0440265 1.0511586 -95.5973549 5.1158593
20 0.0087775 1.0 0.0133333 0.4542724 0.0014695 1.0 0.0293510 1.0 -97.0649033 0.0

Scoring History:
timestamp duration number_of_trees training_MSE training_logloss training_AUC training_lift training_classification_error validation_MSE validation_logloss validation_AUC validation_lift validation_classification_error
2015-11-26 10:25:17 0.036 sec 1.0 0.2393135 0.6716245 0.7312101 1.9746189 0.3159769 0.2393579 0.6717168 0.7310525 2.0087068 0.3197597
2015-11-26 10:25:17 0.071 sec 2.0 0.2324178 0.6576643 0.7498719 2.0431475 0.3172007 0.2326440 0.6581388 0.7446153 2.0202614 0.3147530
2015-11-26 10:25:18 0.102 sec 3.0 0.2261150 0.6447969 0.7735294 2.0350459 0.3147530 0.2265954 0.6458099 0.7649803 2.0103402 0.2973965
2015-11-26 10:25:18 0.138 sec 4.0 0.2205965 0.6333786 0.7768265 2.0432395 0.2890521 0.2210435 0.6343441 0.7709237 2.0264511 0.3034045
2015-11-26 10:25:18 0.183 sec 5.0 0.2152810 0.6223737 0.7890473 2.1408557 0.2900534 0.2157338 0.6233813 0.7850207 2.0890102 0.3017356
--- --- --- --- --- --- --- --- --- --- --- --- --- ---
2015-11-26 10:25:21 3.451 sec 34.0 0.1554295 0.4845459 0.8784078 2.2023641 0.2209613 0.1583969 0.4916962 0.8697198 2.2013226 0.2166222
2015-11-26 10:25:21 3.612 sec 35.0 0.1541826 0.4815032 0.8802874 2.2024603 0.2212951 0.1571958 0.4887622 0.8718223 2.2013226 0.2192924
2015-11-26 10:25:21 3.770 sec 36.0 0.1533055 0.4793517 0.8821833 2.2024073 0.2170672 0.1562774 0.4865299 0.8738704 2.2013226 0.2176235
2015-11-26 10:25:21 3.927 sec 37.0 0.1523871 0.4767990 0.8825759 2.2024073 0.2159546 0.1554071 0.4841040 0.8742059 2.2013226 0.2152870
2015-11-26 10:25:22 4.811 sec 100.0 0.1140268 0.3760053 0.9363704 2.2072692 0.1499777 0.1241215 0.4000232 0.9175143 2.2013226 0.1648865
Variable Importances:
variable relative_importance scaled_importance percentage
P7 1308.6181641 1.0 0.2080290
O1 1043.3781738 0.7973129 0.1658642
F7 835.8756714 0.6387468 0.1328779
AF3 730.2410889 0.5580246 0.1160853
F4 465.6864014 0.3558612 0.0740295
O2 465.3877869 0.3556330 0.0739820
T8 340.6835938 0.2603384 0.0541580
FC6 282.6287231 0.2159749 0.0449291
FC5 249.6704864 0.1907894 0.0396897
F3 238.9603577 0.1826051 0.0379872
T7 233.7027283 0.1785874 0.0371514
P8 95.7217865 0.0731472 0.0152167

Model Performance on a Test Set

Once a model has been trained, you can also use it to make predictions on a test set. In the case above, we just ran the model once, so our validation set (passed as validation_frame), could have also served as a "test set." We technically have already created test set predictions and evaluated test set performance.

However, when performing model selection over a variety of model parameters, it is common for users to train a variety of models (using different parameters) using the training set, train, and a validation set, valid. Once the user selects the best model (based on validation set performance), the true test of model performance is performed by making a final set of predictions on the held-out (never been used before) test set, test.

You can use the model_performance method to generate predictions on a new dataset. The results are stored in an object of class, "H2OBinomialModelMetrics".


In [28]:
perf = model.model_performance(test)
print(perf.__class__)


<class 'h2o.model.metrics_base.H2OBinomialModelMetrics'>

Individual model performance metrics can be extracted using methods like r2, auc and mse. In the case of binary classification, we may be most interested in evaluating test set Area Under the ROC Curve (AUC).


In [29]:
perf.r2()


Out[29]:
0.49537936872725086

In [30]:
perf.auc()


Out[30]:
0.91671733144306

In [31]:
perf.mse()


Out[31]:
0.12372290870105287

Cross-validated Performance

To perform k-fold cross-validation, you use the same code as above, but you specify nfolds as an integer greater than 1, or add a "fold_column" to your H2O Frame which indicates a fold ID for each row.

Unless you have a specific reason to manually assign the observations to folds, you will find it easiest to simply use the nfolds argument.

When performing cross-validation, you can still pass a validation_frame, but you can also choose to use the original dataset that contains all the rows. We will cross-validate a model below using the original H2O Frame which is called data.


In [32]:
cvmodel = H2OGradientBoostingEstimator(distribution='bernoulli',
                                       ntrees=100,
                                       max_depth=4,
                                       learn_rate=0.1,
                                       nfolds=5)

cvmodel.train(x=x, y=y, training_frame=data)


gbm Model Build Progress: [##################################################] 100%

This time around, we will simply pull the training and cross-validation metrics out of the model. To do so, you use the auc method again, and you can specify train or xval as True to get the correct metric.


In [33]:
print(cvmodel.auc(train=True))
print(cvmodel.auc(xval=True))


0.926208136139
0.909288088259

One way of evaluting models with different parameters is to perform a grid search over a set of parameter values. For example, in GBM, here are three model parameters that may be useful to search over:

  • ntrees: Number of trees
  • max_depth: Maximum depth of a tree
  • learn_rate: Learning rate in the GBM

We will define a grid as follows:


In [34]:
ntrees_opt = [5,50,100]
max_depth_opt = [2,3,5]
learn_rate_opt = [0.1,0.2]

hyper_params = {'ntrees': ntrees_opt, 
                'max_depth': max_depth_opt,
                'learn_rate': learn_rate_opt}

Define an "H2OGridSearch" object by specifying the algorithm (GBM) and the hyper parameters:


In [35]:
from h2o.grid.grid_search import H2OGridSearch

gs = H2OGridSearch(H2OGradientBoostingEstimator, hyper_params = hyper_params)

An "H2OGridSearch" object also has a train method, which is used to train all the models in the grid.


In [36]:
gs.train(x=x, y=y, training_frame=train, validation_frame=valid)


gbm Grid Build Progress: [##################################################] 100%

Compare Models


In [37]:
print(gs)


Grid Search Results for H2OGradientBoostingEstimator:
Model Id Hyperparameters: [learn_rate, ntrees, max_depth] mse
Grid_GBM_py_17_model_python_1448559565749_9958_model_17 [0.2, 100, 5] 0.0511374
Grid_GBM_py_17_model_python_1448559565749_9958_model_16 [0.2, 50, 5] 0.0825649
Grid_GBM_py_17_model_python_1448559565749_9958_model_8 [0.1, 100, 5] 0.0827864
Grid_GBM_py_17_model_python_1448559565749_9958_model_7 [0.1, 50, 5] 0.1148579
Grid_GBM_py_17_model_python_1448559565749_9958_model_14 [0.2, 100, 3] 0.1183549
Grid_GBM_py_17_model_python_1448559565749_9958_model_13 [0.2, 50, 3] 0.1433345
Grid_GBM_py_17_model_python_1448559565749_9958_model_5 [0.1, 100, 3] 0.1446235
Grid_GBM_py_17_model_python_1448559565749_9958_model_4 [0.1, 50, 3] 0.1671745
Grid_GBM_py_17_model_python_1448559565749_9958_model_11 [0.2, 100, 2] 0.1718563
Grid_GBM_py_17_model_python_1448559565749_9958_model_15 [0.2, 5, 5] 0.1796775
Grid_GBM_py_17_model_python_1448559565749_9958_model_10 [0.2, 50, 2] 0.1876036
Grid_GBM_py_17_model_python_1448559565749_9958_model_2 [0.1, 100, 2] 0.1879522
Grid_GBM_py_17_model_python_1448559565749_9958_model_1 [0.1, 50, 2] 0.2024828
Grid_GBM_py_17_model_python_1448559565749_9958_model_6 [0.1, 5, 5] 0.2042339
Grid_GBM_py_17_model_python_1448559565749_9958_model_12 [0.2, 5, 3] 0.2126602
Grid_GBM_py_17_model_python_1448559565749_9958_model_3 [0.1, 5, 3] 0.2264710
Grid_GBM_py_17_model_python_1448559565749_9958_model_9 [0.2, 5, 2] 0.2276228
Grid_GBM_py_17_model_python_1448559565749_9958_model_0 [0.1, 5, 2] 0.2355396


In [38]:
# print out the auc for all of the models
auc_table = gs.sort_by('auc(valid=True)',increasing=False)
print(auc_table)


Grid Search Results for H2OGradientBoostingEstimator:
Model Id Hyperparameters: [learn_rate, ntrees, max_depth] auc(valid=True)
Grid_GBM_py_17_model_python_1448559565749_9958_model_17 [0.2, 100, 5] 0.9604258
Grid_GBM_py_17_model_python_1448559565749_9958_model_8 [0.1, 100, 5] 0.9422169
Grid_GBM_py_17_model_python_1448559565749_9958_model_16 [0.2, 50, 5] 0.9417059
Grid_GBM_py_17_model_python_1448559565749_9958_model_7 [0.1, 50, 5] 0.9169205
Grid_GBM_py_17_model_python_1448559565749_9958_model_14 [0.2, 100, 3] 0.9134707
Grid_GBM_py_17_model_python_1448559565749_9958_model_13 [0.2, 50, 3] 0.8833912
Grid_GBM_py_17_model_python_1448559565749_9958_model_5 [0.1, 100, 3] 0.8803587
Grid_GBM_py_17_model_python_1448559565749_9958_model_4 [0.1, 50, 3] 0.8480756
Grid_GBM_py_17_model_python_1448559565749_9958_model_15 [0.2, 5, 5] 0.8447013
Grid_GBM_py_17_model_python_1448559565749_9958_model_6 [0.1, 5, 5] 0.8227407
Grid_GBM_py_17_model_python_1448559565749_9958_model_11 [0.2, 100, 2] 0.8102571
Grid_GBM_py_17_model_python_1448559565749_9958_model_2 [0.1, 100, 2] 0.7821284
Grid_GBM_py_17_model_python_1448559565749_9958_model_10 [0.2, 50, 2] 0.7818671
Grid_GBM_py_17_model_python_1448559565749_9958_model_12 [0.2, 5, 3] 0.7646862
Grid_GBM_py_17_model_python_1448559565749_9958_model_1 [0.1, 50, 2] 0.7548347
Grid_GBM_py_17_model_python_1448559565749_9958_model_3 [0.1, 5, 3] 0.7282790
Grid_GBM_py_17_model_python_1448559565749_9958_model_9 [0.2, 5, 2] 0.6927028
Grid_GBM_py_17_model_python_1448559565749_9958_model_0 [0.1, 5, 2] 0.6761445

The "best" model in terms of validation set AUC is listed first in auc_table.


In [39]:
best_model = h2o.get_model(auc_table['Model Id'][0])
best_model.auc()


Out[39]:
0.9894042107804035

The last thing we may want to do is generate predictions on the test set using the "best" model, and evaluate the test set AUC.


In [40]:
best_perf = best_model.model_performance(test)
best_perf.auc()


Out[40]:
0.9609710824540837

The test set AUC is approximately 0.96. Not bad!!


In [ ]: