Author: Erin LeDell
Contact: erin@h2o.ai
This tutorial steps through a quick introduction to H2O's Python API. The goal of this tutorial is to introduce through a complete example H2O's capabilities from Python. Also, to help those that are accustomed to Scikit Learn and Pandas, the demo will be specific call outs for differences between H2O and those packages; this is intended to help anyone that needs to do machine learning on really Big Data make the transition. It is not meant to be a tutorial on machine learning or algorithms.
Detailed documentation about H2O's and the Python API is available at http://docs.h2o.ai.
This tutorial assumes you have Python 2.7 installed. The h2o
Python package has a few dependencies which can be installed using pip. The packages that are required are (which also have their own dependencies):
pip install requests
pip install tabulate
pip install scikit-learn
If you have any problems (for example, installing the scikit-learn
package), check out this page for tips.
Once the dependencies are installed, you can install H2O. We will use the latest stable version of the h2o
package, which is called "Tibshirani-3." The installation instructions are on the "Install in Python" tab on this page.
# The following command removes the H2O module for Python (if it already exists).
pip uninstall h2o
# Next, use pip to install this version of the H2O Python module.
pip install http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/3/Python/h2o-3.6.0.3-py2.py3-none-any.whl
In [2]:
import h2o
# Start an H2O Cluster on your local machine
h2o.init()
If you already have an H2O cluster running that you'd like to connect to (for example, in a multi-node Hadoop environment), then you can specify the IP and port of that cluster as follows:
In [ ]:
# This will not actually do anything since it's a fake IP address
# h2o.init(ip="123.45.67.89", port=54321)
The following code downloads a copy of the Wisconsin Diagnostic Breast Cancer dataset.
We can import the data directly into H2O using the Python API.
In [3]:
csv_url = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wisc/wisc-diag-breast-cancer-shuffled.csv"
data = h2o.import_file(csv_url)
In [4]:
data.shape
Out[4]:
Now let's take a look at the top of the frame:
In [5]:
data.head()
Out[5]:
The first two columns contain an ID and the resposne. The "diagnosis" column is the response. Let's take a look at the column names. The data contains derived features from the medical images of the tumors.
In [6]:
data.columns
Out[6]:
To select a subset of the columns to look at, typical Pandas indexing applies:
In [7]:
columns = ["id", "diagnosis", "area_mean"]
data[columns].head()
Out[7]:
Now let's select a single column, for example -- the response column, and look at the data more closely:
In [8]:
data['diagnosis']
Out[8]:
It looks like a binary response, but let's validate that assumption:
In [9]:
data['diagnosis'].unique()
Out[9]:
In [10]:
data['diagnosis'].nlevels()
Out[10]:
We can query the categorical "levels" as well ('B' and 'M' stand for "Benign" and "Malignant" diagnosis):
In [11]:
data['diagnosis'].levels()
Out[11]:
Since "diagnosis" column is the response we would like to predict, we may want to check if there are any missing values, so let's look for NAs. To figure out which, if any, values are missing, we can use the isna
method on the diagnosis column. The columns in an H2O Frame are also H2O Frames themselves, so all the methods that apply to a Frame also apply to a single column.
In [12]:
data.isna()
Out[12]:
In [13]:
data['diagnosis'].isna()
Out[13]:
The isna
method doesn't directly answer the question, "Does the diagnosis column contain any NAs?", rather it returns a 0 if that cell is not missing (Is NA? FALSE == 0) and a 1 if it is missing (Is NA? TRUE == 1). So if there are no missing values, then summing over the whole column should produce a summand equal to 0.0. Let's take a look:
In [14]:
data['diagnosis'].isna().sum()
Out[14]:
Great, no missing labels.
Out of curiosity, let's see if there is any missing data in this frame:
In [15]:
data.isna().sum()
Out[15]:
The next thing I may wonder about in a binary classification problem is the distribution of the response in the training data. Is one of the two outcomes under-represented in the training set? Many real datasets have what's called an "imbalanace" problem, where one of the classes has far fewer training examples than the other class. Let's take a look at the distribution, both visually and numerically.
In [17]:
# TO DO: Insert a bar chart or something showing the proportion of M to B in the response.
In [16]:
data['diagnosis'].table()
Out[16]:
Ok, the data is not exactly evenly distributed between the two classes -- there are almost twice as many Benign samples as there are Malicious samples. However, this level of imbalance shouldn't be much of an issue for the machine learning algos. (We will revisit this later in the modeling section below).
In [18]:
n = data.shape[0] # Total number of training samples
data['diagnosis'].table()['Count']/n
Out[18]:
We will do a quick demo of the H2O software -- trying to predict malignant tumors using various machine learning algorithms.
The response, y
, is the 'diagnosis' column, and the predictors, x
, are all the columns aside from the first two columns ('id' and 'diagnosis').
In [19]:
y = 'diagnosis'
In [20]:
x = data.columns
del x[0:1]
x
Out[20]:
In [21]:
train, test = data.split_frame(ratios=[0.75], seed=1)
In [22]:
train.shape
Out[22]:
In [23]:
test.shape
Out[23]:
In [24]:
# Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator
We first create a model
object of class, "H2OGradientBoostingEstimator"
. This does not actually do any training, it just sets the model up for training by specifying model parameters.
In [29]:
model = H2OGradientBoostingEstimator(distribution='bernoulli',
ntrees=100,
max_depth=4,
learn_rate=0.1)
The model
object, like all H2O estimator objects, has a train
method, which will actually perform model training. At this step we specify the training and (optionally) a validation set, along with the response and predictor variables.
In [30]:
model.train(x=x, y=y, training_frame=train, validation_frame=test)
The type of results shown when you print a model, are determined by the following:
training_frame
only, training_frame
and validation_frame
, or training_frame
and nfolds
)Below, we see a GBM Model Summary, as well as training and validation metrics since we supplied a validation_frame
. Since this a binary classification task, we are shown the relevant performance metrics, which inclues: MSE, R^2, LogLoss, AUC and Gini. Also, we are shown a Confusion Matrix, where the threshold for classification is chosen automatically (by H2O) as the threshold which maximizes the F1 score.
The scoring history is also printed, which shows the performance metrics over some increment such as "number of trees" in the case of GBM and RF.
Lastly, for tree-based methods (GBM and RF), we also print variable importance.
In [32]:
print(model)
Once a model has been trained, you can also use it to make predictions on a test set. In the case above, we passed the test set as the validation_frame
in training, so we have technically already created test set predictions and performance.
However, when performing model selection over a variety of model parameters, it is common for users to break their dataset into three pieces: Training, Validation and Test.
After training a variety of models using different parameters (and evaluating them on a validation set), the user may choose a single model and then evaluate model performance on a separate test set. This is when the model_performance
method, shown below, is most useful.
In [132]:
perf = model.model_performance(test)
perf.auc()
Out[132]:
To perform k-fold cross-validation, you use the same code as above, but you specify nfolds
as an integer greater than 1, or add a "fold_column" to your H2O Frame which indicates a fold ID for each row.
Unless you have a specific reason to manually assign the observations to folds, you will find it easiest to simply use the nfolds
argument.
When performing cross-validation, you can still pass a validation_frame
, but you can also choose to use the original dataset that contains all the rows. We will cross-validate a model below using the original H2O Frame which we call data
.
In [35]:
cvmodel = H2OGradientBoostingEstimator(distribution='bernoulli',
ntrees=100,
max_depth=4,
learn_rate=0.1,
nfolds=5)
cvmodel.train(x=x, y=y, training_frame=data)
One way of evaluting models with different parameters is to perform a grid search over a set of parameter values. For example, in GBM, here are three model parameters that may be useful to search over:
ntrees
: Number of treesmax_depth
: Maximum depth of a treelearn_rate
: Learning rate in the GBMWe will define a grid as follows:
In [37]:
ntrees_opt = [5,50,100]
max_depth_opt = [2,3,5]
learn_rate_opt = [0.1,0.2]
hyper_params = {'ntrees': ntrees_opt,
'max_depth': max_depth_opt,
'learn_rate': learn_rate_opt}
Define an "H2OGridSearch"
object by specifying the algorithm (GBM) and the hyper parameters:
In [39]:
from h2o.grid.grid_search import H2OGridSearch
gs = H2OGridSearch(H2OGradientBoostingEstimator, hyper_params = hyper_params)
An "H2OGridSearch"
object also has a train
method, which is used to train all the models in the grid.
In [40]:
gs.train(x=x, y=y, training_frame=train, validation_frame=test)
In [42]:
print(gs)
In [43]:
# print out the auc for all of the models
for g in gs:
print(g.model_id + " auc: " + str(g.auc()))
In [ ]:
#TO DO: Compare grid search models