H2O Tutorial: Breast Cancer Classification

Author: Erin LeDell

Contact: erin@h2o.ai

This tutorial steps through a quick introduction to H2O's Python API. The goal of this tutorial is to introduce through a complete example H2O's capabilities from Python. Also, to help those that are accustomed to Scikit Learn and Pandas, the demo will be specific call outs for differences between H2O and those packages; this is intended to help anyone that needs to do machine learning on really Big Data make the transition. It is not meant to be a tutorial on machine learning or algorithms.

Detailed documentation about H2O's and the Python API is available at http://docs.h2o.ai.

Install H2O in Python

Prerequisites

This tutorial assumes you have Python 2.7 installed. The h2o Python package has a few dependencies which can be installed using pip. The packages that are required are (which also have their own dependencies):

pip install requests
pip install tabulate
pip install scikit-learn

If you have any problems (for example, installing the scikit-learn package), check out this page for tips.

Install h2o

Once the dependencies are installed, you can install H2O. We will use the latest stable version of the h2o package, which is called "Tibshirani-3." The installation instructions are on the "Install in Python" tab on this page.

# The following command removes the H2O module for Python (if it already exists).
pip uninstall h2o

# Next, use pip to install this version of the H2O Python module.
pip install http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/3/Python/h2o-3.6.0.3-py2.py3-none-any.whl

Start up an H2O cluster

In a Python terminal, we can import the h2o package and start up an H2O cluster.



In [2]:

    
import h2o

# Start an H2O Cluster on your local machine
h2o.init()









    




No instance found at ip and port: localhost:54321. Trying to start local jar...


JVM stdout: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpA5iLxS/h2o_me_started_from_python.out
JVM stderr: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmptfhX9Q/h2o_me_started_from_python.err
Using ice_root: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpViw3QS


Java Version: java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)


Starting H2O JVM and connecting: ........... Connection successful!






    




H2O cluster uptime: 
1 seconds 30 milliseconds 
H2O cluster version: 
3.6.0.3
H2O cluster name: 
H2O_started_from_python
H2O cluster total nodes: 
1
H2O cluster total memory: 
3.56 GB
H2O cluster total cores: 
8
H2O cluster allowed cores: 
8
H2O cluster healthy: 
True
H2O Connection ip: 
127.0.0.1
H2O Connection port: 
54321

If you already have an H2O cluster running that you'd like to connect to (for example, in a multi-node Hadoop environment), then you can specify the IP and port of that cluster as follows:



In [ ]:

    
# This will not actually do anything since it's a fake IP address
# h2o.init(ip="123.45.67.89", port=54321)

Download Data

The following code downloads a copy of the Wisconsin Diagnostic Breast Cancer dataset.

We can import the data directly into H2O using the Python API.



In [3]:

    
csv_url = "http://www.stat.berkeley.edu/~ledell/data/wisc-diag-breast-cancer-shuffled.csv"
data = h2o.import_file(csv_url)









    



Parse Progress: [##################################################] 100%

Explore Data

Once we have loaded the data, let's take a quick look. First the dimension of the frame:



In [4]:

    
data.shape









    Out[4]:





(569, 32)

Now let's take a look at the top of the frame:



In [5]:

    
data.head()









    





              id diagnosis    radius_mean   texture_mean   perimeter_mean   area_mean   smoothness_mean   compactness_mean   concavity_mean   concave_points_mean   symmetry_mean   fractal_dimension_mean   radius_se   texture_se   perimeter_se   area_se   smoothness_se   compactness_se   concavity_se   concave_points_se   symmetry_se   fractal_dimension_se   radius_worst   texture_worst   perimeter_worst   area_worst   smoothness_worst   compactness_worst   concavity_worst   concave_points_worst   symmetry_worst   fractal_dimension_worst
     8.71002e+08 B                  8.219          20.7            53.27       203.9           0.09405            0.1305          0.1321               0.02168          0.2222                  0.08261      0.1935       1.962         1.243     10.21        0.01243          0.05416        0.07753            0.01022       0.02309               0.01178          9.092           29.72             58.08        249.8             0.163              0.431            0.5381                0.07879           0.3322                   0.1486 
     8.81053e+06 B                 11.84          18.94            75.51       428            0.08871            0.069           0.02669               0.01393          0.1533                  0.06057      0.2222       0.8652         1.444     17.12        0.005517          0.01727        0.02045            0.006747       0.01616               0.002922         13.3            24.99             85.22        546.3             0.128              0.188            0.1471                0.06913           0.2535                   0.07993
     8.95115e+07 B                 12.2           15.21            78.01       457.9           0.08673            0.06545          0.01994               0.01692          0.1638                  0.06129      0.2575       0.8073         1.959     19.01        0.005403          0.01418        0.01051            0.005142       0.01333               0.002065         13.75           21.38             91.11        583.1             0.1256              0.1928            0.1167                0.05556           0.2661                   0.07961
     9.15946e+07 M                 15.05          19.07            97.26       701.9           0.09215            0.08597          0.07486               0.04335          0.1561                  0.05915      0.386       1.198         2.63      38.49        0.004952          0.0163        0.02967            0.009423       0.01152               0.001718         17.58           28.06            113.8        967              0.1246              0.2101            0.2866                0.112            0.2282                   0.06954
864292          B                 10.51          20.19            68.64       334.2           0.1122            0.1303          0.06476               0.03068          0.1922                  0.07782      0.3336       1.86          2.041     19.91        0.01188          0.03747        0.04591            0.01544       0.02287               0.006792         11.16           22.75             72.62        374.4             0.13               0.2049            0.1295                0.06136           0.2383                   0.09026
     9.1544e+07 B                 12.22          20.04            79.47       453.1           0.1096            0.1152          0.08175               0.02166          0.2124                  0.06894      0.1811       0.7959         0.9857     12.58        0.006272          0.02198        0.03966            0.009894       0.0132               0.003813         13.16           24.17             85.13        515.3             0.1402              0.2315            0.3535                0.08088           0.2709                   0.08839
     9.19039e+07 B                 11.67          20.02            75.21       416.2           0.1016            0.09453          0.042                0.02157          0.1859                  0.06461      0.2067       0.8745         1.393     15.34        0.005251          0.01727        0.0184            0.005298       0.01449               0.002671         13.35           28.81             87          550.6             0.155              0.2964            0.2758                0.0812           0.3206                   0.0895 
     9.01257e+06 B                 15.19          13.21            97.65       711.8           0.07963            0.06934          0.03393               0.02657          0.1721                  0.05544      0.1783       0.4125         1.338     17.72        0.005012          0.01485        0.01551            0.009155       0.01647               0.001767         16.2            15.73            104.5        819.1             0.1126              0.1737            0.1362                0.08178           0.2487                   0.06766
899987          M                 25.73          17.46           174.2      2010            0.1149            0.2363          0.3368               0.1913          0.1956                  0.06121      0.9948       0.8509         7.222    153.1        0.006369          0.04243        0.04266            0.01508       0.02335               0.003385         33.13           23.58            229.3       3234              0.153              0.5937            0.6451                0.2756           0.369                   0.08815
854039          M                 16.13          17.88           107         807.2           0.104             0.1559          0.1354               0.07752          0.1998                  0.06515      0.334       0.6857         2.183     35.03        0.004185          0.02868        0.02664            0.009067       0.01703               0.003817         20.21           27.26            132.7       1261              0.1446              0.5804            0.5274                0.1864           0.427                   0.1233 







    Out[5]:

The first two columns contain an ID and the resposne. The "diagnosis" column is the response. Let's take a look at the column names. The data contains derived features from the medical images of the tumors.



In [6]:

    
data.columns









    Out[6]:





[u'id',
 u'diagnosis',
 u'radius_mean',
 u'texture_mean',
 u'perimeter_mean',
 u'area_mean',
 u'smoothness_mean',
 u'compactness_mean',
 u'concavity_mean',
 u'concave_points_mean',
 u'symmetry_mean',
 u'fractal_dimension_mean',
 u'radius_se',
 u'texture_se',
 u'perimeter_se',
 u'area_se',
 u'smoothness_se',
 u'compactness_se',
 u'concavity_se',
 u'concave_points_se',
 u'symmetry_se',
 u'fractal_dimension_se',
 u'radius_worst',
 u'texture_worst',
 u'perimeter_worst',
 u'area_worst',
 u'smoothness_worst',
 u'compactness_worst',
 u'concavity_worst',
 u'concave_points_worst',
 u'symmetry_worst',
 u'fractal_dimension_worst']

To select a subset of the columns to look at, typical Pandas indexing applies:



In [7]:

    
columns = ["id", "diagnosis", "area_mean"]
data[columns].head()









    





              id diagnosis    area_mean
     8.71002e+08 B                203.9
     8.81053e+06 B                428  
     8.95115e+07 B                457.9
     9.15946e+07 M                701.9
864292          B                334.2
     9.1544e+07 B                453.1
     9.19039e+07 B                416.2
     9.01257e+06 B                711.8
899987          M               2010  
854039          M                807.2







    Out[7]:

Now let's select a single column, for example -- the response column, and look at the data more closely:



In [8]:

    
data['diagnosis']

It looks like a binary response, but let's validate that assumption:



In [9]:

    
data['diagnosis'].unique()









    





C1  
B   
M   







    Out[9]:



In [10]:

    
data['diagnosis'].nlevels()









    Out[10]:





[2]

We can query the categorical "levels" as well ('B' and 'M' stand for "Benign" and "Malignant" diagnosis):



In [11]:

    
data['diagnosis'].levels()









    Out[11]:





[['B', 'M']]

Since "diagnosis" column is the response we would like to predict, we may want to check if there are any missing values, so let's look for NAs. To figure out which, if any, values are missing, we can use the isna method on the diagnosis column. The columns in an H2O Frame are also H2O Frames themselves, so all the methods that apply to a Frame also apply to a single column.



In [12]:

    
data.isna()









    





  C1   C2   C3   C4   C5   C6   C7   C8   C9   C10   C11   C12   C13   C14   C15   C16   C17   C18   C19   C20   C21   C22   C23   C24   C25   C26   C27   C28   C29   C30   C31   C32
   0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
   0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
   0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
   0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
   0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
   0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
   0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
   0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
   0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
   0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0







    Out[12]:



In [13]:

    
data['diagnosis'].isna()

The isna method doesn't directly answer the question, "Does the diagnosis column contain any NAs?", rather it returns a 0 if that cell is not missing (Is NA? FALSE == 0) and a 1 if it is missing (Is NA? TRUE == 1). So if there are no missing values, then summing over the whole column should produce a summand equal to 0.0. Let's take a look:



In [14]:

    
data['diagnosis'].isna().sum()









    Out[14]:





0.0

Great, no missing labels.

Out of curiosity, let's see if there is any missing data in this frame:



In [15]:

    
data.isna().sum()









    Out[15]:





0.0

The next thing I may wonder about in a binary classification problem is the distribution of the response in the training data. Is one of the two outcomes under-represented in the training set? Many real datasets have what's called an "imbalanace" problem, where one of the classes has far fewer training examples than the other class. Let's take a look at the distribution, both visually and numerically.



In [17]:

    
# TO DO: Insert a bar chart or something showing the proportion of M to B in the response.



In [16]:

    
data['diagnosis'].table()









    





diagnosis    Count
B              357
M              212







    Out[16]:

Ok, the data is not exactly evenly distributed between the two classes -- there are almost twice as many Benign samples as there are Malicious samples. However, this level of imbalance shouldn't be much of an issue for the machine learning algos. (We will revisit this later in the modeling section below).



In [18]:

    
n = data.shape[0]  # Total number of training samples
data['diagnosis'].table()['Count']/n









    





   Count
0.627417
0.372583







    Out[18]:

Machine Learning in H2O

We will do a quick demo of the H2O software -- trying to predict malignant tumors using various machine learning algorithms.

Specify the predictor set and response

The response, y, is the 'diagnosis' column, and the predictors, x, are all the columns aside from the first two columns ('id' and 'diagnosis').



In [19]:

    
y = 'diagnosis'



In [20]:

    
x = data.columns
del x[0:1]
x









    Out[20]:





[u'diagnosis',
 u'radius_mean',
 u'texture_mean',
 u'perimeter_mean',
 u'area_mean',
 u'smoothness_mean',
 u'compactness_mean',
 u'concavity_mean',
 u'concave_points_mean',
 u'symmetry_mean',
 u'fractal_dimension_mean',
 u'radius_se',
 u'texture_se',
 u'perimeter_se',
 u'area_se',
 u'smoothness_se',
 u'compactness_se',
 u'concavity_se',
 u'concave_points_se',
 u'symmetry_se',
 u'fractal_dimension_se',
 u'radius_worst',
 u'texture_worst',
 u'perimeter_worst',
 u'area_worst',
 u'smoothness_worst',
 u'compactness_worst',
 u'concavity_worst',
 u'concave_points_worst',
 u'symmetry_worst',
 u'fractal_dimension_worst']

Split H2O Frame into a train and test set



In [21]:

    
train, test = data.split_frame(ratios=[0.75], seed=1)



In [22]:

    
train.shape









    Out[22]:





(428, 32)



In [23]:

    
test.shape









    Out[23]:





(141, 32)

Train and Test a GBM model



In [24]:

    
# Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

We first create a model object of class, "H2OGradientBoostingEstimator". This does not actually do any training, it just sets the model up for training by specifying model parameters.



In [29]:

    
model = H2OGradientBoostingEstimator(distribution='bernoulli',
                                    ntrees=100,
                                    max_depth=4,
                                    learn_rate=0.1)

The model object, like all H2O estimator objects, has a train method, which will actually perform model training. At this step we specify the training and (optionally) a validation set, along with the response and predictor variables.



In [30]:

    
model.train(x=x, y=y, training_frame=train, validation_frame=test)









    



gbm Model Build Progress: [##################################################] 100%

Inspect Model

The type of results shown when you print a model, are determined by the following:

Model class of the estimator (e.g. GBM, RF, GLM, DL)
The type of machine learning problem (e.g. binary classification, multiclass classification, regression)
The data you specify (e.g. training_frame only, training_frame and validation_frame, or training_frame and nfolds)

Below, we see a GBM Model Summary, as well as training and validation metrics since we supplied a validation_frame. Since this a binary classification task, we are shown the relevant performance metrics, which inclues: MSE, R^2, LogLoss, AUC and Gini. Also, we are shown a Confusion Matrix, where the threshold for classification is chosen automatically (by H2O) as the threshold which maximizes the F1 score.

The scoring history is also printed, which shows the performance metrics over some increment such as "number of trees" in the case of GBM and RF.

Lastly, for tree-based methods (GBM and RF), we also print variable importance.



In [32]:

    
print(model)









    



Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1448480209718_6

Model Summary:






    





number_of_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves

100.0
18324.0
4.0
4.0
4.0
8.0
14.0
10.31






    




ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 1.55261137469e-06
R^2: 0.999993333015
LogLoss: 0.000519099361538
AUC: 1.0
Gini: 1.0

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.989733166545:






    





B
M
Error
Rate
B
270.0
0.0
0.0
 (0.0/270.0)
M
0.0
158.0
0.0
 (0.0/158.0)
Total
270.0
158.0
0.0
 (0.0/428.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
1.0
1.0
143.0
max f2
1.0
1.0
143.0
max f0point5
1.0
1.0
143.0
max accuracy
1.0
1.0
143.0
max precision
1.0
1.0
0.0
max absolute_MCC
1.0
1.0
143.0
max min_per_class_accuracy
1.0
1.0
143.0






    



ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.0507094587533
R^2: 0.78540767359
LogLoss: 0.247694592147
AUC: 0.970200085143
Gini: 0.940400170285

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.409828576406:






    





B
M
Error
Rate
B
83.0
4.0
0.046
 (4.0/87.0)
M
4.0
50.0
0.0741
 (4.0/54.0)
Total
87.0
54.0
0.0567
 (8.0/141.0)






    



Maximum Metrics: Maximum metrics at their respective thresholds







    




metric
threshold
value
idx
max f1
0.4
0.9
51.0
max f2
0.0
0.9
60.0
max f0point5
0.7
1.0
43.0
max accuracy
0.7
0.9
43.0
max precision
1.0
1.0
0.0
max absolute_MCC
0.7
0.9
43.0
max min_per_class_accuracy
0.4
0.9
51.0






    



Scoring History:






    





timestamp
duration
number_of_trees
training_MSE
training_logloss
training_AUC
training_classification_error
validation_MSE
validation_logloss
validation_AUC
validation_classification_error

2015-11-25 11:42:58
 0.006 sec
1.0
0.2
0.6
1.0
0.0
0.2
0.6
1.0
0.1

2015-11-25 11:42:58
 0.010 sec
2.0
0.2
0.5
1.0
0.0
0.2
0.5
1.0
0.1

2015-11-25 11:42:58
 0.013 sec
3.0
0.1
0.4
1.0
0.0
0.2
0.5
1.0
0.1

2015-11-25 11:42:58
 0.017 sec
4.0
0.1
0.4
1.0
0.0
0.1
0.4
1.0
0.1

2015-11-25 11:42:58
 0.021 sec
5.0
0.1
0.4
1.0
0.0
0.1
0.4
1.0
0.1
---
---
---
---
---
---
---
---
---
---
---
---

2015-11-25 11:42:59
 0.566 sec
96.0
0.0
0.0
1.0
0.0
0.1
0.2
1.0
0.1

2015-11-25 11:42:59
 0.572 sec
97.0
0.0
0.0
1.0
0.0
0.1
0.2
1.0
0.1

2015-11-25 11:42:59
 0.579 sec
98.0
0.0
0.0
1.0
0.0
0.1
0.2
1.0
0.1

2015-11-25 11:42:59
 0.585 sec
99.0
0.0
0.0
1.0
0.0
0.1
0.2
1.0
0.1

2015-11-25 11:42:59
 0.592 sec
100.0
0.0
0.0
1.0
0.0
0.1
0.2
1.0
0.1






    



Variable Importances:






    




variable
relative_importance
scaled_importance
percentage
radius_worst
177.5
1.0
0.3
perimeter_worst
102.7
0.6
0.2
concave_points_worst
94.2
0.5
0.2
concave_points_mean
88.6
0.5
0.2
concavity_mean
9.3
0.1
0.0
---
---
---
---
compactness_mean
0.0
0.0
0.0
radius_se
0.0
0.0
0.0
smoothness_mean
0.0
0.0
0.0
fractal_dimension_mean
0.0
0.0
0.0
symmetry_mean
0.0
0.0
0.0

Model Performance on a Test Set

Once a model has been trained, you can also use it to make predictions on a test set. In the case above, we passed the test set as the validation_frame in training, so we have technically already created test set predictions and performance.

However, when performing model selection over a variety of model parameters, it is common for users to break their dataset into three pieces: Training, Validation and Test.

After training a variety of models using different parameters (and evaluating them on a validation set), the user may choose a single model and then evaluate model performance on a separate test set. This is when the model_performance method, shown below, is most useful.



In [131]:

    
perf = model.model_performance(test)
perf.r2()









    Out[131]:





0.8270440872454804



In [132]:

    
perf.auc()









    Out[132]:





0.9814814814814814

Cross-validated Performance

To perform k-fold cross-validation, you use the same code as above, but you specify nfolds as an integer greater than 1, or add a "fold_column" to your H2O Frame which indicates a fold ID for each row.

Unless you have a specific reason to manually assign the observations to folds, you will find it easiest to simply use the nfolds argument.

When performing cross-validation, you can still pass a validation_frame, but you can also choose to use the original dataset that contains all the rows. We will cross-validate a model below using the original H2O Frame which we call data.



In [35]:

    
cvmodel = H2OGradientBoostingEstimator(distribution='bernoulli',
                                       ntrees=100,
                                       max_depth=4,
                                       learn_rate=0.1,
                                       nfolds=5)

cvmodel.train(x=x, y=y, training_frame=data)









    



gbm Model Build Progress: [##################################################] 100%

Grid Search

One way of evaluting models with different parameters is to perform a grid search over a set of parameter values. For example, in GBM, here are three model parameters that may be useful to search over:

ntrees: Number of trees
max_depth: Maximum depth of a tree
learn_rate: Learning rate in the GBM

We will define a grid as follows:



In [37]:

    
ntrees_opt = [5,50,100]
max_depth_opt = [2,3,5]
learn_rate_opt = [0.1,0.2]

hyper_params = {'ntrees': ntrees_opt, 
                'max_depth': max_depth_opt,
                'learn_rate': learn_rate_opt}

Define an "H2OGridSearch" object by specifying the algorithm (GBM) and the hyper parameters:



In [39]:

    
from h2o.grid.grid_search import H2OGridSearch

gs = H2OGridSearch(H2OGradientBoostingEstimator, hyper_params = hyper_params)

An "H2OGridSearch" object also has a train method, which is used to train all the models in the grid.



In [40]:

    
gs.train(x=x, y=y, training_frame=train, validation_frame=test)









    



gbm Grid Build Progress: [##################################################] 100%

Compare Models



In [42]:

    
print(gs)









    



Grid Search Results for H2OGradientBoostingEstimator:






    




Model Id
Hyperparameters: [learn_rate, ntrees, max_depth]
mse
Grid_GBM_py_17_model_python_1448480209718_18_model_14
[0.2, 100, 3]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_17
[0.2, 100, 5]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_16
[0.2, 50, 5]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_8
[0.1, 100, 5]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_11
[0.2, 100, 2]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_13
[0.2, 50, 3]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_5
[0.1, 100, 3]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_7
[0.1, 50, 5]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_2
[0.1, 100, 2]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_10
[0.2, 50, 2]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_4
[0.1, 50, 3]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_1
[0.1, 50, 2]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_15
[0.2, 5, 5]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_12
[0.2, 5, 3]
0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_9
[0.2, 5, 2]
0.1
Grid_GBM_py_17_model_python_1448480209718_18_model_6
[0.1, 5, 5]
0.1
Grid_GBM_py_17_model_python_1448480209718_18_model_3
[0.1, 5, 3]
0.1
Grid_GBM_py_17_model_python_1448480209718_18_model_0
[0.1, 5, 2]
0.1



In [43]:

    
# print out the auc for all of the models
for g in gs:
    print(g.model_id + " auc: " + str(g.auc()))









    



Grid_GBM_py_17_model_python_1448480209718_18_model_0 auc: 0.990963431786
Grid_GBM_py_17_model_python_1448480209718_18_model_13 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_16 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_17 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_2 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_15 auc: 0.998476324426
Grid_GBM_py_17_model_python_1448480209718_18_model_1 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_3 auc: 0.997444913268
Grid_GBM_py_17_model_python_1448480209718_18_model_9 auc: 0.993682606657
Grid_GBM_py_17_model_python_1448480209718_18_model_11 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_7 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_12 auc: 0.998663853727
Grid_GBM_py_17_model_python_1448480209718_18_model_4 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_8 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_10 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_5 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_6 auc: 0.997691045476
Grid_GBM_py_17_model_python_1448480209718_18_model_14 auc: 1.0



In [ ]:

    
#TO DO: Compare grid search models

H2O cluster uptime:	1 seconds 30 milliseconds
H2O cluster version:	3.6.0.3
H2O cluster name:	H2O_started_from_python
H2O cluster total nodes:	1
H2O cluster total memory:	3.56 GB
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster healthy:	True
H2O Connection ip:	127.0.0.1
H2O Connection port:	54321

id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave_points_mean	symmetry_mean	fractal_dimension_mean	radius_se	texture_se	perimeter_se	area_se	smoothness_se	compactness_se	concavity_se	concave_points_se	symmetry_se	fractal_dimension_se	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave_points_worst	symmetry_worst	fractal_dimension_worst
8.71002e+08	B	8.219	20.7	53.27	203.9	0.09405	0.1305	0.1321	0.02168	0.2222	0.08261	0.1935	1.962	1.243	10.21	0.01243	0.05416	0.07753	0.01022	0.02309	0.01178	9.092	29.72	58.08	249.8	0.163	0.431	0.5381	0.07879	0.3322	0.1486
8.81053e+06	B	11.84	18.94	75.51	428	0.08871	0.069	0.02669	0.01393	0.1533	0.06057	0.2222	0.8652	1.444	17.12	0.005517	0.01727	0.02045	0.006747	0.01616	0.002922	13.3	24.99	85.22	546.3	0.128	0.188	0.1471	0.06913	0.2535	0.07993
8.95115e+07	B	12.2	15.21	78.01	457.9	0.08673	0.06545	0.01994	0.01692	0.1638	0.06129	0.2575	0.8073	1.959	19.01	0.005403	0.01418	0.01051	0.005142	0.01333	0.002065	13.75	21.38	91.11	583.1	0.1256	0.1928	0.1167	0.05556	0.2661	0.07961
9.15946e+07	M	15.05	19.07	97.26	701.9	0.09215	0.08597	0.07486	0.04335	0.1561	0.05915	0.386	1.198	2.63	38.49	0.004952	0.0163	0.02967	0.009423	0.01152	0.001718	17.58	28.06	113.8	967	0.1246	0.2101	0.2866	0.112	0.2282	0.06954
864292	B	10.51	20.19	68.64	334.2	0.1122	0.1303	0.06476	0.03068	0.1922	0.07782	0.3336	1.86	2.041	19.91	0.01188	0.03747	0.04591	0.01544	0.02287	0.006792	11.16	22.75	72.62	374.4	0.13	0.2049	0.1295	0.06136	0.2383	0.09026
9.1544e+07	B	12.22	20.04	79.47	453.1	0.1096	0.1152	0.08175	0.02166	0.2124	0.06894	0.1811	0.7959	0.9857	12.58	0.006272	0.02198	0.03966	0.009894	0.0132	0.003813	13.16	24.17	85.13	515.3	0.1402	0.2315	0.3535	0.08088	0.2709	0.08839
9.19039e+07	B	11.67	20.02	75.21	416.2	0.1016	0.09453	0.042	0.02157	0.1859	0.06461	0.2067	0.8745	1.393	15.34	0.005251	0.01727	0.0184	0.005298	0.01449	0.002671	13.35	28.81	87	550.6	0.155	0.2964	0.2758	0.0812	0.3206	0.0895
9.01257e+06	B	15.19	13.21	97.65	711.8	0.07963	0.06934	0.03393	0.02657	0.1721	0.05544	0.1783	0.4125	1.338	17.72	0.005012	0.01485	0.01551	0.009155	0.01647	0.001767	16.2	15.73	104.5	819.1	0.1126	0.1737	0.1362	0.08178	0.2487	0.06766
899987	M	25.73	17.46	174.2	2010	0.1149	0.2363	0.3368	0.1913	0.1956	0.06121	0.9948	0.8509	7.222	153.1	0.006369	0.04243	0.04266	0.01508	0.02335	0.003385	33.13	23.58	229.3	3234	0.153	0.5937	0.6451	0.2756	0.369	0.08815
854039	M	16.13	17.88	107	807.2	0.104	0.1559	0.1354	0.07752	0.1998	0.06515	0.334	0.6857	2.183	35.03	0.004185	0.02868	0.02664	0.009067	0.01703	0.003817	20.21	27.26	132.7	1261	0.1446	0.5804	0.5274	0.1864	0.427	0.1233

C1	C2	C3	C4	C5	C6	C7	C8	C9	C10	C11	C12	C13	C14	C15	C16	C17	C18	C19	C20	C21	C22	C23	C24	C25	C26	C27	C28	C29	C30	C31	C32
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	number_of_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	100.0	18324.0	4.0	4.0	4.0	8.0	14.0	10.31

	B	M	Error	Rate
B	270.0	0.0	0.0	(0.0/270.0)
M	0.0	158.0	0.0	(0.0/158.0)
Total	270.0	158.0	0.0	(0.0/428.0)

metric	threshold	value	idx
max f1	1.0	1.0	143.0
max f2	1.0	1.0	143.0
max f0point5	1.0	1.0	143.0
max accuracy	1.0	1.0	143.0
max precision	1.0	1.0	0.0
max absolute_MCC	1.0	1.0	143.0
max min_per_class_accuracy	1.0	1.0	143.0

	B	M	Error	Rate
B	83.0	4.0	0.046	(4.0/87.0)
M	4.0	50.0	0.0741	(4.0/54.0)
Total	87.0	54.0	0.0567	(8.0/141.0)

	timestamp	duration	number_of_trees	training_MSE	training_logloss	training_AUC	training_classification_error	validation_MSE	validation_logloss	validation_AUC	validation_classification_error
	2015-11-25 11:42:58	0.006 sec	1.0	0.2	0.6	1.0	0.0	0.2	0.6	1.0	0.1
	2015-11-25 11:42:58	0.010 sec	2.0	0.2	0.5	1.0	0.0	0.2	0.5	1.0	0.1
	2015-11-25 11:42:58	0.013 sec	3.0	0.1	0.4	1.0	0.0	0.2	0.5	1.0	0.1
	2015-11-25 11:42:58	0.017 sec	4.0	0.1	0.4	1.0	0.0	0.1	0.4	1.0	0.1
	2015-11-25 11:42:58	0.021 sec	5.0	0.1	0.4	1.0	0.0	0.1	0.4	1.0	0.1
---	---	---	---	---	---	---	---	---	---	---	---
	2015-11-25 11:42:59	0.566 sec	96.0	0.0	0.0	1.0	0.0	0.1	0.2	1.0	0.1
	2015-11-25 11:42:59	0.572 sec	97.0	0.0	0.0	1.0	0.0	0.1	0.2	1.0	0.1
	2015-11-25 11:42:59	0.579 sec	98.0	0.0	0.0	1.0	0.0	0.1	0.2	1.0	0.1
	2015-11-25 11:42:59	0.585 sec	99.0	0.0	0.0	1.0	0.0	0.1	0.2	1.0	0.1
	2015-11-25 11:42:59	0.592 sec	100.0	0.0	0.0	1.0	0.0	0.1	0.2	1.0	0.1

variable	relative_importance	scaled_importance	percentage
radius_worst	177.5	1.0	0.3
perimeter_worst	102.7	0.6	0.2
concave_points_worst	94.2	0.5	0.2
concave_points_mean	88.6	0.5	0.2
concavity_mean	9.3	0.1	0.0
---	---	---	---
compactness_mean	0.0	0.0	0.0
radius_se	0.0	0.0	0.0
smoothness_mean	0.0	0.0	0.0
fractal_dimension_mean	0.0	0.0	0.0
symmetry_mean	0.0	0.0	0.0

Model Id	Hyperparameters: [learn_rate, ntrees, max_depth]	mse
Grid_GBM_py_17_model_python_1448480209718_18_model_14	[0.2, 100, 3]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_17	[0.2, 100, 5]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_16	[0.2, 50, 5]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_8	[0.1, 100, 5]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_11	[0.2, 100, 2]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_13	[0.2, 50, 3]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_5	[0.1, 100, 3]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_7	[0.1, 50, 5]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_2	[0.1, 100, 2]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_10	[0.2, 50, 2]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_4	[0.1, 50, 3]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_1	[0.1, 50, 2]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_15	[0.2, 5, 5]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_12	[0.2, 5, 3]	0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_9	[0.2, 5, 2]	0.1
Grid_GBM_py_17_model_python_1448480209718_18_model_6	[0.1, 5, 5]	0.1
Grid_GBM_py_17_model_python_1448480209718_18_model_3	[0.1, 5, 3]	0.1
Grid_GBM_py_17_model_python_1448480209718_18_model_0	[0.1, 5, 2]	0.1

C1	C2	C3	C4	C5	C6	C7	C8	C9	C10	C11	C12	C13	C14	C15	C16	C17	C18	C19	C20	C21	C22	C23	C24	C25	C26	C27	C28	C29	C30	C31	C32
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

C1	C2	C3	C4	C5	C6	C7	C8	C9	C10	C11	C12	C13	C14	C15	C16	C17	C18	C19	C20	C21	C22	C23	C24	C25	C26	C27	C28	C29	C30	C31	C32
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0