H2O Tutorial: Breast Cancer Classification

Author: Erin LeDell

Contact: erin@h2o.ai

This tutorial steps through a quick introduction to H2O's Python API. The goal of this tutorial is to introduce through a complete example H2O's capabilities from Python. Also, to help those that are accustomed to Scikit Learn and Pandas, the demo will be specific call outs for differences between H2O and those packages; this is intended to help anyone that needs to do machine learning on really Big Data make the transition. It is not meant to be a tutorial on machine learning or algorithms.

Detailed documentation about H2O's and the Python API is available at http://docs.h2o.ai.

Install H2O in Python

Prerequisites

This tutorial assumes you have Python 2.7 installed. The h2o Python package has a few dependencies which can be installed using pip. The packages that are required are (which also have their own dependencies):

pip install requests
pip install tabulate
pip install scikit-learn

If you have any problems (for example, installing the scikit-learn package), check out this page for tips.

Install h2o

Once the dependencies are installed, you can install H2O. We will use the latest stable version of the h2o package, which is called "Tibshirani-3." The installation instructions are on the "Install in Python" tab on this page.

# The following command removes the H2O module for Python (if it already exists).
pip uninstall h2o

# Next, use pip to install this version of the H2O Python module.
pip install http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/3/Python/h2o-3.6.0.3-py2.py3-none-any.whl

Start up an H2O cluster

In a Python terminal, we can import the h2o package and start up an H2O cluster.


In [2]:
import h2o

# Start an H2O Cluster on your local machine
h2o.init()



No instance found at ip and port: localhost:54321. Trying to start local jar...


JVM stdout: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpA5iLxS/h2o_me_started_from_python.out
JVM stderr: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmptfhX9Q/h2o_me_started_from_python.err
Using ice_root: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpViw3QS


Java Version: java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)


Starting H2O JVM and connecting: ........... Connection successful!
H2O cluster uptime: 1 seconds 30 milliseconds
H2O cluster version: 3.6.0.3
H2O cluster name: H2O_started_from_python
H2O cluster total nodes: 1
H2O cluster total memory: 3.56 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: True
H2O Connection ip: 127.0.0.1
H2O Connection port: 54321

If you already have an H2O cluster running that you'd like to connect to (for example, in a multi-node Hadoop environment), then you can specify the IP and port of that cluster as follows:


In [ ]:
# This will not actually do anything since it's a fake IP address
# h2o.init(ip="123.45.67.89", port=54321)

Download Data

The following code downloads a copy of the Wisconsin Diagnostic Breast Cancer dataset.

We can import the data directly into H2O using the Python API.


In [3]:
csv_url = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wisc/wisc-diag-breast-cancer-shuffled.csv"
data = h2o.import_file(csv_url)


Parse Progress: [##################################################] 100%

Explore Data

Once we have loaded the data, let's take a quick look. First the dimension of the frame:


In [4]:
data.shape


Out[4]:
(569, 32)

Now let's take a look at the top of the frame:


In [5]:
data.head()


iddiagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave_points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst
8.71002e+08B 8.219 20.7 53.27 203.9 0.09405 0.1305 0.1321 0.02168 0.2222 0.08261 0.1935 1.962 1.243 10.21 0.01243 0.05416 0.07753 0.01022 0.02309 0.01178 9.092 29.72 58.08 249.8 0.163 0.431 0.5381 0.07879 0.3322 0.1486
8.81053e+06B 11.84 18.94 75.51 428 0.08871 0.069 0.02669 0.01393 0.1533 0.06057 0.2222 0.8652 1.444 17.12 0.005517 0.01727 0.02045 0.006747 0.01616 0.002922 13.3 24.99 85.22 546.3 0.128 0.188 0.1471 0.06913 0.2535 0.07993
8.95115e+07B 12.2 15.21 78.01 457.9 0.08673 0.06545 0.01994 0.01692 0.1638 0.06129 0.2575 0.8073 1.959 19.01 0.005403 0.01418 0.01051 0.005142 0.01333 0.002065 13.75 21.38 91.11 583.1 0.1256 0.1928 0.1167 0.05556 0.2661 0.07961
9.15946e+07M 15.05 19.07 97.26 701.9 0.09215 0.08597 0.07486 0.04335 0.1561 0.05915 0.386 1.198 2.63 38.49 0.004952 0.0163 0.02967 0.009423 0.01152 0.001718 17.58 28.06 113.8 967 0.1246 0.2101 0.2866 0.112 0.2282 0.06954
864292 B 10.51 20.19 68.64 334.2 0.1122 0.1303 0.06476 0.03068 0.1922 0.07782 0.3336 1.86 2.041 19.91 0.01188 0.03747 0.04591 0.01544 0.02287 0.006792 11.16 22.75 72.62 374.4 0.13 0.2049 0.1295 0.06136 0.2383 0.09026
9.1544e+07 B 12.22 20.04 79.47 453.1 0.1096 0.1152 0.08175 0.02166 0.2124 0.06894 0.1811 0.7959 0.9857 12.58 0.006272 0.02198 0.03966 0.009894 0.0132 0.003813 13.16 24.17 85.13 515.3 0.1402 0.2315 0.3535 0.08088 0.2709 0.08839
9.19039e+07B 11.67 20.02 75.21 416.2 0.1016 0.09453 0.042 0.02157 0.1859 0.06461 0.2067 0.8745 1.393 15.34 0.005251 0.01727 0.0184 0.005298 0.01449 0.002671 13.35 28.81 87 550.6 0.155 0.2964 0.2758 0.0812 0.3206 0.0895
9.01257e+06B 15.19 13.21 97.65 711.8 0.07963 0.06934 0.03393 0.02657 0.1721 0.05544 0.1783 0.4125 1.338 17.72 0.005012 0.01485 0.01551 0.009155 0.01647 0.001767 16.2 15.73 104.5 819.1 0.1126 0.1737 0.1362 0.08178 0.2487 0.06766
899987 M 25.73 17.46 174.2 2010 0.1149 0.2363 0.3368 0.1913 0.1956 0.06121 0.9948 0.8509 7.222 153.1 0.006369 0.04243 0.04266 0.01508 0.02335 0.003385 33.13 23.58 229.3 3234 0.153 0.5937 0.6451 0.2756 0.369 0.08815
854039 M 16.13 17.88 107 807.2 0.104 0.1559 0.1354 0.07752 0.1998 0.06515 0.334 0.6857 2.183 35.03 0.004185 0.02868 0.02664 0.009067 0.01703 0.003817 20.21 27.26 132.7 1261 0.1446 0.5804 0.5274 0.1864 0.427 0.1233
Out[5]:

The first two columns contain an ID and the resposne. The "diagnosis" column is the response. Let's take a look at the column names. The data contains derived features from the medical images of the tumors.


In [6]:
data.columns


Out[6]:
[u'id',
 u'diagnosis',
 u'radius_mean',
 u'texture_mean',
 u'perimeter_mean',
 u'area_mean',
 u'smoothness_mean',
 u'compactness_mean',
 u'concavity_mean',
 u'concave_points_mean',
 u'symmetry_mean',
 u'fractal_dimension_mean',
 u'radius_se',
 u'texture_se',
 u'perimeter_se',
 u'area_se',
 u'smoothness_se',
 u'compactness_se',
 u'concavity_se',
 u'concave_points_se',
 u'symmetry_se',
 u'fractal_dimension_se',
 u'radius_worst',
 u'texture_worst',
 u'perimeter_worst',
 u'area_worst',
 u'smoothness_worst',
 u'compactness_worst',
 u'concavity_worst',
 u'concave_points_worst',
 u'symmetry_worst',
 u'fractal_dimension_worst']

To select a subset of the columns to look at, typical Pandas indexing applies:


In [7]:
columns = ["id", "diagnosis", "area_mean"]
data[columns].head()


iddiagnosis area_mean
8.71002e+08B 203.9
8.81053e+06B 428
8.95115e+07B 457.9
9.15946e+07M 701.9
864292 B 334.2
9.1544e+07 B 453.1
9.19039e+07B 416.2
9.01257e+06B 711.8
899987 M 2010
854039 M 807.2
Out[7]:

Now let's select a single column, for example -- the response column, and look at the data more closely:


In [8]:
data['diagnosis']


diagnosis
B
B
B
M
B
B
B
B
M
M
Out[8]:

It looks like a binary response, but let's validate that assumption:


In [9]:
data['diagnosis'].unique()


C1
B
M
Out[9]:


In [10]:
data['diagnosis'].nlevels()


Out[10]:
[2]

We can query the categorical "levels" as well ('B' and 'M' stand for "Benign" and "Malignant" diagnosis):


In [11]:
data['diagnosis'].levels()


Out[11]:
[['B', 'M']]

Since "diagnosis" column is the response we would like to predict, we may want to check if there are any missing values, so let's look for NAs. To figure out which, if any, values are missing, we can use the isna method on the diagnosis column. The columns in an H2O Frame are also H2O Frames themselves, so all the methods that apply to a Frame also apply to a single column.


In [12]:
data.isna()


C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23 C24 C25 C26 C27 C28 C29 C30 C31 C32
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Out[12]:


In [13]:
data['diagnosis'].isna()


C1
0
0
0
0
0
0
0
0
0
0
Out[13]:

The isna method doesn't directly answer the question, "Does the diagnosis column contain any NAs?", rather it returns a 0 if that cell is not missing (Is NA? FALSE == 0) and a 1 if it is missing (Is NA? TRUE == 1). So if there are no missing values, then summing over the whole column should produce a summand equal to 0.0. Let's take a look:


In [14]:
data['diagnosis'].isna().sum()


Out[14]:
0.0

Great, no missing labels.

Out of curiosity, let's see if there is any missing data in this frame:


In [15]:
data.isna().sum()


Out[15]:
0.0

The next thing I may wonder about in a binary classification problem is the distribution of the response in the training data. Is one of the two outcomes under-represented in the training set? Many real datasets have what's called an "imbalanace" problem, where one of the classes has far fewer training examples than the other class. Let's take a look at the distribution, both visually and numerically.


In [17]:
# TO DO: Insert a bar chart or something showing the proportion of M to B in the response.

In [16]:
data['diagnosis'].table()


diagnosis Count
B 357
M 212
Out[16]:

Ok, the data is not exactly evenly distributed between the two classes -- there are almost twice as many Benign samples as there are Malicious samples. However, this level of imbalance shouldn't be much of an issue for the machine learning algos. (We will revisit this later in the modeling section below).


In [18]:
n = data.shape[0]  # Total number of training samples
data['diagnosis'].table()['Count']/n


Count
0.627417
0.372583
Out[18]:

Machine Learning in H2O

We will do a quick demo of the H2O software -- trying to predict malignant tumors using various machine learning algorithms.

Specify the predictor set and response

The response, y, is the 'diagnosis' column, and the predictors, x, are all the columns aside from the first two columns ('id' and 'diagnosis').


In [19]:
y = 'diagnosis'

In [20]:
x = data.columns
del x[0:1]
x


Out[20]:
[u'diagnosis',
 u'radius_mean',
 u'texture_mean',
 u'perimeter_mean',
 u'area_mean',
 u'smoothness_mean',
 u'compactness_mean',
 u'concavity_mean',
 u'concave_points_mean',
 u'symmetry_mean',
 u'fractal_dimension_mean',
 u'radius_se',
 u'texture_se',
 u'perimeter_se',
 u'area_se',
 u'smoothness_se',
 u'compactness_se',
 u'concavity_se',
 u'concave_points_se',
 u'symmetry_se',
 u'fractal_dimension_se',
 u'radius_worst',
 u'texture_worst',
 u'perimeter_worst',
 u'area_worst',
 u'smoothness_worst',
 u'compactness_worst',
 u'concavity_worst',
 u'concave_points_worst',
 u'symmetry_worst',
 u'fractal_dimension_worst']

Split H2O Frame into a train and test set


In [21]:
train, test = data.split_frame(ratios=[0.75], seed=1)

In [22]:
train.shape


Out[22]:
(428, 32)

In [23]:
test.shape


Out[23]:
(141, 32)

Train and Test a GBM model


In [24]:
# Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

We first create a model object of class, "H2OGradientBoostingEstimator". This does not actually do any training, it just sets the model up for training by specifying model parameters.


In [29]:
model = H2OGradientBoostingEstimator(distribution='bernoulli',
                                    ntrees=100,
                                    max_depth=4,
                                    learn_rate=0.1)

The model object, like all H2O estimator objects, has a train method, which will actually perform model training. At this step we specify the training and (optionally) a validation set, along with the response and predictor variables.


In [30]:
model.train(x=x, y=y, training_frame=train, validation_frame=test)


gbm Model Build Progress: [##################################################] 100%

Inspect Model

The type of results shown when you print a model, are determined by the following:

  • Model class of the estimator (e.g. GBM, RF, GLM, DL)
  • The type of machine learning problem (e.g. binary classification, multiclass classification, regression)
  • The data you specify (e.g. training_frame only, training_frame and validation_frame, or training_frame and nfolds)

Below, we see a GBM Model Summary, as well as training and validation metrics since we supplied a validation_frame. Since this a binary classification task, we are shown the relevant performance metrics, which inclues: MSE, R^2, LogLoss, AUC and Gini. Also, we are shown a Confusion Matrix, where the threshold for classification is chosen automatically (by H2O) as the threshold which maximizes the F1 score.

The scoring history is also printed, which shows the performance metrics over some increment such as "number of trees" in the case of GBM and RF.

Lastly, for tree-based methods (GBM and RF), we also print variable importance.


In [32]:
print(model)


Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1448480209718_6

Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
100.0 18324.0 4.0 4.0 4.0 8.0 14.0 10.31

ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 1.55261137469e-06
R^2: 0.999993333015
LogLoss: 0.000519099361538
AUC: 1.0
Gini: 1.0

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.989733166545:
B M Error Rate
B 270.0 0.0 0.0 (0.0/270.0)
M 0.0 158.0 0.0 (0.0/158.0)
Total 270.0 158.0 0.0 (0.0/428.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 1.0 1.0 143.0
max f2 1.0 1.0 143.0
max f0point5 1.0 1.0 143.0
max accuracy 1.0 1.0 143.0
max precision 1.0 1.0 0.0
max absolute_MCC 1.0 1.0 143.0
max min_per_class_accuracy 1.0 1.0 143.0
ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.0507094587533
R^2: 0.78540767359
LogLoss: 0.247694592147
AUC: 0.970200085143
Gini: 0.940400170285

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.409828576406:
B M Error Rate
B 83.0 4.0 0.046 (4.0/87.0)
M 4.0 50.0 0.0741 (4.0/54.0)
Total 87.0 54.0 0.0567 (8.0/141.0)
Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx
max f1 0.4 0.9 51.0
max f2 0.0 0.9 60.0
max f0point5 0.7 1.0 43.0
max accuracy 0.7 0.9 43.0
max precision 1.0 1.0 0.0
max absolute_MCC 0.7 0.9 43.0
max min_per_class_accuracy 0.4 0.9 51.0
Scoring History:
timestamp duration number_of_trees training_MSE training_logloss training_AUC training_classification_error validation_MSE validation_logloss validation_AUC validation_classification_error
2015-11-25 11:42:58 0.006 sec 1.0 0.2 0.6 1.0 0.0 0.2 0.6 1.0 0.1
2015-11-25 11:42:58 0.010 sec 2.0 0.2 0.5 1.0 0.0 0.2 0.5 1.0 0.1
2015-11-25 11:42:58 0.013 sec 3.0 0.1 0.4 1.0 0.0 0.2 0.5 1.0 0.1
2015-11-25 11:42:58 0.017 sec 4.0 0.1 0.4 1.0 0.0 0.1 0.4 1.0 0.1
2015-11-25 11:42:58 0.021 sec 5.0 0.1 0.4 1.0 0.0 0.1 0.4 1.0 0.1
--- --- --- --- --- --- --- --- --- --- --- ---
2015-11-25 11:42:59 0.566 sec 96.0 0.0 0.0 1.0 0.0 0.1 0.2 1.0 0.1
2015-11-25 11:42:59 0.572 sec 97.0 0.0 0.0 1.0 0.0 0.1 0.2 1.0 0.1
2015-11-25 11:42:59 0.579 sec 98.0 0.0 0.0 1.0 0.0 0.1 0.2 1.0 0.1
2015-11-25 11:42:59 0.585 sec 99.0 0.0 0.0 1.0 0.0 0.1 0.2 1.0 0.1
2015-11-25 11:42:59 0.592 sec 100.0 0.0 0.0 1.0 0.0 0.1 0.2 1.0 0.1
Variable Importances:
variable relative_importance scaled_importance percentage
radius_worst 177.5 1.0 0.3
perimeter_worst 102.7 0.6 0.2
concave_points_worst 94.2 0.5 0.2
concave_points_mean 88.6 0.5 0.2
concavity_mean 9.3 0.1 0.0
--- --- --- ---
compactness_mean 0.0 0.0 0.0
radius_se 0.0 0.0 0.0
smoothness_mean 0.0 0.0 0.0
fractal_dimension_mean 0.0 0.0 0.0
symmetry_mean 0.0 0.0 0.0

Model Performance on a Test Set

Once a model has been trained, you can also use it to make predictions on a test set. In the case above, we passed the test set as the validation_frame in training, so we have technically already created test set predictions and performance.

However, when performing model selection over a variety of model parameters, it is common for users to break their dataset into three pieces: Training, Validation and Test.

After training a variety of models using different parameters (and evaluating them on a validation set), the user may choose a single model and then evaluate model performance on a separate test set. This is when the model_performance method, shown below, is most useful.


In [132]:
perf = model.model_performance(test)
perf.auc()


Out[132]:
0.9814814814814814

Cross-validated Performance

To perform k-fold cross-validation, you use the same code as above, but you specify nfolds as an integer greater than 1, or add a "fold_column" to your H2O Frame which indicates a fold ID for each row.

Unless you have a specific reason to manually assign the observations to folds, you will find it easiest to simply use the nfolds argument.

When performing cross-validation, you can still pass a validation_frame, but you can also choose to use the original dataset that contains all the rows. We will cross-validate a model below using the original H2O Frame which we call data.


In [35]:
cvmodel = H2OGradientBoostingEstimator(distribution='bernoulli',
                                       ntrees=100,
                                       max_depth=4,
                                       learn_rate=0.1,
                                       nfolds=5)

cvmodel.train(x=x, y=y, training_frame=data)


gbm Model Build Progress: [##################################################] 100%

One way of evaluting models with different parameters is to perform a grid search over a set of parameter values. For example, in GBM, here are three model parameters that may be useful to search over:

  • ntrees: Number of trees
  • max_depth: Maximum depth of a tree
  • learn_rate: Learning rate in the GBM

We will define a grid as follows:


In [37]:
ntrees_opt = [5,50,100]
max_depth_opt = [2,3,5]
learn_rate_opt = [0.1,0.2]

hyper_params = {'ntrees': ntrees_opt, 
                'max_depth': max_depth_opt,
                'learn_rate': learn_rate_opt}

Define an "H2OGridSearch" object by specifying the algorithm (GBM) and the hyper parameters:


In [39]:
from h2o.grid.grid_search import H2OGridSearch

gs = H2OGridSearch(H2OGradientBoostingEstimator, hyper_params = hyper_params)

An "H2OGridSearch" object also has a train method, which is used to train all the models in the grid.


In [40]:
gs.train(x=x, y=y, training_frame=train, validation_frame=test)


gbm Grid Build Progress: [##################################################] 100%

Compare Models


In [42]:
print(gs)


Grid Search Results for H2OGradientBoostingEstimator:
Model Id Hyperparameters: [learn_rate, ntrees, max_depth] mse
Grid_GBM_py_17_model_python_1448480209718_18_model_14 [0.2, 100, 3] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_17 [0.2, 100, 5] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_16 [0.2, 50, 5] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_8 [0.1, 100, 5] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_11 [0.2, 100, 2] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_13 [0.2, 50, 3] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_5 [0.1, 100, 3] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_7 [0.1, 50, 5] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_2 [0.1, 100, 2] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_10 [0.2, 50, 2] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_4 [0.1, 50, 3] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_1 [0.1, 50, 2] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_15 [0.2, 5, 5] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_12 [0.2, 5, 3] 0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_9 [0.2, 5, 2] 0.1
Grid_GBM_py_17_model_python_1448480209718_18_model_6 [0.1, 5, 5] 0.1
Grid_GBM_py_17_model_python_1448480209718_18_model_3 [0.1, 5, 3] 0.1
Grid_GBM_py_17_model_python_1448480209718_18_model_0 [0.1, 5, 2] 0.1


In [43]:
# print out the auc for all of the models
for g in gs:
    print(g.model_id + " auc: " + str(g.auc()))


Grid_GBM_py_17_model_python_1448480209718_18_model_0 auc: 0.990963431786
Grid_GBM_py_17_model_python_1448480209718_18_model_13 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_16 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_17 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_2 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_15 auc: 0.998476324426
Grid_GBM_py_17_model_python_1448480209718_18_model_1 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_3 auc: 0.997444913268
Grid_GBM_py_17_model_python_1448480209718_18_model_9 auc: 0.993682606657
Grid_GBM_py_17_model_python_1448480209718_18_model_11 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_7 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_12 auc: 0.998663853727
Grid_GBM_py_17_model_python_1448480209718_18_model_4 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_8 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_10 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_5 auc: 1.0
Grid_GBM_py_17_model_python_1448480209718_18_model_6 auc: 0.997691045476
Grid_GBM_py_17_model_python_1448480209718_18_model_14 auc: 1.0

In [ ]:
#TO DO: Compare grid search models