Gradient Boosting Machines



Image Source: brucecompany.com

Introduction

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in an iterative fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

It is recommended that you read through the accompanying Classification and Regression Trees Tutorial for an overview of decision trees.

History

Boosting is one of the most powerful learning ideas introduced in the last twenty years. It was originally designed for classification problems, but it can be extended to regression as well. The motivation for boosting was a procedure that combines the outputs of many "weak" classifiers to produce a powerful "committee." A weak classifier (e.g. decision tree) is one whose error rate is only slightly better than random guessing.

AdaBoost short for "Adaptive Boosting", is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire in 1996, which is now considered to be a special case of Gradient Boosting. There are some differences between the AdaBoost algorithm and modern Gradient Boosting. In the AdaBoost algorithm, the "shortcomings" of existing weak learners are identified by high-weight data points, however in Gradient Boosting, the shortcomings are identified by gradients.

The idea of gradient boosting originated in the observation by Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function. Explicit regression gradient boosting algorithms were subsequently developed by Jerome H. Friedman, simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean. The latter two papers introduced the abstract view of boosting algorithms as iterative functional gradient descent algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.

In general, in terms of model performance, we have the following heirarchy:

$$Boosting > Random \: Forest > Bagging > Single \: Tree$$

Boosting

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers $G_m(x)$, $m = 1, 2, ... , M$.

Stagewise Additive Modeling

Boosting builds an additive model:

$$F(x) = \sum_{m=1}^M \beta_m b(x; \gamma_m)$$

where $b(x; \gamma_m)$ is a tree and $\gamma_m$ parameterizes the splits. With boosting, the parameters, $(\beta_m, \gamma_m)$ are fit in a stagewise fashion. This slows the process down, and overfits less quickly.

AdaBoost

  • AdaBoost builds an additive logistic regression model by stagewise fitting.
  • AdaBoost uses an exponential loss function of the form, $L(y, F(x)) = exp(-yF(x))$, similar to the negative binomial log-likelihood loss.
  • The principal attraction of the exponential loss in the context of additive modeling is computational; it leads to the simple modular reweighting
  • Instead of fitting trees to residuals, the special form of the exponential loss function in AdaBoost leads to fitting trees to weighted versions of the original data.

Source: Elements of Statistical Learning

Source: Elements of Statistical Learning

Gradient Boosting Algorithm

  • The idea of gradient boosting originated in the observation by Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function.
  • Explicit regression gradient boosting algorithms were subsequently developed by Jerome H. Friedman simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean.
  • The latter two papers introduced the abstract view of boosting algorithms as iterative functional gradient descent algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction.
  • This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.

Friedman's Gradient Boosting Algorithm for a generic loss function, $L(y_i, \gamma)$:

Source: Elements of Statistical Learning

Loss Functions and Gradients

Source: Elements of Statistical Learning

The optimal number of iterations, T, and the learning rate, λ, depend on each other.

Stochastic GBM

Stochastic Gradient Boosting (Friedman, 2002) proposed the stochastic gradient boosting algorithm that simply samples uniformly without replacement from the dataset before estimating the next gradient step. He found that this additional step greatly improved performance.

Some implemenations of Stochastic GBM have both column and row sampling (per split and per tree) for better generalization. XGBoost and H2O are two implemenations that have a per-tree column and row sampling. This is reason that these implementations are popular among competitive data mining competitors (e.g. Kaggle).

Practical Tips

  • It's more common to grow shorter trees ("shrubs" or "stumps") in GBM than you do in Random Forest.
  • It's useful to try a variety of column sample (and column sample per tree) rates.
  • Don't assume that the set of optimal tuning parameters for one implementation of GBM will carry over and also be optimal in a different GBM implementation.

GBM Software in R

This is not a comprehensive list of GBM software in R, however, we detail a few of the most popular implementations below: gbm, xgboost and h2o.

The CRAN Machine Learning Task View lists the following projects as well. The Hinge-loss is optimized by the boosting implementation in package bst. Package GAMBoost can be used to fit generalized additive models by a boosting algorithm. An extensible boosting framework for generalized linear, additive and nonparametric models is available in package mboost. Likelihood-based boosting for Cox models is implemented in CoxBoost and for mixed models in GMMBoost. GAMLSS models can be fitted using boosting by gamboostLSS.

gbm

Authors: Originally written by Greg Ridgeway, added to by various authors, currently maintained by Harry Southworth

Backend: C++

The gbm R package is an implementation of extensions to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine. This is the original R implementation of GBM. A presentation is available here by Mark Landry.

Features:

  • Stochastic GBM.
  • Supports up to 1024 factor levels.
  • Supports Classification and regression trees.
  • Includes regression methods for:
    • least squares
    • absolute loss
    • t-distribution loss
    • quantile regression
    • logistic
    • multinomial logistic
    • Poisson
    • Cox proportional hazards partial likelihood
    • AdaBoost exponential loss
    • Huberized hinge loss
    • Learning to Rank measures (LambdaMart)
  • Out-of-bag estimator for the optimal number of iterations is provided.
  • Easy to overfit since early stopping functionality is not automated in this package.
  • If internal cross-validation is used, this can be parallelized to all cores on the machine.
  • Currently undergoing a major refactoring & rewrite (and has been for some time).
  • GPL-2/3 License.

In [19]:
#install.packages("gbm")
#install.packages("cvAUC")
library(gbm)
library(cvAUC)


Loading required package: ROCR
Loading required package: gplots

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

    lowess

Loading required package: data.table
 
cvAUC version: 1.1.0
Notice to cvAUC users: Major speed improvements in version 1.1.0
 

In [20]:
# Load 2-class HIGGS dataset
train <- read.csv("data/higgs_train_10k.csv")
test <- read.csv("data/higgs_test_5k.csv")

In [21]:
set.seed(1)
model <- gbm(formula = response ~ ., 
             distribution = "bernoulli",
             data = train,
             n.trees = 70,
             interaction.depth = 5,
             shrinkage = 0.3,
             bag.fraction = 0.5,
             train.fraction = 1.0,
             n.cores = NULL)  #will use all cores by default

In [22]:
print(model)


gbm(formula = response ~ ., distribution = "bernoulli", data = train, 
    n.trees = 70, interaction.depth = 5, shrinkage = 0.3, bag.fraction = 0.5, 
    train.fraction = 1, n.cores = NULL)
A gradient boosted model with bernoulli loss function.
70 iterations were performed.
There were 28 predictors of which 28 had non-zero influence.

In [23]:
# Generate predictions on test dataset
preds <- predict(model, newdata = test, n.trees = 70)
labels <- test[,"response"]

# Compute AUC on the test set
cvAUC::AUC(predictions = preds, labels = labels)


0.774160905116416

xgboost

Authors: Tianqi Chen, Tong He, Michael Benesty

Backend: C++

The xgboost R package provides an R API to "Extreme Gradient Boosting", which is an efficient implementation of gradient boosting framework. Parameter tuning guide and more resources here. The xgboost package is quite popular on Kaggle for data mining competitions.

Features:

  • Stochastic GBM with column and row sampling (per split and per tree) for better generalization.
  • Includes efficient linear model solver and tree learning algorithms.
  • Parallel computation on a single machine.
  • Supports various objective functions, including regression, classification and ranking.
  • The package is made to be extensible, so that users are also allowed to define their own objectives easily.
  • Apache 2.0 License.

In [24]:
#install.packages("xgboost")
#install.packages("cvAUC")
library(xgboost)
library(Matrix)
library(cvAUC)

In [25]:
# Load 2-class HIGGS dataset
train <- read.csv("data/higgs_train_10k.csv")
test <- read.csv("data/higgs_test_5k.csv")

In [26]:
# Set seed because we column-sample
set.seed(1)

y <- "response"
train.mx <- sparse.model.matrix(response ~ ., train)
test.mx <- sparse.model.matrix(response ~ ., test)
dtrain <- xgb.DMatrix(train.mx, label = train[,y])
dtest <- xgb.DMatrix(test.mx, label = test[,y])

train.gdbt <- xgb.train(params = list(objective = "binary:logistic",
                                      #num_class = 2,
                                      #eval_metric = "mlogloss",
                                      eta = 0.3,
                                      max_depth = 5,
                                      subsample = 1,
                                      colsample_bytree = 0.5), 
                                      data = dtrain, 
                                      nrounds = 70, 
                                      watchlist = list(train = dtrain, test = dtest))


[0]	train-error:0.327500	test-error:0.338000
[1]	train-error:0.297800	test-error:0.324600
[2]	train-error:0.288800	test-error:0.318400
[3]	train-error:0.289100	test-error:0.315600
[4]	train-error:0.278800	test-error:0.310600
[5]	train-error:0.268100	test-error:0.304600
[6]	train-error:0.263300	test-error:0.303400
[7]	train-error:0.261300	test-error:0.301800
[8]	train-error:0.251000	test-error:0.303800
[9]	train-error:0.247400	test-error:0.305000
[10]	train-error:0.240000	test-error:0.301600
[11]	train-error:0.240600	test-error:0.303000
[12]	train-error:0.232800	test-error:0.303400
[13]	train-error:0.233000	test-error:0.300200
[14]	train-error:0.228000	test-error:0.298000
[15]	train-error:0.225600	test-error:0.296400
[16]	train-error:0.222500	test-error:0.295000
[17]	train-error:0.219800	test-error:0.294000
[18]	train-error:0.216700	test-error:0.294400
[19]	train-error:0.214100	test-error:0.293000
[20]	train-error:0.212100	test-error:0.297200
[21]	train-error:0.210800	test-error:0.298000
[22]	train-error:0.210800	test-error:0.296400
[23]	train-error:0.207800	test-error:0.295400
[24]	train-error:0.203000	test-error:0.296600
[25]	train-error:0.203800	test-error:0.295800
[26]	train-error:0.198500	test-error:0.294600
[27]	train-error:0.197100	test-error:0.294200
[28]	train-error:0.196400	test-error:0.294000
[29]	train-error:0.193600	test-error:0.295600
[30]	train-error:0.190600	test-error:0.297000
[31]	train-error:0.188300	test-error:0.296600
[32]	train-error:0.186000	test-error:0.296800
[33]	train-error:0.182700	test-error:0.295800
[34]	train-error:0.183100	test-error:0.296000
[35]	train-error:0.181800	test-error:0.296200
[36]	train-error:0.179700	test-error:0.298200
[37]	train-error:0.179000	test-error:0.296400
[38]	train-error:0.176400	test-error:0.299200
[39]	train-error:0.173600	test-error:0.299200
[40]	train-error:0.171900	test-error:0.299800
[41]	train-error:0.170400	test-error:0.298600
[42]	train-error:0.169700	test-error:0.298400
[43]	train-error:0.168100	test-error:0.299000
[44]	train-error:0.167700	test-error:0.299800
[45]	train-error:0.167500	test-error:0.297800
[46]	train-error:0.166400	test-error:0.297200
[47]	train-error:0.165300	test-error:0.297400
[48]	train-error:0.163400	test-error:0.297200
[49]	train-error:0.161800	test-error:0.298000
[50]	train-error:0.158000	test-error:0.298000
[51]	train-error:0.156600	test-error:0.298200
[52]	train-error:0.154700	test-error:0.299800
[53]	train-error:0.153200	test-error:0.298600
[54]	train-error:0.151500	test-error:0.296600
[55]	train-error:0.150500	test-error:0.298800
[56]	train-error:0.149000	test-error:0.299200
[57]	train-error:0.146800	test-error:0.301000
[58]	train-error:0.143400	test-error:0.299000
[59]	train-error:0.142800	test-error:0.302000
[60]	train-error:0.141100	test-error:0.301600
[61]	train-error:0.141000	test-error:0.301400
[62]	train-error:0.140500	test-error:0.302600
[63]	train-error:0.136700	test-error:0.303600
[64]	train-error:0.134900	test-error:0.303600
[65]	train-error:0.131900	test-error:0.304400
[66]	train-error:0.130900	test-error:0.303000
[67]	train-error:0.129000	test-error:0.300200
[68]	train-error:0.128800	test-error:0.299800
[69]	train-error:0.127600	test-error:0.300800

In [29]:
# Generate predictions on test dataset
preds <- predict(train.gdbt, newdata = dtest)
labels <- test[,y]

# Compute AUC on the test set
cvAUC::AUC(predictions = preds, labels = labels)


0.770787069995295

In [30]:
#Advanced functionality of xgboost
#install.packages("Ckmeans.1d.dp")
library(Ckmeans.1d.dp)

# Compute feature importance matrix
names <- dimnames(data.matrix(train[,-1]))[[2]]
importance_matrix <- xgb.importance(names, model = train.gdbt)

# Plot feature importance
xgb.plot.importance(importance_matrix[1:10,])


Warning message:
In set(allTrees, i = which(allTrees[, Feature] != "Leaf"), j = "Yes.Feature", : Supplied 1665 items to be assigned to 1584 items of column 'Yes.Feature' (81 unused)Warning message:
In set(allTrees, i = which(allTrees[, Feature] != "Leaf"), j = "Yes.Cover", : Supplied 1665 items to be assigned to 1584 items of column 'Yes.Cover' (81 unused)Warning message:
In set(allTrees, i = which(allTrees[, Feature] != "Leaf"), j = "Yes.Quality", : Supplied 1665 items to be assigned to 1584 items of column 'Yes.Quality' (81 unused)Warning message:
In set(allTrees, i = which(allTrees[, Feature] != "Leaf"), j = "No.Feature", : Supplied 1665 items to be assigned to 1584 items of column 'No.Feature' (81 unused)Warning message:
In set(allTrees, i = which(allTrees[, Feature] != "Leaf"), j = "No.Cover", : Supplied 1665 items to be assigned to 1584 items of column 'No.Cover' (81 unused)Warning message:
In set(allTrees, i = which(allTrees[, Feature] != "Leaf"), j = "No.Quality", : Supplied 1665 items to be assigned to 1584 items of column 'No.Quality' (81 unused)

h2o

Authors: Arno Candel, Cliff Click, H2O.ai contributors

Backend: Java

H2O GBM Tuning guide by Arno Candel and H2O GBM Vignette.

Features:

  • Distributed and parallelized computation on either a single node or a multi-node cluster.
  • Automatic early stopping based on convergence of user-specied metrics to user-specied relative tolerance.
  • Stochastic GBM with column and row sampling (per split and per tree) for better generalization.
  • Support for exponential families (Poisson, Gamma, Tweedie) and loss functions in addition to binomial (Bernoulli), Gaussian and multinomial distributions, such as Quantile regression (including Laplace)ˆ.
  • Grid search for hyperparameter optimization and model selection.
  • Data-distributed, which means the entire dataset does not need to fit into memory on a single node, hence scales to any size training set.
  • Uses histogram approximations of continuous variables for speedup.
  • Uses dynamic binning - bin limits are reset at each tree level based on the split bins' min and max values discovered during the last pass.
  • Uses squared error to determine optimal splits.
  • Distributed implementation details outlined in a blog post by Cliff Click.
  • Unlimited factor levels.
  • Multiclass trees (one for each class) built in parallel with each other.
  • Apache 2.0 Licensed.
  • Model export in plain Java code for deployment in production environments.
  • GUI for training & model eval/viz (H2O Flow).

In [67]:
#install.packages("h2o")
library(h2o)
#h2o.shutdown(prompt = FALSE)  #if required
h2o.init(nthreads = -1)  #Start a local H2O cluster using nthreads = num available cores


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T//RtmpsSxaHm/h2o_me_started_from_r.out
    /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T//RtmpsSxaHm/h2o_me_started_from_r.err


Starting H2O JVM and connecting: . Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 117 milliseconds 
    H2O cluster version:        3.8.2.6 
    H2O cluster name:           H2O_started_from_R_me_ren836 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.56 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    R Version:                  R version 3.3.0 (2016-05-03) 


In [68]:
# Load 10-class MNIST dataset
train <- h2o.importFile("data/higgs_train_10k.csv")
test <- h2o.importFile("data/higgs_test_5k.csv")
print(dim(train))
print(dim(test))


  |======================================================================| 100%
  |======================================================================| 100%
[1] 10000    29
[1] 5000   29

In [69]:
# Identity the response column
y <- "response"

# Identify the predictor columns
x <- setdiff(names(train), y)

# Convert response to factor
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])

In [70]:
# Train an H2O GBM model
model <- h2o.gbm(x = x,
                 y = y,
                 training_frame = train,
                 ntrees = 70,
                 learn_rate = 0.3,
                 sample_rate = 1.0,
                 max_depth = 5,
                 col_sample_rate_per_tree = 0.5,
                 seed = 1)


  |======================================================================| 100%

In [71]:
# Get model performance on a test set
perf <- h2o.performance(model, test)
print(perf)


H2OBinomialMetrics: gbm

MSE:  0.1938163
R^2:  0.220466
LogLoss:  0.5701889
AUC:  0.7735663
Gini:  0.5471326

Confusion Matrix for F1-optimal threshold:
          0    1    Error        Rate
0       994 1321 0.570626  =1321/2315
1       281 2404 0.104655   =281/2685
Totals 1275 3725 0.320400  =1602/5000

Maximum Metrics: Maximum metrics at their respective thresholds
                      metric threshold    value idx
1                     max f1  0.306264 0.750078 288
2                     max f2  0.110686 0.861722 364
3               max f0point5  0.582499 0.733454 170
4               max accuracy  0.427211 0.704600 237
5              max precision  0.946492 0.966102  14
6                 max recall  0.017942 1.000000 397
7            max specificity  0.995085 0.999568   0
8           max absolute_MCC  0.582499 0.407274 170
9 max min_per_class_accuracy  0.514693 0.702376 198

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

In [72]:
# To retreive individual metrics
h2o.auc(perf)


0.773566288998556

In [73]:
# Print confusion matrix
h2o.confusionMatrix(perf)


01ErrorRate
0994 1321 0.570626349892009 =1321/2315
1281 2404 0.104655493482309 =281/2685
Totals1275 3725 0.3204 =1602/5000

In [74]:
# Plot scoring history over time
plot(model)



In [75]:
# Retreive feature importance
vi <- h2o.varimp(model)
vi[1:10,]


variablerelative_importancescaled_importancepercentage
1x26 541.335571289062 1 0.200105710165661
2x28 229.342498779297 0.423660499961553 0.0847768852139454
3x25 202.556091308594 0.374178424717693 0.0748752394068022
4x6 158.774505615234 0.293301445602679 0.0586912940649391
5x23 154.050354003906 0.284574600625397 0.0569450025532545
6x27 150.640319824219 0.27827530244411 0.0556844770171426
7x4 137.782974243164 0.254524146482867 0.0509317350862629
8x10 111.809761047363 0.20654427120153 0.041330688069431
9x1 105.712783813477 0.195281428785008 0.0390769289891889
10x22 98.4624099731445 0.181887936421173 0.0363968146881255

In [76]:
# Plot feature importance
barplot(vi$scaled_importance,
        names.arg = vi$variable,
        space = 1,
        las = 2,
        main = "Variable Importance: H2O GBM")


Note that all models, data and model metrics can be viewed via the H2O Flow GUI, which should already be running since you started the H2O cluster with h2o.init().


In [77]:
# Early stopping example
# Keep in mind that when you use early stopping, you should pass a validation set
# Since the validation set is used to detmine the stopping point, a separate test set should be used for model eval

#fit <- h2o.gbm(x = x,
#               y = y,
#               training_frame = train,
#               model_id = "gbm_fit3",
#               validation_frame = valid,  #only used if stopping_rounds > 0
#               ntrees = 500,
#               score_tree_interval = 5,      #used for early stopping
#               stopping_rounds = 3,          #used for early stopping
#               stopping_metric = "misclassification", #used for early stopping
#               stopping_tolerance = 0.0005,  #used for early stopping
#               seed = 1)

In [78]:
# GBM hyperparamters
gbm_params <- list(learn_rate = seq(0.01, 0.1, 0.01),
                   max_depth = seq(2, 10, 1),
                   sample_rate = seq(0.5, 1.0, 0.1),
                   col_sample_rate = seq(0.1, 1.0, 0.1))
search_criteria <- list(strategy = "RandomDiscrete", 
                         max_models = 20)

# Train and validate a grid of GBMs
gbm_grid <- h2o.grid("gbm", x = x, y = y,
                      grid_id = "gbm_grid",
                      training_frame = train,
                      validation_frame = test,  #test frame will only be used to calculate metrics
                      ntrees = 70,
                      seed = 1,
                      hyper_params = gbm_params,
                      search_criteria = search_criteria)

gbm_gridperf <- h2o.getGrid(grid_id = "gbm_grid", 
                            sort_by = "auc", 
                            decreasing = TRUE)
print(gbm_gridperf)


  |======================================================================| 100%
H2O Grid Details
================

Grid ID: gbm_grid 
Used hyper parameters: 
  -  sample_rate 
  -  max_depth 
  -  learn_rate 
  -  col_sample_rate 
Number of models: 20 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by decreasing auc
   sample_rate max_depth learn_rate col_sample_rate         model_ids
1          0.8         6       0.06               1  gbm_grid_model_9
2          0.8         7       0.05             0.8  gbm_grid_model_7
3          0.9         7       0.08             0.6 gbm_grid_model_16
4          0.9         9       0.04             0.4 gbm_grid_model_14
5          0.7         5       0.07             0.5 gbm_grid_model_11
6          0.5         7       0.05             0.3 gbm_grid_model_17
7          0.5         6       0.03             0.8  gbm_grid_model_3
8          0.5         9       0.01             0.9  gbm_grid_model_6
9          0.6         9       0.01             0.9 gbm_grid_model_12
10         0.5         8       0.08             0.2  gbm_grid_model_2
11         0.6        10       0.09             0.4 gbm_grid_model_15
12         0.6         8       0.09             0.2  gbm_grid_model_4
13         0.5        10        0.1             0.8  gbm_grid_model_8
14         0.5         3       0.05             0.7 gbm_grid_model_10
15         0.6         4       0.02             0.4 gbm_grid_model_13
16         0.9         2       0.05               1  gbm_grid_model_0
17           1         2       0.07             0.2  gbm_grid_model_1
18         0.6         2       0.09             0.1  gbm_grid_model_5
19         0.8         2       0.03             0.1 gbm_grid_model_19
20         0.6         2       0.01             0.4 gbm_grid_model_18
                 auc
1  0.786214108457916
2  0.785642820082773
3   0.78515100691386
4  0.785017717018393
5  0.782504916925082
6  0.779035598939794
7  0.778822753397605
8  0.778125575652272
9  0.777363884632246
10 0.774896372536007
11 0.774681676862499
12 0.773695476428925
13 0.772144744621548
14 0.765276413641099
15 0.759501992913193
16 0.747648845075634
17  0.74619657243063
18 0.739949242049463
19 0.731288616463756
20  0.71596430050959

The grid search helped a lot. The first model we trained only had a 0.774 test set AUC, but the top GBM in our grid has a test set AUC of 0.786. More information about grid search is available in the H2O grid search R tutorial.