Image Source: brucecompany.com
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in an iterative fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
It is recommended that you read through the accompanying Classification and Regression Trees Tutorial for an overview of decision trees.
Boosting is one of the most powerful learning ideas introduced in the last twenty years. It was originally designed for classification problems, but it can be extended to regression as well. The motivation for boosting was a procedure that combines the outputs of many "weak" classifiers to produce a powerful "committee." A weak classifier (e.g. decision tree) is one whose error rate is only slightly better than random guessing.
AdaBoost short for "Adaptive Boosting", is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire in 1996, which is now considered to be a special case of Gradient Boosting. There are some differences between the AdaBoost algorithm and modern Gradient Boosting. In the AdaBoost algorithm, the "shortcomings" of existing weak learners are identified by high-weight data points, however in Gradient Boosting, the shortcomings are identified by gradients.
The idea of gradient boosting originated in the observation by Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function. Explicit regression gradient boosting algorithms were subsequently developed by Jerome H. Friedman, simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean. The latter two papers introduced the abstract view of boosting algorithms as iterative functional gradient descent algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.
In general, in terms of model performance, we have the following heirarchy:
$$Boosting > Random \: Forest > Bagging > Single \: Tree$$Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers $G_m(x)$, $m = 1, 2, ... , M$.
Boosting builds an additive model:
$$F(x) = \sum_{m=1}^M \beta_m b(x; \gamma_m)$$where $b(x; \gamma_m)$ is a tree and $\gamma_m$ parameterizes the splits. With boosting, the parameters, $(\beta_m, \gamma_m)$ are fit in a stagewise fashion. This slows the process down, and overfits less quickly.
Friedman's Gradient Boosting Algorithm for a generic loss function, $L(y_i, \gamma)$:
The optimal number of iterations, T, and the learning rate, λ, depend on each other.
Stochastic Gradient Boosting (Friedman, 2002) proposed the stochastic gradient boosting algorithm that simply samples uniformly without replacement from the dataset before estimating the next gradient step. He found that this additional step greatly improved performance.
Some implemenations of Stochastic GBM have both column and row sampling (per split and per tree) for better generalization. XGBoost and H2O are two implemenations that have a per-tree column and row sampling. This is reason that these implementations are popular among competitive data mining competitors (e.g. Kaggle).
This is not a comprehensive list of GBM software in R, however, we detail a few of the most popular implementations below: gbm, xgboost and h2o.
The CRAN Machine Learning Task View lists the following projects as well. The Hinge-loss is optimized by the boosting implementation in package bst. Package GAMBoost can be used to fit generalized additive models by a boosting algorithm. An extensible boosting framework for generalized linear, additive and nonparametric models is available in package mboost. Likelihood-based boosting for Cox models is implemented in CoxBoost and for mixed models in GMMBoost. GAMLSS models can be fitted using boosting by gamboostLSS.
Authors: Originally written by Greg Ridgeway, added to by various authors, currently maintained by Harry Southworth
Backend: C++
The gbm R package is an implementation of extensions to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine. This is the original R implementation of GBM. A presentation is available here by Mark Landry.
Features:
In [19]:
#install.packages("gbm")
#install.packages("cvAUC")
library(gbm)
library(cvAUC)
In [20]:
# Load 2-class HIGGS dataset
train <- read.csv("data/higgs_train_10k.csv")
test <- read.csv("data/higgs_test_5k.csv")
In [21]:
set.seed(1)
model <- gbm(formula = response ~ .,
distribution = "bernoulli",
data = train,
n.trees = 70,
interaction.depth = 5,
shrinkage = 0.3,
bag.fraction = 0.5,
train.fraction = 1.0,
n.cores = NULL) #will use all cores by default
In [22]:
print(model)
In [23]:
# Generate predictions on test dataset
preds <- predict(model, newdata = test, n.trees = 70)
labels <- test[,"response"]
# Compute AUC on the test set
cvAUC::AUC(predictions = preds, labels = labels)
Authors: Tianqi Chen, Tong He, Michael Benesty
Backend: C++
The xgboost R package provides an R API to "Extreme Gradient Boosting", which is an efficient implementation of gradient boosting framework. Parameter tuning guide and more resources here. The xgboost package is quite popular on Kaggle for data mining competitions.
Features:
In [24]:
#install.packages("xgboost")
#install.packages("cvAUC")
library(xgboost)
library(Matrix)
library(cvAUC)
In [25]:
# Load 2-class HIGGS dataset
train <- read.csv("data/higgs_train_10k.csv")
test <- read.csv("data/higgs_test_5k.csv")
In [26]:
# Set seed because we column-sample
set.seed(1)
y <- "response"
train.mx <- sparse.model.matrix(response ~ ., train)
test.mx <- sparse.model.matrix(response ~ ., test)
dtrain <- xgb.DMatrix(train.mx, label = train[,y])
dtest <- xgb.DMatrix(test.mx, label = test[,y])
train.gdbt <- xgb.train(params = list(objective = "binary:logistic",
#num_class = 2,
#eval_metric = "mlogloss",
eta = 0.3,
max_depth = 5,
subsample = 1,
colsample_bytree = 0.5),
data = dtrain,
nrounds = 70,
watchlist = list(train = dtrain, test = dtest))
In [29]:
# Generate predictions on test dataset
preds <- predict(train.gdbt, newdata = dtest)
labels <- test[,y]
# Compute AUC on the test set
cvAUC::AUC(predictions = preds, labels = labels)
In [30]:
#Advanced functionality of xgboost
#install.packages("Ckmeans.1d.dp")
library(Ckmeans.1d.dp)
# Compute feature importance matrix
names <- dimnames(data.matrix(train[,-1]))[[2]]
importance_matrix <- xgb.importance(names, model = train.gdbt)
# Plot feature importance
xgb.plot.importance(importance_matrix[1:10,])
Authors: Arno Candel, Cliff Click, H2O.ai contributors
Backend: Java
H2O GBM Tuning guide by Arno Candel and H2O GBM Vignette.
Features:
In [67]:
#install.packages("h2o")
library(h2o)
#h2o.shutdown(prompt = FALSE) #if required
h2o.init(nthreads = -1) #Start a local H2O cluster using nthreads = num available cores
In [68]:
# Load 10-class MNIST dataset
train <- h2o.importFile("data/higgs_train_10k.csv")
test <- h2o.importFile("data/higgs_test_5k.csv")
print(dim(train))
print(dim(test))
In [69]:
# Identity the response column
y <- "response"
# Identify the predictor columns
x <- setdiff(names(train), y)
# Convert response to factor
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])
In [70]:
# Train an H2O GBM model
model <- h2o.gbm(x = x,
y = y,
training_frame = train,
ntrees = 70,
learn_rate = 0.3,
sample_rate = 1.0,
max_depth = 5,
col_sample_rate_per_tree = 0.5,
seed = 1)
In [71]:
# Get model performance on a test set
perf <- h2o.performance(model, test)
print(perf)
In [72]:
# To retreive individual metrics
h2o.auc(perf)
In [73]:
# Print confusion matrix
h2o.confusionMatrix(perf)
In [74]:
# Plot scoring history over time
plot(model)
In [75]:
# Retreive feature importance
vi <- h2o.varimp(model)
vi[1:10,]
In [76]:
# Plot feature importance
barplot(vi$scaled_importance,
names.arg = vi$variable,
space = 1,
las = 2,
main = "Variable Importance: H2O GBM")
Note that all models, data and model metrics can be viewed via the H2O Flow GUI, which should already be running since you started the H2O cluster with h2o.init()
.
In [77]:
# Early stopping example
# Keep in mind that when you use early stopping, you should pass a validation set
# Since the validation set is used to detmine the stopping point, a separate test set should be used for model eval
#fit <- h2o.gbm(x = x,
# y = y,
# training_frame = train,
# model_id = "gbm_fit3",
# validation_frame = valid, #only used if stopping_rounds > 0
# ntrees = 500,
# score_tree_interval = 5, #used for early stopping
# stopping_rounds = 3, #used for early stopping
# stopping_metric = "misclassification", #used for early stopping
# stopping_tolerance = 0.0005, #used for early stopping
# seed = 1)
In [78]:
# GBM hyperparamters
gbm_params <- list(learn_rate = seq(0.01, 0.1, 0.01),
max_depth = seq(2, 10, 1),
sample_rate = seq(0.5, 1.0, 0.1),
col_sample_rate = seq(0.1, 1.0, 0.1))
search_criteria <- list(strategy = "RandomDiscrete",
max_models = 20)
# Train and validate a grid of GBMs
gbm_grid <- h2o.grid("gbm", x = x, y = y,
grid_id = "gbm_grid",
training_frame = train,
validation_frame = test, #test frame will only be used to calculate metrics
ntrees = 70,
seed = 1,
hyper_params = gbm_params,
search_criteria = search_criteria)
gbm_gridperf <- h2o.getGrid(grid_id = "gbm_grid",
sort_by = "auc",
decreasing = TRUE)
print(gbm_gridperf)
The grid search helped a lot. The first model we trained only had a 0.774 test set AUC, but the top GBM in our grid has a test set AUC of 0.786. More information about grid search is available in the H2O grid search R tutorial.