Import the main toolkit.
In [2]:
import time, os, re, zipfile
import numpy as np, pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
Now import some ML stuff
In [4]:
import sklearn as sk, xgboost as xg
# from sklearn.model_selection import train_test_split
from sklearn.cross_validation import train_test_split
Mind the seed!!
In [5]:
random_state = np.random.RandomState( seed = 0x0BADC0DE )
Let's begin this introduction with usage examples.
The demonstration uses the dataset, which was originally used in Otto Group Product Classification Challenge. We load the data directly from ZIP archives.
In [6]:
df_train = pd.read_csv( zipfile.ZipFile( 'train.csv.zip' ).open( 'train.csv' ), index_col = 'id' )
X = np.asanyarray( df_train.drop( 'target', axis = 1 ) )
y = sk.preprocessing.LabelEncoder( ).fit_transform( df_train[ 'target' ] )
As usual do the train-test split.
In [7]:
X_train, X_, y_train, y_ = train_test_split( X, y, test_size = 0.25, random_state = random_state )
X_valid, X_test, y_valid, y_test = train_test_split( X_, y_, test_size = 0.5, random_state = random_state )
Use scikit-learn compatible interface of XGBoost.
In [9]:
clf_ = xg.XGBClassifier( n_estimators = 50,
gamma = 1.0,
max_depth = 1000,
objective = "multi:softmax",
nthread = -1,
silent = False )
Fit the a gradient boosted tree ensemble.
In [10]:
clf_.fit( X_train, y_train, eval_set = [ ( X_valid, y_valid ), ], verbose = True )
Out[10]:
Now let's validate.
In [11]:
y_predict = clf_.predict( X_test )
y_score = clf_.predict_proba( X_test )
Let's check out the confusuion matrix
In [12]:
pd.DataFrame( sk.metrics.confusion_matrix( y_test, y_predict ), index = clf_.classes_, columns = clf_.classes_ )
Out[12]:
Let's plot one-vs-all ROC-AUC curves
In [13]:
fig = plt.figure( figsize = ( 16, 9 ) )
axis = fig.add_subplot( 111 )
axis.set_title( 'ROC-AUC (ovr) curves for the heldout dataset' )
axis.set_xlabel( "False positive rate" ) ; axis.set_ylabel( "True positive rate" )
axis.set_ylim( -0.01, 1.01 ) ; axis.set_xlim( -0.01, 1.01 )
for cls_ in clf_.classes_ :
fpr, tpr, _ = sk.metrics.roc_curve( y_test, y_score[:, cls_], pos_label = cls_ )
axis.plot( fpr, tpr, lw = 2, zorder = cls_, label = "C%d" % ( cls_, ) )
axis.legend( loc = 'lower right', shadow = True, ncol = 3 )
Out[13]:
Internally XGBoost relies heavily on a custom dataset format DMatrix. It is ... The interface, which is exposed into python has three capabilities:
Let's load the train dataset using numpy interface :
In [20]:
train_dmat = xg.DMatrix( data = X_train,
label = y_train,
feature_names = None,
feature_types = None )
test_dmat = xg.DMatrix( data = X_test, label = y_test )
DMatrix exports several useful methods:
For a more detailed list, it is useful to have a look at the official manual
Having dafined the datasets, it is the right time to initialize the booster. To this end one uses xgboost.Learner class. Among other parameters, its instance is initialized with a dictionary of parameters, which allows for a more flexible booster initialization.
In [26]:
xgb_params = {
'bst:max_depth':2,
'bst:eta':1,
'silent':1,
'objective':'multi:softmax',
'num_class': 9,
'nthread': 2,
'eval_metric' : 'auc'
}
The xgboost.train class initalizes an appropriate booster, and then fits it on the provided train dataset. Besides the booster parameters and the train DMatrix , the class initializer accepts:
In [27]:
xgbooster_ = xg.train( params = xgb_params,
dtrain = train_dmat,
num_boost_round = 10,
evals = (),
obj = None,
feval = None,
maximize = False,
early_stopping_rounds = None,
evals_result = None,
verbose_eval = True,
learning_rates = None,
xgb_model = None )
The method xgboost.booster.update performs one iteration of gradinet boosting:
The method xboost.booster.boost performs one iteration of boosting on the custom gradient statistics:
The method xgboost.booster.predict returns either the learned value, or the index of the target leaf. The parameters are :
The returned result is a numpy ndarray.
In [28]:
y_predict = xgbooster_.predict( test_dmat )
y_score = xgbooster_.predict( test_dmat, output_margin = True )
Besides these methods xgboost.booster exports: load_model( fname ) and save_model( fname ).
Let's check out the confusuion matrix
In [29]:
pd.DataFrame( sk.metrics.confusion_matrix( y_test, y_predict ), index = clf_.classes_, columns = clf_.classes_ )
Out[29]:
Let's plot one-vs-all ROC-AUC curves
In [30]:
fig = plt.figure( figsize = ( 16, 9 ) )
axis = fig.add_subplot( 111 )
axis.set_title( 'ROC-AUC (ovr) curves for the heldout dataset' )
axis.set_xlabel( "False positive rate" ) ; axis.set_ylabel( "True positive rate" )
axis.set_ylim( -0.01, 1.01 ) ; axis.set_xlim( -0.01, 1.01 )
for cls_ in clf_.classes_ :
fpr, tpr, _ = sk.metrics.roc_curve( y_test, y_score[:, cls_], pos_label = cls_ )
axis.plot( fpr, tpr, lw = 2, zorder = cls_, label = "C%d" % ( cls_, ) )
axis.legend( loc = 'lower right', shadow = True, ncol = 3 )
Out[30]:
Consider some dataset $(x_i, y_i)_{i=1}^n \in X\times Y$, where $X$ is the feature space and $Y$ is the space of target values. Let $\mathtt{Obj}(\theta)$ be the objective function, which we wish to minimize by learning a model (or a set of parameters). The objective has two components, which depend on the paramter set $\theta$ : $$\mathtt{Obj}(\theta) = L(\theta) + \Omega(\theta)\,. $$ Here $L$ is the trainig loss function, which reflects the overall quaility and fittness of the learned model, and $\Omega$ is the regularization term, which controls overfitting by penalizing the model's complexity. The working assumption here is theat the more complex the model the more likely it is to fail to generalize and overfit.
Let $\mathcal{M}$ be the set of base models for the ensemble, for instance, CART tree. An learner tries to find a pool of $K$ models $(f_k)_{k=1}^K \in \mathcal{M}$, which, when aggregated into an ensemble $$ \hat{y}(x) = \sum_{k=1}^K f_k(x)\,, $$ minimize the objective function $$\mathtt{Obj} = \sum_{i=1}^n l( y_i, \hat{y}(x_i) ) + \sum_{k=1}^K \Omega(f_k)\,,$$ with $l(y, \hat{y})$ being the loss, and $\Omega(\cdot)$ -- the complexity regularizer.
In a general nonlinear and non-convex case solving the follwing optimiation problem is hard: $$ \sum_{i=1}^n l( y_i, \hat{y}_i ) + \sum_{k=1}^K \Omega(f_k) \rightarrow \mathop{\mathtt{min}}_{f_k\in\mathcal{M} } \,,$$ with $\hat{y}_i = \sum_{k=1}^K f_k(x_i)$. No general technique for solving it is known, other than applying a clever variant of exhaustive search. Furthermore minimizing the objective to the extreme might be undesirable, as the data might be subject to random noise and other impurities.
If optimizing the objective function (either the original or modified in some way) over one base model (function) is tractable and can be dome efficiently, then one could apply greedy search method, the search space of which is much narrower.
Forward Stagewise Additive Modelling is a general greedy approach to modelling additive enesembles. The basic idea of this approach is to construct a suboptimal model incrementally in a greedy fashion. For the objective function $\mathtt{Obj}$ the approach is outlined below:
\sum_{i=1}^n l( y_i, F_{k-1}(x_i) + f(x_i) ) + \Omega(f) \,,$$
where the complexity penalty $\Omega(F_{k-1}) = \sum_{j=1}^{k-1} \Omega(f_j) $ is omitted, since it does not depend on $f_k$;The gradinet boosting algorithm, in its simplest form is outlined below.
\biggl.
\frac{\partial}{\partial \hat{y}_i} L\bigl( (y_i)_{i=1}^n, (\hat{y}_i)_{i=1}^n \bigr)
\biggr\rvert_{\hat{y}_i = F_{k-1}(x_i)}\,, $$
where $L\bigl( (y_i)_{i=1}^n, (\hat{y}_i)_{i=1}^n \bigr) = \sum_{i=1}^n l( y_i, \hat{y}_i )$; \sum_{i\,:\,x_i\in R_j} l( y_i, \hat{y}_i + w ) + \Omega(w)\,,$$
where $\Omega(w)$ is the regularizer term for weight $w$;In order to permit the use of arbitrary convex loss functions and still achieve high preformance during learning, the author of XGBoost, implemented a clever trick: the minimization with respect to the increment $f(\cdot)$ is performed on the second order Taylor series approximation of the loss $l$ at $(x_i, y_i)$ and $F(\cdot)$. In particular the minimization over $f(\cdot)$ is done on $$ q{y, x} = l( y, F(x) ) + \frac{\partial l}{\partial \hat{y}}\bigg\vert{(y,F(x))}!! f(x)
+ \frac{1}{2} \frac{\partial^2 l}{\partial \hat{y}^2}\bigg\vert_{(y,F(x))}\! f(x)^2 \,, $$
rather than $l( y, F(x) + f(x) )$.
Since $\Omega(F_{k-1})$ and $l( y_i, F_{k-1}(x_i) )$ are unaffected by the choice of $f\in\mathcal{M}$ at iteration $k$, the greedy step can be reduced to: $$f_k \leftarrow \mathop{\mathtt{argmin}}\limits_{f\in \mathcal{M}} \sum_{i=1}^n g^{k-1}_i f(x_i) + \frac{1}{2} h^{k-1}_i f(x_i)^2 + \Omega(f)\,,$$ where $g^{k-1}_i = \frac{\partial l(y, \hat{y})}{\partial \hat{y}}$ and $h^{k-1}_i = \frac{\partial^2 l(y, \hat{y})}{\partial \hat{y}^2}$ evaluated at $y=y_i$ and $\hat{y}=F_{k-1}(x_i)$.
it turned out to be efficient and no less it was deemed better to approximate
XGBoost library implements two base models, $f:X\to \mathbb{R}$: linear approximator and nonlinear tree-like model. A linear base model is of the form $$ f(x) = \beta_0 + x\beta \,, $$ where $\beta_0\in \mathbb{R}$ is the intercept and $\beta\in \mathbb{R}^{d\times 1}$ is the vector of coefficients.
Regression tree models are a subcalss of piecewise-constant models: $$ f(x) = \sum_{j=1}^J w_j 1_{R_j}(x)\,, $$ where $R = (R_j)_{j=1}^J \subseteq X$ is a partition of the domain $X$ into non-overlapping regions, and $w_j$ is the level of $f$ for all $x$, which fall within leaf $j$, $x\in R_j$. Since $R$ is partition, for any $x\in X$ there is a unique $j(x)=1,\ldots,J$ such that $x\in R_{j(x)}$. Therefore in a more compact representation: for any $x\in X$ $$ f(x) = w_{j(x)} \,.$$
The regression tree model $f$ is directly used to predict the target value $y\in \mathbb{R}$ given some $x\in X$, embedded in $\mathbb{R}^p$.
The specific feature of tree models is that the partition $R$ is constructed in a recursive manner until some criteria for tree's structural score are met, and the final tree structure reflects the recursion tree.
It important to note, that XGBoost implements binary trees, which does not restrict the model in any way. However this add the need for an extra preprocessing step for categorical features. Specifically the binary structure requires that such features be one-hot encoded, which is likely to use excessive volumes of memory, especially when the set of possible categories is of the order of thousands.
XGBoost uses criteria derived from the objective function that permit automatic tree-pruning. Consider some tree $f$ with structure $$ f = \sum_{j=1}^J w_j 1_{R_j} \,,$$ where $(R_j)_{j=1}^J\subseteq X$ is its partition and $w\in\mathbb{R}^J$ -- leaf predicted values. For this tree the complexity regularization is $$ \Omega(f) = \gamma J + \frac{\lambda}{2} \sum_{j=1}^J w_j^2 + \alpha \sum_{j=1}^J \bigl|w_j\bigr| \,. $$ As one can see both excessively large leaf values and tree depths are penalized.
The structural score of an XGBoost regression tree is the minimal value of the objective function for a fixed partition structure $R = (R_j)_{j=1}^J$: $$ \mathtt{Obj}^*(R) = \min_{wj} \mathtt{Obj}(R, w) = \min{wj} \sum{i=1}^n \bigl( g^{k-1}i w{j(x_i)} + \frac{1}{2} h^{k-1}i w{j(x_i)}^2 \bigr)
+ \frac{\lambda}{2} \sum_{j=1}^J w_j^2 + \alpha \sum_{j=1}^J \bigl|w_j\bigr| + \gamma J
\,. $$
This is not an intermediate value of the objective function, but rather its difference against $\sum_{i=1}^n l( y_i, F_{k-1}(x_i) )$.
Using the map $x\mapsto j(x)$, which gives the unique leaf index $j=1,\ldots,J$ such that $x\in R_j$, the objective function can be transformed into \begin{align*} \mathtt{Obj}(R, w) &= \sum_{i=1}^n \bigl( g^{k-1}i w{j(x_i)} + \frac{1}{2} h^{k-1}i w{j(x_i)}^2 \bigr)
+ \frac{\lambda}{2} \sum_{j=1}^J w_j^2 + \alpha \sum_{j=1}^J \bigl|w_j\bigr| + \gamma J \\
&= \sum_{j=1}^J \bigl( w_j G_{k-1}(R_j) + \frac{1}{2} \bigl( H_{k-1}(R_j) + \lambda \bigr) w_j^2
+ \alpha \bigl|w_j\bigr| + \gamma \bigr) \,,
\end{align*} where for any $P\subseteq X$, the values $G_{k-1}(P) = \sum_{i\,:\,x_i\in P} g^{k-1}_i$ and $H_{k-1}(P) = \sum_{i\,:\,x_i\in P} h^{k-1}_i$ are called the first and the second order gradient scores respectively. When $P = R_j$ these are the $j$-th leaf gradinet statistics, which depend only on the ensemble $F_{k-1}$ and are constant relative to the increment $f$.
It is worth noting, that since there are no cross interactions between scores $w_j$ for different leaves, this minimization problem equivalently reduces to $J$ univariate optimization problems: $$ wj G{k-1}(Rj) + \frac{1}{2} \bigl( H{k-1}(R_j) + \lambda \bigr) w_j^2
+ \alpha \bigl|w_j\bigr| + \gamma \to \min_{w_j}\,,$$
for $j=1,\ldots, J$. Let's assume that $H_{k-1}(R_j) + \lambda > 0$, since otherwise this problem has no solution.
The optimal leaf value $w_j^*$ in the general case is given by $$ w^*_j = - \frac{1}{H_{k-1}(R_j) + \lambda}\begin{cases} G_{k-1}(R_j) + \alpha & \text{ if } G_{k-1}(R_j) \leq -\alpha\\ 0&\text{ if } G_{k-1}(R_j) \in [-\alpha, \alpha]\\ G_{k-1}(R_j) - \alpha & \text{ if } G_{k-1}(R_j) \geq \alpha \end{cases} \,, $$ see appendix for a solution.
Trees in XGBoost employ a greedy algorithm for recursive tree construction, outlined below:
the tree growth process continues until no more splits are possible.
The first step in sthe most computatuionally intensive, since it requires $O( J d n\log n )$ operations.
For simplicity, let's consider the case when $\alpha = 0$, $L^2$ regularization. In this case the following weights give optimal leaf scores $$ w^*_j = -\frac{G_{k-1}(R_j)}{H_{k-1}(R_j) + \lambda}\,.$$ The strucutral score becomes $$ \mathtt{Obj}^*(R) = \gamma J - \frac{1}{2}\sum_{j=1}^J \frac{G_{k-1}^2(R_j)}{H_{k-1}(R_j) + \lambda} \,. $$
Any split $R_j \rightarrow R_{j_1}\!\| R_{j_2}$ yields the following gain: $$ \mathtt{Gain} = \frac{1}{2}\Biggl( \frac{G{k-1}^2(R{j1})}{H{k-1}(R_{j_1}) + \lambda}
+ \frac{G_{k-1}^2(R_{j_2})}{H_{k-1}(R_{j_2}) + \lambda}
- \frac{G_{k-1}^2(R_j)}{ H_{k-1}(R_j) + \lambda}
\Biggr) - \gamma\,.$$
Note that $G_{k-1}(\cdot)$ and $H_{k-1}(\cdot)$ are additive by construction: $$G_{k-1}(R_j) = G_{k-1}(R_{j_1}) + G_{k-1}(R_{j_2}) \,,$$ and $$H_{k-1}(R_j) = H_{k-1}(R_{j_1}) + H_{k-1}(R_{j_2}) \,.$$
The processing pipeline for one iteration of gradient boosting is as follows:
XGBoost implements two types of boosters: the linear and tree bnosters:
The gbtree booster ( GBTree : public IGradBooster) is located in gbtree-inl.hpp, and is responsible for constructing one or more regression trees for the provided fixed gradient statistics $(g^{k-1}_i, h^{k-1}_i)_{i=1}^n$ (const std::vector <bst_gpair > &), which are kept intact during this booster's run.
The main procedure, that does all the work is GBTree->DoBoost(), which in turn invokes GBTree->BoostNewTrees(). The latter workhorse grows the required number of trees, determined by the parameter num_parallel_tree. For regular tree boosting this parameter is set to $1$, but for a boosted random forest its value determines the size of the forest.
Each regression tree in GBTree->BoostNewTrees() is constructed via sequential application of so called updaters among which the most frequqntly used are
The ColMaker updater does not take into consideration the tree size cost, and thus constructs the deepest possible binary regression tree. The ColMaker->Update() method performs the following steps:
The TreePruner updater is much simpler than the ColMaker. Its method TreePruner->Update() recursively tries to do cost-complexity pruning with respect to the split gains computed earlier. The pruning rule is whether the structural gain exceed the threshold $\gamma$. In contrast to tree growth, pruning is not done in parallel.
The tree based booster has much richer settings, though some coincide with the linear booster:
By default, the booster uses all features and all train dataset to construct each new tree. Other important default values are:
The gblinear booster ( GBLinear : public IGradBooster ) is defined in gblinear-inl.hpp. It constructs boosted linear models for the given gradient statistics $(g^{k-1}_i, h^{k-1}_i)_{i=1}^n$ (std::vector <bst_gpair > &), which though with no const qualifier, are also unchanged during the run.
Unlike gbtree this booster does not rely on any auxiliary updaters. The main procedure, that does all the work is GBLinear->DoBoost(), which estimates a linear approximation.
The linear booster requires only three parameters:
The reglarization term of the objectve function for a linear base model $f(x) = \beta_0 + x\beta$ is $$ \Omega(f) = \frac{\lambda}{2} \| \beta \|_2^2 + \alpha \| \beta \|_1 + \lambda_{\texttt{bias}} |\beta_0|^2 \,.$$
Note, that unlike scikit-learn, in XGBoost these parameters default ot $0$ -- no regularization.
When we allow the model to get more complicated (e.g. more depth), the model has better ability to fit the training data, resulting in a less biased model. However, such complicated model requires more data to fit.
The first way is to directly control model complexity
* max_depth, min_child_weight and gamma
The second way is to add randomness to make training robust to noise
* subsample, colsample_bytree
* reduce stepsize eta, but needs to remember to increase num_round when you do so.
dataset is extremely imbalanced. This can affect the training of xgboost model, and there are two ways to improve it. the ranking order (AUC) of your prediction: Balance the positive and negative weights, via scale_pos_weight
If you care about predicting the right probability: In such a case, set parameter max_delta_step
The problem $$ w G + \frac{1}{2}( H+\lambda ) w^2 + \alpha \bigl|w\bigr| \to \min_w\,,$$ can be solved by the following substitution: $w = w_+ - w_-$, with $w_- \cdot w_+ = 0$ and $w_-, w_+\geq 0$. Notice that it is assumed that $H+\lambda > 0$.
Indeed, the Lagrangian is
$$ L = ( w_+ G + \frac{1}{2}( H+\lambda ) w_+^2 + \alpha w_+ - \mu_+w_+ ) + (- w_- G + \frac{1}{2}( H+\lambda ) w_-^2 + \alpha w_- - \mu_-w_- ) - \mu w_-w_+\,, $$whence the KKT conditions are
Multiplying the FOC by $w_+$ and $w_-$ respectively, and using $w_-\cdot w_+ = 0$, yields $$ \bigl( G + ( H+\lambda ) w_+ + \alpha - \mu_+ \bigr) w_+ = 0 \,, $$ and $$ \bigl( -G + ( H+\lambda ) w_- + \alpha - \mu_- \bigr) w_- = 0 \,. $$
A pair $w_- = w_+ = 0$ is feasible and satisfied FOC, while a pair $w_+ > 0$ and $w_- > 0$ is infeasible.
Now, if $w_->0$ and $w_+=0$, then $\mu_-=0$ and $$ -G + ( H + \lambda ) w_- + \alpha = 0 \,,$$ which yields $$ w_- = -\frac{\alpha - G}{ H + \lambda } \,,$$ whence $ \alpha < G $.
Conversely, if $w_+>0$ and $w_-=0$, then $\mu_+=0$, and $$ G + ( H+\lambda ) w_+ + \alpha = 0 \,. $$ This implies that $$ w_+ = -\frac{\alpha + G}{H+\lambda}\,, $$ which requires that $G<-\alpha$.
However, since the inequalitites $G<-\alpha$ and $G>\alpha$ are incompatible:
Therefore the optimal $w$ in this problem is given by $$ w = - (H+\lambda)^{-1}\bigl( \min(G + \alpha;0) + \max( G - \alpha;0) \bigr) \,. $$
In [ ]: