XGBoost :eXtreme Gradient Boosting

XGBoost

  • implementation of Gradient Boosting algorithm
  • open source library supporting Python, R, Julia on windows, linux, mac
  • author Tianqi Chen at University of Washington
  • developed by DMLC (Distributed Machine Learning Community)

Review: What is a Gradient Boosting again?

  • ensemble method for regression and classification: many weak learners create a strong learner
  • builds sequentially typically using shallow trees for an additive model

Bias/Variance tradeoff for sklearn ensemble Gradient Boosting:

  • control variance by using small learning rate, subsampling (bagging), max_features to build the trees
  • boosting reduces error mainly by reducing bias by focusing on poor predictions and trying to model them better in the next iteration

What makes it XGBoost eXtreme?

  • focuses on improving speed and performance
  • Speed: Parallelizes each regression tree building by doing the splits/branching in parallel - use multiple CPUs.
  • Performance: model includes a regularization term to the Loss function to better handle various on top of hyperparameters seen in sklearn implementation.

    • recall the objective function:
$$Obj(\Theta) = L(\theta) + \Omega(\Theta)$$

where are loss for linear regression often is sum of squared error:
$$L(\theta) = \sum_i (y_i-\hat{y}_i)^2 $$ or logistic loss: $$L(\theta) = \sum_i[ y_i\ln (1+e^{-\hat{y}_i}) + (1-y_i)\ln (1+e^{\hat{y}_i})$$

For XGBoost there is a regularization term to address model complexity: $$\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2$$

gamma prunes leaves if they don't have sufficient gain, is T is number of leaves and lambda is the L2 regularization term(ridge regression) and w is the vector of scores on each leaf

other great features:

  • has useful method called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required
  • Can run as distributed system on multiple computers, Hadoop
  • more flexibility: custom optimization objectives and evaluation criteria.
  • Can handle missing data - sparse aware (DMatrix API to create a sparse matrix)
  • Can build from existing trained models (warm_start). *sklearn GBM also has this feature.

Setup:

0. install on mac

sudo pip install xgboost

2. ssh into AWS

check number of processors:

cat /proc/cpuinfo | grep processor | wc -l

3. install all packages

sudo dnf install gcc gcc-c++ make git unzip python python2-numpy python2-scipy python2-scikit-learn python2-pandas python2-matplotlib

4. install XGBoost

git clone --recursive https://github.com/dmlc/xgboost

cd xgboost

make -j32

cd python-package

sudo python setup.py install

-OR-

sudo pip install xgboost

5. confirm installation

python -c "import xgboost;print(xgboost.version)"

6. Train your xgboost model

exit to copy script and training data from local machine to AWS instance - (or could use git clone)

scp -r -i xgboost-keypair.pem ../work fedora@52.53.173.84:/home/fedora/

7. Log back into AWS

ssh -i xgboost-keypair.pem fedora@52.53.185.166

cd work

nohup python script.py &

disown # if you want to leave it running and logout

top #see the cpus are being used

8. Then Exit and Terminate your instance in AWS console if you are just using one time. Remember you get charged for usage!

Useful Functions:

xgb is the native booster object and XGBClassifer is the sklearn wrapper. depending on how you train/fit/cv there are different methods and paramters for each. Using both can be usefuleof the examples.

  • feature importance: model.featureimportances (didn't work for me) def get_xgb_imp(xgb, feat_names): from numpy import array imp_vals = xgb.booster().get_fscore() imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))} total = array(imp_dict.values()).sum() return {k:v/total for k,v in imp_dict.items()}

print get_xgb_imp(xgbm,feat_names)

feature names are the list of feature names

early stopping hyperparameter: if the model does not improve its scoring function in n rounds it will stop building additional trees

use a watchlist in train method to get evaluation metric for objective function printed to screen as you build your model for train and test metrics:

watchlist = [ (xg_train,'train'), (xg_test, 'test') ] num_round = 300

pining a model requires a parameter list and data set.

bst = xgb.train(param, xg_train, num_round, watchlist );

watchlist will look like (in this example train is left col and test logloss is right column): ------------------new round------------------------ Will train until validation_1 error hasn't decreased in 10 rounds. [0] validation_0-mlogloss:2.873181 validation_1-mlogloss:2.873842 [1] validation_0-mlogloss:2.755566 validation_1-mlogloss:2.755575 [2] validation_0-mlogloss:2.640008 validation_1-mlogloss:2.640016 [3] validation_0-mlogloss:2.528404 validation_1-mlogloss:2.528415

decreases as model fits data.

Misc: