Source code from http://machinelearningmastery.com/train-xgboost-models-cloud-amazon-web-services/ http://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/ http://machinelearningmastery.com/train-xgboost-models-cloud-amazon-web-services/ https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/and from Matt's Lecture on Gradient Boosting
Performance: model includes a regularization term to the Loss function to better handle various on top of hyperparameters seen in sklearn implementation.
other great features:
http://machinelearningmastery.com/train-xgboost-models-cloud-amazon-web-services/
sudo dnf install gcc gcc-c++ make git unzip python python2-numpy python2-scipy python2-scikit-learn python2-pandas python2-matplotlib
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
make -j32
cd python-package
sudo python setup.py install
-OR-
sudo pip install xgboost
python -c "import xgboost;print(xgboost.version)"
scp -r -i xgboost-keypair.pem ../work fedora@52.53.173.84:/home/fedora/
ssh -i xgboost-keypair.pem fedora@52.53.185.166
cd work
nohup python script.py &
disown # if you want to leave it running and logout
top #see the cpus are being used
xgb is the native booster object and XGBClassifer is the sklearn wrapper. depending on how you train/fit/cv there are different methods and paramters for each. Using both can be usefuleof the examples.
def get_xgb_imp(xgb, feat_names):
from numpy import array
imp_vals = xgb.booster().get_fscore()
imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
total = array(imp_dict.values()).sum()
return {k:v/total for k,v in imp_dict.items()}
print get_xgb_imp(xgbm,feat_names)
early stopping hyperparameter: if the model does not improve its scoring function in n rounds it will stop building additional trees
use a watchlist in train method to get evaluation metric for objective function printed to screen as you build your model for train and test metrics:
watchlist = [ (xg_train,'train'), (xg_test, 'test') ] num_round = 300
bst = xgb.train(param, xg_train, num_round, watchlist );
watchlist will look like (in this example train is left col and test logloss is right column): ------------------new round------------------------ Will train until validation_1 error hasn't decreased in 10 rounds. [0] validation_0-mlogloss:2.873181 validation_1-mlogloss:2.873842 [1] validation_0-mlogloss:2.755566 validation_1-mlogloss:2.755575 [2] validation_0-mlogloss:2.640008 validation_1-mlogloss:2.640016 [3] validation_0-mlogloss:2.528404 validation_1-mlogloss:2.528415
decreases as model fits data.
Paper on XGBoost:https://arxiv.org/pdf/1603.02754v3.pdf
Windows installation:https://www.ibm.com/developerworks/community/blogs/jfp/entry/Installing_XGBoost_For_Anaconda_on_Windows?lang=en
Mac installation:https://www.ibm.com/developerworks/community/blogs/jfp/entry/Installing_XGBoost_on_Mac_OSX?lang=en
Hyperparameter tuning: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
Hadoop and XGBoost:http://xgboost.readthedocs.io/en/latest/tutorials/aws_yarn.html
SPARK and XGBoost: http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html
different examples of xgboost: https://github.com/dmlc/xgboost/tree/master/demo