We use the XGBoost Python package to separate signal from background for rare radiative decays $b \rightarrow s (d) \gamma$. XGBoost is a scalable, distributed implementation of gradient tree boosting that builds the tree itself in parallel, leading to speedy cross validation (relative to other iterative algorithms). Refer to the original paper by Chen et. al: https://arxiv.org/pdf/1603.02754v1.pdf as well as the github: https://github.com/dmlc/xgboost
Author: Justin Tan - 5/04/17
In [39]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import xgboost as xgb
import time, os
In [40]:
# Set training mode, hadronic channel.
mode = 'gamma_only'
channel = 'kstar0'
Load feature vectors saved as Pandas dataframe, convert to data matrix structure used by XGBoost.
In [41]:
df = pd.read_hdf('/home/ubuntu/radiative/df/kstar0/kstar0_gamma_sig_cont.h5', 'df')
In [9]:
df = pd.read_hdf('/home/ubuntu/radiative/df/rho0/std_norm_sig_cus.h5', 'df')
In [ ]:
from sklearn.model_selection import train_test_split
# Split data into training, testing sets
df_X_train, df_X_test, df_y_train, df_y_test = train_test_split(df[df.columns[:-1]], df['labels'],
test_size = 0.05, random_state = 24601)
dTrain = xgb.DMatrix(data = df_X_train.values, label = df_y_train.values, feature_names = df.columns[:-1])
dTest = xgb.DMatrix(data = df_X_test.values, label = df_y_test.values, feature_names = df.columns[:-1])
# Save to XGBoost binary file for faster loading
dTrain.save_binary("dTrain" + mode + channel + ".buffer")
dTest.save_binary("dTest" + mode + channel + ".buffer")
Specify the starting hyperparameters for the boosting algorithm. Ideally this would be optimized using cross-validation. Refer to https://github.com/dmlc/xgboost/blob/master/doc/parameter.md for the full list. Important parameters for regularization control model complexity and add randomness to make training robust against noise.
In [46]:
# Boosting hyperparameters
params = {'eta': 0.2, 'seed':0, 'subsample': 0.9, 'colsample_bytree': 0.9, 'gamma': 0.05,
'objective': 'binary:logistic', 'max_depth':5, 'min_child_weight':1, 'silent':0}
# Specify multiple evaluation metrics for validation set
params['eval_metric'] = 'error@0.5'
pList = list(params.items())+[('eval_metric', 'auc')]
In [48]:
# Number of boosted trees to construct
nTrees = 75
# Specify validation set to watch performance
evalList = [(dTrain,'train'), (dTest,'eval')]
evalDict = {}
print("Starting model training\n")
start_time = time.time()
# Train the model using the above parameters
bst = xgb.train(params = pList, dtrain = dTrain, evals = evalList, num_boost_round = nTrees,
evals_result = evalDict, early_stopping_rounds = 20)
# Save our model
model_name = mode + channel + str(nTrees) + '.model'
bst.save_model(model_name)
delta_t = time.time() - start_time
print("Training ended. Elapsed time: (%.3f s)" %(delta_t))
The set of parameters which control the behaviour of the algorithm are not learned from the training data, called hyperparameters. The best value for each will depend on the dataset. We can optimize these by performing a grid search over the parameter space, using $n-$fold cross validation: The original sample is randomly partitioned into $n$ equally size subsamples - a single subsample is retained as the validation and the remaining $n-1$ subsamples are used as training data. Repeat, with each subsample used exactly once as the validation data. XGBoost is compatible with scikit-learn's API, so we can reuse code from our AdaBoost notebook. See http://scikit-learn.org/stable/modules/grid_search.html
In [ ]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import scipy.stats as stats
# Set number of parameter settings to be sampled
n_iter = 25
# Set parameter distributions for random search CV using the AUC metric
cv_paramDist = {'learning_rate': stats.uniform(loc = 0.05, scale = 0.15), # 'n_estimators': stats.randint(150, 300),
'colsample_bytree': stats.uniform(0.8, 0.195),
'subsample': stats.uniform(loc = 0.8, scale = 0.195),
'max_depth': [3, 4, 5, 6],
'min_child_weight': [1, 2, 3]}
fixed_params = {'n_estimators': 350, 'seed': 24601, 'objective': 'binary:logistic'}
xgb_randCV = RandomizedSearchCV(xgb.XGBClassifier(**fixed_params), cv_paramDist, scoring = 'roc_auc', cv = 5,
n_iter = n_iter, verbose = 2, n_jobs = -1)
start = time.time()
xgb_randCV.fit(df_X_train.values, df_y_train.values)
print("RandomizedSearchCV complete. Time elapsed: %.2f seconds for %d candidates" % ((time.time() - start), n_iter))
In [25]:
# Best set of hyperparameters
xgb_randCV.best_params_
Out[25]:
In [ ]:
optParams = {'eta': 0.1, 'seed':0, 'subsample': 0.95, 'colsample_bytree': 0.9, 'gamma': 0.05,
'objective': 'binary:logistic', 'max_depth':6, 'min_child_weight':1, 'silent':0}
In [33]:
# Cross-validation on optimal parameters
xgb_cv = xgb.cv(params = optParams, dtrain = dTrain, nfold = 5, metrics = ['error', 'auc'], verbose_eval = 10,
stratified = True, as_pandas = True, early_stopping_rounds = 30, num_boost_round = 500)
Model performance varies dramatically depending on our choice of hyperparameters. Tuning a large number of hyperparameters a grid search may be prohibitively expensive and we can instead randomly sample from distributions of hyperparamters and evaluate the model at these points.
In [50]:
def plot_ROC_curve(y_true, network_output, meta):
"""
Plots the receiver-operating characteristic curve
Inputs: y: One-hot encoded binary labels
network_output: NN output probabilities
Output: AUC: Area under the ROC Curve
"""
from sklearn.metrics import roc_curve, auc
# Compute ROC curve, integrate
fpr, tpr, thresholds = roc_curve(y_true, network_output)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.axes([.1,.1,.8,.7])
plt.figtext(.5,.9, r'$\mathrm{Receiver \;operating \;characteristic}$', fontsize=15, ha='center')
plt.figtext(.5,.85, meta, fontsize=10,ha='center')
plt.plot(fpr, tpr, color='darkorange',
lw=2, label='ROC curve - custom (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1.0, linestyle='--')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel(r'$\mathrm{False \;Positive \;Rate}$')
plt.ylabel(r'$\mathrm{True \;Positive \;Rate}$')
plt.legend(loc="lower right")
plt.savefig("graphs/" + "clf_ROCcurve.pdf",format='pdf', dpi=1000)
plt.show()
plt.gcf().clear()
In [51]:
# Load previously trained model
xgb_pred = bst.predict(dTest)
meta = 'XGBoost - max_depth: 5, subsample: 0.9, $\eta = 0.2$'
In [52]:
plot_ROC_curve(df_y_test.values, xgb_pred, meta)
Plot the feature importances of the 20 features that contributed the most to overall tree impurity reduction.
In [49]:
%matplotlib inline
importances = bst.get_fscore()
df_importance = pd.DataFrame({'Importance': list(importances.values()), 'Feature': list(importances.keys())})
df_importance.sort_values(by = 'Importance', inplace = True)
df_importance[-20:].plot(kind = 'barh', x = 'Feature', color = 'orange', figsize = (10,10),
title = 'Feature Importances')
Out[49]: