XGBoost

We use the XGBoost Python package to separate signal from background for rare radiative decays $b \rightarrow s (d) \gamma$. XGBoost is a scalable, distributed implementation of gradient tree boosting that builds the tree itself in parallel, leading to speedy cross validation (relative to other iterative algorithms). Refer to the original paper by Chen et. al: https://arxiv.org/pdf/1603.02754v1.pdf as well as the github: https://github.com/dmlc/xgboost

Author: Justin Tan - 5/04/17


In [39]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import xgboost as xgb
import time, os

Training


In [40]:
# Set training mode, hadronic channel.
mode = 'gamma_only'
channel = 'kstar0'

Load feature vectors saved as Pandas dataframe, convert to data matrix structure used by XGBoost.


In [41]:
df = pd.read_hdf('/home/ubuntu/radiative/df/kstar0/kstar0_gamma_sig_cont.h5', 'df')

In [9]:
df = pd.read_hdf('/home/ubuntu/radiative/df/rho0/std_norm_sig_cus.h5', 'df')

In [ ]:
from sklearn.model_selection import train_test_split
# Split data into training, testing sets
df_X_train, df_X_test, df_y_train, df_y_test = train_test_split(df[df.columns[:-1]], df['labels'],
                                                                test_size = 0.05, random_state = 24601)

dTrain = xgb.DMatrix(data = df_X_train.values, label = df_y_train.values, feature_names = df.columns[:-1])
dTest = xgb.DMatrix(data = df_X_test.values, label = df_y_test.values, feature_names = df.columns[:-1])
# Save to XGBoost binary file for faster loading
dTrain.save_binary("dTrain" + mode + channel + ".buffer")
dTest.save_binary("dTest" + mode + channel + ".buffer")

Specify the starting hyperparameters for the boosting algorithm. Ideally this would be optimized using cross-validation. Refer to https://github.com/dmlc/xgboost/blob/master/doc/parameter.md for the full list. Important parameters for regularization control model complexity and add randomness to make training robust against noise.

  • eta: Reduces feature weights after each boosting iteration
  • subsample: Adjusts proportion of instance that XGBoost collects to grow trees
  • max_depth: Maximum depth of tree structure. Larger depth $\rightarrow$ greater complexity/overfitting
  • gamma: Minimum loss reduction required to further partition a leaf node on the tree

In [46]:
# Boosting hyperparameters
params = {'eta': 0.2, 'seed':0, 'subsample': 0.9, 'colsample_bytree': 0.9, 'gamma': 0.05, 
             'objective': 'binary:logistic', 'max_depth':5, 'min_child_weight':1, 'silent':0}

# Specify multiple evaluation metrics for validation set
params['eval_metric'] = 'error@0.5'
pList = list(params.items())+[('eval_metric', 'auc')]

In [48]:
# Number of boosted trees to construct
nTrees = 75
# Specify validation set to watch performance
evalList  = [(dTrain,'train'), (dTest,'eval')]
evalDict = {}

print("Starting model training\n")
start_time = time.time()
# Train the model using the above parameters
bst = xgb.train(params = pList, dtrain = dTrain, evals = evalList, num_boost_round = nTrees, 
          evals_result = evalDict, early_stopping_rounds = 20)

# Save our model
model_name = mode + channel + str(nTrees) + '.model'
bst.save_model(model_name)

delta_t = time.time() - start_time
print("Training ended. Elapsed time: (%.3f s)" %(delta_t))


Starting model training

[0]	train-error:0.094659	train-auc:0.901981	eval-error:0.093589	eval-auc:0.902183
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 20 rounds.
[1]	train-error:0.089944	train-auc:0.932601	eval-error:0.089384	eval-auc:0.932535
[2]	train-error:0.079728	train-auc:0.946918	eval-error:0.078795	eval-auc:0.947209
[3]	train-error:0.079144	train-auc:0.949071	eval-error:0.078444	eval-auc:0.949497
[4]	train-error:0.078459	train-auc:0.949874	eval-error:0.077205	eval-auc:0.950388
[5]	train-error:0.077653	train-auc:0.952025	eval-error:0.076503	eval-auc:0.952759
[6]	train-error:0.075482	train-auc:0.958021	eval-error:0.074239	eval-auc:0.958771
[7]	train-error:0.071186	train-auc:0.961964	eval-error:0.069933	eval-auc:0.962892
[8]	train-error:0.072895	train-auc:0.962241	eval-error:0.071744	eval-auc:0.963223
[9]	train-error:0.071959	train-auc:0.963297	eval-error:0.070903	eval-auc:0.964307
[10]	train-error:0.071651	train-auc:0.964055	eval-error:0.070654	eval-auc:0.965013
[11]	train-error:0.067705	train-auc:0.968405	eval-error:0.066532	eval-auc:0.96918
[12]	train-error:0.065186	train-auc:0.970447	eval-error:0.064278	eval-auc:0.971304
[13]	train-error:0.064557	train-auc:0.971126	eval-error:0.06364	eval-auc:0.971924
[14]	train-error:0.063972	train-auc:0.971778	eval-error:0.062993	eval-auc:0.972546
[15]	train-error:0.062718	train-auc:0.972939	eval-error:0.061718	eval-auc:0.973896
[16]	train-error:0.061496	train-auc:0.97418	eval-error:0.060544	eval-auc:0.975273
[17]	train-error:0.06127	train-auc:0.974411	eval-error:0.060397	eval-auc:0.97553
[18]	train-error:0.059228	train-auc:0.97605	eval-error:0.058151	eval-auc:0.977215
[19]	train-error:0.058629	train-auc:0.976385	eval-error:0.05744	eval-auc:0.977519
[20]	train-error:0.058149	train-auc:0.976687	eval-error:0.057153	eval-auc:0.977784
[21]	train-error:0.057846	train-auc:0.976899	eval-error:0.056802	eval-auc:0.978034
[22]	train-error:0.057032	train-auc:0.977463	eval-error:0.055859	eval-auc:0.978567
[23]	train-error:0.055977	train-auc:0.978043	eval-error:0.054834	eval-auc:0.979155
[24]	train-error:0.055463	train-auc:0.978427	eval-error:0.054298	eval-auc:0.979548
[25]	train-error:0.054959	train-auc:0.978816	eval-error:0.053845	eval-auc:0.979854
[26]	train-error:0.054301	train-auc:0.979176	eval-error:0.053318	eval-auc:0.980146
[27]	train-error:0.054225	train-auc:0.979264	eval-error:0.053106	eval-auc:0.980206
[28]	train-error:0.053993	train-auc:0.979367	eval-error:0.052939	eval-auc:0.980297
[29]	train-error:0.053319	train-auc:0.979752	eval-error:0.052246	eval-auc:0.980647
[30]	train-error:0.052698	train-auc:0.979959	eval-error:0.051535	eval-auc:0.980821
[31]	train-error:0.052484	train-auc:0.980049	eval-error:0.051535	eval-auc:0.980907
[32]	train-error:0.052303	train-auc:0.980144	eval-error:0.051516	eval-auc:0.980987
[33]	train-error:0.052124	train-auc:0.980235	eval-error:0.051211	eval-auc:0.981068
[34]	train-error:0.051921	train-auc:0.980379	eval-error:0.051165	eval-auc:0.98121
[35]	train-error:0.05149	train-auc:0.980628	eval-error:0.05074	eval-auc:0.981483
[36]	train-error:0.051074	train-auc:0.98098	eval-error:0.050213	eval-auc:0.981812
[37]	train-error:0.050828	train-auc:0.981114	eval-error:0.050056	eval-auc:0.981897
[38]	train-error:0.05047	train-auc:0.981337	eval-error:0.049779	eval-auc:0.982123
[39]	train-error:0.050322	train-auc:0.981376	eval-error:0.049641	eval-auc:0.982149
[40]	train-error:0.050158	train-auc:0.981432	eval-error:0.04953	eval-auc:0.982181
[41]	train-error:0.049599	train-auc:0.981713	eval-error:0.048994	eval-auc:0.98245
[42]	train-error:0.049067	train-auc:0.982134	eval-error:0.048301	eval-auc:0.982821
[43]	train-error:0.04889	train-auc:0.982231	eval-error:0.048042	eval-auc:0.982911
[44]	train-error:0.048361	train-auc:0.982548	eval-error:0.047811	eval-auc:0.983236
[45]	train-error:0.048242	train-auc:0.98262	eval-error:0.047728	eval-auc:0.983301
[46]	train-error:0.048001	train-auc:0.982729	eval-error:0.047451	eval-auc:0.983402
[47]	train-error:0.04753	train-auc:0.98296	eval-error:0.046905	eval-auc:0.983616
[48]	train-error:0.047389	train-auc:0.983019	eval-error:0.046868	eval-auc:0.9837
[49]	train-error:0.047333	train-auc:0.983054	eval-error:0.046739	eval-auc:0.983733
[50]	train-error:0.047225	train-auc:0.983086	eval-error:0.046674	eval-auc:0.983746
[51]	train-error:0.047026	train-auc:0.983142	eval-error:0.04648	eval-auc:0.983797
[52]	train-error:0.046918	train-auc:0.9832	eval-error:0.046166	eval-auc:0.983854
[53]	train-error:0.046805	train-auc:0.983243	eval-error:0.046101	eval-auc:0.983886
[54]	train-error:0.046769	train-auc:0.983256	eval-error:0.046101	eval-auc:0.983894
[55]	train-error:0.046635	train-auc:0.983394	eval-error:0.046055	eval-auc:0.983964
[56]	train-error:0.046592	train-auc:0.983417	eval-error:0.046037	eval-auc:0.983982
[57]	train-error:0.046243	train-auc:0.983611	eval-error:0.045686	eval-auc:0.984178
[58]	train-error:0.045946	train-auc:0.983743	eval-error:0.045214	eval-auc:0.984292
[59]	train-error:0.045595	train-auc:0.983964	eval-error:0.044882	eval-auc:0.984506
[60]	train-error:0.045264	train-auc:0.984176	eval-error:0.044688	eval-auc:0.984717
[61]	train-error:0.045208	train-auc:0.984218	eval-error:0.044632	eval-auc:0.984755
[62]	train-error:0.045091	train-auc:0.984267	eval-error:0.04454	eval-auc:0.984815
[63]	train-error:0.044893	train-auc:0.984393	eval-error:0.04429	eval-auc:0.984939
[64]	train-error:0.044695	train-auc:0.984482	eval-error:0.044179	eval-auc:0.985011
[65]	train-error:0.044519	train-auc:0.98455	eval-error:0.043948	eval-auc:0.98507
[66]	train-error:0.044462	train-auc:0.984574	eval-error:0.043967	eval-auc:0.985088
[67]	train-error:0.04441	train-auc:0.984603	eval-error:0.043911	eval-auc:0.985112
[68]	train-error:0.044242	train-auc:0.984659	eval-error:0.043948	eval-auc:0.985166
[69]	train-error:0.044082	train-auc:0.984745	eval-error:0.0438	eval-auc:0.985246
[70]	train-error:0.044037	train-auc:0.984769	eval-error:0.043828	eval-auc:0.985257
[71]	train-error:0.043992	train-auc:0.984785	eval-error:0.04381	eval-auc:0.985272
[72]	train-error:0.043922	train-auc:0.984809	eval-error:0.043773	eval-auc:0.985291
[73]	train-error:0.043885	train-auc:0.984821	eval-error:0.0438	eval-auc:0.985301
[74]	train-error:0.043832	train-auc:0.984855	eval-error:0.04369	eval-auc:0.985329
Training ended. Elapsed time: (141.286 s)
  • Crossfeed accuracy: 96%, AUC = 0.991
  • Continuum accuracy: ~ 99%, AUC ~ 1
  • Custom accuracy: ~

Optimizing Hyperparameters

The set of parameters which control the behaviour of the algorithm are not learned from the training data, called hyperparameters. The best value for each will depend on the dataset. We can optimize these by performing a grid search over the parameter space, using $n-$fold cross validation: The original sample is randomly partitioned into $n$ equally size subsamples - a single subsample is retained as the validation and the remaining $n-1$ subsamples are used as training data. Repeat, with each subsample used exactly once as the validation data. XGBoost is compatible with scikit-learn's API, so we can reuse code from our AdaBoost notebook. See http://scikit-learn.org/stable/modules/grid_search.html


In [ ]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import scipy.stats as stats

# Set number of parameter settings to be sampled
n_iter = 25

# Set parameter distributions for random search CV using the AUC metric
cv_paramDist = {'learning_rate': stats.uniform(loc = 0.05, scale = 0.15), # 'n_estimators': stats.randint(150, 300),
                'colsample_bytree': stats.uniform(0.8, 0.195),
                'subsample': stats.uniform(loc = 0.8, scale = 0.195),
                'max_depth': [3, 4, 5, 6],
                'min_child_weight': [1, 2, 3]}

fixed_params = {'n_estimators': 350, 'seed': 24601, 'objective': 'binary:logistic'}
xgb_randCV = RandomizedSearchCV(xgb.XGBClassifier(**fixed_params), cv_paramDist, scoring = 'roc_auc', cv = 5, 
                                n_iter = n_iter, verbose = 2, n_jobs = -1)

start = time.time()
xgb_randCV.fit(df_X_train.values, df_y_train.values)
print("RandomizedSearchCV complete. Time elapsed: %.2f seconds for %d candidates" % ((time.time() - start), n_iter))


Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV] colsample_bytree=0.913962813044, learning_rate=0.140240583722, max_depth=5, min_child_weight=2, subsample=0.849028425443 
[CV] colsample_bytree=0.913962813044, learning_rate=0.140240583722, max_depth=5, min_child_weight=2, subsample=0.849028425443 
[CV] colsample_bytree=0.913962813044, learning_rate=0.140240583722, max_depth=5, min_child_weight=2, subsample=0.849028425443 
[CV] colsample_bytree=0.913962813044, learning_rate=0.140240583722, max_depth=5, min_child_weight=2, subsample=0.849028425443 
[CV]  colsample_bytree=0.913962813044, learning_rate=0.140240583722, max_depth=5, min_child_weight=2, subsample=0.849028425443, total=15.0min
[CV] colsample_bytree=0.913962813044, learning_rate=0.140240583722, max_depth=5, min_child_weight=2, subsample=0.849028425443 
[CV]  colsample_bytree=0.913962813044, learning_rate=0.140240583722, max_depth=5, min_child_weight=2, subsample=0.849028425443, total=15.8min
[CV] colsample_bytree=0.843093720178, learning_rate=0.0667731283631, max_depth=6, min_child_weight=3, subsample=0.856622998556 
[CV]  colsample_bytree=0.913962813044, learning_rate=0.140240583722, max_depth=5, min_child_weight=2, subsample=0.849028425443, total=17.7min
[CV] colsample_bytree=0.843093720178, learning_rate=0.0667731283631, max_depth=6, min_child_weight=3, subsample=0.856622998556 
[CV]  colsample_bytree=0.913962813044, learning_rate=0.140240583722, max_depth=5, min_child_weight=2, subsample=0.849028425443, total=17.7min
[CV] colsample_bytree=0.843093720178, learning_rate=0.0667731283631, max_depth=6, min_child_weight=3, subsample=0.856622998556 

In [25]:
# Best set of hyperparameters
xgb_randCV.best_params_


Out[25]:
{'colsample_bytree': 0.82779669726408012,
 'learning_rate': 0.10568674671267647,
 'max_depth': 6,
 'min_child_weight': 1,
 'subsample': 0.99190140241207247}

In [ ]:
optParams = {'eta': 0.1, 'seed':0, 'subsample': 0.95, 'colsample_bytree': 0.9, 'gamma': 0.05, 
             'objective': 'binary:logistic', 'max_depth':6, 'min_child_weight':1, 'silent':0}

In [33]:
# Cross-validation on optimal parameters
xgb_cv = xgb.cv(params = optParams, dtrain = dTrain, nfold = 5, metrics = ['error', 'auc'], verbose_eval = 10, 
                stratified = True, as_pandas = True, early_stopping_rounds = 30, num_boost_round = 500)


[0]	train-auc:0.861859+0.000278173	train-error:0.205978+0.000608732	test-auc:0.860835+0.00146269	test-error:0.207193+0.00196467
[10]	train-auc:0.897839+0.00052163	train-error:0.179376+0.000859587	test-auc:0.896055+0.00147139	test-error:0.181291+0.0016627
[20]	train-auc:0.90633+0.000345617	train-error:0.172124+0.000265181	test-auc:0.903926+0.0016062	test-error:0.174492+0.00190713
[30]	train-auc:0.910556+0.000486815	train-error:0.168388+0.000565942	test-auc:0.907547+0.00107931	test-error:0.171029+0.00167272
[40]	train-auc:0.913759+9.17322e-05	train-error:0.165542+0.000198846	test-auc:0.910145+0.00118215	test-error:0.168821+0.00146569
[50]	train-auc:0.916147+0.000203117	train-error:0.163324+0.000250475	test-auc:0.911946+0.00116153	test-error:0.167316+0.00156567
[60]	train-auc:0.91799+0.000224505	train-error:0.161606+0.00043436	test-auc:0.913203+0.00112846	test-error:0.16619+0.00176853
[70]	train-auc:0.91963+0.000194173	train-error:0.159949+0.000354578	test-auc:0.914253+0.00102195	test-error:0.164908+0.00175753
[80]	train-auc:0.920769+0.000328737	train-error:0.158756+0.000654639	test-auc:0.914799+0.000959314	test-error:0.164298+0.00155099
[90]	train-auc:0.922105+0.000330918	train-error:0.157303+0.00073026	test-auc:0.915592+0.000951738	test-error:0.163728+0.00134518
[100]	train-auc:0.923058+0.000232756	train-error:0.156294+0.00066907	test-auc:0.916012+0.000981623	test-error:0.163143+0.00140979
[110]	train-auc:0.923902+0.000186051	train-error:0.155393+0.000527797	test-auc:0.916307+0.000978867	test-error:0.162889+0.00160488
[120]	train-auc:0.924669+0.000266909	train-error:0.154601+0.000667477	test-auc:0.916556+0.000938846	test-error:0.162596+0.00137669
[130]	train-auc:0.92551+0.000270898	train-error:0.153712+0.000626003	test-auc:0.916832+0.000912862	test-error:0.162156+0.00159439
[140]	train-auc:0.926292+0.000363167	train-error:0.152825+0.000674223	test-auc:0.917143+0.000836029	test-error:0.161805+0.00157142
[150]	train-auc:0.927003+0.000315187	train-error:0.152045+0.000610195	test-auc:0.917344+0.000833794	test-error:0.16175+0.00143372
[160]	train-auc:0.927731+0.000339234	train-error:0.151279+0.000635061	test-auc:0.917523+0.000800286	test-error:0.161549+0.00137565
[170]	train-auc:0.928358+0.00031241	train-error:0.15058+0.00061379	test-auc:0.91764+0.000844162	test-error:0.161414+0.00134731
[180]	train-auc:0.928953+0.000259039	train-error:0.149815+0.000508516	test-auc:0.917723+0.000891276	test-error:0.161335+0.00147907
[190]	train-auc:0.929623+0.000301913	train-error:0.149031+0.000603195	test-auc:0.917874+0.000876784	test-error:0.161231+0.00150901
[200]	train-auc:0.930226+0.000354664	train-error:0.148341+0.000672246	test-auc:0.917956+0.000836115	test-error:0.161216+0.0013835
[210]	train-auc:0.930786+0.000368693	train-error:0.147719+0.000693539	test-auc:0.91801+0.000793866	test-error:0.161289+0.00131375
[220]	train-auc:0.931466+0.000314597	train-error:0.146976+0.000739318	test-auc:0.918202+0.000809104	test-error:0.161027+0.00129596
[230]	train-auc:0.931947+0.000307046	train-error:0.146421+0.000681009	test-auc:0.918229+0.000807531	test-error:0.161017+0.00134792
[240]	train-auc:0.932517+0.000277878	train-error:0.145746+0.000593019	test-auc:0.918297+0.000826561	test-error:0.160867+0.00135259
[250]	train-auc:0.933033+0.000316446	train-error:0.145152+0.000592179	test-auc:0.918327+0.000842301	test-error:0.160918+0.00118708
[260]	train-auc:0.933541+0.000313808	train-error:0.144526+0.000668587	test-auc:0.918369+0.000865626	test-error:0.160852+0.00123704
[270]	train-auc:0.934128+0.000309396	train-error:0.143809+0.000639402	test-auc:0.918451+0.000847163	test-error:0.160771+0.00120248
[280]	train-auc:0.934665+0.000317387	train-error:0.143155+0.000650691	test-auc:0.918499+0.00080859	test-error:0.160664+0.0012407
[290]	train-auc:0.935228+0.000322943	train-error:0.142517+0.000614368	test-auc:0.918582+0.000791712	test-error:0.16058+0.00107991
[300]	train-auc:0.935708+0.00036061	train-error:0.141971+0.000695065	test-auc:0.91857+0.000818663	test-error:0.16058+0.00114249
[310]	train-auc:0.936212+0.000359583	train-error:0.141409+0.000684801	test-auc:0.918581+0.000831321	test-error:0.160651+0.000983923
[320]	train-auc:0.936782+0.000322594	train-error:0.140715+0.000719932	test-auc:0.918674+0.000847351	test-error:0.16045+0.00109373
[330]	train-auc:0.937271+0.000333185	train-error:0.140074+0.000682436	test-auc:0.918693+0.000838091	test-error:0.16045+0.0010565
[340]	train-auc:0.937734+0.00033167	train-error:0.139489+0.000651873	test-auc:0.918698+0.000824699	test-error:0.160417+0.00124092
[350]	train-auc:0.938191+0.000304982	train-error:0.138838+0.000639185	test-auc:0.918696+0.000823397	test-error:0.160483+0.00118835
[360]	train-auc:0.938634+0.000300864	train-error:0.138306+0.000658089	test-auc:0.918697+0.000821008	test-error:0.160407+0.00117152
[370]	train-auc:0.939053+0.000259074	train-error:0.137796+0.000686887	test-auc:0.91872+0.000808779	test-error:0.160463+0.00113559
[380]	train-auc:0.939542+0.00024093	train-error:0.137197+0.000720103	test-auc:0.918751+0.000832329	test-error:0.160384+0.00109876
[390]	train-auc:0.940039+0.000286544	train-error:0.136526+0.000756489	test-auc:0.918804+0.000825468	test-error:0.160339+0.00105465
[400]	train-auc:0.940469+0.000279302	train-error:0.135967+0.000782004	test-auc:0.918807+0.000832574	test-error:0.16028+0.00101883
[410]	train-auc:0.940942+0.000270736	train-error:0.135301+0.000775515	test-auc:0.918793+0.000829554	test-error:0.160265+0.00107146
[420]	train-auc:0.941348+0.000275828	train-error:0.134882+0.000680412	test-auc:0.918784+0.000851939	test-error:0.160313+0.00096992
[430]	train-auc:0.941779+0.000323782	train-error:0.134347+0.000708861	test-auc:0.918768+0.00085264	test-error:0.160234+0.00100316
[440]	train-auc:0.942243+0.000328164	train-error:0.133657+0.000661872	test-auc:0.918801+0.000847173	test-error:0.160292+0.000902565
[450]	train-auc:0.942617+0.000337573	train-error:0.133182+0.00074273	test-auc:0.918814+0.000845156	test-error:0.160376+0.00112526
[460]	train-auc:0.943058+0.000323187	train-error:0.132605+0.000753002	test-auc:0.918824+0.000807352	test-error:0.160292+0.00101506

Model performance varies dramatically depending on our choice of hyperparameters. Tuning a large number of hyperparameters a grid search may be prohibitively expensive and we can instead randomly sample from distributions of hyperparamters and evaluate the model at these points.

Inference


In [50]:
def plot_ROC_curve(y_true, network_output, meta):
    """
    Plots the receiver-operating characteristic curve
    Inputs: y:                 One-hot encoded binary labels
            network_output:    NN output probabilities
    Output: AUC:               Area under the ROC Curve

    """
    from sklearn.metrics import roc_curve, auc
    
    # Compute ROC curve, integrate
    fpr, tpr, thresholds = roc_curve(y_true, network_output)    
    roc_auc = auc(fpr, tpr)
    
    plt.figure()
    plt.axes([.1,.1,.8,.7])                           
    plt.figtext(.5,.9, r'$\mathrm{Receiver \;operating \;characteristic}$', fontsize=15, ha='center')
    plt.figtext(.5,.85, meta, fontsize=10,ha='center')
    plt.plot(fpr, tpr, color='darkorange',
                     lw=2, label='ROC curve - custom (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=1.0, linestyle='--')
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel(r'$\mathrm{False \;Positive \;Rate}$')
    plt.ylabel(r'$\mathrm{True \;Positive \;Rate}$')
    plt.legend(loc="lower right")
    plt.savefig("graphs/" + "clf_ROCcurve.pdf",format='pdf', dpi=1000)
    plt.show()
    plt.gcf().clear()

In [51]:
# Load previously trained model
xgb_pred = bst.predict(dTest)
meta = 'XGBoost - max_depth: 5, subsample: 0.9, $\eta = 0.2$'

In [52]:
plot_ROC_curve(df_y_test.values, xgb_pred, meta)


<matplotlib.figure.Figure at 0x7f7535935ac8>

Feature Importances

Plot the feature importances of the 20 features that contributed the most to overall tree impurity reduction.


In [49]:
%matplotlib inline
importances = bst.get_fscore()
df_importance = pd.DataFrame({'Importance': list(importances.values()), 'Feature': list(importances.keys())})
df_importance.sort_values(by = 'Importance', inplace = True)
df_importance[-20:].plot(kind = 'barh', x = 'Feature', color = 'orange', figsize = (10,10), 
                         title = 'Feature Importances')


Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f74f00859b0>