The files for_yandex_x.root contain a single TTree called T
that has all of the feautres we want to consider for various jet-tagging algorithms. The values of x mean:
Only jets with measured pt > 20 GeV
and 2.2 < eta < 4.2
are kept in these samples.
We want to start by considering the following taggers:
An SV tagger that requires an SV in the jet and uses features similar to what we used in the LHCb Run 1 tagger. This will now include anything given below that starts with SV, along with the feature SVPT / JetPT.
An SV tagger that requires an SV in the jet but uses not only SV features, but also whatever else is useful.
A muon tagger that requires a muon in the jet and uses only Mu features, along with MUPT / JetPT.
A hardest-track tagger that uses features of the highest-pt track in the jet, along with HardPT / JetPT.
A tagger that requires zero SVs, but can use whatever else is useful.
In each case, we want to consider separation of b, c, and light, though some of these may only be useful for b OR c vs light.
To remind you, in Run 1 the SV tagger used
The features kept for each jet are list below with descriptions.
Features that starts with "True" is truth-level information we can use for labeling, but cannot be used in the classification algorithm since in experimental data we don't have this.
TrueParton is the truth-level parton that the jet matches to. This should be +-5(+-4) for all jets in the beauty(charm) file, and various numbers that correspond to u,d,s,g for light jets.
TrueMaxBPT is the true maximum b-hadron pt within the jet. This should be zero except for b-jets.
TrueMaxCPT is the true maximum c-hadron pt within the jet. This should be zero for light jets. It can be non-zero for b-jets since many contain a b --> c decay.
TrueJet(Px,Py,Pz,E) is the true 4-momentum of the jet.
Features that start with "Jet" are properities that all jets have.
Jet(PT,ETA) is the measured jet pt(eta).
JetSigma(1,2) is the "jet width" along the major and minor axes.
JetQ is the "jet charge".
JetMult is the total number of charged and neutral particles in the jet.
JetNChr is the number of charged particles in the jet.
JetNNeu is the number of neutral particles in the jet.
JetPTD is another jet width type feature.
Features that start with "SV" are only set for jets that have a secondary vertex associated to them.
NSV is the number of SVs in the jet. Only the one with the highest pt is given in the SV features.
SV(X,Y,Z) is the position.
SVPerp is (SVX**2 + SVY**2)**0.5
.
SV(Px,Py,Pz,E) is the 4-momentum of all tracks that make up the SV.
SV(PT,ETA) are the transverse component and eta of the 4-momentum.
SVM is the invariant mass.
SVMCor is the corrected mass.
SVMINPERP is the minimum perp location of all 2-body SV contained in the n-body SV.
SVDR is DeltaR between the direction of flight (PV to SV vector) and the jet axis.
SVN is the number of tracks in the SV.
SVNJ is the number of tracks in the SV that are also in the jet.
SVQ is the net charge of the tracks in the SV (we usually only use its absolute value).
SVSumIPChi2 is the sum of the IP chi2 of all tracks in the SV.
SVTZ is the pseudo lifetime of the SV.
SVMINIPCHI2 is the min value of IP chi2 of all tracks in the SV.
SVGhostMax is the max value of the ghost prob of all tracks in the SV.
Features that start with "Mu" are only for jets that contain a muon, which is defined as a track in the jet with ProbNNmu > 0.5 and PT > 500 MeV.
NMU is the number of muons in the jet. If more than one, then only the one with the highest PT is reported in the Mu features.
MuPT is the pt of the muon.
MuIPChi2 is the IP chi2 of the muon.
MuDR is DeltaR between the muon and the jet axis.
MuPNN is ProbNNmu for the muon.
Features that start with "Hard" are for the highest-pt track in the jet.
HardPT is its pt.
HardIPChi2 is its IP chi2.
HardDR is DeltaR between the hardest track and the jet axis.
PV(X,Y,Z) is the PV location.
NDispl(6,9,16) is the number of tracks in the jet that have IP chi2 > (6,9,16). Clearly these are highly correlated, so perhaps we should only consider using one of them.
In [1]:
%pylab inline
In [2]:
import root_numpy
import pandas
import sys
sys.path.insert(0, 'utils/')
In [3]:
import utils
from rep.metaml import FoldingClassifier
from rep.estimators import XGBoostClassifier
In [4]:
b_jets = pandas.DataFrame(root_numpy.root2array('datasets_2016_Sept/for_yandex_5.root', treename='T'))
c_jets = pandas.DataFrame(root_numpy.root2array('datasets_2016_Sept/for_yandex_4.root', treename='T'))
light_jets = pandas.DataFrame(root_numpy.root2array('datasets_2016_Sept/for_yandex_0.root', treename='T'))
In [5]:
hist(numpy.log(b_jets.SVPT[b_jets.SVPT > -999] + 1), bins=100, alpha=0.4, normed=True, label='SVPT');
hist(numpy.log(b_jets.TrueMaxBPT[b_jets.TrueMaxBPT > -999] + 1), bins=100, alpha=0.4, normed=True, label='TrueMaxBPT');
hist(numpy.log(b_jets.TrueMaxCPT[b_jets.TrueMaxCPT > -999] + 1), bins=100, alpha=0.4, normed=True, label='TrueMaxCPT');
xlabel('log(PT+1)')
xlim(6, 12)
legend(loc='best')
Out[5]:
In [6]:
hist(numpy.log(c_jets.SVPT[c_jets.SVPT > -999] + 1), bins=100, alpha=0.4, normed=True, label='SVPT');
hist(numpy.log(c_jets.TrueMaxCPT[c_jets.TrueMaxCPT > -999] + 1), bins=100, alpha=0.4, normed=True, label='TrueMaxCPT');
xlabel('log(PT+1)')
xlim(6, 12)
legend(loc='best')
Out[6]:
In [7]:
print len(b_jets), sum(abs(b_jets.TrueParton) == 5), sum(b_jets.TrueMaxBPT != 0), sum(b_jets.TrueMaxCPT != 0)
print len(c_jets), sum(abs(c_jets.TrueParton) == 4), sum(c_jets.TrueMaxBPT != 0), sum(c_jets.TrueMaxCPT != 0)
print len(light_jets), sum(light_jets.TrueParton != -1000), sum(light_jets.TrueMaxBPT != 0), sum(light_jets.TrueMaxCPT != 0)
In [8]:
b_jets, c_jets, light_jets = utils.add_features(b_jets, c_jets, light_jets)
In [9]:
b_jets.head()
Out[9]:
In [10]:
c_jets.head()
Out[10]:
In [11]:
light_jets.head()
Out[11]:
In [12]:
true_features = filter(lambda x: x.startswith('True'), b_jets.columns)
SV_features = filter(lambda x: x.startswith('SV'), b_jets.columns)
jet_features = filter(lambda x: x.startswith('Jet'), b_jets.columns)
muon_features = filter(lambda x: x.startswith('Mu'), b_jets.columns)
hard_features = filter(lambda x: x.startswith('Hard'), b_jets.columns)
other_features = ['PVX', 'PVY', 'PVZ', 'NDispl6', 'NDispl9', 'NDispl16']
In [13]:
plt.figure(figsize=(20, 65))
kw_hist = {'normed': True, 'bins': 60, 'alpha': 0.4}
for n, name in enumerate(b_jets.columns):
plt.subplot(17, 4, n + 1)
val = numpy.percentile(b_jets[b_jets[name] != -1000][name].values, [1, 99])
plt.hist(b_jets[b_jets[name] != -1000][name].values, label='b', range=val, **kw_hist)
plt.hist(c_jets[c_jets[name] != -1000][name].values, label='c', range=val, **kw_hist)
plt.hist(light_jets[light_jets[name] != -1000][name].values, range=val, label='light', **kw_hist)
plt.legend(loc='best')
plt.xlabel(name)
In [14]:
data = pandas.concat([b_jets, c_jets, light_jets], axis=0)
data['Label'] = [0] * len(b_jets) + [1] * len(c_jets) + [2] * len(light_jets)
In [15]:
def fit_model(data, selection, model):
data_fit = data.query(selection)
data_fit['Weight'] = utils.compute_weights(data_fit['Label'].values)
print 'Classes', sum(data_fit.Label == 0), sum(data_fit.Label == 1), sum(data_fit.Label == 2)
model.fit(data_fit, data_fit.Label, data_fit.Weight)
utils.plot_feature_importances(model.feature_importances_, model.features)
plt.show()
predictions = model.predict_proba(data_fit)
utils.generate_plots({i: predictions[:, i] for i in range(3)},
data_fit.Label.values, data_fit.Weight.values, data_fit)
plt.show()
return model
In [16]:
sv_selection = '(TrueParton != -1000) & (NSV > 0)'
In [17]:
print SV_features
In [19]:
%%time
from rep.metaml import FoldingClassifier
from rep.estimators import XGBoostClassifier
xgb_sv = FoldingClassifier(XGBoostClassifier(nthreads=12,
eta=0.05,
n_estimators=200,
max_depth=9,
subsample=0.5,
min_child_weight=100,
colsample=0.7),
features=SV_features + ['NSV'], random_state=42)
xgb_sv = fit_model(data, sv_selection, xgb_sv)
In [22]:
from sklearn.base import clone
xgb_base = FoldingClassifier(XGBoostClassifier(nthreads=12, eta=0.05,
n_estimators=200,
max_depth=9,
subsample=0.5,
min_child_weight=100,
colsample=0.7),
features=SV_features + ['NSV'], random_state=42)
xgb_b_c = clone(xgb_base)
xgb_c_light = clone(xgb_base)
xgb_b_light = clone(xgb_base)
data_sv = data.query(sv_selection)
data_sv['Weight'] = utils.compute_weights(data_sv['Label'].values)
mask_b_c = data_sv.Label != 2
mask_b_light = data_sv.Label != 1
mask_c_light = data_sv.Label != 0
xgb_b_c.fit(data_sv[mask_b_c], data_sv.loc[mask_b_c, 'Label'] == 0, data_sv.loc[mask_b_c, 'Weight'])
xgb_c_light.fit(data_sv[mask_c_light], data_sv.loc[mask_c_light, 'Label'] == 1 , data_sv.loc[mask_c_light, 'Weight'])
xgb_b_light.fit(data_sv[mask_b_light], data_sv.loc[mask_b_light, 'Label'] == 0 , data_sv.loc[mask_b_light, 'Weight'])
Out[22]:
In [23]:
from sklearn.metrics import roc_auc_score
In [24]:
print roc_auc_score(data_sv.loc[mask_b_c, 'Label'] == 0, xgb_b_c.predict_proba(data_sv[mask_b_c])[:, 1],
sample_weight=data_sv.loc[mask_b_c, 'Weight'])
print roc_auc_score(data_sv.loc[mask_c_light, 'Label'] == 1, xgb_c_light.predict_proba(data_sv[mask_c_light])[:, 1],
sample_weight=data_sv.loc[mask_c_light, 'Weight'])
print roc_auc_score(data_sv.loc[mask_b_light, 'Label'] == 0, xgb_b_light.predict_proba(data_sv[mask_b_light])[:, 1],
sample_weight=data_sv.loc[mask_b_light, 'Weight'])
In [25]:
%%time
xgb_sv_all = FoldingClassifier(XGBoostClassifier(nthreads=12,
eta=0.05,
n_estimators=200,
max_depth=9,
subsample=0.5,
min_child_weight=100,
colsample=0.7),
features=SV_features + ['NSV'] + \
jet_features + muon_features + hard_features + other_features,
random_state=42)
xgb_sv_all = fit_model(data, sv_selection, xgb_sv_all)
In [26]:
muon_selection = '(TrueParton != -1000) & (NMu > 0)'
In [27]:
print muon_features
In [28]:
xgb_muon = FoldingClassifier(XGBoostClassifier(nthreads=12,
eta=0.05,
n_estimators=200,
max_depth=6,
subsample=0.5,
min_child_weight=100,
colsample=0.7),
features=muon_features + ['NMu'],
random_state=42)
xgb_muon = fit_model(data, muon_selection, xgb_muon)
In [29]:
hard_selection = '(TrueParton != -1000)'
In [30]:
print hard_features
In [31]:
xgb_hard = FoldingClassifier(XGBoostClassifier(nthreads=12,
eta=0.05,
n_estimators=200,
max_depth=5,
subsample=0.5,
min_child_weight=100,
colsample=0.7),
features=hard_features,
random_state=42)
xgb_hard = fit_model(data, hard_selection, xgb_hard)
In [32]:
zero_sv_selection = '(TrueParton != -1000) & (NSV == 0)'
In [34]:
xgb_zero = FoldingClassifier(XGBoostClassifier(nthreads=12,
eta=0.05,
n_estimators=200,
max_depth=9,
subsample=0.5,
min_child_weight=100,
colsample=0.7),
features=jet_features + muon_features + hard_features + other_features,
random_state=42)
xgb_zero = fit_model(data, zero_sv_selection, xgb_zero)
In [37]:
predictions = {}
for name, cl, selection in zip(['BDT_SV', 'BDT_SV_EXT', 'BDT_MUON', 'BDT_HARD', 'BDT_ZERO_SV'],
[xgb_sv, xgb_sv_all, xgb_muon, xgb_hard, xgb_zero],
[sv_selection, sv_selection, muon_selection, hard_selection, zero_sv_selection]):
pred = numpy.zeros(shape=(len(data), 3)) - 1
mask = data.eval(selection).values
pred[mask, :] = cl.predict_proba(data.loc[mask, :])
predictions[name] = pred
In [42]:
for key in predictions.keys():
for i, name in zip(range(3), ['_b', '_c', '_light']):
data[key + name] = predictions[key][:, i]
In [43]:
data.head()
Out[43]:
In [44]:
root_numpy.array2root(data[data.Label == 0].to_records(index=False),
'datasets_2016_Sept/predicted/jets_5.root',
treename='T')
root_numpy.array2root(data[data.Label == 1].to_records(index=False),
'datasets_2016_Sept/predicted/jets_4.root',
treename='T')
root_numpy.array2root(data[data.Label == 2].to_records(index=False),
'datasets_2016_Sept/predicted/jets_0.root',
treename='T')