About

Training of the BDT to define if track comes from the same side or opposite side.

Labels:

0 (NAN), cannot establish SS or OS
-1 (OS) - opposite side tracks (good agreement with indeed OS tracks)
1 (SS) - tracks grandmother, grand-grandmother, grand-grand-grandmother of which is the same as for signal B

From test we come up with the statement that SS, NAN should have inverted tracks sing for $K_s$ and $K*$ decays. Thus we train OS vs SS, NAN



In [1]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
import sys
sys.path.insert(0, "../")

Import



In [3]:

    
import pandas
import root_numpy
from folding_group import FoldingGroupClassifier
from decisiontrain import DecisionTrainClassifier
from rep.estimators import SklearnClassifier

Read $B^\pm \to J\psi K^\pm$ MC samples



In [4]:

    
data = pandas.DataFrame(root_numpy.root2array('../datasets/MC/csv/WG/Bu_JPsiK/2012/Tracks.root'))



In [5]:

    
from utils import data_tracks_preprocessing
data = data_tracks_preprocessing(data)









    



/mnt/mfs/miniconda/envs/rep_py2/lib/python2.7/site-packages/pandas/core/indexing.py:266: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
/mnt/mfs/miniconda/envs/rep_py2/lib/python2.7/site-packages/pandas/core/indexing.py:426: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s






    



Initial statistics: {'parts': 33632195, 'Events': 1488891}
after  (ghostProb < 0.4)  selection, statistics: {'parts': 32813556, 'Events': 1488885}
after   ( (PIDNNk > 0.0) | (PIDNNm > 0.0) | (PIDNNe > 0.0) | (PIDNNpi > 0.0) | (PIDNNp > 0.0))   selection, statistics: {'parts': 32808324, 'Events': 1488885}



In [6]:

    
for group in range(-1, 2, 1):
    print group, 1. * numpy.sum(data.OS_SS.values == group) / len(data)









    



-1 0.0846154165022
0 0.904321385024
1 0.0110631984737



In [7]:

    
len(data)









    Out[7]:





32808324



In [8]:

    
features = ['cos_diff_phi', 'diff_pt', 'partPt', 'partP', 'nnkrec', 'diff_eta', 'EOverP', 
            'ptB', 'sum_PID_mu_k', 'proj', 'PIDNNe', 'sum_PID_k_e', 'PIDNNk', 'sum_PID_mu_e', 'PIDNNm',
            'phi', 'IP', 'IPerr', 'IPs', 'veloch', 'max_PID_k_e', 'ghostProb', 
            'IPPU', 'eta', 'max_PID_mu_e', 'max_PID_mu_k', 'partlcs']

distributions for same side vs opposide side tracks



In [9]:

    
kw = {'bins': 100, 'alpha': 0.4, 'normed': True}
figure(figsize=(20, 35))
for n, f in enumerate(features):
    subplot(10, 4, n + 1)
    r = (numpy.min(data.loc[data.OS_SS == -1, f].values), numpy.max(data.loc[data.OS_SS == -1, f].values))
    hist(data.loc[data.OS_SS == -1, f].values, label='OS', range=r, **kw)
    hist(data.loc[data.OS_SS == 0, f].values, label='NAN', range=r, **kw)
    hist(data.loc[data.OS_SS == 1, f].values, label='SS', range=r, **kw)
    title(f)
    legend()

Training OS vs SS



In [10]:

    
data_os_ss = data[data.OS_SS != 0]
weight = numpy.ones(len(data_os_ss))
weight[data_os_ss.OS_SS.values >= 0] *= 1. * sum(data_os_ss.OS_SS < 0) / sum(data_os_ss.OS_SS >= 0)
data_os_ss['weight'] = weight









    



/mnt/mfs/miniconda/envs/rep_py2/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [20]:

    
len(data_os_ss)









    Out[20]:





3139055



In [24]:

    
from hep_ml.losses import LogLossFunction



In [30]:

    
loss = LogLossFunction(regularization=100)
tt_base = DecisionTrainClassifier(learning_rate=0.1, n_estimators=10000, depth=6, loss=loss,
                                  max_features=15, n_threads=12)
tt_folding = FoldingGroupClassifier(SklearnClassifier(tt_base), n_folds=2, random_state=432, 
                                    train_features=features, group_feature='group_column')
%time tt_folding.fit(data_os_ss, data_os_ss.OS_SS >= 0)
pass









    



CPU times: user 43min 38s, sys: 15 s, total: 43min 53s
Wall time: 10min 47s



In [31]:

    
import cPickle 
with open('../models/dt_ss_os_only.pkl', 'w') as f:
    cPickle.dump(tt_folding, f)



In [32]:

    
prob = tt_folding.predict_proba(data_os_ss)[:, 1]









    



KFold prediction using folds column



In [33]:

    
from sklearn.metrics import roc_auc_score
roc_auc_score(data_os_ss.OS_SS >= 0, prob, sample_weight=data_os_ss.weight)









    Out[33]:





0.69828377263840868



In [34]:

    
from rep.report.metrics import RocAuc
tt_folding.test_on(data_os_ss, data_os_ss.OS_SS >= 0).learning_curve(RocAuc())









    



KFold prediction using folds column
KFold prediction using folds column






    Out[34]:



In [43]:

    
tt_folding.estimators[0].clf.estimators = tt_folding.estimators[0].clf.estimators[:7000]
tt_folding.estimators[1].clf.estimators = tt_folding.estimators[1].clf.estimators[:7000]



In [46]:

    
prob = tt_folding.predict_proba(data_os_ss)[:, 1]









    



KFold prediction using folds column



In [53]:

    
report = tt_folding.test_on(data_os_ss, data_os_ss.OS_SS >= 0)









    



KFold prediction using folds column



In [55]:

    
report.feature_importance()









    Out[55]:

Calibration to probability to be SS



In [44]:

    
from utils import plot_calibration

before calibration



In [47]:

    
plot_calibration(prob, data_os_ss.OS_SS.values >= 0, weight=data_os_ss.weight.values)

after calibration



In [48]:

    
from utils import calibrate_probs



In [49]:

    
prob_calib, calibrator = calibrate_probs(data_os_ss.OS_SS.values >= 0, data_os_ss.weight.values, prob,
                                         logistic=True)
plot_calibration(prob_calib, data_os_ss.OS_SS.values >= 0, weight=data_os_ss.weight.values)



In [50]:

    
with open('../models/os_ss_calibrator_only.pkl', 'w') as f:
    cPickle.dump(calibrator, f)



In [ ]:

    
probs_nan = tt_folding.predict_proba(data[data.OS_SS == 0])[:, 1]
probs_nan_calib = calibrator.predict_proba(probs_nan)









    



KFold prediction using random classifier (length of data passed not equal to length of train)



In [ ]:

    
hist(prob_calib[data_os_ss.OS_SS.values < 0], normed=True, alpha=0.4, label='OS', bins=100);
hist(prob_calib[data_os_ss.OS_SS.values > 0], normed=True, alpha=0.4, label='SS', bins=100);
hist(probs_nan_calib, normed=True, alpha=0.4, label='NAN', bins=100);
legend();

OS vs SS and NAN



In [20]:

    
hist(prob_calib[data_os_ss.OS_SS.values < 0], normed=True, alpha=0.4, label='OS', bins=100);
hist(prob_calib[data_os_ss.OS_SS.values > 0], normed=True, alpha=0.4, label='SS', bins=100);
hist(prob_calib[data_os_ss.OS_SS.values == 0], normed=True, alpha=0.4, label='NAN', bins=100);
legend();