Boosting to Uniformity

In physical applications frequently we need to achieve uniformity of predictions along some features. For instance, when testing the existence of new particle, we need classifier to be uniform in background along the mass (otherwise one can get false discovery due to peaking background).

This notebook contains some comparison of classifiers. The target is to obtain flat effiency in signal (without significally loosing quality of classification) in Dalitz features.

The classifiers compared are

  • plain GradientBoosting
  • uBoost
  • gradient boosting with knn-Ada loss (UGB+knnAda)
  • gradient boosting with FlatnessLoss (UGB+FlatnessLoss)

We use dataset from paper about uBoost for demonstration purposes. We have plenty of data here, so results are quite stable


In [1]:
# downloading data
!wget -O ../data/dalitzdata.root -nc https://github.com/arogozhnikov/hep_ml/blob/data/data_to_download/dalitzdata.root?raw=true


Файл «../data/dalitzdata.root» уже существует; не загружается.

In [2]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [3]:
import pandas, numpy
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
# this wrapper makes it possible to train on subset of features
from rep.estimators import SklearnClassifier

from hep_ml.commonutils import train_test_split
from hep_ml import uboost, gradientboosting as ugb, losses

Loading data


In [4]:
import root_numpy
used_columns = ["Y1", "Y2", "Y3", "M2AB", "M2AC"]
data = pandas.DataFrame(root_numpy.root2array('../data/dalitzdata.root', treename='tree'))
labels = data['labels']
data = data.drop('labels', axis=1)

Distribution of events in different files in the Dalitz features

As we can see, the background is distributed mostly in the corners of Dalitz plot,
and for traditional classifiers this results in poor effieciency of signal in the corners.


In [5]:
def plot_distribution(data_frame, var_name1='M2AB', var_name2='M2AC', bins=40):
    """The function to plot 2D distribution histograms"""
    pylab.hist2d(data_frame[var_name1], data_frame[var_name2], bins = 40, cmap=cm.Blues)
    pylab.xlabel(var_name1)
    pylab.ylabel(var_name2)
    pylab.colorbar()

pylab.figure(figsize=(12, 6))
subplot(1, 2, 1), pylab.title("signal"),       plot_distribution(data[labels==1])
subplot(1, 2, 2), pylab.title("background"),   plot_distribution(data[labels==0])
pass


Preparation of train/test datasets


In [6]:
trainX, testX, trainY, testY = train_test_split(data, labels, random_state=42)

Setting up classifiers, training


In [7]:
uniform_features  = ["M2AB", "M2AC"]
train_features = ["Y1", "Y2", "Y3"]
n_estimators = 150
base_estimator = DecisionTreeClassifier(max_depth=4)

uBoost training takes much time, so we reduce number of efficiency_steps, use prediction smoothing and run uBoost in threads


In [8]:
from rep.metaml import ClassifiersFactory

classifiers = ClassifiersFactory()

base_ada = GradientBoostingClassifier(max_depth=4, n_estimators=n_estimators, learning_rate=0.1)
classifiers['AdaBoost'] = SklearnClassifier(base_ada, features=train_features)


knnloss = ugb.KnnAdaLossFunction(uniform_features, knn=10, uniform_label=1)
ugbKnn = ugb.UGradientBoostingClassifier(loss=knnloss, max_depth=4, n_estimators=n_estimators,
                                        learning_rate=0.4, train_features=train_features)
classifiers['uGB+knnAda'] = SklearnClassifier(ugbKnn) 

uboost_clf = uboost.uBoostClassifier(uniform_features=uniform_features, uniform_label=1,
                                     base_estimator=base_estimator, 
                                     n_estimators=n_estimators, train_features=train_features, 
                                     efficiency_steps=12, n_threads=4)
classifiers['uBoost'] = SklearnClassifier(uboost_clf)

flatnessloss = ugb.KnnFlatnessLossFunction(uniform_features, fl_coefficient=3., power=1.3, uniform_label=1)
ugbFL = ugb.UGradientBoostingClassifier(loss=flatnessloss, max_depth=4, 
                                       n_estimators=n_estimators, 
                                       learning_rate=0.1, train_features=train_features)
classifiers['uGB+FL'] = SklearnClassifier(ugbFL)


classifiers.fit(trainX, trainY, parallel_profile='threads-4')
pass


model AdaBoost     was trained in 94.24 seconds
model uGB+knnAda   was trained in 210.28 seconds
model uBoost       was trained in 599.23 seconds
model uGB+FL       was trained in 446.55 seconds
Totally spent 599.25 seconds on training

Lets look at the results of training

dependence of quality on the number of trees built (ROC AUC - area under the ROC curve, the more the better)


In [9]:
from rep.report.metrics import RocAuc
report = classifiers.test_on(testX, testY)

ylim(0.88, 0.94)
report.learning_curve(RocAuc(), steps=1)


Out[9]:

SDE (squared deviation of efficiency) learning curve

SDE vs the number of built trees. SDE is memtric of nonuniformity - less is better.


In [10]:
from hep_ml.metrics import BinBasedSDE, KnnBasedCvM
report.learning_curve(BinBasedSDE(uniform_features, uniform_label=1))


Out[10]:

CvM learning curve

CvM is metric of non-uniformity based on Cramer-von Mises distance. We are using Knn version now.


In [11]:
report.learning_curve(KnnBasedCvM(uniform_features, uniform_label=1))


Out[11]:

ROC curves after training


In [12]:
report.roc().plot(new_plot=True, figsize=[10, 9])


Signal efficiency

global cut corresponds to average signal efficiency=0.5. In ideal case the picture shall be white.


In [13]:
report.efficiencies_2d(uniform_features, efficiency=0.5, signal_label=1, n_bins=15, 
                       labels_dict={1: 'signal'})


/Users/axelr/venvs/rep_env/lib/python2.7/site-packages/matplotlib/axes/_axes.py:475: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "
Out[13]:

the same for global efficiency = 0.7


In [14]:
report.efficiencies_2d(uniform_features, efficiency=0.7, signal_label=1, n_bins=15, 
                       labels_dict={1: 'signal'})


Out[14]: