hep_ml.reweight contains methods to reweight distributions. Typically we use reweighting of monte-carlo to fight drawbacks of simulation, though there are many applications.
In this example we reweight multidimensional distibutions: original
and target
, the aim is to find new weights for original distribution, such that these multidimensional distributions will coincide.
These is a toy example without real physical meaning.
Pay attention: equality of distibutions for each feature $\neq$ equality of multivariate dist
In [1]:
%pylab inline
figsize(16, 8)
import root_numpy
import pandas
from hep_ml import reweight
In [2]:
storage = 'https://github.com/arogozhnikov/hep_ml/blob/data/data_to_download/'
!wget -O ../data/MC_distribution.root -nc $storage/MC_distribution.root?raw=true
!wget -O ../data/RD_distribution.root -nc $storage/RD_distribution.root?raw=true
In [3]:
columns = ['hSPD', 'pt_b', 'pt_phi', 'vchi2_b', 'mu_pt_sum']
original = root_numpy.root2array('../data/MC_distribution.root', branches=columns)
target = root_numpy.root2array('../data/RD_distribution.root', branches=columns)
original = pandas.DataFrame(original)
target = pandas.DataFrame(target)
original_weights = numpy.ones(len(original))
In [4]:
from hep_ml.metrics_utils import ks_2samp_weighted
hist_settings = {'bins': 100, 'normed': True, 'alpha': 0.7}
def draw_distributions(new_original_weights):
for id, column in enumerate(columns, 1):
xlim = numpy.percentile(numpy.hstack([target[column]]), [0.01, 99.99])
subplot(2, 3, id)
hist(original[column], weights=new_original_weights, range=xlim, **hist_settings)
hist(target[column], range=xlim, **hist_settings)
title(column)
print 'KS over ', column, ' = ', ks_2samp_weighted(original[column], target[column],
weights1=new_original_weights, weights2=numpy.ones(len(target), dtype=float))
In [5]:
# pay attention, actually we have very few data
len(original), len(target)
Out[5]:
In [6]:
draw_distributions(original_weights)
In [7]:
bins_reweighter = reweight.BinsReweighter(n_bins=20, n_neighs=1.)
bins_reweighter.fit(original, target)
bins_weights = bins_reweighter.predict_weights(original)
draw_distributions(bins_weights)
In [8]:
reweighter = reweight.GBReweighter(n_estimators=50, learning_rate=0.1, max_depth=3, min_samples_leaf=1000,
gb_args={'subsample': 0.6})
reweighter.fit(original, target)
gb_weights = reweighter.predict_weights(original)
draw_distributions(gb_weights)
In [9]:
def check_ks_of_expression(expression):
col_original = original.eval(expression, engine='python')
col_target = target.eval(expression, engine='python')
w_target = numpy.ones(len(col_target), dtype='float')
print 'No reweight KS:', ks_2samp_weighted(col_original, col_target, weights1=original_weights, weights2=w_target)
print 'Bins reweight KS:', ks_2samp_weighted(col_original, col_target, weights1=bins_weights, weights2=w_target)
print 'GB Reweight KS:', ks_2samp_weighted(col_original, col_target, weights1=gb_weights, weights2=w_target)
In [10]:
check_ks_of_expression('hSPD')
In [11]:
check_ks_of_expression('hSPD * pt_phi')
In [12]:
check_ks_of_expression('hSPD * pt_phi * vchi2_b')
In [13]:
check_ks_of_expression('pt_b * pt_phi / hSPD ')
In [14]:
check_ks_of_expression('hSPD * pt_b * vchi2_b / pt_phi')
let's check how well the classifier is able to distinguish these distributions. ROC AUC is taken as measure of quality.
For this puprose we split data into train and test, then train a classifier do distinguish these distributions. If ROC AUC = 0.5 on test, distibutions are equal, if ROC AUC = 1.0, they are ideally separable.
In [15]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score
data = numpy.concatenate([original, target])
labels = numpy.array([0] * len(original) + [1] * len(target))
weights = {}
weights['original'] = original_weights
weights['bins'] = bins_weights
weights['gb_weights'] = gb_weights
for name, new_weights in weights.items():
W = numpy.concatenate([new_weights / new_weights.sum() * len(target), [1] * len(target)])
Xtr, Xts, Ytr, Yts, Wtr, Wts = train_test_split(data, labels, W, random_state=42, train_size=0.51)
clf = GradientBoostingClassifier(subsample=0.3, n_estimators=30).fit(Xtr, Ytr, sample_weight=Wtr)
print name, roc_auc_score(Yts, clf.predict_proba(Xts)[:, 1], sample_weight=Wts)