hep_ml.reweight contains methods to reweight distributions. Typically we use reweighting of monte-carlo to fight drawbacks of simulation, though there are many applications.
In this example we reweight multidimensional distibutions: original
and target
, the aim is to find new weights for original distribution, such that these multidimensional distributions will coincide.
Here we have a toy example without a real physics meaning.
Pay attention: equality of distibutions for each feature $\neq$ equality of multivariate distributions.
All samples are divided into training and validation part. Training part is used to fit reweighting rule and test part is used to estimate reweighting quality.
In [1]:
%matplotlib inline
import numpy
import root_numpy
import pandas
from hep_ml import reweight
from matplotlib import pyplot as plt
In [2]:
storage = 'https://github.com/arogozhnikov/hep_ml/blob/data/data_to_download/'
!wget -O ../data/MC_distribution.root -nc $storage/MC_distribution.root?raw=true
!wget -O ../data/RD_distribution.root -nc $storage/RD_distribution.root?raw=true
In [3]:
columns = ['hSPD', 'pt_b', 'pt_phi', 'vchi2_b', 'mu_pt_sum']
original = root_numpy.root2array('../data/MC_distribution.root', branches=columns)
target = root_numpy.root2array('../data/RD_distribution.root', branches=columns)
original = pandas.DataFrame(original)
target = pandas.DataFrame(target)
original_weights = numpy.ones(len(original))
In [4]:
from sklearn.model_selection import train_test_split
# divide original samples into training ant test parts
original_train, original_test = train_test_split(original)
# divide target samples into training ant test parts
target_train, target_test = train_test_split(target)
original_weights_train = numpy.ones(len(original_train))
original_weights_test = numpy.ones(len(original_test))
In [5]:
from hep_ml.metrics_utils import ks_2samp_weighted
hist_settings = {'bins': 100, 'density': True, 'alpha': 0.7}
def draw_distributions(original, target, new_original_weights):
plt.figure(figsize=[15, 7])
for id, column in enumerate(columns, 1):
xlim = numpy.percentile(numpy.hstack([target[column]]), [0.01, 99.99])
plt.subplot(2, 3, id)
plt.hist(original[column], weights=new_original_weights, range=xlim, **hist_settings)
plt.hist(target[column], range=xlim, **hist_settings)
plt.title(column)
print('KS over ', column, ' = ', ks_2samp_weighted(original[column], target[column],
weights1=new_original_weights, weights2=numpy.ones(len(target), dtype=float)))
In [6]:
# pay attention, actually we have very few data
len(original), len(target)
Out[6]:
In [7]:
draw_distributions(original, target, original_weights)
In [8]:
draw_distributions(original_train, target_train, original_weights_train)
In [9]:
draw_distributions(original_test, target_test, original_weights_test)
In [10]:
bins_reweighter = reweight.BinsReweighter(n_bins=20, n_neighs=1.)
bins_reweighter.fit(original_train, target_train)
bins_weights_test = bins_reweighter.predict_weights(original_test)
# validate reweighting rule on the test part comparing 1d projections
draw_distributions(original_test, target_test, bins_weights_test)
In [11]:
reweighter = reweight.GBReweighter(n_estimators=50, learning_rate=0.1, max_depth=3, min_samples_leaf=1000,
gb_args={'subsample': 0.4})
reweighter.fit(original_train, target_train)
gb_weights_test = reweighter.predict_weights(original_test)
# validate reweighting rule on the test part comparing 1d projections
draw_distributions(original_test, target_test, gb_weights_test)
In [12]:
def check_ks_of_expression(expression):
col_original = original_test.eval(expression, engine='python')
col_target = target_test.eval(expression, engine='python')
w_target = numpy.ones(len(col_target), dtype='float')
print('No reweight KS:', ks_2samp_weighted(col_original, col_target,
weights1=original_weights_test, weights2=w_target))
print('Bins reweight KS:', ks_2samp_weighted(col_original, col_target,
weights1=bins_weights_test, weights2=w_target))
print('GB Reweight KS:', ks_2samp_weighted(col_original, col_target,
weights1=gb_weights_test, weights2=w_target))
In [13]:
check_ks_of_expression('hSPD')
In [14]:
check_ks_of_expression('hSPD * pt_phi')
In [15]:
check_ks_of_expression('hSPD * pt_phi * vchi2_b')
In [16]:
check_ks_of_expression('pt_b * pt_phi / hSPD ')
In [17]:
check_ks_of_expression('hSPD * pt_b * vchi2_b / pt_phi')
let's check how well a classifier is able to distinguish these distributions. ROC AUC is taken as a measure of quality.
For this puprose we split the data into train and test, then we train a classifier do distinguish these distributions. If ROC AUC = 0.5 on test, distibutions are identical, if ROC AUC = 1.0, they are ideally separable.
In [18]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
data = numpy.concatenate([original_test, target_test])
labels = numpy.array([0] * len(original_test) + [1] * len(target_test))
weights = {}
weights['original'] = original_weights_test
weights['bins'] = bins_weights_test
weights['gb_weights'] = gb_weights_test
for name, new_weights in weights.items():
W = numpy.concatenate([new_weights / new_weights.sum() * len(target_test), [1] * len(target_test)])
Xtr, Xts, Ytr, Yts, Wtr, Wts = train_test_split(data, labels, W, random_state=42, train_size=0.51)
clf = GradientBoostingClassifier(subsample=0.3, n_estimators=50).fit(Xtr, Ytr, sample_weight=Wtr)
print(name, roc_auc_score(Yts, clf.predict_proba(Xts)[:, 1], sample_weight=Wts))
In [19]:
# define base reweighter
reweighter_base = reweight.GBReweighter(n_estimators=50,
learning_rate=0.1, max_depth=3, min_samples_leaf=1000,
gb_args={'subsample': 0.4})
reweighter = reweight.FoldingReweighter(reweighter_base, n_folds=2)
# it is not needed divide data into train/test parts; rewighter can be train on the whole samples
reweighter.fit(original, target)
# predict method provides unbiased weights prediction for the whole sample
# folding reweighter contains two reweighters, each is trained on one half of samples
# during predictions each reweighter predicts another half of samples not used in training
folding_weights = reweighter.predict_weights(original)
draw_distributions(original, target, folding_weights)
In [21]:
data = numpy.concatenate([original, target])
labels = numpy.array([0] * len(original) + [1] * len(target))
weights = {}
weights['original'] = original_weights
weights['2-folding'] = folding_weights
for name, new_weights in weights.items():
W = numpy.concatenate([new_weights / new_weights.sum() * len(target), [1] * len(target)])
Xtr, Xts, Ytr, Yts, Wtr, Wts = train_test_split(data, labels, W, random_state=42, train_size=0.51)
clf = GradientBoostingClassifier(subsample=0.3, n_estimators=30).fit(Xtr, Ytr, sample_weight=Wtr)
print(name, roc_auc_score(Yts, clf.predict_proba(Xts)[:, 1], sample_weight=Wts))
In [ ]: