hep_ml.reweight contains methods to reweight distributions. Typically we use reweighting of monte-carlo to fight drawbacks of simulation, though there are many applications.
In this example we reweight multidimensional distibutions: original
and target
, the aim is to find new weights for original distribution, such that these multidimensional distributions will coincide.
These is a toy example without real physical meaning.
Pay attention: equality of distibutions for each feature $\neq$ equality of multivariate dist
All samples are divided into training and validation part. Training part is used to fit reweighting rule and test part is used to estimate reweighting quality.
In [1]:
%pylab inline
figsize(16, 8)
import root_numpy
import pandas
from hep_ml import reweight
In [2]:
storage = 'https://github.com/arogozhnikov/hep_ml/blob/data/data_to_download/'
!wget -O ../data/MC_distribution.root -nc $storage/MC_distribution.root?raw=true
!wget -O ../data/RD_distribution.root -nc $storage/RD_distribution.root?raw=true
In [3]:
columns = ['hSPD', 'pt_b', 'pt_phi', 'vchi2_b', 'mu_pt_sum']
original = root_numpy.root2array('../data/MC_distribution.root', branches=columns)
target = root_numpy.root2array('../data/RD_distribution.root', branches=columns)
original = pandas.DataFrame(original)
target = pandas.DataFrame(target)
original_weights = numpy.ones(len(original))
In [4]:
from sklearn.cross_validation import train_test_split
# divide original samples into training ant test parts
original_train, original_test = train_test_split(original)
# divide target samples into training ant test parts
target_train, target_test = train_test_split(target)
original_weights_train = numpy.ones(len(original_train))
original_weights_test = numpy.ones(len(original_test))
In [5]:
from hep_ml.metrics_utils import ks_2samp_weighted
hist_settings = {'bins': 100, 'normed': True, 'alpha': 0.7}
def draw_distributions(original, target, new_original_weights):
for id, column in enumerate(columns, 1):
xlim = numpy.percentile(numpy.hstack([target[column]]), [0.01, 99.99])
subplot(2, 3, id)
hist(original[column], weights=new_original_weights, range=xlim, **hist_settings)
hist(target[column], range=xlim, **hist_settings)
title(column)
print 'KS over ', column, ' = ', ks_2samp_weighted(original[column], target[column],
weights1=new_original_weights, weights2=numpy.ones(len(target), dtype=float))
In [6]:
# pay attention, actually we have very few data
len(original), len(target)
Out[6]:
In [7]:
draw_distributions(original, target, original_weights)
In [8]:
draw_distributions(original_train, target_train, original_weights_train)
In [9]:
draw_distributions(original_test, target_test, original_weights_test)
In [10]:
bins_reweighter = reweight.BinsReweighter(n_bins=20, n_neighs=1.)
bins_reweighter.fit(original_train, target_train)
bins_weights_test = bins_reweighter.predict_weights(original_test)
# validate reweighting rule on the test part comparing 1d projections
draw_distributions(original_test, target_test, bins_weights_test)
In [11]:
reweighter = reweight.GBReweighter(n_estimators=50, learning_rate=0.1, max_depth=3, min_samples_leaf=1000,
gb_args={'subsample': 0.4})
reweighter.fit(original_train, target_train)
gb_weights_test = reweighter.predict_weights(original_test)
# validate reweighting rule on the test part comparing 1d projections
draw_distributions(original_test, target_test, gb_weights_test)
In [12]:
def check_ks_of_expression(expression):
col_original = original_test.eval(expression, engine='python')
col_target = target_test.eval(expression, engine='python')
w_target = numpy.ones(len(col_target), dtype='float')
print 'No reweight KS:', ks_2samp_weighted(col_original, col_target,
weights1=original_weights_test, weights2=w_target)
print 'Bins reweight KS:', ks_2samp_weighted(col_original, col_target,
weights1=bins_weights_test, weights2=w_target)
print 'GB Reweight KS:', ks_2samp_weighted(col_original, col_target,
weights1=gb_weights_test, weights2=w_target)
In [13]:
check_ks_of_expression('hSPD')
In [14]:
check_ks_of_expression('hSPD * pt_phi')
In [15]:
check_ks_of_expression('hSPD * pt_phi * vchi2_b')
In [16]:
check_ks_of_expression('pt_b * pt_phi / hSPD ')
In [17]:
check_ks_of_expression('hSPD * pt_b * vchi2_b / pt_phi')
let's check how well the classifier is able to distinguish these distributions. ROC AUC is taken as measure of quality.
For this puprose we split data into train and test, then train a classifier do distinguish these distributions. If ROC AUC = 0.5 on test, distibutions are equal, if ROC AUC = 1.0, they are ideally separable.
In [18]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score
data = numpy.concatenate([original_test, target_test])
labels = numpy.array([0] * len(original_test) + [1] * len(target_test))
weights = {}
weights['original'] = original_weights_test
weights['bins'] = bins_weights_test
weights['gb_weights'] = gb_weights_test
for name, new_weights in weights.items():
W = numpy.concatenate([new_weights / new_weights.sum() * len(target_test), [1] * len(target_test)])
Xtr, Xts, Ytr, Yts, Wtr, Wts = train_test_split(data, labels, W, random_state=42, train_size=0.51)
clf = GradientBoostingClassifier(subsample=0.3, n_estimators=50).fit(Xtr, Ytr, sample_weight=Wtr)
print name, roc_auc_score(Yts, clf.predict_proba(Xts)[:, 1], sample_weight=Wts)
In [19]:
from hep_ml.reweight import FoldingReweighter
In [20]:
# define base reweighter
reweighter_base = reweight.GBReweighter(n_estimators=50,
learning_rate=0.1, max_depth=3, min_samples_leaf=1000,
gb_args={'subsample': 0.4})
reweighter = reweight.FoldingReweighter(reweighter_base, n_folds=2)
# it is not needed divide data into train/test parts; rewighter can be train on the whole samples
reweighter.fit(original, target)
# predict method provides unbiased weights prediction for the whole sample
# folding reweighter contains two reweighters, each is trained on one half of samples
# during predictions each reweighter predicts another half of samples not used in training
folding_weights = reweighter.predict_weights(original)
draw_distributions(original, target, folding_weights)
In [21]:
data = numpy.concatenate([original, target])
labels = numpy.array([0] * len(original) + [1] * len(target))
weights = {}
weights['original'] = original_weights
weights['2-folding'] = folding_weights
for name, new_weights in weights.items():
W = numpy.concatenate([new_weights / new_weights.sum() * len(target), [1] * len(target)])
Xtr, Xts, Ytr, Yts, Wtr, Wts = train_test_split(data, labels, W, random_state=42, train_size=0.51)
clf = GradientBoostingClassifier(subsample=0.3, n_estimators=30).fit(Xtr, Ytr, sample_weight=Wtr)
print name, roc_auc_score(Yts, clf.predict_proba(Xts)[:, 1], sample_weight=Wts)