Demonstration of distribution reweighting

hep_ml.reweight contains methods to reweight distributions. Typically we use reweighting of monte-carlo to fight drawbacks of simulation, though there are many applications.

In this example we reweight multidimensional distibutions: original and target, the aim is to find new weights for original distribution, such that these multidimensional distributions will coincide.

Here we have a toy example without a real physics meaning.

Pay attention: equality of distibutions for each feature $\neq$ equality of multivariate distributions.

All samples are divided into training and validation part. Training part is used to fit reweighting rule and test part is used to estimate reweighting quality.


In [1]:
%matplotlib inline

import numpy
import root_numpy
import pandas
from hep_ml import reweight
from matplotlib import pyplot as plt


Welcome to JupyROOT 6.14/04

Downloading data


In [2]:
storage = 'https://github.com/arogozhnikov/hep_ml/blob/data/data_to_download/'
!wget -O ../data/MC_distribution.root -nc $storage/MC_distribution.root?raw=true
!wget -O ../data/RD_distribution.root -nc $storage/RD_distribution.root?raw=true


File ‘../data/MC_distribution.root’ already there; not retrieving.
File ‘../data/RD_distribution.root’ already there; not retrieving.

In [3]:
columns = ['hSPD', 'pt_b', 'pt_phi', 'vchi2_b', 'mu_pt_sum']

original = root_numpy.root2array('../data/MC_distribution.root', branches=columns)
target = root_numpy.root2array('../data/RD_distribution.root', branches=columns)

original = pandas.DataFrame(original)
target = pandas.DataFrame(target)

original_weights = numpy.ones(len(original))

prepare train and test samples

  • train part is used to train reweighting rule
  • test part is used to evaluate reweighting rule comparing the following things:
    • Kolmogorov-Smirnov distances for 1d projections
    • n-dim distibutions using ML (see below).

In [4]:
from sklearn.model_selection import train_test_split
# divide original samples into training ant test parts
original_train, original_test = train_test_split(original)
# divide target samples into training ant test parts
target_train, target_test = train_test_split(target)

original_weights_train = numpy.ones(len(original_train))
original_weights_test = numpy.ones(len(original_test))

In [5]:
from hep_ml.metrics_utils import ks_2samp_weighted
hist_settings = {'bins': 100, 'density': True, 'alpha': 0.7}

def draw_distributions(original, target, new_original_weights):
    plt.figure(figsize=[15, 7])
    for id, column in enumerate(columns, 1):
        xlim = numpy.percentile(numpy.hstack([target[column]]), [0.01, 99.99])
        plt.subplot(2, 3, id)
        plt.hist(original[column], weights=new_original_weights, range=xlim, **hist_settings)
        plt.hist(target[column], range=xlim, **hist_settings)
        plt.title(column)
        print('KS over ', column, ' = ', ks_2samp_weighted(original[column], target[column], 
                                         weights1=new_original_weights, weights2=numpy.ones(len(target), dtype=float)))

Original distributions

KS = Kolmogorov-Smirnov distance


In [6]:
# pay attention, actually we have very few data
len(original), len(target)


Out[6]:
(1000000, 21441)

In [7]:
draw_distributions(original, target, original_weights)


KS over  hSPD  =  0.5203540728277889
KS over  pt_b  =  0.21639364439970188
KS over  pt_phi  =  0.4020113592414034
KS over  vchi2_b  =  0.40466385087324064
KS over  mu_pt_sum  =  0.21639364439970188

train part of original distribution


In [8]:
draw_distributions(original_train, target_train, original_weights_train)


KS over  hSPD  =  0.5210722288557113
KS over  pt_b  =  0.21779417910324533
KS over  pt_phi  =  0.40672785074642664
KS over  vchi2_b  =  0.4085994129292934
KS over  mu_pt_sum  =  0.21779417910324533

test part for target distribution


In [9]:
draw_distributions(original_test, target_test, original_weights_test)


KS over  hSPD  =  0.5193503182242225
KS over  pt_b  =  0.213457129266594
KS over  pt_phi  =  0.3883895351612351
KS over  vchi2_b  =  0.39544763999336957
KS over  mu_pt_sum  =  0.213457129266594

Bins-based reweighting in n dimensions

Typical way to reweight distributions is based on bins.


In [10]:
bins_reweighter = reweight.BinsReweighter(n_bins=20, n_neighs=1.)
bins_reweighter.fit(original_train, target_train)

bins_weights_test = bins_reweighter.predict_weights(original_test)
# validate reweighting rule on the test part comparing 1d projections
draw_distributions(original_test, target_test, bins_weights_test)


KS over  hSPD  =  0.2951658559973105
KS over  pt_b  =  0.08121477746324601
KS over  pt_phi  =  0.13957791696908367
KS over  vchi2_b  =  0.24605269011792386
KS over  mu_pt_sum  =  0.08121477746324601

Gradient Boosted Reweighter

This algorithm is inspired by gradient boosting and is able to fight curse of dimensionality. It uses decision trees and special loss functiion (ReweightLossFunction).

GBReweighter supports negative weights (to reweight MC to splotted real data).


In [11]:
reweighter = reweight.GBReweighter(n_estimators=50, learning_rate=0.1, max_depth=3, min_samples_leaf=1000, 
                                   gb_args={'subsample': 0.4})
reweighter.fit(original_train, target_train)

gb_weights_test = reweighter.predict_weights(original_test)
# validate reweighting rule on the test part comparing 1d projections
draw_distributions(original_test, target_test, gb_weights_test)


KS over  hSPD  =  0.04718368649398397
KS over  pt_b  =  0.026187858656329777
KS over  pt_phi  =  0.023381726229074
KS over  vchi2_b  =  0.028600191400877273
KS over  mu_pt_sum  =  0.026187858656329777

Comparing some simple expressions:

the most interesting is checking some other variables in multidimensional distributions (those are expressed via original variables).


In [12]:
def check_ks_of_expression(expression):
    col_original = original_test.eval(expression, engine='python')
    col_target = target_test.eval(expression, engine='python')
    w_target = numpy.ones(len(col_target), dtype='float')
    print('No reweight   KS:', ks_2samp_weighted(col_original, col_target, 
                                                 weights1=original_weights_test, weights2=w_target))
    print('Bins reweight KS:', ks_2samp_weighted(col_original, col_target, 
                                                 weights1=bins_weights_test, weights2=w_target))
    print('GB Reweight   KS:', ks_2samp_weighted(col_original, col_target,
                                                 weights1=gb_weights_test, weights2=w_target))

In [13]:
check_ks_of_expression('hSPD')


No reweight   KS: 0.5193503182242225
Bins reweight KS: 0.2951658559973105
GB Reweight   KS: 0.04718368649398397

In [14]:
check_ks_of_expression('hSPD * pt_phi')


No reweight   KS: 0.09252929341563598
Bins reweight KS: 0.13072965686675675
GB Reweight   KS: 0.03051476562999561

In [15]:
check_ks_of_expression('hSPD * pt_phi * vchi2_b')


No reweight   KS: 0.3701594318230964
Bins reweight KS: 0.2725790982061164
GB Reweight   KS: 0.03781407972773909

In [16]:
check_ks_of_expression('pt_b * pt_phi / hSPD ')


No reweight   KS: 0.4691427379219105
Bins reweight KS: 0.27596281132356254
GB Reweight   KS: 0.043978959119210626

In [17]:
check_ks_of_expression('hSPD * pt_b * vchi2_b / pt_phi')


No reweight   KS: 0.48417531393470276
Bins reweight KS: 0.2893709029176066
GB Reweight   KS: 0.03194834745037389

GB-discrimination

let's check how well a classifier is able to distinguish these distributions. ROC AUC is taken as a measure of quality.

For this puprose we split the data into train and test, then we train a classifier do distinguish these distributions. If ROC AUC = 0.5 on test, distibutions are identical, if ROC AUC = 1.0, they are ideally separable.


In [18]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

data = numpy.concatenate([original_test, target_test])
labels = numpy.array([0] * len(original_test) + [1] * len(target_test))

weights = {}
weights['original'] = original_weights_test
weights['bins'] = bins_weights_test
weights['gb_weights'] = gb_weights_test


for name, new_weights in weights.items():
    W = numpy.concatenate([new_weights / new_weights.sum() * len(target_test), [1] * len(target_test)])
    Xtr, Xts, Ytr, Yts, Wtr, Wts = train_test_split(data, labels, W, random_state=42, train_size=0.51)
    clf = GradientBoostingClassifier(subsample=0.3, n_estimators=50).fit(Xtr, Ytr, sample_weight=Wtr)
    
    print(name, roc_auc_score(Yts, clf.predict_proba(Xts)[:, 1], sample_weight=Wts))


/home/axelr/py36/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)
original 0.882831894651019
bins 0.7860968212062143
gb_weights 0.5338000084993204

Folding reweighter

With FoldingReweighter one can simpler do cross-validation and in the end obtain unbiased weights for the whole original sample


In [19]:
# define base reweighter
reweighter_base = reweight.GBReweighter(n_estimators=50, 
                                        learning_rate=0.1, max_depth=3, min_samples_leaf=1000, 
                                        gb_args={'subsample': 0.4})
reweighter = reweight.FoldingReweighter(reweighter_base, n_folds=2)
# it is not needed divide data into train/test parts; rewighter can be train on the whole samples
reweighter.fit(original, target)

# predict method provides unbiased weights prediction for the whole sample
# folding reweighter contains two reweighters, each is trained on one half of samples
# during predictions each reweighter predicts another half of samples not used in training
folding_weights = reweighter.predict_weights(original)

draw_distributions(original, target, folding_weights)


KFold prediction using folds column
KS over  hSPD  =  0.04906960924441417
KS over  pt_b  =  0.0302104042975897
KS over  pt_phi  =  0.023896752700647228
KS over  vchi2_b  =  0.028907928010168615
KS over  mu_pt_sum  =  0.0302104042975897

GB discrimination for reweighting rule


In [21]:
data = numpy.concatenate([original, target])
labels = numpy.array([0] * len(original) + [1] * len(target))

weights = {}
weights['original'] = original_weights
weights['2-folding'] = folding_weights


for name, new_weights in weights.items():
    W = numpy.concatenate([new_weights / new_weights.sum() * len(target), [1] * len(target)])
    Xtr, Xts, Ytr, Yts, Wtr, Wts = train_test_split(data, labels, W, random_state=42, train_size=0.51)
    clf = GradientBoostingClassifier(subsample=0.3, n_estimators=30).fit(Xtr, Ytr, sample_weight=Wtr)
    
    print(name, roc_auc_score(Yts, clf.predict_proba(Xts)[:, 1], sample_weight=Wts))


/home/axelr/py36/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)
original 0.8878493574684831
2-folding 0.5501818764659429

In [ ]: