sPlot

This notebook explains sPlot and how to use hep_ml.splot. sPlot is a way to reconstruct features of mixture components based on known properties of distributions. This method is frequently used in High Energy Physics.

If you prefer explanations without code, find them here



In [1]:

    
import numpy
from matplotlib import pyplot as plt



In [2]:

    
size = 10000
sig_data = numpy.random.normal(-1, 1, size=size)
bck_data = numpy.random.normal(1, 1, size=size)

Simple sPlot example

First we start with a simple (and not very useful in practice) example:

Assume we have two types of particles (say, electrons and positrons).
Distribution of some characteristic is different for them (let this be $p_x$ momentum projection).



In [3]:

    
plt.hist(sig_data, color='b', alpha=0.5, bins=30, label='electron')
plt.hist(bck_data, color='r', alpha=0.5, bins=30, label='positron')
plt.xlim(-5, 5), plt.xlabel('$p_x$')
plt.legend()









    Out[3]:





<matplotlib.legend.Legend at 0x7f5907b94d30>

Observed distributions

Picture above shows how this distibution should look like, but due to inaccuracies during classification we will observe a different picture.

Let's assume that with a probability of 80% particle is classified correctly (and we are not using $p_x$ during classification).

And when we look at distribution of px for particles which were classified as electrons or positrons, we see that they were distorted. We lost the original shapes of distributions.



In [4]:

    
n_sig1, n_bck1 = 8000, 2000
n_sig2, n_bck2 = 2000, 8000
first_bin = numpy.concatenate([sig_data[:n_sig1], bck_data[:n_bck1]])
second_bin = numpy.concatenate([sig_data[n_sig1:], bck_data[n_bck1:]])



In [5]:

    
plt.figure(figsize=[15, 6])
plt.subplot(121)
plt.bar([0, 2], [n_sig1, n_sig2], width=1, color='b', alpha=0.5)
plt.bar([0, 2], [n_bck1, n_bck2], width=1, bottom=[n_sig1, n_sig2], color='r', alpha=0.5)
plt.xlim(-0.5, 3.5)

plt.axis('off')
plt.xticks([0.5, 2.5], ['as electrons', 'as positrons'])
plt.text(0.5, -300, 'as electron', horizontalalignment='center', verticalalignment='top', fontsize=20)
plt.text(2.5, -300, 'as positron', horizontalalignment='center', verticalalignment='top', fontsize=20)
plt.title('Proportion of events being classified as')

plt.subplot(122)
plt.hist(first_bin, alpha=0.5, bins=30, label='as electrons', color=(0.22, 0., 0.66))
plt.hist(second_bin, alpha=0.5, bins=30, label='as positrons', color=(0.66, 0., 0.22))
plt.legend()
plt.title('Distributions')
plt.xlim(-5, 5), plt.xlabel('$p_x$')
pass

Applying sWeights

We can think of it in the following way: there are 2 bins. In first 80% are electrons, 20% are signal. And visa versa in second bin.

To reconstruct initial distribution, we can plot histogram, where each event from first bin has weight 0.8, and each event from second bin has weight -0.2. This numbers are called sWeights.

So, if we had 8000 $e^{-}$ + 2000 $e^{+}$ in first bin and 8000 $e^{+}$ + 2000 $e^{-}$ ($ e^-, e^+$ are electron and positron). After summing with introduced sWeights:

$$ \big[ 8000 e^{-} + 2000 e^{+} \big] \times 0.8 + \big[ 2000 e^{-} + 8000 e^{+} \big] \times (- 0.2) = 6800 e^{-} $$

Positrons with positive and negative weights compensated each other, and we will get pure electrons.

At this moment we ignore normalization of sWeights (because it doesn't play role when we want to reconstruct shape).



In [6]:

    
def plot_with_weights(datas, weights, **kargs):
    assert len(datas) == len(weights)
    data = numpy.concatenate(datas)
    weight = numpy.concatenate([numpy.ones(len(d)) * w for d, w in zip(datas, weights) ])
    plt.hist(data, weights=weight, alpha=0.5, bins=30, **kargs)



In [7]:

    
plot_with_weights([first_bin, second_bin], [n_bck2, -n_bck1], density=True, label='reconstructed electron')
plot_with_weights([first_bin, second_bin], [-n_sig2, n_sig1], density=True, color='r', label='reconstructed positron')
plt.xlabel('px')
plt.legend();

Compare

let's compare reconstructed distribution for electrons with an original one:



In [8]:

    
plot_with_weights([first_bin, second_bin], [n_bck2, -n_bck1], density=True, label='reconstructed electons', edgecolor='none')
plot_with_weights([sig_data], [1], density=True, label='original electons', edgecolor='none')
plt.legend();

More complex case

In the case when we have only two 'bins' is simple and straightforward. But when there are more than two bins, the solution is not unique. There are many appropriate combinations of sWeights, which one to choose?



In [9]:

    
plt.bar([0, 2, 4], [3, 2, 1], width=1, color='b', alpha=0.5)
plt.bar([0, 2, 4], [1, 2, 3], width=1, bottom=[3, 2, 1], color='r', alpha=0.5)

plt.xlim(-1, 6)
plt.ylim(-0.5, 5)
plt.axis('off')

plt.text(0.5, -0.5, 'Bin 1', horizontalalignment='center', verticalalignment='top', fontsize=20)
plt.text(2.5, -0.5, 'Bin 2', horizontalalignment='center', verticalalignment='top', fontsize=20)
plt.text(4.5, -0.5, 'Bin 3', horizontalalignment='center', verticalalignment='top', fontsize=20)
pass

Things in practice are however even more complex. We have not bins, but continuos distribution (which can be treated as many bins).

Typically this is a distribution over mass. By fitting mass we are able to split mixture into two parts: signal channel and everything else.

Building sPlot over mass

Let's show how this works. First we generate two fake distributions (signal and background) with 2 variables: mass and momentum $p$.



In [10]:

    
from scipy.stats import norm, expon



In [11]:

    
plt.figure(figsize=[15, 6])
size = 10000

sig_mass_distr = norm(loc=4, scale=1)
bck_mass_distr = expon(scale=4)

sig_mass = sig_mass_distr.rvs(size=size)
bck_mass = bck_mass_distr.rvs(size=size)
sig_p = numpy.random.normal(5, 1, size=size)
bck_p = numpy.random.normal(3, 1, size=size)

plt.subplot(121)

plt.hist(sig_mass, bins=20, density=True)
plt.hist(bck_mass, bins=20, density=True, range=(0, 10), alpha=0.5)
plt.xlabel('mass')

plt.subplot(122)

plt.hist(sig_p, bins=20, density=True)
plt.hist(bck_p, bins=20, density=True, range=(0, 10), alpha=0.5)
plt.xlabel('p');

Of course we don't have labels which events are signal and which are background beforehand

And we observe the mixture of two distributions:



In [12]:

    
plt.figure(figsize=[15, 6])
mass = numpy.concatenate([sig_mass, bck_mass])
p = numpy.concatenate([sig_p, bck_p])

sorter = numpy.argsort(mass)
mass = mass[sorter]
p = p[sorter]

plt.subplot(121)
plt.hist(mass, bins=20, range=(0, 10))
plt.xlabel('mass')

plt.subplot(122)
plt.hist(p, bins=20)
plt.xlabel('p')









    Out[12]:





Text(0.5,0,'p')

We have no information about real labels

But we know a priori that background is distributed as exponential distribution and signal - as gaussian (more complex models can be met in practice, but idea is the same).

After fitting the mixture (let me skip this process), we will get the following result:



In [13]:

    
x = numpy.linspace(0, 10)
plt.hist(mass, bins=30, range=[0, 10], density=True, alpha=0.4)
plt.plot(x, norm.pdf(x, loc=4, scale=1) / 2., label='signal')
plt.plot(x, expon.pdf(x, scale=4) / 2., label='bck')
plt.plot(x, 0.5 * (norm.pdf(x, loc=4, scale=1) + expon.pdf(x, scale=4)), label='sig + bck')
plt.legend(fontsize=20)









    Out[13]:





<matplotlib.legend.Legend at 0x7f59078e2128>

Fitting doesn't give us information about real labels

But it gives an information about probabilities, thus now we can estimate number of signal and background events within each bin.

We won't use bins, but instead we will get for each event probability that it is signal or background:



In [14]:

    
import pandas
probs = pandas.DataFrame(dict(sig=sig_mass_distr.pdf(mass), bck=bck_mass_distr.pdf(mass)))
probs = probs.div(probs.sum(axis=1), axis=0)



In [15]:

    
plt.plot(mass, probs.sig, label='sig probability')
plt.plot(mass, probs.bck, label='bck probability')
plt.xlim(0, 10), plt.legend(), plt.xlabel('mass')









    Out[15]:





((0, 10), <matplotlib.legend.Legend at 0x7f58e6834cf8>, Text(0.5,0,'mass'))

Appying sPlot

sPlot converts probabilities to sWeights, using an implementation from hep_ml:



In [16]:

    
from hep_ml import splot
sWeights = splot.compute_sweights(probs)

As you can see, there are also negative sWeights, which are needed to compensate the contributions of other class.



In [17]:

    
plt.plot(mass, sWeights.sig, label='sig sWeight')
plt.plot(mass, sWeights.bck, label='bck sWeight')
plt.xlim(0, 10), plt.legend(), plt.xlabel('mass');

Using sWeights to reconstruct initial distribution

Let's check that we achieved our goal and can reconstruct momentum distribution for signal and background:



In [18]:

    
plt.figure(figsize=[15, 7])
plt.subplot(121)
hist_conf = dict(bins=30, alpha=0.4, range=[0, 10])
plt.hist(sig_p, label='original sig p', **hist_conf)
plt.hist(p, weights=sWeights.sig, label='reconstructed sig p', **hist_conf)
plt.xlabel('p'), plt.legend()

plt.subplot(122)
plt.hist(bck_p, label='original bck p', **hist_conf)
plt.hist(p, weights=sWeights.bck, label='reconstructed bck p', **hist_conf)
plt.xlabel('p'), plt.legend()
pass

An important requirement of sPlot

Reconstructed variable (i.e. $p$) and splotted variable (i.e. mass) shall be statistically independent for each class.

Read the line above again. Reconstructed and splotted variable are correlated:



In [19]:

    
numpy.corrcoef(abs(mass - 4), p) [0, 1]









    Out[19]:





-0.33919239053759365

But within each class there is no correlation, so the requirement is satisfied:



In [20]:

    
print(numpy.corrcoef(abs(sig_mass - 4), sig_p)[0, 1])
print(numpy.corrcoef(abs(bck_mass - 4), bck_p)[0, 1])









    



0.016468135185733136
-0.008327865674083126

as a demonstration why this is important let's use sweights to reconstruct mass (obviously mass is correlated with mass):



In [21]:

    
hist_conf = dict(bins=30, alpha=0.5, range=[-1, 7])
plt.hist(sig_mass, label='original sig mass', **hist_conf)
plt.hist(mass, weights=sWeights.sig, label='reconstructed sig mass', **hist_conf)
plt.xlabel('mass'), plt.legend()
pass

$\def\ps{p_s(x)}$ $\def\pb{p_b(x)}$ $\def\ws{sw_s(x)}$ $\def\wb{sw_b(x)}$

Derivation of sWeights (optional)

Now, after we seen how this works, let's derive a formula for sWeights.

The only information we have from fitting over mass is $ \ps $, $ \pb$ which are probabilities of event $x$ to be signal and background.

Our main goal is to correctly reconstruct histogram. Let's reconstruct the number of signal events in particular bin. Let's introduce unknown $p_s$ and $p_b$ - probability that signal or background event will be in the named bin.

(Since mass and reconstructed variable are statistically independent for each class, $p_s$ and $p_b$ do not depend on mass.)

The mathematical expectation should be obviously equal to $p_s N_s$, where $N_s$ is total amount of signal events available from fitting.

Let's also introduce random variable $1_{x \in bin}$, which is 1 iff event $x$ lies in selected bin.

The estimate for number of signal event in bin is equal to: $$X = \sum_x \ws \; 1_{x \in bin},$$ where $\ws$ are sPlot weights and are subject to find.

First main property of sweights

Property 1. We expect out estimation to be unbiased $$\mathbb{E} \, X = p_s N_s $$

Corollary Let's understand what this means for sPlot weights.

$ p_s N_s = \mathbb{E} \, X = \sum_x w_s \; \mathbb{E} \, 1_{x \in bin} = \sum_x w_s \; (p_s \ps + p_b \pb) $

In the line above I used the assumption that variables are statistically independent for each class.

Since the previous equation should hold for all possible $p_s$ and $p_b$, we get two equalities:
$ p_s N_s = \sum_x \ws \; p_s \ps $
$ 0 = \sum_x \ws \; p_b \pb $

After reduction:
$ N_s = \sum_x \ws \; \ps $
$ 0 = \sum_x \ws \; \pb $

This way we can guarantee that mean contribution of background are 0 (expectation is zero, but observed number will not be zero due to statistical deviation).

Under assumption of linearity:

assuming that splot weight can be computed as a linear combination of conditional probabilities:

$ \ws = a_1 \pb + a_2 \ps$

We can easily reconstruct those numbers, first let's rewrite our system:

$ \sum_x (a_1 \pb + a_2 \ps) \; \ps = 0$
$ \sum_x (a_1 \pb + a_2 \ps) \; \pb = N_{sig}$

$ a_1 V_{bb} + a_2 V_{bs} = 0$
$ a_1 V_{sb} + a_2 V_{ss} = N_{sig}$

Where $V_{ss} = \sum_x \ps \; \ps $, $V_{bs} = V_{sb} = \sum_x \ps \; \pb$, $V_{bb} = \sum_x \pb \; \pb$

Having solved this linear equation, we get needed coefficients (as those in the paper)

NB. There is little difference between $V$ matrix I use and $V$ matrix in the paper.

Minimization of variation

$\def\Var{\mathbb{V}\,}$

Previous part allows one to get the correct result. But there is still no explanation of reason for linearity.

Apart from having correct mean, we should also minimize variation of any reconstructed variable. Let's try to optimize it

$$ \Var X = \sum_x \ws^2 \; \Var 1_{x \in bin} = \sum_x \ws^2 \; (p_s \ps + p_b \pb)(1 - p_s \ps - p_b \pb)$$

A bit complex, isn't it? Instead of optimizing such a complex expression (which is individual for each bin), let's minimize it's uniform upper estimate $$ \Var X = \sum_x \ws^2 \; \Var 1_{x \in bin} \leq \sum_x \ws^2 $$

so if we are going to minimize this upper estimate, we should solve the following optimization problem with constraints:
$\sum_x \ws^2 \to \min $
$\sum_x \ws \; \pb = 0$
$\sum_x \ws \; \ps = N_{sig}$

Let's write lagrangian of optimization problem:
$$ \mathcal{L} = \sum_x \ws^2

\lambda_1 \left[\sum_x \ws \; \pb \right]
\lambda_2 \left[\sumx \ws \; \ps - N{sig} \right]$$

After taking derivative with respect to $ \ws $ we get the equality: $$ 0 = \dfrac{\partial \mathcal{L}}{\partial \; \ws} = 2 \ws

\lambda_1 \pb
\lambda_2 \ps $$

which holds for every $x$. Thus, after renaming for convenience $a_1 = - \lambda_1 / 2, $ $a_2 = - \lambda_2 / 2, $ we are getting needed linear dependency.

Uncorrelatedness

The main assuption we used here is that distribution inside each bin is abolutely indentical.

In other words, we stated that there is no correlation between the index of bin and the reconstructed variable. Remember that bin corresponds to some interval in mass, and finaly we get:

reconstructed variable shall not be correlated with mass variables (or any other splotted variable)

Conclusion

sPlot allows reconstruction of some variables.
the only information used is probabilities taken from fit over variable. If fact, any probability estimates fit well.
the source of probabilities should be statistically independent from reconstructed variable (for each class!).
mixture may contain more than 2 classes (this is supported by hep_ml.splot as well)