Bayes Theorem

Goals:

Grasp how the mathematics of probability can be used to do statistical inference.
Start working through real inference problems, with pencil, paper, and PGMs.



In [ ]:

    
exec(open('code/bayes_theorem.py').read()) # see code here for later demos
%matplotlib inline

References

Gelman ch. 1
Ivezic 5.1-5.3
MacKay 2.3

Sampling distributions and likelihoods

You've just been introduced to PGMs, which provide a visual representation of how data are generated.

In inference problems, we have one (and only one) dataset to learn from
The sampling distribution, $p(\mathrm{data}|\mathrm{params})$, plays a central role, since it quantifies how probable a given data set is given an assumed set of parameters.
When evaluated as a function of the parameters, $p(\mathrm{data}|\mathrm{params})$ is known as the likelihood function

Likelihood gotchas

The sampling distribution is a PDF, normalized in data space
The likelihood function is not not a PDF for the parameters - it's not normalized in parameter space
Either way, PGMs are still useful for visualizing the factorization of a joint probability into conditionally independent pieces

PGMs in inference

Double circled (or, sometimes, gray shaded) nodes denote variables that are fixed by observation: these nodes indicate the data.
The PGM still illustrates a particular factorization of the joint PDF of all variables.

$\theta \sim$ ???
For each $k$:
- $\mu_k|\theta =$ ???
- $N_k|mu_k \sim$ ???

$P(\theta) \prod_k P(\mu_k|\theta) P(N_k|\mu_k)$

Other ingredients for principled inference

$p($data|params$)$ clearly has a role to play in inferring which parameter values are consistent with the data. What else do we need?

$p($params$)$, the prior distribution
$p($params|data$)$, the posterior distribution

The prior distribution

$p($params$)$

The marginal probability of a set of parameter values (integrated over possible data sets).
Consequently, independent of the measured data.
Interpretation: what we know about the model parameters before incorporating new knowledge in the form of the measured data.

The posterior distribution

$p($params|data$)$

The probability of a model given the measured data.
Interpretation: what we know about the model parameters after incorporating new knowledge in the form of the measured data. In other words, the product of statistical inference.

Bayes Theorem

The ingredients above are all related through the definition of conditional probability

$p(\mathrm{params}|\mathrm{data}) = \frac{p(\mathrm{data}|\mathrm{params})\,p(\mathrm{params})}{p(\mathrm{data})}$

The crux of this theorem, what makes it more than a trivial restatement, is that probability distributions are the appropriate mathematical tool for encoding knowledge about models and parameters.

$p(\mathrm{params})$: prior - what we know before doing the experiment
$p(\mathrm{data}|\mathrm{params})$: sampling distribution - probability of obtaining our data set
$p(\mathrm{params}|\mathrm{data})$: posterior - what we know after doing the experiment ("the answer")
$p(\mathrm{data})$: evidence - marginal probability of obtaining our data for any parameter values (more on this later)

All of these terms are implicitly also conditional on the choice of model.

PGMs and Bayes

There are two ways of factorizing the joint PDF for all variables:

$p(\mathrm{params}, \mathrm{data}) = p(\mathrm{params}|\mathrm{data}) \, p(\mathrm{data})$

$= p(\mathrm{data}|\mathrm{params}) \, p(\mathrm{params})$

The second one is the factorization illustrated by the PGM - and it is proportional to the posterior PDF for the parameters given the data.
PGMs help us design inferences, by illustrating the conditional dependencies between parameters and data

Example: measuring the flux of a source

Say we want to measure the flux of a galaxy. In a given integration time, $T$, the number of counts, $N$, that we collect in our fancy CCD will be Poisson distributed

$N|\mu \sim \mathrm{Poisson}(\mu)$

where $\mu=FAT$ is the average number of counts we would expect in time $T$, the product of the integration time, the source flux ($F$, counts per unit time and area), and the collecting area of our telescope ($A$).

Presumably we know $A$ and $T$ well, so for convenience we can make $\mu$ rather than $F$ the free parameter of our model.

$N|\mu \sim \mathrm{Poisson}(\mu)$

$\mu \sim \mathrm{some~prior}$

We'll talk more about how to choose a prior in a few minutes. For now, we'll make a common choice, the uniform distribution (for $\mu\geq0$ in this case).

This is an improper distribution, i.e. one that can't technically be normalized. This doesn't necessarily matter, as long as the posterior distribution turns out to be proper.

$\mu \sim \mathrm{Uniform}(0,\infty)$

What about the evidence, $P(N)$?

This is constant with respect to model parameters by definition, so we don't actually need to calculate it (although we could, by marginalizing the sampling distribution over $\mu$).
That's because we know the posterior will be a probability distribution - as long as it's proper, the normalizing constant must be whatever makes it integrate to 1.

So, we have everything we need to calculate $p(\mu|N)\propto P(N|\mu)p(\mu)$.

Say we measure $N=5$. Now what?

Broadly speaking, we have 3 options for computing the posterior.

Brute force over a grid in parameter space: straightforward, but expensive in many dimensions

Analytically: super-efficient, but only possible for certain forms of the likelihood and prior

Monto Carlo: a more intelligent brute-force method

Brute force



In [ ]:

    
N = 5
mu = np.linspace(0.0, 20.0, 100) # np == numpy
prior = 1.0                      # uniform (proportional to a constant)
like = st.poisson.pmf(N, mu)     # st == scipy.stats
post = like * prior
plt.rcParams['figure.figsize'] = (6.0,4.0) # plt == matplotlib.pyplot
plt.plot(mu, post, 'bo');
plt.plot([N]*2, [0.0, np.max(post)*1.1], 'k--');
plt.xlabel(r'$\mu$', fontsize=22);
plt.ylabel('posterior', fontsize=22);

Analytical solution

If both the prior and likelihood are built from standard statistical distributions, we can sometimes take advantage of conjugacy.

Conjugate distributions are like eigenfunctions of Bayes Theorem. These are special cases for which the form of the posterior is the same as the prior, for a specific sampling distribution.

$f(x|\theta)g(\theta|\phi) = g\left[\theta\left|\tilde{\phi}(x,\phi)\right.\right]$

The Poisson distribution is conjugate to the Gamma distribution

$p(x|\alpha,\beta) = \frac{1}{\Gamma(\alpha)}\beta^\alpha x^{\alpha-1} e^{-\beta x}$ for $x\geq0$

Our uniform prior is a limiting case of Gamma (with $\beta\rightarrow0$), so we can take advantage of this.

If we take the prior $\mu \sim \mathrm{Gamma}(\alpha_0,\beta_0)$, the posterior will be $\mu|N \sim \mathrm{Gamma}(\alpha_0+N,\beta_0+1)$.

Here we can demo how the posterior distribution depends on these prior hyperparameters, as well as the observed data.



In [ ]:

    
plt.rcParams['figure.figsize'] = (8.0,5.0)
# returns alpha,beta describing the posterior
bayesDemo(alpha0=1.0, beta0=0.001, N=5)

Bayes theorem as a model for accumulating information

$p(\mathrm{params}|\mathrm{data}) = \frac{p(\mathrm{data}|\mathrm{params})\,p(\mathrm{params})}{p(\mathrm{data})}$

Within this formalism, today's posterior becomes tomorrow's prior seamlessly.

We can try this out with the demo on the previous slide.

Exercise: categorizing galaxy clusters

X-ray imaging data for 361 galaxy clusters were analyzed, and 57 of them were found to be morphologically "relaxed" according to some metric (arxiv:1502.06020). We want to constrain the fraction of relaxed clusters in the Universe (the probability that a randomly chosen cluster is relaxed), $f$, assuming that the data set is representative.

What sampling distribution is associated with this type of categorical data?
What prior distribution is nicely conjugate to it? (Look this up.) Choose hyperparmeters corresponding to a broadly reasonable prior (e.g. uniform).
Sketch a PGM for this experiment, and write down the corresponding conditional probability expressions.
Prove the conjugacy relation for the sampling and prior distributions above. (This does not require integration.)
Write down the posterior for $f$. What is the most probable value for $f$? You can consult a reference for the mode of the appropriate distribution.

Choosing a prior

The prior distribution encodes "what we know before doing the experiment at hand".

Things are straightforward if we have prior knowledge from e.g. previous measurements or trustworthy simulations, and don't mind including it in our analysis.
Otherwise, one generally assigns an "uninformative" prior.
But note that no distribution is truly devoid of information.

"Uninformative" priors

Common choices are

Uniform
Uniform in the log (i.e. in the order of magnitude)
The Jeffreys prior for a given problem, which provides minimal Fisher information
Priors with maximal entropy for a given problem

The last 2 are not seen as often in astrophysics, but are worth knowing about.

Again, there is no genuinely uninformative option. At best, we can make a minimally informative and defensible choice that doesn't significantly influence the result.
If the data are so inadequate that out choice of prior really matters, our job as analysts is to clearly state/defend our assumptions and their influence on the results.

Exercise: defend a prior

Given the choice between uniform, uniform-in-the-log ($\propto1/x$), or some other prior of your choosing on the following measurables, what do you think is most appropriate?

The height of a classmate
The mass of a star
The mass of a neutron star
The distance to a supernova

Summarizing posterior distributions

The posterior for model parameters may be "the answer" to an inference problem, but we often want a summary of the form $\theta = \theta_0 \pm \Delta\theta$ or $\theta_0{}_{-\Delta\theta_m}^{+\Delta\theta_p}$.

This is interpreted as a statement about the marginal posterior of $\theta$, namely that $\theta$ is in the given credible interval with some (specified) probability.
Note that credible intervals are technically different from the confidence intervals that appear in non-Bayesian statistics, but in practice the terms are often used interchangeably.

There are several conventions in use for how to come up with a best estimate and credible interval:

Most-probable value and interval (most intuitive)
Median and symmetric percentiles (easy)
Mean and symmetric interval (huh?)

And some things you probably shouldn't do (valid only in certain limits)

Mean and standard deviation
Intervals from differences in log-likelihood (or log-posterior)

1 . Most-probable value and interval

2 . Median and symmetric percentiles

3 . Mean and symmetric interval

Differences between the conventions can be pronounced in the case of limits.


most probable	median/percentiles	mean/symmetric

Joint constraints (contours) for multiple parameters work similarly, although here only the maximum-probability approach is in wide use.

Everything in this course is an inference

We've focussed on one task so far, but this framework applies any time we want to draw a conclusion from data.

constraining model parameters
choosing between model frameworks
source detection
interpolation and extrapolation
...

Bonus numerical exercise: conjugate distributions

Adapt the code in bayes_theorem.py and above for the "flux of a source" demo to produce an equivalent demo for the "relaxed galaxy cluster" example. It should take as input the prior hyperparameters and data (number of relaxed clusters and total number of clusters), produce plots of the prior and posterior (optionally also the likelihood), and return the parameters describing the posterior distribution.



In [ ]:

Bonus numerical exercise: independent monte carlo

So far, we've used examples where the posterior is a standard distribution. In general this is not the case, and we resort to methods which produce samples from the posterior in order to estimate its density and/or integrate it (e.g. to marginalize some parameters). Sampling can be useful even in simple cases, especially for downstream propagation of uncertainties. This exercise will get you some practice working with scipy, and introduce the simplest possible sampling method.

For either the Poisson or Binomial examples above, make up some reasonable data and generate $10^4$ samples from the posterior distribution. Then do some post-processing:

For the Poisson experiment, convert the posterior for $\mu$ (average number of counts) into something akin to magnitude. Simplifying away things like the reference, integration time and collecting area, use the formula $m=1-2.5\log_{10}(\mu/50)$.
For the Binomial experiment, predict the number of relaxed clusters found in a hypothetical much larger sample (say 5000 clusters in total). That is, generate a Poisson sample from each of the samples of $\mu=5000f$ corresponding to the larger sample size.

Visualize the distributions for these derived parameters by plotting histograms of the samples.



In [ ]:

Bonus numerical exercise: confidence intervals

Write a function for determining the maximum-probability best fit and 1-dimensional confidence interval at a given level from an array of samples. (You can also write one that takes a posterior function as input instead of samples, but in the future we'll be using samples a lot.) Note that some kind of intelligent smoothing or kernel density estimation is generally needed to suppress numerical noise in histograms. Feel free to do the same for the percentile-based confidence intervals described above.

MegaBonus: Write a function for generating 2-dimensional confidence contours the same way. (Note that the corner package provides a nifty way to do this already.)



In [ ]: