Assigning Priors

"There are only two problems in inference: how to assign probability distributions, and how to do integrals." - John Skilling

All our inferences necessarily depend on our model parameters' prior PDFs, because you can't do inference without making assumptions.

The personal and subjective nature of probability in Bayesian statistics (remember, you are quantifying your degree of belief in the parameter values) is in slight tension with the collective aspiration of the scientific community to learn the "objective truth" about the world.

There are several things we can do when publishing inferences:
- State clearly what priors we assumed, so that anyone reproducing our analysis can compare their results given their assumptions with ours.
- Try to make it easy for others to carry out the same inference but with different assumptions. Making our posterior samples is a good example of this: those samples can often be re-weighted in an importance sampling analysis that involves different assumptions.
- Carry out "sensitivity analyses" so that our readers don't have to do the above: if a conclusion is sensitive to the prior PDF we assumed, that means there was relatively little information in our data about that parameter.

Still, to be able to derive the posterior PDF for the parameters, we must assign a prior PDF. Let's look at some guidelines.

Uninformative Priors

One way to assign priors is to plead ignorance.

Not knowing the value of a parameter up to an additive constant indicates that a uniform (i.e. top hat) prior might be appropriate.

Not knowing the value of a parameter up to an multiplicative constant indicates that a prior uniform in the log of the parameter might be appropriate. Equating ${\rm Pr}(x)\,dx = {\rm Pr}(\log{x})\,d \log{x}$ and assigning the above uniform PDF leads to ${\rm Pr}(x) \propto 1/x$, which is sometimes known loosely as the "Jeffreys Prior"

At first sight, ${\rm Pr}(x) \propto 1/x$ seems like a bad idea, because it rises steeply with decreasing $x$, "biasing" the result. But suppose $x$ is galaxy mass, and you want to plead ignorance: you assign a a uniform prior between 0 and $10^{14} M_{\odot}$. With this assignment you are saying that a priori ${\rm Pr}(x > 10^{12}) = 0.99$ - a highly informative statement!

A computational problem with uninformative priors is that they can lead to parameter space volumes that are unmanageably large.

Exercise:

Consider a uniform joint prior PDF in N parameter dimensions. What fraction of the a priori allowed volume is in a hypercubic shell that has thickness f of the side length?



In [ ]:

    
f = 0.01
N = 2
# Compute difference between two hypervolumes:
dV = 'not yet coded'

print("Volume fraction for f =",f," is",dV)

This effect can cause real computational problems when attempting to characterize posterior PDFs - you've seen it already, just in our attempts with two-dimensional grids!

Maximum Entropy Priors

Attempts have been made to put uninformative priors on a sound theoretical footing, by deriving them from various symmetries and invariants. (Jeffreys' principle is one such attempt that is worth reading about.)

The uniform and logarithmic priors are examples of distributions derivable from the Principle of Maximum Entropy, which is a formalization of the request for an uninformative prior.

The entropy of a PDF $p(x)$ is the functional $H(p) = - \int p \log p dx$, and measures something like "randomness." The distribution $p$ that maximizes the entropy $H$ given some constraints is the least informative (or most non-committal) one.

Maximizing the entropy of $p(x)$ is an exercise in the calculus of variations. The constraints (which are often integral in nature) are added in with Lagrange multipliers, and then the functional derivative of $H$ is taken and set to zero, and the optimal $p(x)$ solved for. The values of the multipliers come from normalizing the resulting PDF $p(x)$. See the Maximum Entropy wikipedia page and related page of Maximum Entropy probability distributions for more details.

Examples of constraints, and the maximum entropy distributions that result from them, include:
- No constraint, except that $\int p(x) dx = 1$, gives $p(x) \propto {\rm constant}$, the uniform distribution.
- Known mean, $\int x p(x) dx = \mu$, gives $p(x) \propto \exp(-x/\mu)$, the exponential distribution.
- Known mean $\mu$ and variance $\sigma^2$ gives $p(x) \propto \exp(-(x - \mu)^2/2\sigma^2)/\sigma$, the Gaussian distribution.

This last one provides some justification for our repeated interpretation of points with error bars as Gaussian sampling distributions: when we are just told a mean and a width, the Gaussian assumption is the least committal one we can make - and over a large number of situations making this assumption, we are not persuaded that it's a bad idea to do so.

Informative Priors

Another way to think of the prior PDF is as an opportunity - to include more information into our analysis.

One good prior PDF is the posterior PDF from a previous analysis.

${\rm Pr}(x|d,H) \propto {\rm Pr}(d|x,H)\;{\rm Pr}(x|c,H)$

Note that ${\rm Pr}(x|c,H)\propto {\rm Pr}(c|x,H)\;{\rm Pr}(x|H)$ by Bayes Theorem, so that using posteriors as priors is equivalent to joint inference from multiple datasets:

${\rm Pr}(x|d,H) \propto {\rm Pr}(d|x,H)\;{\rm Pr}(c|x,H)\;{\rm Pr}(x|H)$

More generally, a good prior PDF is one that accurately represents your beliefs.

Such "subjective priors" are typically treated with justified caution, because scientists aspire to perform objective analyses. However, a prior PDF assignment is just one type of assumption in the many that are involved in the model, and they can and should be tested in the same way.

Subjective priors can be derived by "elicitation" - asking experts for opinions. This is not done often in astronomy, but if you look carefully you might find examples.

In a domain like astrophysics some very good prior PDFs are generated as the outputs of simulations. Approximate realizations of complex systems are generated by computer models implementing the known laws of physics plus a range of plausible assumptions, and the result can often be interpreted as a prior distribution for some simpler phenomenological model parameters.

An example from KIPAC's research would be our treatment of large catalogs of simulated dark matter halos as ensembles of samples drawn from the prior PDF for model dark matter halo parameters.



In [ ]: