In [ ]:
exec(open('code/bayes_theorem.py').read()) # see code here for later demos
%matplotlib inline
You've just been introduced to PGMs, which provide a visual representation of how data are generated.
In inference problems, we have one (and only one) dataset to learn from
The sampling distribution, $p(\mathrm{data}|\mathrm{params})$, plays a central role, since it quantifies how probable a given data set is given an assumed set of parameters.
When evaluated as a function of the parameters, $p(\mathrm{data}|\mathrm{params})$ is known as the likelihood function
The sampling distribution is a PDF, normalized in data space
The likelihood function is not not a PDF for the parameters - it's not normalized in parameter space
Either way, PGMs are still useful for visualizing the factorization of a joint probability into conditionally independent pieces
Double circled (or, sometimes, gray shaded) nodes denote variables that are fixed by observation: these nodes indicate the data.
The PGM still illustrates a particular factorization of the joint PDF of all variables.
$P(\theta) \prod_k P(\mu_k|\theta) P(N_k|\mu_k)$
$p($params$)$
The ingredients above are all related through the definition of conditional probability
$p(\mathrm{params}|\mathrm{data}) = \frac{p(\mathrm{data}|\mathrm{params})\,p(\mathrm{params})}{p(\mathrm{data})}$
The crux of this theorem, what makes it more than a trivial restatement, is that probability distributions are the appropriate mathematical tool for encoding knowledge about models and parameters.
All of these terms are implicitly also conditional on the choice of model.
$p(\mathrm{params}, \mathrm{data}) = p(\mathrm{params}|\mathrm{data}) \, p(\mathrm{data})$
$= p(\mathrm{data}|\mathrm{params}) \, p(\mathrm{params})$
The second one is the factorization illustrated by the PGM - and it is proportional to the posterior PDF for the parameters given the data.
PGMs help us design inferences, by illustrating the conditional dependencies between parameters and data
Say we want to measure the flux of a galaxy. In a given integration time, $T$, the number of counts, $N$, that we collect in our fancy CCD will be Poisson distributed
$N|\mu \sim \mathrm{Poisson}(\mu)$
where $\mu=FAT$ is the average number of counts we would expect in time $T$, the product of the integration time, the source flux ($F$, counts per unit time and area), and the collecting area of our telescope ($A$).
Presumably we know $A$ and $T$ well, so for convenience we can make $\mu$ rather than $F$ the free parameter of our model.
$N|\mu \sim \mathrm{Poisson}(\mu)$
$N|\mu \sim \mathrm{Poisson}(\mu)$
$\mu \sim \mathrm{some~prior}$
We'll talk more about how to choose a prior in a few minutes. For now, we'll make a common choice, the uniform distribution (for $\mu\geq0$ in this case).
$\mu \sim \mathrm{Uniform}(0,\infty)$
What about the evidence, $P(N)$?
So, we have everything we need to calculate $p(\mu|N)\propto P(N|\mu)p(\mu)$.
Say we measure $N=5$. Now what?
Broadly speaking, we have 3 options for computing the posterior.
In [ ]:
N = 5
mu = np.linspace(0.0, 20.0, 100) # np == numpy
prior = 1.0 # uniform (proportional to a constant)
like = st.poisson.pmf(N, mu) # st == scipy.stats
post = like * prior
plt.rcParams['figure.figsize'] = (6.0,4.0) # plt == matplotlib.pyplot
plt.plot(mu, post, 'bo');
plt.plot([N]*2, [0.0, np.max(post)*1.1], 'k--');
plt.xlabel(r'$\mu$', fontsize=22);
plt.ylabel('posterior', fontsize=22);
If both the prior and likelihood are built from standard statistical distributions, we can sometimes take advantage of conjugacy.
Conjugate distributions are like eigenfunctions of Bayes Theorem. These are special cases for which the form of the posterior is the same as the prior, for a specific sampling distribution.
$f(x|\theta)g(\theta|\phi) = g\left[\theta\left|\tilde{\phi}(x,\phi)\right.\right]$
The Poisson distribution is conjugate to the Gamma distribution
$p(x|\alpha,\beta) = \frac{1}{\Gamma(\alpha)}\beta^\alpha x^{\alpha-1} e^{-\beta x}$ for $x\geq0$
Our uniform prior is a limiting case of Gamma (with $\beta\rightarrow0$), so we can take advantage of this.
If we take the prior $\mu \sim \mathrm{Gamma}(\alpha_0,\beta_0)$, the posterior will be $\mu|N \sim \mathrm{Gamma}(\alpha_0+N,\beta_0+1)$.
Here we can demo how the posterior distribution depends on these prior hyperparameters, as well as the observed data.
In [ ]:
plt.rcParams['figure.figsize'] = (8.0,5.0)
# returns alpha,beta describing the posterior
bayesDemo(alpha0=1.0, beta0=0.001, N=5)
$p(\mathrm{params}|\mathrm{data}) = \frac{p(\mathrm{data}|\mathrm{params})\,p(\mathrm{params})}{p(\mathrm{data})}$
Assuming independent measurements, $p(\theta|x_1,x_2) \propto p(x_2|\theta)p(x_1|\theta)p(\theta) = p(x_2|\theta)p(\theta|x_1)$
X-ray imaging data for 361 galaxy clusters were analyzed, and 57 of them were found to be morphologically "relaxed" according to some metric (arxiv:1502.06020). We want to constrain the fraction of relaxed clusters in the Universe (the probability that a randomly chosen cluster is relaxed), $f$, assuming that the data set is representative.
The prior distribution encodes "what we know before doing the experiment at hand".
Common choices are
The last 2 are not seen as often in astrophysics, but are worth knowing about.
The posterior for model parameters may be "the answer" to an inference problem, but we often want a summary of the form $\theta = \theta_0 \pm \Delta\theta$ or $\theta_0{}_{-\Delta\theta_m}^{+\Delta\theta_p}$.
There are several conventions in use for how to come up with a best estimate and credible interval:
And some things you probably shouldn't do (valid only in certain limits)
1 . Most-probable value and interval
2 . Median and symmetric percentiles
3 . Mean and symmetric interval
Differences between the conventions can be pronounced in the case of limits.
most probable | median/percentiles | mean/symmetric |
Joint constraints (contours) for multiple parameters work similarly, although here only the maximum-probability approach is in wide use.
Adapt the code in bayes_theorem.py and above for the "flux of a source" demo to produce an equivalent demo for the "relaxed galaxy cluster" example. It should take as input the prior hyperparameters and data (number of relaxed clusters and total number of clusters), produce plots of the prior and posterior (optionally also the likelihood), and return the parameters describing the posterior distribution.
In [ ]:
So far, we've used examples where the posterior is a standard distribution. In general this is not the case, and we resort to methods which produce samples from the posterior in order to estimate its density and/or integrate it (e.g. to marginalize some parameters). Sampling can be useful even in simple cases, especially for downstream propagation of uncertainties. This exercise will get you some practice working with scipy
, and introduce the simplest possible sampling method.
For either the Poisson or Binomial examples above, make up some reasonable data and generate $10^4$ samples from the posterior distribution. Then do some post-processing:
Visualize the distributions for these derived parameters by plotting histograms of the samples.
In [ ]:
Write a function for determining the maximum-probability best fit and 1-dimensional confidence interval at a given level from an array of samples. (You can also write one that takes a posterior function as input instead of samples, but in the future we'll be using samples a lot.) Note that some kind of intelligent smoothing or kernel density estimation is generally needed to suppress numerical noise in histograms. Feel free to do the same for the percentile-based confidence intervals described above.
MegaBonus: Write a function for generating 2-dimensional confidence contours the same way. (Note that the corner
package provides a nifty way to do this already.)
In [ ]: