In [ ]:
exec(open('code/montecarlo1.py').read()) # see code here for later demos
%matplotlib inline
So far, we've restricted ourselves to inferences with
In general, evaluating the posterior throughout the entire parameter space is too costly. We want to focus resources on mapping the posterior where it is non-tiny. Generating samples from the posterior itself automatically accomplishes this.
Almost always, we are ultimately interested in integrals of the posterior, i.e. marginal distributions of parameters. The tasks of Monte Carlo sampling and Monte Carlo integration are essentially indistinguishable. (Similar machinery is useful for difficult optimization problems.)
The essence of MC integration:
$\int w(x)\,p(x)\,dx = \int w(x)\,dP(x) \approx \overline{w(x_i)}; ~ x_i\sim P$
i.e., if we can factor the integrand into a PDF and a weight, and sample from the PDF, then our integral becomes an average over the samples.
In other words, given a list of samples of $\theta$ from $p(\theta)$,
All of these computations would be weighted if $w(\theta)\neq1$.
A posterior is already naturally factored into a likelihood function and a prior PDF.
$p(\theta|x) \propto p(x|\theta)\,p(\theta)$
Applying this in the MC integration context leads to the Simple Monte Carlo algorithm:
while we want more samples
draw theta from p(theta)
compute weight = p(x|theta)
store theta and weight
Obtaining marginal distribution(s) for $\theta$ then reduces to constructing weighted histograms of the samples.
SMC is indeed simple (as long as the prior is simple to draw from), but if the priors are not very informative then it still wastes many likelihood evaluations where the posterior is small. However, refinements of this approach lead to some of the advanced algorithms we'll cover later.
For now, we'll focus on the most common methods, which use a unit weight function (i.e. obtain draws directly from the posterior).
But first, a bit more context re: random number generation.
Useful terms to know:
Random: predictable only in the sense of following a PDF
Pseudorandom: not random, but "unpredictable enough" for practical purposes. Various computer algorithms produce pseudorandom sequences that approximate the uniform distribution on [0,1).
Quasirandom: sequence that doesn't even pretend to be random, but does converge to a target PDF more quickly than a random or pseudorandom process would
Here we assume that we have a reliable source of uniform pseudorandom numbers, and want to turn these into samples of another PDF.
Two simple approaches are
Recall the definition of the CDF (and it's inverse, $F^{-1}$, the quantile function),
$F(x) = P(X \leq x) = \int_{-\infty}^x p(x')\,dx'$
By this definition, quantiles of $X$ are uniformly distributed on [0,1]. If $F^{-1}$ is easy to evaluate, we can use this straightforwardly:
draw u from Uniform(0,1)
compute x = F_inverse(u)
This distribution has $p(x)=\lambda e^{-\lambda x}$ and $F(x)=1-e^{-\lambda x}$ for $x\geq0$.
The quantile function is, therefore, $F^{-1}(P) = -\ln(1-P)/\lambda$.
Check it out (see code):
In [ ]:
lam = 1.0
u = np.random.rand(1000)
x = -np.log(1-u)/lam
plt.rcParams['figure.figsize'] = (15.0,8.0)
# plots a histogram of these x's and the exponential PDF
inv_trans_demo(x, lam)
For this method, we need to define an envelope function which everywhere exceeds the target PDF, $p(x)$, and can be sampled. Let this be $Ag(x)$ where $A$ is a scaling factor and $g(x)$ is a PDF we know.
Then the algorithm is
while we want more samples
draw x from g(x)
draw u from Uniform(0,1)
if u <= p(x)/(A*g(x)), keep the sample x
otherwise, reject x
In general, there's a trade-off between efficiency and generality in sampling algorithms. We've seen already seen examples of the very specialized and efficient (independent sampling) and the very general but inefficient (SMC).
We'll cover a number of different algorithms later, but for now let's focus on one that is widely used and simple to implement: the Metropolis-Hastings algorithm, which is one example of Markov Chain Monte Carlo.
Markov chains provide a powerful way to sample PDFs, provided that the transition kernel satisfies a few requirements
If we have detailed balance, $p(x)T(y|x) = p(y)T(x|y)$, then
$p(y) = p(y) \int dx \, T(x|y) = p(y)$
which shows that the chain should converge to $P$.
Samples are also correlated with one another.
Consequences:
We can keep only every $N$th sample without losing much information.
We'll spend more time looking at these features later.
Let $P$ be the distribution we want to sample. The algorithm is then
set x to an initial state
compute p(x)
while we want more samples
draw y from Q(y|x)
compute p(y)
draw u from Uniform(0,1)
if u <= p(y)/p(x) * Q(x|y)/Q(y|x), set x=y
otherwise, x stays the same
store x as a sample
Notice that the probability of accepting a step (once it's proposed) is
$\alpha(y,x) = \mathrm{min}\left[1, \frac{p(y)Q(x|y)}{p(x)Q(y|x)}\right]$
Let's look again at the requirement of detailed balance
the probability of being at $x$ and moving to $y$ must equal the probability of being at $y$ and moving to $x$
The first of these is $p(x)Q(y|x)\alpha(y,x)$, where
$p(x)$ is the posterior density (probability of being at $x$, if we're sampling $P$ properly)
$Q(y|x)$ is the proposal distribution (probability of attempting a move to $y$ from $x$)
$\alpha(y,x)$ is the probability of accepting the proposed move
With this definition of $\alpha$, detailed balance is automatically satisfied!
$p(x)Q(y|x)\alpha(y,x) \equiv p(y)Q(x|y)\alpha(x,y)$
Note that even if a step is rejected, we still keep a sample (the original state, without moving). The difficulty of finding a temptingly better point is important information!
It has been claimed that an acceptance fraction of $\sim25\%$ is optimal.
If the proposal distribution is translation invariant, $Q(y|x)=Q\left(\left|y-x\right|\right)$, then it drops out of the acceptance ratio that decides whether to accept a step.
$\alpha(y,x) = \mathrm{min}\left[1, \frac{p(y)}{p(x)}\right]$
This is the original Metropolis algorithm, and is the easiest case to implement.
In this case, we always accept a jump to higher $p$, and sometimes accept one to lower $p$.
Monte Carlo sampling provides a way to characterize posterior distributions without exhaustively exploring the parameter space.
Markov Chain Monte Carlo methods are general enough to accomodate the full power of our Bayesian approach to inference, and are reasonably straightforward to implement.
Later on, we'll look at ways to make MCMC sampling more efficient without sacrificing this generality.
But first, we'll get some experience building and running basic MCMC algorithms, and assessing whether they have successfully converged to the posterior distribution.
In [ ]:
Implement a rejection sampler corresponding to the example figure above that illustrates the method. For this example,
$p(x)$ is the $\chi^2$ distribution with 3 degrees of freedom
$A=\pi$
$g(x)$ is a normal distribution with mean 0 and standard deviation 5
Use a different envelope function if you wish. Verify that your samples do indeed approximate the target PDF.
In [ ]:
Hypothetical: a collaborator of yours is having trouble with the fact that rejected steps lead to repeated samples in the Metropolis algorithm. They maintain that just recording the new parameter values whenever a step is accepted ought to be good enough.
Without resorting to "mathematical proofs" or "reliable references", and without doing any actual sampling, can you come up with a toy problem that convincingly shows that your collaborator's approach is wrong? The simpler the better!