In [ ]:
exec(open('../code/montecarlo1.py').read()) # see code here for later demos
%matplotlib inline
So far, we've restricted ourselves to inferences with
In general, evaluating the posterior throughout the entire parameter space is too costly. We want to focus resources on mapping the posterior where it is non-tiny. Generating samples from the posterior itself automatically accomplishes this.
Almost always, we are ultimately interested in integrals of the posterior, i.e. marginal distributions of parameters. The tasks of Monte Carlo sampling and Monte Carlo integration are essentially indistinguishable. (Similar machinery is useful for difficult optimization problems.)
The essence of MC integration:
$\int w(x)\,p(x)\,dx = \int w(x)\,dP(x) \approx \overline{w(x_i)}; ~ x_i\sim P$
i.e., if we can factor the integrand into a PDF and a weight, and sample from the PDF, then our integral becomes an average.
A posterior is already naturally factored into a likelihood function and a prior PDF.
$P(\theta|x) \propto P(x|\theta)\,P(\theta)$
Applying this in the MC integration context leads to the Simple Monte Carlo algorithm:
while we want more samples
draw theta from P(theta)
compute weight = P(x|theta)
store theta and weight
Obtaining marginal distribution(s) for $\theta$ then reduces to constructing weighted histograms of the samples.
SMC is indeed simple (as long as the prior is simple to draw from), but if the priors are not very informative then it still wastes many likelihood evaluations where the posterior is small. However, refinements of this approach lead to some of the advanced algorithms we'll cover later.
For now, we'll focus on the most common methods, which use a weight function of unity (i.e. obtain draws directly from the posterior.)
But first, a bit more context re: random number generation.
Useful terms to know:
Random: predictable only in the sense of following a PDF
Pseudorandom: not random, but "unpredictable enough" for practical purposes. Various computer algorithms produce pseudorandom sequences that approximate the uniform distribution on [0,1).
Quasirandom: sequence that doesn't even pretend to be random, but does converge to a target PDF more quickly than a random or pseudorandom process would
Recall the definition of the CDF (and it's inverse, $F^{-1}$, the quantile function),
$F(x) = P(X \leq x) = \int_{-\infty}^x p(x')\,dx'$
By this definition, quantiles of $X$ are uniformly distributed on [0,1]. If $F^{-1}$ is easy to evaluate, we can use this straightforwardly:
draw u from Uniform(0,1)
compute x = F_inverse(u)
In [ ]:
lam = 1.0
u = np.random.rand(1000)
x = -np.log(1-u)/lam
plt.rcParams['figure.figsize'] = (15.0,8.0)
# plots a histogram of these x's and the exponential PDF
inv_trans_demo(x, lam)
For this method, we need to define an envelope function which everywhere exceeds the target PDF, $p(x)$, and can be sampled. Let this be $A\,g(x)$ where $A$ is a scaling factor and $g(x)$ is a PDF we know.
Then the algorithm is
while we want more samples
draw x from g(x)
draw u from Uniform(0,1)
if u <= p(x)/(A*g(x)), keep the sample x
otherwise, reject x
In general, there's a trade-off between efficiency and generality in sampling algorithms. We've seen already seen examples of the very specialized and efficient (independent sampling) and the very general but inefficient (SMC).
We'll cover a number of different algorithms later, but for now let's focus on one that is widely used and simple to implement: the Metropolis-Hastings algorithm, which is one example of Markov Chain Monte Carlo.
Markov chains provide a powerful way to sample PDFs, provided that the transition kernel satisfies a few requirements
Consequences:
Chains take time to converge to the target distribution, depending on where they start
We'll spend more time looking at these features later.
Let $P$ be the distribution we want to sample. The algorithm is then
set x to an initial state
compute P(x)
while we want more samples
draw y from Q(y|x)
compute P(y)
draw u from Uniform(0,1)
if u <= P(y)/P(x) * Q(x|y)/Q(y|x), set x=y
otherwise, x stays the same
store x as a sample
Notice that the accept/reject step depended on the acceptance ratio
$\frac{P(y)Q(x|y)}{P(x)Q(y|x)}$, where
$P(x)$ is the posterior density (probability of being at $x$, if we've sampling $P$ properly)
$Q(y|x)$ is the proposal distribution (probability of attempting a move to $y$ from $x$)
This construction allows us to satisfy detailed balance.
If the proposal distribution is translation invariant, $Q(y|x)=Q\left(\left|y-x\right|\right)$, then it drops out of the acceptance ratio that decides whether to accept a step. This is the original Metropolis algorithm, and is the easiest case to implement.
In this case, we always accept a jump to higher $P$, and sometimes accept one to lower $P$.
This is so important to understand that we'll go through an example and exercise in detail in this sandbox notebook.
We need to be careful about
To illustrate, we have 4 chains from the sandbox with different starting points and width=0.1
.
This approach tests the similarlity of independent chains intended to sample the same PDF. To be meaningful, they should start from different locations and burn-in should be removed.
For a given parameter, $\theta$, the $R$ statistic compares the variance across chains with the variance within a chain. Intuitively, if the chains are random-walking in very different places, i.e. not sampling the same distribution, $R$ will be large.
We'd like to see $R\approx 1$ (e.g. $R<1.1$ is often used).
In detail, given chains $J=1,\ldots,m$, each of length $n$,
Let $B=\frac{n}{m-1} \sum_j \left(\bar{\theta}_j - \bar{\theta}\right)^2$, where $\bar{\theta_j}$ is the average $\theta$ for chain $j$ and $\bar{\theta}$ is the global average. This is proportional to the variance of the individual-chain averages for $\theta$.
Let $W=\frac{1}{m}\sum_j s_j^2$, where $s_j^2$ is the estimated variance of $\theta$ within chain $j$. This is the average of the individual-chain variances for $\theta$.
Let $V=\frac{n-1}{n}W + \frac{1}{n}B$. This is an estimate for the overall variance of $\theta$.
The autocorrelation of a sequence, as a function of lag, $k$, is defined thusly:
$\rho_k = \frac{\sum_{i=1}^{n-k}\left(\theta_{i} - \bar{\theta}\right)\left(\theta_{i+k} - \bar{\theta}\right)}{\sum_{i=1}^{n-k}\left(\theta_{i} - \bar{\theta}\right)^2} = \frac{\mathrm{Cov}_i\left(\theta_i,\theta_{i+k}\right)}{\mathrm{Var}(\theta)}$
The larger lag one needs to get a small autocorrelation, the less informative individual samples are.
The pandas
function autocorrelation_plot()
may be useful for this.
In [ ]:
Implement a rejection sampler for the example figure shown above. For this example,
$p(x)$ is the $\chi^2$ distribution with 3 degrees of freedom
$A=\pi$
$g(x)$ is a normal distribution with mean 0 and standard deviation 5
Use a different envelope function if you wish. Verify that your samples do indeed approximate the target PDF.
In [ ]:
Hypothetical: a collaborator of yours is having trouble with the fact that rejected steps lead to repeated samples in the Metropolis algorithm. They maintain that just recording the new parameter values whenever a step is accepted ought to be good enough.
Without resorting to "mathematical proofs" or "reliable references", and without doing any actual sampling, can you come up with a toy problem that convincingly shows that your collaborator's approach is wrong? The simpler the better!
Write some code to perform the Gelman-Rubin convergence test. Try it out on
multiple chains from the sandbox notebook. Fiddle with the sampler to get chains that do/do not display nice convergence after e.g. 5000 steps.
multiple "chains" produced from independent sampling, e.g. from the inverse-transform or rejection examples above or one of the examples in previous chunks.
You'll be expected to test convergence from now on, so having a function to do so will be helpful.
MegaBonus: modify your code to compute the $R$ statistic for the eigenvectors of the covariance of the posterior (yikes). This can be informative when there are strong parameter degeneracies, as in the convergence example above. The eigenvectors can be estimated efficiently from a chain using singular value decomposition.
In [ ]: