"Importance Sampling", Owen, 2013

"In many applications we want to compute $\mu = \mathbb{E}[f(\mathbf{X})]$ where $f(\mathbf{x})$ is nearly zero outside a region $A$ for which $\mathbb{P}(\mathbf{X} \in A)$ is small. The set $A$ may have small volume, or it may be in the tail of the $\mathbf{X}$ distribution. A plain Monte Carlo sample from the distribution $\mathbf{X}$ could fail to have even one point inside the region $A$. Problems of this type arise in high energy physics, Bayesian inference, rare event simulation for finance and insurance, and rendering in computer graphics among other areas.

"It is clear intuitively that we must get some samples from the interesting or important region. We do this by sampling from a distribution that over-weights the important region, hence the name importance sampling. Having oversampled the important region, we have to adjust our estimate somehow to account for having sampled from this other distribution.

"Importance sampling can bring enormous gains, making an otherwise infeasible problem amenable to Monte Carlo. It can also backfire, yielding an estimate with infinite variance when simple Monte Carlo would have had a finite variance. It is the hardest variance reduction method to use well.

"Importance sampling is more than just a variance reduction method. It can be used to study one distribution while sampling from another. As a result we can use importance sampling as an alternative to acceptance-rejection sampling, as a method for sensitivity analysis and as the foundation for some methods of computing normalizing constants of probability densities. Importance sampling is also an important prerequisite for sequential Monte Carlo. For these reasons we spend a whole chapter on it."

9.1 Basic importance sampling$\def\E{\mathbb{E}}

\def\X{\mathbf{X}} \def\D{\mathcal{D}} \def\x{\mathbf{x}} \def\R{\mathbb{R}} \def\Var{\text{Var}} \def\Cov{\text{Cov}} \def\Corr{\text{Corr}} \def\Q{\mathcal{Q}}$

"Suppose that our problem is to find $\mu = \E[f(\X)] = \int_\D f(\x)\,p(\x)\,d\x$ where $p$ is a probability density on $\D \subseteq \R^d$ and $f$ is the integrand. We take $p(\x) = 0$ for all $\x \notin \D$. If $q$ is a positive probability density function on $\R^d$, then

$$ \mu = \int_\D f(\x)\,p(\x)\,d\x \\ = \int_\D \frac{f(\x)\,p(\x)}{q(\x)} q(\x) \,d\x \\ = \E_q\left[ \frac{ f(\X)\, p(\X)} {q(\X)} \right] $$

where $\E_q(\cdot)$ denotes expectation for $\X \sim q$. We also write $\Var_q(\cdot)$, $\Cov_q(\cdot, \cdot)$, and $\Corr_q(\cdot, \cdot)$ for variance, covariance and correlation when $\X \sim q$. Our original goal then is to find $\E_p[f(\X)]$. By making a multiplicative adjustment to $f$ we compensate for sampling from $q$ instead of $p$. The adjustment factor $p(\x)/q(\x)$ is called the likelihood ratio. The distribution of $q$ is the importance distribution and $p$ is the nominal distribution.

"The importance distribution $q$ does not have to be positive everywhere. It is enough to have $q(x) > 0$ whenever $f(\x)\,p(\x) \ne 0$. That is, for $\Q = \{\x \mid q(\x) > 0 \}$ we have $\x \in \Q$ whenever $f(\x)\,p(\x) \ne 0$. So if $\x \in \D \cap \Q^c$ we know that $f(\x) = 0$, while if $\x \in \Q \cup \D^c$ we have $p(\x) = 0$. Now

$$ \E_q\left( \frac{f(\X)p(\X)} {q(\X)} \right) = \int_\Q \frac{f(\x) \, p(\x)}{q(\x)} q(\x)\,d\x = \int_\Q f(\x)\,p(\x)\,d\x \\ = \int_\D f(\x)\,p(\x)\,d\x + \int_{\Q \cap \D^c} f(\x)p(\x)\,d\x - \int_{\D \cap \Q^c} f(\x) \, p(\x) \, d \x \\ = \int_\D f(\x)\,p(\x)\,d\x = \mu \tag{9.2} $$

"It is natural to wonder what happens for $\x$ with $q(\x) = 0$ in the denominator. The answer is that there are no such points $\x \in \Q$ and we will never see one when sampling $\X \sim q$. Later we will see examples where $q(\x)$ close to $0$ causes extreme difficulties, but $q(\x) = 0$ is not a problem if $f(\x)\,p(\x) = 0$ too.

"When we want $q$ to work for many different functions $f_j$ then we need $q(\x) > 0$ at every $\x$ where any $f_j(\x)\,p(\x) \ne 0$. Then a density $q$ with $q(\x) > 0$ whenever $p(\x) > 0$ will suffice, and will allow us to add new function $f_j$ to our list after we've drawn the sample.

"The importance sampling estimate of $\mu = \E_p[f(\X)]$ is:

$$ \hat{\mu}_q = \frac{1}{n} \sum_{i=1}^n \frac{f(\X_i)\,p(\X_i)} {q(\X_i)},\: \X_i \sim q \tag{9.3} $$

"To use (9.3), we must be able to compute $fp/q$. Assuming that we can compute $f$, this estimate requires that we can compute $p(\x)/q(\x)$ at any $\x$ we might sample. When $p$ or $q$ has an unknown normalization constant, then we will resort to a ratio estimate. For now, we assume that $p/q$ is computable, and study the variance of $\hat{\mu}_q$. Exponential tilting is one way to chooe $q$ with computable $p/q$ even when $p$ is unnormalized.

Theorem 9.1 Let $\hat{\mu}_q$ be given by (9.3) where $\mu = \int_\D f(\x) \, p(\x) \, d\x$ and $q(\x) > 0$ whenever $f(\x)\, p(\x) \ne 0$. then $\E_q[\hat{\mu}_q] = \mu$, and $\Var_q(\hat{\mu}_q) = \sigma^2_q/n$ where

$$ \sigma^2_q = \int_\D \frac{(f(\x)\,p(\x))^2} {q(\x)} \, d\x - \mu^2 \\ =\int_\D \frac{(f(\x)\, p(\x) - \mu q(\x))^2} {q(\x)} \, d\x \tag{9.4} $$

Proof That $\E_q(\hat{\mu}_q) = \mu$ follows directly from (9.2). Using $\Q = \{ \x \mid q(\x) > 0\}$, we find that

$$ \Var_q(\hat{\mu}_q) = \frac{1}{n} \left( \int_\Q \left( \frac{f(\x)\, p(\x)} {q(\x)} \right)^2 q(\x)\, d\x - \mu^2 \right) \\ = \frac{1}{n}\left( \int_\D \left( \frac{f(\x)\, p(\x)} {q(\x)} \right)^2 q(\x)\,d\x - \mu^2 \right) $$