Iterative Importance Sampling : concepts and simple example

A Bayesian method

Classically, one ensemble, prior sampling, likelihood weighting : posterior estimates

posterior $\propto$ likelihood * prior

p( x | o ) $\propto$ p( o | x ) * b(x)

in the Gaussian case: $$ p( o | x ) \propto e^{-0.5 (\frac{x-o}{ \sigma_o}) ^2}$$

Basic procedure:

  • sample x from prior b(x)
  • calculate weights $w_i$ = p ( o | $x_i$ )
  • estimate quantities carrying weights around
  • mean = $\sum_i x_i w_i $

Or alternatively, instead of carrying weights, do some resampling

  • posterior estimates of 3 members: 1 (w=0.2), 2(w=0.5), 3 (w=0.3)
  • equivalent resampled ensemble (10 members) : [1, 1, 2, 2, 2, 2, 2, 3, 3, 3]
  • mean = $\sum_{i=1,10} x_i $

The main problem of such an approach, is that if the prior belief is too far off the posterior, the weights may all equal zero, or be concentrated on a few samples only, making a poor approximation of the PDF. The problem is especially acute when exploring a high-dimensional parameter space (since there is an exponential relationship between the number of parameters and the degrees of freedom).

The idea behind iterative importance sampling is to repeat such weighting/resampling procedure iteratively using intermediate, or bridging distributions, so that the sequence of distributions sampled by the particles at each step smoothly converges toward the posterior (this has much bearing with Monte Carlo Markov Chains, except that here we work with an ensemble of particles at each iteration step). A challenge will be do design the weighting to avoid an over-concentration of weights, a problem sometimes called ensemble collapse or degenerescence.

Annan and Hargreaves (2010) proposed an iterative importance sampling (IIS) scheme where the weighting function is a scaled version of the posterior, via an exponent $\epsilon$ (0.05 by default). Moreover, each weighting/resampling step will be counterbalanced by addition of noise, or jitter. The noise is chosen as a (multivariate) normal approximation of the ensemble, but with variance scaled down by the same factor $\epsilon$.

Each iteration, two steps:

  • weighting/re-sampling using $f^\epsilon $
    where $ f = p( x | o ) \propto p( o | x ) b(x) $
  • addition of jitter $N(0, \epsilon \Sigma)$ where $\Sigma$ is the covariance matrix of the resampled ensemble.

In the Gaussian case, these two steps correspond to the folling operations on the PDF of any proposal distribution $\phi$:

  • weighting by $f^\epsilon$ : $\phi \rightarrow \phi f^\epsilon$ (does not require f to be gaussian)
  • jittering : $\phi \rightarrow \phi f^\frac{1}{1+\epsilon}$

It is not too difficult to prove that if one first sample the model parameters from a proposal distribution $g$, the ensemble will follow at each time step $i$ the distribution:

$$\phi_i = g^{1-\alpha_i} f^\alpha_i$$

where $\alpha_i$ is a series that verifies:

$$\alpha_{i+1} = \frac{\alpha_i + \epsilon}{1 + \epsilon}$$

with $\alpha_0 = 0$. The series converge and their limits are:

$$\alpha_i \rightarrow 1$$$$\phi_i \rightarrow f = p( x | o )$$

A numerical example (almost without iis module)

The following part of this notebook presents a simple example for the convergence of a series of distributions with successive weighting and jittering (variance increase) steps. The iis module does not much more than providing a simple user interface including multivariate models, and provides a couple of handy functions for plotting and diagnostics. Of course, iis module could be extended with new methods etc...


In [ ]:
import sys 
import numpy as np
from scipy.stats import norm, uniform, lognorm
from iis.resampling import multinomial_resampling, residual_resampling # , stratified_sampling

In [ ]:
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++
# Initialization
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++
# ensemble size
size = 500

# posterior distribution (the truth)
posterior = norm(loc=0, scale=1) 
#posterior = lognorm(1) # log-normal distribution : test deviation from theoretical gaussian case

# proposal distribution from which to start the process (no prior here)
proposal = uniform(loc=0, scale=100)

# initialize (or reset) ensemble and a few other things 
ensemble = proposal.rvs(size=size)
ensembles = [] # initialize ensembles
alpha = 0
alphas = []
iteration = 0

In [ ]:
#+++++++++++++++++++++++++++++++++++++++++++++++++++++
# estimation (can be executed as many times as needed)
#+++++++++++++++++++++++++++++++++++++++++++++++++++++
# algorithm parameters
epsilon = 0.05  # exponent

# number of iterations
iterations = 60

# sampling method (important !)
#
# - random_walk : otherwise called multinomial sampling : non-deterministic
# - residual_sampling : partly deterministic (main 60% or so just copied, then random walk through residual weights)
# - stratified_sampling : NOT IMPLEMENTED (used by Annan and Hargreave, 2010). 
# - deterministic_sampling : NOT IMPLEMENTED starts like residual sampling, but aggregate residual weights into 
#    larger particules (called support particles) according to a recursive partition of space, along a user-input 
#    preferred sub-space. See Li et al (2012) in Signal Processing, doi:10.1016/j.sigpro.2011.12.019
#
resample_ensemble = residual_resampling  

for iteration in xrange(iterations):
    ensembles.append(ensemble) # ensemble before update, for the record
    alphas.append(alpha)
    
    # weight the samples according to a scaled version of the likelihood
    weights = np.exp(posterior.logpdf(ensemble)*epsilon)
    weights /= weights.sum() # normalized weights
    
    # resample according to weights
    ids = resample_ensemble(weights, ensemble.size)
    ensemble = ensemble[ids]  # new ensemble 
    
    # add jitter
    variance = np.var(ensemble) # after resampling
    jitter = np.random.normal(loc=0, scale=np.sqrt(variance*epsilon), size=ensemble.size)
    ensemble += jitter
    
    # diagnostic only: update alpha to check convergence properties
    # supposedly sampled distribution will be: proposal**(1-alpha) * posterior**alpha
    alpha = (alpha + epsilon)/(1+epsilon)
    
    #if alpha > 0.99999: break

In [ ]:
#+++++++++++++++++++++++++++++++++++++++++++++++++++++
# check results
#+++++++++++++++++++++++++++++++++++++++++++++++++++++
%matplotlib inline
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, sharex=True)
bins = 10

mi, ma = -20, 20
x = np.linspace(mi, ma, 1000)

# PDF
ax = axes[0]
ax.hist(ensemble, bins, histtype='step', label='result', normed=True)
ax.plot(x, posterior.pdf(x), label='truth')
ax.legend(frameon=False)
ax.set_title("Histogram (empirical PDF)")

# CDF
ax = axes[1]
ax.hist(ensemble, bins, histtype='step', label='result', normed=True, cumulative=True)
ax.plot(x, posterior.cdf(x), label='true posterior')
ax.legend(frameon=False)
ax.set_title("Empirical cumulative distribution function (CDF))")

fig.set_figwidth(17)
ax.set_xlim([mi, ma])

In [ ]:
#+++++++++++++++++++++++++++++++++++++++++++++++++++++
# check convergence properties
#+++++++++++++++++++++++++++++++++++++++++++++++++++++
%matplotlib inline
import matplotlib.pyplot as plt

pct = [50, 5, 95, 16, 84]
n = len(ensembles)
data = np.empty((n, len(pct)))

for i, ens in enumerate(ensembles):
    data[i] = np.percentile(ens, pct)
    
x = np.arange(n)
f90 = plt.fill_between(x, data[:, 1], data[:, 2], alpha=0.2)
f67 = plt.fill_between(x, data[:, 3], data[:, 4], alpha=0.5)
plt.plot(x, data[:, 0], label='model')

# add truth
mi, ma = plt.xlim()
truth = posterior.ppf(np.array(pct)/100.) # quantiles
plt.hlines(truth[0], mi, ma, linestyle='-', label='truth', color='red')
plt.hlines(truth[[1,2]], mi, ma, linestyle=':', color='red')
plt.hlines(truth[[3,4]], mi, ma, linestyle='--', color='red')

plt.legend(frameon=False)
plt.xlim([mi, ma])
plt.xlabel("Iteration number")
plt.ylabel("Quantiles")

# also plot alpha
fig = plt.figure()
plt.plot(x, alphas, label='alpha')
plt.legend(frameon=False, loc="upper left")
plt.title('Alpha value to check convergence')
plt.grid()
fig.set_figheight(3)
plt.ylim([0,1])
plt.xlabel("Iteration number")