Frequentism and Bayesianism II

Higher-dimensional Models: Outlier Elimination via Marginalization

So far we've seen two relatively simple problems where Bayesian and frequentist approaches give largely the same result, and in which frequentism is often the simpler method to employ. Here we'll take a look at a Bayesian analysis of a more complicated model in which a frequentist approach is much more difficult to apply.

One of the advantages of Bayesianism is that its unified probabilistic framework allows for some really interesting approaches to situations that would be very difficult to treat rigorously within a frequentist framework.

For example, let's imagine that we have a set of data which contains some outliers: that is, some number of the data points have errors much larger than those recorded by $e_i$. We'll take the approach of modeling this via nuisance parameters $\{g_i\}$ which lie between 0 and 1, and control whether a point is drawn from the "true" error distribution or from a separate "outlier" distribution. What's more, we will incorporate the detection of these outliers into the model itself!

In this setup, the model is now $(2 + N)$-dimensional:

$$ \theta = [\mu, \sigma, g_1, g_2, g_3 \cdots g_N] $$

The expression for the likelihood will look like this:

$$\mathcal{L}(D~|~\theta) = \prod_{i=1}^N \left[ \frac{g_i}{\sqrt{2\pi(\sigma^2 + e_i^2)}}\exp\left(\frac{-(F_i - \mu)^2}{2(\sigma^2 + e_i^2)}\right) + \frac{1 - g_i}{\sqrt{2\pi\sigma_\ast^2}}\exp\left(\frac{-(F_i - \mu)^2}{2\sigma_\ast^2}\right) \right]$$

So if $g_i = 1$, the point is considered "good" data, and if $g_i = 0$, the point is considered "bad" data drawn from the distribution with width $\sigma_\ast \gg \sigma$.


In [1]:
np.random.seed(42)  # for reproducibility
N = 50
mu_true, sigma_true = 1000, 15  # stochastic flux model
F_true = stats.norm(mu_true, sigma_true).rvs(N)  # (unknown) true flux
F = np.random.normal(F_true, e_true)

# Create 10% outliers
sigma_star = 500
mask = (np.random.random(N) < 0.1)
F[mask] += stats.norm(0, sigma_star).rvs(mask.sum())
F = np.maximum(F, 0)
e = np.sqrt(F)

def log_prior(theta):
    if np.all(theta[2:] >= 0) and np.all(theta[2:] <= 1):
        return 0
    else:
        return -np.inf
    
def log_likelihood(theta, F, e, sigma_star):
    logP = log_prior(theta)
    if np.isneginf(logP):
        return logP
    g = theta[2:]
    logL_1 = (np.log(2 * np.pi * (theta[1] ** 2 + e ** 2))
              + (F - theta[0]) ** 2 / (theta[1] ** 2 + e ** 2))
    logL_2 = (np.log(2 * np.pi * sigma_star ** 2)
              + (F - theta[0]) ** 2 / sigma_star ** 2)
    return logP - 0.5 * np.sum(np.logaddexp(logL_1 + np.log(g),
                                            logL_2 + np.log(1 - g)))

def log_posterior(theta, F, e, sigma_star):
    return log_prior(theta) + log_likelihood(theta, F, e, sigma_star)

ndim, nwalkers = N + 2, 150
nsteps, nburn = 5000, 3000

starting_guesses = np.random.rand(nwalkers, ndim)
starting_guesses[:, 0] *= 2000
starting_guesses[:, 1] *= 20

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior,
                                args=[F, e, sigma_star])
sampler.run_mcmc(starting_guesses, nsteps)
sample = sampler.chain  # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].reshape(-1, N + 2)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-a293b2d963f0> in <module>()
----> 1 np.random.seed(42)  # for reproducibility
      2 N = 50
      3 mu_true, sigma_true = 1000, 15  # stochastic flux model
      4 F_true = stats.norm(mu_true, sigma_true).rvs(N)  # (unknown) true flux
      5 F = np.random.normal(F_true, e_true)

NameError: name 'np' is not defined

In [ ]: