Foreground-Background Modeling

Next, we will use a mixture model to handle a realistic data analysis task: We have some interlopers in our sample, and we can't cleanly separate them from our objects of interest, so we want to model them as a background signal while we also infer properties of our sample.

Before we jump into this, however, let's write a different version of the Gaussian Mixture Model code that uses sampling rather than the Expectation-Maximization algorithm. Doing this will make it easier to see how to adapt the mixture-model approach to non-Gaussian models.


In [1]:
# Boilerplate setup!
%matplotlib inline
import numpy as np
import pylab as plt
# Some utility code for this notebook
import utils

To start out, we'll use the same dataset as the E-M notebook:


In [2]:
(amps,means,covs),ax = utils.get_clusters_D()
N = 1000
Xi = utils.sample_clusters(amps, means, covs, N=N)
X = np.vstack(Xi)

# Plot the true cluster memberships
plt.clf()
for i,x in enumerate(Xi):
    plt.plot(x[:,0], x[:,1], 'o', color=utils.colors[i])
plt.title('True cluster labels');
plt.show();


Now, instead of using Expectation-Maximization (EM), which was fairly fast and easy, we're going to treat this as a general model from which we want to sample. One could also use a non-linear optimizer to get just the maximum a posteriori (MAP) answer.

We're going to use the emcee sampler here, which means that what we need to do is first define a log-likelihood function for our observed data given the parameters of the mixture model.

For K components in D dimensions, the mixture model is going to have (K-1) amplitudes (mixture strengths, or fraction of points drawn from each component), K\D means, and K*(D)(D+1)/2* covariance terms.


In [3]:
# Gaussian Mixture Models via sampling using emcee
import emcee

# We need to define the log-posterior probability of drawing our data
# from the Gaussian mixture model described by a given set of parameters.
# The 'emcee' sampler is going to call this function many times with different
# parameter values.
def gmm_logprob(params, K, X):
    N,D = X.shape
    prob = np.zeros(N)

    # 'params' is a numpy array... unpack it into separate amplitude, mean,
    # and covariance arrays.
    amps, means, covs = utils.unpack_gmm_params(params, K, D)

    # Now we have to compute the likelihoods, by evaluating the
    # weighted Gaussians.
    for k in range(K):
        prob += amps[k] * utils.gaussian_probability(X, means[k,:], covs[k,:,:])

    # Total log-likelihood:
    loglikelihood = np.sum(np.log(prob))

    # We're being bad Bayesians here, using improper flat priors on everything.
    # Tsk tsk!
    logprior = 0.
    logposterior = loglikelihood + logprior
    return logposterior

In [4]:
# Set up for running emcee:
K = 2
D = 2
nwalkers = 50

# Initialization: set the amplitudes equal
amps = np.ones(K)
# Start with the mean of the whole dataset as the initial mean values
mn = np.mean(X, axis=0)
means = np.array([mn for k in range(K)])
# Hack --- separate the initial mean estimates in the x direction...
means[:,0] += np.linspace(-K, K, K)
# Set the covariances to the identity... it would be better to use the
# sample variance, but this works fine, just takes a little longer to burn in.
covs = np.array([np.eye(D) for k in range(K)])

In [5]:
# To use emcee, we need to pack all of our parameters into a single vector.
# There is a utility function for that...
# initial parameter vector p0:
p0 = utils.pack_gmm_params(amps, means, covs)

# Now, to initialize emcee we need to randomly place each walker at some
# position in the parameter space.  Typically one just scatters them around
# an initial point.  It's okay to scatter them in a tiny ball, it will just take
# a bit longer to converge.
pp = np.vstack([p0 + np.random.normal(scale=0.1, size=len(p0))
                for i in range(nwalkers)])

ndim = len(p0)
print('Number of dimensions to sample:', ndim)

# Now create the emcee Sampler object, giving it our "gmm_logprob" function.
sampler = emcee.EnsembleSampler(nwalkers, ndim, gmm_logprob, args=(K,X))

# Let's have a look at the initial parameter values we're giving the sampler:
utils.plot_gmm_samples(X, K, pp)
plt.title('Gaussian mixture: initial samples');


('Number of dimensions to sample:', 11)

And finally, run the sampler. This will take a little while to run, since it's going to run our log-posterior function 50,000 times.


In [6]:
# Cavalierly ignore some numerical warnings (overflow/underflow because
# we're working in probability rather than log-prob space in gmm_logprob)
np.seterr(all='ignore')
# During development, you may run this more than once... reset the sampler state.
sampler.reset()
# Run the sampler!  This will take 10 seconds or more.
pos,prob,state = sampler.run_mcmc(pp, 1000)

In [7]:
# What do the samples in our final step look like?
utils.plot_gmm_samples(X, K, pos)
plt.title('Gaussian mixture: final samples');



In [8]:
# Let's explore the emcee results a bit.  First, look at how the log-posterior
# values vary over time.
plt.clf()
plt.plot(sampler.lnprobability.T, 'k-', alpha=0.1);


From this, we can see that we pretty much look converged after 600 iterations.

Next, let's look at what the individual parameter tracks look like.


In [9]:
nparams = len(p0)
param_names = ['log amp2', 'x1','y1','x2','y2', 'xx1','xy1','yy1', 'xx2','xy2','yy2']
for i in range(nparams):
    plt.clf()
    plt.plot(sampler.chain[:,:,i].T, 'k-', alpha=0.1)
    plt.title(param_names[i])
    plt.xlabel('emcee step')
    plt.show()
    # Don't really need to look at them all...
    if i == 2:
        break



In [10]:
import triangle
# How many burn-in steps to drop.
nburn = 600
keepchain = sampler.flatchain[(nburn * nwalkers):,:]
# Uncomment this to make gigantic corner plot!
#triangle.corner(keepchain, labels=param_names);

Alright, now that we've seen how to learn a mixture model using MCMC sampling, let's build a non-Gaussian mixture model.


In [5]:
# Here's our foreground-background synthetic dataset:
xlo,xhi = 0,10
ylo,yhi = 0,8
# Here are the real objects in our sample:
amp_true = 3
offset_true = 4
Nf = 200
sf = 0.5
xf = np.linspace(xlo, xhi, Nf)
yf = offset_true + amp_true * np.sin(xf) + np.random.normal(scale=sf, size=Nf)
# And here are the background objects
Nb = 400
xb = np.random.uniform(xlo, xhi, size=Nb)
yb = np.random.uniform(ylo, yhi, size=Nb)
Xi = [np.vstack((xf,yf)).T, np.vstack((xb,yb)).T]
X = np.vstack(Xi)
sigmas = np.zeros(Nf+Nb) + sf

fg_true = Nf / float(Nf + Nb)

plt.clf()
for i,x in enumerate(Xi):
    plt.plot(x[:,0], x[:,1], 'o', color=utils.colors[i], ms=3)
plt.title('Foreground-background data');
plt.axis([xlo,xhi,ylo,yhi])
plt.show();


Here, our foreground model is Gaussian scatter around a sinusoid in x. The background is uniform.

We're going to assume no scatter in the x coordinates, and that our errors in y are known.

We're going to infer the amplitude and offset of the sinusoid.

The first step here, as usual, will be to write out our log-posterior function.


In [6]:
# Log posterior probability for our sinusoidal foreground-background model.
def sinusoid_logprob(params, X, sigmas, ylo, yhi):

    # Unpack the parameter vector
    fg, offset, amp = params

    # If the foreground amplitude is outside the range [0,1], return
    if not 0 <= fg <= 1:
        return -np.inf
    
    # Pull out the 'x' and 'y' coordinates of the data
    x = X[:,0]
    y = X[:,1]

    # The foreground likelihood is a Gaussian around the predicted y.
    ypred = offset + amp * np.sin(x)
    likelihood = fg * utils.gaussian_probability_1d(y, ypred, sigmas**2)

    # The background likelihood is just a uniform value.
    # In order to be normalized, we need the constant value to 
    # integrate to 1 over the range (yhi-ylo)
    likelihood += (1 - fg) * 1/(yhi - ylo)
    
    # Bad Bayesian!  No cookie!
    logposterior = np.sum(np.log(likelihood)) + 0.
    
    return logposterior

Now let's set up the emcee sampler...


In [7]:
# If we haven't already...
import emcee

# Initialization...
nwalkers = 50

# Foreground weight
fg = 0.5
# Sinusoidal offset
offset = 3.
# Sinusoidal amplitude
amp = (yhi - ylo)/4.

# Parameter vector
p0 = np.array([fg, offset, amp])

# Scatter initial walker positions around p0.
pp = np.vstack([p0 + np.random.normal(scale=0.1, size=len(p0))
                for i in range(nwalkers)])
ndim = len(p0)
print 'Number of dimensions to sample:', ndim

# Now create the emcee Sampler object, giving it our "sinusoid_logprob" function.
sampler = emcee.EnsembleSampler(nwalkers, ndim, sinusoid_logprob, args=(X, sigmas, ylo, yhi))

# Let's have a look at the initial parameter values we're giving the sampler:
utils.plot_sinusoid_samples(Xi, xf, pp)
plt.axis([xlo,xhi,ylo,yhi])
plt.title('Sinusoid: initial samples');


Number of dimensions to sample: 3

In [8]:
# Cavalierly ignore some numerical warnings (overflow/underflow because
# we're working in probability rather than log-prob space in gmm_logprob)
np.seterr(all='ignore')
# During development, you may run this more than once... reset the sampler state.
sampler.reset()
# Run the sampler!  This will take a few seconds to run.
pos,logprob,state = sampler.run_mcmc(pp, 1000)

In [9]:
utils.plot_sinusoid_samples(Xi, xf, pos)
plt.axis([xlo,xhi,ylo,yhi])
plt.title('Sinusoid: final samples');



In [10]:
# How do the parameters we infer compare with reality?
for i,(truth,name,lo,hi) in enumerate([
    (fg_true,'Foreground fraction', 0, 1),
    (offset_true, 'Sinusoidal offset', 3, 5),
    (amp_true, 'Sinusoidal amplitude', 2, 4)]):
    plt.clf()
    plt.hist(pos[:,i], 40, range=(lo,hi), histtype='step')
    plt.axvline(truth, color='k', lw=2, linestyle='--');
    plt.title(name);
    plt.show();



In [ ]: