Next, we will use a mixture model to handle a realistic data analysis task: We have some interlopers in our sample, and we can't cleanly separate them from our objects of interest, so we want to model them as a background signal while we also infer properties of our sample.
Before we jump into this, however, let's write a different version of the Gaussian Mixture Model code that uses sampling rather than the Expectation-Maximization algorithm. Doing this will make it easier to see how to adapt the mixture-model approach to non-Gaussian models.
In [16]:
# Boilerplate setup!
%matplotlib inline
import numpy as np
import pylab as plt
# Some utility code for this notebook
import utils
from __future__ import print_function
To start out, we'll use the same dataset as the E-M notebook:
In [17]:
(amps,means,covs),ax = utils.get_clusters_D()
N = 1000
Xi = utils.sample_clusters(amps, means, covs, N=N)
X = np.vstack(Xi)
# Plot the true cluster memberships
plt.clf()
for i,x in enumerate(Xi):
plt.plot(x[:,0], x[:,1], 'o', color=utils.colors[i])
plt.title('True cluster labels');
plt.show();
Now, instead of using Expectation-Maximization (EM), which was fairly fast and easy, we're going to treat this as a general model from which we want to sample. One could also use a non-linear optimizer to get just the maximum a posteriori (MAP) answer.
We're going to use the emcee sampler here, which means that what we need to do is first define a log-likelihood function for our observed data given the parameters of the mixture model.
For K components in D dimensions, the mixture model is going to have (K-1) amplitudes (mixture strengths, or fraction of points drawn from each component), K\D means, and K*(D)(D+1)/2* covariance terms.
In [28]:
# Gaussian Mixture Models via sampling using emcee
import emcee
# We need to define the log-posterior probability of drawing our data
# from the Gaussian mixture model described by a given set of parameters.
# The 'emcee' sampler is going to call this function many times with different
# parameter values.
#
# YOU NEED TO FILL IN SOME CODE BELOW!
#
def gmm_logprob(params, K, X):
'''
Compute the log of the posterior probability of the given parameter values
given data X.
'''
N,D = X.shape
# 'params' is a numpy array... unpack it into separate amplitude, mean,
# and covariance arrays.
amps, means, covs = utils.unpack_gmm_params(params, K, D)
# Now we have to compute the likelihoods, by evaluating the
# weighted Gaussians.
prob = np.zeros(N)
for k in range(K):
#
# ADD YOUR CODE HERE...
#
# Note that 'amps[k]' is the mixing amplitude of the k'th component
# And means[k,d] is the mean of component k in dimension d
# And covs[k,d1,d2] is the covariance of component k.
# Also note that there is a handy function in the 'utils' module:
# p_k = gaussian_probability(X, means_k, covs_k)
#
prob += np.random.uniform(size=N)
# Total log-likelihood:
loglikelihood = np.sum(np.log(prob))
# We're being bad Bayesians here, using improper flat priors on everything.
# Tsk tsk!
logprior = 0.
logposterior = loglikelihood + logprior
return logposterior
In [29]:
# Set up for running emcee:
K = 2
D = 2
nwalkers = 50
# Initialization: set the amplitudes equal
amps = np.ones(K)
# Start with the mean of the whole dataset as the initial mean values
mn = np.mean(X, axis=0)
means = np.array([mn for k in range(K)])
# Hack --- separate the initial mean estimates in the x direction...
means[:,0] += np.linspace(-K, K, K)
# Set the covariances to the identity... it would be better to use the
# sample variance, but this works fine, just takes a little longer to burn in.
covs = np.array([np.eye(D) for k in range(K)])
In [30]:
# To use emcee, we need to pack all of our parameters into a single vector.
# There is a utility function for that...
# initial parameter vector p0:
p0 = utils.pack_gmm_params(amps, means, covs)
# Now, to initialize emcee we need to randomly place each walker at some
# position in the parameter space. Typically one just scatters them around
# an initial point. It's okay to scatter them in a tiny ball, it will just take
# a bit longer to converge.
pp = np.vstack([p0 + np.random.normal(scale=0.1, size=len(p0))
for i in range(nwalkers)])
ndim = len(p0)
print('Number of dimensions to sample:', ndim)
# Now create the emcee Sampler object, giving it our "gmm_logprob" function.
sampler = emcee.EnsembleSampler(nwalkers, ndim, gmm_logprob, args=(K,X))
# Let's have a look at the initial parameter values we're giving the sampler:
utils.plot_gmm_samples(X, K, pp)
plt.title('Gaussian mixture: emcee walker initialization');
And finally, run the sampler. This will take a little while to run, since it's going to run our log-posterior function 50,000 times.
In [31]:
# Cavalierly ignore some numerical warnings (overflow/underflow because
# we're working in probability rather than log-prob space in gmm_logprob)
np.seterr(all='ignore')
# During development, you may run this more than once... reset the sampler state.
sampler.reset()
# Run the sampler! This will take 10 seconds or more.
pos,prob,state = sampler.run_mcmc(pp, 1000)
In [33]:
# What do the samples in our final step look like?
utils.plot_gmm_samples(X, K, pos)
plt.title('Gaussian mixture: emcee final samples');
In [34]:
# Let's explore the emcee results a bit. First, look at how the log-posterior
# values vary over time.
plt.clf()
plt.plot(sampler.lnprobability.T, 'k-', alpha=0.1);
If your log-probability function is right, you should see that it pretty much looks converged after 600 iterations.
Next, let's look at what the individual parameter tracks look like.
In [35]:
nparams = len(p0)
param_names = ['log amp2', 'x1','y1','x2','y2', 'xx1','xy1','yy1', 'xx2','xy2','yy2']
for i in range(nparams):
plt.clf()
plt.plot(sampler.chain[:,:,i].T, 'k-', alpha=0.1)
plt.title(param_names[i])
plt.xlabel('emcee step')
plt.show()
# Don't really need to look at them all...
if i == 2:
break
In [10]:
import triangle
# How many burn-in steps to drop.
nburn = 600
keepchain = sampler.flatchain[(nburn * nwalkers):,:]
# Uncomment this to make gigantic corner plot!
#triangle.corner(keepchain, labels=param_names);
Alright, now that we've seen how to learn a mixture model using MCMC sampling, let's build a non-Gaussian mixture model.
In [36]:
# Here's our foreground-background synthetic dataset:
xlo,xhi = 0,10
ylo,yhi = 0,8
# Here are the real objects in our sample:
amp_true = 3
offset_true = 4
Nf = 200
sf = 0.5
xf = np.linspace(xlo, xhi, Nf)
yf = offset_true + amp_true * np.sin(xf) + np.random.normal(scale=sf, size=Nf)
# And here are the background objects
Nb = 400
xb = np.random.uniform(xlo, xhi, size=Nb)
yb = np.random.uniform(ylo, yhi, size=Nb)
Xi = [np.vstack((xf,yf)).T, np.vstack((xb,yb)).T]
X = np.vstack(Xi)
sigmas = np.zeros(Nf+Nb) + sf
fg_true = Nf / float(Nf + Nb)
plt.clf()
for i,x in enumerate(Xi):
plt.plot(x[:,0], x[:,1], 'o', color=utils.colors[i], ms=3)
plt.title('Foreground-background data');
plt.axis([xlo,xhi,ylo,yhi])
plt.show();
Here, our foreground model is Gaussian scatter around a sinusoid in x. The background is uniform.
We're going to assume no scatter in the x coordinates, and that our errors in y are known.
We're going to infer the amplitude and offset of the sinusoid.
The first step here, as usual, will be to write out our log-posterior function.
In [37]:
# Log posterior probability for our sinusoidal foreground-background model.
#
# YOU NEED TO ADD SOME CODE BELOW!
#
def sinusoid_logprob(params, X, sigmas, ylo, yhi):
# Unpack the parameter vector
fg, offset, amp = params
# If the foreground amplitude is outside the range [0,1], return
if not 0 <= fg <= 1:
return -np.inf
# Pull out the 'x' and 'y' coordinates of the data
x = X[:,0]
y = X[:,1]
# The foreground likelihood is a Gaussian around the predicted y.
## ADD SOME CODE HERE to compute the predicted y coordinate
# the line below is wrong!
ypred = np.random.uniform(ylo,yhi, size=len(y))
# Now, the foreground likelihood is a Gaussian around the prediction,
# weighted by the foreground fraction.
likelihood = fg * utils.gaussian_probability_1d(y, ypred, sigmas**2)
# The background likelihood is just a uniform value.
# In order to be normalized, we need the constant value to
# integrate to 1 over the range (yhi-ylo)
likelihood += (1 - fg) * 1/(yhi - ylo)
# Bad Bayesian! No cookie!
logprior = 0.
logposterior = np.sum(np.log(likelihood)) + logprior
return logposterior
Now let's set up the emcee sampler...
In [45]:
# If we haven't already...
import emcee
# Initialization...
nwalkers = 50
# Foreground weight
fg = 0.5
# Sinusoidal offset
offset = 3.
# Sinusoidal amplitude
amp = (yhi - ylo)/4.
# Parameter vector
p0 = np.array([fg, offset, amp])
# Scatter initial walker positions around p0.
pp = np.vstack([p0 + np.random.normal(scale=0.1, size=len(p0))
for i in range(nwalkers)])
ndim = len(p0)
print('Number of dimensions to sample:', ndim)
# Now create the emcee Sampler object, giving it our "sinusoid_logprob" function.
sampler = emcee.EnsembleSampler(nwalkers, ndim, sinusoid_logprob, args=(X, sigmas, ylo, yhi))
# Let's have a look at the initial parameter values we're giving the sampler:
utils.plot_sinusoid_samples(Xi, xf, pp)
plt.axis([xlo,xhi,ylo,yhi])
plt.title('Sinusoid: emcee walker initialization');
In [46]:
# Cavalierly ignore some numerical warnings (overflow/underflow because
# we're working in probability rather than log-prob space in gmm_logprob)
np.seterr(all='ignore')
# During development, you may run this more than once... reset the sampler state.
sampler.reset()
# Run the sampler! This will take a few seconds to run.
pos,logprob,state = sampler.run_mcmc(pp, 1000)
In [47]:
utils.plot_sinusoid_samples(Xi, xf, pos)
plt.axis([xlo,xhi,ylo,yhi])
plt.title('Sinusoid: final emcee samples');
In [51]:
# How do the parameters we infer compare with reality?
for i,(truth,name,lo,hi) in enumerate([
(fg_true,'Foreground fraction', 0, 1),
(offset_true, 'Sinusoidal offset', 3, 5),
(amp_true, 'Sinusoidal amplitude', 2, 4)]):
plt.clf()
if np.any(np.logical_and(pos[:,i] > lo, pos[:,i] < hi)):
plt.hist(pos[:,i], 40, range=(lo,hi), histtype='step')
plt.axvline(truth, color='k', lw=2, linestyle='--');
plt.title(name);
plt.show();
For bonus points, you could try including multiple foreground mixture components!