In [1]:
%matplotlib inline
%config IPython.matplotlib.backend = 'retina'
%config InlineBackend.figure_format = 'retina'
from IPython.display import HTML
In [2]:
from traitlets.config.manager import BaseJSONConfigManager
path = "/Users/bl/.jupyter/nbconfig"
cm = BaseJSONConfigManager(config_dir=path)
cm.update('livereveal', {
'theme': 'simple', 'start_slideshow_at': 'selected',
'width': 800, 'height': 800, 'margin': 0,
})
Out[2]:
This is a tutorial session.
Don't be like
In [3]:
HTML('<img src="./monkey.gif" width=600>')
Out[3]:
Play with the code! Try and do the exercises.
Please interrupt me if you are lost or if you disagree with what I say.
All questions are welcome, especially the ones that you find "simple", because 1) they are probably not simple, 2) other people are probably wondering the same, 3) they often are the most relevant contributions
If you haven't done it, install those packages using conda and/or pip:
conda install numpy scipy pandas matplotlib jupyter pip
pip install emcee corner
start a jupyter kernel: jupyter notebook
and open a copy of this notebook.
In [4]:
import matplotlib
import matplotlib.pyplot as plt
from cycler import cycler
matplotlib.rc("font", family="serif", size=14)
matplotlib.rc("figure", figsize="10, 5")
colors = ['k', 'c', 'm', 'y']
matplotlib.rc('axes', prop_cycle=cycler("color", colors))
import scipy.optimize
import numpy as np
In [5]:
HTML('<img src="hoggmograph.gif" width=700>')
Out[5]:
This notebook covers the following topics:
Fitting a line to data with y errors.
Basics of Bayesian inference and MCMC: gridding, rejection sampling, Metropolis Hastings, convergence, etc.
Fitting a line to data with x and y errors. Marginalization of latent variables.
Hamiltonian montecarlo for high-dimensional inference. Fitting multiple lines to data (multi-component models). Nested sampling for multimodal solutions. Not covered here: see ixkael.com
Let's put more weight on some of them according to your demands.
See Hogg, Bovy and Lang (2010): https://arxiv.org/abs/1008.4686
Let's generate a model:
In [6]:
ncomponents = 1
slopes_true = np.random.uniform(0, 1, ncomponents)
intercepts_true = np.random.uniform(0, 1, ncomponents)
component_fractionalprobs = np.random.dirichlet(np.arange(1., ncomponents+1.))
print('Slopes:', slopes_true)
print('Intercepts:', intercepts_true)
print('Fractional probabilities:', component_fractionalprobs)
# This notebook is ready for you to play with 2+ components and more complicated models.
Let's generate some data drawn from that model:
In [7]:
ndatapoints = 20
xis_true = np.random.uniform(0, 1, ndatapoints)
x_grid = np.linspace(0, 1, 100)
numberpercomponent = np.random.multinomial(ndatapoints, component_fractionalprobs)
print('Number of objects per component:', numberpercomponent)
allocations = np.concatenate([np.repeat(i, nb).astype(int)
for i, nb in enumerate(numberpercomponent)])
np.random.shuffle(allocations)
print('Component allocations:', allocations)
def model_linear(xs, slope, intercept): return xs * slope + intercept
yis_true = model_linear(xis_true, slopes_true[allocations], intercepts_true[allocations])
In [8]:
sigma_yis = np.repeat(0.1, ndatapoints) * np.random.uniform(0.5, 2.0, ndatapoints)
yis_noisy = yis_true + np.random.randn(ndatapoints) * sigma_yis
In [9]:
y_min, y_max = np.min(yis_noisy - sigma_yis), np.max(yis_noisy + sigma_yis)
for i in range(ncomponents):
y_min = np.min([y_min, np.min(model_linear(x_grid, slopes_true[i], intercepts_true[i]))])
y_max = np.max([y_max, np.max(model_linear(x_grid, slopes_true[i], intercepts_true[i]))])
In [10]:
for i in range(ncomponents):
plt.plot(x_grid, model_linear(x_grid, slopes_true[i], intercepts_true[i]), c=colors[i])
ind = allocations == i
plt.errorbar(xis_true[ind], yis_noisy[ind], sigma_yis[ind], fmt='o', c=colors[i])
plt.xlabel('$x$'); plt.ylabel('$y$'); plt.ylim([y_min, y_max])
Out[10]:
We are going to pretend we don't know the true model.
Forget what you saw (please).
Here is the noisy data to be analyzed. Can you (mentally) fit a line through it?
In [11]:
plt.errorbar(xis_true, yis_noisy, sigma_yis, fmt='o')
plt.xlabel('$x$'); plt.ylabel('$y$'); plt.ylim([y_min, y_max])
Out[11]:
Let's define a loss/cost function: the total weighted squared error, also called chi-squared: $$ \chi^2 = \sum_i \left( \frac{ \hat{y}_i - y_i^\mathrm{mod}(x_i, s, m) }{\sigma_i} \right)^2 $$
In [12]:
def loss(observed_yis, yi_uncertainties, model_yis):
scaled_differences = (observed_yis - model_yis) / yi_uncertainties
return np.sum(scaled_differences**2, axis=0)
We want to minimize this chi-squared to obtain the best possible fit to the data.
Let us look at the fit for a couple of (random) sets of parameters.
In [13]:
random_slopes = np.array([0.25, 0.25, 0.75, 0.75])
random_intercepts = np.array([0.25, 0.75, 0.25, 0.75])
In [14]:
random_slopes = np.random.uniform(0, 1, 4)
random_intercepts = np.random.uniform(0, 1, 4)
In [15]:
fig, axs = plt.subplots(1, 2, sharex=True)
axs[0].errorbar(xis_true, yis_noisy, sigma_yis, fmt='o')
for i, (slope, intercept) in enumerate(zip(random_slopes, random_intercepts)):
axs[0].plot(x_grid, model_linear(x_grid, slope, intercept), c=colors[i])
axs[1].scatter(slope, intercept,marker='x', c=colors[i])
chi2 = loss(yis_noisy[:, None], sigma_yis[:, None],
model_linear(xis_true[:, None], slope, intercept))
axs[1].text(slope, intercept+0.05, '$\chi^2 = %.1f$'% chi2,
horizontalalignment='center')
axs[0].set_xlabel('$x$'); axs[0].set_ylabel('$y$')
axs[0].set_ylim([0, y_max]); axs[1].set_ylim([0, 1]);
axs[1].set_xlabel('slope'); axs[1].set_ylabel('intercept')
fig.tight_layout()
Let us try a brute-force search, and grid our 2D parameter space.
EXERCISE
Create a 100 x 100 grid covering our parameter space.
Evaluate the loss function on the grid, and plot exp(-0.5*loss).
Also find the point that has the minimal loss value.
In [16]:
# SOLUTION
In [17]:
fig, axs = plt.subplots(1, 2, sharex=False, sharey=False)
axs[0].errorbar(xis_true, yis_noisy, sigma_yis, fmt='o')
axs[0].plot(x_grid, model_linear(x_grid, slope_ml, intercept_ml))
axs[0].set_xlabel('$x$'); axs[0].set_ylabel('$y$')
axs[0].set_ylim([y_min, y_max])
axs[1].set_xlabel('slope'); axs[1].set_ylabel('intercept')
axs[1].axvline(slope_ml, c=colors[1]); axs[1].axhline(intercept_ml, c=colors[1])
axs[1].pcolormesh(slope_grid, intercept_grid, np.exp(-0.5*loss_grid), cmap='ocean_r')
fig.tight_layout()
Why visualize $exp(-\frac{1}{2}\chi^2)$ and not simply the $\chi^2$?
Because the former is proportional to our likelihood:
$$\begin{align} p(D| P, M) &= p(\{ \hat{y}_i \} \vert \{\sigma_i, x_i\}, \textrm{intercept}, \textrm{slope}) \\ &= \prod_{i=1}^{N} p(\hat{y}_i \vert x_i, \sigma_i, b, m)\\ &= \prod_{i=1}^{N} \mathcal{N}\left(\hat{y}_i - y^\mathrm{mod}(x_i; m, b); \sigma^2_i \right) \ = \prod_{i=1}^{N} \mathcal{N}\left(\hat{y}_i - m x_i - b; \sigma^2_i \right) \\ &= \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi}\sigma_i}\exp\left( - \frac{1}{2} \frac{(\hat{y}_i - m x_i - b)^2}{\sigma^2_i} \right) \\ &\propto \ \exp\left( - \sum_{i=1}^{N} \frac{1}{2} \frac{(\hat{y}_i - m x_i - b)^2}{\sigma^2_i} \right) \ = \ \exp\left(-\frac{1}{2}\chi^2\right) \end{align} $$Since the data points are independent and the noise is Gaussian.
Let's visualize the $\chi^2$ for individual objects
In [18]:
model_yis = model_linear(xis_true, slope_ml, intercept_ml)
object_chi2s = 0.5*((yis_noisy - model_yis) / sigma_yis)**2
In [19]:
fig, ax = plt.subplots(1, 1)
ax.plot(x_grid, model_linear(x_grid, slope_ml, intercept_ml))
v = ax.scatter(xis_true, yis_noisy, c=object_chi2s, cmap='coolwarm', zorder=0)
ax.errorbar(xis_true, yis_noisy, sigma_yis, fmt='o', zorder=-1)
ax.set_xlabel('$x$'); ax.set_ylabel('$y$'); ax.set_ylim([y_min, y_max])
plt.colorbar(v); fig.tight_layout()
Digression
Is a line a good model?
Should we aiming at maximizing the likelihood only?
Here is a danger of Maximum Likelihood: there is always of model that perfectly fits all of the data.
This model does not have to be complicated...
EXERCISE (5 min): can you try to write a very flexible model that fits the data perfectly, i.e. go through every single point? What $\chi^2$ does it lead to?
NOTE: this might not be trivial, so just look for a model that goes through most of the data points.
HINT: numpy has good infrastructure for constructing and fitting polynomials... (try ?np.polyfit
).
If you pick a more complicated model you might need to use scipy.optimize.minimize
.
In [20]:
# SOLUTION
In [21]:
plt.plot(x_grid, bestfit_polynomial(x_grid))
plt.errorbar(xis_true, yis_noisy, sigma_yis, fmt='o')
plt.ylim([y_min, y_max])
Out[21]:
In [22]:
HTML('<img src="hoggmograph.gif" width=700>')
# Copyright Daniela Huppenkothen, Astrohackweek 2015 in NYC
Out[22]:
with explicit Model and Fixed parameters conditioned on:
$$p(P | D, M, F) = \frac{p(D | P, M, F)\ p(P | M, F)}{p(D | M, F)}$$In our case, if we omit the explicit dependence on a linear model:
$$p\bigl(m, s \ \bigl\vert \ \{ \hat{y}_i, \sigma_i, x_i\} \bigr) \ \propto \ p\bigl(\{ \hat{y}_i \} \ \bigl\vert \ m, s, \{\sigma_i, x_i\}\bigr) \ p\bigl(m, s\bigr) \ = \ \exp\bigl(-\frac{1}{2}\chi^2\bigr)\ p\bigl(m, s\bigr) $$
In [23]:
# Let us play with Bayes theorem and pick some un-motivated prior:
prior_grid = np.exp(-slope_grid**-1) * np.exp(-intercept_grid**-1)
likelihood_grid = np.exp(-0.5*loss_grid)
posterior_grid = likelihood_grid * prior_grid
In [24]:
fig, axs = plt.subplots(1, 3)
for i in range(3):
axs[i].set_ylabel('intercept'); axs[i].set_xlabel('slope');
axs[0].set_title('Prior'); axs[1].set_title('Likelihood'); axs[2].set_title('Posterior')
axs[1].axvline(slope_ml, c=colors[1]); axs[1].axhline(intercept_ml, c=colors[1])
axs[0].pcolormesh(slope_grid, intercept_grid, prior_grid, cmap='ocean_r')
axs[1].pcolormesh(slope_grid, intercept_grid, likelihood_grid, cmap='ocean_r')
axs[2].pcolormesh(slope_grid, intercept_grid, posterior_grid, cmap='ocean_r')
fig.tight_layout()
Discussion: what priors are adequate here?
Three common types of priors are:
Problems with 'gridding': number of likelihood evaluations, resolution of the grids, etc
In [25]:
fig, ax = plt.subplots(1, 1, figsize=(5, 4))
ax.set_xlabel('slope'); ax.set_ylabel('intercept');
ax.scatter(slope_grid.ravel(), intercept_grid.ravel(), marker='.', s=1)
ax.set_ylim([0, 1])
ax.set_xlim([0, 1])
fig.tight_layout()
print('Number of point/evaluations of the likelihood:', slope_grid.size)
EXERCISE
Write three functions returning:
the log of the likelihood ln_like(params, args...)
.
the log of the prior ln_prior(params, args...)
.
the log of the posterior ln_post(params, args...)
.
The likelihood is pretty much our previous loss function.
The prior should return -np.inf
outside of our parameter space of interest. At this stage use a uniform prior in $[0, 1] \times [0, 1]$.
Think about what other priors could be used. Include the correct normalization in the prior and the likelihood if possible.
In [26]:
def ln_like(params, xs, observed_yis, yi_uncertainties):
model_yis = model_linear(xs, params[0], params[1])
chi2s = ((observed_yis - model_yis) / yi_uncertainties)**2
return np.sum(-0.5 * chi2s - 0.5*np.log(2*np.pi) - np.log(yi_uncertainties))
def ln_prior(params):
if np.any(params < 0) or np.any(params > 1):
return - np.inf
return 0.
def ln_post(params, xs, observed_yis, yi_uncertainties):
lnprior_val = ln_prior(params)
if ~np.isfinite(lnprior_val):
return lnprior_val
else:
lnlike_val = ln_like(params, xs, observed_yis, yi_uncertainties)
return lnprior_val + lnlike_val
In [27]:
x0 = np.array([0.5, 0.5])
print('Likelihood:', ln_like(x0, xis_true, yis_noisy, sigma_yis))
print('Prior:', ln_prior(x0))
print('Posterior:', ln_post(x0, xis_true, yis_noisy, sigma_yis))
EXERCISE (2 min)
Find the maximum of the log posterior. Try different optimizers in scipy.optimize.minimize
. Be careful about the sign of the objective function (is it plus or minus the log posterior?)
In [28]:
# SOLUTION
EXERCISE
Implement rejection sampling. Randomly draw points in our 2D parameter space. Keep each point with a probability proportional to the posterior distribution.
HINT: you will find that you need to normalize the posterior distribution in some way to make the sampling possible. Use the MAP solution we just found!
In [29]:
# SOLUTION
In [30]:
fig, axs = plt.subplots(1, 2, sharex=True, sharey=True)
axs[0].pcolormesh(slope_grid, intercept_grid, likelihood_grid, cmap='ocean_r')
axs[1].hist2d(params_drawn[:, 0], params_drawn[:, 1], 30, cmap="ocean_r");
axs[0].set_title('Gridding'); axs[1].set_title('Rejection sampling');
axs[0].set_xlabel('slope'); axs[0].set_ylabel('intercept'); axs[1].set_xlabel('slope');
The Metropolis-Hastings algorithm
For a given target probability $p(\theta)$ and a (symmetric) proposal density $p(\theta_{i+1}|\theta_i)$. We repeat the following:
EXERCISE
Use your implementation of the Metropolis-Hastings algorithm to draw samples from our 2D posterior distribution of interest.
Measure the proportion of parameter draws that are accepted: the acceptance rate.
Plot the chain and visualize the burn-in phase.
Compare the sampling to our previous gridded version.
Estimate the mean and standard deviation of the distribution from the samples. Are they accurate?
In [31]:
# SOLUTION
In [32]:
fig, axs = plt.subplots(1, 2, sharex=True, sharey=True)
axs[0].pcolormesh(slope_grid, intercept_grid, likelihood_grid, cmap='ocean_r')
axs[1].hist2d(params_drawn[:, 0], params_drawn[:, 1], 30, cmap="ocean_r");
axs[0].set_xlabel('slope'); axs[0].set_ylabel('intercept'); axs[1].set_xlabel('slope');
In [33]:
fig, ax = plt.subplots(2, sharex=True)
for i in range(2):
ax[i].plot(params_drawn[:, i]);
MCMC is approximate and is only valid if it has converged. But we can't prove that a chain has converget - we can only show it hasn't.
What to do? _Be paranoïd.
Is it crucial to 1) run many chains in various setups, and 2) check that the results are stable, and 3) look at the auto-correlation time:
$$\rho_k = \frac{\mathrm{Covar}[X_t, X_{t+k}]}{\mathrm{Var}[X_t]\mathrm{Var}[X_{t+k}]]}$$See http://rstudio-pubs-static.s3.amazonaws.com/258436_5c7f6f9a84bd47aeaa33ee763e57a531.html and www.astrostatistics.psu.edu/RLectures/diagnosticsMCMC.pdf
EXERCISE
Visualize chains, autocorrelation time, etc, for short and long chains with different proposal distributions in the Metropolis Hastings algorithm.
In [34]:
# SOLUTION
In [35]:
for i in range(2):
plt.plot(autocorr_naive(params_drawn[:, i], 500))
plt.xscale('log'); plt.xlabel('$\Delta$'); plt.ylabel('Autocorrelation');
EXERCISE
Let's use a more advanced sampler. Look at the documentation of the emcee
package and use it to (again) draw samples from our 2D posterior distribution of interest. Make 2D plots with both plt.hist2d
or plt.contourf
. For the latter, add 68% and 95% confidence contours.
In [36]:
# SOLUTION
In [37]:
fig, ax = plt.subplots(2, sharex=True)
for i in range(2):
ax[i].plot(sampler.chain[:, :, i].T, '-k', alpha=0.2);
In [38]:
fig, axs = plt.subplots(1, 3, sharex=True, sharey=True)
for i in range(axs.size):
axs[i].errorbar(sampler.chain[:, i, 0], sampler.chain[:, i, 1], fmt="-o", alpha=0.5, c='k');