I use notebooks more often than I use an executable .py script. This is partially because notebooks were my first major introduction to python, but my continued use relates back to the fact that it allows me to break up problems I'm solving into different blocks.
Either install Anaconda (which comes with jupyter notebooks), or use pip install jupyter for Python 2, or pip3 install jupyter for Python 3
Then it's as simple as typing jupyter notebook into your terminal to launch the application!
A major advantage of notebooks is that you can utilise Markdown and $\LaTeX$ to incorperate discussion and directions into your code. This will likely make it more readable for another user (or even your future self!)
In this example, we're going to work through an emcee example, where we fit a model to some data. The most common example is a linear fit to data with errors, but I'm going to cahnge it up a little to prove you can use it for models other than straight lines. Let's examine how well we can fit the general sinusiod:
$$y(t) = A\sin(2\pi ft+ \phi),$$
where $A$ is the amplitude, $f$ is the frequency, $t$ is time, and $\phi$ is the phase.
Commands begining with '%' are known as magic commands in IPython. Depending on what you'd like to do, there are any number of useful magic commands. By far the most common command I use is %matplotlib inline which incorporates plots directly into the notebook, rather than being opened in a new window:
In [1]:
%matplotlib inline
Often I reserve a single notebook cell for all of my imports, similar to how you would normally import all of your libraries at the beginning of the program. I can always come back to add more. This is also usually where I configure the general look I want my plots to have, using the matplotlib.rcParams.update() function.
In [2]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import emcee
from __future__ import division
matplotlib.rcParams.update({'font.size': 16,
'xtick.major.size': 12,
'ytick.major.size': 12,
'xtick.major.width': 1,
'ytick.major.width': 1,
'ytick.minor.size': 5,
'xtick.minor.size': 5,
'axes.linewidth': 1,
'font.family': 'serif',
'font.serif': 'Times New Roman',
'text.usetex': True})
In [3]:
A_true = 2
f_true = 0.2
phi_true = np.pi/4.
param_names = ["$A$", "$f$","$\phi$"]
Now we can generate some synthetic data. We'll need to add some noise as well.
In [4]:
N = 50
t = np.sort(10*np.random.rand(N))
t_lin = np.linspace(0,10,100)
y_true = A_true * np.sin(2*np.pi*f_true*t + phi_true)
displacement_err = 0.5*np.random.rand(N)*np.sqrt(max(abs(y_true)))
y_alt = y_true + displacement_err*np.random.randn(N)
plt.errorbar(t, y_alt, yerr=displacement_err, fmt=".k")
plt.plot(t_lin, A_true * np.sin(2*np.pi*f_true*t_lin + phi_true), "-k", lw=1, alpha=0.6)
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.tight_layout()
plt.show()
Now define the likelihood function. This is the probablility of our data given our model (including its free parameters). Traditionally, we define the logarithm of the likelihood function.
$$ L_i = \frac{1}{\sqrt{2\pi \sigma_i^2}} \exp \left( -\frac{1}{2} \frac{(x_i - \mu_i)^2}{\sigma_i^2} \right)$$Likelihood values must then be multiplied together, alternatively, their logarithms may be summed:
$$ \ln(L) = -\frac{1}{2} \sum_i \left( \ln(2\pi) + \ln(\sigma_i^2) + \frac{(x_i - \mu_i)^2}{\sigma_i^2} \right) $$
In [5]:
def lnlike(theta, t, y, yerr):
A, f, phi = theta
model = A * np.sin(2*np.pi*f*t + phi)
inv_sigma2 = 1.0/(yerr**2)
return -0.5*(np.sum((y-model)**2*inv_sigma2 - np.log(inv_sigma2) + np.log(2.*np.pi)))
We may wish to impose priors on our observations. The prior represents any information you already know about your parameters. For example, perhaps we are modelling some sort of physical system, and we know that the amplitude cannot be negative. Thus, we would wish to exclude any evaluations of the likelihood here, since we know this would give an unphysical result. Let us impose the condition $A>0$. We also know that the phase must be given by $0 \ge \phi < 2\pi$.
In [6]:
def lnprior(theta):
A, f, phi = theta
if (0 < phi) and (phi < 2.*np.pi) and (0 < A) and (0 < f):
return 0.0
else:
return -np.inf
We can now define the posterior distibution, which is the product of the product of the likelihood and prior
In [7]:
def lnpost(theta, t, y, yerr):
lnp = lnprior(theta)
if not np.isfinite(lnp):
return -np.inf
return lnp + lnlike(theta, t, y, yerr)
The key parameters for emcee (aside from the likelihood defined earlier) are the number of dimensions and number of walkers. $n_{dim}$ is given by the number of free parameters (in our case, 3), and $n_{walkers}$ is the number of chains we'd like to generate. We'll also need to specify an initial starting position for every walker. A good approach is to pick a sensible estimate and distribute the walkers randomly around this point
In [8]:
ndim = 3
nwalkers = 100
initial_guess = [2, 0.2, np.pi/4.]
pos = [initial_guess + 1e-4*np.random.randn(ndim) for i in range(nwalkers)]
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnpost, args=(t, y_alt, displacement_err))
In [9]:
sampler.run_mcmc(pos, 5000)
burnin = 1000
samples = sampler.chain[:, burnin:, :].reshape((-1, ndim))
In [10]:
import corner
fig = corner.corner(samples, truths=initial_guess, labels=param_names, verbose=True)
In [11]:
from chainconsumer import ChainConsumer
c = ChainConsumer()
c.add_chain(samples, parameters=param_names, name="samples", walkers=nwalkers)
c.configure(statistics='cumulative', flip=True, diagonal_tick_labels=False)
fig = c.plot(figsize=2.5, truth=initial_guess)
Similarly, now that I've defined the instance of chain consumer, I can ask for statistics in a new cell without rerunning the plot!
In [12]:
print c.diagnostic_gelman_rubin()
In [13]:
pythonarr = c.get_correlations()
latextab = c.get_correlation_table()
In [14]:
print pythonarr
In [15]:
print latextab
You should think of notebooks as an option for documenting analysis code and sharing your work with others. It's also fantastic for data exploration and generation of plots, especially when it takes time to generate the data you want to plot.