The rad-ness of notebooks

I use notebooks more often than I use an executable .py script. This is partially because notebooks were my first major introduction to python, but my continued use relates back to the fact that it allows me to break up problems I'm solving into different blocks.

Either install Anaconda (which comes with jupyter notebooks), or use pip install jupyter for Python 2, or pip3 install jupyter for Python 3

Then it's as simple as typing jupyter notebook into your terminal to launch the application!

What problem are you trying to solve?

A major advantage of notebooks is that you can utilise Markdown and $\LaTeX$ to incorperate discussion and directions into your code. This will likely make it more readable for another user (or even your future self!)

In this example, we're going to work through an emcee example, where we fit a model to some data. The most common example is a linear fit to data with errors, but I'm going to cahnge it up a little to prove you can use it for models other than straight lines. Let's examine how well we can fit the general sinusiod: $$y(t) = A\sin(2\pi ft+ \phi),$$ where $A$ is the amplitude, $f$ is the frequency, $t$ is time, and $\phi$ is the phase.

Setting up your notebook

Commands begining with '%' are known as magic commands in IPython. Depending on what you'd like to do, there are any number of useful magic commands. By far the most common command I use is %matplotlib inline which incorporates plots directly into the notebook, rather than being opened in a new window:



In [1]:

    
%matplotlib inline

Often I reserve a single notebook cell for all of my imports, similar to how you would normally import all of your libraries at the beginning of the program. I can always come back to add more. This is also usually where I configure the general look I want my plots to have, using the matplotlib.rcParams.update() function.



In [2]:

    
import numpy as np
import matplotlib
import matplotlib.pyplot as plt 
import emcee
from __future__ import division

matplotlib.rcParams.update({'font.size': 16,
                            'xtick.major.size': 12, 
                            'ytick.major.size': 12, 
                            'xtick.major.width': 1, 
                            'ytick.major.width': 1, 
                            'ytick.minor.size': 5, 
                            'xtick.minor.size': 5, 
                            'axes.linewidth': 1, 
                            'font.family': 'serif', 
                            'font.serif': 'Times New Roman', 
                            'text.usetex': True})

Generating data

Often, I would read in data from a file but I'll just generate some here for simplicity. Let's define the "true" parameters of our model and attempt to recover these from data with random errors.



In [3]:

    
A_true = 2
f_true = 0.2
phi_true = np.pi/4.

param_names = ["$A$", "$f$","$\phi$"]

Now we can generate some synthetic data. We'll need to add some noise as well.



In [4]:

    
N = 50
t = np.sort(10*np.random.rand(N))
t_lin = np.linspace(0,10,100)
y_true = A_true * np.sin(2*np.pi*f_true*t + phi_true) 

displacement_err = 0.5*np.random.rand(N)*np.sqrt(max(abs(y_true)))

y_alt = y_true + displacement_err*np.random.randn(N)

plt.errorbar(t, y_alt, yerr=displacement_err, fmt=".k")
plt.plot(t_lin, A_true * np.sin(2*np.pi*f_true*t_lin + phi_true), "-k", lw=1, alpha=0.6)
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.tight_layout()
plt.show()

Establishing the model

Now define the likelihood function. This is the probablility of our data given our model (including its free parameters). Traditionally, we define the logarithm of the likelihood function.

$$ L_i = \frac{1}{\sqrt{2\pi \sigma_i^2}} \exp \left( -\frac{1}{2} \frac{(x_i - \mu_i)^2}{\sigma_i^2} \right)$$

Likelihood values must then be multiplied together, alternatively, their logarithms may be summed:

$$ \ln(L) = -\frac{1}{2} \sum_i \left( \ln(2\pi) + \ln(\sigma_i^2) + \frac{(x_i - \mu_i)^2}{\sigma_i^2} \right) $$



In [5]:

    
def lnlike(theta, t, y, yerr):
    A, f, phi = theta
    model = A * np.sin(2*np.pi*f*t + phi)
    inv_sigma2 = 1.0/(yerr**2)
    return -0.5*(np.sum((y-model)**2*inv_sigma2 - np.log(inv_sigma2) + np.log(2.*np.pi)))

We may wish to impose priors on our observations. The prior represents any information you already know about your parameters. For example, perhaps we are modelling some sort of physical system, and we know that the amplitude cannot be negative. Thus, we would wish to exclude any evaluations of the likelihood here, since we know this would give an unphysical result. Let us impose the condition $A>0$. We also know that the phase must be given by $0 \ge \phi < 2\pi$.



In [6]:

    
def lnprior(theta):
    A, f, phi = theta
    if (0 < phi) and (phi < 2.*np.pi) and (0 < A) and (0 < f):
        return 0.0
    else:
        return -np.inf

We can now define the posterior distibution, which is the product of the product of the likelihood and prior



In [7]:

    
def lnpost(theta, t, y, yerr):
    lnp = lnprior(theta)
    if not np.isfinite(lnp):
        return -np.inf
    return lnp + lnlike(theta, t, y, yerr)

Initialising emcee

The key parameters for emcee (aside from the likelihood defined earlier) are the number of dimensions and number of walkers. $n_{dim}$ is given by the number of free parameters (in our case, 3), and $n_{walkers}$ is the number of chains we'd like to generate. We'll also need to specify an initial starting position for every walker. A good approach is to pick a sensible estimate and distribute the walkers randomly around this point



In [8]:

    
ndim = 3
nwalkers = 100

initial_guess = [2, 0.2, np.pi/4.]

pos = [initial_guess + 1e-4*np.random.randn(ndim) for i in range(nwalkers)]

sampler = emcee.EnsembleSampler(nwalkers, ndim, lnpost, args=(t, y_alt, displacement_err))

Run emcee



In [9]:

    
sampler.run_mcmc(pos, 5000)

burnin = 1000
samples = sampler.chain[:, burnin:, :].reshape((-1, ndim))

Plotting options

My favourite thing about notebooks is that I no longer have to run any of the previous cells in order to change the plots that I'm making. Let's examine two plotting approaches:

Corner



In [10]:

    
import corner
fig = corner.corner(samples, truths=initial_guess, labels=param_names, verbose=True)

ChainConsumer



In [11]:

    
from chainconsumer import ChainConsumer

c = ChainConsumer()
c.add_chain(samples, parameters=param_names, name="samples", walkers=nwalkers)
c.configure(statistics='cumulative', flip=True, diagonal_tick_labels=False)
fig = c.plot(figsize=2.5, truth=initial_guess)

Similarly, now that I've defined the instance of chain consumer, I can ask for statistics in a new cell without rerunning the plot!

Gelman-Rubin Statistic

This is a measure of chain convergence



In [12]:

    
print c.diagnostic_gelman_rubin()









    



Gelman-Rubin Statistic values for chain samples
$A$: 0.99999 (Passed)
$f$: 0.99988 (Passed)
$\phi$: 1.00016 (Passed)
True

Parameter correlation

we can ask for a python array of the correlation, or a latex table that could be given straight to a paper!



In [13]:

    
pythonarr = c.get_correlations()
latextab = c.get_correlation_table()



In [14]:

    
print pythonarr









    



(['$A$', '$f$', '$\\phi$'], array([[ 1.        , -0.53267615,  0.54624959],
       [-0.53267615,  1.        , -0.97461997],
       [ 0.54624959, -0.97461997,  1.        ]]))



In [15]:

    
print latextab









    



\begin{table}
    \centering
    \caption{Parameter Correlations}
    \label{tab:parameter_correlations}
    \begin{tabular}{c|ccc}
         & $A$ & $f$ & $\phi$\\ 
        \hline
           $A$ &  1.00 & -0.53 &  0.55 \\ 
           $f$ & -0.53 &  1.00 & -0.97 \\ 
        $\phi$ &  0.55 & -0.97 &  1.00 \\ 
        \hline
    \end{tabular}
\end{table}

That's really all there is!

You should think of notebooks as an option for documenting analysis code and sharing your work with others. It's also fantastic for data exploration and generation of plots, especially when it takes time to generate the data you want to plot.

Other ideas

Interactive data processing pipelines, where you might wish to examine specific plots and make changes before continuing
Run a coding demonstration or workshop (Notebooks can support Julia, Python and R!)
Keep a research journal