Generative Models and Probabilistic Graphical Models

Goals:

Introduce generative models in the context of mocking data and inference
Introduce probabilistic graphical models as a tool for model visualization
Practice drawing some PGMs

You've already seen Bayes Theorem and the role of the prior, sampling and posterior distributions.

Fully specifying these requires us to write down (or at least approximate) how the data set comes to exist, e.g.

physical processes happening out there in the Universe
instrumental effects and the measurement process
any computations done prior to calling the result a "data set"

A generative model is formally the joint distribution of all our data and model parameters,

$P(\mathrm{data},\mathrm{params}) = P(\mathrm{params}) P(\mathrm{data}|\mathrm{params})$

It encodes the modelling information needed for both inference and creating mock data.

What are generative models useful for?

Performing inference: constructing the sampling distribution or likelihood function

Testing inference: does our analysis, run on mock data, recover the input model?

Checking inferences: do mock data generated from a fitted model resemble the real data?

$P(\mathrm{data},\mathrm{params})$ is an abstract beast, but usually it factorizes in helpful ways.

This is where probabilistic graphical models (PGMs) come in. A PGM is a sketch that encodes the conditional dependences within a generative model.

PGMs are to inference what free-body diagrams are to kinematics. Everyone hates having to draw them, yet everyone makes fewer mistakes when they do.

Let's do a simple example, first specifying a problem in words and then building the PGM.

Here's an image (and a zoom-in):

Our measurement is the number of counts in each pixel. Here is a generative model in words:

There's an object emitting light, whose properties are parametrized by $\theta$.

From $\theta$, we can determine the average flux falling on a given pixel $k$, $F_k$.

Given the exposure time of our observation, $T$ (and some conversion factors), $F_k$ determines the average number of counts expected, $\mu_k$.

The number of counts measured, $N_k$, is a Poisson draw, given the average $\mu_k$.

Notice that the model was described in terms of conditional relationships.

$\theta \sim$ some prior

For every pixel, $k$

$F_k \Leftarrow \theta,k$
$\mu_k \Leftarrow F_k,T$
$N_k \sim \mathrm{Poisson}(\mu_k)$

NB: $\Leftarrow$ indicates a deterministic dependence, $\sim$ means "is distributed as".

The PGM shows most of the same information, visually:

Ingredients of a PGM (sometimes also called a directed acyclic graph):

Nodes (dots and circles) represent PDFs for parameters
Edges (arrows) represent conditional relationships
Plates (rectangles) represent repeated model components whose contents are conditionally independent

Types of nodes:

Circles represent a PDF. This parameter is a stochastic ($\sim$) function of the parameters feeding into it.
Points represent a delta-function PDF. This parameter is a deterministic ($\Leftarrow$) function of the parameters feeding into it.
Double circles (or shading) indicate measured data. They are stochastic ($\sim$) in the context of generating mock data, but fixed in the context of parameter inference.

Q: According to this PGM, how can we factorize $p(\theta,T,\{F_k, \mu_k, N_k\})$?

What does all this imply in the context of

parameter inference?
mock data generation?

The key here is that the PGM shows conditional dependences - therefore, it also shows (by omission) where parameters are conditionally independent.

That feature, plus the directional aspect, mean that the PGM is a map to the most logical sequence of steps (lines in code) for either generating mock data or evaluating the posterior density of real data.

Q: How are these PGMs different, and what does the difference mean?

In this case, some PDFs are delta functions, so we can straightforwardly marginalize over such deterministic variables:

$p(\theta,\{N_k\}) = $

$\quad \int dF_k\,d\mu_k\,dT\; p(\theta)p(T) \prod_k P(N_k|\mu_k)p(\mu_k|F_k,T)p(F_k|\theta)$

$= \underbrace{p(\theta)} ~ \underbrace{\prod_k P\left(N_k|\mu_k(\theta,T)\right)}$

$= \mathrm{prior}(\theta) ~\times~ (\mathrm{sampling~distribution~of~}\vec{N})$

We could have written/drawn everything without explicitly mentioning $F_k$ (or even $\mu_k$). Like all simplifications, this is sometimes helpful and sometimes a pitfall.

Exercise

On your own, write down the probability expressions (factorization of the generative model) illustrated by these two graphs.
Then, discuss their meaning with your neighbor, and prepare to report back to the class.

Take-home messages

Both simulation of mock data and inference from data require a model for how the Universe (or our computer) generates data.
PGMs are a helpful way of visualizing the conditional dependences of a model (how the probability expressions factorize).

Note: the daft Python package can be useful for making PGMs programatically, though it's no substitute for paper.

Exercise: linear regression

Your data is a list of $\{x_k,y_k,\sigma_k\}$ triplets, where $\sigma_k$ is some estimate of the "error" on $y_k$. You think a linear model, $y(x)=a+bx$, might explain these data. To start exploring this idea, you decide to generate some simulated data, to compare with your real dataset.

In the absence of any better information, assume that $\vec{x}$ and $\vec{\sigma}$ are (somehow) known precisely, and that the "error" on $y_k$ is Gaussian (mean of $a+bx_k$ and standard deviation $\sigma_k$).

Draw the PGM, and write down the corresponding probability expressions, for this problem.
What (unspecified) assumptions, if any, would you have to make to actually generate data? Which assumptions do you think are unlikely to hold in practice? Choose one (or more) of these assumptions and work out how to generalize the PGM/generative model to avoid making it.

Bonus numerical exercise:

Extending the linear regression exercise, simulate a few data sets, given some values (your choice) for the input parameters. The commented code below is a (crummy) starting point.



In [ ]:

    
'''
import numpy as np
import scipy.stats as st
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
plt.rcParams['xtick.labelsize'] = 'x-large'
plt.rcParams['ytick.labelsize'] = 'x-large'
%matplotlib inline
''';



In [ ]:

    
"""
# Choose some linear model parameters, somehow
a = 
b =

# Choose some x and sigma values... somehow
n = 10 # Number of data points. Feel free to change.
x = np.array([
sigma = np.array([

# Work out the values for any intermediate nodes in your PGM
    
# generate the "observed" y values
y = st.norm.rvs(
""";



In [ ]:

    
"""
# plot x, y and sigma in the usual way
plt.rcParams['figure.figsize'] = (12.0, 5.0)
plt.errorbar(x, y, yerr=sigma, fmt='none');
plt.plt(x, y, 'bo');
plt.xlabel('x', fontsize=14);
plt.ylabel('y', fontsize=14);
""";

Bonus exercise: Exoplanet transit photometry

You've taken several images of a particular field, in order to record the transit of an exoplanet in front of a star (resulting in a temporary decrease in its brightness). Some kind of model, parametrized by $\theta$, describes the time series of the resulting flux. Before we get to measure a number of counts, however, each image is affected by time-specific variables, e.g. related to changing weather. To account for these, you've also measured a second star in the same field in every exposure. The assumption is that the average intrinsic flux of this second star is constant in time, so that it can be used to correct for photometric variations, putting the multiple measurements of the target star on the same scale.

Draw a PGM and write down the corresponding probability expressions for this problem.

Thanks to Anja von der Linden for inspiring (and then correcting) the above problem.

Note: Sketchy solutions for the PGM-drawing exercises can be found with the corresponding material from DSFP Session 4.