Generative Models

Goals:

  • Introduce generative models in the context of mocking data and inference
  • Introduce probabilistic graphical models as a tool for model visualization
  • Practice building some simple models

Further reading

  • Ivezic et al, Sections 3.3 and 3.7
  • Bishop, 'Pattern Recognition and Machine Learning,' Sections 8.1 and 8.2

A generative model formalizes our understanding of how a data set comes to exist, including

  • physical processes happening out there in the Universe
  • instrumental effects and the measurement process
  • any computations done prior to calling the result a "data set"

In other words, it's what we need in order to generate a mock data set.

To actually generate mock data, we need to specify the sampling distribution, $p(\mathrm{data}|\mathrm{model})$. This PDF is the mathemetical expression of our generative model.

  • The assumed "$\mathrm{model}$" specifies the form and parameters of the sampling distribution
  • A random draw from $P(\mathrm{data}|\mathrm{model})$ is a dataset, "$\mathrm{data}$"

What are generative models useful for?

  • Performing inference: constructing the sampling distribution or likelihood function
  • Testing inference: does our analysis, run on mock data, recover the input model?
  • Checking inferences: do mock data generated from a fitted model resemble the real data?

A probabilistic graphical model (PGM) is a very useful way of visualizing a generative model.

  • They sketch out the procedure for how one would generate mock data in practice.
  • They illustrate the interdependence of model parameters, and the dependence of data on parameters.
  • They also (therefore) represent a conditional factorization of the PDF for all the data and model parameters.

Many, many mistakes can be avoided by sketching out a PGM at the outset of a statistical analysis.

Technically, a PGM is a type of directed acyclic graph, where nodes and edges represent parts of the model.

Let's look at a very simple example...

Here's an image (and a zoom-in):

Our measurement is the number of counts in each pixel. Here is a generative model:

  • There's an object emitting light, whose properties are parametrized by $\theta$.
  • From $\theta$, we can determine the average flux falling on a given pixel $k$, $F_k$.
  • Given the exposure time of our observation, $T$, and some conversion factors, $F_k$ determines the average number of counts expected, $\mu_k$.
  • The number of counts measured, $N_k$, is a Poisson draw, given the average $\mu_k$.

Notice that the model was described in terms of conditional relationships.

  • $\theta \Rightarrow F_k$
  • $F_k,T \Rightarrow \mu_k$
  • $N_k \sim \mathrm{Poisson}(\mu_k)$

The PGM will do the same, visually.

This is what it looks like:

Ingredients of a PGM:

  • Nodes represent PDFs for parameters
  • Edges represent conditional relationships
  • Plates represent repeated model components whose contents are conditionally independent

Types of nodes:

  • Circles represent a PDF. This parameter is a stochastic function of the parameters feeding into it.
  • Points represent a delta-function PDF. This parameter is a deterministic function of the parameters feeding into it.
  • Double circles (or shading) indicate measured data. They are stochastic in the context of generating mock data, but fixed in the context of parameter inference.

Q: What is this PGM telling us?

Q: How are these PGMs different, and what does the difference mean?

By mapping the conditional dependences of a model, PGMs illustrate how to factorize (and hence draw samples from) the joint PDF for all variables:

$p(\theta,T,\{F_k, \mu_k, N_k\}) = p(\theta)p(T) \prod_k P(N_k|\mu_k)p(\mu_k|F_k,T)p(F_k|\theta)$

In this case, some PDFs are delta functions, so we can straightforwardly marginalize over such deterministic variables:

$p(\theta,T,\{\mu_k, N_k\}) = $

$\quad \int dF_k\; p(\theta)p(T) \prod_k P(N_k|\mu_k)p(\mu_k|F_k,T)p(F_k|\theta)$

$= \underbrace{p(\theta)} ~ \underbrace{\prod_k P\left(N_k|\mu_k(\theta,T)\right)}$

$= \mathrm{prior}(\theta) ~\times~ (\mathrm{sampling~distribution~of~}\vec{N})$

Exercise

  • On your own, write down the probability expressions illustrated by these two graphs.
  • Then, discuss their meaning with your neighbor, and prepare to report back to the class.

Take-home messages

  • Both simulation of mock data and model inference from data require a model for how the Universe (or our computer) generates data.
  • PGMs are a helpful way of visualizing the conditional dependences of a model (how the probability expressions factorize).

Note: the daft Python package is useful for making pretty PGMs.

Exercise: linear regression

Your data is a list of $\{x_k,y_k,\sigma_k\}$ triplets, where $\sigma_k$ is some estimate of the "error" on $y_k$. You think a linear model, $y(x)=a+bx$, might explain these data. To start exploring this idea, you decide to generate some simulated data, to compare with your real dataset.

In the absence of any better information, assume that $\vec{x}$ and $\vec{\sigma}$ are (somehow) known precisely, and that the "error" on $y_k$ is Gaussian (mean of $a+bx_k$ and standard deviation $\sigma_k$).

  1. Draw the PGM, and write down the corresponding probability expressions, for this problem.

  2. What (unspecified) assumptions, if any, would you have to make to actually generate data? Which assumptions do you think are unlikely to hold in practice? Choose one (or more) of these assumptions and work out how to generalize the PGM/generative model to avoid making it.

Bonus numerical exercise:

Extending the linear regression exercise, simulate a few data sets, given some values (your choice) for the input parameters. The commented code below is a (crummy) starting point.


In [ ]:
'''
import numpy as np
import scipy.stats as st
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
plt.rcParams['xtick.labelsize'] = 'x-large'
plt.rcParams['ytick.labelsize'] = 'x-large'
%matplotlib inline
''';

In [ ]:
"""
# Choose some linear model parameters, somehow
a = 
b =

# Choose some x and sigma values... somehow
n = 10 # Number of data points. Feel free to change.
x = np.array([
sigma = np.array([

# Work out the values for any intermediate nodes in your PGM
    
# generate the "observed" y values
y = st.norm.rvs(
""";

In [ ]:
"""
# plot x, y and sigma in the usual way
plt.rcParams['figure.figsize'] = (12.0, 5.0)
plt.errorbar(x, y, yerr=sigma, fmt='none');
plt.plt(x, y, 'bo');
plt.xlabel('x', fontsize=14);
plt.ylabel('y', fontsize=14);
""";

Bonus exercise: Exoplanet transit photometry

You've taken several images of a particular field, in order to record the transit of an exoplanet in front of a star (resulting in a temporary decrease in its brightness). Some kind of model, parametrized by $\theta$, describes the time series of the resulting flux. Before we get to measure a number of counts, however, each image is affected by time-specific variables, e.g. related to changing weather. To account for these, you've also measured 10 other stars in the same field in every exposure. The assumption is that the average intrinsic flux of these stars should be constant in time, so that they can be used to correct for photometric variations, putting the multiple measurements of the target star on the same scale.

Draw the PGM and write down the corresponding probability expressions for this problem.

Thanks to Anja von der Linden for inspiring (and then correcting) the above problem.

Bonus numerical exercise: Galaxy cluster center offsets

You've measured the centers of a sample of galaxy clusters in two ways: by choosing a brightest cluster galaxy (BCG) and by finding the centroid of each cluster's X-ray emission. The difference between the two should say something about the fidelity of the BCG selection method, among other things. The BCG positions are determined essentially perfectly, but the X-ray centroids come with a Gaussian statistical uncertainty of typically $\sim30$ kpc (standard deviation) in both the $x$ and $y$ directions.

The underlying model is assumed to be that the BCG and true X-ray centroid coincide perfectly in a fraction $f$ of clusters. In the remaining clusters, the true X-ray centroid and BCG are displaced according to a 2D Gaussian whose width in either direction is $\sigma$.

  1. Draw the PGM and write down the corresponding probability expressions for this problem.
  2. Simulate some data sets and visualize them, e.g. as a histogram of the offset distances.

In [ ]: