Generative Models

Goals:

Introduce generative models in the context of mocking data and inference
Introduce probabilistic graphical models as a tool for model visualization
Practice building some simple models

Circles represent a PDF. This parameter is a stochastic function of the parameters feeding into it.
Points represent a delta-function PDF. This parameter is a deterministic function of the parameters feeding into it.
Double circles (or shading) indicate measured data. They are stochastic in the context of generating mock data, but fixed in the context of parameter inference.

Q: What is this PGM telling us?

Q: How are these PGMs different, and what does the difference mean?

By mapping the conditional dependences of a model, PGMs illustrate how to factorize (and hence draw samples from) the joint PDF for all variables:

$p(\theta,T,\{F_k, \mu_k, N_k\}) = p(\theta)p(T) \prod_k P(N_k|\mu_k)p(\mu_k|F_k,T)p(F_k|\theta)$

In this case, some PDFs are delta functions, so we can straightforwardly marginalize over such deterministic variables:

$p(\theta,T,\{\mu_k, N_k\}) = $

$\quad \int dF_k\; p(\theta)p(T) \prod_k P(N_k|\mu_k)p(\mu_k|F_k,T)p(F_k|\theta)$

$= \underbrace{p(\theta)} ~ \underbrace{\prod_k P\left(N_k|\mu_k(\theta,T)\right)}$

$= \mathrm{prior}(\theta) ~\times~ (\mathrm{sampling~distribution~of~}\vec{N})$

Exercise

On your own, write down the probability expressions illustrated by these two graphs.
Then, discuss their meaning with your neighbor, and prepare to report back to the class.

Take-home messages

Both simulation of mock data and model inference from data require a model for how the Universe (or our computer) generates data.
PGMs are a helpful way of visualizing the conditional dependences of a model (how the probability expressions factorize).

Note: the daft Python package is useful for making pretty PGMs.

Exercise: linear regression

Your data is a list of $\{x_k,y_k,\sigma_k\}$ triplets, where $\sigma_k$ is some estimate of the "error" on $y_k$. You think a linear model, $y(x)=a+bx$, might explain these data. To start exploring this idea, you decide to generate some simulated data, to compare with your real dataset.

In the absence of any better information, assume that $\vec{x}$ and $\vec{\sigma}$ are (somehow) known precisely, and that the "error" on $y_k$ is Gaussian (mean of $a+bx_k$ and standard deviation $\sigma_k$).

Draw the PGM, and write down the corresponding probability expressions, for this problem.
What (unspecified) assumptions, if any, would you have to make to actually generate data? Which assumptions do you think are unlikely to hold in practice? Choose one (or more) of these assumptions and work out how to generalize the PGM/generative model to avoid making it.

Bonus numerical exercise:

Extending the linear regression exercise, simulate a few data sets, given some values (your choice) for the input parameters. The commented code below is a (crummy) starting point.



In [ ]:

    
'''
import numpy as np
import scipy.stats as st
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
plt.rcParams['xtick.labelsize'] = 'x-large'
plt.rcParams['ytick.labelsize'] = 'x-large'
%matplotlib inline
''';



In [ ]:

    
"""
# Choose some linear model parameters, somehow
a = 
b =

# Choose some x and sigma values... somehow
n = 10 # Number of data points. Feel free to change.
x = np.array([
sigma = np.array([

# Work out the values for any intermediate nodes in your PGM
    
# generate the "observed" y values
y = st.norm.rvs(
""";



In [ ]:

    
"""
# plot x, y and sigma in the usual way
plt.rcParams['figure.figsize'] = (12.0, 5.0)
plt.errorbar(x, y, yerr=sigma, fmt='none');
plt.plt(x, y, 'bo');
plt.xlabel('x', fontsize=14);
plt.ylabel('y', fontsize=14);
""";

Bonus exercise: Exoplanet transit photometry

You've taken several images of a particular field, in order to record the transit of an exoplanet in front of a star (resulting in a temporary decrease in its brightness). Some kind of model, parametrized by $\theta$, describes the time series of the resulting flux. Before we get to measure a number of counts, however, each image is affected by time-specific variables, e.g. related to changing weather. To account for these, you've also measured 10 other stars in the same field in every exposure. The assumption is that the average intrinsic flux of these stars should be constant in time, so that they can be used to correct for photometric variations, putting the multiple measurements of the target star on the same scale.

Draw the PGM and write down the corresponding probability expressions for this problem.

Thanks to Anja von der Linden for inspiring (and then correcting) the above problem.

Bonus numerical exercise: Galaxy cluster center offsets

You've measured the centers of a sample of galaxy clusters in two ways: by choosing a brightest cluster galaxy (BCG) and by finding the centroid of each cluster's X-ray emission. The difference between the two should say something about the fidelity of the BCG selection method, among other things. The BCG positions are determined essentially perfectly, but the X-ray centroids come with a Gaussian statistical uncertainty of typically $\sim30$ kpc (standard deviation) in both the $x$ and $y$ directions.

The underlying model is assumed to be that the BCG and true X-ray centroid coincide perfectly in a fraction $f$ of clusters. In the remaining clusters, the true X-ray centroid and BCG are displaced according to a 2D Gaussian whose width in either direction is $\sigma$.

Draw the PGM and write down the corresponding probability expressions for this problem.
Simulate some data sets and visualize them, e.g. as a histogram of the offset distances.



In [ ]: