A generative model formalizes our understanding of how a data set comes to exist, including
In other words, it's what we need in order to generate a mock data set.
To actually generate mock data, we need to specify the sampling distribution, $p(\mathrm{data}|\mathrm{model})$. This PDF is the mathemetical expression of our generative model.
What are generative models useful for?
A probabilistic graphical model (PGM) is a very useful way of visualizing a generative model.
Many, many mistakes can be avoided by sketching out a PGM at the outset of a statistical analysis.
Technically, a PGM is a type of directed acyclic graph, where nodes and edges represent parts of the model.
Let's look at a very simple example...
Here's an image (and a zoom-in):
Our measurement is the number of counts in each pixel. Here is a generative model:
Notice that the model was described in terms of conditional relationships.
The PGM will do the same, visually.
This is what it looks like:
Ingredients of a PGM:
Types of nodes:
Q: What is this PGM telling us?
Q: How are these PGMs different, and what does the difference mean?
|
|
By mapping the conditional dependences of a model, PGMs illustrate how to factorize (and hence draw samples from) the joint PDF for all variables:
$p(\theta,T,\{F_k, \mu_k, N_k\}) = p(\theta)p(T) \prod_k P(N_k|\mu_k)p(\mu_k|F_k,T)p(F_k|\theta)$
In this case, some PDFs are delta functions, so we can straightforwardly marginalize over such deterministic variables:
$p(\theta,T,\{\mu_k, N_k\}) = $
$\quad \int dF_k\; p(\theta)p(T) \prod_k P(N_k|\mu_k)p(\mu_k|F_k,T)p(F_k|\theta)$
$= \underbrace{p(\theta)} ~ \underbrace{\prod_k P\left(N_k|\mu_k(\theta,T)\right)}$
$= \mathrm{prior}(\theta) ~\times~ (\mathrm{sampling~distribution~of~}\vec{N})$
Note: the daft
Python package is useful for making pretty PGMs.
Your data is a list of $\{x_k,y_k,\sigma_k\}$ triplets, where $\sigma_k$ is some estimate of the "error" on $y_k$. You think a linear model, $y(x)=a+bx$, might explain these data. To start exploring this idea, you decide to generate some simulated data, to compare with your real dataset.
In the absence of any better information, assume that $\vec{x}$ and $\vec{\sigma}$ are (somehow) known precisely, and that the "error" on $y_k$ is Gaussian (mean of $a+bx_k$ and standard deviation $\sigma_k$).
Draw the PGM, and write down the corresponding probability expressions, for this problem.
What (unspecified) assumptions, if any, would you have to make to actually generate data? Which assumptions do you think are unlikely to hold in practice? Choose one (or more) of these assumptions and work out how to generalize the PGM/generative model to avoid making it.
In [ ]:
'''
import numpy as np
import scipy.stats as st
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
plt.rcParams['xtick.labelsize'] = 'x-large'
plt.rcParams['ytick.labelsize'] = 'x-large'
%matplotlib inline
''';
In [ ]:
"""
# Choose some linear model parameters, somehow
a =
b =
# Choose some x and sigma values... somehow
n = 10 # Number of data points. Feel free to change.
x = np.array([
sigma = np.array([
# Work out the values for any intermediate nodes in your PGM
# generate the "observed" y values
y = st.norm.rvs(
""";
In [ ]:
"""
# plot x, y and sigma in the usual way
plt.rcParams['figure.figsize'] = (12.0, 5.0)
plt.errorbar(x, y, yerr=sigma, fmt='none');
plt.plt(x, y, 'bo');
plt.xlabel('x', fontsize=14);
plt.ylabel('y', fontsize=14);
""";
You've taken several images of a particular field, in order to record the transit of an exoplanet in front of a star (resulting in a temporary decrease in its brightness). Some kind of model, parametrized by $\theta$, describes the time series of the resulting flux. Before we get to measure a number of counts, however, each image is affected by time-specific variables, e.g. related to changing weather. To account for these, you've also measured 10 other stars in the same field in every exposure. The assumption is that the average intrinsic flux of these stars should be constant in time, so that they can be used to correct for photometric variations, putting the multiple measurements of the target star on the same scale.
Draw the PGM and write down the corresponding probability expressions for this problem.
Thanks to Anja von der Linden for inspiring (and then correcting) the above problem.
You've measured the centers of a sample of galaxy clusters in two ways: by choosing a brightest cluster galaxy (BCG) and by finding the centroid of each cluster's X-ray emission. The difference between the two should say something about the fidelity of the BCG selection method, among other things. The BCG positions are determined essentially perfectly, but the X-ray centroids come with a Gaussian statistical uncertainty of typically $\sim30$ kpc (standard deviation) in both the $x$ and $y$ directions.
The underlying model is assumed to be that the BCG and true X-ray centroid coincide perfectly in a fraction $f$ of clusters. In the remaining clusters, the true X-ray centroid and BCG are displaced according to a 2D Gaussian whose width in either direction is $\sigma$.
In [ ]: