Settings that make the plots look pretty on my Mac:
In [1]:
%pylab inline
%config InlineBackend.figure_format = 'retina'
You will need
pip install pystan has always worked well for me. (If you need a good Python distribution, I recommend anaconda, which comes with a complete numpy, scipy, iPython, etc stack.)pip install seaborn. Seaborn is a plotting library.
In [2]:
import corner
import pystan
import seaborn as sns
These are my favourite plotting styles:
In [3]:
sns.set_context('talk') #sns.set_context('notebook') is probably better on your monitor
sns.set_palette('colorblind')
sns.set_style('ticks')
We are going to focus today on Hogg, et al. (2010). In this paper, the "just so" story will be evolving, but it always involves measuring data x and y which are (often) related by a linear relationship:
$$
y = m x + b
$$
For some reason (maybe we are cosmologists?) we are interested in the parameters $m$ and $b$. Please read Hogg, et al. (2010)'s cautionary note about this!
I used tabula to scrape the data table from Hogg's PDF. When you post things on the arXiv, you should always include a link to machine readable tables to accompany the document. You can post your data on GitHub, or if you want more archive-quality repositories that are citable (e.g. generate a DOI), consider something like Zenodo. Hogg & company, normally so good in this respect, dropped the ball for this paper.
In [4]:
table1 = genfromtxt('HoggTable1.tsv', names=True)
In [5]:
errorbar(table1['x'], table1['y'], xerr=table1['sigma_x'], yerr=table1['sigma_y'], fmt='.')
Out[5]:
(I know that this can be computed analytically---and Hogg, et al. (2010) tell you how---but we're going to be computing anyway, so let's do it that way first.)
To do some fits, we will be using Stan (named, by the way for Stanislaw Ulam). Stan is a specialised language for describing joint distributions of data and parameters. Once the description is written down, Stan compiles the description to a C++ program that draws samples from the conditional distribution $p\left( \theta \mid d \right)$; a C++ compiler on your machine translates that into executable code, which we then drive from Python. Whew.
For exercise 1, we assume that $$ y_\mathrm{true} = m x_\mathrm{true} + b $$ and $$ y_\mathrm{obs} = y_\mathrm{true} + \sigma_y $$
In [75]:
model = pystan.StanModel(file='linear_model.stan')
Exercise 1 asks us to fit all the data from the 5th to the 20th point, and ignore the uncertainty in $y$.
In [76]:
ex1N0 = 4
In [77]:
ex1_data = {'npts': table1['x'][ex1N0:].shape[0],
'xs': table1['x'][ex1N0:],
'ys': table1['y'][ex1N0:],
'sigma_ys': table1['sigma_y'][ex1N0:]}
In [78]:
ex1_fit = model.sampling(data=ex1_data)
ex1_fit.plot()
ex1_fit
Out[78]:
In [79]:
ex1_samples = ex1_fit.extract(permuted=True)
In [80]:
corner.corner(column_stack((ex1_samples['m'], ex1_samples['b'])), labels=[r'$m$', r'$b$']);
We always want to plot the output of our models in data space, to check whether the fit is good. Let's define a function that does this:
In [81]:
def plot_errorbands(xs, ms, bs, *args, **kwargs):
ys = []
for m,b in zip(ms, bs):
ys.append(m*xs + b)
ys = array(ys)
line, = plot(xs, median(ys, axis=0), *args, **kwargs)
fill_between(xs, percentile(ys, 84, axis=0), percentile(ys, 16, axis=0), alpha=0.25, color=line.get_color())
fill_between(xs, percentile(ys, 97.5, axis=0), percentile(ys, 2.5, axis=0), alpha=0.25, color=line.get_color())
A somewhat subtle point: our model has implications for $y_\mathrm{true}$. $y_\mathrm{true} = m x + b$. Thus we have a posterior on $y_\mathrm{true}$:
In [82]:
def plot_ytrues_no_scatter(xs, ms, bs, *args, **kwargs):
ytrues = []
for m,b in zip(ms, bs):
ytrues.append(m*xs + b)
ytrues = array(ytrues)
errorbar(xs, mean(ytrues, axis=0), yerr=std(ytrues, axis=0), fmt='.', *args, **kwargs)
You can see that our uncertainty is very small. In fact, they are so small, that our model is clearly inconsistent with the measurements---the green dots + errorbars are the model predictions, and they are wildly inconsistent with the data in black. Something must be done.
In [83]:
plot_errorbands(linspace(np.min(table1['x'][ex1N0:]), np.max(table1['x'][ex1N0:]), 1000), ex1_samples['m'], ex1_samples['b'])
plot_ytrues_no_scatter(table1['x'][ex1N0:], ex1_samples['m'], ex1_samples['b'], color='g')
errorbar(table1['x'][ex1N0:], table1['y'][ex1N0:], yerr=table1['sigma_y'][ex1N0:], fmt='.', color='k')
Out[83]:
Places to go from here:
Ilya suggested that we try to inflate the observed uncertainties by some constant factor until the fit is good (and I made a lot of fun at his expense because this model is saying you don't trust the observational errorbars---thanks for being a good sport, Ilya).
I have limited the scaling factor between 0.1 and 10.0.
In [84]:
model_ilya = pystan.StanModel(file='linear_model_ilya_errors.stan')
In [85]:
fit_ilya = model_ilya.sampling(data=ex1_data)
fit_ilya.plot()
fit_ilya
Out[85]:
In [86]:
samples_ilya = fit_ilya.extract(permuted=True)
In particular, Ilya's suggestion would imply that the observers have under-stated their errorbars by a factor of ~5:
In [87]:
sns.distplot(samples_ilya['ilyas_idiot_factor'])
xlabel('Ilya Scaling Factor')
ylabel(r'$p\left( \mathrm{factor} \right)$')
Out[87]:
In [88]:
plot_errorbands(linspace(np.min(table1['x'][ex1N0:]), np.max(table1['x'][ex1N0:]), 1000), samples_ilya['m'], samples_ilya['b'])
plot_ytrues_no_scatter(table1['x'][ex1N0:], samples_ilya['m'], samples_ilya['b'], color='g')
errorbar(table1['x'][ex1N0:], table1['y'][ex1N0:], yerr=mean(samples_ilya['ilyas_idiot_factor'], axis=0)*table1['sigma_y'][ex1N0:], fmt='.', color='k')
Out[88]:
An alternative model is that the observers got it right, but there is some "intrinsic" scatter in the relation. We talked about how this could come about in lecture. Our just-so story is now: $$ y_\mathrm{true} \sim N\left( m x_\mathrm{true} + b, \sigma \right) $$ with $$ y_\mathrm{obs} \sim N\left( y_\mathrm{true}, \sigma_y \right) $$
In [90]:
model_scatter = pystan.StanModel(file='linear_model_scatter.stan')
In [91]:
fit_scatter = model_scatter.sampling(data=ex1_data)
In [92]:
fit_scatter.plot()
fit_scatter
Out[92]:
In [93]:
samples_scatter = fit_scatter.extract(permuted=True)
In [94]:
plot_errorbands(linspace(np.min(table1['x'][ex1N0:]), np.max(table1['x'][ex1N0:]), 1000), samples_scatter['m'], samples_scatter['b'])
errorbar(table1['x'][ex1N0:], mean(samples_scatter['y_true'], axis=0), yerr=std(samples_scatter['y_true'], axis=0), fmt='.', color='g')
errorbar(table1['x'][ex1N0:], table1['y'][ex1N0:], yerr=table1['sigma_y'][ex1N0:], fmt='.', color='k')
Out[94]:
Note that in the last three models, I have sneakily put in posteriors over $y_\mathrm{true}$. These models are hierarchical or multilevel or partially pooled (different communities have different names for this). The posterior over $y_\mathrm{true}$ is a weighted combination of the observational uncertainty and the modelling uncertainty; by imposing a model, we allow observations of multiple data points to inform each other and reduce the uncertainty. For example, here is the posterior on $y_\mathrm{true}$ for the worst-measured data point compared to the observational uncertainty. Because the point lies above the fitted line, the posterior trends lower than the observation, but the error is significantly reduced.
In [102]:
ind = argmax(table1['sigma_y'][ex1N0:])
sns.distplot(samples_scatter['y_true'][:,ind])
errorbar(table1['y'][ex1N0:][ind], 0.030, xerr=table1['sigma_y'][ex1N0:][ind], fmt='.', color='k', label='Observation')
errorbar(mean(samples_scatter['y_true'][:,ind]), 0.020, xerr=std(samples_scatter['y_true'][:,ind]), fmt='.', color=sns.color_palette()[0], label='Posterior')
legend(loc='best')
xlabel(r'$y_\mathrm{true}$')
ylabel(r'$p\left( y_\mathrm{true} \right)$')
Out[102]:
Now let's try a fit including the "bad" data at face value:
In [103]:
data_bad = {'npts': table1['x'].shape[0],
'xs': table1['x'],
'ys': table1['y'],
'sigma_ys': table1['sigma_y']}
In [104]:
fit_bad = model_scatter.sampling(data=data_bad)
In [105]:
fit_bad.plot()
fit_bad
Out[105]:
In [106]:
samples_bad = fit_bad.extract(permuted=True)
Ugh---that fit is eye-wateringly bad! The outliers are pulling the slope way down! How can we reject them?
In [107]:
plot_errorbands(linspace(np.min(table1['x'][ex1N0:]), np.max(table1['x'][ex1N0:]), 1000), samples_scatter['m'], samples_scatter['b'])
plot_errorbands(linspace(np.min(table1['x']), np.max(table1['x']), 1000), samples_bad['m'], samples_bad['b'])
errorbar(table1['x'][ex1N0:], mean(samples_scatter['y_true'], axis=0), yerr=std(samples_scatter['y_true'], axis=0), fmt='.', color=sns.color_palette()[0])
errorbar(table1['x'], mean(samples_bad['y_true'], axis=0), yerr=std(samples_bad['y_true'], axis=0), fmt='.', color=sns.color_palette()[1])
errorbar(table1['x'][ex1N0:], table1['y'][ex1N0:], yerr=table1['sigma_y'][ex1N0:], fmt='.', color='k')
Out[107]:
Well, let's not reject them. Let's suppose that our model is now a mixture of good data points and bad; the good ones come from a line while the bad ones just come from a Gaussian in $y$, independent of the $x$ values (see Hogg, et al. (2010) for discussion). Thus: $$ y_\mathrm{true} \sim A N\left( m x_\mathrm{true} + b, \sigma \right) + (1-A) N\left( \mu, \sigma'\right) $$ with $$ y \sim N\left( y_\mathrm{true}, \sigma_y\right). $$ Here we will have to impose reasonable priors on all the parameters, because otherwise when $A = 1$ or $A = 0$ the parameters are "not identifiable" (that is, not constrained at all---just cut out of the likelihood) and we need to keep the sampler from running off to infinity.
See the Stan file for the specific priors I impose.
In [108]:
model_outilers = pystan.StanModel(file='linear_model_scatter_outlier.stan')
In [109]:
fit_outliers = model_outilers.sampling(data=data_bad)
In [110]:
fit_outliers
Out[110]:
In [111]:
samples_outliers = fit_outliers.extract(permuted=True)
And now the sampling looks much more reasonable:
In [112]:
plot_errorbands(linspace(np.min(table1['x'][ex1N0:]), np.max(table1['x'][ex1N0:]), 1000), samples_scatter['m'], samples_scatter['b'])
plot_errorbands(linspace(np.min(table1['x']), np.max(table1['x']), 1000), samples_outliers['m'], samples_outliers['b'])
errorbar(table1['x'][ex1N0:], mean(samples_scatter['y_true'], axis=0), yerr=std(samples_scatter['y_true'], axis=0), fmt='.', color=sns.color_palette()[0])
errorbar(table1['x'], mean(samples_outliers['y_true'], axis=0), yerr=std(samples_outliers['y_true'], axis=0), fmt='.', color=sns.color_palette()[1])
errorbar(table1['x'], table1['y'], yerr=table1['sigma_y'], fmt='.', color='k')
Out[112]:
I think we are going to stop here, but useful homework could be: introduce a $x_\mathrm{true}$ parameter, and incorporate the $x$ observational uncertainties in one or more of these models and check the fits, etc.
In [ ]: