In [1]:

    
%pylab inline
%config InlineBackend.figure_format = 'retina'









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
import arviz as az
import pandas as pd
import pystan
import seaborn as sns
from io import StringIO

sns.set_context('notebook')
sns.set_palette('colorblind')
sns.set_style('ticks')



In [18]:

    
def plot_fitted_line(fit, x, y, xerr=None, yerr=None):
    xmin = np.min(x)
    xmax = np.max(x)
    
    xmin, xmax = (xmin - 0.1*(xmax-xmin)), (xmax + 0.1*(xmax-xmin))
    
    xs = linspace(xmin, xmax, 100)
    ys = []
    for m, b in zip(fit.posterior.m.values.flatten(), fit.posterior.b.values.flatten()):
        ys.append(m*xs + b)
    ys = array(ys)
    
    l, = plot(xs, median(ys, axis=0))
    fill_between(xs, percentile(ys, 84, axis=0), percentile(ys, 16, axis=0), color=l.get_color(), alpha=0.25)
    fill_between(xs, percentile(ys, 97.5, axis=0), percentile(ys, 2.5, axis=0), color=l.get_color(), alpha=0.25)
    
    # Plot the data
    errorbar(x, y, xerr=xerr, yerr=yerr, color='k', fmt='.')
    
def plot_inferred_ys_noscatter(fit, x):
    ys = []
    for m, b in zip(fit.posterior.m.values.flatten(), fit.posterior.b.values.flatten()):
        ys.append(m*x + b)
    ys = array(ys)
    
    errorbar(x, mean(ys, axis=0), yerr=std(ys, axis=0), fmt='.')

Here is the Hogg, Bovy & Lang (2010) data set (normally so good at this, they neglected to publish the data separately, so I had to scrape the PDF):



In [3]:

    
hogg_data = pd.read_csv(StringIO("""ID	x	y	sigma_x	sigma_y	rho_xy
1	201	592	61	9	-0.84
2	244	401	25	4	0.31
3	47	583	38	11	0.64
4	287	402	15	7	-0.27
5	203	495	21	5	-0.33
6	58	173	15	9	0.67
7	210	479	27	4	-0.02
8	202	504	14	4	-0.05
9	198	510	30	11	-0.84
10	158	416	16	7	-0.69
11	165	393	14	5	0.30
12	201	442	25	5	-0.46
13	157	317	52	5	-0.03
14	131	311	16	6	0.50
15	166	400	34	6	0.73
16	160	337	31	5	-0.52
17	186	423	42	9	0.90
18	125	334	26	8	0.40
19	218	533	16	6	-0.78
20	146	344	22	5	-0.56
"""), sep='\t')
hogg_data



In [5]:

    
errorbar(hogg_data['x'], hogg_data['y'], yerr=hogg_data['sigma_y'], fmt='.')









    Out[5]:





<ErrorbarContainer object of 3 artists>



In [6]:

    
igood = 4



In [7]:

    
errorbar(hogg_data['x'][igood:], hogg_data['y'][igood:], yerr=hogg_data['sigma_y'][igood:], fmt='.')









    Out[7]:





<ErrorbarContainer object of 3 artists>



In [8]:

    
model = pystan.StanModel(file='../stan/linear.stan')









    



INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_3d242dc879ec246da3d07c30fb04b270 NOW.
/Users/wfarr/anaconda3/lib/python3.7/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /var/folders/35/vcq24mtj2_g5wk96mbck0cw400018s/T/tmptnxksxy2/stanfit4anon_model_3d242dc879ec246da3d07c30fb04b270_635174904276811115.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)



In [9]:

    
data = {
    'nobs': len(hogg_data['x'])-igood,
    'xobs': hogg_data['x'][igood:],
    'yobs': hogg_data['y'][igood:],
    'sigma_y': hogg_data['sigma_y'][igood:]
}



In [10]:

    
fit = model.sampling(data=data)



In [13]:

    
az.plot_trace(fit, var_names=['m', 'b'])









    Out[13]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1177e8f28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1169bc4e0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11c48b9b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11c4a7eb8>]],
      dtype=object)



In [14]:

    
az.effective_sample_size(fit, var_names=['m', 'b'])









    Out[14]:





<xarray.Dataset>
Dimensions:  ()
Data variables:
    m        float64 490.3
    b        float64 461.1



In [16]:

    
fit = az.convert_to_inference_data(fit)



In [21]:

    
plot_fitted_line(fit, hogg_data['x'][igood:], hogg_data['y'][igood:], yerr=hogg_data['sigma_y'][igood:])



In [22]:

    
model_scatter = pystan.StanModel(file='../stan/linear_scatter.stan')









    



INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_70b05456641533d724ce5ef5460c1644 NOW.
/Users/wfarr/anaconda3/lib/python3.7/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /var/folders/35/vcq24mtj2_g5wk96mbck0cw400018s/T/tmpjumvucwj/stanfit4anon_model_70b05456641533d724ce5ef5460c1644_6919991353757140430.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)



In [23]:

    
fit_scatter = model_scatter.sampling(data=data)



In [24]:

    
az.plot_trace(fit_scatter, var_names=['m', 'b', 'sigma_int'])









    Out[24]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11bc4ba20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11bc376a0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11bc01ac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b639be0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11a157f28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b6ba9b0>]],
      dtype=object)



In [25]:

    
fit_scatter = az.convert_to_inference_data(fit_scatter)



In [27]:

    
plot_fitted_line(fit_scatter, hogg_data['x'][igood:], hogg_data['y'][igood:], yerr=hogg_data['sigma_y'][igood:])

Problem 1

Incorporate the uncertainty in the measurement of $x$ from Hogg, Bovy & Lang (2010) into your model. You will need to introduce a variable $x_\mathrm{true}$ (just like our $y_\mathrm{true}$) and include an observational likelihood for its values. Make plots of your fits similar to those we have produced. Is your model reasonable? (Remember to fit data points 5 to 20, as the first four points are outliers!)

Can you write a version of this model that uses a mixture model to fit out the outliers, as we did above?

Can you extend your model to account for either intrinsic scatter or inaccuracy in the measurement uncertainties reported in both dimensions, as we did before? Could you account for both effects in this data set?

Problem 2

Cosmology! Download the Type Ia supernova dataset from Scolnic, et al. (2017); you can find it here. It gives measurements of the redshift (assumed perfectly measured) and distance modulus (imperfectly measured) for some hundreds of supernova from the PanSTARRS survey. Recall that the distance modulus is $$ \mu = 5 \log_{10} \left( \frac{d_L}{10 \, \mathrm{pc}} \right) $$ and the luminosity distance (see Hogg (1999)) is given in terms of the cosmological parameters of a flat universe by $$ d_L\left( z \mid H_0, \Omega_M, w\right) = \frac{c}{H_0} \int_{0}^z \mathrm{d} z \, \frac{1}{\sqrt{\Omega_M \left( 1 + z \right)^3 + \left( 1 - \Omega_M \right) \left( 1 + z \right)^{3(1+w)}}}. $$

This is a non-linear version of the line fitting problem: we have a perfectly-measured redshift; we use cosmology---rather than a linear relation---to predict the luminosity distance, which predicts the distance modulus. We then compare to the observed distance modulus (plus uncertainty!) to constrain the cosmology. You can either learn about how to do integrals like the above in Stan (see here; but you will need to install a newer version of Stan and do some fancy coding---this is expert-mode!) or you can use the following rational function approximation to $d_L$ that is good to $z \simeq 1.5$ or so: $$ d_L = \frac{c}{H_0} \frac{z + z^2\frac{3 - 10 w +3 w^2 + 10 w \Omega_M + 6 w^2 \Omega_M - 9 w^2 \Omega_M^2}{4 \left( 1 - 3 w + 3 w \Omega_M \right)}}{1 + z \frac{1 - 2 w -3 w^2 +2 w \Omega_M + 12 w^2 \Omega_M - 9 w^2 \Omega_M^2}{2 \left( 1 - 3 w + 3 w \Omega_M\right)}} $$

Recall that there are some limits on variables: $0 \leq \Omega_M < 1$, and $0 < H_0$. So you will need to declare these in the Stan parameters block:

real<lower=0,upper=1> Om;
real<lower=0> H0;

It is up to you whether you want to encode the weak energy condition $w > -1$ or not.

Problem 3

The paper Pearson, et al. (2015) tried to calibrate a large number of "mass proxies" for galaxy groups based on simulations. One of the data sets can be found here.

You can, of course, take each proxy one-by-one in a linear fit similar to those we have been doing. But, a better approach is to imagine that the true mass and proxies are drawn from a multivariate normal distribution. Stan has facilities for modelling the mean vector and covariance matrix of this multivariate normal (see here and here or here). Write down a model that has (log) true mass and some number of proxies drawn from such a MVN, and then constraints the parameters of this MVN using some number of observations with uncertainty. Fit it to the data in the datasets directory.

Now extend your model to incorporate some number of observations of only the proxy values (i.e. do not include a term in the likelihood for log mass in this subset of observations); the model can still predict masses from the MVN using the proxy observations and the observed common distribution.

Comment on the "meaning" of your model; in particular, what are you assuming about the proxies in the "training" set compared to the "observation" set?



In [ ]:

	ID	x	y	sigma_x	sigma_y	rho_xy
0	1	201	592	61	9	-0.84
1	2	244	401	25	4	0.31
2	3	47	583	38	11	0.64
3	4	287	402	15	7	-0.27
4	5	203	495	21	5	-0.33
5	6	58	173	15	9	0.67
6	7	210	479	27	4	-0.02
7	8	202	504	14	4	-0.05
8	9	198	510	30	11	-0.84
9	10	158	416	16	7	-0.69
10	11	165	393	14	5	0.30
11	12	201	442	25	5	-0.46
12	13	157	317	52	5	-0.03
13	14	131	311	16	6	0.50
14	15	166	400	34	6	0.73
15	16	160	337	31	5	-0.52
16	17	186	423	42	9	0.90
17	18	125	334	26	8	0.40
18	19	218	533	16	6	-0.78
19	20	146	344	22	5	-0.56