Logistics

We are going to use parallel-tempering, implemented via the python emcee package, to explore our posterior, which consists of the set of distances and gas to dust conversion coefficients to the twelve velocity slices. Since we need to explore a 24 dimensional parameter space, we are going to use 100 walkers, 500 steps each, at 5 different temperatures. If you would like to edit this parameters, simply edit "nwalkers", "ntemps", and "nsteps" in the cell below. However, we are only going to keep the lowest temperature chain ($\beta=1$) for analysis. Since the sampler.chain object from PTSampler returns an array of shape (Ntemps, Nwalkers, Nsteps, Ndim), returning the samples for all walkers, steps, and dimensions at $\beta=1$ would correspond to sampler.chain[0,:,:,:]. To decrease your value of $\beta$ simply increase the index for the first dimension. For more information on how PTSampler works, see http://dan.iel.fm/emcee/current/user/pt/. We will set off our walkers in a Gaussian ball around a) the kinematic distance estimates for the Cepheus molecular cloud given by a flat rotation curve from Leroy & Rosolowsky 2006 and b) the gas-to-dust coefficient given by the literature. We perturb the walkers in a Gaussian ball with mean 0 and variance 1. You can edit the starting positions of the walkers by editing the "result" variable below. We are going to discard the first half of every walker's chain as burn-in.

Setting up the positional arguments for PTSampler

We need to feed PTSampler the required positional arguments for the log_likelihood and log_prior function. We do this using the fetch_args function from the io module, which creates an instance of the pixclass object that holds our data and metadata. Fetch_args accepts three arguments:

  • A string specifiying the h5 filename containing your data, in our case 89996.h5
  • The prior bounds you want to impose on a) distances (flat prior), in units of distance modulus and b) gas to dust coefficients (also flat prior). In the script, we multiply our input literature value of choice by this prefactor. The bounds are given as a list in the format [lower_distance, upper_distance, lower_coefficient, upper_coefficient]. For distance, this must be between 4 and 19, because that's the distance modulus range of our stellar posterior array.
  • The gas-to-dust coefficient you'd like to use, given as a float; for this tutorial, we are pulling a value from the literature of 0.06 magnitudes/K. This value is then multiplied by the set of c coefficients we're determining as part of the parameter estimation problem.

Fetch_args will then return the correct arguments for the log_likelihood and log_prior functions within the model module.

Here we go!


In [1]:
import emcee
from dustcurve import model
import seaborn as sns
import numpy as np
from dustcurve import pixclass
import matplotlib.pyplot as plt
import pandas as pd
import warnings
from dustcurve import io
from dustcurve import hputils
from dustcurve import kdist
%matplotlib inline

#this code pulls snippets from the PHYS201 week 9 MCMC notebook written by Vinny Manohoran and the PHYS201 L9 solutions,
#written by Tom Dimiduk and Kevin Shain

#suppress obnoxious deprecation warning that doesn't affect output
warnings.filterwarnings("ignore", category=Warning, module="emcee")

#our pixel of choice
fnames='359992.h5'

#fetch the required likelihood and prior arguments for PTSampler
ldata,pdata=io.fetch_args(fnames,[4,19,0,100],0.06)

# the model has 24 parameters; we'll use 50 walkers, 1000 steps each, at 5 different temps
ndim=24
nslices=12
nwalkers = 50
nsteps = 20000
ntemps=5

#setting off the walkers at the kinematic distance given by the literature, assuming a flat rotation curve, theta=220 km/s, R=8.5 kpc
#Details on rotation curve given in Rosolowsky and Leroy 2006
vslices=np.linspace(-15.6,-1.3,12)
klong=np.ones(12)*hputils.pix2lb_scalar(256,int(fnames[:-3]))[0]
klat=np.ones(12)*hputils.pix2lb_scalar(256,int(fnames[:-3]))[1]
kdist=kdist.kdist(klong,klat,vslices)
kdistmod=5*np.log10(kdist)-5

#slightly perturb the starting positions for each walker, in a ball centered around result
result=kdistmod.tolist()
result.extend(1.0 for i in range (nslices))
starting_positions = [[result + np.random.randn(ndim) for i in range(nwalkers)] for j in range(ntemps)]

#set up the sampler object
sampler = emcee.PTSampler(ntemps, nwalkers, ndim, model.log_likelihood, model.log_prior, loglargs=(ldata), logpargs=[pdata])

#burn in, and save final positions for all parameters, which we'll then set off our walkers at for the "real" thing
post_burn_pos, prob, state = sampler.run_mcmc(starting_positions, 300)
sampler.reset()

print("Setup complete")


/n/home12/czucker/envs/PYTHON3/lib/python3.4/site-packages/IPython/kernel/__init__.py:13: ShimWarning: The `IPython.kernel` package has been deprecated. You should import from ipykernel or jupyter_client instead.
  "You should import from ipykernel or jupyter_client instead.", ShimWarning)
Total number of stars used in analysis: 765
Setup complete

In [2]:
# run the sampler and time how long it takes
%time sampler.run_mcmc(post_burn_pos, nsteps) #, thin=thin_int)
print('Sampler Done')


CPU times: user 57min 47s, sys: 306 ms, total: 57min 48s
Wall time: 57min 48s
Sampler Done

The sampler is done running, so now let's check out the results. We are going to print out our mean acceptance fraction across all walkers for the coldest temperature chain.

We are going to discard the first half of each walker's chain as burn-in; to change the number of steps to burn off, simply edit the 3rd dimension of sampler.chain[0,:,n:,:] and input your desired value of n. Next we are going to plot what the chains look like, for all distances and conversion coefficients. Finally, we are going to compute and print out the 50th, 16th, and 84th percentile of chain for each distance parameter, using the "quantile" attribute of a pandas dataframe object. The 50th percentile measurement represents are best guess for the each distance parameter, while the difference between the 16th and 50th gives us a lower limit and the difference between the 50th and 84th percentile gives us an upper limit.


In [3]:
#Extract the coldest [beta=1] temperature chain from the sampler object; discard first half of samples as burnin
samples_cold = sampler.chain[0,:,int(nsteps/2):,:]
traces_cold = samples_cold.reshape(-1, ndim).T

#check out acceptance fraction:
print("Our mean acceptance fraction for the coldest chain is %.2f" % np.mean(sampler.acceptance_fraction[0]))

#find best fit values for each of the 24 parameters (12 d's and 12 c's)
theta=pd.DataFrame(traces_cold)
quantile_50=theta.quantile(.50, axis=1).values
quantile_84=theta.quantile(.84, axis=1).values
quantile_16=theta.quantile(.16, axis=1).values

upperlim=quantile_84-quantile_50
lowerlim=quantile_50-quantile_16

#print out distances
for i in range(0,int(len(quantile_50)/2)):
    print('d%i: %.3f + %.3f - %.3f' % (i+1,quantile_50[i],upperlim[i], lowerlim[i]))

#print out coefficients
for i in range(int(len(quantile_50)/2), int(len(quantile_50))):
    print('c%i: %.3f + %.3f - %.3f' % (i+1-int(len(quantile_50)/2),quantile_50[i],upperlim[i], lowerlim[i]))


Our mean acceptance fraction for the coldest chain is 0.04
d1: 9.711 + 1.904 - 0.603
d2: 11.957 + 2.692 - 5.439
d3: 12.299 + 0.538 - 4.661
d4: 9.759 + 0.484 - 1.948
d5: 9.801 + 0.429 - 1.938
d6: 7.839 + 2.497 - 1.359
d7: 7.568 + 4.271 - 2.238
d8: 7.962 + 3.958 - 2.541
d9: 7.501 + 2.519 - 2.581
d10: 6.301 + 1.482 - 1.694
d11: 7.239 + 3.375 - 2.607
d12: 10.200 + 2.002 - 3.782
c1: 3.922 + 2.315 - 2.471
c2: 1.826 + 2.256 - 1.360
c3: 2.717 + 1.387 - 1.284
c4: 3.741 + 1.728 - 1.612
c5: 1.986 + 1.740 - 0.962
c6: 0.779 + 1.087 - 0.563
c7: 1.996 + 1.755 - 1.289
c8: 0.925 + 0.800 - 0.697
c9: 0.562 + 0.681 - 0.426
c10: 0.988 + 0.757 - 0.649
c11: 0.875 + 0.975 - 0.636
c12: 2.858 + 3.700 - 2.070

Let's see what our chains look like by producing trace plots:


In [20]:
#set up subplots for chain plotting
axes=['ax'+str(i) for i in range(ndim)]
fig, (axes) = plt.subplots(ndim, figsize=(10,60))

plt.tight_layout()

for i in range(0,ndim):
    if i<int(ndim/2):
        axes[i].set(ylabel='d%i' % (i+1))
    else:
        axes[i].set(ylabel='c%i' % (i+1))

#plot traces for each parameter
for i in range(0,ndim):
    sns.tsplot(traces_cold[i],ax=axes[i])


Now let's overplot the reddening profiles corresponding to our most probable parameters on top of the stacked stellar posterior surfaces. This plot has been normalized so that a) each individual stellar posterior array sums to one and b) each distance column in the stacked posterior array contains the same amount of "ink"


In [21]:
from dustcurve import pixclass
pixObj=pixclass.PixStars('/n/fink1/czucker/Data/'+ fnames)
post_array=pixObj.get_p()

unique_co,indices,unique_post,ratio=ldata

from dustcurve import plot_posterior

#plot the reddening profile over the stacked, normalized stellar posterior surfaces  
#normcol=True, normsurf=True
plot_posterior.plot_posterior(np.asarray(post_array),np.linspace(4,19,120),np.linspace(0,7,700),quantile_50,ratio,unique_co,y_range=[0,2],vmax=.03,normcol=True)


Now that we know how our code works, let's run a Gelman-Rubin test. We are going to determine the Gelman-Rubin convergence diagnostic for every parameter. We'll use the same number of walkers (50) and steps (5000) as before, except now we are running 5 independent chains and checking if these chains converge for all parameters, for the coldest temperature.


In [6]:
from dustcurve import diagnostics

gr,chain_ensemble=diagnostics.run_chains(fnames,nwalkers=50,nsteps=20000,runs=5, bounds=[4,19,0,100])
print(gr)


Total number of stars used in analysis: 765
[ 1.00016284  1.03629772  1.06899136  0.99488074  1.07929996  1.06205883
  1.36458192  1.07956806  1.03616228  1.03676851  1.03004837  1.05759881
  1.08172989  1.02692952  1.04061463  1.09071466  1.12370605  0.96801972
  1.01784889  1.00766189  0.98295746  0.96367963  0.96388008  0.99256639]

It looks like most of the parameters are converging (GR diagnostic < 1.1); let's see what they look like side by side for all 5 runs:


In [19]:
#extract the independent runs
traces_cold_0=chain_ensemble[0,:,:]
traces_cold_1=chain_ensemble[1,:,:]
traces_cold_2=chain_ensemble[2,:,:]
traces_cold_3=chain_ensemble[3,:,:]
traces_cold_4=chain_ensemble[4,:,:]

#find best fit values for each of the 24 parameters (12 d's and 12 c's)
theta_0=pd.DataFrame(traces_cold_0)
quantile_50_0=theta_0.quantile(.50, axis=1).values

theta_1=pd.DataFrame(traces_cold_1)
quantile_50_1=theta_1.quantile(.50, axis=1).values

theta_2=pd.DataFrame(traces_cold_2)
quantile_50_2=theta_2.quantile(.50, axis=1).values

theta_3=pd.DataFrame(traces_cold_3)
quantile_50_3=theta_3.quantile(.50, axis=1).values

theta_4=pd.DataFrame(traces_cold_4)
quantile_50_4=theta_4.quantile(.50, axis=1).values

print("Values of each parameter in 5 independent runs with perturbed starting positions")
#print out distances
for i in range(0,int(len(quantile_50_0)/2)):
    print('d%i: %.3f, %.3f ,%.3f, %.3f, %.3f' % (i+1,quantile_50_0[i],quantile_50_1[i], quantile_50_2[i], quantile_50_3[i], quantile_50_4[i]))

#print out coefficients
for i in range(int(len(quantile_50_0)/2), int(len(quantile_50_0))):
    print('c%i: %.3f, %.3f ,%.3f, %.3f, %.3f' % (i+1-int(len(quantile_50)/2),quantile_50_0[i],quantile_50_1[i], quantile_50_2[i], quantile_50_3[i], quantile_50_4[i]))


Values of each parameter in 5 independent runs with perturbed starting positions
d1: 9.541, 10.296 ,9.581, 9.605, 9.410
d2: 12.512, 10.700 ,10.271, 11.387, 12.276
d3: 12.019, 12.170 ,7.739, 7.749, 7.701
d4: 9.753, 9.738 ,9.834, 9.719, 9.779
d5: 9.846, 10.168 ,9.827, 9.850, 9.835
d6: 7.743, 7.851 ,7.720, 10.135, 10.224
d7: 7.714, 7.624 ,12.463, 7.788, 7.598
d8: 8.011, 9.569 ,7.849, 8.014, 5.789
d9: 5.364, 7.436 ,6.869, 5.897, 5.315
d10: 6.999, 6.180 ,5.958, 5.197, 5.697
d11: 7.925, 6.485 ,7.389, 7.667, 7.602
d12: 11.003, 10.075 ,9.819, 9.868, 11.677
c1: 3.993, 3.426 ,2.581, 2.539, 4.033
c2: 3.755, 2.799 ,2.569, 2.350, 4.161
c3: 3.316, 4.035 ,2.947, 3.319, 3.141
c4: 3.014, 4.632 ,3.411, 2.512, 3.948
c5: 2.715, 1.244 ,2.435, 3.338, 1.769
c6: 0.868, 1.040 ,0.851, 0.819, 0.931
c7: 1.544, 2.075 ,2.398, 1.056, 1.890
c8: 1.224, 1.599 ,1.641, 1.816, 1.256
c9: 0.517, 0.712 ,0.546, 0.445, 0.498
c10: 0.892, 0.941 ,0.798, 0.939, 0.925
c11: 0.708, 0.631 ,0.641, 0.566, 0.553
c12: 2.835, 3.070 ,2.535, 2.127, 3.647

In [26]:
print(result)


[10.377253433743746, 10.128298567395227, 9.842684999016447, 9.507909950556135, 9.103934749541587, 8.595629319664706, 7.912504707481961, 6.876329221961026, 4.718117798982229, 4.122309952610406, 6.7841708223851, 7.985466620506397, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Now let's save the results to a file!


In [7]:
import h5py
#Save the results of the sampler:
output=fnames

fwrite = h5py.File("/n/fink1/czucker/Output/"+str(output), "w")
chaindata = fwrite.create_dataset("/chains", sampler.chain.shape, dtype='f')
chaindata[:,:,:,:]=sampler.chain
    
probdata = fwrite.create_dataset("/probs", sampler.lnprobability.shape, dtype='f')
probdata[:,:,:]=sampler.lnprobability

gr=fwrite.create_dataset("/gr", chain_ensemble.shape, dtype='f')
gr[:,:,:]=chain_ensemble

fwrite.close()

In [ ]: