Logistics

We are going to use parallel-tempering, implemented via the python emcee package, to explore our posterior, which consists of the set of distances and gas to dust conversion coefficients to the six velocity slices towards the center of the Cepheus molecular cloud. Since we need to explore a 12 dimensional parameter space, we are going to use 50 walkers, 10000 steps each, at 5 different temperatures. If you would like to edit this parameters, simply edit "nwalkers", "ntemps", and "nsteps" in the cell below. However, we are only going to keep the lowest temperature chain ($\beta=1$) for analysis. Since the sampler.chain object from PTSampler returns an array of shape (Ntemps, Nwalkers, Nsteps, Ndim), returning the samples for all walkers, steps, and dimensions at $\beta=1$ would correspond to sampler.chain[0,:,:,:]. To decrease your value of $\beta$ simply increase the index for the first dimension. For more information on how PTSampler works, see http://dan.iel.fm/emcee/current/user/pt/. We will set off our walkers in a Gaussian ball around a) the kinematic distance estimates for the Cepheus molecular cloud given by a flat rotation curve from Leroy & Rosolowsky 2006 and b) the gas-to-dust coefficient given by the literature. We perturb the walkers in a Gaussian ball with mean 0 and variance 1. You can edit the starting positions of the walkers by editing the "result" variable below. We are going to discard the first half of every walker's chain as burn-in.

Setting up the positional arguments for PTSampler

We need to feed PTSampler the required positional arguments for the log_likelihood and log_prior function. We do this using the fetch_args function from the io module, which creates an instance of the pixclass object that holds our data and metadata. Fetch_args accepts three arguments:

  • A string specifiying the h5 filenames containing your data, in our case 10 healpix nside 128 pixels centered around (l,b)=(109.75, 13.75), which covers a total area of 2 sq. deg.
  • The prior bounds you want to impose on distances (flat prior) and the standard deviation you'd like for the log-normal prior on the conversion coefficients. For distances, this must be between 4 and 19, because that's the distance modulus range of our stellar posterior array. The prior bounds must be in the format [lowerbound_distance, upperbound_distance, sigma]
  • The gas-to-dust coefficient you'd like to use, given as a float; for this tutorial, we are pulling a value from the literature of 0.06 magnitudes/K. This value is then multiplied by the set of c coefficients we're determining as part of the parameter estimation problem.

Fetch_args will then return the correct arguments for the log_likelihood and log_prior functions within the model module.

Here we go!


In [1]:
import emcee
from dustcurve import model
import seaborn as sns
import numpy as np
from dustcurve import pixclass
import matplotlib.pyplot as plt
import pandas as pd
import warnings
from dustcurve import io
from dustcurve import hputils
from dustcurve import kdist
%matplotlib inline

#bounds for prior function, in form [lower_dist, upper_dist, sigma] 
bounds=[4,19,0.001]

#ratio for gas-to-dust coefficient
ratio=0.06

#suppress obnoxious deprecation warning that doesn't affect output
warnings.filterwarnings("ignore", category=Warning, module="emcee")

#our pixels of choice
indices=hputils.list_indices(128,(109.0,110.5,13.0,14.5)).astype(str)
fnames=[str(i)+'.h5' for i in indices]

#fetch the required likelihood and prior arguments for PTSampler
io.fetch_args(fnames)

# the model has 12 parameters; we'll use 50 walkers, 10000 steps each, at 5 different temps
ndim=12
nslices=6
nwalkers = 50
nsteps = 10000
ntemps=5

#setting off the walkers at the kinematic distance given by the literature, assuming a flat rotation curve, theta=220 km/s, R=8.5 kpc
#Details on rotation curve given in Rosolowsky and Leroy 2006
vslices=np.linspace(-11.7,-5.2,nslices)
klong=np.ones(nslices)*109.75
klat=np.ones(nslices)*13.75
kdist=kdist.kdist(klong,klat,vslices)
kdistmod=5*np.log10(kdist)-5

##Change initial starting positions to something else...only works for Q1 and Q4##


#slightly perturb the starting positions for each walker, in a ball centered around result
#perturb all walkers in a Gaussian ball with mean 0 and variance 1
result_dist=kdistmod.tolist()
result_coeff= [1.0 for i in range (nslices)]
starting_positions_dist=np.array([[result_dist + np.random.randn(nslices) for i in range(nwalkers)] for j in range(ntemps)]).clip(4,19)
starting_positions_coeff=np.array([[result_coeff + 0.5*np.random.randn(nslices) for i in range(nwalkers)] for j in range(ntemps)]).clip(0)
starting_positions=np.concatenate((starting_positions_dist,starting_positions_coeff), axis=2)

#set up the sampler object
sampler = emcee.PTSampler(ntemps, nwalkers, ndim, model.log_likelihood, model.log_prior, loglargs=[ratio], logpargs=[np.array([bounds])],threads=10)

print("Setup complete")


/n/home12/czucker/envs/PYTHON3/lib/python3.4/site-packages/IPython/kernel/__init__.py:13: ShimWarning: The `IPython.kernel` package has been deprecated. You should import from ipykernel or jupyter_client instead.
  "You should import from ipykernel or jupyter_client instead.", ShimWarning)
Setup complete

In [2]:
# run the sampler and time how long it takes
%time sampler.run_mcmc(starting_positions, nsteps) 
print('Sampler Done')


CPU times: user 1min 47s, sys: 7.5 s, total: 1min 55s
Wall time: 2h 32min 18s
Sampler Done

The sampler is done running, so now let's check out the results. We are going to print out our mean acceptance fraction across all walkers for the coldest temperature chain.

We are also going to discard the first half of each walker's chain as burn-in; to change the number of steps to burn off, simply edit the 3rd dimension of sampler.chain[0,:,n:,:] and input your desired value of n. Next, we are going to compute and print out the 50th, 16th, and 84th percentile of the chains for each distance parameter, using the "quantile" attribute of a pandas dataframe object. The 50th percentile measurement represents are best guess for the each distance parameter, while the difference between the 16th and 50th gives us a lower limit and the difference between the 50th and 84th percentile gives us an upper limit:


In [3]:
#Extract the coldest [beta=1] temperature chain from the sampler object; discard first half of samples as burnin
samples_cold = sampler.chain[0,:,int(.5*nsteps):,:]
traces_cold = samples_cold.reshape(-1, ndim).T


#check out acceptance fraction:
print("Our mean acceptance fraction for the coldest chain is %.2f" % np.mean(sampler.acceptance_fraction[0]))

#find best fit values for each of the 24 parameters (12 d's and 12 c's)
theta=pd.DataFrame(traces_cold)
quantile_50=theta.quantile(.50, axis=1).values
quantile_84=theta.quantile(.84, axis=1).values
quantile_16=theta.quantile(.16, axis=1).values

upperlim=quantile_84-quantile_50
lowerlim=quantile_50-quantile_16

#print out distances
for i in range(0,int(len(quantile_50)/2)):
    print('d%i: %.3f + %.3f - %.3f' % (i+1,quantile_50[i],upperlim[i], lowerlim[i]))

#print out coefficients
for i in range(int(len(quantile_50)/2), int(len(quantile_50))):
    print('c%i: %.3f + %.3f - %.3f' % (i+1-int(len(quantile_50)/2),quantile_50[i],upperlim[i], lowerlim[i]))


Our mean acceptance fraction for the coldest chain is 0.15
d1: 8.439 + 0.042 - 0.044
d2: 7.061 + 0.043 - 0.044
d3: 8.440 + 0.041 - 0.044
d4: 8.440 + 0.039 - 0.041
d5: 7.074 + 0.050 - 0.048
d6: 5.647 + 0.094 - 0.110
c1: 1.019 + 0.000 - 0.001
c2: 1.016 + 0.001 - 0.001
c3: 1.008 + 0.001 - 0.001
c4: 1.004 + 0.001 - 0.001
c5: 1.002 + 0.001 - 0.001
c6: 1.006 + 0.001 - 0.001

Let's see what our chains look like by producing trace plots:


In [12]:
#set up subplots for chain plotting
axes=['ax'+str(i) for i in range(ndim)]
fig, (axes) = plt.subplots(ndim, figsize=(10,60))

plt.tight_layout()

for i in range(0,ndim):
    if i<int(ndim/2):
        axes[i].set(ylabel='d%i' % (i+1))
    else:
        axes[i].set(ylabel='c%i' % (i-5))

#plot traces for each parameter
for i in range(0,ndim):
    sns.tsplot(traces_cold[i],ax=axes[i])


Now we are going to use the seaborn distplot function to plot histograms of the last half of the traces for each parameter.


In [13]:
#set up subplots for histogram plotting
axes=['ax'+str(i) for i in range(ndim)]
fig, (axes) = plt.subplots(ndim, figsize=(10,60))

plt.tight_layout()

for i in range(0,ndim):
    if i<int(ndim/2):
        axes[i].set(ylabel='d%i' % (i+1))
    else:
        axes[i].set(ylabel='c%i' % (i-5))

#plot traces for each parameter
for i in range(0,ndim):
    sns.distplot(traces_cold[i],ax=axes[i],hist=True,norm_hist=False)


/n/home12/czucker/envs/PYTHON3/lib/python3.4/site-packages/statsmodels/nonparametric/kdetools.py:19: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Now let's overplot the reddening profiles corresponding to our most probable parameters on top of the stacked stellar posterior surfaces. We show two options. The first plot has been normalized so that a) each individual stellar posterior array sums to one and b) each distance column in the stacked posterior array contains the same amount of "ink." The second plot just normalized so each individual stellar posterior array sums to 1. Each colored reddening profile corresponds to a single pixel in the Dame et al. 2001 CO spectral cube. There are stars in ~175 Dame et al. 2001 CO pixels in our region of interest.


In [15]:
from dustcurve import pixclass

post_all=np.empty((0,700,120))

for file in fnames:
    pixObj=pixclass.PixStars('/n/fink1/czucker/Data/'+file)
    post_array=np.asarray(pixObj.get_p()[:,:,:])
    post_all=np.vstack((post_all,post_array))
        
from dustcurve import plot_posterior
#plot the reddening profile over the stacked, normalized stellar posterior surfaces  
#normcol=True, normsurf=True
plot_posterior.plot_posterior(np.asarray(post_all),np.linspace(4,19,120),np.linspace(0,7,700),quantile_50,ratio,gv.unique_co,y_range=[0,2],vmax=0.03,normcol=True)
plot_posterior.plot_posterior(np.asarray(post_all),np.linspace(4,19,120),np.linspace(0,7,700),quantile_50,ratio,gv.unique_co,y_range=[0,2],vmax=10.0,normcol=False)


Now we want to see how similar the parameters at different steps are. To do this, we draw one thousand random samples from the last half of the chain and plot the reddening profile corresponding to those parameters in light blue. Then, we plot the "best fit" reddening profile corresponding to the 50th quantile parameters (essentially the median of the last half of the chains). We are normalizing the surfaces in two different ways, as we did above.


In [16]:
from dustcurve import plot_posterior
from dustcurve import globalvars as gv
import numpy as np

plot_posterior.plot_samples(np.asarray(post_all),np.linspace(4,19,120),np.linspace(0,7,700),quantile_50,traces_cold,ratio,gv.unique_co,y_range=[0,2],vmax=0.03,normcol=True)
plot_posterior.plot_samples(np.asarray(post_all),np.linspace(4,19,120),np.linspace(0,7,700),quantile_50,traces_cold,ratio,gv.unique_co,y_range=[0,2],vmax=20,normcol=False)


Now let's save the results to a file!


In [17]:
import h5py
#Save the results of the sampler:
#our pixels of choice
output='2degrees_june27_0.001sig.h5'

fwrite = h5py.File("/n/fink1/czucker/Output/"+str(output), "w")
chaindata = fwrite.create_dataset("/chains", sampler.chain.shape, dtype='f')
chaindata[:,:,:,:]=sampler.chain
    
probdata = fwrite.create_dataset("/probs", sampler.lnprobability.shape, dtype='f')
probdata[:,:,:]=sampler.lnprobability


fwrite.close()

In [ ]: