Shifty Lines: Line Detection and Doppler Shift Estimation

We have some X-ray spectra that have absorption and emission lines in it. The original spectrum is seen through a stellar wind, which moves either toward or away from us, Doppler-shifting the absorbed lines. Not all lines will be absorbed, some may be at their original position. There may also be more than one redshift in the same spectrum. There are various other complications: for example, we rarely have the complete spectrum, but separate intervals of it (due to issues with calibration). In principle, however, the other segments may give valuable information about any other given segment.

Simplest problem: estimating Doppler shift and line presence at the same time

Second-simplest problem: some lines are Doppler shifted, some are not

Full problem: estimating the number of redshifts and the lines belonging to each in the same model

Note: we are going to need to add some kind of OU process or spline to model the background.


In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_context("notebook", font_scale=2.5, rc={"axes.labelsize": 26})
sns.set_style("darkgrid")

plt.rc("font", size=24, family="serif", serif="Computer Sans")
plt.rc("text", usetex=True)

import cPickle as pickle

import numpy as np
import scipy.special
import shutil

from astropy import units
import astropy.constants as const

import emcee
import corner

import dnest4

Let's first load our first example spectrum:


In [ ]:
datadir = "../data/"
datafile = "8525_nodip_0.dat"

In [ ]:
data = np.loadtxt(datadir+datafile)

In [ ]:
wavelength_left = data[:,0]
wavelength_right = data[:,1]

wavelength_mid = data[:,0] + (data[:,1]-data[:,0])/2.
counts = data[:,2]
counts_err = data[:,3]

In [ ]:
plt.figure(figsize=(16,4))
plt.plot(wavelength_mid, counts)
plt.gca().invert_xaxis()
plt.xlim(wavelength_mid[-1], wavelength_mid[0])

Next, we're going to need the lines we're interested in. Let's use the Silicon lines. Note that these are all in electron volt. However, the data are in Angstrom, which means I need to convert them.

I'm going to use astropy.units to do that:


In [ ]:
siXIV = 1864.9995 * units.eV
siXIII = 2005.494 * units.eV

siXII = 1845.02 * units.eV
siXII_err = 0.07 * units.eV

siXI = 1827.51 * units.eV
siXI_err = 0.06 * units.eV

siX = 1808.39 * units.eV
siX_err = 0.05 * units.eV

siIX = 1789.57 * units.eV
siIX_err = 0.07 * units.eV

siVIII = 1772.01 * units.eV
siVIII_err = 0.09 * units.eV

siVII = 1756.68 * units.eV
siVII_err = 0.08 * units.eV

si_all = [siXIV, siXIII, siXII, siXI, siX, siIX, siVIII, siVII]
si_err_all = [0.0*units.eV, 0.0*units.eV, siXII_err, siXI_err, 
              siX_err, siIX_err, siVIII_err, siVII_err]

Now I can do the actual conversion:


In [ ]:
si_all_angstrom = [(const.h*const.c/s.to(units.Joule)).to(units.Angstrom) 
                   for s in si_all]
si_err_all_angstrom = [(const.h*const.c/s.to(units.Joule)).to(units.Angstrom) 
                       for s in si_err_all]

si_err_all_angstrom[0] = 0.0*units.Angstrom
si_err_all_angstrom[1] = 0.0*units.Angstrom

In [ ]:
for s in si_all_angstrom:
    print(s)

Let's plot the lines onto the spectrum:


In [ ]:
plt.figure(figsize=(16,4))
plt.errorbar(wavelength_mid, counts, yerr=counts_err, fmt="o")
plt.gca().invert_xaxis()
plt.xlim(wavelength_mid[-1], wavelength_mid[0])
for s in si_all_angstrom:
    plt.vlines(s.value, np.min(counts), np.max(counts), lw=2)

We currently don't have the positions of the longer-wavelength lines, so we're going to cut the spectrum at 7.1 Angstrom:


In [ ]:
maxind = wavelength_mid.searchsorted(7.1)

In [ ]:
wnew_mid = wavelength_mid[:maxind]
wnew_left = wavelength_left[:maxind]
wnew_right = wavelength_right[:maxind]

cnew = counts[:maxind]
enew = counts_err[:maxind]

In [ ]:
plt.figure(figsize=(16,4))
plt.errorbar(wnew_mid, cnew, yerr=enew, fmt="o")
plt.gca().invert_xaxis()
plt.xlim(wnew_mid[-1], wnew_mid[0])
for s in si_all_angstrom:
    plt.vlines(s.value, np.min(counts), np.max(counts), lw=2)

We are going to save the spectrum as well as the line centers:


In [ ]:
# the full spectrum in a format usable by ShiftyLines
np.savetxt(datadir+"8525_nodip_full.txt", np.array([wavelength_left, wavelength_right,
                                                   counts, counts_err]).T)

# the cut spectrum with the Si lines only
np.savetxt(datadir+"8525_nodip_cut.txt", np.array([wnew_left, wnew_right, cnew, enew]).T)

In [ ]:
## convert from astropy.units to float
si_all_val = np.array([s.value for s in si_all_angstrom])
si_err_all_val = np.array([s.value for s in si_err_all_angstrom])

In [ ]:
np.savetxt(datadir+"si_lines.txt", np.array(si_all_val))

Adding some more lines for later

We have some more lines that are going to become important/interesting later, because they'll be shifted at a different redshift:


In [ ]:
# line energies in keV
al_ly_alpha = 1.72855 * 1000 * units.eV
mg_he_gamma = 1.65910 * 1000 * units.eV
mg_he_delta = 1.69606 * 1000 * units.eV
mg_ly_beta = 1.74474 * 1000 * units.eV
mg_ly_gamma = 1.84010 * 1000 * units.eV
mg_ly_delta = 1.88423 * 1000 * units.eV
fe_xxiv = 1.88494 * 1000 * units.eV

The lines need to be converted to Angstroms


In [ ]:
other_lines_all = [al_ly_alpha, mg_he_gamma, mg_ly_beta, mg_ly_gamma, mg_ly_delta, fe_xxiv]

other_lines_all_angstrom = [(const.h*const.c/s.to(units.Joule)).to(units.Angstrom) 
                   for s in other_lines_all]

other_lines_all_val = np.array([s.value for s in other_lines_all_angstrom])

for l in other_lines_all_angstrom:
    print(str(l.value) + " " +  str(l.unit))

What does the spectrum with the line centroids look like?


In [ ]:
plt.figure(figsize=(16,4))
plt.errorbar(wavelength_mid, counts, yerr=counts_err, fmt="o")
plt.gca().invert_xaxis()
plt.xlim(wavelength_mid[-1], wavelength_mid[0])
for s in si_all_angstrom:
    plt.vlines(s.value, np.min(counts), np.max(counts), lw=2)
for l in other_lines_all_angstrom:
    plt.vlines(l.value, np.min(counts), np.max(counts), lw=2)

Let's save the extended list of lines to a file:


In [ ]:
# make extended array of lines
lines_extended = np.hstack([si_all_val, other_lines_all_val])

# save the lines
np.savetxt(datadir+"lines_extended.txt", np.array(lines_extended))

Simulating Data

In order to test any methods we are creating, we are first going to produce some simulated data.

Set the seed so that the output simulations will always be the same:


In [ ]:
np.random.seed(20160216)

The spectral lines are modelled as simple Gaussians with an amplitude $A$, a width $\sigma$ and a position $\lambda_0$.

Because energy data comes naturally binned (the original channels detect photons between a certain minimum and maximum energy), we integrate over energy bins to get an accurate estimate of the flux in each energy bin. This also allows the use of uneven binning.

In order to integrate over the bins correctly, I also define the cumulative distribution function (CDF) of a Gaussian below, which is, in fact, the integral of the Gaussian function.

This also means that the amplitude is defined as the integrated area under the Gaussian rather than the height of the Gaussian, but this is closer to the physical quantities astronomers might be interested in (equivalent width) anyway.


In [ ]:
def gaussian_cdf(x, w0, width):
    return 0.5*(1. + scipy.special.erf((x-w0)/(width*np.sqrt(2.))))

def spectral_line(wleft, wright, w0, amplitude, width):
    """
    Use the CDF of a Gaussian distribution to define spectral 
    lines. We use the CDF to integrate over the energy bins, 
    rather than taking the mid-bin energy.
    
    Parameters
    ----------
    wleft: array
        Left edges of the energy bins
        
    wright: array
        Right edges of the energy bins
        
    w0: float
        The centroid of the line
        
    amplitude: float
        The amplitude of the line
        
    width: float
        The width of the line
        
    Returns
    -------
    line_flux: array
        The array of line fluxes integrated over each bin
    """
    line_flux = amplitude*(gaussian_cdf(wright, w0, width)-
                           gaussian_cdf(wleft, w0, width))
    return line_flux

A simple test:


In [ ]:
w0 = 6.6
amp = 0.01
width = 0.01

line_flux = spectral_line(wnew_left, wnew_right, w0, amp, width)

plt.plot(wnew_mid, line_flux)

Simulating Spectra

In order to test our algorithm, we'd like to simulate some test data where we know the "ground truth" (i.e. the input parameters that made the spectrum).

Below is a (admittedly complicated) function that will simulate data for various test cases. We'll address these test cases one by one below and simulate a spectrum to test.


In [ ]:
def fake_spectrum(wleft, wright, line_pos, logbkg=np.log(0.09), err=0.007, dshift=0.0,
                  sample_logamp=False, sample_logq=False, sample_signs=False,
                  logamp_hypermean=None, logamp_hypersigma=np.log(0.08), nzero=0, 
                  logq_hypermean=np.log(500), logq_hypersigma=np.log(50)):
    """
    Make a fake spectrum with emission/absorption lines.
    The background is constant, though that should later become an OU process or 
    something similar.
    
    NOTE: The amplitude *must not* fall below zero! I'm not entirely sure how to deal 
    with that yet!
    
    Parameters
    ----------
    wleft: np.ndarray
        Left edges of the energy bins
        
    wright: np.ndarray
        Right edges of the energy bins
    
    line_pos: np.ndarray
        The positions of the line centroids
    
    bkg: float
        The value of the constant background
        
    err: float
        The width of the Gaussian error distribution
        
    dshift: float, default 0.0
        The Doppler shift of the spectral lines.
        
    sample_amp: bool, default False
        Sample all amplitudes? If not, whatever value is set in 
        `amp_hypermean` will be set as collective amplitude for all 
        lines
        
    sample_width: bool, default False
        Sample all line widths?  If not, whatever value is set in 
        `width_hypersigma` will be set as collective amplitude for all 
        lines
    
    sample_signs: bool, default False
        Sample the sign of the line amplitude (i.e. whether the line is an 
        absorption or emission line)? If False, all lines will be absorption 
        lines
        
    logamp_hypermean: {float | None}, default None
        The mean of the Gaussian prior distribution on the line amplitude. If None, 
        it is set to the same value as `bkg`.
        
    logamp_hypersigma: float, default 0.08
        The width of the Gaussian prior distribution on the line amplitude. 
        
    nzero: int, default 0
        The number of lines to set to zero amplitude
        
    logq_hypermean: float, default 0.01
        The mean of the Gaussian prior distribution on the 
        q-factor, q=(line centroid wavelength)/(line width)
        
    logq_hypersigma: float, default 0.01
        The width of the Gaussian prior distribution on the
        q-factor, q=(line centroid wavelength)/(line width)
    
    Returns
    -------
    model_flux: np.ndarray
        The array of model line fluxes for each bin
        
    fake_flux: np.ndarray
        The array of fake fluxes (with errors) for each bin
        
    """
    
    # number of lines
    nlines = line_pos.shape[0]
    
    # shift spectral lines
    line_pos_shifted = line_pos*(1. + dshift)
   
    # if sampling the amplitudes
    if sample_logamp:  
        # sample line amplitudes
        logamps = np.random.normal(logamp_hypermean, logamp_hypersigma, size=nlines)
    else:
        logamps = np.zeros(nlines)+logamp_hypermean
        
    
    amps = np.exp(logamps)
    
    if nzero > 0:
        zero_ind = np.random.choice(np.arange(nlines), size=nzero)
        for z in zero_ind:
            amps[int(z)] = 0.0
    
    if sample_signs:
        # sample sign of the amplitudes
        signs = np.random.choice([-1., 1.], p=[0.5, 0.5], size=nlines)
    else:
        # all lines are absorption lines
        signs = -1.*np.ones(nlines)
        
    # include signs in the amplitudes
    amps *= signs
    
    if sample_logq:
        # widths of the lines
        logq = np.random.normal(logq_hypermean, logq_hypersigma, size=nlines)
    else:
        logq = np.ones(nlines)*logq_hypermean
        
        
    widths = line_pos_shifted/np.exp(logq)
    
    model_flux = np.zeros_like(wleft) + np.exp(logbkg)
    
    for si, a, w in zip(line_pos_shifted, amps, widths):
        model_flux += spectral_line(wleft, wright, si, a, w)
      
    fake_flux = model_flux + np.random.normal(0.0, 0.007, size=model_flux.shape[0])
    
    pars = {"wavelength_left": wleft, "wavelength_right": wright, "err":err,
            "model_flux": model_flux, "fake_flux": fake_flux, "logbkg":logbkg,
            "dshift": dshift, "line_pos": line_pos_shifted, "logamp": logamps,
           "signs": signs, "logq": logq }
    
    
    return pars

Test 1: A spectrum with no redshift and strong lines

As a first simple check, we simulate a spectrum with strong absorption lines at all available line positions. The amplitudes and widths ($q$-values) are the same for all lines. There is no Doppler shift.

We use the wavelength bins from the real data for generating the simulation:


In [ ]:
froot = "test_noshift1"

## set amplitude and q
logamp_mean = np.log(0.3)
logq_mean = np.log(600.)

# set Doppler shift
dshift = 0.0

# set background
logbkg = np.log(0.09)

# do not sample amplitudes or q-factors(all are the same!)
sample_logamp = False
sample_logq = False

# all lines are absorption lines
sample_signs = False

# error on the data points (will sample from a Gaussian distribution)
err = 0.007

# do not set any lines to zero!
nzero = 0

In [ ]:
pars = fake_spectrum(wnew_left, wnew_right, si_all_val, logbkg=logbkg, err=err, 
                     dshift=dshift, sample_logamp=sample_logamp, sample_logq=sample_logq, 
                     logamp_hypermean=logamp_mean, logq_hypermean=logq_mean, 
                     sample_signs=sample_signs, nzero=nzero)

In [ ]:
model_flux = pars["model_flux"]
fake_flux = pars["fake_flux"]

fake_err = np.zeros_like(fake_flux) + pars["err"]

Let's plot the spectrum:


In [ ]:
plt.figure(figsize=(14,6))
plt.errorbar(wnew_mid, fake_flux, yerr=fake_err, fmt="o", label="simulated flux", alpha=0.7)
plt.plot(wnew_mid, model_flux, label="simulated model", lw=3)
plt.xlim(wnew_mid[0], wnew_mid[-1])
plt.gca().invert_xaxis()
plt.legend(prop={"size":18})
plt.xlabel("Wavelength [Angstrom]")
plt.ylabel("Normalized Flux")
plt.savefig(datadir+froot+"_lc.png", format="png")

We're going to save the data and the parameters in a pickle file for later use. We'll also save the fake data itself in a way that I can easily input it into ShiftyLines.


In [ ]:
# save the whole dictionary in a pickle file
f = open(datadir+froot+"_data.pkl", "w")
pickle.dump(pars, f)
f.close()

# save the fake data in an ASCII file for input into ShiftyLines
np.savetxt(datadir+froot+".txt", np.array([wnew_left, wnew_right, fake_flux, fake_err]).T)

Sampling the Model

If DNest4 and ShiftyLines are installed and compiled, you should now be able to run this model (from the ShiftyLines/code/ directory) with

>>> ./main -d ../data/test_noshift1.txt -t 8

The last option sets the number of threads to run; this will depend on your computer and how many CPUs you can keep busy with this.

In this run, we set the number of levels in the OPTIONS file to $100$, based on running the sampler until the likelihood change between two levels fell below $1$ or so.

Results

Here are the results of the initial simulation. For this run, we set the number of Doppler shifts to exactly $1$ and do not sample over redshifts. This means the model is equivalent to simple template fitting, but it makes a fairly good test.

First, we need to move the posterior run to a new directory so it doesn't get overwritten by subsequent runs. We'll write a function for that


In [ ]:
import shutil
def move_dnest_output(froot, dnest_dir="./"):
    shutil.move(dnest_dir+"posterior_sample.txt", froot+"_posterior_sample.txt")
    shutil.move(dnest_dir+"sample.txt", froot+"_sample.txt")
    shutil.move(dnest_dir+"sample_info.txt", froot+"_sample_info.txt")
    shutil.move(dnest_dir+"weights.txt", froot+"_weights.txt")
    shutil.move(dnest_dir+"levels.txt", froot+"_levels.txt")
    return

In [ ]:
move_dnest_output("../data/%s"%froot, "../code/")

We'll need to load the data and the posterior samples for plotting:


In [ ]:
# the pickle file with the data + parameters:
f = open(datadir+froot+"_data.pkl")
data = pickle.load(f)
f.close()

print("Keys in data dictionary: " + str(data.keys()))

# the posterior samples
sample = np.atleast_2d(np.loadtxt(datadir+froot+"_posterior_sample.txt"))
nsamples = sample.shape[0]
print("We have %i samples from the posterior."%nsamples)

First plot: a random set of realizations from the posterior overplotted on the data:


In [ ]:
# randomly pick some samples from the posterior to plot
s_ind = np.random.choice(np.arange(nsamples, dtype=int), size=20)

# the middle of the wavelength bins for plotting
wmid = data["wavelength_left"] + (data["wavelength_right"] - data["wavelength_left"])/2.

# the error on the data
yerr = np.zeros_like(wmid) + data['err']

plt.figure(figsize=(14,6))
plt.errorbar(wmid, data["fake_flux"], yerr=yerr, fmt="o")
for i in s_ind:
    plt.plot(wmid, sample[i,-wmid.shape[0]:], lw=2, alpha=0.7)

plt.xlim(wmid[0], wmid[-1])
plt.gca().invert_xaxis()
plt.xlabel("Wavelength [Angstrom]")
plt.ylabel("Normalized Flux")
plt.tight_layout()
plt.savefig(datadir+froot+"_samples.png", format="png")

What's the posterior distribution over the constant background?


In [ ]:
fig = plt.figure(figsize=(12,9))
# Plot a historgram and kernel density estimate
ax = fig.add_subplot(111)
sns.distplot(sample[:,0], hist_kws={"histtype":"stepfilled"}, ax=ax)
_, ymax = ax.get_ylim()

ax.vlines(np.exp(data["logbkg"]), 0, ymax, lw=3, color="black")

ax.set_xlabel("Normalized Background Flux")
ax.set_ylabel("$N_{\mathrm{samples}}$")
plt.tight_layout()
plt.savefig(datadir+froot+"_bkg.png", format="png")

What does the OU process modelling the variable background look like?


In [ ]:
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(14,6))
sns.distplot(sample[:,1], hist_kws={"histtype":"stepfilled"}, ax=ax1)
#_, ymax = ax.get_ylim()
#ax.vlines(np.exp(data["logbkg"]), 0, ymax, lw=3, color="black")

ax1.set_xlabel(r"OU time scale $\tau$")
ax1.set_ylabel(r"$N_{\mathrm{samples}}$")

sns.distplot(sample[:,2], hist_kws={"histtype":"stepfilled"}, ax=ax2)
#_, ymax = ax.get_ylim()
#ax.vlines(np.exp(data["logbkg"]), 0, ymax, lw=3, color="black")

ax2.set_xlabel(r"OU amplitude")
ax2.set_ylabel(r"$N_{\mathrm{samples}}$")

fig.tight_layout()

plt.savefig(datadir+froot+"_ou.png", format="png")

Now we can look at the hyperparameters for the log-amplitude and log-q priors:


In [ ]:
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(14,6))
sns.distplot(np.log(sample[:,5]), hist_kws={"histtype":"stepfilled"}, ax=ax1)
_, ymax = ax1.get_ylim()
ax1.vlines(data["logamp"], 0, ymax, lw=3, color="black")

ax1.set_xlabel(r"$\mu_{\mathrm{\log{A}}}$")
ax1.set_ylabel(r"$N_{\mathrm{samples}}$")
ax1.set_title("Location parameter of the amplitude prior")


sns.distplot(sample[:,6], hist_kws={"histtype":"stepfilled"}, ax=ax2)
#_, ymax = ax.get_ylim()
#ax.vlines(np.exp(data["logbkg"]), 0, ymax, lw=3, color="black")

ax2.set_xlabel(r"$\sigma_{\mathrm{\log{A}}}$")
ax2.set_ylabel(r"$N_{\mathrm{samples}}$")
ax2.set_title("Scale parameter of the amplitude prior")

fig.tight_layout()

plt.savefig(datadir+froot+"_logamp_prior.png", format="png")

Let's do the same for the width:


In [ ]:
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(14,6))
sns.distplot(sample[:,7], hist_kws={"histtype":"stepfilled"}, ax=ax1)
_, ymax = ax1.get_ylim()
ax1.vlines(data["logq"], 0, ymax, lw=3, color="black")

ax1.set_xlabel(r"$\mu_{\mathrm{\log{q}}}$")
ax1.set_ylabel(r"$N_{\mathrm{samples}}$")
ax1.set_title(r"Location parameter of the $q$ prior")


sns.distplot(sample[:,8], hist_kws={"histtype":"stepfilled"}, ax=ax2)
#_, ymax = ax.get_ylim()
#ax.vlines(np.exp(data["logbkg"]), 0, ymax, lw=3, color="black")

ax2.set_xlabel(r"$\sigma_{\mathrm{\log{q}}}$")
ax2.set_ylabel(r"$N_{\mathrm{samples}}$")
ax2.set_title(r"Scale parameter of the $q$ prior")

fig.tight_layout()

plt.savefig(datadir+froot+"_logq_prior.png", format="png")

The next parameter (pp) in the model is the threshold determining the sign of the amplitude (i.e. whether a line is an absorption or an emission line). The sign is sampled as a random variable between $0$ and $1$; the threshold sets the boundary below which a line will become an absorption line. Above the threshold, the sign will be flipped to return an emission line.

For a spectrum with mostly absorption lines, pp should be quite high, close to $1$. For a spectrum with mostly emission lines, pp should be close to $0$.


In [ ]:
fig = plt.figure(figsize=(12,9))
# Plot a historgram and kernel density estimate
ax = fig.add_subplot(111)
sns.distplot(sample[:,9], hist_kws={"histtype":"stepfilled"}, ax=ax)

ax.set_xlabel("Threshold parameter")
ax.set_ylabel("$N_{\mathrm{samples}}$")
plt.tight_layout()

plt.savefig(datadir+froot+"_pp.png", format="png")

Hmmm, the model doesn't seem to care much about that? Funny!

The Doppler shift is next!


In [ ]:
fig = plt.figure(figsize=(12,9))
plt.locator_params(axis = 'x', nbins = 6)

# Plot a historgram and kernel density estimate
ax = fig.add_subplot(111)
sns.distplot(sample[:,11], hist_kws={"histtype":"stepfilled"}, ax=ax)
plt.xticks(rotation=45)


_, ymax = ax.get_ylim()
ax.vlines(data["dshift"], 0, ymax, lw=3, color="black")

ax.set_xlabel(r"Doppler shift $d=v/c$")
ax.set_ylabel("$N_{\mathrm{samples}}$")
plt.tight_layout()

plt.savefig(datadir+froot+"_dshift.png", format="png")

What is the posterior on all the line amplitudes and widths? Let's try overplotting them all:


In [ ]:
nlines = 8
ncolumns = 3
nrows = int(nlines/3)+1

fig = plt.figure(figsize=(nrows*4,ncolumns*4))
plt.locator_params(axis = 'x', nbins = 6)

# log-amplitudes
for i in range(8):
    ax = plt.subplot(nrows, ncolumns, i+1)    
    sns.distplot(sample[:,12+i], hist_kws={"histtype":"stepfilled"}, ax=ax)
    #ax.hist(sample[:,12+i], histtype="stepfilled", alpha=0.7)
    plt.locator_params(axis = 'x', nbins = 6)
    xlabels = ax.get_xticklabels() 
    for l in xlabels: 
        l.set_rotation(45) 

    _, ymax = ax.get_ylim()
    ax.vlines(data["logamp"][i], 0, ymax, lw=3, color="black")

    ax.set_xlabel(r"$\log{A}$")
    ax.set_ylabel("$N_{\mathrm{samples}}$")
    
plt.tight_layout()
plt.savefig(datadir+froot+"_logamp.png", format="png")

Looks like it samples amplitudes correctly! Let's make the same Figure for log-q:


In [ ]:
fig = plt.figure(figsize=(ncolumns*4,nrows*4))
plt.locator_params(axis = 'x', nbins = 6)

# log-amplitudes
for i in range(8):
    ax = plt.subplot(nrows, ncolumns, i+1)    
    sns.distplot(sample[:,20+i], hist_kws={"histtype":"stepfilled"}, ax=ax)
    #ax.hist(sample[:,20+i], histtype="stepfilled", alpha=0.7)
    plt.locator_params(axis = 'x', nbins = 6)
    xlabels = ax.get_xticklabels() 
    for l in xlabels: 
        l.set_rotation(45) 

    _, ymax = ax.get_ylim()
    ax.vlines(data["logq"][i], 0, ymax, lw=3, color="black")

    ax.set_xlabel(r"$\log{q}$")
    ax.set_ylabel("$N_{\mathrm{samples}}$")
    
plt.tight_layout()
plt.savefig(datadir+froot+"_logq.png", format="png")

Final thing, just to be sure: the signs of the amplitudes!


In [ ]:
fig = plt.figure(figsize=(12,9))
plt.locator_params(axis = 'x', nbins = 6)

# Plot a historgram and kernel density estimate
ax = fig.add_subplot(111)
for i in range(8):
    sns.distplot(sample[:,28+i], hist_kws={"histtype":"stepfilled"}, ax=ax, alpha=0.6)
    plt.xticks(rotation=45)


_, ymax = ax.get_ylim()
ax.vlines(data["dshift"], 0, ymax, lw=3, color="black")

ax.set_xlabel(r"Emission/absorption line sign")
ax.set_ylabel("$N_{\mathrm{samples}}$")
plt.tight_layout()

plt.savefig(datadir+froot+"_signs.png", format="png")

A General Plotting function for the Posterior

We'll make some individual plotting functions that we can then combine to plot useful Figures on the whole posterior sample!


In [ ]:
def plot_samples(data, sample, fout, close=True):
    """
    FIRST PLOT: SPECTRUM + SAMPLES FROM THE POSTERIOR
    """
    
    # number of posterior samples
    nsamples = sample.shape[0]

    # randomly pick some samples from the posterior to plot
    s_ind = np.random.choice(np.arange(nsamples, dtype=int), size=20)

    # the middle of the wavelength bins for plotting
    wmid = data["wavelength_left"] + (data["wavelength_right"] - data["wavelength_left"])/2.

    # the error on the data
    yerr = np.zeros_like(wmid) + data['err']

    plt.figure(figsize=(14,6))
    plt.errorbar(wmid, data["fake_flux"], yerr=yerr, fmt="o")
    for i in s_ind:
        plt.plot(wmid, sample[i,-wmid.shape[0]:], lw=2, alpha=0.7)

    plt.xlim(wmid[0], wmid[-1])
    plt.gca().invert_xaxis()
    plt.xlabel("Wavelength [Angstrom]")
    plt.ylabel("Normalized Flux")
    plt.tight_layout()
    plt.savefig(fout+"_samples.png", format="png")
    if close:
        plt.close()
    
    return

def plot_bkg(data, sample, fout, close=True):
    """
    PLOT THE BACKGROUND POSTERIOR
    """
    fig = plt.figure(figsize=(12,9))
    # Plot a histogram and kernel density estimate
    ax = fig.add_subplot(111)
    sns.distplot(sample[:,0], hist_kws={"histtype":"stepfilled"}, ax=ax)
    plt.xticks(rotation=45)

    _, ymax = ax.get_ylim()
    ax.vlines(np.exp(data["logbkg"]), 0, ymax, lw=3, color="black")

    ax.set_xlabel("Normalized Background Flux")
    ax.set_ylabel("$N_{\mathrm{samples}}$")
    plt.tight_layout()
    plt.savefig(fout+"_bkg.png", format="png")
    if close:
        plt.close()
    return
    
def plot_ou_bkg(sample, fout, close=True):
    """
    PLOT THE POSTERIOR FOR THE OU PROCESS
    """
    fig, (ax1, ax2) = plt.subplots(1,2,figsize=(14,6))
    sns.distplot(sample[:,1], hist_kws={"histtype":"stepfilled"}, ax=ax1)
    #_, ymax = ax.get_ylim()
    #ax.vlines(np.exp(data["logbkg"]), 0, ymax, lw=3, color="black")

    ax1.set_xlabel(r"OU time scale $\tau$")
    ax1.set_ylabel(r"$N_{\mathrm{samples}}$")

    sns.distplot(sample[:,2], hist_kws={"histtype":"stepfilled"}, ax=ax2)
    plt.xticks(rotation=45)

    #_, ymax = ax.get_ylim()
    #ax.vlines(np.exp(data["logbkg"]), 0, ymax, lw=3, color="black")

    ax2.set_xlabel(r"OU amplitude")
    ax2.set_ylabel(r"$N_{\mathrm{samples}}$")

    fig.tight_layout()

    plt.savefig(fout+"_ou.png", format="png")
    if close:
        plt.close()
    return

def plot_logamp_hyper(data, sample, fout, close=True):
    """
    PLOT THE POSTERIOR FOR THE LOG-AMP HYPERPARAMETERS
    """    
    
    fig, (ax1, ax2) = plt.subplots(1,2,figsize=(14,6))
    sns.distplot(np.log(sample[:,5]), hist_kws={"histtype":"stepfilled"}, ax=ax1)
    _, ymax = ax1.get_ylim()
    ax1.vlines(data["logamp"], 0, ymax, lw=3, color="black")

    ax1.set_xlabel(r"$\mu_{\mathrm{\log{A}}}$")
    ax1.set_ylabel(r"$N_{\mathrm{samples}}$")
    ax1.set_title("Location parameter of the amplitude prior")


    sns.distplot(sample[:,6], hist_kws={"histtype":"stepfilled"}, ax=ax2)
    plt.xticks(rotation=45)

    #_, ymax = ax.get_ylim()
    #ax.vlines(np.exp(data["logbkg"]), 0, ymax, lw=3, color="black")

    ax2.set_xlabel(r"$\sigma_{\mathrm{\log{A}}}$")
    ax2.set_ylabel(r"$N_{\mathrm{samples}}$")
    ax2.set_title("Scale parameter of the amplitude prior")

    fig.tight_layout()

    plt.savefig(fout+"_logamp_prior.png", format="png")
    if close:
        plt.close()
    return

def plot_logq_hyper(data, sample, fout, close=True):
    """
    PLOT THE POSTERIOR FOR THE LOG-Q HYPERPARAMETERS
    """
    
    fig, (ax1, ax2) = plt.subplots(1,2,figsize=(14,6))
    sns.distplot(sample[:,7], hist_kws={"histtype":"stepfilled"}, ax=ax1)
    _, ymax = ax1.get_ylim()
    ax1.vlines(data["logq"], 0, ymax, lw=3, color="black")

    ax1.set_xlabel(r"$\mu_{\mathrm{\log{q}}}$")
    ax1.set_ylabel(r"$N_{\mathrm{samples}}$")
    ax1.set_title(r"Location parameter of the $q$ prior")


    sns.distplot(sample[:,8], hist_kws={"histtype":"stepfilled"}, ax=ax2)
    plt.xticks(rotation=45)

    #_, ymax = ax.get_ylim()
    #ax.vlines(np.exp(data["logbkg"]), 0, ymax, lw=3, color="black")

    ax2.set_xlabel(r"$\sigma_{\mathrm{\log{q}}}$")
    ax2.set_ylabel(r"$N_{\mathrm{samples}}$")
    ax2.set_title(r"Scale parameter of the $q$ prior")

    fig.tight_layout()

    plt.savefig(fout+"_logq_prior.png", format="png")
    if close:
        plt.close()
    return

def plot_threshold(sample, fout, close=True):
    """
    PLOT THE POSTERIOR FOR THE THRESHOLD PARAMETER
    """
    
    fig = plt.figure(figsize=(12,9))
    # Plot a historgram and kernel density estimate
    ax = fig.add_subplot(111)
    sns.distplot(sample[:,9], hist_kws={"histtype":"stepfilled"}, ax=ax)
    plt.xticks(rotation=45)

    ax.set_xlabel("Threshold parameter")
    ax.set_ylabel("$N_{\mathrm{samples}}$")
    plt.tight_layout()

    plt.savefig(fout+"_pp.png", format="png")
    if close:
        plt.close()
    return

def plot_dshift(data, sample, fout, dshift_ind=0, close=True):
    """ 
    PLOT THE POSTERIOR FOR THE DOPPLER SHIFT
    """
    
    fig = plt.figure(figsize=(12,9))
    plt.locator_params(axis = 'x', nbins = 6)

    # Plot a historgram and kernel density estimate
    ax = fig.add_subplot(111)
    sns.distplot(sample[:,11+dshift_ind], hist_kws={"histtype":"stepfilled"}, ax=ax)
    plt.xticks(rotation=45)


    _, ymax = ax.get_ylim()
    ax.vlines(data["dshift"], 0, ymax, lw=3, color="black")

    ax.set_xlabel(r"Doppler shift $d=v/c$")
    ax.set_ylabel("$N_{\mathrm{samples}}$")
    plt.tight_layout()

    plt.savefig(fout+"_dshift%i.png"%dshift_ind, format="png")
    if close:
        plt.close()
    return

def plot_logamp(data, sample, fout, ndshift, nlines, ncolumns=3, 
                dshift_ind=0, close=True):
    """
    PLOT THE POSTERIOR FOR THE LINE LOG-AMPLITUDES
    """
    
    nrows = int(nlines/ncolumns)+1

    fig = plt.figure(figsize=(nrows*4,ncolumns*4))
    plt.locator_params(axis = 'x', nbins = 6)

    # index of column where the log-amplitudes start:
    start_ind = 11 + ndshift + dshift_ind*nlines
    
    # log-amplitudes
    for i in range(nlines):
        ax = plt.subplot(nrows, ncolumns, i+1)    
#        ax.hist(sample[:,start_ind+i], histtype="stepfilled", alpha=0.7)
        sns.distplot(sample[:,start_ind+i], hist_kws={"histtype":"stepfilled"}, ax=ax)

        plt.locator_params(axis = 'x', nbins = 6)
        xlabels = ax.get_xticklabels() 
        for l in xlabels: 
            l.set_rotation(45) 

        _, ymax = ax.get_ylim()
        ax.vlines(data["logamp"][i], 0, ymax, lw=3, color="black")

        ax.set_xlabel(r"$\log{A}$")
        ax.set_ylabel("$N_{\mathrm{samples}}$")

    plt.tight_layout()
    if dshift_ind == 0:
        plt.savefig(fout+"_logamp.png", format="png")
    else:
        plt.savefig(fout+"_logamp%i.png"%dshift_ind, format="png")

    if close:
        plt.close()
    
    return

def plot_logq(data, sample, fout, ndshift, nlines, ncolumns=3, 
              dshift_ind=0, close=True):
    """
    PLOT THE POSTERIOR FOR THE LINE LOG-Q
    """

    nrows = int(nlines/ncolumns)+1

    fig = plt.figure(figsize=(nrows*4,ncolumns*4))
    plt.locator_params(axis = 'x', nbins = 6)

    # set starting index for the logq values:
    start_ind = 11 + ndshift + nlines*(dshift_ind + 1)
    
    # log-amplitudes
    for i in range(nlines):
        ax = plt.subplot(nrows, ncolumns, i+1)    
        #ax.hist(sample[:,start_ind+i], histtype="stepfilled", alpha=0.7)
        sns.distplot(sample[:,start_ind+i], hist_kws={"histtype":"stepfilled"}, ax=ax)
        
        plt.locator_params(axis = 'x', nbins = 6)
        xlabels = ax.get_xticklabels() 
        for l in xlabels: 
            l.set_rotation(45) 

        _, ymax = ax.get_ylim()
        ax.vlines(data["logq"][i], 0, ymax, lw=3, color="black")

        ax.set_xlabel(r"$\log{q}$")
        ax.set_ylabel("$N_{\mathrm{samples}}$")

    plt.tight_layout()
    if dshift_ind == 0:
        plt.savefig(fout+"_logq.png", format="png")
    else:
        plt.savefig(fout+"_logq%i.png"%dshift_ind, format="png")
    
    if close:
        plt.close()
    return


def plot_signs(data, sample, fout, ndshift, nlines, ncolumns=3, 
               dshift_ind=0, close=True):
    """
    PLOT THE POSTERIOR FOR THE LINE AMPLITUDE SIGNS
    """

    nrows = int(nlines/ncolumns)+1
    
    fig = plt.figure(figsize=(nrows*4,ncolumns*4))
    plt.locator_params(axis = 'x', nbins = 6)

    # set starting index for the logq values:
    start_ind = 11 + ndshift + nlines*(dshift_ind + 2)
    
    # log-amplitudes
    for i in range(nlines):
        ax = plt.subplot(nrows, ncolumns, i+1)    
#        ax.hist(sample[:,start_ind+i], histtype="stepfilled", alpha=0.7)
        sns.distplot(sample[:,start_ind+i], hist_kws={"histtype":"stepfilled"}, ax=ax)

        plt.locator_params(axis = 'x', nbins = 6)
        xlabels = ax.get_xticklabels() 
        for l in xlabels: 
            l.set_rotation(45) 

        ax.set_xlabel(r"Emission/absorption line sign")
        ax.set_ylabel("$N_{\mathrm{samples}}$")

    plt.tight_layout()
    if dshift_ind == 0:
        plt.savefig(fout+"_signs.png", format="png")
    else:
        plt.savefig(fout+"_signs%i.png"%dshift_ind, format="png")
    
    if close:
        plt.close()
    return

In [ ]:
def plot_posterior_summary(froot, datadir="../data/", nlines=8, 
                           ndshift=1, ncolumns=3, close=True):
    """
    Plot summeries of the posterior distribution. Mostly histograms.
    Plots a bunch of Figures to png files.
    
    Parameters
    ----------
    froot: str
        The root string of the data file and file with posterior samples to be loaded.
        
    datadir: str
        The directory with the data.
        Default: "../data/"
        
    nlines: int
        The number of lines in the model.
        Default: 8
        
    ndshift: int
        The number of (possible) Doppler shifts in the model.
        Default: 1
        
    ncolumns: int
        The number of columns in multi-panel plots. Default: 3
    
    close: bool
        Close plots at the end of the plotting? Default: True
    """
    
    
    # the pickle file with the data + parameters:
    f = open(datadir+froot+"_data.pkl")
    data = pickle.load(f)
    f.close()

    print("Keys in data dictionary: " + str(data.keys()))

    # the posterior samples
    sample = np.atleast_2d(np.loadtxt(datadir+froot+"_posterior_sample.txt"))
    nsamples = sample.shape[0]
    print("We have %i samples from the posterior."%nsamples)
    
    # set the directory path and file name for output files:
    fout = datadir+froot
    
    # plot the spectrum with some draws from the posterior
    plot_samples(data, sample, fout, close=close)

    # plot a histogram of the background parameter
    plot_bkg(data, sample, fout, close=close)
    
    # plot histograms of the OU parameters
    plot_ou_bkg(sample, fout, close=close)
    
    # plot the hyper parameters of the log-amp prior
    plot_logamp_hyper(data, sample, fout, close=close)
    
    # plot the hyper parameters of the log-q prior
    plot_logq_hyper(data, sample, fout, close=close)
    
    # plot the threshold for the amplitude sign
    plot_threshold(sample, fout, close=close)
    
    # for the varying number of Doppler shifts, plot their posterior
    if ndshift == 1:
        plot_dshift(data, sample, fout, dshift_ind=0, close=close)
        plot_logamp(data, sample, fout, ndshift, nlines, 
                    ncolumns=ncolumns, dshift_ind=0, close=close)
        plot_logq(data, sample, fout, ndshift, nlines, 
                  ncolumns=ncolumns, dshift_ind=0, close=close)
        plot_signs(data, sample, fout, ndshift, nlines, 
                   ncolumns=ncolumns, dshift_ind=0, close=close)        
    else:
        for i in range(ndshift):
            plot_dshift(data, sample, fout, dshift_ind=i, close=close)
            plot_logamp(data, sample, fout, ndshift, nlines, 
                        ncolumns=ncolumns, dshift_ind=i, close=close)
            plot_logq(data, sample, fout, ndshift, nlines, 
                      ncolumns=ncolumns, dshift_ind=i, close=close)
            plot_signs(data, sample, fout, ndshift, nlines, 
                       ncolumns=ncolumns, dshift_ind=i, close=close)        

    return

Let's run this function on the data we just made individual plots from to see whether it worked.

The model had $8$ lines and $1$ redshift:


In [ ]:
plot_posterior_summary(froot, datadir="../data/", nlines=8, ndshift=1, ncolumns=3,
                      close=True)

A general function for simulating data sets

Based on what we just did, we'll write a function that takes parameters as an input and spits out the files we need:


In [ ]:
def fake_data(wleft, wright, line_pos, input_pars, froot):
    """
    Simulate spectra, including a (constant) background and a set of 
    (Gaussian) lines with given positions. 
    
    The model integrates over wavelength/frequency/energy bins, hence 
    it requires the left and right bin edges rather than the centre of 
    the bin.
    
    Produces (1) a pickle file with the simulated data and parameters used
    to produce the simulation; (2) an ASCII file that can be fed straight into 
    ShiftyLines; (3) a Figure of the simulated data and the model that 
    produced it.
    
    Parameters
    ----------
    wleft: numpy.ndarray
        The left bin edges of the wavelength/frequency/energy bins
        
    wright: numpy.ndarray
        The right bin edges of the wavelength/frequency/energy bins
        
    line_pos: numpy.ndarray
        The positions of the spectral lines in the same units as 
        `wleft` and `wright`; will be translated into centroid
        wavelength/frequency/energy of the Gaussian modelling the line
        
    input_pars: dict
        A dictionary containing the following keywords (for detailed 
        descriptions see the docstring for `fake_spectrum`):
            logbkg: the log-background
            err: the error on the data points
            dshift: the Doppler shift (just one!)
            sample_logamp: sample the log-amplitudes?
            sample_logq: sample the log-q values?
            sample_signs: sample the amplitude signs (if no, all lines 
                          are absorption lines!)
            logamp_mean: location of normal sampling distribution for log-amplitudes
            logq_mean: location of normal sampling distribution for log-q
            logamp_sigma: scale of normal sampling distribution for log-amplitudes
            logq_sigma: scale of normal sampling distribution for log-q
            nzero: Number of lines to set to zero

    froot: str
        A string describing the directory and file name of the output files
    
    """

    # read out all the parameters
    logbkg = input_pars["logbkg"]
    err = input_pars["err"]
    dshift = input_pars["dshift"]
    sample_logamp = input_pars["sample_logamp"]
    sample_logq = input_pars["sample_logq"]
    sample_signs = input_pars["sample_signs"]
    logamp_mean = input_pars["logamp_mean"]
    logq_mean = input_pars["logq_mean"]
    logamp_sigma = input_pars["logamp_sigma"]
    logq_sigma = input_pars["logq_sigma"]
    nzero = input_pars["nzero"]

    # simulate fake spectrum 
    pars = fake_spectrum(wleft, wright, line_pos, logbkg=logbkg, err=err, dshift=dshift, 
                         sample_logamp=sample_logamp, sample_logq=sample_logq, 
                         logamp_hypermean=logamp_mean, logq_hypermean=logq_mean, 
                         logamp_hypersigma=logamp_sigma, logq_hypersigma=logq_sigma,
                         sample_signs=sample_signs, nzero=nzero)
    
    # read out model and fake flux, construct error
    model_flux = pars["model_flux"]
    fake_flux = pars["fake_flux"]
    fake_err = np.zeros_like(fake_flux) + pars["err"]
    
    # get the middle of each bin
    wmid = wleft + (wright-wleft)/2.
    
    # plot the resulting data and model to a file
    plt.figure(figsize=(14,6))
    plt.errorbar(wmid, fake_flux, yerr=fake_err, fmt="o", label="simulated flux", alpha=0.7)
    plt.plot(wmid, model_flux, label="simulated model", lw=3)
    plt.xlim(wmid[0], wmid[-1])
    plt.gca().invert_xaxis()
    plt.legend(prop={"size":18})
    plt.xlabel("Wavelength [Angstrom]")
    plt.ylabel("Normalized Flux")
    plt.savefig(froot+"_lc.png", format="png")
    
    # save the whole dictionary in a pickle file
    f = open(froot+"_data.pkl", "w")
    pickle.dump(pars, f)
    f.close()

    # save the fake data in an ASCII file for input into ShiftyLines
    np.savetxt(froot+".txt", np.array([wleft, wright, fake_flux, fake_err]).T)
    
    return

A Spectrum with Weak Absorption Lines

The model should still work if the lines are very weak. We will simulate a spectrum with weak lines to test how the strength of the lines will affect the inferences drawn from the model:


In [ ]:
froot = "../data/test_noshift2"

input_pars = {}

# set amplitude and q
input_pars["logamp_mean"] = np.log(0.1)
input_pars["logq_mean"] = np.log(600.)

# set the width of the amplitude and q distribution (not used here)

input_pars["logamp_sigma"] =np.log(0.08)
input_pars["logq_sigma"] = np.log(50)

# set Doppler shift
input_pars["dshift"] = 0.0

# set background
input_pars["logbkg"] = np.log(0.09)

# do not sample amplitudes or q-factors(all are the same!)
input_pars["sample_logamp"] = False
input_pars["sample_logq"] = False


# lines are either absorption or emission lines this time!
input_pars["sample_signs"] = True

# error on the data points (will sample from a Gaussian distribution)
input_pars["err"] = 0.007

# do not set any lines to zero!
input_pars["nzero"] = 0

In [ ]:
fake_data(wnew_left, wnew_right, si_all_val, input_pars, froot)

This time, we run DNest4 with $100$ levels.

Results

Let's first move the samples into the right directory and give it the right filename


In [ ]:
move_dnest_output("../data/%s"%froot, "../code/")

In [ ]:
plot_posterior_summary(froot, datadir="../data/", nlines=8, ndshift=1,
                      ncolumns=3, close=False)

It looks like for this spectrum the amplitude really is too weak to constrain anything, so the Doppler shift does whatever the hell it wants.

I'm not sure I like this behaviour; I might need to ask Brendon about it!

A Spectrum With Emission+Absorption Lines

We'll do the same test as the first, but with varying strong emission and absorption lines:


In [ ]:
froot = "../data/test_noshift3"

input_pars = {}

# set amplitude and q
input_pars["logamp_mean"] = np.log(0.3)
input_pars["logq_mean"] = np.log(600.)

# set the width of the amplitude and q distribution (not used here)

input_pars["logamp_sigma"] = np.log(0.08)
input_pars["logq_sigma"] = np.log(50)

# set Doppler shift
input_pars["dshift"] = 0.0

# set background
input_pars["logbkg"] = np.log(0.09)

# do not sample amplitudes or q-factors(all are the same!)
input_pars["sample_logamp"] = False
input_pars["sample_logq"] = False


# lines are either absorption or emission lines this time!
input_pars["sample_signs"] = True

# error on the data points (will sample from a Gaussian distribution)
input_pars["err"] = 0.007

# do not set any lines to zero!
input_pars["nzero"] = 0

Now run the new function:


In [ ]:
fake_data(wnew_left, wnew_right, si_all_val, input_pars, froot)

Run the model the same as with test_noshift2.txt, but with $150$ levels.

Results


In [ ]:
move_dnest_output(froot, dnest_dir="../code/")

In [ ]:
plot_posterior_summary(froot, datadir="../data/", nlines=8, ndshift=1, ncolumns=3,
                      close=False)

In [ ]:

A Spectrum With Variable Absorption Lines

In this test, we'll see how the model deals with variable absorption lines:


In [ ]:
froot = "../data/test_noshift4"

input_pars = {}

# set amplitude and q
input_pars["logamp_mean"] = np.log(0.3)
input_pars["logq_mean"] = np.log(600.)

# set the width of the amplitude and q distribution (not used here)

input_pars["logamp_sigma"] = 0.5
input_pars["logq_sigma"] = np.log(50)

# set Doppler shift
input_pars["dshift"] = 0.0

# set background
input_pars["logbkg"] = np.log(0.09)

# sample amplitudes, but not q-factors(all are the same!)
input_pars["sample_logamp"] = True
input_pars["sample_logq"] = False


# lines are either absorption or emission lines this time!
input_pars["sample_signs"] = False

# error on the data points (will sample from a Gaussian distribution)
input_pars["err"] = 0.007

# do not set any lines to zero!
input_pars["nzero"] = 0

In [ ]:
fake_data(wnew_left, wnew_right, si_all_val, input_pars, froot)

Results

I set the number of levels to $200$.


In [ ]:
move_dnest_output(froot, dnest_dir="../code/")

In [ ]:
plot_posterior_summary(froot, datadir="../data/", nlines=8, ndshift=1, ncolumns=3,
                      close=False)

In [ ]:

A Spectrum with Lines Turned Off

What does the model do if lines are just not there? This is an important question, so we will now make a spectrum with three lines having amplitudes $A=0$:


In [ ]:
froot = "../data/test_noshift5"

input_pars = {}

# set amplitude and q
input_pars["logamp_mean"] = np.log(0.3)
input_pars["logq_mean"] = np.log(600.)

# set the width of the amplitude and q distribution (not used here)

input_pars["logamp_sigma"] = 0.5
input_pars["logq_sigma"] = np.log(50)

# set Doppler shift
input_pars["dshift"] = 0.0

# set background
input_pars["logbkg"] = np.log(0.09)

# do not sample amplitudes or q-factors(all are the same!)
input_pars["sample_logamp"] = False
input_pars["sample_logq"] = False


# lines are either absorption or emission lines this time!
input_pars["sample_signs"] = False

# error on the data points (will sample from a Gaussian distribution)
input_pars["err"] = 0.007

# Set three lines straight to zero!
input_pars["nzero"] = 3

In [ ]:
np.random.seed(20160221)
fake_data(wnew_left, wnew_right, si_all_val, input_pars, froot)

Results


In [ ]:
plot_posterior_summary(froot, datadir="../data/", nlines=8, ndshift=1, ncolumns=3,
                      close=False)

In [ ]:

A Doppler-shifted Spectrum with Absorption lines

We are now going to look how well the model constrains the Doppler shift. Again, we build a simple model where all lines have the sample amplitude,


In [ ]:
froot = "../data/test_shift1"

input_pars = {}

# set amplitude and q
input_pars["logamp_mean"] = np.log(0.3)
input_pars["logq_mean"] = np.log(600.)

# set the width of the amplitude and q distribution (not used here)

input_pars["logamp_sigma"] = 0.5
input_pars["logq_sigma"] = np.log(50)

# set Doppler shift
input_pars["dshift"] = 0.01

# set background
input_pars["logbkg"] = np.log(0.09)

# do not sample amplitudes or q-factors(all are the same!)
input_pars["sample_logamp"] = False
input_pars["sample_logq"] = False


# lines are only absorption lines this time!
input_pars["sample_signs"] = False

# error on the data points (will sample from a Gaussian distribution)
input_pars["err"] = 0.007

# Set three lines straight to zero!
input_pars["nzero"] = 0

In [ ]:
np.random.seed(20160220)
fake_data(wnew_left, wnew_right, si_all_val, input_pars, froot)

Results


In [ ]:
move_dnest_output(froot, "../code/")

In [ ]:
plot_posterior_summary(froot, datadir="../data/", nlines=8, ndshift=1, ncolumns=3,
                      close=False)

A Shifted Spectrum with Emission/Absorption Lines with Variable Amplitudes and Signs

More complicated: a spectrum with a single Doppler shift and variable line amplitudes of both emission and absorption lines:


In [ ]:
froot = "../data/test_shift2"

input_pars = {}

# set amplitude and q
input_pars["logamp_mean"] = np.log(0.3)
input_pars["logq_mean"] = np.log(600.)

# set the width of the amplitude and q distribution (not used here)

input_pars["logamp_sigma"] = 0.5
input_pars["logq_sigma"] = np.log(50)

# set Doppler shift
input_pars["dshift"] = 0.01

# set background
input_pars["logbkg"] = np.log(0.09)

# do not sample amplitudes or q-factors(all are the same!)
input_pars["sample_logamp"] = True
input_pars["sample_logq"] = False


# lines are only absorption lines this time!
input_pars["sample_signs"] = True

# error on the data points (will sample from a Gaussian distribution)
input_pars["err"] = 0.007

# Set three lines straight to zero!
input_pars["nzero"] = 0

In [ ]:
np.random.seed(20160221)
fake_data(wnew_left, wnew_right, si_all_val, input_pars, froot)

Results


In [ ]:

Adding a Noise Process

At this point, I should be adding an OU process to the data generation process to simulate the effect of a variable background in the spectrum.

THIS IS STILL TO BE DONE!


In [ ]:

Testing Multiple Doppler Shift Components

For all of the above simulations, we also ought to test how well the model works if I add additional Doppler shift components to sample over.

For this, you'll need to change the line

:dopplershift(3*nlines+1, 1, true, MyConditionalPrior())

in MyModel.cpp to read

:dopplershift(3*nlines+1, 3, false, MyConditionalPrior())

and recompile. This will sample over up to three different Doppler shifts at the same time. In theory, we expect that the posterior will have a strong mode at either zero Doppler shifts (for the non-Doppler-shifted data) or at $1$ for all types of data set (where for some, the Doppler shift is rather strongly constrained to $0$).

Results

A place holder for the results from this experiment.


In [ ]:


In [ ]:


In [ ]:


In [ ]:

The extended Spectrum: More lines!

Let's also make some simulations for the extended spectrum with more lines. This will require the extended_lines.txt file. The file name for the file with the lines centroid is currently hard-coded in the main.cpp file.

Change the line

    Data::get_instance().load_lines("../data/si_lines.txt");

to

    Data::get_instance().load_lines("../data/lines_extended.txt");

and also change the Doppler shifts in MyModel.cpp back to

:dopplershift(3*nlines+1, 1, true, MyConditionalPrior())

before recompiling. We will change the last line again in a little while, but first we'll test the general performance of the model on the extended data set.

An extended spectrum with variable line amplitudes and a single Doppler shift


In [ ]:
froot = "../data/test_extended_shift1"

input_pars = {}

# set amplitude and q
input_pars["logamp_mean"] = np.log(0.2)
input_pars["logq_mean"] = np.log(600.)

# set the width of the amplitude and q distribution (not used here)

input_pars["logamp_sigma"] = 0.4
input_pars["logq_sigma"] = np.log(50)

# set Doppler shift
input_pars["dshift"] = 0.01

# set background
input_pars["logbkg"] = np.log(0.09)

# do not sample amplitudes or q-factors(all are the same!)
input_pars["sample_logamp"] = True
input_pars["sample_logq"] = False


# lines are only absorption lines this time!
input_pars["sample_signs"] = True

# error on the data points (will sample from a Gaussian distribution)
input_pars["err"] = 0.007

# Set three lines straight to zero!
input_pars["nzero"] = 0

In [ ]:
np.random.seed(20162210)
fake_data(wavelength_left, wavelength_right, lines_extended, input_pars, froot)

In [ ]:


In [ ]:

A Spectrum with Two Doppler Shifts

What if the lines are shifted with respect to each other?

Let's simulate a spectrum where the silicon lines are Doppler shifted by one value, but the other lines are shifted by a different Doppler shift.


In [ ]:
froot = "test_extended_shift2"

# set amplitude and q
logamp_mean = np.log(0.2)
logq_mean = np.log(600.)

# set the width of the amplitude and q distribution (not used here)

logamp_sigma = 0.4
logq_sigma = np.log(50)

# set Doppler shift
dshift1 = 0.01
dshift2 = 0.02

# set background
logbkg1 = np.log(0.09)
logbkg2 = -15.

# do not sample amplitudes or q-factors(all are the same!)
sample_logamp = True
sample_logq = False

# lines are only absorption lines this time!
sample_signs = True

# error on the data points (will sample from a Gaussian distribution)
err = 0.007

# Set three lines straight to zero!
nzero = 0

np.random.seed(20160201)
pars1 = fake_spectrum(wavelength_left, wavelength_right, si_all_val, logbkg=logbkg1,
                     dshift=dshift1, err=err, sample_logamp=sample_logamp, 
                     sample_logq=sample_logq, logamp_hypermean=logamp_mean, 
                     logamp_hypersigma=logamp_sigma, logq_hypermean=logq_mean,
                     logq_hypersigma=logq_sigma, sample_signs=sample_signs, nzero=nzero)

pars2 = fake_spectrum(wavelength_left, wavelength_right, other_lines_all_val, logbkg=logbkg2,
                     dshift=dshift2, err=err, sample_logamp=sample_logamp, 
                     sample_logq=sample_logq, logamp_hypermean=logamp_mean, 
                     logamp_hypersigma=logamp_sigma, logq_hypermean=logq_mean,
                     logq_hypersigma=logq_sigma, sample_signs=sample_signs, nzero=nzero)

In [ ]:
model_flux_c =  pars1["model_flux"]+pars2["model_flux"]
fake_flux_c = model_flux_c + np.random.normal(0.0, err, size=model_flux_c.shape[0])
fake_err_c = np.zeros_like(fake_flux_c) + pars1["err"]

plt.figure(figsize=(14,6))
plt.errorbar(wnew_mid, fake_flux, yerr=fake_err, fmt="o", label="simulated flux", alpha=0.7)
plt.plot(wnew_mid, model_flux, label="simulated model", lw=3)
plt.xlim(wnew_mid[0], wnew_mid[-1])
plt.gca().invert_xaxis()
plt.legend(prop={"size":18})
plt.xlabel("Wavelength [Angstrom]")
plt.ylabel("Normalized Flux")
plt.savefig(datadir+froot+"_lc.png", format="png")

Let's save all the output files as we did before:


In [ ]:
pars = {"wavelength_left": wavelength_left, "wavelength_right": wavelength_right, "err":err,
        "model_flux": model_flux_c, "fake_flux": fake_flux_c,
        "dshift": [dshift1, dshift2], 
        "line_pos": np.hstack([pars1["line_pos"], pars2["line_pos"]]), 
        "logamp": np.hstack([pars1["logamp"], pars2["logamp"]]),
        "signs": np.hstack([pars1["signs"], pars2["signs"]]),
        "logq":  np.hstack([pars1["logq"], pars2["logq"]]) }

In [ ]:
# save the whole dictionary in a pickle file
f = open(froot+"_data.pkl", "w")
pickle.dump(pars, f)
f.close()

# save the fake data in an ASCII file for input into ShiftyLines
np.savetxt(froot+".txt", np.array([wavelength_left, wavelength_right, 
                                   fake_flux_c, fake_err_c]).T)

Results


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

A simple model for Doppler-shifted Spectra

Below we define a basic toy model which samples over all the line amplitudes, widths as well as a Doppler shift. We'll later extend this to work in DNest4.


In [ ]:
logmin = -1.e16

def gaussian_cdf(x, w0, width):
    return 0.5*(1. + scipy.special.erf((x-w0)/(width*np.sqrt(2.))))

def spectral_line(wleft, wright, w0, amplitude, width):
    """
    Use the CDF of a Gaussian distribution to define spectral 
    lines. We use the CDF to integrate over the energy bins, 
    rather than taking the mid-bin energy.
    
    Parameters
    ----------
    wleft: array
        Left edges of the energy bins
        
    wright: array
        Right edges of the energy bins
        
    w0: float
        The centroid of the line
        
    amplitude: float
        The amplitude of the line
        
    width: float
        The width of the line
        
    Returns
    -------
    line_flux: array
        The array of line fluxes integrated over each bin
    """
    line_flux = amplitude*(gaussian_cdf(wright, w0, width)-
                           gaussian_cdf(wleft, w0, width))
    return line_flux

class LinePosterior(object):
    def __init__(self, x_left, x_right, y, yerr, line_pos):
        """
        A class to compute the posterior of all the lines in 
        a spectrum.
        
        Parameters
        ----------
        x_left: np.ndarray
            The left edges of the independent variable (wavelength bins)
        
        x_right: np.ndarray
            The right edges of the independent variable (wavelength bins)
            
        y: np.ndarray
            The dependent variable (flux)
            
        yerr: np.ndarray
            The uncertainty on the dependent variable (flux)
            
        line_pos: np.ndarray
            The rest-frame positions of the spectral lines
            
        Attributes
        ----------
        x_left: np.ndarray
            The left edges of the independent variable (wavelength bins)
        
        x_right: np.ndarray
            The right edges of the independent variable (wavelength bins)
          
        x_mid: np.ndarray
            The mid-bin positions
            
        y: np.ndarray
            The dependent variable (flux)
            
        yerr: np.ndarray
            The uncertainty on the dependent variable (flux)
            
        line_pos: np.ndarray
            The rest-frame positions of the spectral lines

        nlines: int
            The number of lines in the model
        
        """
    
        self.x_left = x_left
        self.x_right = x_right
        self.x_mid = x_left + (x_right-x_left)/2.
        
        self.y = y
        
        assert np.size(yerr) == 1, "Multiple errors are not supported!"
        
        self.yerr = yerr
        self.line_pos = line_pos
        self.nlines = len(line_pos)
        
    def logprior(self, t0):
        """
        The prior of the model. Currently there are Gaussian priors on the 
        line width as well as the amplitude and the redshift.
    
        Parameters
        ----------
        t0: iterable
            The list or array with the parameters of the model
            Contains:
                * Doppler shift
                * a background parameter
                * all line amplitudes
                * all line widths

            
        Returns
        -------
        logp: float
            The log-prior of the model
        
        """
        # t0 must have twice the number of lines (amplitude, width for each) plus a 
        # background plus the redshift
        assert len(t0) == 2*self.nlines+2, "Wrong number of parameters!"
        
        # get out the individual parameters
        dshift = t0[0] # Doppler shift
        log_bkg = t0[1]
        amps = t0[2:2+self.nlines]
        log_widths = t0[2+self.nlines:]
        
        # prior on the Doppler shift is Gaussian
        dshift_hypermean = 0.0
        dshift_hypersigma = 0.01
        
        dshift_prior = scipy.stats.norm(dshift_hypermean, dshift_hypersigma)
        p_dshift = np.log(dshift_prior.pdf(dshift))
        
        #print("p_dshift: " + str(p_dshift))
        
        # Prior on the background is uniform
        logbkg_min = -10.0
        logbkg_max = 10.0
        p_bkg = (log_bkg >= logbkg_min and log_bkg <= logbkg_max)/(logbkg_max-logbkg_min)
        if p_bkg == 0:
            p_logbkg = logmin
        else:
            p_logbkg = 0.0
        
        #print("p_logbkg: " + str(p_logbkg))
        
        # prior on the amplitude is Gaussian
        amp_hypermean = 0.0
        amp_hypersigma = 0.1
        
        amp_prior = scipy.stats.norm(amp_hypermean, amp_hypersigma)
        
        p_amp = 0.0
        for a in amps:
            p_amp += np.log(amp_prior.pdf(a))
        
        #print("p_amp: " + str(p_amp))
        # prior on the log-widths is uniform:
        logwidth_min = -5.
        logwidth_max = 3.
        
        p_logwidths = 0.0
        for w in log_widths:
            #print("w: " + str(w))
            p_width = (w >= logwidth_min and w <= logwidth_max)/(logwidth_max-logwidth_min)
            if p_width == 0.0:
                p_logwidths += logmin
            else:
                continue
                
        #print("p_logwidths: " + str(p_logwidths))
        logp = p_dshift + p_logbkg + p_amp + p_logwidths
        
        return logp
    
    @staticmethod
    def _spectral_model(x_left, x_right, dshift, logbkg, line_pos, amplitudes, logwidths):
        """
        The spectral model underlying the data. It uses the object 
        attributes `x_left` and `x_right` to integrate over the bins 
        correctly.
        
        Parameters
        ----------
        x_left: np.ndarray
            The left bin edges of the ordinate (wavelength bins)
            
        x_right: np.ndarray
            The right bin edges of the ordinate (wavelength bins)
            
        dshift: float
            The Doppler shift
            
        logbkg: float
            Logarithm of the constant background level
        
        line_pos: iterable
            The rest frame positions of the line centroids in the same 
            units as `x_left` and `x_right`
        
        amplitudes: iterable
            The list of all line amplitudes
            
        logwidths: iterable
            The list of the logarithm of all line widths
        
        Returns
        -------
        flux: np.ndarray
            The integrated flux in the bins defined by `x_left` and `x_right`
            
        """
        
        assert len(line_pos) == len(amplitudes), "Line positions and amplitudes must have same length"
        assert len(line_pos) == len(logwidths), "Line positions and widths must have same length"
        
        # shift the line position by the redshift
        line_pos_shifted = line_pos + dshift
        
        #print(line_pos_shifted)
        # exponentiate logarithmic quantities
        bkg = np.exp(logbkg)
        widths = np.exp(logwidths)
        
        #print(widths)
        # background flux
        flux = np.zeros_like(x_left) + bkg

        #print(amplitudes)
        
        # add all the line fluxes
        for x0, a, w in zip(line_pos_shifted, amplitudes, widths):
            flux += spectral_line(x_left, x_right, x0, a, w)

        return flux

        
        
    def loglikelihood(self, t0):
        """
        The Gaussian likelihood of the model. 
    
        Parameters
        ----------
        t0: iterable
            The list or array with the parameters of the model
            Contains:
                * Doppler shift
                * a background parameter
                * all line amplitudes
                * all line widths

            
        Returns
        -------
        loglike: float
            The log-likelihood of the model
        
        """
        
        assert len(t0) == 2*self.nlines+2, "Wrong number of parameters!"
        
        # get out the individual parameters
        dshift = t0[0] # Doppler shift
        logbkg = t0[1]
        amplitudes = t0[2:2+self.nlines]
        logwidths = t0[2+self.nlines:]

        model_flux = self._spectral_model(self.x_left, self.x_right, 
                                          dshift, logbkg, self.line_pos, 
                                          amplitudes, logwidths)

        loglike = -(len(self.y)/2.)*np.log(2.*np.pi*self.yerr**2.) - \
                    np.sum((self.y-model_flux)**2./(2.*self.yerr**2.))
            
        return loglike
    
    def logposterior(self, t0):
        """
        The Gaussian likelihood of the model. 
    
        Parameters
        ----------
        t0: iterable
            The list or array with the parameters of the model
            Contains:
                * Doppler shift
                * a background parameter
                * all line amplitudes
                * all line widths

            
        Returns
        -------
        logpost: float
            The log-likelihood of the model
        
        """

        
        # assert the number of input parameters is correct:
        assert len(t0) == 2*self.nlines+2, "Wrong number of parameters!"

        logpost = self.logprior(t0) + self.loglikelihood(t0)
        #print("prior: " + str(self.logprior(t0)))
        #print("likelihood: " + str(self.loglikelihood(t0)))
        #print("posterior: " + str(self.logposterior(t0)))
        if np.isfinite(logpost):
            return logpost
        else:
            return logmin

    def __call__(self, t0):
        return self.logposterior(t0)

Now we can use emcee to sample.

We're going to use one of our example data sets, one without Doppler shift and strong lines:


In [ ]:
data = np.loadtxt(datadir+"test_spectra_noshift_sameamp_samewidth3.txt")

x_left = data[:,0]
x_right = data[:,1]
flux = data[:,3]
f_err = 0.007

lpost = LinePosterior(x_left, x_right, flux, f_err, si_all_val)

plt.errorbar(lpost.x_mid, flux, yerr=f_err, fmt="o", label="Fake data")
plt.plot(lpost.x_mid, data[:,4], label="Underlying model")

In [ ]:
d_test = 0.0
bkg_test = np.log(0.09)
amp_test = np.zeros_like(si_all_val) - 0.5
logwidth_test = np.log(np.zeros_like(si_all_val) + 0.01)

p_test = np.hstack([d_test, bkg_test, amp_test, logwidth_test])

In [ ]:
lpost.logprior(p_test)

In [ ]:
d_test = 0.0
bkg_test = np.log(0.09)
amp_test = np.zeros_like(si_all_val) - 0.5
logwidth_test = np.log(np.zeros_like(si_all_val) + 0.01)

p_test = np.hstack([d_test, bkg_test, amp_test, logwidth_test])

lpost.loglikelihood(p_test)

In [ ]:
nwalkers = 1000
niter = 300
burnin = 300

# starting positions for all parameters, from the prior
# the values are taken from the `logprior` method in `LinePosterior`.
# If the hyperparameters of the prior change in there, they'd better 
# change here, too!
dshift_start = np.random.normal(0.0, 0.01, size=nwalkers)
logbkg_start = np.random.uniform(-10., 10., size=nwalkers)
amp_start = np.random.normal(0.0, 0.1, size=(lpost.nlines, nwalkers))
logwidth_start = np.random.uniform(-5., 3., size=(lpost.nlines, nwalkers))
p0 = np.vstack([dshift_start, logbkg_start, amp_start, logwidth_start]).T

ndim = p0.shape[1]

In [ ]:
sampler = emcee.EnsembleSampler(nwalkers, ndim, lpost)
pos, prob, state = sampler.run_mcmc(p0, burnin)

In [ ]:
sampler.reset()

## do the actual MCMC run
pos, prob, state = sampler.run_mcmc(pos, niter, rstate0=state)

In [ ]:
mcall = sampler.flatchain

In [ ]:
mcall.shape

In [ ]:
for i in range(mcall.shape[1]):
    pmean = np.mean(mcall[:,i])
    pstd = np.std(mcall[:,i])
    print("Parameter %i: %.4f +/- %.4f"%(i, pmean, pstd))

In [ ]:
mcall.shape

In [ ]:
corner.corner(mcall);

In [ ]:
lpost = LinePosterior(x_left, x_right, flux, f_err, si_all_val)

plt.errorbar(lpost.x_mid, flux, yerr=f_err, fmt="o", label="Fake data")
plt.plot(lpost.x_mid, data[:,4], label="Underlying model")

randind = np.random.choice(np.arange(mcall.shape[0]), replace=False, size=20)
for ri in randind:
    ri = int(ri)
    p = mcall[ri]
    dshift = p[0]
    logbkg = p[1]
    line_pos = lpost.line_pos
    amplitudes = p[2:2+lpost.nlines]
    logwidths = p[2+lpost.nlines:] 
    plt.plot(lpost.x_mid, lpost._spectral_model(x_left, x_right, dshift, logbkg, line_pos,
                                               amplitudes, logwidths))

In [ ]:


In [ ]:


In [ ]:

Things to think about:

  • what if we don't have ten lines, but a hundred?
  • want to do actual inference on individual lines (i.e. want to know where they are)
  • how do we incorporate errors in the line positions?
  • should we do a straight model comparison for the redshifts?
  • how do we incorporate sections of data?