CHIPPR

This notebook demonstrates the use of the Cosmological Hierarchical Inference with Probabilistic Photometric Redshifts (CHIPPR) package to estimate population distributions based on a catalog of probability distributions.

The package supports two primary objectives: simulation of catalogs and inference of posterior distributions over parameters defining population distributions.


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import timeit
import cProfile, pstats, StringIO
import os

import chippr

In [ ]:
help(chippr)

Simulation

Many of chippr's modules are used to produce mock catalogs of individual posterior probability distributions.

To create a catalog, we must first define a true redshift distribution function. It may be a Gaussian distribution of the gauss class, a Gaussian mixture distribution of the gmix class, or a binned discrete distribution of the discrete class. In this case, we will consider a mixture of three Gaussian distributions to represent the true redshift distribution.


In [ ]:
true_amps = np.array([0.20, 0.35, 0.55])
true_means = np.array([0.5, 0.2, 0.75])
true_sigmas = np.array([0.4, 0.2, 0.1])
n_mix_comps = len(true_amps)

true_funcs = []
for c in range(n_mix_comps):
    true_funcs.append(chippr.gauss(true_means[c], true_sigmas[c]**2))

true_nz = chippr.gmix(true_amps, true_funcs, limits=(0., 1.))

chippr supports the use of a parameter file to specify various options for the catalog simulator and inference module to turn on and off.


In [ ]:
param_loc = 'params.txt'
params = chippr.utils.ingest(param_loc)
params = chippr.defaults.check_sim_params(params)
print(params)

To make a catalog, we must specify an interim prior redshift distribution $n^{*}(z)$, regardless of what quantity we wish to infer using the catalog. So far, only discrete distributions are supported, but this will soon be changed. The simplest discrete distribution is uniform.


In [ ]:
bin_ends = np.array([params['bin_min'], params['bin_max']])
weights = np.array([1.])

int_prior = chippr.discrete(bin_ends, weights)

Now we are ready to make a catalog. To do this we instantiate a catalog object and then create a catalog of indiviual posterior distributions based on the true redshift distribution and the interim prior. By default, the catalog generator will make some informative plots along the way. The included plots are a histogram of the true redshifts and a scatterplot of the true redshifts and the centers of the individual Gaussian posteriors. Support for other posterior forms will be added soon. Additionally, the catalog is expressed as normalized binned histogram heights. Support for other parametrizations of the individual posteriors may be added in the future.


In [ ]:
results_loc = os.path.join(os.path.join(os.path.join(os.path.join('..', '..'), 'research'), 'results'), 'demo')

posteriors = chippr.catalog(params=param_loc, loc=results_loc)

output = posteriors.create(true_nz, int_prior, N=params['n_gals'])

data = np.exp(output['log_interim_posteriors'])

We can also plot a histogram of the centers of the individual Gaussian posteriors, a binned version of the true redshift distribution, and the $n(z)$ resulting from stacking the individual posteriors.


In [ ]:
plt.hist(posteriors.samps.T[1], bins=100, normed=True, color="k")
plt.plot(posteriors.z_coarse, true_nz.evaluate(posteriors.z_coarse), "r-")
plt.plot(posteriors.z_coarse, np.sum(data, axis=0) / 10**params['n_gals'], "go")
plt.xlabel("z")

It is also informative to see what a few individual likelihoods and binned posteriors look like.


In [ ]:
for n, z in enumerate(data[:10]):
    plt.plot(posteriors.z_coarse, data[n], 'ko')
    plt.plot(posteriors.z_fine, posteriors.obs_lfs[n], 'k-')
    plt.show()

We finish by saving the data as a plaintext file. Support for more file formats will be added soon.


In [ ]:
saved_location = 'data'
saved_type = '.txt'
posteriors.write(loc=saved_location, style=saved_type)

Inference

chippr currently contains one inference module to probe the posterior distribution of parameters defining the redshift distribution function.

To perform inference, we must create a catalog object. This may be done by making a new catalog as is done above or by reading in an existing catalog file.


In [ ]:
param_loc = 'params.txt'
results_loc = os.path.join(os.path.join(os.path.join(os.path.join('..', '..'), 'research'), 'results'), 'demo')
simulated_posteriors = chippr.catalog(params=param_loc, loc=results_loc)

saved_location = 'data'
saved_type = '.txt'
data = simulated_posteriors.read(loc=saved_location, style=saved_type)

The catalog file contains three components: the bin_ends, the log_interim_prior, and the log_interim_posteriors. The bin endpoints can be processed to enable their use in constructing a prior distribution over the parameters determining the redshift distribution function.


In [ ]:
zs = data['bin_ends']
log_nz_intp = data['log_interim_prior']
log_z_posts = data['log_interim_posteriors']

z_difs = zs[1:]-zs[:-1]
z_mids = (zs[1:]+zs[:-1])/2.
n_bins = len(z_mids)

The prior distribution must be a mvn object, defined by a mean vector and covariance matrix over the parameters defining the redshift distribution. In this case, it is intuitive to use the definition of the binning strategy to create the prior distribution since the parameters are normalized histogram bin heights, the same parametrization used for the catalog entries themselves.


In [ ]:
# prior_sigma = 0.16
# prior_var = np.eye(n_bins)
# for b in range(n_bins):
#     prior_var[b] = 1. * np.exp(-0.5 * (z_mids[b] - z_mids) ** 2 / prior_sigma ** 2)
# l = 1.e-4
# prior_var = prior_var+l*np.identity(n_bins)


a = 1.# / n_bins
b = 20.#1. / z_difs ** 2
c = 0.
prior_var = np.eye(n_bins)
for k in range(n_bins):
    prior_var[k] = a * np.exp(-0.5 * b * (z_mids[k] - z_mids) ** 2)
prior_var += c * np.identity(n_bins)

prior_mean = log_nz_intp
prior = chippr.mvn(prior_mean, prior_var)

We create a log_z_dens object from the dictionary of catalog parameters and the prior distribution. We include the optional specification of the true distribution, since it is available in this case. We also include a parameter file that may contain default constants for the inference. Now the log_z_dens object can plot some samples from the prior so we can see what it looks like.


In [ ]:
nz = chippr.log_z_dens(data, prior, truth=true_nz, vb=True)

prior_samples = prior.sample(7)
chippr.log_z_dens_plots.plot_ivals(prior_samples, nz.info, nz.plot_dir)

We perform calculations of a few of the simplest estimators of the redshift distribution function $\hat{n}(z)$. The stacked estimator is defined as $\hat{n}(z)=\frac{1}{N}\sum p(z|\vec{d},n^{*}(z))$. The marginalized maximum a posteriori estimator is defined as $\hat{n}(z)=\hat{n}(\{argmax[p(z|\vec{d},n^{*}(z))]\})$. The marginalized expected value estimator is defined as $\hat{n}(z)=\hat{n}(\{E[p(z|\vec{d},n^{*}(z))]\})$.


In [ ]:
nz_stacked = nz.calculate_stacked()
nz_mmap = nz.calculate_mmap()
nz_mexp = nz.calculate_mexp()

The log_z_dens object enables easy comparison between estimators using the Kullback-Leibler Divergences (when the true distribution is available) and root-mean-square differences.

We may next calculate the marginalized maximum likelihood estimator (which actually returns the parameters maximizing the posterior probability).


In [ ]:
nz_mmle = nz.calculate_mmle(nz_stacked)

If we are very ambitious, we can run an MCMC sampler (currently use of emcee is supported, but other samplers may be added in the future) to probe the posterior distribution of the parameter values. To do this, we initialize the sampler with samples from the prior distribution.


In [ ]:
n_ivals = 2*n_bins
initial_values = prior.sample(n_ivals)

nz_samps = nz.calculate_samples(initial_values)

In [ ]:
nz_stats = nz.compare()

The log_z_dens object stores the estimators that have been calculated as well as all metadata associated with the posterior samples. The storage of the metadata and samples will soon be eliminated in favor of saved files, as that information may necessitate a great deal of memory.


In [ ]:
nz.info['estimators'].keys()

Currently, the results of all previously calculated estimators (and the true redshift density function, if it was provided) may be plotted automatically.


In [ ]:
nz.plot_estimators()

The log_z_dens object supports writing the information associated with the estimators to a file in the pickle format, though other formats may be added in the future.


In [ ]:
nz.write('nz.p')

Here we demonstrate that the written estimators may be loaded from files as well for future use.


In [ ]:
nz.info = nz.read('nz.p')
print(nz)

In [ ]:


In [ ]: