qp Demo

Alex Malz & Phil Marshall

In this notebook we use the qp module to approximate some simple, standard, 1-D PDFs using sets of quantiles, samples, and histograms, and assess their relative accuracy. We also show how such analyses can be extended to use "composite" PDFs made up of mixtures of standard distributions.

Requirements

To run qp, you will need to first install the module.


In [ ]:
import numpy as np
import scipy.stats as sps
import scipy.interpolate as spi

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
%matplotlib inline

import qp

The qp.PDF Class

This is the basic element of qp - an object representing a probability density function. This class is stored in the module pdf.py. The PDF must be initialized with some representation of the distribution.


In [ ]:
# ! cat qp/pdf.py
P = qp.PDF(vb=True)

Approximating a Gaussian

Let's summon a PDF object, and initialize it with a standard function - a Gaussian.


In [ ]:
dist = sps.norm(loc=0, scale=1)
print(type(dist))
demo_limits = (-5., 5.)
P = qp.PDF(funcform=dist, limits=demo_limits)
P.plot()

Samples

Let's sample the PDF to see how it looks. When we plot the PDF object, both the true and sampled distributions are displayed.


In [ ]:
np.random.seed(42)

samples = P.sample(1000, using='mix_mod', vb=False)
S = qp.PDF(samples=samples, limits=demo_limits)
S.plot()

Quantile Parametrization

Now, let's compute a set of evenly spaced quantiles. These will be carried by the PDF object as p.quantiles. We also demonstrate the initialization of a PDF object with quantiles and no truth function.


In [ ]:
quantiles = P.quantize(N=10)
Q = qp.PDF(quantiles=quantiles, limits=demo_limits)
Q.plot()

Histogram Parametrization

Let's also compute a histogram representation, that will be carried by the PDF object as p.histogram. The values in each bin are the integrals of the PDF over the range defined by bin ends. We can also initialize a PDF object with a histogram and no truth function.


In [ ]:
histogram = P.histogramize(N=10, binrange=demo_limits)
H = qp.PDF(histogram=histogram, limits=demo_limits)
H.plot()
print H.truth

Evaluating the Approximate PDF by Interpolation

Once we have chosen a parametrization to approximate the PDF with, we can evaluate the approximate PDF at any point by interpolation (or extrapolation). qp uses scipy.intepolate.interp1d to do this, with linear as the default interpolation scheme. (Most other options do not enable extrapolation, nearest being the exception.)

Let's test this interpolation by evaluating an approximation at a single point using the quantile parametrization.


In [ ]:
print P.approximate(np.array([0.314]), using='quantiles')

In [ ]:
P.mix_mod

(We can also integrate any approximation.)


In [ ]:
print P.integrate([0., 1.], using='quantiles')

We can also interpolate the function onto an evenly spaced grid with points within and out of the quantile range, as follows:


In [ ]:
grid = np.linspace(-3., 3., 100)
gridded = P.approximate(grid, using='quantiles')

We can also change the interpolation scheme:


In [ ]:
print P.scheme
print P.approximate(np.array([0.314]), using='quantiles', scheme='nearest')
print P.scheme

The "Evaluated" or "Gridded" Parametrization

A qp.PDF object may also be initialized with a parametrization of a function evaluated on a grid. This is also what is produced by the qp.PDF.approximate() method. So, let's take the output of a qp.PDF approximation evaluation, and use it to instantiate a new qp.PDF object. Note that the evaluate method can be used to return PDF evaluations from either the true PDF or one of its approximations, via the using keyword argument.


In [ ]:
grid = np.linspace(-3., 3., 20)
gridded = P.evaluate(grid, using='mix_mod', vb=False)

G = qp.PDF(gridded=gridded, limits=demo_limits)
G.sample(100, vb=False)
G.plot()

Let's unpack this a little. The G PDF object has an attribute G.gridded which contains the initial gridded function. This lookup table is used when making further approximations. To check this, let's look at whether this G PDF object knows what the true PDF is, which approximation it's going to use, and then how it performs at making a new approximation to the PDF on a coarser grid:


In [ ]:
print G.truth

In [ ]:
print G.last,'approximation, ', G.scheme, 'interpolation'

In [ ]:
# 10-point grid for a coarse approximation:
coarse_grid = np.linspace(-3.5, 3.5, 10)
coarse_evaluation = G.approximate(coarse_grid, using='gridded')
print coarse_evaluation

Mixture Model Fit

We can fit a parametric mixture model to samples from any parametrization. Currently, only a Gaussian mixture model is supported.


In [ ]:
MM = qp.PDF(funcform=dist, limits=demo_limits)
MM.sample(1000, vb=False)
MM.mix_mod_fit(n_components=5)
MM.plot()

Comparing Parametrizations

qp supports both qualitative and quantitative comparisons between different distributions, across parametrizations.

Qualitative Comparisons: Plotting

Let's visualize the PDF object in order to compare the truth and the approximations. The solid, black line shows the true PDF evaluated between the bounds. The green rugplot shows the locations of the 1000 samples we took. The vertical, dotted, blue lines show the percentiles we asked for, and the hotizontal, dotted, red lines show the 10 equally spaced bins we asked for. Note that the quantiles refer to the probability distribution between the bounds, because we are not able to integrate numerically over an infinite range. Interpolations of each parametrization are given as dashed lines in their corresponding colors. Note that the interpolations of the quantile and histogram parametrizations are so close to each other that the difference is almost imperceptible!


In [ ]:
P.plot()

Quantitative Comparisons


In [ ]:
symm_lims = np.array([-1., 1.])
all_lims = [symm_lims, 2.*symm_lims, 3.*symm_lims]

Next, let's compare the different parametrizations to the truth using the Kullback-Leibler Divergence (KLD). The KLD is a measure of how close two probability distributions are to one another -- a smaller value indicates closer agreement. It is measured in units of bits of information, the information lost in going from the second distribution to the first distribution. The KLD calculator here takes in a shared grid upon which to evaluate the true distribution and the interpolated approximation of that distribution and returns the KLD of the approximation relative to the truth, which is not in general the same as the KLD of the truth relative to the approximation. Below, we'll calculate the KLD of the approximation relative to the truth over different ranges, showing that it increases as it includes areas where the true distribution and interpolated distributions diverge.


In [ ]:
for PDF in [Q, H, S]:
    D = []
    for lims in all_lims:
        D.append(qp.metrics.calculate_kld(P, PDF, limits=lims, vb=False))
    print(PDF.truth+' approximation: KLD over 1, 2, 3, sigma ranges = '+str(D))

Holy smokes, does the quantile approximation blow everything else out of the water, thanks to using spline interpolation.

The progression of KLD values should follow that of the root mean square error (RMSE), another measure of how close two functions are to one another. The RMSE also increases as it includes areas where the true distribution and interpolated distribution diverge. Unlike the KLD, the RMSE is symmetric, meaning the distance measured is not that of one distribution from the other but of the symmetric distance between them.


In [ ]:
for PDF in [Q, H, S]:
    D = []
    for lims in all_lims:
        D.append(qp.metrics.calculate_rmse(P, PDF, limits=lims, vb=False))
    print(PDF.truth+' approximation: RMSE over 1, 2, 3, sigma ranges = '+str(D))

Both the KLD and RMSE metrics suggest that the quantile approximation is better in the high density region, but samples work better when the tails are included. We might expect the answer to the question of which approximation to use to depend on the application, and whether the tails need to be captured or not.

Finally, we can compare the meoments of each approximation and compare those to the moments ofthe true distribution.


In [ ]:
pdfs = [P, Q, H, S]
which_moments = range(3)
all_moments = []
for pdf in pdfs:
    moments = []
    for n in which_moments:
        moments.append(qp.metrics.calculate_moment(pdf, n))
    all_moments.append(moments)
    
print('moments: '+str(which_moments))
for i in range(len(pdfs)):
    print(pdfs[i].first+': '+str(all_moments[i]))

The first three moments have an interesting interpretation. The zeroth moment should always be 1 when calculated over the entire range of redshifts, but the quantile approximation is off by about $7\%$. We know the first moment in this case is 0, and indeed the evaluation of the first moment for the true distribution deviates from 0 by less than Python's floating point precision. The samples parametrization has a biased estimate for the first moment to the tune of $2\%$. The second moment for the true distribution is 1, and the quantile parametrization (and, to a lesser extent, the histogram parametrization) fails to provide a good estimate of it.

Advanced Usage

Composite PDFs

In addition to individual scipy.stats.rv_continuous objects, qp can be initialized with true distributions that are linear combinations of scipy.stats.rv_continuous objects. To do this, one must create the component distributions and specify their relative weights. This can be done by running qp.PDF.mix_mod_fit() on an existing qp.PDF object once samples have been calculated, or it can be done by hand.


In [ ]:
component_1 = {}
component_1['function'] = sps.norm(loc=-2., scale=1.)
component_1['coefficient'] = 4.
component_2 = {}
component_2['function'] = sps.norm(loc=2., scale=1.)
component_2['coefficient'] = 1.
dist_info = [component_1, component_2]

composite_lims = (-5., 5.)

C_dist = qp.composite(dist_info)
C = qp.PDF(funcform=C_dist, limits=composite_lims)
C.plot()

We can calculate the quantiles for such a distribution.


In [ ]:
Cq = qp.PDF(funcform=C_dist, limits = composite_lims)
Cq.quantize(N=20, limits=composite_lims, vb=False)
Cq.plot()

Similarly, the histogram parametrization is also supported for composite PDFs.


In [ ]:
Ch = qp.PDF(funcform=C_dist, limits = composite_lims)
Ch.histogramize(N=20, binrange=composite_lims, vb=True)
Ch.plot()

Finally, samples from this distribution may also be taken, and a PDF may be reconstructed from them. Note: this uses scipy.stats.gaussian_kde, which determines its bandwidth/kernel size using Scott's Rule, Silverman's Rule, a fixed bandwidth, or a callable function that returns a bandwidth.


In [ ]:
Cs = qp.PDF(funcform=C_dist, limits = composite_lims)
Cs.sample(N=20, using='mix_mod', vb=False)
Cs.plot()

In [ ]:
qD = qp.metrics.calculate_kld(C, Cq, limits=composite_lims, dx=0.001, vb=True)
hD = qp.metrics.calculate_kld(C, Ch, limits=composite_lims, dx=0.001, vb=True)
sD = qp.metrics.calculate_kld(C, Cs, limits=composite_lims, dx=0.001, vb=True)
print(qD, hD, sD)

PDF Ensembles

qp also includes infrastructure for handling ensembles of PDF objects with shared metaparameters, such as histogram bin ends, but unique per-object parameters, such as histogram bin heights. A qp.Ensemble object takes as input the number of items in the ensemble and, optionally, a list, with contents corresponding to one of the built-in formats.

Let's demonstrate on PDFs with a functional form, which means the list of information for each member of the ensemble is scipy.stats.rv_continuous or qp.composite objects.


In [ ]:
N = 10
in_dists = []
for i in range(N):
    dist = sps.norm(loc=sps.uniform.rvs(), scale=sps.uniform.rvs())
    in_dists.append(dist)
    
E = qp.Ensemble(N, funcform=in_dists, vb=True)

As with individual qp.PDF objects, we can evaluate the PDFs at given points, convert to other formats, and integrate.


In [ ]:
eval_range = np.linspace(-5., 5., 100)
E.evaluate(eval_range, using='mix_mod', vb=False)

In [ ]:
E.quantize(N=10)

In [ ]:
E.integrate(demo_limits, using='mix_mod')

Previous versions of qp included a built-in function for "stacking" the member PDFs of a qp.Ensemble object. This functionality has been removed to discourage use of this procedure in science applications. However, we provide a simple function one may use should this functionality be desired.


In [ ]:
def stack(ensemble, loc, using, vb=True):
    """
    Produces an average of the PDFs in the ensemble

    Parameters
    ----------
    ensemble: qp.Ensemble
        the ensemble of PDFs to stack
    loc: ndarray, float or float
        location(s) at which to evaluate the PDFs
    using: string
        which parametrization to use for the approximation
    vb: boolean
        report on progress

    Returns
    -------
    stacked: tuple, ndarray, float
        pair of arrays for locations where approximations were evaluated
        and the values of the stacked PDFs at those points
    """
    evaluated = ensemble.evaluate(loc, using=using, norm=True, vb=vb)
    stack = np.mean(evaluated[1], axis=0)
    stacked = (evaluated[0], stack)
    return stacked

In [ ]:
stacked = stack(E, eval_range, using='quantiles')
plt.plot(stacked[0], stacked[-1])