Before doing anything,
Make a copy of this notebook in your clone of the private course repository. Edit that copy, not the version in the public repository clone. Please follow the instructions in the private repo README by locating your solutions on a path like phys366_2019/HW_week<N>/<Your Name(s)>/
and note that case matters. Take a minute to look hard at the directory structure and choose your filenames carefully: if your solutions are not in the right place they will not be graded.
Remember that you will submit your solution by pull request to the private repository.
This week you also need to produce a project pitch. As explained in the "Project Milestones" instructions:
With your team, write up a proto-abstract for your project concept, explaining what you want to learn, what data can help, and roughly how you plan to proceed.
As with your homework solution, push your (team's) pitch to your fork of the private repo in a folder like phys366_2019/Project_pitches/<your name(s)>/
and submit a PR to the private repository. You can write your pitch in either markdown, plain text, or jupyter notebook form - all will display reasonably well on the repo.
Once you've submitted your solution, don't forget to also fill out the very quick (and required!) weekly feedback form.
Let's work through the example inference described by Jake Vanderplas here (pages 4-6), and get some practice in handling and interpreting both Bayesian credible regions and Frequentist confidence intervals.
Let's explore Jaynes' model that Vanderplas relays:
Consider a device that “...will operate without failure for a time θ because of a protective chemical inhibitor injected into it; but at time θ the supply of the chemical is exhausted, and failures then commence, following an exponential failure law [with a characteristic decay time of 1 day]. It is not feasible to observe the depletion of this inhibitor directly; one can observe only the resulting failures. From data on actual failure times, estimate the time θ of guaranteed safe operation...”
Suppose we observe three device failures. We don't know the exhaustion time $\theta$, but we see the three devices fail after $\mathbf{x} = [10, 12 , 15]$ days. What is the value of $\theta$?
This is an inference problem: we have some data $\mathbf{x}$ and a model for how it was generated; the task is to infer the exhaustion time $\theta$.
The Bayesian solution is to compute the posterior probability distribution for $\theta$ given the data $\mathbf{x}$, and report a credible region for $\theta$.
The Frequentist solution is to find an estimator $\hat{\theta}$ for $\theta$ given the data $\mathbf{x}$, and report a confidence interval for that estimator.
Both approaches will use the same model, which dictates a sampling distribution for $\mathbf{x}$, and hence the same likelihood function (which is the sampling distribution written as a function of the model parameter $\theta$). Whether or not we use this likelihood function will be crucial.
The results need to be articulated carefully - in Vanderplas' words:
Bayesian credible region: "Given our observed data, there is a 95% probability that the true value of $\theta$ lies within the credible region."
Frequentist confidence interval: “If the experiment is repeated many times, in 95% of these cases the computed confidence interval will contain the true value of $\theta$."
Vanderplas points out that the mean of $x$ over its sampling distribution $P(x|\theta)$ is $\theta + 1$, and hence that an estimator for $\theta$ is $\hat{\theta} = \bar{x} - 1$ where $\bar{x}$ is the sample mean of the data $\mathbf{x}$.
Vanderplas then derives an approximate form for the sampling distribution for this estimator:
"The exponential distribution has a standard deviation of 1, so in the limit of large $N$, we can use the standard error of the mean to show that the sampling distribution of $\hat{\theta}$ will approach normal with variance $\sigma =1/\sqrt{N}$."
Under this Gaussian approximation, the the $\pm 2\sigma$ range corresponds to the 95.4% Frequentist confidence interval for the estimator $\hat{\theta}$:
$ CI \approx (\hat{\theta} - 2/\sqrt{N}, \hat{\theta} + 2/\sqrt{N})$
We'll use "95%" interchangeably with "$2\sigma$" below, but note that this is actually very bad practice. People do actually mean 95% when they say 95% in general, and the difference between a 95% range and a 95.4 (and change) % range can be noticeable, given that we're out in the tails of the distribution.
It seems problematic that the Frequentist confidence interval that you calculated above only contains unphysical values of the exhaustion time $\theta$ (that are all greater than the minimum observed failure time). Recall that the meaning of the confidence interval is the following, in Vanderplas' words:
“If this experiment is repeated many times, in 95% of these cases the computed confidence interval will contain the true value of $\theta$."
We saw in class that for a Gaussian likelihood function, the quantity $-2 \log \mathcal{L} = \chi^2$ is expected to be distributed (over datasets) according to a $\chi^2$ distribution, such that the 2 sigma, 95% confidence interval for the single parameter $\theta$ has edges lying at $-2 \Delta \log \mathcal{L} = 4$ from the maximum likelihood point.
Assuming a very broad uniform prior PDF for $\theta$, Vanderplas gives the posterior PDF for $\theta$ as:
$P(\theta|\mathbf{x}) = N e^{N(\theta-\min \mathbf{x})}$, for $\theta \leq \min \mathbf{x}$; 0 for $\theta > \min \mathbf{x}$
(i.e the posterior PDF is simply proportional to the likelihood function in part 4, as expected for a uniform prior).
The 95% maximum density credible region extends from $\theta_1$ to $\theta_2 = \min \mathbf{x}$, where $\theta_1$ is determined by the integral
$\int_{\theta_1}^{\theta_2} P(\theta|\mathbf{x}) d\theta = 0.95$
Vanderplas does the integral for us:
$\theta_1 = \theta_2 + \frac{1}{N} \log \left[ 1 - 0.95 ( 1 - e^{-N \theta_2} ) \right] $
The Bayesian credible region has the following meaning - again articulated by Vanderplas:
"Given our observed data, there is a 95% probability that the true value of $\theta$ lies within the credible region."
To test that our credible region is accurate, we can again do a numerical test involving a large number of datasets, but this time each will need its own $\theta$ value; the 95% credible region for each $\theta$ given its own dataset should contain the true value of that $\theta$ with 95% probability.
In [ ]:
import numpy as np
import scipy.stats
x0 = np.array([10., 12., 15.])
In [ ]:
# From class notes: the relationship between the confidence level in a confidence interval,
# and the number of sigma it corresponds to:
dimensions = 1
level = 0.95
sigma = np.sqrt(scipy.stats.chi2.ppf(level, dimensions))
print("A {:.1f} sigma confidence interval in {:d} dimensions is also a {:.0f}% confidence interval, assuming the estimator is Gaussian distributed.".format(sigma, dimensions, 100*level))
In [ ]:
# Generate 10000 N=3 datapoints for the same true value of theta,
# compute the CI from each dataset, and test it for coverage.
Ns = 10000
theta = 10.0
Plot:
In [ ]:
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
# Generate 10000 N=3 datapoints for the same true value of theta,
# compute the CI from each dataset, and test it for coverage.
Ns = 10000
theta = 10.0
Commentary:
In [ ]:
# Draw 10000 values of theta from the prior uniform range,
# and generate an N=3 dataset for each one.
# Then compute the CR from each dataset and test it for coverage.
# Here I've assumed that an upper limit of 30 puts us in the regime of a "broad, uniform prior" assumed above.
Ns = 10000
thetas = 30.0*np.random.random(Ns)
Commentary: