This notebook provides an interactive overview to some if the ideas developed in a paper entitled "Error Statistics", by Mayo and Spanos.

Goals

For a sample of data:

(1) quantify the extent to which the sample is consistent with coming from a particular, hypothetical data source

(2) if inconsistent, determine what other, particular data sources is the sample consistent with.

Introduction

Two notions of probability:

Frequentist: probabilities represent relative frequency of occurence. e.g. $P(X;\mu)$ speaks to the probability of outcome $X$, given $\mu$.

Bayesian: probabilities represent degrees of belief e.g. $P(\mu;X)$ speaks to the probability of $\mu$ being true, given $X$.

Both are useful.

Bayesian analyses:

  • incorporate prior knowledge
  • produce posterior probabilities

Freqentist analyses:

  • allow for lack of prior knowledge
  • do not produce posterior probabilities

Let's explore the use of frequentist statistics. $P(X;\mu)$ describes a set of probabilities for observed data $X$ given an assumption about the world, parameterized by $\mu$. Specifically, let's look at the statistics of errors of inference.

Exploration


In [ ]:
from scipy.stats import norm # properties of the distribution
from numpy.random import normal # samples from the distribution
import numpy as np
import scipy
from matplotlib import pyplot as plt
%matplotlib inline

All hypotheses discussed herein will be expressed with Gaussian / normal distributions. Let's look at the properties of this distribution.

Start by plotting it. We'll set the mean to 0 and the width the 1...the standard normal distribution.


In [ ]:
x = np.arange(-10, 10, 0.001)
plt.plot(x,norm.pdf(x,0,1)) # final arguments are mean and width

Now look at the cumulative distribution function of the standard normal, which integrates from negative infinity up to the function argument, on a unit-normalized distribution.


In [ ]:
norm.cdf(0)

The function also accepts a list.


In [ ]:
norm.cdf([-1., 0, 1])

Now let's be more explicit about the parameters of the distribution.


In [ ]:
mu = 0
sigma = 1
norm(loc=mu, scale=sigma)
norm.cdf([-1., 0, 1])

In [ ]:
sigma=2
mu = 0
n = norm(loc=mu, scale=sigma)
n.cdf([-1., 0, 1])

In addition to exploring properties of the exact function, we can sample points from it.


In [ ]:
[normal() for _ in range(5)]

We can also approximate the exact distribution by sampling a large number of points from it.


In [ ]:
size = 1000000
num_bins = 300

plt.hist([normal() for _ in range(size)],num_bins)
plt.xlim([-10,10])

Data samples

If we have sample of points, we can summarize them in a model-nonspecific way by calculating the mean.

Here, we draw them from a Gaussian for convenience.


In [ ]:
n = 10
my_sample = [normal() for _ in range(n)]
my_sample_mean = np.mean(my_sample)
print(my_sample_mean)

Now let's generate a large number of data samples and plot the corresponding distribution of sample means.


In [ ]:
n = 10
means_10 = []
for _ in range(10000):
    my_sample = [normal() for _ in range(n)]
    my_sample_mean = np.mean(my_sample)
    means_10.append(my_sample_mean)

    
plt.hist(means_10,100)
plt.xlim([-1.5,1.5])
plt.xlabel("P(mean(X))")
plt.show()

In [ ]:
n = 100
means_100 = []
for _ in range(10000):
    my_sample = [normal() for _ in range(n)]
    my_sample_mean = np.mean(my_sample)
    means_100.append(my_sample_mean)
    
plt.hist(means_100,100)
plt.xlim([-1.5,1.5])
plt.xlabel("P(mean(X))")
plt.show()

In [ ]:
# show 1/sqrt(n) scaling of deviation
n_s = []
std_100 = []
for i in range(1, 1000, 50):
    means_100 = []
    for _ in range(5000):
        my_sample = [normal() for _ in range(i)]
        my_sample_mean = np.mean(my_sample)
        means_100.append(my_sample_mean)
    my_sample_std = np.std(means_100)
    std_100.append(1./(my_sample_std*my_sample_std))
    n_s.append(i)
    
plt.scatter(n_s,std_100)
plt.xlim([0,1000])
plt.ylabel("std(mean(X;sample))")
plt.xlabel("sample")
plt.show()

Note that by increasing the number of data points, the variation on the mean decreases.

Notation: the variable containing all possible n-sized sets of samples is called $X$. A specific $X$, like the one actually observed in an experiment, is called $X_0$.

What can we say about the data?

  • are the data consistent with having been sampled from a certain distribution?
  • if not, what distribution are they consistent with?

Hypotheses

In our tutorial, a hypothesis is expressed as a distribution from which the data may have been drawn. Our goal is to provide a procedure for rejection of the null hypothesis, and, in the case of rejecting the null, provide warranted inference of one or more alternate hypotheses.

Simplification: the hypothesis space is defined as all normal distributions with variable mean $\mu$ and fixed variance. Generalizing this assumption changes almost nothing.

Corollary: the hypothesis space is one-dimensional, and the logical not of a hypothesis is simple to comprehend.

A Test Statistic

To relate observed data to hypotheses, we need to define a test statistic, which summarizes a particular experimental result. This statistic is also a function of the hypothesis, and will have different sampling distributions under different hypotheses.

$d(X;H_\mu) = (\bar X - \mu)/(\sigma/\sqrt n)$,

where $\bar X$ is the mean of $X$. For Gaussian hypotheses, $d$ is distributed as a unit normal.


In [ ]:
def d(X=[0], mu = 0, sigma = 1):
    X_bar = np.mean(X)
    return (X_bar - mu) / sigma * np.sqrt(len(X))

In [ ]:
n = 10
my_sample = [normal() for _ in range(n)]
d(my_sample)

Let's numerically determine the sampling distribution under the hypothesis: $H_0$: $\mu = 0, \sigma = 1$


In [ ]:
size = 100000
n = 10
d_sample = []

for _ in range(size):
    my_sample = [normal() for _ in range(n)] # get a sample of size n
    d_sample.append(d(my_sample)) # add test statistic for this sample to the list
plt.hist(d_sample,100)
plt.xlabel("P(d(X);H0)")

With this sampling distribution (which can be calculated exactly), we know exectly how likely a particular result $d(X_0)$ is. We also know how likely it is to observe a result that is even less probable than $d(X_0)$, $P(d(X) > d(X_0); \mu)$.

Rejecting the null

This probability is the famous p-value. When the p-value for a particular experimental outcome is less that some pre-determined amount (usually called $\alpha$), we can:

  • infer that $H_0$ is falsified at level $\alpha$
  • take the action that has been specified for this situation
  • infer that $X_0$ indicates something about an alternate hypothesis. If $H_0$ corresponds to $\mu = \mu_0$, then we infer that $\mu > \mu_0 + \delta$.

If $H_0$ is rejected, we can now also begin to speak about statistical properties of $H_1$ where $H_1 != H_0$.

Neyman-Pearson digression

The traditional frequentist procedure (due to Neyman and Pearson) is to construct a test that fixes the probability of rejecting $H_0$ when it’s true, and maximizes the power: the probability of statistical similarity with $H_1$ when it is true. In other words, for a fixed probability of rejecting $H_0$ when it's true, maximize the probability of accepting $H_1$ when it's true.

The N-P construction is fixed before $X_0$ is observed. We wish to extend this and, when $H_0$ is rejected, infer regions of alternate parameter space that are severely tested by the outcome $X_0$.

Inference of an alternate hypothesis

When the null hypothesis is rejected, we are interested in _ranges of alternate hypotheses that, if not true, are highly likely to have produced a test statistic less significant than $d(X_0)$_. We say these ranges of parameters space, which can be thought of as composite hypotheses, have been severely tested. We call the level of testing severity and it is a function of the observed data ($X_0$), the range of alternate hypothesis ($H$), and the test constuction itself.

This is the point of the tutorial: we are warrented to infer ranges of hypothesis space when that range has been severely tested.


In [ ]:
# look at the distributions of sample means for two hypotheses

def make_histograms(mu0=0,mu1=1,num_samples=10000,n=100,sigma=1):

    #d0_sample = []
    #d1_sample = []

    m0_sample = []
    m1_sample = []
    for _ in range(num_samples):
        H0_sample = [normal(loc=mu0,scale=sigma) for _ in range(n)] # get a sample of size n from H0
        H1_sample = [normal(loc=mu1,scale=sigma) for _ in range(n)] # get a sample of size n from H1

        m0_sample.append( np.mean(H0_sample) ) # add mean for this sample to the m0 list
        m1_sample.append( np.mean(H1_sample) ) # add mean for this sample to the m1 list

        # remember that the test statistic is unit-normal-distributed for Gaussian hypotheses,
        # so these distributions should be identical
        #d0_sample.append( d(H0_sample,mu0,sigma) ) # add test statistic for this sample to the d0 list
        #d1_sample.append( d(H1_sample,mu1,sigma) ) # add test statistic for this sample to the d1 list

    plt.hist(m0_sample,100,label="H0")
    plt.hist(m1_sample,100,label="H1")
    plt.xlabel(r"$\bar{X}$")
    plt.legend()

In [ ]:
num_samples = 10000
n = 100

mu0 = 0
mu1 = 1
sigma=2

make_histograms(mu0=mu0,mu1=mu1,num_samples=num_samples,n=n,sigma=sigma)

Now, imagine that we observe $\bar X_0 = 0.4$. The probability of $\bar X > 0.4$ is less than $2\%$ under $H_0$, so let's say we've rejected $H_0$.

Question, what regions of $\mu$ (defined as $\mu > \mu_1$) have been severely tested?

$SEV(\mu>\mu_1) = P(d(X)<d(X_0);!(\mu>\mu_1)) = P(d(X)<d(X_0); \mu<=\mu_1)$ ---> $P(d(X)<d(X_0);\mu = \mu_1)$

So we only need to calculate the probability of a result less anomalous than $d(X_0)$, given $\mu_1$.


In [ ]:
# severity for the interval: mu > mu_1

# note that we calculate the probability in terms of the _lower bound_ of the interval, 
# since it will provide the _lowest_ severity

def severity(mu_1=0, x=[0], mu0=0, sigma=sigma, n=100):
    # find the mean of the observed data
    x_bar = np.mean(x)
    # calculate the test statistic w.r.t. mu_1
    dx = (x_bar - mu_1)/sigma*np.sqrt(n)
    # the test statistic is distributed as a unit normal
    n = norm()
    return n.cdf(dx)

Calculate the severity of an outcome that is rather unlike (is lower) than the lower bound of a range of alternate hypotheses ($\mu > \mu_1$).


In [ ]:
sigma = 2
mu_1 = 0.2
x = [0.4]
severity(mu_1=mu_1,x=x,sigma=sigma)

In [ ]:
num_samples = 10000
n = 100

mu0 = 0
mu1 = 0.2
sigma=2

make_histograms(mu0=mu0,mu1=mu1,num_samples=num_samples,n=n,sigma=sigma)

Calculate the severity for a set of observations.


In [ ]:
x_bar_values = [[0.4],[0.6],[1.]]
color_indices = ["b","k","r"]

for x,color_idx in zip(x_bar_values,color_indices):
    mu_values = scipy.linspace(0,1,100)
    sev = [severity(mu_1=mu_1,x=x,sigma=sigma) for mu_1 in mu_values]
    plt.plot(mu_values,sev,color_idx,label=x)
    plt.ylim(0,1.1)
    plt.ylabel("severity for $H: \mu > \mu_1$")
    plt.legend(loc="lower left")
    plt.xlabel(r"$\mu_1$")

Get some intuition by considering some test points:

  • given the interval $\mu > \mu_1 = 0.4$, the probability to observe something less than or equal to $0.4$ (blue line) is $0.5$. The interval is not severely tested and its inference is not warrented.
  • a result of $\bar X_0 = 1$ warrents inference of $\mu \sim> 0.7$.
  • warrented intervals of inference are dependant on the observed outcome ($\bar X$ values).

In [ ]: