Statistical Inference

Sampling Distributions

With randomization, we typically only ever see one realization of all the possible set of assignments. From this, we get a single estimate of the average treatment effect. However, had the assignments been different, the estimate would have been different, too. If we were to repeatedly conduct the same experiment, the collection of average treatment effects would form a sampling distribution. The sampling distribution refers to the collection of estimates associated with every possible random assignment (Gerber and Green, 2012).

The mean of this sampling distribution—that is, the average estimated ATE across all possible random assignments—is equal to the true ATE (Gerber and Green, 2012). With a single experiment, however, we get a single, albeit unbiased, estimate of the ATE. This will almost certainly differ from the true ATE, but is unbiased in the sense that, on average, it yields the true ATE.

Let's look at a hypothetical sampling distribution.


In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline
mpl.style.use('ggplot')
mpl.rc('savefig', dpi=100)

In [2]:
np.random.seed(42)

# data
mu, sigma = 0, 1
x = mu + sigma * np.random.randn(100000)

# plot
pd.Series(x).plot(kind='hist', bins=50,
                  color='#000000', alpha=0.5, normed=True)

# plot options
plt.title('Sampling Distribution Example')
plt.xlabel('ATE Estimates')
plt.ylabel('Density')
plt.tick_params(axis='both',
                top='off', bottom='off',
                left='off', right='off')


This distribution was generated using 100,000 random data points with zero mean ($\mu = 0$) and unit variance ($\sigma^2 = 1$). Thus, the true ATE is zero, meaning there is no treatment effect, and we see that the histogram is centered around that value. We can also see the variation around zero. We'll discuss this next.

Standard Error

While experiments provide unbiased estimates of the ATE, they are not necessarily precise—there is variability around the true ATE. Sampling variability tells us about the uncertainty associated with our estimate. The statistic of choice for characterizing this variability is the standard error.

The standard error is the standard deviation of the sampling distribution. The formula for the population standard deviation is:

$$\sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})^2 } \text{.}$$

In other words, we take the square root of the average squared deviations from the mean. (Note: when the data is a random sample from the population, the sample standard deviation divides the squared deviations by $N-1$ rather than $N$.)

For our hypothetical sampling distribution, we can calculate the standard deviation as follows.


In [3]:
(((x - x.mean()) ** 2).mean()) ** 0.5


Out[3]:
1.0009009542948903

We could also use NumPy's built-in standard deviation method.


In [4]:
x.std()


Out[4]:
1.0009009542948903

This can be expressed in terms of potential outcomes (Gerber and Green, 2012):

$$ SE(\widehat{ATE}) = \sqrt{ \frac{1}{N-1} \left( \frac{ m\sigma^2_{y_{0i}} }{ N-m } + \frac{ (N-m) \sigma^2_{y_{1i}} }{ m } + 2 cov(y_{0i}, y_{1i}) \right) }\text{,}$$

where $m$ is the number of units in the treatment group, $N$ is the total number of units (treatment and control), and $\sigma^2$ is the variance. The last term represents the covariance of $y_{0i}$ and $y_{1i}$ and measures the association between the potential outcomes.

Using this formulation, Gerber and Green discuss the ways that the experimental design can produce more precise estimates of the ATE. They do this by changing single inputs at a time while holding others fixed. These are summarized below.

  1. A larger $N$ leads to a smaller standard error. This is true even if $m$ is held constant, which implies that only the control group increases. Similarly, increasing the number of units in the control group, $m$, also decreases the standard error.
  2. The smaller that $\sigma^2_{y_{0i}}$ or $\sigma^2_{y_{1i}}$ are, the smaller the standard error. This implies that the treatment and control group units should be as similar as possible in terms of potential outcomes.
  3. The smaller the covariance between $y_{0i}$ and $y_{1i}$, the smaller the standard error.

Gerber and Green provide an additional suggestion:

  • if $\sigma^2_{y_{0i}}$ and $\sigma^2_{y_{1i}}$ are similar, about half of the units should be assigned to treatment and the rest to control
  • if the variances of the potential outcomes are different, assign more units to the corresponding group
    • in practice, however, this variance is rarely known
    • thus, try to assign an equal number of units to each condition

Standard errors measure uncertainty, indicating how estimates vary across all possible random assignments (Gerber and Green, 2012). One goal of an experiment is to reduce variability. There are several ways to do this, which we'll discuss further in a subsequent chapter, but, in general, one can:

  • make more precise measurements of the response
  • transform response data
    • for example, instead of measuring only "post-test" responses, issue a "pre-test" and measure outcomes based difference between these values
  • group observations so that they're as homogeneous as possible in terms of potential outcomes
  • use more experimental units

Estimating the Standard Error

As described earlier, we usually only have access to results from a single randomization. Our goal, then, is to estimate the true standard error, which is unknown, using data from a single experiment.

In this scenario, we have access to four of the five inputs. We know the number of units in each group and can estimate the variance of the potential outcomes, $y_{0i}$ and $y_{1i}$, using the observed outcomes. These are unbiased estimates because of the randomized assignment.

However, because we never observe both potential outcomes for a single unit, we cannot estimate the covariance between $y_{0i}$ and $y_{1i}$. By convention, we can assume that there is a constant treatment effect for all units, implying that the covariance between $y_{0i}$ and $y_{1i}$ is 1.0.

The formula for estimating the standard error of the treatment effect is (Gerber and Green, 2012).

$$ \widehat{SE} = \sqrt{ \frac{\widehat{ \sigma^2_{y_{0i}} }}{ N-m } + \frac{ \widehat{ \sigma^2_{y_{1i}} } }{ m } } $$

On average, the standard error estimate "does a reasonably good job of estimating the true level of sampling variability" (Gerber and Green, 2012).

A few caveats:

  • there must be at least two units in each experimental group
  • estimated standard errors may be inaccurate if:
    • there isn't much data about the variances
    • the assumptions about the covariance are incorrect (often, there isn't a way to tell)

Hypothesis Testing

Hypothesis testing is a strategy for determining whether any observed differences between experimental groups are "real." To do this, we don't actually gather evidence in favor of this hypothesis. What we do is assume the opposite is true and find evidence against it.

For example, if we believe there is a difference in observed outcomes between experimental groups—that is, $y_{0i} \neq y_{1i}$—we start with $y_{0i} = y_{1i}$, which says there is no difference. This is referred to as the null hypothesis. Then, if we find evidence to reject the null hypothesis, we are left with what's called the alternative hypothesis—$y_{0i} \neq y_{1i}$, in this case. Note that we can never be 100% sure that the alternative hypothesis is true. In other words, we never accept the alternatieve hypothesis.

This process is called hypothesis testing and there are several methods we can use. For example, some readers might be familiar with t-tests. Here, we'll focus on using randomization for making inferences. In particular, we'll use permutation tests. There are several reasons for this. Approximate methods, such as t-tests and F-tests, assume that outcomes are independent and that they follow a normal distribution with mean $\mu$ and variance $\sigma^2$ (Oehlert, 2010). This may not be the case. Another benefit of permutation tests is that they do not require large samples. In fact, "any sample size will do, and the method can be applied to all sorts of outcomes, including counts, durations, or ranks" (Gerber and Green, 2012).

Permutation tests simulate all possible randomizations.

These simulated randomizations provide an exact sampling distribution of the estimated average treatment effect under the null hypothesis (Gerber and Green, 2012).

With this, we calculate the probability of observing an estimated ATE as large as the one observed in the experiment. If the experimental estimate of the ATE is 5.0, for example, we find the proportion of times that the randomized outcomes were greater than or equal to 5.0. This probability is the $p$-value for a one-tailed hypothesis. The one-tailed hypothesis might be something like $y_{1i} > y_{0i}$ if we believe the treatment group has larger expected outcomes. This is based on the specified alternative hypothesis. A two-tailed hypothesis, on the other hand, is based on $y_{0i} \neq y_{1i}$, simply meaning that the outcomes are different. For a two-tailed hypothesis, we would find the proportion of times than the randomized outcomes were greater than or equal to the absolute value of 5.0.

In practice, we rarely simulate all possible randomizations. As Gerber and Green point out, if $N = 50$ and half the units are randomly assigned to treatment and half to control, there would be more than 126 trillion possible randomizations:

$$\frac{50!}{50!(50-25)!} = 126,410,606,437,752\text{.}$$

Instead, we can approximate the sampling distribution by randomly sampling from the set of all possible random assignments (Gerber and Green, 2012). This can be implemented by randomly shuffling the labels—in this case, the treatment assignments—a total of $P$ times. A common value of $P$ is 10,000.

Let's describe the interpretation of the $p$-value. Randomly shuffling the labels implies that the treatment assignment, $D_i$, has no effect on the outcomes. In effect, we're building the null distribution. The expectation of this distribution is zero. Hence, the more times we see a permuted test statistic that is as extreme as the experimental one, the less likely the latter is statistically different from zero.

Let's look at a code example. In code/, there is a module named permutation. Let's import permutation_test() from it.


In [5]:
from code.permutation import permutation_test

Let's set our $N$ and $m$ variables.


In [6]:
N, m = 50, 25

Next, we'll generate some data from a normal distribution. In the first example, the treatment, t, and the control, c, will be centered at 5. We'll combine the treatment and control arrays.


In [7]:
np.random.seed(42)
t = np.random.normal(loc=5, size=m)
c = np.random.normal(loc=5, size=N-m)
outcomes = np.append(t, c)

The experimental test statistic is:


In [8]:
t.mean() - c.mean()


Out[8]:
0.12393169319609765

Here, we should expect a high $p$-value.


In [9]:
permutation_test(outcomes, m)


Out[9]:
0.64829999999999999

Using the data from [7], we find that the permuted test statistics are as extreme as the experimental test statistic 65% of the time. This means, we don't have enough evidence to say that the experimental test statistic is different from zero.

Let's see what happens if we change, t and c.


In [10]:
np.random.seed(42)
t = np.random.normal(loc=6, size=m)
c = np.random.normal(loc=5, size=N-m)
outcomes = np.append(t, c)

The new experimental test statistic is:


In [11]:
t.mean() - c.mean()


Out[11]:
1.1239316931960976

Because the values are actually different, the $p$-value should be low.


In [12]:
permutation_test(outcomes, m)


Out[12]:
0.0001

This says we only saw a value as extreme as 1.12 one time out of ten thousand. This gives us enough evidence to reject the null hypothesis that the average treatment effect is zero.


In [ ]: