The causal effect of a treatment, $D$, on an outcome, $y$, for either an observational or experimental unit, $i$, is defined by the differences in outcomes occurring under each of the treatment possibilities (Gelman, 2006). In the binary treatment case, $D_i = \left\{0, 1\right\}$. The potential outcomes—that is, the outcomes that would have been observed for unit $i$ under treatment and control conditions—are defined as follows:
\begin{cases} y_{1i} \quad \text{if}\ D_i = 1 \\ y_{0i} \quad \text{if}\ D_i = 0\text{.} \end{cases}Only one of the outcomes—the one associated with unit $i$'s assignment—is ever observed, though. For a unit assigned to the treatment condition, $y_{1i}$ is observed and $y_{0i}$, referred to as the counterfactual, is unobserved.
Using the potential outcomes notation, the observed outcome, $y_i$, can be written as:
$$y_{0i} + (y_{1i} - y_{0i})D_i\text{,}$$where $y_{1i} - y_{0i}$ is the causal effect of the treatment, $D$, on unit $i$ (Angrist and Pischke, 2009).
We can define the average treatment effect, or ATE, as:
$$ATE \equiv \frac{1}{N}\sum_{i=1}^N (y_{1i} - y_{0i})\text{.}$$Because both potential outcomes are never observed for a single unit, though, treatment effects must be inferred by comparing average outcomes for treated and untreated units, resulting in:
$$\frac{1}{N}\sum_{i=1}^N y_{1i} - \frac{1}{N}\sum_{i=1}^N y_{0i}\text{.}$$Another way to express the this is using expected value notation. For a random variable, $X$, its expected value is:
$$E[X] = x_1p_1 + x_2p_2 + \dots + x_kp_k = \sum_{i=1}^{k}x_ip_i\text{,}$$where $p_i$ is the probability of seeing $x_i$.
We can rewrite the ATE formula in terms of conditional expectations:
$$E[y_i \mid D_i=1] - E[y_i \mid D_i=0]\text{.}$$This is the difference in expected outcomes for treated and untreated units.
Recall our medical treatment study example where the treatment is only administered to patients who need it most. A naive comparison of outcomes for treated and untreated units, in this example, tells us something about potential outcomes, but not necessarily what we want to know—the effect of the treatment (Angrist and Pischke, 2009). The ATE, in this case, is the following:
$$\underbrace{E[y_{1i} \mid D_i=1] - E[y_{0i} \mid D_i=1]}_\text{Average treatment effect on the treated} + \underbrace{E[y_{0i} \mid D_i=1] - E[y_{0i} \mid D_i=0]}_\text{Selection bias}\text{.}$$The term $E[y_{1i} \mid D_i=1] - E[y_{0i} \mid D_i=1]$ represents the difference in average outcomes for those who were treated and what would have happened to them had they not been treated. To this causal effect, we add a term called selection bias. This represents the difference in average $y_{0i}$—that is, outcomes we would have observed without treatment—between those who were and weren't treated. In our example, because those who were treated were of the highest need, their observed outcomes, whatever they might be (we have said nothing about the response), are presumably worse than those who weren't treated, making selection bias negative.
Let's consider each component of the ATE equation, above. Why do we focus on the average treatment effect on the treated and how do we get rid of selection bias? It turns out, that random assignment of $D_i$ can the explain the former and solve the latter.
Random assignment of $D_i$ makes $D_i$ independent of potential outcomes. This means that knowing $D_i$ gives us no information about the potential outcome, $y_i$. That is, the treated and untreated units have the same potential outcomes.
For the ATE, this implies that:
\begin{align} E[y_i \mid D_i=1] - E[y_i \mid D_i=0]&=E[y_{1i} \mid D_i=1] - E[y_{0i} \mid D_i=0]\\ &=\underbrace{E[y_{1i} \mid D_i=1] - E[y_{0i} \mid D_i=1]}_\text{Average treatment effect on the treated}\text{.} \end{align}Going from the first line to the second is possible because of the independence of $y_{0i}$ and $D_i$. Because the values on the right side of the equation are equal, we don't have to necessarily focus on average treatment effect on the treated. But this notation is useful because we can simply the second line further:
\begin{align} &=E[y_{1i} - y_{0i} \mid D_i=1]\\ &=E[y_{1i} - y{0_1}]\text{.} \end{align}This makes it possible to, under random assignment, compare the observed outcomes for treated and untreated units.
What about selection bias? Notice that because $E[y_{0i} \mid D_i=0]$ is equal to $E[y_{0i} \mid D_i=1]$, selection bias is eliminated!
In [ ]: