In the previous chapter, we saw how permutation tests could help us determine whether observed differences between experimental groups are real. Randomized experiments make this possible. While unbiased, however, they are not necessarily precise.
This chapter describes how to reduce the variability of the outcomes using regression. We make use of covariates—that is, other variables that affect outcomes—in order to eliminate observed differences between experimental groups.
A regression model typically describes a single variable in terms of one or more different variables. A line, for example, can be described by the equation $y = mx + b$, where $b$ is the $y$-intercept and $m$ is the slope. The slope tells us the amount by which $y$ increases for every unit increase in $x$. In other words, this equation describes the relationship between $x$ and $y$ in terms of $m$.
By convention, $y$ and $x$ are typically referred to as the dependent and independent variables, respectively.
Dependent variables are also sometimes referred to as "response variables," "regressands," or "targets." The corresponding terms for independent variables are "predictor variables," "regressors," or "features."
Previously, we expressed the potential outcome equation as:
$$y_i = y_{0i} + (y_{1i} - y_{0i})D_i\text{.}$$It turns out, the regression equation isn't much different:
$$y_i = \beta_0 + \beta_1 D_i + \epsilon\text{.}$$Notice the similarities. $D_i$ is the treatment status and is either $0$ or $1$. The parameter of interest—that is, what we're interested in estimating—is $\beta_1$. This tells us the amount by which $y$ changes with treatment. In other words, this is the ATE.
In [ ]: