In this notebook, I reproduce the semiparametric results from
Carneiro, P., Heckman, J. J., & Vytlacil, E. J. (2011). Estimating marginal returns to education. American Economic Review, 101(6), 2754-81.
The authors analyze the returns to college for white males born between 1957 and 1963 using data from the National Longitudinal Survey of Youth 1979. The authors provide some replication material) on their website but do not include geographic identifiers. Therefore, we make use of a mock data merging background characteristics and local data randomly.
In a future update, the semiparametric estimation method will be included in the open-source package grmpy for the simulation and estimation of the generalized Roy model in Python. Currently, grmpy is limited to the estimation of a parametric normal version of the generalized Roy model.
For more, see the online documentation.
In [1]:
import numpy as np
from grmpy.estimate.estimate_semipar import plot_common_support
from tutorial_semipar_auxiliary import plot_semipar_mte
from grmpy.estimate.estimate import fit
import warnings
warnings.filterwarnings('ignore')
The method of Local Instrumental Variables (LIV) is based on the generalized Roy model, which is characterized by the following equations:
We work with the linear-in-the-parameters version of the generalized Roy model:
\begin{align} E(Y|X = \overline{x}, P(Z) = p) = \overline{x} \beta_0 + p \overline{x} (\beta_1 - \beta_0) + K(p), \end{align}where $K(p) = E(U_1 - U_0 | D = 1, P(Z) = p)$ is a nonlinear function of $p$ that captures heterogeneity along the unobservable resistance to treatment $u_D$.
In addition, assume that $(X, Z)$ is independent of $\{U_1, U_0, V\}$. Then, the MTE is
1) additively separable in $X$ and $U_D$, which means that the shape of the MTE is independent of $X$, and
2) identified over the common support of $P(Z)$, unconditional on $X$.
The common support, $P(Z)$, plays a crucial role for the identification of the MTE. It denotes the probability of going to university ($D=1$). Common support is defined as the intersection of the support of $P(Z)$ given $D = 1$ and the support of $P(Z)$ given $D = 0$. i.e., those evaluations of $P(Z)$ for which we obtain positive frequencies in both subsamples. We will plot it below. The larger the common support, the larger the region over which the MTE is identified.
The LIV estimator, $\Delta^{LIV}$, is derived as follows (Heckman and Vytlacil 2001, 2005):
\begin{equation} \begin{split} \Delta^{LIV} (\overline{x}, u_D) &= \frac{\partial E(Y|X = \overline{x}, P(Z) = p)}{\partial p} \bigg\rvert_{p = u_D} \\ & \\ &= \overline{x}(\beta_1 - \beta_0) + E(U_1 - U_0 | U_D = u_D) \\ &\\ & = \underbrace{\overline{x}(\beta_1 - \beta_0)}_{\substack{observable \\ component}} + \underbrace{\frac{\partial K}{\partial p} \bigg\rvert_{p = u_D}}_{\substack{k(p): \ unobservable \\ component}} = MTE(\overline{x}, u_D) \end{split} %\frac{[E(U_1 - U_0 | U_D \leq p] p}{\partial p} \bigg\rvert_{p = u_D} %E(U_1 - U_0 | U_D = u_D) \end{equation}Since we do not make any assumption about the functional form of the unobservables, we estimate $k(p)$ non-parametrically. In particualr, $k(p)$ is the first derivative of a locally quadratic kernel regression.
For the semiparametric estimation, we need information on the following sections:
In [2]:
%%file files/tutorial_semipar.yml
---
ESTIMATION:
file: data/aer-replication-mock.pkl
dependent: wage
indicator: state
semipar: True
show_output: True
logit: True
nbins: 30
bandwidth: 0.322
gridsize: 500
trim_support: True
reestimate_p: False
rbandwidth: 0.05
derivative: 1
degree: 2
ps_range: [0.005, 0.995]
TREATED:
order:
- exp
- expsq
- lwage5
- lurate
- cafqt
- cafqtsq
- mhgc
- mhgcsq
- numsibs
- numsibssq
- urban14
- lavlocwage17
- lavlocwage17sq
- avurate
- avuratesq
- d57
- d58
- d59
- d60
- d61
- d62
- d63
UNTREATED:
order:
- exp
- expsq
- lwage5
- lurate
- cafqt
- cafqtsq
- mhgc
- mhgcsq
- numsibs
- numsibssq
- urban14
- lavlocwage17
- lavlocwage17sq
- avurate
- avuratesq
- d57
- d58
- d59
- d60
- d61
- d62
- d63
CHOICE:
params:
- 1.0
order:
- const
- cafqt
- cafqtsq
- mhgc
- mhgcsq
- numsibs
- numsibssq
- urban14
- lavlocwage17
- lavlocwage17sq
- avurate
- avuratesq
- d57
- d58
- d59
- d60
- d61
- d62
- d63
- lwage5_17numsibs
- lwage5_17mhgc
- lwage5_17cafqt
- lwage5_17
- lurate_17
- lurate_17numsibs
- lurate_17mhgc
- lurate_17cafqt
- tuit4c
- tuit4cnumsibs
- tuit4cmhgc
- tuit4ccafqt
- pub4
- pub4numsibs
- pub4mhgc
- pub4cafqt
DIST:
params:
- 0.1
- 0.0
- 0.0
- 0.1
- 0.0
- 1.0
Note that I do not include a constant in the TREATED, UNTREATED section. The reason for this is that in the semiparametric setup, $\beta_1$ and $\beta_0$ are determined by running a Double Residual Regression without an intercept: $$ e_Y =e_X \beta_0 \ + \ e_{X \ \times \ p} (\beta_1 - \beta_0) \ + \ \epsilon $$
where $e_X$, $e_{X \ \times \ p}$, and $e_Y$ are the residuals of a local linear regression of $X$, $X$ x $p$, and $Y$ on $\widehat{P}(Z)$.
We now proceed to our replication.
Conduct the estimation based on the initialization file.
In [3]:
rslt = fit('files/tutorial_semipar.yml', semipar=True)
The rslt dictionary contains information on the estimated parameters and the final MTE.
In [4]:
list(rslt)
Out[4]:
Before plotting the MTE, let's see what else we can learn.
For instance, we can account for the variation in $X$.
Note that we divide the MTE by 4 to investigate the effect of one additional year of college education.
In [5]:
np.min(rslt['mte_min']) / 4, np.max(rslt['mte_max']) / 4
Out[5]:
Next we plot the MTE based on the estimation results. As shown in the figure below, the replicated MTE gets very close to the original, but its 90 percent confidence bands are wider. This is due to the use of a mock data set which merges basic and local variables randomly. The bootsrap method, which is used to estimate the confidence bands, is sensitive to the discrepancies in the data.
In [6]:
mte, quantiles = plot_semipar_mte(rslt, 'files/tutorial_semipar.yml', nbootstraps=250)
People with the highest returns to education (those who have low unobserved resistance $u_D$ ) are more likely to go to college. Note that the returns vary considerably with $u_D$ . Low $u_D$ students have returns of up to 40% per year of college, whereas high $u_D$ people, who would loose from attending college, have returns of approximately -18%.
The magnitude of total heterogeneity is probably even higher, as the MTE depicts the average gain of college attendance at the mean values of X, i.e. $\bar{x} (\beta_1 - \beta_0)$. Accounting for variation in $X$, we observe returns as high as 64% and as low as -57%.