Replication of Carneiro, Heckman, & Vytlacil's (2011) Local Instrumental Variables approach

In this notebook, I reproduce the semiparametric results from

Carneiro, P., Heckman, J. J., & Vytlacil, E. J. (2011). Estimating marginal returns to education. American Economic Review, 101(6), 2754-81.

The authors analyze the returns to college for white males born between 1957 and 1963 using data from the National Longitudinal Survey of Youth 1979. The authors provide some replication material) on their website but do not include geographic identifiers. Therefore, we make use of a mock data merging background characteristics and local data randomly.

In a future update, the semiparametric estimation method will be included in the open-source package grmpy for the simulation and estimation of the generalized Roy model in Python. Currently, grmpy is limited to the estimation of a parametric normal version of the generalized Roy model.
For more, see the online documentation.

0) Imports


In [1]:
import numpy as np

from grmpy.estimate.estimate_semipar import plot_common_support
from tutorial_semipar_auxiliary import plot_semipar_mte

from grmpy.estimate.estimate import fit

import warnings
warnings.filterwarnings('ignore')

1) The LIV Framework

The method of Local Instrumental Variables (LIV) is based on the generalized Roy model, which is characterized by the following equations:

\begin{align*} &\textbf{Potential Outcomes} & & \textbf{Choice} &\\ & Y_1 = \beta_1 X + U_{1} & & I = Z \gamma - V &\\ & Y_0 = \beta_0 X + U_{0} & & D_i = \left\{ \begin{array}{ll} 1 & if \ I > 0 \\ 0 & if \ I \leq 0\\ \end{array} \right. &&&&\\ & \textbf{Observed Outcome} &&&\\ & Y = D Y_1 + (1-D) Y_0 &&& \end{align*}

We work with the linear-in-the-parameters version of the generalized Roy model:

\begin{align} E(Y|X = \overline{x}, P(Z) = p) = \overline{x} \beta_0 + p \overline{x} (\beta_1 - \beta_0) + K(p), \end{align}

where $K(p) = E(U_1 - U_0 | D = 1, P(Z) = p)$ is a nonlinear function of $p$ that captures heterogeneity along the unobservable resistance to treatment $u_D$.

In addition, assume that $(X, Z)$ is independent of $\{U_1, U_0, V\}$. Then, the MTE is

1) additively separable in $X$ and $U_D$, which means that the shape of the MTE is independent of $X$, and

2) identified over the common support of $P(Z)$, unconditional on $X$.

The common support, $P(Z)$, plays a crucial role for the identification of the MTE. It denotes the probability of going to university ($D=1$). Common support is defined as the intersection of the support of $P(Z)$ given $D = 1$ and the support of $P(Z)$ given $D = 0$. i.e., those evaluations of $P(Z)$ for which we obtain positive frequencies in both subsamples. We will plot it below. The larger the common support, the larger the region over which the MTE is identified.

The LIV estimator, $\Delta^{LIV}$, is derived as follows (Heckman and Vytlacil 2001, 2005):

\begin{equation} \begin{split} \Delta^{LIV} (\overline{x}, u_D) &= \frac{\partial E(Y|X = \overline{x}, P(Z) = p)}{\partial p} \bigg\rvert_{p = u_D} \\ & \\ &= \overline{x}(\beta_1 - \beta_0) + E(U_1 - U_0 | U_D = u_D) \\ &\\ & = \underbrace{\overline{x}(\beta_1 - \beta_0)}_{\substack{observable \\ component}} + \underbrace{\frac{\partial K}{\partial p} \bigg\rvert_{p = u_D}}_{\substack{k(p): \ unobservable \\ component}} = MTE(\overline{x}, u_D) \end{split} %\frac{[E(U_1 - U_0 | U_D \leq p] p}{\partial p} \bigg\rvert_{p = u_D} %E(U_1 - U_0 | U_D = u_D) \end{equation}

Since we do not make any assumption about the functional form of the unobservables, we estimate $k(p)$ non-parametrically. In particualr, $k(p)$ is the first derivative of a locally quadratic kernel regression.

3) The Initialization File

For the semiparametric estimation, we need information on the following sections:

  • ESTIMATION: Specify the dependent (wage) and indicator variable (treatment dummy) of the input data frame. For the estimation of the propensity score $P(Z)$, we choose a probability model, here logit. Furthermore, we select 30 bins to determine the common support in the treated and untreated subsamples. For the locally quadratic regression, we follow the specification of Carneiro et al. (2011) and choose a bandwidth of 0.322. The respective gridsize for the locally quadratic regression is set to 500. Fan and Marron (1994) find that a gridsize of 400 is a good default for graphical analysis. Since the data set is large (1785 observations) and the kernel regression function has a very low runtime, we increase the gridsize to 500. Setting it to the default or increasing it even more does not affect the final MTE.
    Note that the MTE identified by LIV consists of two components: $\overline{x}(\beta_1 - \beta_0)$ (which does not depend on $P(Z) = p$) and $k(p)$ (which does depend on $p$). The latter is estimated nonparametrically. The key "p_range" in the initialization file specifies the interval over which $k(p)$ is estimated. After the data outside the overlapping support are trimmed, the locally quadratic kernel estimator uses the remaining data to predict $k(p)$ over the entire "p_range" specified by the user. If "p_range" is larger than the common support, grmpy extrapolates the values for the MTE outside this region. Technically speaking, interpretations of the MTE are only valid within the common support. Here, we set "p_range" to [0.005, 0.995].
    The other parameters in this section are set by default and, normally, do not need to be changed.
  • TREATED, UNTREATED, CHOICE: In this section, the variables of the outcome equations (treated, untreated) and the college decision (choice) are specified.
  • DIST: The distribution of the unobservables is not of relevance in the semiparametric apporach and can be ignored.

In [2]:
%%file files/tutorial_semipar.yml
---
ESTIMATION:
    file: data/aer-replication-mock.pkl
    dependent: wage
    indicator: state
    semipar: True
    show_output: True
    logit: True
    nbins: 30
    bandwidth: 0.322
    gridsize: 500
    trim_support: True
    reestimate_p: False
    rbandwidth: 0.05
    derivative: 1
    degree: 2
    ps_range: [0.005, 0.995]
TREATED:
    order:
    - exp
    - expsq
    - lwage5
    - lurate
    - cafqt
    - cafqtsq
    - mhgc
    - mhgcsq
    - numsibs
    - numsibssq
    - urban14
    - lavlocwage17
    - lavlocwage17sq
    - avurate
    - avuratesq
    - d57
    - d58
    - d59
    - d60
    - d61
    - d62
    - d63
UNTREATED:
    order:
    - exp
    - expsq
    - lwage5
    - lurate
    - cafqt
    - cafqtsq
    - mhgc
    - mhgcsq
    - numsibs
    - numsibssq
    - urban14
    - lavlocwage17
    - lavlocwage17sq
    - avurate
    - avuratesq
    - d57
    - d58
    - d59
    - d60
    - d61
    - d62
    - d63
CHOICE:
    params:
    - 1.0
    order:
    - const
    - cafqt
    - cafqtsq
    - mhgc
    - mhgcsq
    - numsibs
    - numsibssq
    - urban14
    - lavlocwage17
    - lavlocwage17sq
    - avurate
    - avuratesq
    - d57
    - d58
    - d59
    - d60
    - d61
    - d62
    - d63
    - lwage5_17numsibs
    - lwage5_17mhgc
    - lwage5_17cafqt
    - lwage5_17
    - lurate_17
    - lurate_17numsibs
    - lurate_17mhgc
    - lurate_17cafqt
    - tuit4c
    - tuit4cnumsibs
    - tuit4cmhgc
    - tuit4ccafqt
    - pub4
    - pub4numsibs
    - pub4mhgc
    - pub4cafqt
DIST:
    params:
    - 0.1
    - 0.0
    - 0.0
    - 0.1
    - 0.0
    - 1.0


Overwriting tutorial.semipar.yml

Note that I do not include a constant in the TREATED, UNTREATED section. The reason for this is that in the semiparametric setup, $\beta_1$ and $\beta_0$ are determined by running a Double Residual Regression without an intercept: $$ e_Y =e_X \beta_0 \ + \ e_{X \ \times \ p} (\beta_1 - \beta_0) \ + \ \epsilon $$

where $e_X$, $e_{X \ \times \ p}$, and $e_Y$ are the residuals of a local linear regression of $X$, $X$ x $p$, and $Y$ on $\widehat{P}(Z)$.

We now proceed to our replication.

3) Estimation

Conduct the estimation based on the initialization file.


In [3]:
rslt = fit('files/tutorial_semipar.yml', semipar=True)


                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                 1747
Model:                          Logit   Df Residuals:                     1717
Method:                           MLE   Df Model:                           29
Date:                Sat, 18 Jan 2020   Pseudo R-squ.:                  0.2858
Time:                        10:25:10   Log-Likelihood:                -864.74
converged:                       True   LL-Null:                       -1210.8
Covariance Type:            nonrobust   LLR p-value:                4.178e-127
====================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
const              288.3699    151.012      1.910      0.056      -7.609     584.349
cafqt               -6.4256      5.019     -1.280      0.200     -16.263       3.411
cafqtsq              0.3348      0.072      4.665      0.000       0.194       0.476
mhgc                -0.2733      2.146     -0.127      0.899      -4.480       3.933
mhgcsq               0.0180      0.007      2.624      0.009       0.005       0.031
numsibs             -0.4059      2.403     -0.169      0.866      -5.116       4.304
numsibssq            0.0012      0.011      0.104      0.917      -0.021       0.024
urban14              0.3387      0.140      2.418      0.016       0.064       0.613
lavlocwage17       -54.1077     29.111     -1.859      0.063    -111.164       2.948
lavlocwage17sq       2.6770      1.420      1.885      0.059      -0.107       5.461
avurate             -0.0936      0.633     -0.148      0.882      -1.334       1.146
avuratesq            0.0139      0.049      0.286      0.775      -0.081       0.109
d57                  0.3166      0.251      1.262      0.207      -0.175       0.808
d58                  0.3065      0.253      1.214      0.225      -0.189       0.802
d59                 -0.2110      0.251     -0.840      0.401      -0.703       0.281
d60                  0.0341      0.237      0.144      0.886      -0.430       0.498
d61                  0.0863      0.238      0.362      0.717      -0.381       0.553
d62                  0.2900      0.224      1.293      0.196      -0.150       0.730
d63                 -0.0237      0.239     -0.099      0.921      -0.492       0.444
lwage5_17numsibs     0.0170      0.237      0.072      0.943      -0.448       0.482
lwage5_17mhgc        0.0050      0.214      0.023      0.981      -0.414       0.424
lwage5_17cafqt       0.7582      0.498      1.521      0.128      -0.219       1.735
lwage5_17           -1.5203      2.738     -0.555      0.579      -6.887       3.846
lurate_17           -0.1394      0.248     -0.563      0.573      -0.625       0.346
lurate_17numsibs    -0.0028      0.020     -0.140      0.888      -0.042       0.037
lurate_17mhgc        0.0074      0.019      0.386      0.700      -0.030       0.045
lurate_17cafqt      -0.0174      0.044     -0.394      0.693      -0.104       0.069
tuit4c               0.0114      0.060      0.191      0.849      -0.105       0.128
tuit4cnumsibs        0.0039      0.005      0.806      0.420      -0.006       0.013
tuit4cmhgc          -0.0008      0.005     -0.167      0.867      -0.010       0.008
tuit4ccafqt         -0.0041      0.010     -0.398      0.690      -0.024       0.016
pub4                 0.4641      0.873      0.532      0.595      -1.247       2.175
pub4numsibs          0.0451      0.074      0.611      0.541      -0.100       0.190
pub4mhgc            -0.0408      0.069     -0.594      0.553      -0.176       0.094
pub4cafqt           -0.0164      0.164     -0.100      0.920      -0.338       0.305
====================================================================================

    Common support lies beteen:

        0.053615848983571426 and
        0.9670786072336038
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                      y   R-squared (uncentered):                   0.077
Model:                            OLS   Adj. R-squared (uncentered):              0.052
Method:                 Least Squares   F-statistic:                              3.079
Date:                Sat, 18 Jan 2020   Prob (F-statistic):                    1.09e-10
Time:                        10:25:10   Log-Likelihood:                         -978.64
No. Observations:                1659   AIC:                                      2045.
Df Residuals:                    1615   BIC:                                      2283.
Df Model:                          44                                                  
Covariance Type:            nonrobust                                                  
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
exp                  -0.0061      0.032     -0.190      0.850      -0.069       0.057
expsq                 0.0036      0.002      1.940      0.053   -4.09e-05       0.007
lwage5                0.3281      0.171      1.919      0.055      -0.007       0.663
lurate               -0.0279      0.023     -1.239      0.216      -0.072       0.016
cafqt                 0.0169      0.152      0.111      0.912      -0.281       0.315
cafqtsq              -0.0263      0.054     -0.487      0.626      -0.132       0.080
mhgc                 -0.1185      0.060     -1.987      0.047      -0.235      -0.002
mhgcsq                0.0057      0.003      1.719      0.086      -0.001       0.012
numsibs               0.0387      0.039      0.988      0.323      -0.038       0.116
numsibssq            -0.0032      0.004     -0.788      0.431      -0.011       0.005
urban14               0.0554      0.059      0.943      0.346      -0.060       0.171
lavlocwage17         -6.7889     12.889     -0.527      0.598     -32.070      18.493
lavlocwage17sq        0.3212      0.629      0.511      0.610      -0.913       1.555
avurate               0.3364      0.236      1.428      0.154      -0.126       0.798
avuratesq            -0.0242      0.018     -1.339      0.181      -0.060       0.011
d57                  -0.3912      0.127     -3.072      0.002      -0.641      -0.141
d58                  -0.1951      0.125     -1.555      0.120      -0.441       0.051
d59                  -0.1554      0.100     -1.561      0.119      -0.351       0.040
d60                  -0.1414      0.093     -1.519      0.129      -0.324       0.041
d61                  -0.0852      0.091     -0.938      0.349      -0.263       0.093
d62                  -0.0729      0.088     -0.826      0.409      -0.246       0.100
d63                  -0.0315      0.085     -0.371      0.710      -0.198       0.135
exp_ps                0.1174      0.050      2.361      0.018       0.020       0.215
expsq_ps             -0.0116      0.003     -3.789      0.000      -0.018      -0.006
lwage5_ps            -0.7168      0.297     -2.414      0.016      -1.299      -0.134
lurate_ps             0.0284      0.041      0.685      0.494      -0.053       0.110
cafqt_ps              0.1328      0.393      0.338      0.735      -0.638       0.904
cafqtsq_ps            0.0757      0.115      0.658      0.511      -0.150       0.302
mhgc_ps               0.2407      0.138      1.744      0.081      -0.030       0.511
mhgcsq_ps            -0.0095      0.007     -1.383      0.167      -0.023       0.004
numsibs_ps           -0.0789      0.080     -0.992      0.321      -0.235       0.077
numsibssq_ps          0.0070      0.009      0.805      0.421      -0.010       0.024
urban14_ps            0.0360      0.123      0.293      0.769      -0.205       0.277
lavlocwage17_ps       9.1985     23.452      0.392      0.695     -36.800      55.197
lavlocwage17sq_ps    -0.4198      1.146     -0.366      0.714      -2.668       1.829
avurate_ps           -0.3721      0.452     -0.823      0.411      -1.259       0.515
avuratesq_ps          0.0270      0.035      0.770      0.442      -0.042       0.096
d57_ps                1.0237      0.227      4.513      0.000       0.579       1.469
d58_ps                0.6714      0.217      3.095      0.002       0.246       1.097
d59_ps                0.3508      0.183      1.919      0.055      -0.008       0.709
d60_ps                0.4781      0.169      2.838      0.005       0.148       0.809
d61_ps                0.2989      0.161      1.852      0.064      -0.018       0.615
d62_ps                0.2595      0.166      1.561      0.119      -0.066       0.585
d63_ps                0.2084      0.152      1.374      0.170      -0.089       0.506
==============================================================================
Omnibus:                       64.414   Durbin-Watson:                   2.065
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              146.286
Skew:                          -0.212   Prob(JB):                     1.72e-32
Kurtosis:                       4.392   Cond. No.                     1.52e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.52e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

The rslt dictionary contains information on the estimated parameters and the final MTE.


In [4]:
list(rslt)


Out[4]:
['quantiles', 'mte', 'mte_x', 'mte_u', 'mte_min', 'mte_max', 'X', 'b1-b0']

Before plotting the MTE, let's see what else we can learn. For instance, we can account for the variation in $X$.
Note that we divide the MTE by 4 to investigate the effect of one additional year of college education.


In [5]:
np.min(rslt['mte_min']) / 4, np.max(rslt['mte_max']) / 4


Out[5]:
(-0.5703098725428841, 0.6420576895047017)

Next we plot the MTE based on the estimation results. As shown in the figure below, the replicated MTE gets very close to the original, but its 90 percent confidence bands are wider. This is due to the use of a mock data set which merges basic and local variables randomly. The bootsrap method, which is used to estimate the confidence bands, is sensitive to the discrepancies in the data.


In [6]:
mte, quantiles = plot_semipar_mte(rslt, 'files/tutorial_semipar.yml', nbootstraps=250)


People with the highest returns to education (those who have low unobserved resistance $u_D$ ) are more likely to go to college. Note that the returns vary considerably with $u_D$ . Low $u_D$ students have returns of up to 40% per year of college, whereas high $u_D$ people, who would loose from attending college, have returns of approximately -18%.

The magnitude of total heterogeneity is probably even higher, as the MTE depicts the average gain of college attendance at the mean values of X, i.e. $\bar{x} (\beta_1 - \beta_0)$. Accounting for variation in $X$, we observe returns as high as 64% and as low as -57%.