This notebook includes replication code for the Monte Carlo experiments reports in Graham and Pinto (2018). In addition to several standard Python scientific computing libraries, we use functions from the ipt library. This library is available on GitHub at https://github.com/bryangraham/ipt. It includes implementations of Wooldridge's (2004) generalized inverse probability weighting estimator for average partial effects (APE), the "Oaxaca-Blinder" type APE estimator described in the paper, as well as of our own locally efficient, doubly robust estimator.
Graham, Bryan S. and Pinto, Cristine Campose de Xavier. (2018). "Semiparametrically efficient estimation of the average linear regression function," CEMMAP Working Paper.
In [1]:
# Direct Python to plot all figures inline (i.e., not in a separate window)
%matplotlib inline
# Load libraries
import numpy as np
import numpy.linalg
import scipy as sp
import scipy.optimize
import pandas as pd
In [2]:
# Append location of ipt module base directory to system path
# NOTE: only required if permanent install of ipt package not made
import sys
sys.path.append('/Users/bgraham/Dropbox/Sites/software/ipt/')
# Load ipt module
import ipt as ipt
In [3]:
import warnings
def fxn():
warnings.warn("runtime", RuntimeWarning)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
fxn()
In [4]:
from platform import python_version
print(python_version())
We consider four designs, each described in detail in Graham and Pinto (2018). In all cases the conditional distribution $ f(X|W) $ is poisson with a conditional mean of $\exp\left(k\left(W\right)'\phi\right)$. In designs $1$ and $3$ the $k\left(W\right)$ vector includes a constant and a linear term. In designs $2$ and $4$ a quadratic term is also included.
The outcome variable is generated according to $Y=a\left(W\right)+b\left(W\right)X+U$. In the first two designs $a\left(W\right)$ and $b\left(W\right)$ are linear functions of $W$, while in the last two they are quadratic functions. Both $W$ and $U$ are standard normal random variables, uncorrelated with each other. The parameter values are chosen such that the standard error of a semiparametrically efficient estimator would 0.05 across each design (when $N = 1,000$). In this sense each design is equally difficult.
We evaluate the performance of three estimators: (i) the generalized inverse probability weight estimator introduced by Wooldridge (2004), (ii) the "Oaxaca-Blinder" imputation type estimator discussed in the paper, and (iii) our own locally efficient doubly robust estimator.
For the Wooldridge (2004) estimator $ f(X|W) $ is modelled as a Poisson distribution with a conditional mean of $\exp\left(k\left(W\right)'\phi\right)$ with $k\left(W\right)$ including a constant and linear term. The Wooldridge (2004) estimator is consistent in designs 1 and 3. The "Oaxaca-Blinder" estimator assumes that both $a\left(W\right)$ and $b\left(W\right)$ are linear functions of $W$. This estimator will be consistent in designs $1$ and $2$.
Our doubly robust estimator is based on the same submodels as the Wooldridge (2004) and "Oaxaca-Blinder" ones. It is consistent across designs $1$, $2$ and $3$. It is locally efficient in design $1$.
All three estimators are inconsistent in design $4$.
In [5]:
import time
N = 1000
S = 5000
# List with a, b, c and efficiency bounds for each design
D1 = [[1, 1, 0], [2, 1.22, 0], [0.1, 0.5, 0], [0.05]]
D2 = [[1, 1, 0], [2, 1.26, 0], [0.1, 0.5, 0.1], [0.05]]
D3 = [[1, 1, 0.5], [2, 1, 0.5], [0.1, 0.5, 0], [0.05]]
D4 = [[1, 1, 0.5], [2, 1.05, 0.5], [0.1, 0.5, 0.1], [0.05]]
Designs = [D1, D2, D3, D4]
NumDesigns = len(Designs)
# Initialize matrices to store Monte Carlo results
bias = np.zeros((S,3*NumDesigns))
coverage = np.zeros((S,3*NumDesigns))
se = np.zeros((S,3*NumDesigns))
# Set random seed
np.random.seed(361)
d = 0
start = time.time()
for Design in Designs:
print("Simulating design " + str(d+1) + " of " + str(len(Designs)))
a = Design[0]
b = Design[1]
c = Design[2]
asym_se = Design[3][0]
for s in range(0,S):
# Simulate s-th dataset
W, U = np.random.multivariate_normal([0,0], [[1, 0], [0, 1]], N).T
X = np.random.poisson(np.exp(c[0] + c[1]*W + c[2]*W**2))
Y = ((a[0] + a[2]) + a[1]*W + a[2]*(W**2 - 1)) + ((b[0] + b[2]) + b[1]*W + b[2]*(W**2 - 1))*X + U
W = pd.DataFrame(W, columns=['W'])
X = pd.Series(X, name = 'X')
Y = pd.Series(Y, name = 'Y')
# Doubly robust estimator
[beta_hat_dr, vcov_beta_hat_dr] = ipt.avreg_dr(Y, X, W, psmodel='poisson', c_id=None, s_wgt=None, \
silent=True)
# Wooldridge (2004) estimator
[beta_hat_ipw, vcov_beta_hat_ipw] = ipt.avreg_ipw(Y, X, W, psmodel='poisson', c_id=None, s_wgt=None, \
silent=True)
# "Oaxaca-Blinder" estimator
[beta_hat_ob, vcov_beta_hat_ob] = ipt.avreg_ob(Y, X, W, c_id=None, s_wgt=None, \
silent=True)
bias[s,[d,d+NumDesigns,d+2*NumDesigns]] = (beta_hat_dr[0] - b[0] - b[2])[0], \
(beta_hat_ipw[0] - b[0] - b[2])[0], \
(beta_hat_ob[0] - b[0] - b[2])[0]
# Coverage
coverage[s,[d,d+NumDesigns,d+2*NumDesigns]] = ((b[0] + b[2]<=beta_hat_dr[0] + 1.96*np.sqrt(vcov_beta_hat_dr[0,0]))*\
(b[0] + b[2]>=beta_hat_dr[0] - 1.96*np.sqrt(vcov_beta_hat_dr[0,0])))[0], \
((b[0] + b[2]<=beta_hat_ipw[0] + 1.96*np.sqrt(vcov_beta_hat_ipw[0,0]))*\
(b[0] + b[2]>=beta_hat_ipw[0] - 1.96*np.sqrt(vcov_beta_hat_ipw[0,0])))[0], \
((b[0] + b[2]<=beta_hat_ob[0] + 1.96*np.sqrt(vcov_beta_hat_ob[0,0]))*\
(b[0] + b[2]>=beta_hat_ob[0] - 1.96*np.sqrt(vcov_beta_hat_ob[0,0])))[0]
# Standard error length
se[s,[d,d+NumDesigns,d+2*NumDesigns]] = np.sqrt(vcov_beta_hat_dr[0,0]), \
np.sqrt(vcov_beta_hat_ipw[0,0]), \
np.sqrt(vcov_beta_hat_ob[0,0])
end = time.time()
if (s+1) % 1000 == 0:
print("Time required f/ MC rep " + str(s+1) + " of " + str(S) + ": " + str(end-start))
start = time.time()
d += 1
In [6]:
# Print options and row and column labels for Monte Carlo results
pd.options.display.precision=4
pd.set_option('display.float_format', lambda x: '%.4f' % x)
Designs = ['1 (S,S)', '2 (S,R)', '3 (R,S)', '4 (R,R)']
Estimators = ['DR', 'GIPW', 'Oaxaca-Blinder']
# Report bias and coverage results
print("Mean Bias")
mean_bias = pd.DataFrame(np.mean(bias, axis=0).reshape(3,NumDesigns,order='C'), columns=Designs, index=Estimators)
print(mean_bias)
print("")
print("Median Bias")
median_bias = pd.DataFrame(np.median(bias, axis=0).reshape(3,NumDesigns,order='C'), columns=Designs, index=Estimators)
print(median_bias)
print("")
print("Standard deviation")
std_dev = pd.DataFrame(np.std(bias, axis=0).reshape(3,NumDesigns,order='C'), columns=Designs, index=Estimators)
print(std_dev)
print("")
print("Mean Standard Error")
mean_std_err = pd.DataFrame(np.mean(se, axis=0).reshape(3,NumDesigns,order='C'), columns=Designs, index=Estimators)
print(mean_std_err)
print("")
print("Median Standard Error")
median_std_err = pd.DataFrame(np.median(se, axis=0).reshape(3,NumDesigns,order='C'), columns=Designs, index=Estimators)
print(median_std_err)
print("")
print("Coverage (nominal 95%)")
actual_cov = pd.DataFrame(np.mean(coverage, axis=0).reshape(3,NumDesigns,order='C'), columns=Designs, index=Estimators)
print(actual_cov)
print("")
The standard error associated with a Monte Carlo coverage estimate is $\sqrt{\alpha\left(1-\alpha\right)/B}$. With $B = 5,000$ simulations and $\alpha = 0.05$ this results in a standard error of approximately 0.003 or a 95 percent confidence interval of $[0.944, 0.956]$. Overall the Monte Carlo results are consistent with theoretical expectations. This is especially true when considering larger sampler sizes (as would be expected).
In [7]:
# This imports an attractive notebook style from Github
from IPython.display import HTML
from urllib.request import urlopen
html = urlopen('http://bit.ly/1Bf5Hft')
HTML(html.read().decode('utf-8'))
Out[7]: