We are interested in estimating the proportion of the population whose incomes are below the poverty line. Let $Y$ be the income, $c$ - the poverty line, so the parameter of interest is $\theta=P\left[Y\leq c\right]$. Suppose we have collected $n$ independent observations $\left\{ Y_{i}\right\}_{i=1}^{n}$.
The Analogy Principle estimator is the fraction of individuals below the poverty line in sample: $\hat{\theta}_{n}=\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}\left(Y_{i}\leq c\right)$.
It can be shown that the confidence interval for the estimator is $$ P\left[\frac{\hat{\theta}_{n}}{n}-\frac{1.96}{\sqrt{n}}\sqrt{\hat{\theta}_{n}\left(n-\hat{\theta}_{n}\right)}<\theta<\frac{\hat{\theta}_{n}}{n}+\frac{1.96}{\sqrt{n}}\sqrt{\hat{\theta}_{n}\left(n-\hat{\theta}_{n}\right)}\right]\approx0.95. $$ Alternatively, using variance stabilization method, we can derive another confidence interval, $$ P\left[f\left(\sin^{-1}\left(\sqrt{\hat{\theta}_{n}}\right)-\frac{0.98}{\sqrt{n}}\right)<\theta<f\left(\sin^{-1}\left(\sqrt{\hat{\theta}_{n}}\right)+\frac{0.98}{\sqrt{n}}\right)\right]\approx0.95. $$ with $$ f\left(x\right)=\begin{cases} 0, & x\leq0,\\ \sin^{2}\left(x\right), & 0<x\leq\pi/2,\\ 1, & x>\pi/2. \end{cases} $$
Evaluate the two confidence intervals above numerically for all combinations of $n\in\left\{ 10,100,1000\right\}$ and $\theta\in\left\{ .1,.3,.5\right\}$ as follows. For 1000 realizations of $n\hat{\theta}_{n}\sim\text{Bin}\left(n,\theta\right)$, construct both 95% confidence intervals and keep track of how many times (out of 1000) that the confidence intervals does not contain $\theta$. Report the observed proportion of rejections for each $\left(n,\theta\right)$ combination. Does your study reveal any differences in the performance of these two competing methods?
In [1]:
import numpy as np
import itertools
import pandas as pd
# In Python 2.7 the division of integers is not float. Do this to have 1 / 2 = .5
from __future__ import division
In [2]:
# Number of simulations
S = 1000
# Number of observations in each sample
N = [10, 100, 1000]
# True parameter values
theta = [.1, .3, .5]
In [3]:
def rejections(n, p, s):
"""Compute rejection probabilties.
Parameters
----------
n : int
Number of draws
p : float
Success probability
s : int
Number of simulations
Returns
-------
r1, r2 : float
Rejection probabilities for two CI
"""
# Generate the data
X = np.random.binomial(n, p, s)
# Compute the estimator of success probability
theta_hat = X / n
# Compute two statistics in CI
W1 = np.abs(n**.5 * (theta_hat - p) / (theta_hat * (1 - theta_hat))**.5)
W2 = np.abs(2 * n**.5 * (np.arcsin(theta_hat**.5) - np.arcsin(p**.5)))
# Compute the fraction of rejections
r1 = np.mean(W1 > 1.96)
r2 = np.mean(W2 > 1.96)
return r1, r2
In [4]:
# Initialyze containers for the labels and data
index, df = [], []
# To avoid multiple loops, use cartesian product of all options to check
for (n, p) in itertools.product(N, theta):
r = rejections(n, p, S)
# Append results to labels and data
df.append(r)
index.append([n, p])
# Construct multi-level index
index = pd.MultiIndex.from_tuples(index, names=['n', 'p'])
# Construct pandas DataFrame for nice output
table = pd.DataFrame(df, columns=['r1', 'r2'], index=index)
print(table.unstack('p'))