Confidence intervals for proportion estimator

We are interested in estimating the proportion of the population whose incomes are below the poverty line. Let $Y$ be the income, $c$ - the poverty line, so the parameter of interest is $\theta=P\left[Y\leq c\right]$. Suppose we have collected $n$ independent observations $\left\{ Y_{i}\right\}_{i=1}^{n}$.

The Analogy Principle estimator is the fraction of individuals below the poverty line in sample: $\hat{\theta}_{n}=\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}\left(Y_{i}\leq c\right)$.

It can be shown that the confidence interval for the estimator is $$ P\left[\frac{\hat{\theta}_{n}}{n}-\frac{1.96}{\sqrt{n}}\sqrt{\hat{\theta}_{n}\left(n-\hat{\theta}_{n}\right)}<\theta<\frac{\hat{\theta}_{n}}{n}+\frac{1.96}{\sqrt{n}}\sqrt{\hat{\theta}_{n}\left(n-\hat{\theta}_{n}\right)}\right]\approx0.95. $$ Alternatively, using variance stabilization method, we can derive another confidence interval, $$ P\left[f\left(\sin^{-1}\left(\sqrt{\hat{\theta}_{n}}\right)-\frac{0.98}{\sqrt{n}}\right)<\theta<f\left(\sin^{-1}\left(\sqrt{\hat{\theta}_{n}}\right)+\frac{0.98}{\sqrt{n}}\right)\right]\approx0.95. $$ with $$ f\left(x\right)=\begin{cases} 0, & x\leq0,\\ \sin^{2}\left(x\right), & 0<x\leq\pi/2,\\ 1, & x>\pi/2. \end{cases} $$

Evaluate the two confidence intervals above numerically for all combinations of $n\in\left\{ 10,100,1000\right\}$ and $\theta\in\left\{ .1,.3,.5\right\}$ as follows. For 1000 realizations of $n\hat{\theta}_{n}\sim\text{Bin}\left(n,\theta\right)$, construct both 95% confidence intervals and keep track of how many times (out of 1000) that the confidence intervals does not contain $\theta$. Report the observed proportion of rejections for each $\left(n,\theta\right)$ combination. Does your study reveal any differences in the performance of these two competing methods?

Set up the environment


In [1]:
import numpy as np
import itertools
import pandas as pd
# In Python 2.7 the division of integers is not float. Do this to have 1 / 2 = .5
from __future__ import division

Define parameters of the simulation


In [2]:
# Number of simulations
S = 1000
# Number of observations in each sample
N = [10, 100, 1000]
# True parameter values
theta = [.1, .3, .5]

Define the function that returns rejection probabilities


In [3]:
def rejections(n, p, s):
    """Compute rejection probabilties.
    
    Parameters
    ----------
    n : int
        Number of draws
    p : float
        Success probability
    s : int
        Number of simulations
    
    Returns
    -------
    r1, r2 : float
        Rejection probabilities for two CI
        
    """
    # Generate the data
    X = np.random.binomial(n, p, s)
    # Compute the estimator of success probability
    theta_hat = X / n
    # Compute two statistics in CI
    W1 = np.abs(n**.5 * (theta_hat - p) / (theta_hat * (1 - theta_hat))**.5)
    W2 = np.abs(2 * n**.5 * (np.arcsin(theta_hat**.5) - np.arcsin(p**.5)))
    # Compute the fraction of rejections
    r1 = np.mean(W1 > 1.96)
    r2 = np.mean(W2 > 1.96)
    return r1, r2

Run simulation study


In [4]:
# Initialyze containers for the labels and data
index, df = [], []

# To avoid multiple loops, use cartesian product of all options to check
for (n, p) in itertools.product(N, theta):
    r = rejections(n, p, S)
    # Append results to labels and data
    df.append(r)
    index.append([n, p])

# Construct multi-level index
index = pd.MultiIndex.from_tuples(index, names=['n', 'p'])
# Construct pandas DataFrame for nice output
table = pd.DataFrame(df, columns=['r1', 'r2'], index=index)

print(table.unstack('p'))


         r1                   r2              
p       0.1    0.3    0.5    0.1    0.3    0.5
n                                             
10    0.338  0.142  0.116  0.351  0.046  0.116
100   0.077  0.049  0.056  0.048  0.049  0.056
1000  0.045  0.051  0.061  0.050  0.051  0.061