Bootstrapping

$$X = \theta + \epsilon$$

where

  • $\theta \in \mathbb{R}$ and
  • $\epsilon $ is the error with CDF $F_\epsilon$
    • $\mathbf{E}[\epsilon] = 0$
    • density of $\epsilon$ is symmetric

Recall:

  • if $\epsilon\sim \mathcal{N}$ (normal), then the "best" estimator for $\theta$ is sample mean $\bar{X}$.

  • However, if $\epsilon$ is not normal, this may not be the case.

For example, if $f_\epsilon(x)=1/2 e^{-|x|}$ ("double xponential"), theb MLE is sample median.

This is important because if $F_\epsilon$ is not known, then it's hard to choose a good estimator.

  • The biggest problem with not knowing the $F_\epsilon$ is its tails might be very heavy.

    (heavy tail refers to the speed at which the density goes to zero when its input goes to infinity)

  • This is a problem when the number of data points $n$ is not very large. Indeed, tjhen the CLT is NOT a good approximation, therefore the $\bar{X}$ is NOT approximately normal, and the MLE for $\theta$ is NOT $\approx \bar{X}$

Question: what to do when $F_\epsilon$ is not known and $n$ is not large? ($n=50$ is to low)

  • Answer: resample from the sample a number of times.

Definition:

A Bootstrap sample $X^*_i: i=1...n$ is a set of $n$ variables picked independently from the data points $X_i: i=1..n$

(noice that the size of sample is the sample as the original)

IDEA: trimmed mean $\bar{X}_\beta$ where $\beta\in(0,1)$:

  • throw away a proportion $\beta$ of data points from the left & right tail of your data

  • This certainly avoids outliers

IDEA: Let $\bar{\epsilon}_\beta = \bar{X}_\beta - \theta$

Consider the 0.025 & 0.975 quantiles $c_1$ and $c_2$ for $F_\epsilon$.

Then, we would be able to say $[\bar{X}_\beta-c_2, \bar{X}_\beta-c_1]$ is a $95\%$ confidence interval for $\theta$

Problem: Since we don't have $c_1$ and $c_2$, let us estimate them.

  • Most obvious idea is to order the $n$ data points, and pick the ones corresponding to quantil $2.5\%$ and $97.5\%$. This is a bad idea, because not enough data. (becuase those quantiles will be extremely sensitive)

Solution: let us use a large number $B$ of bootstrap samples:

  • these are $X^*_{ij}$ where $i=1,..,n$ and $j=1,..,B$

  • then for each $j$ we compute the sample quantiles

  • then, we average these sample quantiles over $j=1,..,B$

  • the resulting estimates are $\hat{C}_1$ and $\hat{C}_2$; we cold call them bootstapped sample quantiles.

  • finally, the bootstrapped $95\%$ confidence interval for $\theta$ is

    $$\left[\bar{X}_\beta - \hat{C}_2, \bar{X}_\beta - \hat{C}_1\right]$$

Remark:

The textbook by stapelton has an excellent treatment of the extension of this idea to the case of linear regression with non-normal errors,

Remark:

The only good (not hard) alternative to the bootstrap for cofidence intervals is the ordinary $t$ method.

  • when tails are fat, this leads to big mistakes

    Using t-tables to build confidecen intervals when $n$ is not large, this only works well for NORMAL ERRORS


In [ ]: