These are my notes and worked exercises in Python from Geyer's overview, tutorial and notes contrasting the subsampling bootstrap of Politis, Romano and Wolfe (hereafter, PRW) and the classic Efron nonparametric bootstrap (hereafter, ET). See the references for more exposition, here I'm replicating the R results in Python along with some notes and a few additional examples from the texts.
An important difference between subsampling and bootstrapping is that subsampling is done without replacement, yielding sample statistics having the same distribution as the true sample distribution F (because each subsample is an actual sample from F). Bootstrap sampling is done with replacement, yielding samples from the empirical distribution $\hat{F}_n$, which is not the same thing as the sampling distribution. As Geyer puts it:
Each of these procedures does the Wrong Thing.
- The Right Thing is samples of the right size n from the right distribution F.
- The Politis and Romano thing is samples of the wrong size b ≪ n from the right distribution F.
- The Efron thing is samples of the right size n from the wrong distribution $\hat{F}_n$.
The distinction becomes pronounced with statistics that are not locally smooth. The Extreme Value example below is a simple example of the bootstrap distribution not looking at all like the sampling distribution for the sample max.
Two kinds of sub-sampling are considered in these tutorials:
An R implementation for a subsampling approach to ET's 8.5 and 8.6 hormone example (the moving blocks version) is provided in the overview. We'll replicate that in this ipython notebook, first in R using the python R bridge (rpy2), then in Python using numpy/scipy. (Note that you must have installed rpy2 for the former but not the latter).
In [1]:
%load_ext rmagic
In [2]:
%%R
library(bootstrap)
print(lutenhorm)
summary(lutenhorm)
y <- lutenhorm[ , 4]
plot(y)
acf(y)
n <- length(y)
foo <- function(w) {
z <- w - mean(w)
m <- length(w)
out <- lm(z[-1] ~ z[-m] + 0)
as.numeric(coefficients(out))
}
print(beta.hat <- foo(y))
b <- 8
nboot <- n - b + 1
beta.star <- double(nboot)
for (i in 1:nboot) {
y.star <- y[seq(i, i + b - 1)]
beta.star[i] <- foo(y.star)
}
plot(beta.star)
abline(h = beta.hat, lty = 2)
hist(beta.star)
abline(v = beta.hat, lty = 2)
sd(beta.star) * sqrt(b / n)
(If the above output cell does not include a table of 48x5 values followed by 4 graphs, you've something disagreeable in your configuration)
Here are three versions of the hormone example (subsampling as above, and bootstrapped disturbances and moving blocks from ET) in Python, using Pandas dataframes, scipy and numpy for vectorized numerics, and matplotlib for graphics. This and other examples assume the notebook server was started with the pylab=inline flag, so that numpy/scipy/matplotlib modules are pre-imported and configured and graphic results are displayed within the page. I am also using a special matplotlibrc file to get a ggplot color-scheme, but things are fine without it.
We'll start by grabbing the hormone data from the rpy2 namespace (it was loaded by the previous example). You could do this using the rmagic %Rpull lutenhorm
, but this would bring the data across as a nested list of python lists. Instead, use the Pandas load_data() function to bring it across as appropriate DataFrame or Series objects built from fast ndarrays:
In [3]:
from pandas.rpy.common import load_data
lutenhorm = load_data('lutenhorm')
# (to do this without R, load from a local csv file instead:
# from pandas import read_csv
# lutenhorm = read_csv("data/lutenhorm.csv")
print lutenhorm
print type(lutenhorm)
In [4]:
from statsmodels.graphics.tsaplots import plot_acf
y = lutenhorm.V4
print type(y)
scatter(y.index, y)
plot_acf(y, alpha=.05, lags=20);
The statistic of interest is the slope parameter $\beta$ of an AR(1) model with disturbances for centered observations $z_i$. We estimate it from the full sample:
In [5]:
def beta_lse(z, demean=False):
"""estimate the AR(1) beta as the LSE slope relating centered observations"""
if demean:
z = z - mean(z)
zlag = z[:-1]
return np.dot(zlag/np.dot(zlag, zlag), z[1:])
beta_hat = beta_lse(y, demean=True)
print beta_hat
In the following examples, this convenience function plots a bootstrap distribution histogram along with bootstrap confidence intervals (based on the realized distribution) and normal 95% confidence intervals about the sample statistic.
In [6]:
def plot_bootstrap(theta_star, theta_hat, blockratio=1.0, bins=10, title="bootstrapped values"):
"""
given a vector of bootstrap realizations of a statistic,
show a histogram with normal and bootstrap confidence intervals.
"""
figure()
scatter(arange(len(theta_star)), theta_star)
plt.title(title)
figure()
b_star = np.sort(theta_star[~np.isnan(theta_star)]) # get rid of nans while we sort
se = np.std(b_star) * np.sqrt(blockratio)
ci = [b_star[int(len(b_star)*.025)], b_star[int(len(b_star)*.975)]] # bootstrap 95% CI based on empirical percentiles
print "theta_hat = %.4f, E(theta_star)=%.4f, se(theta_hat) = %.4f, 95%% CI = [%.4f, %.4f]" % (
theta_hat, mean(b_star), se, ci[0], ci[1])
hist(b_star, bins=bins)
plt.title("counts for %s" % title)
xlabel("value")
ylabel("count")
axvline(x=theta_hat, linestyle='-', color='k')
axvline(x=mean(b_star), linestyle='-', color='r')
axvline(x=theta_hat+1.96*se, linestyle='--', color='b') # standard normal 95% CI
axvline(x=theta_hat-1.96*se, linestyle='--', color='b')
axvline(x=ci[0], linestyle='--', color='r')
axvline(x=ci[1], linestyle='--', color='r')
The tutorial begins with a subsampling approach to the hormone data. This selects each successive b-length block as a sample. There are few enough of these that the complete set can be used in computing $\hat{\beta}^*$, no random subsampling required:
In [7]:
b = 8 # block length
b_star = np.array([beta_lse(y[i:i+b], demean=True) for i in range(len(y)-b+1)])
plot_bootstrap(b_star, beta_hat, blockratio=float(b)/len(y),
title="complete subsampling with recentering, b=%d" % b)
z_star = np.sqrt(b) * (b_star - beta_hat)
plot_bootstrap(z_star, 0, blockratio=float(b)/len(y),
title="root statistic distribution, b=%d" % b)
In the histogram above of $\hat{\beta}^*$ values, the original sample statistic is a solid black line, the mean of the bootstrapped statistics a solid red line. This estimate is biased low. Normal 95% confidence intervals about the estimate are dashed blue lines, and bootstrap percentile intervals are dashed red lines. When the distribution of $\hat{\beta}^*$ is not normal, these intervals will differ, with the percentile interval capturing the actual 95% density of the bootstrap distribution.
This (and the R code above) are not quite in agreement with the problem statement -- they re-center each block when fitting, which is not compatible with having a single AR(1) model for the entire series with the mean as a given parameter value. But the stderr is not too bad with a blocksize of 8, because of stationarity: subsequences of the series are expected to have the same mean and variance as the whole. At smaller block sizes, the estimates become biased extremely low, because of the error in estimating the global mean from such a small sample. (If this is a live notebook, you can try out different values of b by editing the code above and typing ctrl-Enter to re-run the estimate)
The moving-blocks bootstrap procedure described in ET is quite a bit different from the subsampling approach: for each boostrap sample, randomly select blocks and assemble into a length-n timeseries, then compute $\hat{\beta}^*$ for each such length-n series.
In [8]:
B = 200
b = 3 # divides evenly into length(z). See text for scaling factor when not even.
beta_star = np.empty(B)
z = y - mean(y)
z_star = np.empty(len(z))
for boot_i in range(B):
for block_i, start in enumerate(np.random.randint(len(z) - b + 1, size=len(z)/b)):
z_star[block_i*b:(block_i+1)*b] = z[start:start+b]
beta_star[boot_i] = beta_lse(z_star)
plot_bootstrap(beta_star, beta_hat, title="%d moving-block bootstraps, b=%d" % (B, b))
ET's first hormone approach is parametric -- it focuses on the random disturbances driving the AR(1) model. The disturbances $\epsilon_i$ are from an unknown distribution and are estimated from the sample data. This estimated distribution will then be sampled to generate new sample sequences having the same statistics as the original sample.
Using the estimated $\hat{\beta}$ from the sample, the estimated disturbances $\hat{\epsilon}$ are:
In [9]:
def eps_beta(z, beta):
"""estimate AR(1) disturbances from centered observations z given beta"""
return (z - z.shift(1) * beta).values[1:]
eps_hat = eps_beta(z, beta_hat)
hist(eps_hat)
Out[9]:
Using the estimated disturbances $\hat{\epsilon}$, bootstrapped $z^*$'s are computed sequentially, each using the value of the previous $z^*$ and a bootstrap sample from $\hat{\epsilon}$:
In [10]:
def bootstrap_z(z0, beta_hat, eps_hat):
"""
return a z vector simulated using bootstrapped
disturbances in an AR(1) model, given an estimated probability model
(beta_hat, eps_hat).
beta_hat is the estimated slope term of the AR(1) model and eps_hat is
the empirical distribution of disturbances, represented as a vector of
equally likely estimated eps values. We also need the constant z0
to initialize the AR(1) simulation.
"""
z_star = np.array([z0]*(len(eps_hat)+1))
for i in range(len(eps_hat)):
z_star[i+1] = (beta_hat * z_star[i]) + np.random.choice(eps_hat)
return z_star[1:] # leave out z0, it's not bootstrappy
# continuing with z, beta_hat, and eps_hat from above
B = 200 # number of bootstrap simulations
b_star = np.array([beta_lse(bootstrap_z(z.iloc[0], beta_hat, eps_hat))
for i in range(B)])
plot_bootstrap(b_star, beta_hat, title="%d bootstrapped values" % len(b_star))
In [11]:
b = 3 # block length
b_star = np.array([beta_lse(z[i:i+b], demean=False) for i in range(len(z)-b+1)])
plot_bootstrap(b_star, beta_hat, blockratio=float(b)/len(z),
title="complete moving-block bootstrap, b=%d" % b)
The estimated standard errors for b=3 (and b=5) agree with those reported in the text, and the bias is still lousy.
In [12]:
%%R
x = read.table("data/lts.csv", header=TRUE)$x
foo <- function(w) ar.burg(w, aic = FALSE, order.max = 1)$ar
print(beta.hat <- foo(x))
n <- length(x)
b <- 2500
spc <- 100
ii <- seq(1, n - b + 1, spc)
nboot <- length(ii)
beta.star <- double(nboot)
for (i in 1:nboot) {
x.star <- x[seq(ii[i], ii[i] + b - 1)]
beta.star[i] <- foo(x.star)
}
alpha <- 0.05
z.star <- sqrt(b) * (beta.star - beta.hat)
crit.val <- quantile(z.star,
probs = c(1 - alpha / 2, alpha / 2))
beta.hat - crit.val / sqrt(n)
# histogram from which critical values are derived
hist(z.star)
abline(v = crit.val, lty = 2)
# for comparison, interval using 1.96 bootstrap s. e.
beta.hat + c(-1, 1) * qnorm(0.975) * sd(beta.star) *
sqrt(b / n)
cat("Calculation took", proc.time()[1], "seconds\n")
As in the tutorial, we use the Burg method to estimate the AR(1) $\beta$ parameter. This isn't necessary for an AR(1) like this example, but is generally superior for larger AR(n), particularly when the covariance matrix would be ill-conditioned. An implementation is available in the spectrum module, and may be installed with the unix command % easy_install -U spectrum
. (warning: it is slow, and has a different sign convention than R's ar.burg)
In [13]:
from pandas import read_csv
from spectrum import arburg
def beta_burg(z, demean=True):
"""estimate the AR(1) beta using the Burg method"""
if demean:
z = z - np.mean(z)
return -np.real(arburg(z, 1)[0][0])
x = read_csv("data/lts.csv").x.values
n = len(x)
b = 2500 # block length
spc = 100
beta_hat = beta_burg(x)
print beta_hat
b_star = np.array([beta_burg(x[i:i+b]) for i in range(0, n-b+1, 100)])
z_star = np.sqrt(b) * (b_star - beta_hat)
plot_bootstrap(z_star, 0, blockratio=float(b)/n,
title="root statistic distribution, b=%d" % b)
Here the statistic of interest is the sample maximum $\hat{\theta}$ from a uniform distribution over $[0, \theta]$. It can be shown that this is exponentially distributed with expected value $\theta$. The subsample and bootstrap computations below examine how well this exponential distribution is reproduced.
R code for the example (Geyer):
In [14]:
%%R
X <- read.table("http://www.stat.umn.edu/geyer/5601/mydata/expo.txt", header=T)
x = X$x
n <- length(x)
print(theta.hat <- max(x))
nboot <- 1000
b <- 50
theta.star <- double(nboot)
theta.bogo <- double(nboot)
for (i in 1:nboot) {
theta.star[i] <- max(sample(x, b, replace = FALSE))
theta.bogo[i] <- max(sample(x, replace = TRUE))
}
z.star <- b * (theta.hat - theta.star)
plot(qexp(ppoints(nboot), 1 / theta.hat), sort(z.star),
xlab = "Quantiles of Exp(1 / theta.hat) distribution")
z.bogo <- n * (theta.hat - theta.bogo)
points(qexp(ppoints(nboot), 1 / theta.hat), sort(z.bogo),
pch="B", col=2)
abline(0, 1)
First, the subsampling approach, taking b-length samples from the n samples then taking the max of each. The distribution of these maxima is compared to the true distribution in a Q-Q plot.
In [15]:
from pandas import read_csv
from scipy.stats.distributions import expon
from statsmodels.graphics.gofplots import qqplot
df = read_csv("data/expo.csv")
x = df.x.values
n = len(x)
theta_hat = max(x)
print theta_hat
nboot = 1000
b = 50
theta_star_sub = np.max([np.random.choice(x, b, replace=False) for i in range(nboot)], axis=1)
z_star_sub = b * (theta_hat - theta_star_sub)
qqplot(z_star_sub, expon, scale=theta_hat, line='45');
Now, bootstrapping n-length samples with replacement. The distribution is quite different because there is a $\approx 1 - \frac{1}{e} \approx 62\%$ chance that any given bootstrap sample contains the actual sample max $\hat{\theta}$ (similarly for any other particular value from the full sample). This means the bootstrap distribution does not converge to the sampling distribution as n increases.
In [16]:
theta_star_boot = np.max([np.random.choice(x, n, replace=True) for i in range(nboot)], axis=1)
z_star_boot = n * (theta_hat - theta_star_boot)
qqplot(z_star_boot, expon, scale=theta_hat, line='45');
This is clearer displayed as a histogram:
In [17]:
hist(theta_star_boot)
Out[17]:
PRW point out (section 1.3) that in instances where the simple bootstrap fails, consistency can often be restored by reducing the bootstrap sample size, foreshadowing their subsampling results. In this example, the code is the same as for subsampling, but done with replacement:
In [18]:
nboot = 1000
b = 50
theta_star_sub = np.max([np.random.choice(x, b, replace=True) for i in range(nboot)], axis=1)
z_star_sub = b * (theta_hat - theta_star_sub)
qqplot(z_star_sub, expon, scale=theta_hat, line='45');
One difficulty with subsampling is the need to know the convergence rate, so that a b-length sample statistic distribution can be adjusted to be comparable with an n-length sample statistic distribution (needed for constructing confidence intervals from subsample estimates). PRW show how to use subsampling to estimate the convergence rate. Geyer continues with the example of the max order statistic over $[0,\theta]$ interval, in which the convergence rate is known to be $n$ (rather than the typical $\sqrt{n}$ law), and estimates convergence by constructing correlated subsamples of various sizes and fitting a convergence rate using the method from Chapter 8 of PRW
In [19]:
%%R
X <- read.table(url("http://www.stat.umn.edu/geyer/5601/mydata/big-unif.txt"), header = TRUE)
x = X$x
print(n <- length(x))
print(theta.hat <- max(x))
nboot <- 2e4 - 1
b <- c(40, 60, 90, 135)
b <- sort(b)
theta.star <- matrix(NA, nboot, length(b))
for (i in 1:nboot) {
x.star <- x
for (j in length(b):1) {
x.star <- sample(x.star, b[j], replace = FALSE)
theta.star[i, j] <- max(x.star)
}
}
zlist <- list()
for (i in 1:length(b)) {
zlist[[i]] <- theta.star[ , i] - theta.hat
}
names(zlist) <- b
boxplot(zlist, xlab = "subsample size",
ylab = expression(hat(theta)[b] - hat(theta)[n]))
qlist <- list()
k <- (nboot + 1) * seq(0.05, 0.45, 0.05)
l <- (nboot + 1) * seq(0.55, 0.95, 0.05)
for (i in 1:length(b)) {
z.star <- zlist[[i]]
sz.star <- sort(z.star, partial = c(k, l))
qlist[[i]] <- sz.star[l] - sz.star[k]
}
names(qlist) <- b
lqlist <- lapply(qlist, log)
stripchart(lqlist, xlab = "subsample size",
ylab = "log(high quantile - low quantile)",
vertical = TRUE)
y <- sapply(lqlist, mean)
print(beta <- cov(- y, log(b)) / var(log(b)))
inter <- mean(y) + beta * mean(log(b))
plot(log(rep(b, each = length(k))), unlist(lqlist),
xlab = "log(subsample size)", ylab = "log(high quantile - low quantile)")
abline(inter, - beta)
# confidence interval calculation
m <- 3
b <- b[m]
theta.star <- theta.star[ , m]
alpha <- 0.05
z.star <- b^beta * (theta.star - theta.hat)
# two-sided interval
crit.val <- sort(z.star)[(nboot + 1) * c(1 - alpha / 2, alpha / 2)]
theta.hat - crit.val / n^beta
# histogram from which critical values are derived
hist(z.star)
abline(v = crit.val, lty = 2)
# one-sided confidence interval actually makes more sense
# since we know theta.hat < theta
crit.val <- sort(z.star)[(nboot + 1) * alpha]
theta.hat - c(0, crit.val) / n^beta
cat("Calculation took", proc.time()[1], "seconds\n")
In [20]:
x = read_csv("http://www.stat.umn.edu/geyer/5601/mydata/big-unif.txt").x.values
n = len(x)
theta_hat = max(x)
print "theta_hat= %f" % theta_hat
nboot = int(2e4 - 1)
bs = np.array((135, 90, 60, 40))
theta_star = np.empty((nboot, len(bs)))
for i in range(nboot):
x_star = x
for j, b in enumerate(bs):
x_star = np.random.choice(x_star, b, replace = False)
theta_star[i, j] = max(x_star)
print theta_star
zs = theta_star - theta_hat
boxplot(zs, positions=arange(len(bs)));
xlabel("subsample size")
xticks(arange(len(bs)), bs)
Out[20]:
(... work in progress ...)
This
work is licensed under a
Creative Commons Attribution-Share Alike 3.0 License.