Subsampling Bootstrap

These are my notes and worked exercises in Python from Geyer's overview, tutorial and notes contrasting the subsampling bootstrap of Politis, Romano and Wolfe (hereafter, PRW) and the classic Efron nonparametric bootstrap (hereafter, ET). See the references for more exposition, here I'm replicating the R results in Python along with some notes and a few additional examples from the texts.

Sampling with and without replacement

An important difference between subsampling and bootstrapping is that subsampling is done without replacement, yielding sample statistics having the same distribution as the true sample distribution F (because each subsample is an actual sample from F). Bootstrap sampling is done with replacement, yielding samples from the empirical distribution $\hat{F}_n$, which is not the same thing as the sampling distribution. As Geyer puts it:

Each of these procedures does the Wrong Thing.

  • The Right Thing is samples of the right size n from the right distribution F.
  • The Politis and Romano thing is samples of the wrong size b ≪ n from the right distribution F.
  • The Efron thing is samples of the right size n from the wrong distribution $\hat{F}_n$.

The distinction becomes pronounced with statistics that are not locally smooth. The Extreme Value example below is a simple example of the bootstrap distribution not looking at all like the sampling distribution for the sample max.

Two kinds of sub-sampling are considered in these tutorials:

  1. moving blocks: consecutive samples of length b. There are $(n - b + 1)$ such blocks available. This preserves dependency in the original samples to length b.
  2. individual samples without replacement to build subsamples of length b. Assumes original samples are IID.

Bootstrapping: Hormone Example (Efron and Tibsharini 8.5 and 8.6)

An R implementation for a subsampling approach to ET's 8.5 and 8.6 hormone example (the moving blocks version) is provided in the overview. We'll replicate that in this ipython notebook, first in R using the python R bridge (rpy2), then in Python using numpy/scipy. (Note that you must have installed rpy2 for the former but not the latter).


In [1]:
%load_ext rmagic

In [2]:
%%R
library(bootstrap)
print(lutenhorm)
summary(lutenhorm)

y <- lutenhorm[ , 4]
plot(y)
acf(y)
n <- length(y)

foo <- function(w) {
    z <- w - mean(w)
    m <- length(w)
    out <- lm(z[-1] ~ z[-m] + 0)
    as.numeric(coefficients(out))
}

print(beta.hat <- foo(y))

b <- 8
nboot <- n - b + 1
beta.star <- double(nboot)
for (i in 1:nboot) {
    y.star <- y[seq(i, i + b - 1)]
    beta.star[i] <- foo(y.star)
}

plot(beta.star)
abline(h = beta.hat, lty = 2)
hist(beta.star)
abline(v = beta.hat, lty = 2)
sd(beta.star) * sqrt(b / n)


   V1  V2  V3  V4  V5
1   1 2.2 5.5 2.4 4.3
2   2 2.2 4.5 2.4 4.6
3   3 2.3 5.1 2.4 4.7
4   4 2.0 5.5 2.2 4.1
5   5 1.6 5.7 2.1 4.1
6   6 1.4 5.1 1.5 5.2
7   7 1.8 4.3 2.3 5.0
8   8 2.2 4.8 2.3 4.4
9   9 2.9 5.6 2.5 4.2
10 10 2.6 5.9 2.0 5.1
11 11 2.4 6.0 1.9 5.1
12 12 2.1 5.1 1.7 4.7
13 13 3.0 5.2 2.2 4.4
14 14 2.5 4.4 1.8 3.9
15 15 2.7 5.5 3.2 5.4
16 16 2.2 5.4 3.2 5.9
17 17 2.4 4.1 2.7 4.2
18 18 2.7 4.4 2.2 4.1
19 19 3.1 4.7 2.2 4.1
20 20 2.5 4.6 1.9 3.6
21 21 2.4 6.0 1.9 3.1
22 22 2.3 5.6 1.8 4.8
23 23 2.4 5.1 2.7 5.1
24 24 1.9 4.7 3.0 5.1
25 25 3.3 4.8 2.3 4.5
26 26 3.8 5.5 2.0 4.6
27 27 3.7 5.1 2.0 5.8
28 28 3.5 5.2 2.9 5.0
29 29 3.1 5.0 2.9 5.1
30 30 2.7 4.0 2.7 4.5
31 31 4.1 3.7 2.7 4.2
32 32 4.0 4.8 2.3 6.0
33 33 3.4 5.9 2.6 5.6
34 34 3.2 5.5 2.4 5.4
35 35 3.7 4.9 1.8 5.0
36 36 3.6 4.4 1.7 4.4
37 37 4.1 4.7 1.5 4.6
38 38 2.0 4.2 1.4 5.7
39 39 4.6 5.5 2.1 5.2
40 40 4.1 4.9 3.3 5.0
41 41 3.2 4.8 3.5 4.4
42 42 2.9 4.5 3.5 5.7
43 43 2.7 4.9 3.1 5.7
44 44 3.0 4.9 2.6 4.8
45 45 0.0 4.5 2.1 3.4
46 46 0.0 4.2 3.4 5.5
47 47 0.0 4.9 3.0 5.5
48 48 0.0 5.9 2.9 5.6
[1] 0.5857651

(If the above output cell does not include a table of 48x5 values followed by 4 graphs, you've something disagreeable in your configuration)

Python version of Hormone Example

Here are three versions of the hormone example (subsampling as above, and bootstrapped disturbances and moving blocks from ET) in Python, using Pandas dataframes, scipy and numpy for vectorized numerics, and matplotlib for graphics. This and other examples assume the notebook server was started with the pylab=inline flag, so that numpy/scipy/matplotlib modules are pre-imported and configured and graphic results are displayed within the page. I am also using a special matplotlibrc file to get a ggplot color-scheme, but things are fine without it.

We'll start by grabbing the hormone data from the rpy2 namespace (it was loaded by the previous example). You could do this using the rmagic %Rpull lutenhorm, but this would bring the data across as a nested list of python lists. Instead, use the Pandas load_data() function to bring it across as appropriate DataFrame or Series objects built from fast ndarrays:


In [3]:
from pandas.rpy.common import load_data
lutenhorm = load_data('lutenhorm')

# (to do this without R, load from a local csv file instead:
# from pandas import read_csv
# lutenhorm = read_csv("data/lutenhorm.csv")

print lutenhorm
print type(lutenhorm)


    V1   V2   V3   V4   V5
1    1  2.2  5.5  2.4  4.3
2    2  2.2  4.5  2.4  4.6
3    3  2.3  5.1  2.4  4.7
4    4  2.0  5.5  2.2  4.1
5    5  1.6  5.7  2.1  4.1
6    6  1.4  5.1  1.5  5.2
7    7  1.8  4.3  2.3  5.0
8    8  2.2  4.8  2.3  4.4
9    9  2.9  5.6  2.5  4.2
10  10  2.6  5.9  2.0  5.1
11  11  2.4  6.0  1.9  5.1
12  12  2.1  5.1  1.7  4.7
13  13  3.0  5.2  2.2  4.4
14  14  2.5  4.4  1.8  3.9
15  15  2.7  5.5  3.2  5.4
16  16  2.2  5.4  3.2  5.9
17  17  2.4  4.1  2.7  4.2
18  18  2.7  4.4  2.2  4.1
19  19  3.1  4.7  2.2  4.1
20  20  2.5  4.6  1.9  3.6
21  21  2.4  6.0  1.9  3.1
22  22  2.3  5.6  1.8  4.8
23  23  2.4  5.1  2.7  5.1
24  24  1.9  4.7  3.0  5.1
25  25  3.3  4.8  2.3  4.5
26  26  3.8  5.5  2.0  4.6
27  27  3.7  5.1  2.0  5.8
28  28  3.5  5.2  2.9  5.0
29  29  3.1  5.0  2.9  5.1
30  30  2.7  4.0  2.7  4.5
31  31  4.1  3.7  2.7  4.2
32  32  4.0  4.8  2.3  6.0
33  33  3.4  5.9  2.6  5.6
34  34  3.2  5.5  2.4  5.4
35  35  3.7  4.9  1.8  5.0
36  36  3.6  4.4  1.7  4.4
37  37  4.1  4.7  1.5  4.6
38  38  2.0  4.2  1.4  5.7
39  39  4.6  5.5  2.1  5.2
40  40  4.1  4.9  3.3  5.0
41  41  3.2  4.8  3.5  4.4
42  42  2.9  4.5  3.5  5.7
43  43  2.7  4.9  3.1  5.7
44  44  3.0  4.9  2.6  4.8
45  45  0.0  4.5  2.1  3.4
46  46  0.0  4.2  3.4  5.5
47  47  0.0  4.9  3.0  5.5
48  48  0.0  5.9  2.9  5.6
<class 'pandas.core.frame.DataFrame'>

In [4]:
from statsmodels.graphics.tsaplots import plot_acf
y = lutenhorm.V4
print type(y)
scatter(y.index, y)
plot_acf(y, alpha=.05, lags=20);


<class 'pandas.core.series.Series'>

The statistic of interest is the slope parameter $\beta$ of an AR(1) model with disturbances for centered observations $z_i$. We estimate it from the full sample:


In [5]:
def beta_lse(z, demean=False):
    """estimate the AR(1) beta as the LSE slope relating centered observations"""
    if demean:
        z = z - mean(z)
    zlag = z[:-1]
    return np.dot(zlag/np.dot(zlag, zlag), z[1:])

beta_hat = beta_lse(y, demean=True)
print beta_hat


0.585765124555

In the following examples, this convenience function plots a bootstrap distribution histogram along with bootstrap confidence intervals (based on the realized distribution) and normal 95% confidence intervals about the sample statistic.


In [6]:
def plot_bootstrap(theta_star, theta_hat, blockratio=1.0, bins=10, title="bootstrapped values"):
    """ 
    given a vector of bootstrap realizations of a statistic, 
    show a histogram with normal and bootstrap confidence intervals. 

    """
    figure()
    scatter(arange(len(theta_star)), theta_star)
    plt.title(title)
    figure()
    b_star = np.sort(theta_star[~np.isnan(theta_star)]) # get rid of nans while we sort
    se = np.std(b_star) * np.sqrt(blockratio)
    ci = [b_star[int(len(b_star)*.025)], b_star[int(len(b_star)*.975)]] # bootstrap 95% CI based on empirical percentiles
    print "theta_hat = %.4f, E(theta_star)=%.4f, se(theta_hat) = %.4f, 95%% CI = [%.4f, %.4f]" % (
            theta_hat, mean(b_star), se, ci[0], ci[1])
    hist(b_star, bins=bins)
    plt.title("counts for %s" % title)
    xlabel("value")
    ylabel("count")
    axvline(x=theta_hat, linestyle='-', color='k')
    axvline(x=mean(b_star), linestyle='-', color='r')
    axvline(x=theta_hat+1.96*se, linestyle='--', color='b') # standard normal 95% CI
    axvline(x=theta_hat-1.96*se, linestyle='--', color='b')
    axvline(x=ci[0], linestyle='--', color='r')
    axvline(x=ci[1], linestyle='--', color='r')

Subsampling Bootstrap

The tutorial begins with a subsampling approach to the hormone data. This selects each successive b-length block as a sample. There are few enough of these that the complete set can be used in computing $\hat{\beta}^*$, no random subsampling required:


In [7]:
b = 8 # block length
b_star = np.array([beta_lse(y[i:i+b], demean=True) for i in range(len(y)-b+1)])
plot_bootstrap(b_star, beta_hat, blockratio=float(b)/len(y), 
               title="complete subsampling with recentering, b=%d" % b)
z_star = np.sqrt(b) * (b_star - beta_hat) 
plot_bootstrap(z_star, 0, blockratio=float(b)/len(y), 
               title="root statistic distribution, b=%d" % b)


theta_hat = 0.5858, E(theta_star)=0.3642, se(theta_hat) = 0.1153, 95% CI = [-0.0960, 0.8645]
theta_hat = 0.0000, E(theta_star)=-0.6267, se(theta_hat) = 0.3262, 95% CI = [-1.9285, 0.7882]

In the histogram above of $\hat{\beta}^*$ values, the original sample statistic is a solid black line, the mean of the bootstrapped statistics a solid red line. This estimate is biased low. Normal 95% confidence intervals about the estimate are dashed blue lines, and bootstrap percentile intervals are dashed red lines. When the distribution of $\hat{\beta}^*$ is not normal, these intervals will differ, with the percentile interval capturing the actual 95% density of the bootstrap distribution.

This (and the R code above) are not quite in agreement with the problem statement -- they re-center each block when fitting, which is not compatible with having a single AR(1) model for the entire series with the mean as a given parameter value. But the stderr is not too bad with a blocksize of 8, because of stationarity: subsequences of the series are expected to have the same mean and variance as the whole. At smaller block sizes, the estimates become biased extremely low, because of the error in estimating the global mean from such a small sample. (If this is a live notebook, you can try out different values of b by editing the code above and typing ctrl-Enter to re-run the estimate)

Moving Blocks Bootstrap

The moving-blocks bootstrap procedure described in ET is quite a bit different from the subsampling approach: for each boostrap sample, randomly select blocks and assemble into a length-n timeseries, then compute $\hat{\beta}^*$ for each such length-n series.


In [8]:
B = 200
b = 3 # divides evenly into length(z). See text for scaling factor when not even.
beta_star = np.empty(B)
z = y - mean(y)
z_star = np.empty(len(z))
for boot_i in range(B):
    for block_i, start in enumerate(np.random.randint(len(z) - b + 1, size=len(z)/b)):
        z_star[block_i*b:(block_i+1)*b] = z[start:start+b]
    beta_star[boot_i] = beta_lse(z_star)
plot_bootstrap(beta_star, beta_hat, title="%d moving-block bootstraps, b=%d" % (B, b))


theta_hat = 0.5858, E(theta_star)=0.3815, se(theta_hat) = 0.1285, 95% CI = [0.1334, 0.6156]

Bootstrapped z's

ET's first hormone approach is parametric -- it focuses on the random disturbances driving the AR(1) model. The disturbances $\epsilon_i$ are from an unknown distribution and are estimated from the sample data. This estimated distribution will then be sampled to generate new sample sequences having the same statistics as the original sample.

Using the estimated $\hat{\beta}$ from the sample, the estimated disturbances $\hat{\epsilon}$ are:


In [9]:
def eps_beta(z, beta):
    """estimate AR(1) disturbances from centered observations z given beta"""
    return (z - z.shift(1) * beta).values[1:]

eps_hat = eps_beta(z, beta_hat)
hist(eps_hat)


Out[9]:
(array([  2.,  10.,  10.,   7.,   6.,   3.,   4.,   2.,   0.,   3.]),
 array([-0.72427046, -0.53427046, -0.34427046, -0.15427046,  0.03572954,
        0.22572954,  0.41572954,  0.60572954,  0.79572954,  0.98572954,
        1.17572954]),
 <a list of 10 Patch objects>)

Using the estimated disturbances $\hat{\epsilon}$, bootstrapped $z^*$'s are computed sequentially, each using the value of the previous $z^*$ and a bootstrap sample from $\hat{\epsilon}$:


In [10]:
def bootstrap_z(z0, beta_hat, eps_hat):
    """
    return a z vector simulated using bootstrapped 
    disturbances in an AR(1) model, given an estimated probability model 
    (beta_hat, eps_hat).
    
    beta_hat is the estimated slope term of the AR(1) model and eps_hat is 
    the empirical distribution of disturbances, represented as a vector of
    equally likely estimated eps values.  We also need the constant z0
    to initialize the AR(1) simulation.

    """
    z_star = np.array([z0]*(len(eps_hat)+1))
    for i in range(len(eps_hat)):
        z_star[i+1] = (beta_hat * z_star[i]) + np.random.choice(eps_hat)
    return z_star[1:] # leave out z0, it's not bootstrappy
        
# continuing with z, beta_hat, and eps_hat from above
B = 200 # number of bootstrap simulations
b_star = np.array([beta_lse(bootstrap_z(z.iloc[0], beta_hat, eps_hat))
                   for i in range(B)])
plot_bootstrap(b_star, beta_hat, title="%d bootstrapped values" % len(b_star))


theta_hat = 0.5858, E(theta_star)=0.5668, se(theta_hat) = 0.1207, 95% CI = [0.3185, 0.7637]

In [11]:
b = 3 # block length
b_star = np.array([beta_lse(z[i:i+b], demean=False) for i in range(len(z)-b+1)])
plot_bootstrap(b_star, beta_hat, blockratio=float(b)/len(z), 
               title="complete moving-block bootstrap, b=%d" % b)


theta_hat = 0.5858, E(theta_star)=0.4079, se(theta_hat) = 0.2298, 95% CI = [-2.5000, 1.7500]

The estimated standard errors for b=3 (and b=5) agree with those reported in the text, and the bias is still lousy.

Long Timeseries example


In [12]:
%%R
x = read.table("data/lts.csv", header=TRUE)$x
foo <- function(w) ar.burg(w, aic = FALSE, order.max = 1)$ar
print(beta.hat <- foo(x))

n <- length(x)
b <- 2500
spc <- 100
ii <- seq(1, n - b + 1, spc)
nboot <- length(ii)
beta.star <- double(nboot)
for (i in 1:nboot) {
    x.star <- x[seq(ii[i], ii[i] + b - 1)]
    beta.star[i] <- foo(x.star)
}

alpha <- 0.05
z.star <- sqrt(b) * (beta.star - beta.hat)
crit.val <- quantile(z.star,
    probs = c(1 - alpha / 2, alpha / 2))
beta.hat - crit.val / sqrt(n)

# histogram from which critical values are derived
hist(z.star)
abline(v = crit.val, lty = 2)

# for comparison, interval using 1.96 bootstrap s. e.
beta.hat + c(-1, 1) * qnorm(0.975) * sd(beta.star) *
    sqrt(b / n)

cat("Calculation took", proc.time()[1], "seconds\n")


[1] 0.99072
Calculation took 5.412 seconds

Python version of Long Timeseries Example

As in the tutorial, we use the Burg method to estimate the AR(1) $\beta$ parameter. This isn't necessary for an AR(1) like this example, but is generally superior for larger AR(n), particularly when the covariance matrix would be ill-conditioned. An implementation is available in the spectrum module, and may be installed with the unix command % easy_install -U spectrum. (warning: it is slow, and has a different sign convention than R's ar.burg)


In [13]:
from pandas import read_csv
from spectrum import arburg

def beta_burg(z, demean=True):
    """estimate the AR(1) beta using the Burg method"""
    if demean:
        z = z - np.mean(z)
    return -np.real(arburg(z, 1)[0][0])
    
x = read_csv("data/lts.csv").x.values
n = len(x)
b = 2500 # block length
spc = 100
beta_hat = beta_burg(x)
print beta_hat
b_star = np.array([beta_burg(x[i:i+b]) for i in range(0, n-b+1, 100)])
z_star = np.sqrt(b) * (b_star - beta_hat) 
plot_bootstrap(z_star, 0, blockratio=float(b)/n, 
               title="root statistic distribution, b=%d" % b)


0.990719999947
theta_hat = 0.0000, E(theta_star)=-0.0971, se(theta_hat) = 0.0249, 95% CI = [-0.4131, 0.1349]

Extreme Values (Efron and Tibsharini 7.4)

Here the statistic of interest is the sample maximum $\hat{\theta}$ from a uniform distribution over $[0, \theta]$. It can be shown that this is exponentially distributed with expected value $\theta$. The subsample and bootstrap computations below examine how well this exponential distribution is reproduced.

R code for the example (Geyer):


In [14]:
%%R
X <- read.table("http://www.stat.umn.edu/geyer/5601/mydata/expo.txt", header=T) 
x = X$x
n <- length(x)
print(theta.hat <- max(x))

nboot <- 1000
b <- 50
theta.star <- double(nboot)
theta.bogo <- double(nboot)
for (i in 1:nboot) {
    theta.star[i] <- max(sample(x, b, replace = FALSE))
    theta.bogo[i] <- max(sample(x, replace = TRUE))
}

z.star <- b * (theta.hat - theta.star)
plot(qexp(ppoints(nboot), 1 / theta.hat), sort(z.star),
    xlab = "Quantiles of Exp(1 / theta.hat) distribution")
z.bogo <- n * (theta.hat - theta.bogo)
points(qexp(ppoints(nboot), 1 / theta.hat), sort(z.bogo),
    pch="B", col=2)
abline(0, 1)


[1] 3.1415

Python version of Extreme Value example

First, the subsampling approach, taking b-length samples from the n samples then taking the max of each. The distribution of these maxima is compared to the true distribution in a Q-Q plot.


In [15]:
from pandas import read_csv
from scipy.stats.distributions import expon
from statsmodels.graphics.gofplots import qqplot
df = read_csv("data/expo.csv")
x = df.x.values
n = len(x)
theta_hat = max(x)
print theta_hat
nboot = 1000
b = 50
theta_star_sub = np.max([np.random.choice(x, b, replace=False) for i in range(nboot)], axis=1)
z_star_sub = b * (theta_hat - theta_star_sub)
qqplot(z_star_sub, expon, scale=theta_hat, line='45');


3.1415

Now, bootstrapping n-length samples with replacement. The distribution is quite different because there is a $\approx 1 - \frac{1}{e} \approx 62\%$ chance that any given bootstrap sample contains the actual sample max $\hat{\theta}$ (similarly for any other particular value from the full sample). This means the bootstrap distribution does not converge to the sampling distribution as n increases.


In [16]:
theta_star_boot = np.max([np.random.choice(x, n, replace=True) for i in range(nboot)], axis=1)
z_star_boot = n * (theta_hat - theta_star_boot)
qqplot(z_star_boot, expon, scale=theta_hat, line='45');


This is clearer displayed as a histogram:


In [17]:
hist(theta_star_boot)


Out[17]:
(array([   2.,    0.,    0.,    5.,    9.,   37.,    0.,    0.,    0.,  947.]),
 array([ 3.1128 ,  3.11567,  3.11854,  3.12141,  3.12428,  3.12715,
        3.13002,  3.13289,  3.13576,  3.13863,  3.1415 ]),
 <a list of 10 Patch objects>)

PRW point out (section 1.3) that in instances where the simple bootstrap fails, consistency can often be restored by reducing the bootstrap sample size, foreshadowing their subsampling results. In this example, the code is the same as for subsampling, but done with replacement:


In [18]:
nboot = 1000
b = 50
theta_star_sub = np.max([np.random.choice(x, b, replace=True) for i in range(nboot)], axis=1)
z_star_sub = b * (theta_hat - theta_star_sub)
qqplot(z_star_sub, expon, scale=theta_hat, line='45');


Estimating the Rate

One difficulty with subsampling is the need to know the convergence rate, so that a b-length sample statistic distribution can be adjusted to be comparable with an n-length sample statistic distribution (needed for constructing confidence intervals from subsample estimates). PRW show how to use subsampling to estimate the convergence rate. Geyer continues with the example of the max order statistic over $[0,\theta]$ interval, in which the convergence rate is known to be $n$ (rather than the typical $\sqrt{n}$ law), and estimates convergence by constructing correlated subsamples of various sizes and fitting a convergence rate using the method from Chapter 8 of PRW


In [19]:
%%R
X <- read.table(url("http://www.stat.umn.edu/geyer/5601/mydata/big-unif.txt"), header = TRUE)
x = X$x

print(n <- length(x))
print(theta.hat <- max(x))

nboot <- 2e4 - 1
b <- c(40, 60, 90, 135)
b <- sort(b)
theta.star <- matrix(NA, nboot, length(b))
for (i in 1:nboot) {
    x.star <- x
    for (j in length(b):1) {
        x.star <- sample(x.star, b[j], replace = FALSE)
        theta.star[i, j] <- max(x.star)
    }
}
zlist <- list()
for (i in 1:length(b)) {
    zlist[[i]] <- theta.star[ , i] - theta.hat
}
names(zlist) <- b
boxplot(zlist, xlab = "subsample size",
    ylab = expression(hat(theta)[b] - hat(theta)[n]))

qlist <- list()
k <- (nboot + 1) * seq(0.05, 0.45, 0.05)
l <- (nboot + 1) * seq(0.55, 0.95, 0.05)
for (i in 1:length(b)) {
    z.star <- zlist[[i]]
    sz.star <- sort(z.star, partial = c(k, l))
    qlist[[i]] <- sz.star[l] - sz.star[k]
}
names(qlist) <- b

lqlist <- lapply(qlist, log)
stripchart(lqlist, xlab = "subsample size",
    ylab = "log(high quantile - low quantile)",
    vertical = TRUE)

y <- sapply(lqlist, mean)
print(beta <- cov(- y, log(b)) / var(log(b)))
inter <- mean(y) + beta * mean(log(b))
plot(log(rep(b, each = length(k))), unlist(lqlist),
  xlab = "log(subsample size)", ylab = "log(high quantile - low quantile)")
abline(inter, - beta)

# confidence interval calculation
m <- 3
b <- b[m]
theta.star <- theta.star[ , m]
alpha <- 0.05
z.star <- b^beta * (theta.star - theta.hat)

# two-sided interval
crit.val <- sort(z.star)[(nboot + 1) * c(1 - alpha / 2, alpha / 2)]
theta.hat - crit.val / n^beta

# histogram from which critical values are derived
hist(z.star)
abline(v = crit.val, lty = 2)

# one-sided confidence interval actually makes more sense
# since we know theta.hat < theta
crit.val <- sort(z.star)[(nboot + 1) * alpha]
theta.hat - c(0, crit.val) / n^beta

cat("Calculation took", proc.time()[1], "seconds\n")


[1] 10000
[1] 2.717583
[1] 0.8805016
Calculation took 31.605 seconds

Python version of rate estimation


In [20]:
x = read_csv("http://www.stat.umn.edu/geyer/5601/mydata/big-unif.txt").x.values
n = len(x)
theta_hat = max(x)
print "theta_hat= %f" % theta_hat

nboot = int(2e4 - 1)
bs = np.array((135, 90, 60, 40))
theta_star = np.empty((nboot, len(bs)))
for i in range(nboot):
    x_star = x
    for j, b in enumerate(bs):
        x_star = np.random.choice(x_star, b, replace = False)
        theta_star[i, j] = max(x_star)
print theta_star
zs = theta_star - theta_hat
boxplot(zs, positions=arange(len(bs)));
xlabel("subsample size")
xticks(arange(len(bs)), bs)


theta_hat= 2.717583
[[ 2.7160474   2.7160474   2.69963314  2.69963314]
 [ 2.69933923  2.69933923  2.69933923  2.66839546]
 [ 2.71426336  2.71426336  2.71426336  2.67169819]
 ..., 
 [ 2.68393348  2.66152502  2.66152502  2.66152502]
 [ 2.71562287  2.65969069  2.65969069  2.48717823]
 [ 2.70318573  2.70318573  2.66981915  2.54924967]]
Out[20]:
([<matplotlib.axis.XTick at 0x10c534c10>,
  <matplotlib.axis.XTick at 0x10c39e410>,
  <matplotlib.axis.XTick at 0x10e895f10>,
  <matplotlib.axis.XTick at 0x10e899a10>],
 <a list of 4 Text xticklabel objects>)

(... work in progress ...)