Fabrizzio Sanchez suggested bootstrapping a statistical model as a non-trivial task on which to demonstrate the speed of Julia.
In this notebook I will demonstrate a parametric bootstrap, which means that samples are drawn from the probability model with parameters set at their estimated values from the original model, and the model refit.
A simple example is a linear model which can be fit using the GLM
package.
In [1]:
using GLM, RDatasets
ls = dataset("datasets","LifeCycleSavings")
Out[1]:
In [2]:
m1 = lm(SR ~ Pop15 + Pop75 + DPI + DDPI, ls)
Out[2]:
We will take the fitted values and the estimate of the scale parameter from this model as the mean and scale parameter of a multivariate normal distribution from which to simulate new response vectors.
Getting the fitted values turned out to be more complicated than I thought it would be because I hadn't written the appropriate method. For now we will use the somewhat mysterious extractor m1.model.rr.mu
to get this.
We will use the IsoNormal
type of multivariate normal distribution from the Distributions
package for simulating the responses.
In [6]:
fitted(m1)
In [3]:
d = IsoNormal(m1.model.rr.mu,scale(m1.model))
A simple method of obtaining the coefficients from fitting the model matrix to a new simulated response is the lm
method for a model matrix and a response.
In [4]:
X = copy(m1.model.pp.X) # copy the X matrix from the predictor field, pp
Out[4]:
In [5]:
srand(1234321)
lm(X,rand(d))
Out[5]:
Suppose that we want the coefficient estimates, $\widehat{\beta}$, and the scale parameter estimate, $\widehat{\sigma}$, from, say, 10000 simulated responses for this model.
For a single fit we can extract both of these and concatenate them as a vector
In [6]:
sm1 = lm(X,rand(d))
[coef(sm1), scale(sm1)]
Out[6]:
Let's express this as a function
In [7]:
samp1(X::Matrix,d) = (sm = lm(X,rand(d)); [coef(sm), scale(sm)])
Out[7]:
In [8]:
samp1(X,d)
Out[8]:
The simplest way to create a parametric bootstrap sample of these parameter estimates is as a comprehension.
In [9]:
srand(1234321)
bss = [samp1(X,d) for i in 1:10000]
Out[9]:
In [10]:
@time [samp1(X,d) for i in 1:10000];
As we see, this produces the sample in the form of an array of vectors, as opposed to, say, a matrix. Furthermore this process is not blazingly fast, allocates a lot of memory and spends about 1/6 of the time in garbage collection.
What is happening here is that all the work of creating the model object and factoring the model matrix is (unnecessarily) being done for every sample.
As indicated by the rather long name of the object type for the fitted models, a dense QR decomposition is used to obtain the coefficients.
In [11]:
qrd = m1.model.pp.qr;
typeof(qrd)
Out[11]:
In [12]:
srand(1234321)
y1 = rand(d)
β = qrd\y1
Out[12]:
We see we get the same coefficient estimates using the pre-computed QR decomposition of the model matrix. To get $\widehat{\sigma}$ we could use these coefficient estimates, generate the fitted values and residuals and assumulate the sum-of-squared residuals. However, there is a faster way using the vector $\bf Q^\prime y$. The coefficient estimates satisfy $\bf R\widehat{\beta}=Q_1^\prime y$ and the residual sum of squares is $\|\bf Q_2^\prime y\|^2$ where $\bf Q_1$ is the first $p$ columns of the $n\times n$ matrix $\bf Q$ and $\bf Q_2$ is the last $n-p$ columns.
This method would not be practical for large $n$ if we needed to generate and store $\bf Q$ explicitly. But we don't.
In [13]:
Q = qrd[:Q]
typeof(Q)
Out[13]:
In [14]:
R = Triangular(qrd[:R],:U)
Out[14]:
We can create both $\widehat{\sigma}$ and $\widehat{\beta}$ in a simple function.
In [15]:
function betasigmahat(qrd,y::Vector)
n,p = size(qrd)
length(y) == n || throw(DimensionMismatch(""))
qty = qrd[:Q]'y
sigmahat = sqrt(Base.sumabs2(sub(qty,(p+1):n))/(n-p))
Triangular(qrd[:R],:U)\sub(qty,1:p), sigmahat
end
Out[15]:
In [16]:
betasigmahat(qrd,y1)
Out[16]:
We can wrap all this in a function to create the bootstrap sample
In [17]:
function boot(m::LinearModel,nsim::Integer)
qrd = m.pp.qr
qmat = qrd[:Q]
rmat = Triangular(qrd[:R],:U)
n,p = size(qrd)
df = float(n-p)
d = IsoNormal(m.rr.mu, scale(m))
betas = Array(Float64,(size(qrd,2),nsim))
sigmas = Array(Float64,nsim)
for j in 1:nsim
qty = qmat'rand(d)
betas[:,j] = rmat\sub(qty,1:p)
sigmas[j] = sqrt(Base.sumabs2(sub(qty,(p+1):n))/df)
end
betas, sigmas
end
Out[17]:
In [18]:
srand(1234321)
boot(m1.model,1) # check that the results are calculated properly
Out[18]:
In [19]:
srand(1234321)
boot(m1.model,10000)
Out[19]:
In [20]:
gc();@time boot(m1.model,10000);
We are doing much better in the boot function except that we are allocating a lot of memory. We can avoid this by using mutating functions. For example each call to rand(d)
allocates a vector of length n
which eventually must be garbage collected. This is not a terrible burden if $n=50$ but when $n=100000$ you don't want to have this wasted allocation occurring nsim
times.
The mutating version of rand
avoids this.
In [21]:
function boot1(m::LinearModel,nsim::Integer)
qrd = m.pp.qr
qmat = qrd[:Q]
rmat = Triangular(qrd[:R],:U)
n,p = size(qrd)
df = float(n-p)
d = IsoNormal(m.rr.mu, scale(m))
betas = Array(Float64,(size(qrd,2),nsim))
sigmas = Array(Float64,nsim)
y = Array(Float64,n)
for j in 1:nsim
qty = qmat'rand!(d,y)
betas[:,j] = rmat\sub(qty,1:p)
sigmas[j] = sqrt(Base.sumabs2(sub(qty,(p+1):n))/df)
end
betas, sigmas
end
Out[21]:
In [22]:
srand(1234321)
boot1(m1.model,1)
Out[22]:
In [23]:
srand(1234321);gc();@time boot1(m1.model,10000)
Out[23]:
We have cut down on the memory allocated (24 Mb total down to 20 Mb total) but only by about 1/6. The other allocations are occurring in the multiplication by $\bf Q$ and the solutions with respect to $\bf R$. There are mutating versions of both these operations. A multiplication of the form $\bf Q^\prime y$ ends up calling the relevant method for At_mul_B
. The general function is Ac_mul_B
indicating the conjugate transpose. The two functions are identical for real matrices and we will use the more general name Ac_mul_B
and its mutating version Base.Ac_mul_B!
. Similarly, R\qty
ends up calling A_ldiv_B
with mutating version Base.A_ldiv_B!
.
In [24]:
function boot2(m::LinearModel,nsim::Integer)
qrd = m.pp.qr
qmat = qrd[:Q]
rmat = Triangular(qrd[:R],:U)
n,p = size(qrd)
df = float(n-p)
d = IsoNormal(m.rr.mu, scale(m))
betas = Array(Float64,(size(qrd,2),nsim))
sigmas = Array(Float64,nsim)
y = Array(Float64,n)
for j in 1:nsim
Base.Ac_mul_B!(qmat,rand!(d,y))
Base.A_ldiv_B!(rmat,copy!(sub(betas,:,j),sub(y,1:p)))
sigmas[j] = sqrt(Base.sumabs2(sub(y,(p+1):n))/df)
end
betas, sigmas
end
Out[24]:
In [25]:
srand(1234321)
boot2(m1.model,1)
Out[25]:
In [26]:
srand(1234321); gc(); @time samp2 = boot2(m1.model,10000);
It doesn't look like we have saved a lot but remember that we are allocating 4 Mb of that 14 Mb for the result.
A common idiom in Julia is to create a mutating version of a function that overwrites one or more arguments and a non-mutating version that allocates the object to be overwritten and calls the mutating version.
We first create the mutating boot2!
method then define the non-mutation version to allocate the storage and call the mutating version.
In [27]:
function boot2!(betas::Matrix, sigmas::Vector, m::LinearModel)
qrd = m.pp.qr
n,p = size(qrd)
size(betas,2) == length(sigmas) && size(betas,1) == p || throw(DimensionMismatch(""))
qmat = qrd[:Q]
rmat = Triangular(qrd[:R],:U)
df = float(n-p)
d = IsoNormal(m.rr.mu, scale(m))
y = Array(Float64,n)
for j in 1:length(sigmas)
Base.Ac_mul_B!(qmat,rand!(d,y))
Base.A_ldiv_B!(rmat,copy!(sub(betas,:,j),sub(y,1:p)))
sigmas[j] = sqrt(Base.sumabs2(sub(y,(p+1):n))/df)
end
betas, sigmas
end
function boot2(m::LinearModel, nsim::Integer)
boot2!(Array(Float64,(size(m.pp.qr,2),nsim)),Array(Float64,nsim),m)
end
Out[27]:
In [28]:
srand(1234321)
boot2!(zeros(5,1),zeros(1),m1.model)
Out[28]:
In [29]:
betas = zeros(5,10000)
sigmas = zeros(10000)
srand(1234321); gc(); @time boot2!(betas,sigmas,m1.model);
So despite all these heroic efforts we seem to be stuck allocating about 14 Mb in this case.
We should profile to see what is happening. To get a reasonable execution time we up the number of samples from 10,000 to 100,000.
In [30]:
betas = zeros(5,100_000)
sigmas = zeros(100_000)
srand(1234321)
gc()
@profile boot2!(betas,sigmas,m1.model);
In [31]:
using ProfileView
At this point we would typically run ProfileViews.view()
but the resulting plot doesn't display well in an IJulia notebook. Also, when @profile
is called through IJulia there are several functions specific to the IJulia interface that show up.
Running this in the REPL does produce a profile flame plot which shows, as expected, that essentially all the time is being used in the inner loop. Roughly 1/3 in the Ac_mul_B!
call, nearly 2/3 in the A_ldiv_B!
call and a small amount in the sumabs2
call.
Furthermore, only a small fraction of the time in the A_ldiv_B!
call is actually spent in the calculation. Most of the time is spent in the sub
function. I misunderstood how it worked. This frequently happens to me, which is why @profile
and ProfileView.view()
are so valuable.
It is likely that view
from the ArrayViews
package will be faster and allocate less memory. (Because the capabilities in ArrayViews
are likely to break existing code, they will not be introduced into the base system until after release 0.3.0 is out.)
In [32]:
using ArrayViews
function boot3!(betas::Matrix, sigmas::Vector, m::LinearModel)
qrd = m.pp.qr
n,p = size(qrd)
size(betas,2) == length(sigmas) && size(betas,1) == p || throw(DimensionMismatch(""))
qmat = qrd[:Q]
rmat = Triangular(qrd[:R],:U)
df = float(n-p)
d = IsoNormal(m.rr.mu, scale(m))
y = Array(Float64,n)
for j in 1:length(sigmas)
Base.Ac_mul_B!(qmat,rand!(d,y))
Base.A_ldiv_B!(rmat,copy!(view(betas,:,j),view(y,1:p)))
sigmas[j] = sqrt(Base.sumabs2(view(y,(p+1):n))/df)
end
betas, sigmas
end
function boot3(m::LinearModel, nsim::Integer)
boot3!(Array(Float64,(size(m.pp.qr,2),nsim)),Array(Float64,nsim),m)
end
Out[32]:
In [33]:
srand(1234321)
boot3!(zeros(5,1),[0.],m1.model)
Out[33]:
In [34]:
betas = zeros(5,10_000)
sigmas = zeros(10_000)
srand(1234321)
gc()
@time boot3!(betas,sigmas,m1.model);
The profile indicates that we are unlikely to make this much faster as almost all the time is being spent in the BLAS or LAPACK functions, plus the simulation from the multivariate normal. We are still allocating over 5 Mb during the execution but I'm not sure exactly where this is occuring.
In addition to allowing memory allocation to be more finely controlled, having a mutating version of the simulation function can help when parallelizing the simulation using distributed arrays or memory-mapped arrays.
A memory-mapped array corresponds to a stored file. Numbers are stored in their native binary representation.
Suppose that we were simulating a huge number of replications for this model or for more complex models.
In [35]:
nsim = 10_000_000
n, p = size(m1.model.pp.X)
s = open("/tmp/mmapbetas.bin","w+")
betas = mmap_array(Float64,(p,nsim),s)
typeof(betas)
Out[35]: