A recent paper by DeBruin and Barr describes the use of simulations to explote the formulation and interpretation of mixed-effects models, especially in cases of crossed random-effects terms. They use the lme4
package for R
to illustrate how the model is constructed and to perform the simulations.
R
and lme4
are wonderful for illustrating the concepts but not always the best for performing the simulations. One of my purposes in creating the MixedModels
package for Julia
was to develop and implement data structures and algorithms that accelerate compute-intensive processes such as simulation of mixed-effects models.
The purpose of this notebook is to show how the simulations can be performed in Julia quickly and with minimal adjustment from the R representation.
Julia executables for different operating systems are available at the download site. This notebook was prepared using Julia version 1.1.1 and the packages shown below.
In [1]:
# Attach packages from the standard library
using InteractiveUtils, LinearAlgebra, Random, Statistics
# Attach user-contributed packages (may need to Pkg.add these)
using DataFrames, MixedModels, RData, StatsModels, StatsPlots
# Check the version of Julia being used
versioninfo()
The simulations will be based on a model defined using a formula and a data frame such as shown in Table 2 of DeBruin and Barr's paper. The easiest way for an R user to obtain such a data frame in Julia is to create it in R, using the methods described in that paper, and save it as, say, data.RData
, which can be read in Julia using the RData
package.
In [2]:
dat = load("data.RData")["dat"]
Out[2]:
In [3]:
describe(dat)
Out[3]:
The values of the response, RT
, are arbitrary. They need to be present in the data frame to construct the model but are not used in the simulation.
A mixed-effects model formula in Julia is similar to one in R but must be enclosed in a call to the @formula
macro. The formula for the model to be considered is
In [4]:
f1 = @formula(RT ~ 1 + category + (1+category|subj_id) + (1|item_id));
One issue to be resolved is the numerical coding of the category
variable. DeBruin and Barr used a $\pm0.5$ encoding. Statisticians, such as I, prefer a $\pm 1$ encoding which is generated by the "Helmert" or the "Effects" contrasts for a two-level factor. Coding of categorical covariates is specified as a dictionary of key-value pairs in the call to create a LinearMixedModel
object. Without going into too many details, the name of the column is given as a symbol, like :cateegory
, and the generator for the coding as a function call, like EffectsCoding()
. The =>
operator creates a key-value pair.
In [5]:
m1 = LinearMixedModel(f1, dat, Dict(:category => EffectsCoding()));
Just to check that the coding occurred as expected, the model matrix for the fixed-effects parameters is
In [6]:
m1.X
Out[6]:
and that for the subject-specific random effects is
In [7]:
first(m1.reterms)
Out[7]:
(If you are concerned about the size of this matrix, don't worry - it is stored in a much more compact way.)
The fixed-effects parameter vector in DeBruine and Barr's formula is [800.0, 50.0]
for the $\pm 0.5$ coding, which corresponds to [800.0, 25.0]
in the $\pm 1$ coding. It is written as $\beta$. The standard deviation for the residual, written $\sigma$, is 200.0
.
In [8]:
β = [800.0; 25.0]
Out[8]:
In [9]:
σ = 200.0
Out[9]:
The standard deviation of the by-subject intercept random-effects is 100.0
and the by-item intercept random-effects is 80.0
. The standard deviation of the by-subject slope random effects is 40.0
in the $\pm0.5$ coding or 20.0
in the $\pm 1$ coding. The within-subject correlation is 0.2
.
Internally both the lme4
and the MixedModels
package use parameters derived from the relative covariance factor of the random effects distribution. In this case the correlation matrix for the random effects by subject is
In [10]:
cormat = [1. 0.2; 0.2 1.]
Out[10]:
Combining this with the standard deviations produces the covariance matrix
In [11]:
covmat = Diagonal([100.; 20.]) * cormat * Diagonal([100.; 20.])
Out[11]:
The covariance factor is the lower Cholesky factor of this positive definite symmetric matrix
In [12]:
covfac = cholesky(covmat).L
Out[12]:
The relative covariance factor is covfac/σ
In [13]:
Λ = covfac ./ σ
Out[13]:
leading to a covariance parameter vector for the subject-specific random effects of
In [14]:
θ = Λ[[1,2,4]]
Out[14]:
For the item-specific random effects the relative covariance factor is simply the ratio of the standard deviation of the intercepts to σ. This is appended to the current
In [15]:
push!(θ, 80.0/200.0)
Out[15]:
Simulation of a response vector and fitting the model to that vector is performed as
In [16]:
refit!(simulate!(m1, β=β, σ=σ, θ=θ))
Out[16]:
A Julia convention is to append !
to the names of mutating functions, which can change the value of one or more of the functions arguments. Typically it is the first argument that is changed. That is, the call to push!
modifies θ
and the call to simulate!
modifies and returns m1
which is then further modified by the call to refit!
.
The focus of this simulation is the estimate of the effect of category.
In [17]:
last(m1.β)
Out[17]:
and its standard error
In [18]:
last(m1.stderror)
Out[18]:
It is useful to create a function to perform the simulation and store the values in a DataFrame
. For reproducibility we pass the random number generator initialized with a known seed.
In [19]:
function lmmSim!(m, rng, N, β, σ, θ)
b = similar(β, N)
se = similar(b)
for i in 1:N
refit!(simulate!(m, β=β, σ=σ, θ=θ))
b[i] = last(m.β)
se[i] = last(m.stderror)
end
DataFrame(estimate=b, stderr=se)
end
Out[19]:
Just to check that it works as expected
In [20]:
rng = MersenneTwister(1234321) # random number generator
lmmSim!(m1, rng, 10, β, σ, θ)
Out[20]:
Now collect a larger sample
In [21]:
sim = @time lmmSim!(m1, rng, 10000, β, σ, θ);
In [22]:
@df sim density(:estimate)
Out[22]:
In [23]:
describe(sim)
Out[23]:
An empirical p-value for the one-sided test of $H_0:\beta_1\le0$ versus $H_a:\beta_1>0$ is
In [24]:
sum(x ≤ 0 for x in sim.estimate)/length(sim.estimate)
Out[24]:
The simulated sample from the estimator can be used to create empirical confidence intervals on the parameters. Because the distribution of the estimate is relatively symmetric, it is reasonable to evaluate the bounds on a $1-\alpha$ confidence interval as the $\frac{\alpha}{2}$ and $1-\frac{\alpha}{2}$ quantiles of the sample. A 95% confidence interval evaluated in this way would be
In [25]:
quantile(sim.estimate, [0.025, 0.975])
Out[25]:
An alternative, which can be useful for skewed but unimodal distributions, is to consider all the contiguous blocks of indices containing $1-\alpha$ proportion of the sorted values and pick the one with the shortest distance between the end points. This is similar to a highest posterior density interval in a Bayesian analysis.
In [26]:
function hpdinterval(samp, coverage)
sorted = sort(samp)
n = length(sorted)
k = round(Int, n * coverage)
minj = argmin([(sorted[j + k] - sorted[j]) for j in 1:(n-k)])
[sorted[minj], sorted[minj + k]]
end
Out[26]:
In [27]:
hpdinterval(sim.estimate, 0.95)
Out[27]:
This is essentially the same as the interval from the equal-tail quantiles.