Mixed-effects model simulation using MixedModels.jl

A recent paper by DeBruin and Barr describes the use of simulations to explote the formulation and interpretation of mixed-effects models, especially in cases of crossed random-effects terms. They use the lme4 package for R to illustrate how the model is constructed and to perform the simulations.

R and lme4 are wonderful for illustrating the concepts but not always the best for performing the simulations. One of my purposes in creating the MixedModels package for Julia was to develop and implement data structures and algorithms that accelerate compute-intensive processes such as simulation of mixed-effects models.

The purpose of this notebook is to show how the simulations can be performed in Julia quickly and with minimal adjustment from the R representation.

Preparing the Julia environment

Julia executables for different operating systems are available at the download site. This notebook was prepared using Julia version 1.1.1 and the packages shown below.


In [1]:
# Attach packages from the standard library
using InteractiveUtils, LinearAlgebra, Random, Statistics
# Attach user-contributed packages (may need to Pkg.add these)
using DataFrames, MixedModels, RData, StatsModels, StatsPlots
# Check the version of Julia being used
versioninfo()


Julia Version 1.1.1
Commit 55e36cc308 (2019-05-16 04:10 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, ivybridge)

The simulations will be based on a model defined using a formula and a data frame such as shown in Table 2 of DeBruin and Barr's paper. The easiest way for an R user to obtain such a data frame in Julia is to create it in R, using the methods described in that paper, and save it as, say, data.RData, which can be read in Julia using the RData package.


In [2]:
dat = load("data.RData")["dat"]


Out[2]:

5,000 rows × 4 columns

subj_iditem_idcategoryRT
StringStringStringFloat64
1S001I01ingroup798.76
2S001I02ingroup800.02
3S001I03ingroup800.843
4S001I04ingroup800.143
5S001I05ingroup800.781
6S001I06ingroup799.472
7S001I07ingroup799.051
8S001I08ingroup799.584
9S001I09ingroup801.917
10S001I10ingroup798.339
11S001I11ingroup800.221
12S001I12ingroup799.701
13S001I13ingroup800.044
14S001I14ingroup799.906
15S001I15ingroup799.697
16S001I16ingroup799.318
17S001I17ingroup798.748
18S001I18ingroup799.089
19S001I19ingroup798.742
20S001I20ingroup801.326
21S001I21ingroup801.264
22S001I22ingroup799.85
23S001I23ingroup799.253
24S001I24ingroup799.586
25S001I25ingroup799.155
26S001I26outgroup799.412
27S001I27outgroup799.352
28S001I28outgroup799.644
29S001I29outgroup800.85
30S001I30outgroup798.959
&vellip&vellip&vellip&vellip&vellip

In [3]:
describe(dat)


Out[3]:

4 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…NothingDataType
1subj_idS001S100100String
2item_idI01I5050String
3categoryingroupoutgroup2String
4RT799.971796.716799.97803.884Float64

The values of the response, RT, are arbitrary. They need to be present in the data frame to construct the model but are not used in the simulation.

Creating a LinearMixedModel object

A mixed-effects model formula in Julia is similar to one in R but must be enclosed in a call to the @formula macro. The formula for the model to be considered is


In [4]:
f1 = @formula(RT ~ 1 + category + (1+category|subj_id) + (1|item_id));

One issue to be resolved is the numerical coding of the category variable. DeBruin and Barr used a $\pm0.5$ encoding. Statisticians, such as I, prefer a $\pm 1$ encoding which is generated by the "Helmert" or the "Effects" contrasts for a two-level factor. Coding of categorical covariates is specified as a dictionary of key-value pairs in the call to create a LinearMixedModel object. Without going into too many details, the name of the column is given as a symbol, like :cateegory, and the generator for the coding as a function call, like EffectsCoding(). The => operator creates a key-value pair.


In [5]:
m1 = LinearMixedModel(f1, dat, Dict(:category => EffectsCoding()));

Just to check that the coding occurred as expected, the model matrix for the fixed-effects parameters is


In [6]:
m1.X


Out[6]:
5000×2 Array{Float64,2}:
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 ⋮        
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0

and that for the subject-specific random effects is


In [7]:
first(m1.reterms)


Out[7]:
5000×200 ReMat{Float64,2}:
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 ⋮                         ⋮         ⋱            ⋮                      
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0

(If you are concerned about the size of this matrix, don't worry - it is stored in a much more compact way.)

Specifying parameter values

The fixed-effects parameter vector in DeBruine and Barr's formula is [800.0, 50.0] for the $\pm 0.5$ coding, which corresponds to [800.0, 25.0] in the $\pm 1$ coding. It is written as $\beta$. The standard deviation for the residual, written $\sigma$, is 200.0.


In [8]:
β = [800.0; 25.0]


Out[8]:
2-element Array{Float64,1}:
 800.0
  25.0

In [9]:
σ = 200.0


Out[9]:
200.0

The standard deviation of the by-subject intercept random-effects is 100.0 and the by-item intercept random-effects is 80.0. The standard deviation of the by-subject slope random effects is 40.0 in the $\pm0.5$ coding or 20.0 in the $\pm 1$ coding. The within-subject correlation is 0.2.

Internally both the lme4 and the MixedModels package use parameters derived from the relative covariance factor of the random effects distribution. In this case the correlation matrix for the random effects by subject is


In [10]:
cormat = [1. 0.2; 0.2 1.]


Out[10]:
2×2 Array{Float64,2}:
 1.0  0.2
 0.2  1.0

Combining this with the standard deviations produces the covariance matrix


In [11]:
covmat = Diagonal([100.; 20.]) * cormat * Diagonal([100.; 20.])


Out[11]:
2×2 Array{Float64,2}:
 10000.0  400.0
   400.0  400.0

The covariance factor is the lower Cholesky factor of this positive definite symmetric matrix


In [12]:
covfac = cholesky(covmat).L


Out[12]:
2×2 LowerTriangular{Float64,Array{Float64,2}}:
 100.0    ⋅    
   4.0  19.5959

The relative covariance factor is covfac/σ


In [13]:
Λ = covfac ./ σ


Out[13]:
2×2 LowerTriangular{Float64,Array{Float64,2}}:
 0.5    ⋅       
 0.02  0.0979796

leading to a covariance parameter vector for the subject-specific random effects of


In [14]:
θ = Λ[[1,2,4]]


Out[14]:
3-element Array{Float64,1}:
 0.5                
 0.02               
 0.09797958971132711

For the item-specific random effects the relative covariance factor is simply the ratio of the standard deviation of the intercepts to σ. This is appended to the current


In [15]:
push!(θ, 80.0/200.0)


Out[15]:
4-element Array{Float64,1}:
 0.5                
 0.02               
 0.09797958971132711
 0.4                

Simulation of a response vector and fitting the model to that vector is performed as


In [16]:
refit!(simulate!(m1, β=β, σ=σ, θ=θ))


Out[16]:
Linear mixed model fit by maximum likelihood
 RT ~ 1 + category + (1 + category | subj_id) + (1 | item_id)
     logLik        -2 logLik          AIC             BIC       
 -3.38407253×10⁴  6.76814506×10⁴  6.76954506×10⁴  6.77410709×10⁴

Variance components:
                 Column         Variance   Std.Dev.    Corr.
 subj_id  (Intercept)          8301.70025  91.113667
          category: outgroup    289.67972  17.019980  0.14
 item_id  (Intercept)          6566.69818  81.035166
 Residual                     40790.96882 201.967742
 Number of obs: 5000; levels of grouping factors: 100, 50

  Fixed-effects parameters:
──────────────────────────────────────────────────────────
                    Estimate  Std.Error   z value  P(>|z|)
──────────────────────────────────────────────────────────
(Intercept)         816.44      14.9167  54.7331    <1e-99
category: outgroup   37.1347    11.9327   3.11201   0.0019
──────────────────────────────────────────────────────────

A Julia convention is to append ! to the names of mutating functions, which can change the value of one or more of the functions arguments. Typically it is the first argument that is changed. That is, the call to push! modifies θ and the call to simulate! modifies and returns m1 which is then further modified by the call to refit!.

Simulating the distribution of an estimator

The focus of this simulation is the estimate of the effect of category.


In [17]:
last(m1.β)


Out[17]:
37.134667905555

and its standard error


In [18]:
last(m1.stderror)


Out[18]:
11.932684292413413

It is useful to create a function to perform the simulation and store the values in a DataFrame. For reproducibility we pass the random number generator initialized with a known seed.


In [19]:
function lmmSim!(m, rng, N, β, σ, θ)
    b = similar(β, N)
    se = similar(b)
    for i in 1:N
        refit!(simulate!(m, β=β, σ=σ, θ=θ))
        b[i] = last(m.β)
        se[i] = last(m.stderror)
    end
    DataFrame(estimate=b, stderr=se)
end


Out[19]:
lmmSim! (generic function with 1 method)

Just to check that it works as expected


In [20]:
rng = MersenneTwister(1234321)   # random number generator
lmmSim!(m1, rng, 10, β, σ, θ)


Out[20]:

10 rows × 2 columns

estimatestderr
Float64Float64
152.361710.0055
229.104110.4875
335.516210.869
421.100911.7573
518.065511.7528
69.0906410.748
72.0521512.6805
814.364712.9878
937.885113.0451
1043.148111.7467

Now collect a larger sample


In [21]:
sim = @time lmmSim!(m1, rng, 10000, β, σ, θ);


303.806155 seconds (644.81 M allocations: 32.730 GiB, 1.48% gc time)

In [22]:
@df sim density(:estimate)


Out[22]:
-20 0 20 40 60 0.00 0.01 0.02 0.03 y1

In [23]:
describe(sim)


Out[23]:

2 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64Float64Float64Float64NothingNothingDataType
1estimate25.1166-23.530425.146569.0275Float64
2stderr11.60727.7039811.580516.1908Float64

An empirical p-value for the one-sided test of $H_0:\beta_1\le0$ versus $H_a:\beta_1>0$ is


In [24]:
sum(x  0 for x in sim.estimate)/length(sim.estimate)


Out[24]:
0.0158

Confidence intervals

The simulated sample from the estimator can be used to create empirical confidence intervals on the parameters. Because the distribution of the estimate is relatively symmetric, it is reasonable to evaluate the bounds on a $1-\alpha$ confidence interval as the $\frac{\alpha}{2}$ and $1-\frac{\alpha}{2}$ quantiles of the sample. A 95% confidence interval evaluated in this way would be


In [25]:
quantile(sim.estimate, [0.025, 0.975])


Out[25]:
2-element Array{Float64,1}:
  2.149911837488085
 48.37219081741177 

An alternative, which can be useful for skewed but unimodal distributions, is to consider all the contiguous blocks of indices containing $1-\alpha$ proportion of the sorted values and pick the one with the shortest distance between the end points. This is similar to a highest posterior density interval in a Bayesian analysis.


In [26]:
function hpdinterval(samp, coverage)
    sorted = sort(samp)
    n = length(sorted)
    k = round(Int, n * coverage)
    minj = argmin([(sorted[j + k] - sorted[j]) for j in 1:(n-k)])
    [sorted[minj], sorted[minj + k]]
end


Out[26]:
hpdinterval (generic function with 1 method)

In [27]:
hpdinterval(sim.estimate, 0.95)


Out[27]:
2-element Array{Float64,1}:
  2.2296495929604245
 48.45062854829372  

This is essentially the same as the interval from the equal-tail quantiles.