Mixed-effects model simulation using MixedModels.jl

A recent paper by DeBruin and Barr describes the use of simulations to explote the formulation and interpretation of mixed-effects models, especially in cases of crossed random-effects terms. They use the lme4 package for R to illustrate how the model is constructed and to perform the simulations.

R and lme4 are wonderful for illustrating the concepts but not always the best for performing the simulations. One of my purposes in creating the MixedModels package for Julia was to develop and implement data structures and algorithms that accelerate compute-intensive processes such as simulation of mixed-effects models.

The purpose of this notebook is to show how the simulations can be performed in Julia quickly and with minimal adjustment from the R representation.

Preparing the Julia environment

Julia executables for different operating systems are available at the download site. This notebook was prepared using Julia version 1.1.1 and the packages shown below.



In [1]:

    
# Attach packages from the standard library
using InteractiveUtils, LinearAlgebra, Random, Statistics
# Attach user-contributed packages (may need to Pkg.add these)
using DataFrames, MixedModels, RData, StatsModels, StatsPlots
# Check the version of Julia being used
versioninfo()









    



Julia Version 1.1.1
Commit 55e36cc308 (2019-05-16 04:10 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, ivybridge)

The simulations will be based on a model defined using a formula and a data frame such as shown in Table 2 of DeBruin and Barr's paper. The easiest way for an R user to obtain such a data frame in Julia is to create it in R, using the methods described in that paper, and save it as, say, data.RData, which can be read in Julia using the RData package.



In [2]:

    
dat = load("data.RData")["dat"]









    Out[2]:




5,000 rows × 4 columns
subj_id item_id category RT
String String String Float64
1 S001 I01 ingroup 798.76
2 S001 I02 ingroup 800.02
3 S001 I03 ingroup 800.843
4 S001 I04 ingroup 800.143
5 S001 I05 ingroup 800.781
6 S001 I06 ingroup 799.472
7 S001 I07 ingroup 799.051
8 S001 I08 ingroup 799.584
9 S001 I09 ingroup 801.917
10 S001 I10 ingroup 798.339
11 S001 I11 ingroup 800.221
12 S001 I12 ingroup 799.701
13 S001 I13 ingroup 800.044
14 S001 I14 ingroup 799.906
15 S001 I15 ingroup 799.697
16 S001 I16 ingroup 799.318
17 S001 I17 ingroup 798.748
18 S001 I18 ingroup 799.089
19 S001 I19 ingroup 798.742
20 S001 I20 ingroup 801.326
21 S001 I21 ingroup 801.264
22 S001 I22 ingroup 799.85
23 S001 I23 ingroup 799.253
24 S001 I24 ingroup 799.586
25 S001 I25 ingroup 799.155
26 S001 I26 outgroup 799.412
27 S001 I27 outgroup 799.352
28 S001 I28 outgroup 799.644
29 S001 I29 outgroup 800.85
30 S001 I30 outgroup 798.959
&vellip &vellip &vellip &vellip &vellip



In [3]:

    
describe(dat)









    Out[3]:




4 rows × 8 columns
variable mean min median max nunique nmissing eltype
Symbol Union… Any Union… Any Union… Nothing DataType
1 subj_id S001 S100 100 String
2 item_id I01 I50 50 String
3 category ingroup outgroup 2 String
4 RT 799.971 796.716 799.97 803.884 Float64

The values of the response, RT, are arbitrary. They need to be present in the data frame to construct the model but are not used in the simulation.

Creating a LinearMixedModel object

A mixed-effects model formula in Julia is similar to one in R but must be enclosed in a call to the @formula macro. The formula for the model to be considered is



In [4]:

    
f1 = @formula(RT ~ 1 + category + (1+category|subj_id) + (1|item_id));

One issue to be resolved is the numerical coding of the category variable. DeBruin and Barr used a $\pm0.5$ encoding. Statisticians, such as I, prefer a $\pm 1$ encoding which is generated by the "Helmert" or the "Effects" contrasts for a two-level factor. Coding of categorical covariates is specified as a dictionary of key-value pairs in the call to create a LinearMixedModel object. Without going into too many details, the name of the column is given as a symbol, like :cateegory, and the generator for the coding as a function call, like EffectsCoding(). The => operator creates a key-value pair.



In [5]:

    
m1 = LinearMixedModel(f1, dat, Dict(:category => EffectsCoding()));

Just to check that the coding occurred as expected, the model matrix for the fixed-effects parameters is



In [6]:

    
m1.X









    Out[6]:





5000×2 Array{Float64,2}:
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 1.0  -1.0
 ⋮        
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0
 1.0   1.0

and that for the subject-specific random effects is



In [7]:

    
first(m1.reterms)









    Out[7]:





5000×200 ReMat{Float64,2}:
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  -1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 ⋮                         ⋮         ⋱            ⋮                      
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0
 0.0   0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  1.0  1.0

(If you are concerned about the size of this matrix, don't worry - it is stored in a much more compact way.)

Specifying parameter values

The fixed-effects parameter vector in DeBruine and Barr's formula is [800.0, 50.0] for the $\pm 0.5$ coding, which corresponds to [800.0, 25.0] in the $\pm 1$ coding. It is written as $\beta$. The standard deviation for the residual, written $\sigma$, is 200.0.



In [8]:

    
β = [800.0; 25.0]









    Out[8]:





2-element Array{Float64,1}:
 800.0
  25.0



In [9]:

    
σ = 200.0









    Out[9]:





200.0

The standard deviation of the by-subject intercept random-effects is 100.0 and the by-item intercept random-effects is 80.0. The standard deviation of the by-subject slope random effects is 40.0 in the $\pm0.5$ coding or 20.0 in the $\pm 1$ coding. The within-subject correlation is 0.2.

Internally both the lme4 and the MixedModels package use parameters derived from the relative covariance factor of the random effects distribution. In this case the correlation matrix for the random effects by subject is



In [10]:

    
cormat = [1. 0.2; 0.2 1.]









    Out[10]:





2×2 Array{Float64,2}:
 1.0  0.2
 0.2  1.0

Combining this with the standard deviations produces the covariance matrix



In [11]:

    
covmat = Diagonal([100.; 20.]) * cormat * Diagonal([100.; 20.])









    Out[11]:





2×2 Array{Float64,2}:
 10000.0  400.0
   400.0  400.0

The covariance factor is the lower Cholesky factor of this positive definite symmetric matrix



In [12]:

    
covfac = cholesky(covmat).L









    Out[12]:





2×2 LowerTriangular{Float64,Array{Float64,2}}:
 100.0    ⋅    
   4.0  19.5959

The relative covariance factor is covfac/σ



In [13]:

    
Λ = covfac ./ σ









    Out[13]:





2×2 LowerTriangular{Float64,Array{Float64,2}}:
 0.5    ⋅       
 0.02  0.0979796

leading to a covariance parameter vector for the subject-specific random effects of



In [14]:

    
θ = Λ[[1,2,4]]









    Out[14]:





3-element Array{Float64,1}:
 0.5                
 0.02               
 0.09797958971132711

For the item-specific random effects the relative covariance factor is simply the ratio of the standard deviation of the intercepts to σ. This is appended to the current



In [15]:

    
push!(θ, 80.0/200.0)









    Out[15]:





4-element Array{Float64,1}:
 0.5                
 0.02               
 0.09797958971132711
 0.4

Simulation of a response vector and fitting the model to that vector is performed as



In [16]:

    
refit!(simulate!(m1, β=β, σ=σ, θ=θ))









    Out[16]:





Linear mixed model fit by maximum likelihood
 RT ~ 1 + category + (1 + category | subj_id) + (1 | item_id)
     logLik        -2 logLik          AIC             BIC       
 -3.38407253×10⁴  6.76814506×10⁴  6.76954506×10⁴  6.77410709×10⁴

Variance components:
                 Column         Variance   Std.Dev.    Corr.
 subj_id  (Intercept)          8301.70025  91.113667
          category: outgroup    289.67972  17.019980  0.14
 item_id  (Intercept)          6566.69818  81.035166
 Residual                     40790.96882 201.967742
 Number of obs: 5000; levels of grouping factors: 100, 50

  Fixed-effects parameters:
──────────────────────────────────────────────────────────
                    Estimate  Std.Error   z value  P(>|z|)
──────────────────────────────────────────────────────────
(Intercept)         816.44      14.9167  54.7331    <1e-99
category: outgroup   37.1347    11.9327   3.11201   0.0019
──────────────────────────────────────────────────────────

A Julia convention is to append ! to the names of mutating functions, which can change the value of one or more of the functions arguments. Typically it is the first argument that is changed. That is, the call to push! modifies θ and the call to simulate! modifies and returns m1 which is then further modified by the call to refit!.

Simulating the distribution of an estimator

The focus of this simulation is the estimate of the effect of category.



In [17]:

    
last(m1.β)









    Out[17]:





37.134667905555

and its standard error



In [18]:

    
last(m1.stderror)









    Out[18]:





11.932684292413413

It is useful to create a function to perform the simulation and store the values in a DataFrame. For reproducibility we pass the random number generator initialized with a known seed.



In [19]:

    
function lmmSim!(m, rng, N, β, σ, θ)
    b = similar(β, N)
    se = similar(b)
    for i in 1:N
        refit!(simulate!(m, β=β, σ=σ, θ=θ))
        b[i] = last(m.β)
        se[i] = last(m.stderror)
    end
    DataFrame(estimate=b, stderr=se)
end









    Out[19]:





lmmSim! (generic function with 1 method)

Just to check that it works as expected



In [20]:

    
rng = MersenneTwister(1234321)   # random number generator
lmmSim!(m1, rng, 10, β, σ, θ)









    Out[20]:




10 rows × 2 columns
estimate stderr
Float64 Float64
1 52.3617 10.0055
2 29.1041 10.4875
3 35.5162 10.869
4 21.1009 11.7573
5 18.0655 11.7528
6 9.09064 10.748
7 2.05215 12.6805
8 14.3647 12.9878
9 37.8851 13.0451
10 43.1481 11.7467

Now collect a larger sample



In [21]:

    
sim = @time lmmSim!(m1, rng, 10000, β, σ, θ);









    



303.806155 seconds (644.81 M allocations: 32.730 GiB, 1.48% gc time)



In [22]:

    
@df sim density(:estimate)









    Out[22]:



In [23]:

    
describe(sim)









    Out[23]:




2 rows × 8 columns
variable mean min median max nunique nmissing eltype
Symbol Float64 Float64 Float64 Float64 Nothing Nothing DataType
1 estimate 25.1166 -23.5304 25.1465 69.0275 Float64
2 stderr 11.6072 7.70398 11.5805 16.1908 Float64

An empirical p-value for the one-sided test of $H_0:\beta_1\le0$ versus $H_a:\beta_1>0$ is



In [24]:

    
sum(x ≤ 0 for x in sim.estimate)/length(sim.estimate)









    Out[24]:





0.0158

Confidence intervals

The simulated sample from the estimator can be used to create empirical confidence intervals on the parameters. Because the distribution of the estimate is relatively symmetric, it is reasonable to evaluate the bounds on a $1-\alpha$ confidence interval as the $\frac{\alpha}{2}$ and $1-\frac{\alpha}{2}$ quantiles of the sample. A 95% confidence interval evaluated in this way would be



In [25]:

    
quantile(sim.estimate, [0.025, 0.975])









    Out[25]:





2-element Array{Float64,1}:
  2.149911837488085
 48.37219081741177

An alternative, which can be useful for skewed but unimodal distributions, is to consider all the contiguous blocks of indices containing $1-\alpha$ proportion of the sorted values and pick the one with the shortest distance between the end points. This is similar to a highest posterior density interval in a Bayesian analysis.



In [26]:

    
function hpdinterval(samp, coverage)
    sorted = sort(samp)
    n = length(sorted)
    k = round(Int, n * coverage)
    minj = argmin([(sorted[j + k] - sorted[j]) for j in 1:(n-k)])
    [sorted[minj], sorted[minj + k]]
end









    Out[26]:





hpdinterval (generic function with 1 method)



In [27]:

    
hpdinterval(sim.estimate, 0.95)









    Out[27]:





2-element Array{Float64,1}:
  2.2296495929604245
 48.45062854829372

This is essentially the same as the interval from the equal-tail quantiles.

	subj_id	item_id	category	RT
	String	String	String	Float64
1	S001	I01	ingroup	798.76
2	S001	I02	ingroup	800.02
3	S001	I03	ingroup	800.843
4	S001	I04	ingroup	800.143
5	S001	I05	ingroup	800.781
6	S001	I06	ingroup	799.472
7	S001	I07	ingroup	799.051
8	S001	I08	ingroup	799.584
9	S001	I09	ingroup	801.917
10	S001	I10	ingroup	798.339
11	S001	I11	ingroup	800.221
12	S001	I12	ingroup	799.701
13	S001	I13	ingroup	800.044
14	S001	I14	ingroup	799.906
15	S001	I15	ingroup	799.697
16	S001	I16	ingroup	799.318
17	S001	I17	ingroup	798.748
18	S001	I18	ingroup	799.089
19	S001	I19	ingroup	798.742
20	S001	I20	ingroup	801.326
21	S001	I21	ingroup	801.264
22	S001	I22	ingroup	799.85
23	S001	I23	ingroup	799.253
24	S001	I24	ingroup	799.586
25	S001	I25	ingroup	799.155
26	S001	I26	outgroup	799.412
27	S001	I27	outgroup	799.352
28	S001	I28	outgroup	799.644
29	S001	I29	outgroup	800.85
30	S001	I30	outgroup	798.959
&vellip	&vellip	&vellip	&vellip	&vellip

	variable	mean	min	median	max	nunique	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Union…	Nothing	DataType
1	subj_id		S001		S100	100		String
2	item_id		I01		I50	50		String
3	category		ingroup		outgroup	2		String
4	RT	799.971	796.716	799.97	803.884			Float64

	estimate	stderr
	Float64	Float64
1	52.3617	10.0055
2	29.1041	10.4875
3	35.5162	10.869
4	21.1009	11.7573
5	18.0655	11.7528
6	9.09064	10.748
7	2.05215	12.6805
8	14.3647	12.9878
9	37.8851	13.0451
10	43.1481	11.7467