Complexity in fitting Linear Mixed Models

Linear mixed-effects models are increasingly used for the analysis of data from experiments in fields like psychology where several subjects are each exposed to each of several different items. In addition to a response, which here will be assumed to be on a continuous scale, such as a response time, a number of experimental conditions are systematically varied during the experiment. In the language of statistical experimental design the latter variables are called experimental factors whereas factors like Subject and Item are blocking factors. That is, these are known sources of variation that usually are not of interest by themselves but still should be accounted for when looking for systematic variation in the response.

An example data set

The data from experiment 2 in Kronmueller and Barr (2007) are available in .rds (R Data Set) format in the file kb07_exp2_rt.rds in the github repository provided by Dale Barr. Files in this format can be loaded using the RData package for Julia.


In [1]:
using BenchmarkTools, DataFrames, Distributions, FreqTables
using MixedModels, RData, Statistics, StatsBase, StatsModels
using Gadfly


┌ Info: Loading DataFrames support into Gadfly.jl
└ @ Gadfly /home/bates/.julia/packages/Gadfly/09PWZ/src/mapping.jl:228

In [2]:
kb07 = load("../kronmueller-barr-2007/kb07_exp2_rt.rds")


Out[2]:

1,789 rows × 7 columns

subjitemspkrprecloadrt_truncrt_raw
Categorical…Categorical…Categorical…Categorical…Categorical…Int32Int32
1301newbreakyes22672267
2302oldmaintainno38563856
3303oldbreakno15671567
4304newmaintainno17321732
5305newbreakno26602660
6306oldmaintainyes27632763
7307oldbreakyes35283528
8308newmaintainyes17411741
9309newbreakyes36923692
103010oldmaintainno19491949
113011oldbreakno21892189
123012newmaintainno22072207
133013newbreakno20782078
143014oldmaintainyes19011901
153015oldbreakyes40154015
163016newmaintainyes18801880
173017newbreakyes14441444
183018oldmaintainno16831683
193019oldbreakno20372037
203020newmaintainno11681168
213021newbreakno19301930
223022oldmaintainyes18431843
233023oldbreakyes49694969
243024newmaintainyes17981798
253025newbreakyes24362436
263026oldmaintainno20182018
273027oldbreakno22782278
283028newmaintainno18661866
293029newbreakno17431743
303030oldmaintainyes19631963
&vellip&vellip&vellip&vellip&vellip&vellip&vellip&vellip

In [3]:
describe(kb07)


Out[3]:

7 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…NothingDataType
1subj3010356CategoricalString{UInt8}
2item13232CategoricalString{UInt8}
3spkrnewold2CategoricalString{UInt8}
4precbreakmaintain2CategoricalString{UInt8}
5loadnoyes2CategoricalString{UInt8}
6rt_trunc2182.25791940.05171Int32
7rt_raw2226.245791940.015923Int32

The blocking factors are subj and item with 56 and 32 levels respectively. There are three experimental factors each with two levels: spkr (speaker), prec (precedence), and load (cognitive load). The response time, rt_raw, is measured in milliseconds. A few very large values, e.g. the maximum which is nearly 16 seconds, which could skew the results, are truncated in the rt_trunc column. In addition, three erroneously recorded responses (values of 300 ms.) have been dropped, resulting in a slight imbalance in the data.


In [4]:
subjitemtbl = freqtable(kb07, :subj, :item)


Out[4]:
56×32 Named Array{Int64,2}
subj ╲ item │  1   2   3   4   5   6   7   8  …  25  26  27  28  29  30  31  32
────────────┼──────────────────────────────────────────────────────────────────
30          │  1   1   1   1   1   1   1   1  …   1   1   1   1   1   1   1   1
31          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
34          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
35          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
36          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
37          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
38          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
39          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
41          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
42          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
43          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
44          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
⋮              ⋮   ⋮   ⋮   ⋮   ⋮   ⋮   ⋮   ⋮  ⋱   ⋮   ⋮   ⋮   ⋮   ⋮   ⋮   ⋮   ⋮
89          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
91          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
93          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
94          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
95          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
96          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
97          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
98          │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
100         │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
101         │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
102         │  1   1   1   1   1   1   1   1      1   1   1   1   1   1   1   1
103         │  1   1   1   1   1   1   1   1  …   1   1   1   1   1   1   1   1

In [5]:
countmap(vec(subjitemtbl))


Out[5]:
Dict{Int64,Int64} with 2 entries:
  0 => 3
  1 => 1789

All of the experimental factors vary within subject and within item, as can be verified by examining the frequency tables for the experimental and grouping factors. For example


In [6]:
freqtable(kb07, :spkr, :subj)


Out[6]:
2×56 Named Array{Int64,2}
spkr ╲ subj │  30   31   34   35   36   37  …   97   98  100  101  102  103
────────────┼──────────────────────────────────────────────────────────────
new         │  16   16   16   16   16   16  …   16   16   16   16   16   16
old         │  16   16   16   16   16   16  …   16   16   16   16   15   16

A table of mean responses and standard deviations for combinations of the experimental factors, as shown in Table 3 of the paper and on the data repository can be reproduced as


In [7]:
cellmeans = by(kb07, [:spkr, :prec, :load], meanRT = :rt_trunc => mean, sdRT = :rt_trunc => std, n = :rt_trunc => length)


Out[7]:

8 rows × 6 columns

spkrprecloadmeanRTsdRTn
Categorical…Categorical…Categorical…Float64Float64Int64
1newbreakno2345.17972.693223
2newbreakyes2505.341012.04224
3newmaintainno1760.66587.445224
4newmaintainyes1843.28609.672224
5oldbreakno2536.291058.78224
6oldbreakyes2674.321119.67224
7oldmaintainno1770.76669.901222
8oldmaintainyes2018.83725.359224

An interaction plot of these means


In [8]:
plot(cellmeans, x=:load, y=:meanRT, color=:prec, xgroup=:spkr, Geom.subplot_grid(Geom.point))


Out[8]:
load by spkr break maintain prec old new no yes no yes 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000 3250 3500 3750 4000 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 0 1000 2000 3000 4000 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 meanRT

shows that the dominant factor is the precedent with the response times under break being considerably larger than under maintain.

Formulating a simple model

A simple model with main-effects for each of the experimental factors and with random effects for subject and for item is described by the formula rt_trunc ~ 1 + spkr + prec + load + (1|subj) + (1|item). In the MixedModels package, which uses the formula specifications from the StatsModels package, a formula must be wrapped in a call to the @formula macro. The model is created as an instance of a LinearMixedModel type then fit with a call to the fit! generic.

(By convention, the names in Julia of mutating functions, which modify the value of one or more of their arguments, end in ! as a warning to the user that arguments, usually just the first argument, can be overwritten with new values.)


In [9]:
m1 = fit!(LinearMixedModel(@formula(rt_trunc ~ 1 + spkr + prec + load + (1|subj) + (1|item)), kb07))


Out[9]:
Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + load + (1 | subj) + (1 | item)
     logLik        -2 logLik          AIC             BIC       
 -1.44115694×10⁴  2.88231387×10⁴  2.88371387×10⁴  2.88755646×10⁴

Variance components:
            Variance  Std.Dev. 
 subj      101529.32 318.63666
 item      132314.19 363.75018
 Residual  521145.70 721.90422
 Number of obs: 1789; levels of grouping factors: 56, 32

  Fixed-effects parameters:
                Estimate Std.Error  z value P(>|z|)
(Intercept)      2369.24   84.3435  28.0903  <1e-99
spkr: old        136.167   34.1367  3.98886   <1e-4
prec: maintain  -667.173   34.1367 -19.5441  <1e-84
load: yes        156.709   34.1366  4.59065   <1e-5

The first fit of such a model can take several seconds because the Just-In-Time (JIT) compiler must analyze and compile a considerable amount of code. (All of the code in the MixedModels package is Julia code.) Subsequent fits of this or similar models are much faster.

When timing fits that take a few seconds or less the @btime macro will be used, as it performs the evaluation of the expression several times and takes the minimum of these evaluation times.
This is to eliminate the noise introduced by garbage collection and compilation.

Furthermore, in what follows the formulas and models will be declared const, which in Julia means that the type of the value cannot be changed, although the values in, say, a vector can be.

Also the expression is written using $ before an argument name. This is an interpolation in a macro call. The effect is to time only the expression evaluation and not the name lookup for the arguments.


In [10]:
const f1 = @formula(rt_trunc ~ 1 + spkr + prec + load + (1|subj) + (1|item));
@btime fit!(LinearMixedModel($f1, $kb07));


  2.324 ms (3082 allocations: 841.83 KiB)

The coding for the experimental factors in the model matrix for this model is the default or "dummy" coding, where the first level is the reference level and the estimated coefficients represent differences of the response at the indicated level relative to the first.

In this case the model matrix for the fixed effects is


In [11]:
m1.X


Out[11]:
1789×4 Array{Float64,2}:
 1.0  0.0  0.0  1.0
 1.0  1.0  1.0  0.0
 1.0  1.0  0.0  0.0
 1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0
 1.0  1.0  0.0  1.0
 1.0  0.0  1.0  1.0
 1.0  0.0  0.0  1.0
 1.0  1.0  1.0  0.0
 1.0  1.0  0.0  0.0
 1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0
 ⋮                 
 1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0
 1.0  1.0  0.0  1.0
 1.0  0.0  1.0  1.0
 1.0  0.0  0.0  1.0
 1.0  1.0  1.0  0.0
 1.0  1.0  0.0  0.0
 1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0
 1.0  1.0  0.0  1.0

In two-level factorial experiments like this it is an advantage to use a $\pm1$ coding, which can be derived from EffectsCoding or HelmertCoding (these codings are the same for a two-level factor).


In [12]:
const contrasts = Dict(:spkr=>HelmertCoding(), :prec=>HelmertCoding(), :load=>HelmertCoding());

In [13]:
m1 = @btime fit!(LinearMixedModel($f1, $kb07, $contrasts))


  2.397 ms (3300 allocations: 849.63 KiB)
Out[13]:
Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + load + (1 | subj) + (1 | item)
     logLik        -2 logLik          AIC             BIC       
 -1.44115694×10⁴  2.88231387×10⁴  2.88371387×10⁴  2.88755646×10⁴

Variance components:
            Variance  Std.Dev. 
 subj      101529.32 318.63666
 item      132314.19 363.75018
 Residual  521145.70 721.90422
 Number of obs: 1789; levels of grouping factors: 56, 32

  Fixed-effects parameters:
                Estimate Std.Error  z value P(>|z|)
(Intercept)      2182.09   78.9884  27.6254  <1e-99
spkr: old        68.0833   17.0684  3.98886   <1e-4
prec: maintain  -333.586   17.0684 -19.5441  <1e-84
load: yes        78.3546   17.0683  4.59065   <1e-5

The estimate of the (Intercept) coefficient is now approximately the sample mean. (It may be slightly different from the sample mean due to the
slight imbalance in the design because of two unusually long response times being dropped.)

The coefficients for the experimental factors are half the previously estimated effects, because of the $\pm1$ coding, Their standard errors are also half those from the previous fit. The estimates of the variance components, the log-likelihood and other goodness of fit criteria, and the z statistics and p-values for the effects remain the same.

The model matrix for the fixed effects is now


In [14]:
m1.X


Out[14]:
1789×4 Array{Float64,2}:
 1.0  -1.0  -1.0   1.0
 1.0   1.0   1.0  -1.0
 1.0   1.0  -1.0  -1.0
 1.0  -1.0   1.0  -1.0
 1.0  -1.0  -1.0  -1.0
 1.0   1.0   1.0   1.0
 1.0   1.0  -1.0   1.0
 1.0  -1.0   1.0   1.0
 1.0  -1.0  -1.0   1.0
 1.0   1.0   1.0  -1.0
 1.0   1.0  -1.0  -1.0
 1.0  -1.0   1.0  -1.0
 1.0  -1.0  -1.0  -1.0
 ⋮                    
 1.0  -1.0   1.0  -1.0
 1.0  -1.0  -1.0  -1.0
 1.0   1.0   1.0   1.0
 1.0   1.0  -1.0   1.0
 1.0  -1.0   1.0   1.0
 1.0  -1.0  -1.0   1.0
 1.0   1.0   1.0  -1.0
 1.0   1.0  -1.0  -1.0
 1.0  -1.0   1.0  -1.0
 1.0  -1.0  -1.0  -1.0
 1.0   1.0   1.0   1.0
 1.0   1.0  -1.0   1.0

Model construction versus model optimization

The m1 object is created in the call to the constructor function, LinearMixedModel, then the parameters are optimized or fit in the call to fit!. Usually the process of fitting a model will take longer than creating the numerical representation but, for simple models like this, the creation time can be a significant portion of the overall running time.


In [15]:
@btime LinearMixedModel($f1, $kb07, $contrasts);


  923.172 μs (1583 allocations: 805.44 KiB)

In [16]:
@btime fit!($m1);


  1.370 ms (1716 allocations: 44.09 KiB)

Factors affecting the time to optimize the parameters

The optimization process is summarized in the optsum property of the model.


In [17]:
m1.optsum


Out[17]:
Initial parameter vector: [1.0, 1.0]
Initial objective value:  28881.249730604657

Optimizer (from NLopt):   LN_BOBYQA
Lower bounds:             [0.0, 0.0]
ftol_rel:                 1.0e-12
ftol_abs:                 1.0e-8
xtol_rel:                 0.0
xtol_abs:                 [1.0e-10, 1.0e-10]
initial_step:             [0.75, 0.75]
maxfeval:                 -1

Function evaluations:     28
Final parameter vector:   [0.441384, 0.503876]
Final objective value:    28823.13873586134
Return code:              FTOL_REACHED

For this model there are two parameters to be optimized because the objective function, negative twice the log-likelihood, can be profiled with respect to all the other parameters. (See section 3 of Bates et al. 2015 for details.) Both these parameters must be non-negative (i.e. both have a lower bound of zero) and both have an initial value of one. After 21 function evaluations an optimum is declared according to the function value tolerance, either $10^{-8}$ in absolute terms or $10^{-12}$ relative to the current value.

The optimization itself has a certain amount of setup and summary time but the majority of the time is spent in the evaluation of the objective - the profiled log-likelihood.

Each function evaluation is of the form


In [18]:
const θ1 = m1.θ;

In [19]:
@btime objective(updateL!(setθ!($m1, $θ1)));


  42.720 μs (57 allocations: 1.42 KiB)

Remember that this is a "best case" time. During the course of the optimization some of the function evaluations could take longer due to garbage collection or other factors.

On this machine 21 function evaluations, each taking 42 microseconds or more, gives the total function evaluation time of at least 0.85 ms., which is practically all of the time to fit the model.

The majority of the time for the function evaluation for this model is in the call to updateL!


In [20]:
@btime updateL!($m1);


  32.016 μs (17 allocations: 320 bytes)

This is an operation that updates the lower Cholesky factor (often written as L) of a blocked sparse matrix.

There are 4 rows and columns of blocks. The first row and column correspond to the random effects for subject, the second to the random effects for item, the third to the fixed-effects parameters and the fourth to the response. Their sizes and types are


In [21]:
describeblocks(m1)


1,1: LinearAlgebra.Diagonal{Float64,Array{Float64,1}} (56, 56) LinearAlgebra.Diagonal{Float64,Array{Float64,1}}
2,1: Array{Float64,2} (32, 56) Array{Float64,2}
2,2: LinearAlgebra.Diagonal{Float64,Array{Float64,1}} (32, 32) Array{Float64,2}
3,1: Array{Float64,2} (4, 56) Array{Float64,2}
3,2: Array{Float64,2} (4, 32) Array{Float64,2}
3,3: Array{Float64,2} (4, 4) Array{Float64,2}
4,1: Array{Float64,2} (1, 56) Array{Float64,2}
4,2: Array{Float64,2} (1, 32) Array{Float64,2}
4,3: Array{Float64,2} (1, 4) Array{Float64,2}
4,4: Array{Float64,2} (1, 1) Array{Float64,2}

There are two lower-triangular blocked matrices: A with fixed entries determined by the model and data, and L which is updated for each evaluation of the objective function. The type of the A block is given before the size and the type of the L block is after the size. For scalar random effects, generated by a random-effects term like (1|G), the (1,1) block is always diagonal for both A and L. Its size is the number of levels of the grouping factor, G.

Because subject and item are crossed, the (2,1) block of A is dense, as is the (2,1) block of L. The (2,2) block of A is diagonal because, like the (1,1) block, it is generated from a scalar random effects term. However, the (2,2) block of L ends up being dense as a result of "fill-in" in the sparse Cholesky factorization. All the blocks associated with the fixed-effects or the response are stored as dense matrices but their dimensions are (relatively) small.

Increasing the complexity

In general, adding more terms to a model will increase the time required to fit the model. However, there is a big difference between adding fixed-effects terms and adding complexity to the random effects.

Adding the two- and three-factor interactions to the fixed-effects terms increases the time required to fit the model.


In [22]:
const f2 = @formula(rt_trunc ~ 1 + spkr*prec*load + (1|subj) + (1|item));

In [23]:
const m2 = @btime fit!(LinearMixedModel($f2, $kb07, $contrasts))


  3.902 ms (5685 allocations: 1.65 MiB)
Out[23]:
Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + load + spkr & prec + spkr & load + prec & load + spkr & prec & load + (1 | subj) + (1 | item)
     logLik       -2 logLik         AIC            BIC      
 -1.4409243×10⁴ 2.88184861×10⁴ 2.88404861×10⁴ 2.89008696×10⁴

Variance components:
            Variance   Std.Dev. 
 subj      101581.912 318.71917
 item      132326.015 363.76643
 Residual  519722.754 720.91799
 Number of obs: 1789; levels of grouping factors: 56, 32

  Fixed-effects parameters:
                                        Estimate Std.Error  z value P(>|z|)
(Intercept)                               2182.0   78.9917  27.6231  <1e-99
spkr: old                                68.0177   17.0451  3.99045   <1e-4
prec: maintain                          -333.652   17.0451 -19.5747  <1e-84
load: yes                                78.4464   17.0451   4.6023   <1e-5
spkr: old & prec: maintain              -21.6417   17.0451 -1.26968  0.2042
spkr: old & load: yes                    18.1151   17.0451  1.06278  0.2879
prec: maintain & load: yes                4.2647   17.0451 0.250201  0.8024
spkr: old & prec: maintain & load: yes   23.2834   17.0451  1.36599  0.1719

(Notice that none of the interactions are statistically significant.)

In this case, the increase in fitting time is more because the number of function evaluations to determine the optimum increases than because of increased evaluation time for the objective function.


In [24]:
m2.optsum.feval


Out[24]:
32

In [25]:
const θ2 = m2.θ;

In [26]:
@btime objective(updateL!(setθ!($m2, $θ2)));


  48.086 μs (57 allocations: 1.42 KiB)

Increasing complexity of the random effects

Another way in which the model can be extended is to switch to vector-valued random effects. Sometimes this is described as having random slopes, so that a subject not only brings their own shift in the typical response but also their own shift in the change due to, say, Load versus No Load. Instead of just one, scalar, change associated with each subject there is an entire vector of changes in the coefficients.

A model with a random slopes for each of the experimental factors for both subject and item is specified as


In [27]:
const f3 = @formula(rt_trunc ~ 1 + spkr*prec*load + (1+spkr+prec+load|subj) + (1+spkr+prec+load|item));

In [28]:
const m3 = @btime fit!(LinearMixedModel($f3, $kb07, $contrasts))


  517.561 ms (715239 allocations: 31.42 MiB)
Out[28]:
Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + load + spkr & prec + spkr & load + prec & load + spkr & prec & load + (1 + spkr + prec + load | subj) + (1 + spkr + prec + load | item)
     logLik        -2 logLik          AIC             BIC       
 -1.43189855×10⁴   2.8637971×10⁴   2.8695971×10⁴   2.8855164×10⁴

Variance components:
             Variance   Std.Dev.    Corr.
 subj       91130.9567 301.879043
             1092.7366  33.056567  1.00
             3463.3273  58.850041 -0.62 -0.62
             4481.7460  66.945844  0.36  0.36  0.51
 item      131377.4183 362.460230
             1690.3829  41.114266  0.42
            61168.0277 247.321709 -0.69  0.37
             1882.8841  43.392212  0.29  0.14 -0.13
 Residual  447630.8098 669.052173
 Number of obs: 1789; levels of grouping factors: 56, 32

  Fixed-effects parameters:
                                        Estimate Std.Error  z value P(>|z|)
(Intercept)                              2181.64   77.3507  28.2046  <1e-99
spkr: old                                67.7496   17.9606  3.77213  0.0002
prec: maintain                           -333.92    47.155 -7.08133  <1e-11
load: yes                                78.8007    19.726  3.99476   <1e-4
spkr: old & prec: maintain               -21.996   15.8191 -1.39047  0.1644
spkr: old & load: yes                    18.3832   15.8191  1.16209  0.2452
prec: maintain & load: yes               4.53273   15.8191 0.286535  0.7745
spkr: old & prec: maintain & load: yes   23.6377   15.8191  1.49425  0.1351

There are several interesting aspects of this model fit.

First, the number of parameters optimized directly has increased substantially. What was previously a 2-dimensional optimization has now become 20 dimensional.


In [29]:
m3.optsum


Out[29]:
Initial parameter vector: [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0]
Initial objective value:  29303.38411137385

Optimizer (from NLopt):   LN_BOBYQA
Lower bounds:             [0.0, -Inf, -Inf, -Inf, 0.0, -Inf, -Inf, 0.0, -Inf, 0.0, 0.0, -Inf, -Inf, -Inf, 0.0, -Inf, -Inf, 0.0, -Inf, 0.0]
ftol_rel:                 1.0e-12
ftol_abs:                 1.0e-8
xtol_rel:                 0.0
xtol_abs:                 [1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10, 1.0e-10]
initial_step:             [0.75, 1.0, 1.0, 1.0, 0.75, 1.0, 1.0, 0.75, 1.0, 0.75, 0.75, 1.0, 1.0, 1.0, 0.75, 1.0, 1.0, 0.75, 1.0, 0.75]
maxfeval:                 -1

Function evaluations:     781
Final parameter vector:   [0.451204, 0.0494081, -0.0547485, 0.0358954, 0.0, -0.0657285, -0.0891811, 0.0204788, 0.0277562, 0.0, 0.541752, 0.0259224, -0.253773, 0.0189352, 0.0557164, 0.267834, 0.000872977, 0.0226436, 0.0620244, 0.0]
Final objective value:    28637.9710101938
Return code:              FTOL_REACHED

and the number of function evaluations to convergence has gone from under 40 to over 650.

The time required for each function evaluation has also increased considerably,


In [30]:
const θ3 = m3.θ;

In [31]:
@btime objective(updateL!(setθ!($m3, $θ3)));


  592.043 μs (559 allocations: 29.98 KiB)

resulting in much longer times for model fitting - about half a second for the "best case" on this machine.

Notice that the estimates of the fixed-effects coefficients and their standard errors have not changed substantially except for the standard error of P (Precedent), which is also the largest effect.

The parameters in the optimization for this model can be arranged as two lower-triangular 4 by 4 matrices.


In [32]:
m3.λ[1]


Out[32]:
4×4 LinearAlgebra.LowerTriangular{Float64,Array{Float64,2}}:
  0.451204     ⋅          ⋅          ⋅ 
  0.0494081   0.0         ⋅          ⋅ 
 -0.0547485  -0.0657285  0.0204788   ⋅ 
  0.0358954  -0.0891811  0.0277562  0.0

In [33]:
m3.λ[2]


Out[33]:
4×4 LinearAlgebra.LowerTriangular{Float64,Array{Float64,2}}:
  0.541752    ⋅            ⋅          ⋅ 
  0.0259224  0.0557164     ⋅          ⋅ 
 -0.253773   0.267834     0.0226436   ⋅ 
  0.0189352  0.000872977  0.0620244  0.0

which generate the covariance matrices for the random effects. The cumulative proportion of the variance in the principal components of these covariance matrices, available as


In [34]:
m3.rePCA


Out[34]:
2-element Array{Array{Float64,1},1}:
 [0.93984, 1.0, 1.0, 1.0]      
 [0.853217, 0.991263, 1.0, 1.0]

show that 93% of the variation in the random effects for subject is in the first principal direction and 99% in the first two principal directions. The random effects for item also have 99% of the variation in the first two principal directions.

Furthermore the estimates of the standard deviations of the "slope" random effects are much smaller than the those of the intercept random effects except for the P coefficient random effect for item, which suggests that the model could be reduced to rt_trunc ~ 1 + spkr*prec*load + (1|subj) + (1+prec|item) or even RTtrunc ~ 1 + spkr+prec+load + (1|subj) + (1+prec|item).


In [35]:
const f4 = @formula(rt_trunc ~ 1 + spkr+prec+load + (1|subj) + (1+prec|item));

In [36]:
const m4 = @btime fit!(LinearMixedModel($f4, $kb07, $contrasts))


  17.886 ms (75285 allocations: 3.31 MiB)
Out[36]:
Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + load + (1 | subj) + (1 + prec | item)
     logLik        -2 logLik          AIC             BIC       
 -1.43319251×10⁴  2.86638501×10⁴  2.86818501×10⁴  2.87312548×10⁴

Variance components:
            Variance   Std.Dev.   Corr.
 item      133015.242 364.71255
            63766.936 252.52116 -0.70
 subj       88819.437 298.02590
 Residual  462443.388 680.03190
 Number of obs: 1789; levels of grouping factors: 32, 56

  Fixed-effects parameters:
                Estimate Std.Error  z value P(>|z|)
(Intercept)      2181.85   77.4681  28.1645  <1e-99
spkr: old         67.879   16.0785  4.22172   <1e-4
prec: maintain  -333.791   47.4472 -7.03499  <1e-11
load: yes        78.5904   16.0785  4.88792   <1e-5

In [37]:
m4.optsum.feval


Out[37]:
93

In [38]:
const θ4 = m4.θ;

In [39]:
@btime objective(updateL!(setθ!($m4, $θ4)));


  136.383 μs (289 allocations: 14.30 KiB)

These two model fits can be compared with one of the information criteria, AIC or BIC, for which "smaller is better". They both indicate a preference for the smaller model, m4.

These criteria are values of the objective, negative twice the log-likelihood at convergence, plus a penalty that depends on the number of parameters being estimated.

Because model m4 is a special case of model m3, a likelihood ratio test could also be used. The alternative model, m3, will always produce an objective that is less than or equal to that from the null model, m4. The difference in these value is similar to the change in the residual sum of squares in a linear model fit. This objective would be called the deviance if there was a way of defining a saturated model but it is not clear what this should be. However, if there was a way to define a deviance then the difference in the deviances would be the same as the differences in these objectives, which is


In [40]:
diff(objective.([m3, m4]))


Out[40]:
1-element Array{Float64,1}:
 25.879117466858588

This difference is compared to a $\chi^2$ distribution with degrees of freedom corresponding to the difference in the number of parameters


In [41]:
diff(dof.([m4, m3]))


Out[41]:
1-element Array{Int64,1}:
 20

producing a p-value of


In [42]:
ccdf(Chisq(20), first(diff(objective.([m3,m4]))))


Out[42]:
0.16984148520447773

The "maximal" model as proposed by Barr et al., 2013 would include all possible interactions of experimental and grouping factors.


In [43]:
const f5 = @formula(rt_trunc ~ 1 + spkr*prec*load + (1+spkr*prec*load|subj) + (1+spkr*prec*load|item));

In [44]:
const m5 = @btime fit!(LinearMixedModel($f5, $kb07, $contrasts))


  10.297 s (2568234 allocations: 107.73 MiB)
Out[44]:
Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + load + spkr & prec + spkr & load + prec & load + spkr & prec & load + (1 + spkr + prec + spkr & prec + load + spkr & load + prec & load + spkr & prec & load | subj) + (1 + spkr + prec + spkr & prec + load + spkr & load + prec & load + spkr & prec & load | item)
     logLik        -2 logLik          AIC             BIC       
 -1.42891729×10⁴  2.85783458×10⁴  2.87403458×10⁴  2.91849882×10⁴

Variance components:
             Variance   Std.Dev.    Corr.
 subj       90608.5644 301.012565
             5143.6952  71.719559  0.43
             5644.8543  75.132245 -0.48 -0.09
             9013.7657  94.940854 -0.20 -0.76  0.55
             7685.1682  87.665091  0.22  0.22  0.39  0.20
             1817.5847  42.633141 -0.45 -0.51  0.10  0.29  0.45
             7501.2782  86.609919 -0.10 -0.14 -0.04  0.04 -0.86 -0.72
             3885.1757  62.331178  0.48  0.41  0.37  0.16  0.09 -0.78  0.39
 item      130494.7324 361.240546
             1749.7106  41.829542  0.36
            62347.1395 249.694092 -0.68  0.43
             1036.0759  32.188132 -0.57 -0.79 -0.03
             2871.3208  53.584706  0.20  0.06 -0.17 -0.03
             1472.8731  38.378029 -0.30  0.03  0.31 -0.20 -0.53
             4738.0323  68.833366  0.07  0.26  0.22  0.25 -0.15 -0.03
             4990.5418  70.643767 -0.05 -0.48 -0.31  0.65  0.69 -0.72  0.11
 Residual  401618.2118 633.733550
 Number of obs: 1789; levels of grouping factors: 56, 32

  Fixed-effects parameters:
                                        Estimate Std.Error  z value P(>|z|)
(Intercept)                              2181.86   76.9448  28.3562  <1e-99
spkr: old                                67.7443    19.263   3.5168  0.0004
prec: maintain                          -333.925   47.6832   -7.003  <1e-11
load: yes                                 78.581   21.2485  3.69819  0.0002
spkr: old & prec: maintain              -21.7763   20.4419 -1.06528  0.2868
spkr: old & load: yes                    18.3885   17.4074  1.05636  0.2908
prec: maintain & load: yes               4.53807   22.5067 0.201632  0.8402
spkr: old & prec: maintain & load: yes   23.4181   21.2101   1.1041  0.2695

As is common in models with high-dimensional vector-valued random effects, the dominant portion of the variation is in the first few principal components


In [45]:
m5.rePCA


Out[45]:
2-element Array{Array{Float64,1},1}:
 [0.724878, 0.846927, 0.94087, 0.999998, 1.0, 1.0, 1.0, 1.0]     
 [0.796978, 0.94188, 0.975067, 0.996119, 0.999996, 1.0, 1.0, 1.0]

For both the subjects and the items practically all the variation of these 8-dimensional random effects is in the first 4 principal components.

The dimension of $\theta$, the parameters in the optimization, increases considerably


In [46]:
const θ5 = m5.θ;
length(θ5)


Out[46]:
72

Of these 72 parameters, 36 are estimated from variation between items, yet there are only 32 items.

Because the dimension of the optimization problem has gotten much larger the number of function evaluations to convergence increases correspondingly.


In [47]:
m5.optsum.feval


Out[47]:
2810

Also, each function evaluation requires more time


In [48]:
@btime objective(updateL!(setθ!($m5, $θ5)));


  2.284 ms (561 allocations: 30.02 KiB)

almost all of which is for the call to updateL!.


In [49]:
@btime updateL!($m5);


  2.228 ms (519 allocations: 28.88 KiB)

To provide more granularity in the plots of execution time shown below, fit one more model without random effects for the third-order interaction of the experimental factors.


In [50]:
const f6 = @formula(rt_trunc ~ 1 + spkr*prec*load + 
    (1+spkr+prec+load+spkr&prec+spkr&load+prec&load|subj) + 
    (1+spkr+prec+load+spkr&prec+spkr&load+prec&load|item));

In [51]:
const m6 = @btime fit!(LinearMixedModel($f6, $kb07, $contrasts))


  2.950 s (1526668 allocations: 59.76 MiB)
Out[51]:
Linear mixed model fit by maximum likelihood
 rt_trunc ~ 1 + spkr + prec + load + spkr & prec + spkr & load + prec & load + spkr & prec & load + (1 + spkr + prec + load + spkr & prec + spkr & load + prec & load | subj) + (1 + spkr + prec + load + spkr & prec + spkr & load + prec & load | item)
     logLik        -2 logLik          AIC             BIC       
 -1.43019526×10⁴  2.86039053×10⁴  2.87339053×10⁴   2.9090717×10⁴

Variance components:
             Variance    Std.Dev.    Corr.
 subj       92941.31347 304.862778
             5017.82659  70.836619  0.49
             5026.50804  70.897871 -0.53 -0.18
             7197.78240  84.839745  0.24  0.24  0.33
             8159.97984  90.332607 -0.19 -0.76  0.55  0.17
             1310.88763  36.206182 -0.58 -0.52  0.43  0.52  0.43
             7451.93456  86.324588 -0.17 -0.23 -0.09 -0.93  0.03 -0.60
 item      130595.36541 361.379808
              229.13521  15.137213  1.00
            62451.29181 249.902565 -0.68 -0.68
             2103.67100  45.865793  0.21  0.21 -0.10
              708.89408  26.625065 -0.59 -0.59 -0.04 -0.35
              335.37064  18.313128 -0.78 -0.78  0.69 -0.41  0.54
             4296.57645  65.548276  0.11  0.11  0.27 -0.32  0.10  0.42
 Residual  417488.17855 646.133251
 Number of obs: 1789; levels of grouping factors: 56, 32

  Fixed-effects parameters:
                                        Estimate Std.Error  z value P(>|z|)
(Intercept)                              2181.91   77.2928  28.2292  <1e-99
spkr: old                                67.7697   18.1705  3.72966  0.0002
prec: maintain                            -333.9   47.6945 -7.00081  <1e-11
load: yes                                78.5329   20.6802  3.79749  0.0001
spkr: old & prec: maintain              -21.7283   20.0317 -1.08469  0.2781
spkr: old & load: yes                    18.3631    16.349  1.12319  0.2614
prec: maintain & load: yes               4.51266   22.3772 0.201663  0.8402
spkr: old & prec: maintain & load: yes     23.37   15.2775   1.5297  0.1261

In [52]:
const θ6 = m6.θ;
length(θ6)


Out[52]:
56

In [53]:
@btime objective(updateL!(setθ!($m6, $θ6)));


  1.568 ms (561 allocations: 30.02 KiB)

In [54]:
@btime updateL!($m6);


  1.538 ms (519 allocations: 28.88 KiB)

Summary of goodness of fit

Apply the goodness of fit measures to m1 to m6 creating a data frame


In [55]:
const mods = [m1, m2, m3, m4, m5, m6];
gofsumry = DataFrame(dof=dof.(mods), deviance=deviance.(mods),
    AIC = aic.(mods), AICc = aicc.(mods), BIC = bic.(mods))


Out[55]:

6 rows × 5 columns

dofdevianceAICAICcBIC
Int64Float64Float64Float64Float64
1728823.128837.128837.228875.6
21128818.528840.528840.628900.9
32928638.028696.028697.028855.2
4928663.928681.928682.028731.3
58128578.328740.328748.129185.0
66528603.928733.928738.929090.7

Here dof or degrees of freedom is the total number of parameters estimated in the model and deviance is simply negative twice the log-likelihood at convergence, without a correction for a saturated model. All the information criteria are on a scale of "smaller is better" and all would select model 4 as "best".


In [56]:
gofnms = (:AIC, :AICc, :BIC)
map(nm -> argmin(gofsumry[nm]), NamedTuple{gofnms, NTuple{3, Symbol}}(gofnms))


Out[56]:
(AIC = 4, AICc = 4, BIC = 4)

In [57]:
nfe(m) = length(coef(m));
nre1(m) = MixedModels.vsize(first(m.reterms));
nlv1(m) = MixedModels.nlevs(first(m.reterms));
nre2(m) = MixedModels.vsize(last(m.reterms));
nlv2(m) = MixedModels.nlevs(last(m.reterms));
(m) = sum(MixedModels., m.reterms);
nev = map(m -> m.optsum.feval, mods);
μsev = [42.035, 46.950, 589.784, 134.539, 2248, 1552];
fit = [0.002072, 0.004233, 0.428509, 0.015163, 5.536, 2.423]
dimsumry = DataFrame(p = nfe.(mods), q1 = nre1.(mods), n1 = nlv1.(mods),
    q2 = nre2.(mods), n2 = nlv2.(mods),  = .(mods),
    npar = dof.(mods), nev = nev, μsev = μsev, 
    fit = fit)


Out[57]:

6 rows × 10 columns

pq1n1q2n2nparnevμsevfit
Int64Int64Int64Int64Int64Int64Int64Int64Float64Float64
14156132272842.0350.002072
281561322113246.950.004233
384564322029781589.7840.428509
442321564993134.5390.015163
58856832728128102248.05.536
68756732566513561552.02.423

In this table, p is the dimension of the fixed-effects vector, q1 is the dimension of the random-effects for each of the n1 levels of the first grouping factor while q2 and n2 are similar for the second grouping factor. is the dimension of the parameter vector and npar is the total number of parameters being estimated, and is equal to nθ + p + 1.

nev is the number of function evaluations to convergence. Because this number will depend on the number of parameters in the model and several other factors such as the setting of the convergence criteria, only its magnitude should be considered reproducible. μsev is the best-case time, in microseconds, per function evaluation and fit is the best-case time, in seconds, to fit the model.

Relationships between model complexity and fitting time

As would be expected, the time required to fit a model increases with the complexity of the model. In particular, the number of function evaluations to convergence increases with the number of parameters being optimized.


In [58]:
plot(dimsumry, x=:nθ, y=:nev, Geom.point,
    Guide.xlabel("# of parameters in the optimization"),
    Guide.ylabel("Function evaluations to convergence"))


Out[58]:
# of parameters in the optimization -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -4×10³ -3×10³ -2×10³ -1×10³ 0 1×10³ 2×10³ 3×10³ 4×10³ 5×10³ 6×10³ 7×10³ -3000 -2900 -2800 -2700 -2600 -2500 -2400 -2300 -2200 -2100 -2000 -1900 -1800 -1700 -1600 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000 5100 5200 5300 5400 5500 5600 5700 5800 5900 6000 -3000 0 3000 6000 -3000 -2800 -2600 -2400 -2200 -2000 -1800 -1600 -1400 -1200 -1000 -800 -600 -400 -200 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 Function evaluations to convergence

The relationship is more-or-less linear, based on this small sample.

The time to evaluate the objective also increases with n$\theta$, the number parameters in the random-effects covariance matrices.


In [59]:
plot(dimsumry, x=:nθ, y=:μsev, Geom.point, 
    Guide.xlabel("# of parameters in the optimization"),
    Guide.ylabel("Microseconds per function evaluation"))


Out[59]:
# of parameters in the optimization -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -3000 -2500 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 -2500 -2400 -2300 -2200 -2100 -2000 -1900 -1800 -1700 -1600 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000 -2500 0 2500 5000 -2600 -2400 -2200 -2000 -1800 -1600 -1400 -1200 -1000 -800 -600 -400 -200 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 Microseconds per function evaluation

This relationship also appears linear, based on this small sample.

Finally, consider the time to fit the model versus the number of parameters in the optimization,


In [60]:
plot(dimsumry, x=:nθ, y=:fit, Geom.point, 
    Guide.xlabel("# of parameters in the optimization"),
    Guide.ylabel("Time (s) to fit the model"))


Out[60]:
# of parameters in the optimization -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 -6.0 -5.8 -5.6 -5.4 -5.2 -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2 10.4 10.6 10.8 11.0 11.2 11.4 11.6 11.8 12.0 -10 0 10 20 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 Time (s) to fit the model

which, given the shape of the previous plots, would be expected to be approximately quadratic.