Often statistical models are used in order to determine which of the predictor variables have a significant relationship with the response variable. LMM
has a number of methods to aid with this kind of statistical inference.
Below we will fit a linear mixed model using the Ruby gem mixed_models, and demostrate various types of hypotheses tests and confidence intervals available for objects of class LMM
.
Data and linear mixed model
Likelihood ratio test
Fixed effects hypotheses tests
Fixed effects confidence intervals
Random effects hypotheses tests
We use the same data and model formulation as in a previous example, where we have looked at various parameter estimates, and have shown that the model fit is good.
The data set, which is simulated, contains two numeric variables Age and Aggression, and two categorical variables Location and Species. These data are available for 100 (human and alien) individuals.
We model the Aggression level of an individual as a linear function of the Age (Aggression decreases with Age), with a different constant added for each Species (i.e. each species has a different base level of aggression). Moreover, we assume that there is a random fluctuation in Aggression due to the Location that an individual is at. Additionally, there is a random fluctuation in how Age affects Aggression at each different Location.
Thus, the Aggression level of an individual of Species $spcs$ who is at the Location $lctn$ can be expressed as: $$Aggression = \beta_{0} + \gamma_{spcs} + Age \cdot \beta_{1} + b_{lctn,0} + Age \cdot b_{lctn,1} + \epsilon,$$ where $\epsilon$ is a random residual, and the random vector $(b_{lctn,0}, b_{lctn,1})^T$ follows a multivariate normal distribution (the same distribution but different realizations of the random vector for each Location). That is, we have a linear mixed model with fixed effects $\beta_{0}, \beta_{1}, \gamma_{Dalek}, \gamma_{Ood}, \dots$, and random effects $b_{Asylum,0}, b_{Asylum,1}, b_{Earth,0},\dots$.
We fit the model with the convenient method LMM#from_formula
, which mimics the behaviour of the function lmer
from the R
package lme4
.
In [1]:
require 'mixed_models'
alien_species = Daru::DataFrame.from_csv '../examples/data/alien_species.csv'
# mixed_models expects that all variable names in the data frame are ruby Symbols:
alien_species.vectors = Daru::Index.new(alien_species.vectors.map { |v| v.to_sym })
model_fit = LMM.from_formula(formula: "Aggression ~ Age + Species + (Age | Location)",
data: alien_species)
puts "Fixed effects terms estimates and some diagnostics:"
puts model_fit.fix_ef_summary.inspect(20)
puts "Random effects correlation structure:"
puts model_fit.ran_ef_summary.inspect(12)
Given two nested models, the LMM.likelihood_ratio_test
tests whether the restricted simpler model is adequate. In this context 'nested' means that all predictors used in the restricted model must also be predictors in the full model (i.e. one model is a reduced version of the other more complex model). This method works only if both models were fit using the deviance (as opposed to REML criterion) as the objective function for the minimization (i.e. fit with reml: false
). LMM.likelihood_ratio_test
returns the p-value of the test.
Two methods are available to compute the p-value: approximation by a Chi squared distribution as delineated in section 2.4.1 in Pinheiro & Bates "Mixed Effects Models in S and S-PLUS" (2000), and a method based on bootstrapping as delineated in section 4.2.3 in Davison & Hinkley "Bootstrap Methods and their Application" (1997).
For example we can test the model formulation as above against a simpler model, which assumes that Age is neither a fixed effect nor a random effect. We can compute a likelihood ratio using the method LMM.likelihood_ratio
.
In [2]:
complex_model = LMM.from_formula(formula: "Aggression ~ Age + Species + (Age | Location)",
data: alien_species, reml: false)
simple_model = LMM.from_formula(formula: "Aggression ~ Species + (1 | Location)",
data: alien_species, reml: false)
LMM.likelihood_ratio(simple_model, complex_model)
Out[2]:
In [3]:
chi2_p_value = LMM.likelihood_ratio_test(simple_model, complex_model, method: :chi2)
Out[3]:
The p-value is tiny, which implies that Age is a significant predictor of Aggression, and that the more complex model should be prefered.
In [4]:
bootstrap_p_value = LMM.likelihood_ratio_test(simple_model, complex_model, method: :bootstrap, nsim: 1000)
Out[4]:
Even though the p-value is not as extreme as with the Chi squared test, it still shows significance of the variable Age.
Significance tests for the fixed effects can be performed with LMM#fix_ef_p
(or its alias LMM#fix_ef_test
). For a given fixed effects coefficient estimate the tested null hypothesis is that the true value of the coefficient is zero (i.e. no linear relationship to the response).
That is, for the above model formulation we carry out hypotheses tests for each fixed effects terms $\beta_{i}$ or $\gamma_{species}$, testing the null $H_{0} : \beta_{i} = 0$ against the alternative $H_{a} : \beta_{i} \neq 0$, or respectively the null $H_{0} : \gamma_{species} = 0$ against the alternative $H_{a} : \gamma_{species} \neq 0$.
LMM
currently offers three methods of hypotheses testing for fixed effects: Wald Z test, likelihood ratio test, and a bootstrap test. For a good discussion of the validity of different methods see this entry from the wiki of the r-sig-mixed-models mailing list. Moreover, due to the equivalence of hypotheses tests and confidence intervals, an additional hypothesis testing tool are bootstrap confidence intervals which are described in a different section below.
The likelihood ratio test for fixed effects is actually merely a convenience method. It is a convenient interface to LMM.likelihood_ratio_test
with method: :chi2
described above. For example, we can test whether the fixed effects term Age is a significant predictor of Aggression in complex_model
as follows.
In [5]:
lrt_p_value = complex_model.fix_ef_p(variable: :Age, method: :lrt)
Out[5]:
We see that Age does not seem be significant as a fixed effects term, even though it is in general a significant predictor as we have seen before.
In [6]:
bootstrap_p_value = complex_model.fix_ef_p(variable: :Age, method: :bootstrap, nsim: 1000)
Out[6]:
This result confirms the conclusion based of method: :lrt
.
In [7]:
model_fit.fix_ef_sd
Out[7]:
Based on the values returned by LMM#fix_ef_sd
, the Wald Z test statistics for all fixed effects coefficients are computed by the method LMM#fix_ef_z
.
In [8]:
model_fit.fix_ef_z
Out[8]:
Based on the above Wald Z test statistics, hypotheses tests for each fixed effects term can be carried out.
The corresponding (approximate) p-values are obtained with LMM#fix_ef_p
as follows.
In [9]:
model_fit.fix_ef_p(method: :wald)
Out[9]:
We see that Aggression of each Species is significantly different from the base level (which is the species Dalek in this model), while statistically we don't have enough evidence to conclude that Age of an individual is a good predictor of his/her/its aggression level (which agrees with the conclusion obtained above with method: :lrt
and method: :bootstrap
).
Confidence intervals for the fixed effects terms can be computed with the method LMM#fix_ef_conf_int
. The following types of confidence intervals are available:
A detailed description of the bootstrap methods available in LMM#fix_ef_conf_int
is given in this blog post.
For example we can compute the studentized bootstrap confidence intervals (also called bootstrap-t) with 98% coverage as follows.
In [10]:
bootstrap_t_intervals = model_fit.fix_ef_conf_int(level: 0.98, method: :bootstrap,
boottype: :studentized, nsim: 1000)
Out[10]:
We see that Age is the only fixed effects predictor whose confidence interval contains zero, which implies that it probably has little linear relationship as a fixed effect to the response variable Aggression.
In [11]:
conf_int = model_fit.fix_ef_conf_int(level: 0.9, method: :wald)
Out[11]:
For greated visual clarity we can put the coefficient estimates and the confidence intervals into a Daru::DataFrame
:
In [12]:
df = Daru::DataFrame.rows(conf_int.values, order: [:lower90, :upper90], index: model_fit.fix_ef_names)
df[:coef] = model_fit.fix_ef.values
df
Out[12]:
With method: :all
, LMM#fix_ef_conf_int
returns a Daru::DataFrame containing the confidence intervals obtained by each of the available methods. The data frame can be printed in form of a nice looking table for inspection.
For example for the alien species data we obtain all types of 95% confidence intervals with:
In [13]:
ci = model_fit.fix_ef_conf_int(method: :all, nsim: 1000)
Out[13]:
Since here we are dealing with data that was simulated according to the assumptions of the linear mixed model, all parameters end up approximately meeting the normality assumptions, and therefore all confidence interval methods turn out to be pretty much equivalent. Often when analyzing less ideal data, this will not be the case. Then it might be necessary to compare different types of confidence intervals in order to draw the right conclusions.
In [14]:
complex_model.ran_ef_p(variable: :intercept, grouping: :Location, method: :lrt)
Out[14]:
This suggests that the variability in Aggression is significantly influenced by the effect of Location.
In [15]:
complex_model.ran_ef_p(variable: :Age, grouping: :Location, method: :bootstrap, nsim: 1000)
Out[15]:
We see that in fact the change of Age due to Location is a significant random effect.