Below we will fit a linear mixed model using the Ruby gem mixed_models, and demostrate many inference and prediction methods available for objects of class LMM
.
Example Data
Linear Mixed Model
Some model attributes
Fitted values and residuals
Fixed effects hypotheses tests and confidence intervals
Predictions and prediction intervals
* linking the titles to the respective sections messes up the display in nbviewer for some reason; would appreciate any hint on how to make it work...
The data set, which is simulated, contains two numeric variables Age and Aggression, and two categorical variables Location and Species. These data are available for 100 (human and alien) individuals.
We will fit the model with the method LMM#from_formula
, which mimics the behaviour of the function lmer
from the R
package lme4
.
The data is supplied to LMM#from_formula
as a Daru::DataFrame
(from the excellent Ruby gem daru). We load the data, and display the first 10 lines with:
In [1]:
require 'daru'
alien_species = Daru::DataFrame.from_csv '../examples/data/alien_species.csv'
# mixed_models expects that all variable names in the data frame are ruby Symbols:
alien_species.vectors = Daru::Index.new(alien_species.vectors.map { |v| v.to_sym })
alien_species.head
Out[1]:
We model the Aggression level of an individual as a linear function of the Age (Aggression decreases with Age), with a different constant added for each Species (i.e. each species has a different base level of aggression). Moreover, we assume that there is a random fluctuation in Aggression due to the Location that an individual is at. Additionally, there is a random fluctuation in how Age affects Aggression at each different Location.
Thus, the Aggression level of an individual of Species $spcs$ who is at the Location $lctn$ can be expressed as: $$Aggression = \beta_{0} + \gamma_{spcs} + Age \cdot \beta_{1} + b_{lctn,0} + Age \cdot b_{lctn,1} + \epsilon,$$ where $\epsilon$ is a random residual, and the random vector $(b_{lctn,0}, b_{lctn,1})^T$ follows a multivariate normal distribution (the same distribution but different realizations of the random vector for each Location). That is, we have a linear mixed model with fixed effects $\beta_{0}, \beta_{1}, \gamma_{Dalek}, \gamma_{Ood}, \dots$, and random effects $b_{Asylum,0}, b_{Asylum,1}, b_{Earth,0},\dots$.
We fit this model in mixed_models
using a syntax familiar from the R
package lme4
, and display the estimated fixed and random effects coefficients:
In [2]:
require 'mixed_models'
model_fit = LMM.from_formula(formula: "Aggression ~ Age + Species + (Age | Location)",
data: alien_species)
puts "Fixed effects:"
puts model_fit.fix_ef
puts "Random effects:"
puts model_fit.ran_ef
Apart from the fixed and random effects coefficients (seen above), we can access many attributes of the fitted model. Among others:
fix_ef_names
and ran_ef_names
are Arrays of names of the fixed and random effects.
reml
is an indicator whether the profiled REML criterion or the profiled deviance function was optimized by the model fitting algorithm.
formula
returns the R-like formula used to fit the model as a String.
model_data
, optimization_result
and dev_fun
store the various model matrices in an LMMData
object, the results of the utilized optimization algorithm, and the corresponding objective function as a Proc
.
sigma2
is the residual variance (unless weights
was specified in the model fit).
sigma_mat
is the covariance matrix of the multivariate normal random effects vector.
We can look at some of these parameters for our example model:
In [3]:
puts "REML criterion used: \t#{model_fit.reml}"
puts "Residual variance: \t#{model_fit.sigma2}"
puts "Formula: \t" + model_fit.formula
puts "Variance of the intercept due to 'location' (i.e. variance of b0): \t#{model_fit.sigma_mat[0,0]}"
puts "Variance of the effect of 'age' due to 'location' (i.e. variance of b1): \t#{model_fit.sigma_mat[1,1]}"
puts "Covariance of b0 and b1: \t#{model_fit.sigma_mat[0,1]}"
Some further convenience methods are:
sigma
returns the square root of sigma2
.
theta
returns the optimal solution of the minimization of the deviance function or the REML criterion (whichever was used to fit the model).
deviance
returns the value of the deviance function or the REML criterion at the optimal solution.
In [4]:
puts "Residual standard deviation: \t#{model_fit.sigma}"
puts "REML criterion: \t#{model_fit.deviance}"
In [5]:
puts "Fitted values at the population level:"
model_fit.fitted(with_ran_ef: false)
Out[5]:
In [6]:
puts "Model residuals:"
model_fit.residuals
Out[6]:
We can assess the goodness of the model fit (to some extent) by plotting the residuals agains the fitted values, and checking for unexpected patterns. We use the gem gnuplotrb for plotting.
In [7]:
require 'gnuplotrb'
include GnuplotRB
x, y = model_fit.fitted, model_fit.residuals
fitted_vs_residuals = Plot.new([[x,y], with: 'points', pointtype: 6, notitle: true],
xlabel: 'Fitted', ylabel: 'Residuals')
Out[7]:
We see that the residuals look more or less like noise, which is good.
We can further analyze the validity of the linear mixed model somewhat, by checking if the residuals appear to be approximately normally distributed.
In [8]:
bin_width = (y.max - y.min)/10.0
bins = (y.min..y.max).step(bin_width).to_a
rel_freq = Array.new(bins.length-1){0.0}
y.each do |r|
0.upto(bins.length-2) do |i|
if r >= bins[i] && r < bins[i+1] then
rel_freq[i] += 1.0/y.length
end
end
end
bins_center = bins[0...-1].map { |b| b + bin_width/2.0 }
residuals_hist = Plot.new([[bins_center, rel_freq], with: 'boxes', notitle: true],
style: 'fill solid 0.5')
Out[8]:
The histogram does not appear to be too different from a bell shaped curve, although it might be slightly skewed to the right.
We can further explore the validity of the normality assumption by looking at the Q-Q plot of the residuals.
In [9]:
require 'distribution'
observed = model_fit.residuals.sort
n = observed.length
theoretical = (1..n).to_a.map { |t| Distribution::Normal.p_value(t.to_f/n.to_f) * model_fit.sigma}
qq_plot = Plot.new([[theoretical, observed], with: 'points', pointtype: 6, notitle: true],
['x', with: 'lines', notitle: true],
xlabel: 'Normal theoretical quantiles', ylabel: 'Observed quantiles',
title: 'Q-Q plot of the residuals')
Out[9]:
The straight line in the above plot is simply the diagonal. We see that the observed quantiles aggree with the theoretical values fairly well, as expected from a "good" model.
The covariance matrix of the fixed effects estimates is returned by LMM#fix_ef_cov_mat
, and the standard deviations of the fixed effects coefficients are returned by LMM#fix_ef_sd
. Methods for hypotheses tests and confidence intervals can be based on these values.
In [10]:
model_fit.fix_ef_sd
Out[10]:
The Wald Z test statistics for the fixed effects coefficients can be computed with:
In [11]:
model_fit.fix_ef_z
Out[11]:
Based on the above Wald Z test statistics, we can carry out hypotheses tests for each fixed effects terms $\beta_{i}$ or $\gamma_{species}$, testing the null $H_{0} : \beta_{i} = 0$ against the alternative $H_{a} : \beta_{i} \neq 0$, or respectively the null $H_{0} : \gamma_{species} = 0$ against the alternative $H_{a} : \gamma_{species} \neq 0$.
The corresponding (approximate) p-values are obtained with:
In [12]:
model_fit.fix_ef_p(method: :wald)
Out[12]:
We see that the aggression level of each species is significantly different from the base level (which is the species Dalek
in this model), while statistically we don't have enough evidence to conclude that the age of an individual is a good predictor of his/her/its aggression level.
We can use the Wald method for confidence intervals as well. For example 90% confidence intervals for each fixed effects coefficient estimate can be computed as follows.
In [13]:
conf_int = model_fit.fix_ef_conf_int(level: 0.9, method: :wald)
Out[13]:
For greated visual clarity we can put the coefficient estimates and the confidence intervals into a Daru::DataFrame
:
In [14]:
df = Daru::DataFrame.rows(conf_int.values, order: [:lower90, :upper90], index: model_fit.fix_ef_names)
df[:coef] = model_fit.fix_ef.values
df
Out[14]:
In [15]:
newdata = Daru::DataFrame.from_csv '../examples/data/alien_species_newdata.csv'
newdata.vectors = Daru::Index.new(newdata.vectors.map { |v| v.to_sym })
newdata
Out[15]:
In [16]:
puts "Predictions of aggression levels on a new data set:"
pred = model_fit.predict(newdata: newdata, with_ran_ef: true)
Out[16]:
Now we can add the computed predictions to the data set, in order to see better which of the individuals are likely to be particularly dangerous.
In [17]:
newdata = Daru::DataFrame.from_csv '../examples/data/alien_species_newdata.csv'
newdata.vectors = Daru::Index.new(newdata.vectors.map { |v| v.to_sym })
newdata[:Predicted_Agression] = pred
newdata
Out[17]:
Since the estimated fixed and random effects coefficients most likely are not exactly the true values, we probably should look at interval estimates of the predictions, rather than the point estimates computed above.
Two types of such interval estimates are currently available in LMM
. On the one hand, a confidence interval is an interval estimate of the mean value of the response for given covariates (i.e. a population parameter); on the other hand, a prediction interval is an interval estimate of a future observation (for further explanation of this distinction see for example https://stat.ethz.ch/education/semesters/ss2010/seminar/06_Handout.pdf).
In [18]:
puts "88% confidence intervals for the predictions:"
ci = model_fit.predict_with_intervals(newdata: newdata, level: 0.88, type: :confidence)
Daru::DataFrame.new(ci, order: [:pred, :lower88, :upper88])
Out[18]:
In [19]:
puts "88% prediction intervals for the predictions:"
pi = model_fit.predict_with_intervals(newdata: newdata, level: 0.88, type: :prediction)
Daru::DataFrame.new(pi, order: [:pred, :lower88, :upper88])
Out[19]: