Below we will fit a linear mixed model using the Ruby gem mixed_models and demonstrate the available prediction methods.
We use the same data and model formulation as in several previous examples, where we have looked at various parameter estimates (1) and demostrated many types hypotheses tests as well as confidence intervals (2).
The data set, which is simulated, contains two numeric variables Age and Aggression, and two categorical variables Location and Species. These data are available for 100 (human and alien) individuals.
We model the Aggression level of an individual of Species $spcs$ who is at the Location $lctn$ as:
$$Aggression = \beta_{0} + \gamma_{spcs} + Age \cdot \beta_{1} + b_{lctn,0} + Age \cdot b_{lctn,1} + \epsilon,$$where $\epsilon$ is a random residual, and the random vector $(b_{lctn,0}, b_{lctn,1})^T$ follows a multivariate normal distribution (the same distribution but different realizations of the random vector for each Location).
We fit this model in mixed_models
using a syntax familiar from the R
package lme4
.
In [1]:
require 'mixed_models'
alien_species = Daru::DataFrame.from_csv '../examples/data/alien_species.csv'
# mixed_models expects that all variable names in the data frame are ruby Symbols:
alien_species.vectors = Daru::Index.new(alien_species.vectors.map { |v| v.to_sym })
model_fit = LMM.from_formula(formula: "Aggression ~ Age + Species + (Age | Location)",
data: alien_species)
model_fit.fix_ef_summary
Out[1]:
In [2]:
newdata = Daru::DataFrame.from_csv '../examples/data/alien_species_newdata.csv'
newdata.vectors = Daru::Index.new(newdata.vectors.map { |v| v.to_sym })
newdata
Out[2]:
In [3]:
puts "Predictions of aggression levels on a new data set:"
pred = model_fit.predict(newdata: newdata, with_ran_ef: true)
Out[3]:
Now we can add the computed predictions to the data set, in order to see better which of the individuals are likely to be particularly dangerous.
In [4]:
newdata = Daru::DataFrame.from_csv '../examples/data/alien_species_newdata.csv'
newdata.vectors = Daru::Index.new(newdata.vectors.map { |v| v.to_sym })
newdata[:Predicted_Agression] = pred
newdata
Out[4]:
Since the estimated fixed and random effects coefficients most likely are not exactly the true values, we probably should look at interval estimates of the predictions, rather than the point estimates computed above.
Two types of such interval estimates are currently available in LMM
. On the one hand, a confidence interval is an interval estimate of the mean value of the response for given covariates (i.e. a population parameter); on the other hand, a prediction interval is an interval estimate of a future observation (for further explanation of this distinction see for example https://stat.ethz.ch/education/semesters/ss2010/seminar/06_Handout.pdf).
In [5]:
puts "88% confidence intervals for the predictions:"
ci = model_fit.predict_with_intervals(newdata: newdata, level: 0.88, type: :confidence)
Daru::DataFrame.new(ci, order: [:pred, :lower88, :upper88])
Out[5]:
In [6]:
puts "88% prediction intervals for the predictions:"
pi = model_fit.predict_with_intervals(newdata: newdata, level: 0.88, type: :prediction)
Daru::DataFrame.new(pi, order: [:pred, :lower88, :upper88])
Out[6]:
Remark: You might notice that #predict
with with_ran_ef: true
produces some values outside of the confidence intervals, because the confidence intervals are computed from #predict
with with_ran_ef: false
.
However, #predict
with with_ran_ef: false
should always give values which lie in the center of the confidence or prediction intervals.