LMM predictions and prediction intervals

Below we will fit a linear mixed model using the Ruby gem mixed_models and demonstrate the available prediction methods.

Data and linear mixed model

We use the same data and model formulation as in several previous examples, where we have looked at various parameter estimates (1) and demostrated many types hypotheses tests as well as confidence intervals (2).

The data set, which is simulated, contains two numeric variables Age and Aggression, and two categorical variables Location and Species. These data are available for 100 (human and alien) individuals.

We model the Aggression level of an individual of Species $spcs$ who is at the Location $lctn$ as:

$$Aggression = \beta_{0} + \gamma_{spcs} + Age \cdot \beta_{1} + b_{lctn,0} + Age \cdot b_{lctn,1} + \epsilon,$$

where $\epsilon$ is a random residual, and the random vector $(b_{lctn,0}, b_{lctn,1})^T$ follows a multivariate normal distribution (the same distribution but different realizations of the random vector for each Location).

We fit this model in mixed_models using a syntax familiar from the R package lme4.


In [1]:
require 'mixed_models'

alien_species = Daru::DataFrame.from_csv '../examples/data/alien_species.csv'
# mixed_models expects that all variable names in the data frame are ruby Symbols:
alien_species.vectors = Daru::Index.new(alien_species.vectors.map { |v| v.to_sym })

model_fit = LMM.from_formula(formula: "Aggression ~ Age + Species + (Age | Location)", 
                             data: alien_species)
model_fit.fix_ef_summary


Out[1]:
Daru::DataFrame:47316264430760 rows: 5 cols: 4
coefsdz_scoreWaldZ_p_value
intercept1016.286720702345960.1972749576905416.8826034304150770.0
Age-0.065316153427889070.0898848636725299-0.72666465475043740.46743141066211646
Species_lvl_Human-499.693695290208550.2682523406941929-1862.7747813759370.0
Species_lvl_Ood-899.56932135357650.28144708140043684-3196.22899224060030.0
Species_lvl_WeepingAngel-199.588958042007020.27578357795259995-723.71589172837250.0

Predictions and prediction intervals

Often, the objective of a statistical model is the prediction of future observations based on new data input.

We consider the following new data set containing age, geographic location and species for ten individuals.


In [2]:
newdata = Daru::DataFrame.from_csv '../examples/data/alien_species_newdata.csv'
newdata.vectors = Daru::Index.new(newdata.vectors.map { |v| v.to_sym })
newdata


Out[2]:
Daru::DataFrame:47316263806300 rows: 10 cols: 3
AgeSpeciesLocation
0209DalekOodSphere
190OodEarth
2173OodAsylum
3153HumanAsylum
4255WeepingAngelOodSphere
5256WeepingAngelAsylum
637DalekEarth
7146WeepingAngelEarth
8127WeepingAngelAsylum
941OodAsylum

Point estimates

Based on the fitted linear mixed model we can predict the aggression levels for the inidividuals, where we can specify whether the random effects estimates should be included in the calculations or not.


In [3]:
puts "Predictions of aggression levels on a new data set:"
pred =  model_fit.predict(newdata: newdata, with_ran_ef: true)


Predictions of aggression levels on a new data set:
Out[3]:
[1070.9125752531208, 182.45206492790737, -17.06446875476354, 384.7881586199103, 876.1240725686446, 674.7113391148862, 1092.6985606350866, 871.1508855262363, 687.4629975728096, -4.016260100144294]

Now we can add the computed predictions to the data set, in order to see better which of the individuals are likely to be particularly dangerous.


In [4]:
newdata = Daru::DataFrame.from_csv '../examples/data/alien_species_newdata.csv'
newdata.vectors = Daru::Index.new(newdata.vectors.map { |v| v.to_sym })
newdata[:Predicted_Agression] = pred
newdata


Out[4]:
Daru::DataFrame:47316262633840 rows: 10 cols: 4
AgeSpeciesLocationPredicted_Agression
0209DalekOodSphere1070.9125752531208
190OodEarth182.45206492790737
2173OodAsylum-17.06446875476354
3153HumanAsylum384.7881586199103
4255WeepingAngelOodSphere876.1240725686446
5256WeepingAngelAsylum674.7113391148862
637DalekEarth1092.6985606350866
7146WeepingAngelEarth871.1508855262363
8127WeepingAngelAsylum687.4629975728096
941OodAsylum-4.016260100144294

Interval estimates

Since the estimated fixed and random effects coefficients most likely are not exactly the true values, we probably should look at interval estimates of the predictions, rather than the point estimates computed above.

Two types of such interval estimates are currently available in LMM. On the one hand, a confidence interval is an interval estimate of the mean value of the response for given covariates (i.e. a population parameter); on the other hand, a prediction interval is an interval estimate of a future observation (for further explanation of this distinction see for example https://stat.ethz.ch/education/semesters/ss2010/seminar/06_Handout.pdf).


In [5]:
puts "88% confidence intervals for the predictions:"
ci = model_fit.predict_with_intervals(newdata: newdata, level: 0.88, type: :confidence)
Daru::DataFrame.new(ci, order: [:pred, :lower88, :upper88])


88% confidence intervals for the predictions:
Out[5]:
Daru::DataFrame:47316259596660 rows: 10 cols: 3
predlower88upper88
01002.6356446359171906.2754736170911098.995815654743
1110.8389455402593717.15393113018095204.5239599503378
2105.4177048057446210.164687937713381200.67072167377586
3506.59965393767027411.8519191795299601.3473886958107
4800.0421435362272701.9091174988788898.1751695735755
5799.9768273827992701.8009453018722898.1527094637263
61013.870023025514920.4439313191591107.296114731869
7807.1616042598671712.571759209002901.7514493107321
8808.402611174997714.191640124036902.613582225958
9114.0394370582259920.614034870631627207.46483924582034

In [6]:
puts "88% prediction intervals for the predictions:"
pi = model_fit.predict_with_intervals(newdata: newdata, level: 0.88, type: :prediction)
Daru::DataFrame.new(pi, order: [:pred, :lower88, :upper88])


88% prediction intervals for the predictions:
Out[6]:
Daru::DataFrame:47316258683700 rows: 10 cols: 3
predlower88upper88
01002.6356446359171809.91005014591041195.3612391259237
1110.83894554025937-76.53615884686141298.2140499273802
2105.41770480574462-85.09352864481423295.92893825630347
3506.59965393767027317.0988995529618696.1004083223787
4800.0421435362272603.7713980881146996.3128889843398
5799.9768273827992603.6203777073699996.3332770582285
61013.870023025514827.01272323178051200.7273228192475
7807.1616042598671617.9767304115936996.3464781081406
8808.402611174997619.9754792487822996.8297431012118
9114.03943705822599-72.8161447158925300.8950188323445

Remark: You might notice that #predict with with_ran_ef: true produces some values outside of the confidence intervals, because the confidence intervals are computed from #predict with with_ran_ef: false. However, #predict with with_ran_ef: false should always give values which lie in the center of the confidence or prediction intervals.