Fitting a linear mixed model and accessing the parameter estimates

Below we demonstrate how to fit linear mixed models using the Ruby gem mixed_models. We also show how to access various parameters estimated by the linear mixed model, and we use them to assess the goodness of fit of the resulting model.

We will fit the model with the user friendly method LMM#from_formula, which mimics the behaviour of the function lmer from the R package lme4. The formula language used by mixed_models is in fact a subset of the lme4 formula interface. It allows to fit all of the same models, but does not allow for cetain shortcuts, namely the symbols *, || and /. However, for all of these shortcuts longer alternative formulations exist in all cases (e.g. you need to use the equivalent formulation a + b + a:b instead of a*b). An expanation of the lme4 formula interface can be found in the lme4 vignette. Additionally, there is a multitude of posts on stackexchange (such as this) discussing the formula language of lme4.

Example data
Linear mixed model
Model attributes
Fitted values and residuals
Assessing the quality of the model fit

Example data

The data set, which is simulated, contains two numeric variables Age and Aggression, and two categorical variables Location and Species. These data are available for 100 (human and alien) individuals.

The data is supplied to LMM#from_formula as a Daru::DataFrame (from the excellent Ruby gem daru). We load the data, and display the first 10 lines with:



In [1]:

    
require 'daru'
alien_species = Daru::DataFrame.from_csv '../examples/data/alien_species.csv'
# mixed_models expects that all variable names in the data frame are ruby Symbols:
alien_species.vectors = Daru::Index.new(alien_species.vectors.map { |v| v.to_sym })
alien_species.head









    Out[1]:




Daru::DataFrame:47084194581820  rows: 10  cols: 4
Aggression Age Species Location
0 877.54242028595 204.95 Dalek Asylum
1 852.528392188206 39.88 WeepingAngel OodSphere
2 388.791416909388 107.34 Human Asylum
3 170.010124622982 210.01 Ood OodSphere
4 1078.31219494376 270.22 Dalek OodSphere
5 164.924992952256 157.65 Ood OodSphere
6 865.838374677443 136.15 WeepingAngel OodSphere
7 1052.36035549029 241.31 Dalek Earth
8 -8.57251993382567 86.84 Ood Asylum
9 1070.71900405899 206.7 Dalek OodSphere

Linear mixed model

We model the Aggression level of an individual as a linear function of the Age (Aggression decreases with Age), with a different constant added for each Species (i.e. each species has a different base level of aggression). Moreover, we assume that there is a random fluctuation in Aggression due to the Location that an individual is at. Additionally, there is a random fluctuation in how Age affects Aggression at each different Location.

Thus, the Aggression level of an individual of Species $spcs$ who is at the Location $lctn$ can be expressed as: $$Aggression = \beta_{0} + \gamma_{spcs} + Age \cdot \beta_{1} + b_{lctn,0} + Age \cdot b_{lctn,1} + \epsilon,$$ where $\epsilon$ is a random residual, and the random vector $(b_{lctn,0}, b_{lctn,1})^T$ follows a multivariate normal distribution (the same distribution but different realizations of the random vector for each Location). That is, we have a linear mixed model with fixed effects $\beta_{0}, \beta_{1}, \gamma_{Dalek}, \gamma_{Ood}, \dots$, and random effects $b_{Asylum,0}, b_{Asylum,1}, b_{Earth,0},\dots$.

We fit this model in mixed_models using a syntax familiar from the R package lme4 (as described above), and display the estimated fixed and random effects coefficients:



In [2]:

    
require 'mixed_models'
model_fit = LMM.from_formula(formula: "Aggression ~ Age + Species + (Age | Location)", 
                             data: alien_species)
puts "Fixed effects:"
puts model_fit.fix_ef
puts "Random effects:"
puts model_fit.ran_ef









    



Fixed effects:
{:intercept=>1016.2867207023459, :Age=>-0.06531615342788907, :Species_lvl_Human=>-499.69369529020855, :Species_lvl_Ood=>-899.5693213535765, :Species_lvl_WeepingAngel=>-199.58895804200702}
Random effects:
{:intercept_Asylum=>-116.68080676073654, :Age_Asylum=>-0.0335339121374082, :intercept_Earth=>83.86571636827462, :Age_Earth=>-0.136139966451407, :intercept_OodSphere=>32.81508999155884, :Age_OodSphere=>0.16967387859160207}

A summary of some important information about the fixed and random effects of the model can be conveniently displayed with the methods LMM#fix_ef_summary and LMM#ran_ef_summary.

LMM#fix_ef_summary contains the fixed effects coefficient estimates, the standard deviations of the estimates, as well as the corresponding z scores and Wald Z p-values testing the significance of each fixed effects term.



In [3]:

    
model_fit.fix_ef_summary









    Out[3]:




Daru::DataFrame:47084187305780  rows: 5  cols: 4
coef sd z_score WaldZ_p_value
intercept 1016.2867207023459 60.19727495769054 16.882603430415077 0.0
Age -0.06531615342788907 0.0898848636725299 -0.7266646547504374 0.46743141066211646
Species_lvl_Human -499.69369529020855 0.2682523406941929 -1862.774781375937 0.0
Species_lvl_Ood -899.5693213535765 0.28144708140043684 -3196.2289922406003 0.0
Species_lvl_WeepingAngel -199.58895804200702 0.27578357795259995 -723.7158917283725 0.0

LMM#ran_ef_summary summarizes the correlation structure of the random effects terms, that is, the correlation matrix of the vector $(b_{lctn,0}, b_{lctn,1})^T$ in the present data analysis.



In [4]:

    
model_fit.ran_ef_summary









    Out[4]:




Daru::DataFrame:47084184227680  rows: 2  cols: 2
Location Location_Age
Location 104.26376362946607 -0.05986390378124856
Location_Age -0.05986390378124856 0.15567077553868963

In particular, we see that the standard deviation of the constant term corresponding to the variable Location is rather big (104.26...), and therefore Location must contribute hugely to the variance in Aggression. The variability of the effect of Age with respect to Location, however, seems rather small (0.15567...) in comparison.

Model attributes

Apart from the fixed and random effects coefficients (seen above), we can access many attributes of the fitted model. Among others:

fix_ef_names and ran_ef_names are Arrays of names of the fixed and random effects.
reml is an indicator whether the profiled REML criterion or the profiled deviance function was optimized by the model fitting algorithm.
formula returns the R-like formula used to fit the model as a String.
model_data, optimization_result and dev_fun store the various model matrices in an LMMData object, the results of the utilized optimization algorithm, and the corresponding objective function as a Proc.
sigma2 is the residual variance (unless weights was specified in the model fit).
sigma_mat is the covariance matrix of the multivariate normal random effects vector.

We can look at some of these parameters for our example model:



In [5]:

    
puts "REML criterion used: \t#{model_fit.reml}"
puts "Residual variance: \t#{model_fit.sigma2}"
puts "Formula: \t" + model_fit.formula
puts "Variance of the intercept due to 'location' (i.e. variance of b0): \t#{model_fit.sigma_mat[0,0]}"
puts "Variance of the effect of 'age' due to 'location' (i.e. variance of b1): \t#{model_fit.sigma_mat[1,1]}"
puts "Covariance of b0 and b1: \t#{model_fit.sigma_mat[0,1]}"









    



REML criterion used: 	true
Residual variance: 	0.9496833447256825
Formula: 	Aggression ~ Age + Species + (Age | Location)
Variance of the intercept due to 'location' (i.e. variance of b0): 	10870.932406181171
Variance of the effect of 'age' due to 'location' (i.e. variance of b1): 	0.024233390356817094
Covariance of b0 and b1: 	-0.9716403033290799

Some further convenience methods are (apart from #fix_ef_summary and #ran_ef_summary as seen above):

#sigma returns the square root of sigma2.
#theta returns the optimal solution of the minimization of the deviance function or the REML criterion (whichever was used to fit the model).
#deviance returns the value of the deviance function or the REML criterion at the optimal solution.
#ran_ef_cov is a version of #ran_ef_summary that returns the covariance structure of the random effects rather than the correlation structure.



In [6]:

    
puts "Residual standard deviation: \t#{model_fit.sigma}"
puts "REML criterion: \t#{model_fit.deviance}"









    



Residual standard deviation: 	0.9745169802141379
REML criterion: 	333.7155391015166

Fitted values and residual

There are methods to get the fitted (i.e. predicted) values of the response variable, with or without inclusion of random effects, as well as the model residuals (which are differences between the true and the predicted values).



In [7]:

    
puts "Fitted values at the population level:"
model_fit.fitted(with_ran_ef: false)









    



Fitted values at the population level:






    Out[7]:





[1002.9001750573, 814.0929544616347, 509.58198950318774, 103.00035396737837, 998.6369897230617, 106.42030776086267, 807.8049683711317, 1000.525279718662, 111.04534458509147, 1002.7858717888012, 1008.0797460241316, 498.79959889531176, 100.90435860387743, 111.03358767747443, 102.37593154060778, 812.6239941710414, 97.21856806594167, 800.9297900613121, 510.0032786927976, 502.6950542857511, 503.7329279637203, 999.8864877381372, 497.2875299434561, 1003.3325679929926, 804.1328942254158, 811.9113949371432, 99.91285939484214, 803.9356394420636, 998.4671677241491, 99.50267395131493, 500.71140270614603, 1014.38536747606, 112.41241167633723, 1010.303107886817, 114.52212343205804, 813.8107886788262, 811.7526766843134, 498.6990120190328, 1005.0836940663943, 110.67826780282678, 501.00205958890024, 810.6305451684223, 511.318092861301, 1000.7623773556052, 508.8465296155897, 1008.3730155530228, 1011.249538949987, 102.4595362169955, 1010.9490846442187, 1007.0346875692853, 506.2286581861998, 502.15031756616247, 813.0589997528712, 499.75256157382466, 812.9329395767553, 498.25682166032595, 514.1586923738798, 499.76366531990743, 510.7733561417124, 798.2159038863833, 511.4147607683742, 506.48077853843154, 1012.1763751671288, 1013.9856326170814, 101.35504006252984, 1013.1593832762185, 1006.1091576752121, 106.82265526597848, 1002.7754212042528, 806.3314359497986, 1002.5860043593118, 514.9699189994543, 1007.2214917680891, 806.255016050288, 502.0582217898292, 1011.9458091455283, 804.6221122145907, 997.8786691817638, 802.2648522373782, 1011.1705064043393, 806.8657220848387, 1011.1763848581478, 798.7684785443834, 1009.0999843406752, 806.0166120902761, 102.78481066106633, 99.6640048502818, 506.1835900403346, 998.8949885291019, 499.4357782296994, 814.6050331045093, 1012.1182437905779, 109.60251075586939, 501.6904918460301, 98.0232630761733, 106.71945574356243, 99.98274767900989, 114.889853375857, 798.1950027172865, 106.6208283518863]



In [8]:

    
puts "Model residuals:"
model_fit.residuals









    



Model residuals:






    Out[8]:





[-1.8041727180516318, -1.1462465432206272, -0.5102357042338213, -1.438530578977577, 1.0108397561168658, -1.059491760131607, 2.1172177445057514, 0.8212947077424815, -0.024972828168071004, 0.04645157374602604, -0.2521512808752959, 0.5337610637689068, -0.5306298541011536, -0.8746282732136024, -0.5173146938876982, -1.076703511382675, -0.37352823345474917, 0.4342880104853748, -0.008686417241847266, 0.9633162152576915, -1.8591988675074163, -0.06035456182189591, -0.8747549066542888, 2.156663624530097, -0.12014582672816232, 1.5012618544112684, 0.5916114379897799, -1.4808531218288863, -0.16478112799086375, 2.0314537895963163, 0.010773844423852097, 0.8539551421497436, -1.2568149547061997, 0.07403493882179646, -0.11003692956987265, 1.027133585975207, 1.2893893800322758, -0.14846806796293777, -0.4342951733146947, 1.7157974749037237, 1.017366682325644, -0.6984875406660649, 0.16845508494236583, -1.423216161059372, -0.14893911781359748, 0.1337897922242064, -0.21024649931018757, 0.40638495225081783, -0.40370493998443635, -1.5086471211782282, 0.22131784542784771, -0.3343567440182369, -1.3354988341066019, 1.1064057146309096, 0.3472979813135453, 0.21318943374672017, -0.44914337321006315, -0.3144066453174901, -0.8024013676736104, -1.138768647945767, -0.270266988866922, 0.32795235324385885, -0.07521784973812373, 1.8354848431990831, 0.5871444360181073, -0.30368225978327246, -0.8505153742187304, -0.5012451945714531, -0.8060105059505531, 0.5330080478594255, 1.4260343762148295, -1.050545171033491, -0.6140910982546757, -0.2504994695145797, 0.6581363886527924, 0.4752780907743954, 0.7314233385786792, -0.7084492435404854, -1.2272348687666863, -0.7943531062586544, 0.24601716633890192, 0.45095282087356736, 0.2410007816512234, 0.5882243348621614, -0.045955678732184424, 2.3011995430929595, -0.805550441639582, 1.2857127416019694, 0.9605154956996103, -0.05181584374145132, -0.04108522357819311, -0.4196304756428617, 1.2953557251604657, 0.3168318472534679, -1.6086909528895887, 0.39499568456598766, -1.2512383164572825, -0.1762592429552683, 0.09344137531570595, 1.2049892111455733]

Assessing the quality of the model fit

We can assess the goodness of the model fit (to some extent) by plotting the residuals agains the fitted values, and checking for unexpected patterns. We use the gem gnuplotrb for plotting.



In [9]:

    
require 'gnuplotrb'
include GnuplotRB

x, y = model_fit.fitted, model_fit.residuals
fitted_vs_residuals = Plot.new([[x,y], with: 'points', pointtype: 6, notitle: true],
                               xlabel: 'Fitted', ylabel: 'Residuals')









    Out[9]:

We see that the residuals look more or less like noise, which is good.

We can further analyze the validity of the linear mixed model somewhat, by checking if the residuals appear to be approximately normally distributed.



In [10]:

    
bin_width = (y.max - y.min)/10.0
bins = (y.min..y.max).step(bin_width).to_a
rel_freq = Array.new(bins.length-1){0.0}
y.each do |r|
  0.upto(bins.length-2) do |i|
    if r >= bins[i] && r < bins[i+1] then
      rel_freq[i] += 1.0/y.length
    end
  end
end
bins_center = bins[0...-1].map { |b| b + bin_width/2.0 }
  
residuals_hist = Plot.new([[bins_center, rel_freq], with: 'boxes', notitle: true],
                           style: 'fill solid 0.5')









    Out[10]:

The histogram does not appear to be too different from a bell shaped curve, although it might be slightly skewed to the right.

We can further explore the validity of the normality assumption by looking at the Q-Q plot of the residuals.



In [11]:

    
require 'distribution'

observed = model_fit.residuals.sort
n = observed.length
theoretical = (1..n).to_a.map { |t| Distribution::Normal.p_value(t.to_f/n.to_f) * model_fit.sigma}
qq_plot = Plot.new([[theoretical, observed], with: 'points', pointtype: 6, notitle: true],
                   ['x', with: 'lines', notitle: true],
                   xlabel: 'Normal theoretical quantiles', ylabel: 'Observed quantiles',
                   title: 'Q-Q plot of the residuals')









    Out[11]:

The straight line in the above plot is simply the diagonal. We see that the observed quantiles aggree with the theoretical values fairly well, as expected from a "good" model.

Daru::DataFrame:47084194581820 rows: 10 cols: 4
	Aggression	Age	Species	Location
0	877.54242028595	204.95	Dalek	Asylum
1	852.528392188206	39.88	WeepingAngel	OodSphere
2	388.791416909388	107.34	Human	Asylum
3	170.010124622982	210.01	Ood	OodSphere
4	1078.31219494376	270.22	Dalek	OodSphere
5	164.924992952256	157.65	Ood	OodSphere
6	865.838374677443	136.15	WeepingAngel	OodSphere
7	1052.36035549029	241.31	Dalek	Earth
8	-8.57251993382567	86.84	Ood	Asylum
9	1070.71900405899	206.7	Dalek	OodSphere

Daru::DataFrame:47084187305780 rows: 5 cols: 4
	coef	sd	z_score	WaldZ_p_value
intercept	1016.2867207023459	60.19727495769054	16.882603430415077	0.0
Age	-0.06531615342788907	0.0898848636725299	-0.7266646547504374	0.46743141066211646
Species_lvl_Human	-499.69369529020855	0.2682523406941929	-1862.774781375937	0.0
Species_lvl_Ood	-899.5693213535765	0.28144708140043684	-3196.2289922406003	0.0
Species_lvl_WeepingAngel	-199.58895804200702	0.27578357795259995	-723.7158917283725	0.0

Daru::DataFrame:47084184227680 rows: 2 cols: 2
	Location	Location_Age
Location	104.26376362946607	-0.05986390378124856
Location_Age	-0.05986390378124856	0.15567077553868963