Linear Mixed Models

Consider a study that compares the effect of a new drug to a placebo.

| | | Drug | | | | Placebo | | | |--------|---|-----------|---|---|---|-----------|---|----| | | | Age Group | | | | Age Group | | | |--------|---|-----------|---|---|---|-----------|---|----| | | A | B | C | | A | B | C | | | Gender | | | | | | | | | | Male | | | | | | | | | | Female | | | | | | | | | | | | | | | | | | | Table 1: Study of effect if a drug over placebo for different age groups for males and females | | | School X | | | | School Y | | | |--------|---|-----------|---|---|---|-----------|---|----| | | | Section | | | | Section | | | |--------|---|-----------|---|---|---|-----------|---|----| | | A | B | C | | A | B | C | | | Gender | | | | | | | | | | Male | | | | | | | | | | Female | | | | | | | | | | | | | | | | | | | Table 2: Study of marks scored for different schools over

The definition for fixed effects and random-effects is a bit hazy, but here we refer to the fixed effects arising due to a factor

Terminology


Factor: Classifications identifying the source of datum. So 'Drug', 'Placebo' and 'Gender' are the three factors in the above study.

Levels: Individual classes of the factors. So 'Male' and 'Female' are levels of 'Gender' factor and 'A', 'B', 'C' are the levels of 'gender' factor. 'Drug' and 'Placebo' are levels of 'Type of drug administered'

Crossed Factors: All levels of 'Age' factor appear for both levels of the 'Type of drug' factorand hence is a 'crossed' factor.

Nested Factors: In the table summarizing school wise grades, the section A under school X is comnpletely independent of section A under school Y. In fact, it does not matter what particular section we refer to, as long as we have 3 sections for both. Since the levels of the 'section' factor appear only once, 'section' is a nested factor within the 'type of school'.

Effects: Extent to which different levels of a factor affect the variable of interest

In order to define our meaning of fixed and random effects,, we refer to fixed effects as the effects that can be attributed to a finite set of levels of a factor. The random effects arise due to the levels of factor being infinite. So if the performance of two schools is being considered and we also record the student ids along with their gender and scores, the gender in itself will give rise to a fixed effect for there are just two levels: male and female. But, the student ids being levels of the student identifier factor can have infinite levels, because the student can be sampled randomly.

Fixed Effects are simply the effects arising due to the levels of a fixed covariate, a factor whose all levels are of interest such that they represent specific conditions that are used to define specific contrasts

Random Effects arise out of random factors, a factor whose levels were randomly sampled from a population of levels. Thus, all possible levels of random factors cannot be part of the study and hence only an inference can be made about the entire population of levels given a random sample of the levels

Matrix Specification

A linear mixed model is described by the following model:

$$Y= \underbrace{X\beta}_\text{Fixed} + \underbrace{Zu + \epsilon}_\text{Random}$$

where

$n$ = Number of observations

$p$ = Number of fixed effect covariates

$q$ = Number of random effect covariates

$Y$ = response column vector $(n \times 1)$

$X$ = $n \times p$ model matrix representing the known values of the $p$ fixed-effect covariates

$\beta$ = $p \times 1$ fixed-effect parameter column vector

$Z$ = $n \times q$ model matrix for representing the known values of the $q$ random-effect covariates

$u$ = $q \times 1$ fixed-effect parameter column vector

$\epsilon$ = Residual column vector

By definition the random-effects are random variables and the $q$ random-effects parameter inthe $u$ column vector follow a multivarite normal distribution:

$$u \sim N(0,D)$$

where $D$ is a $q \times q$ variance-covariance matrix

$\epsilon$ column vector constitutes the residuals associated with the $n$ observations.

Residuals and random-effects parameter vector $u$ are assumed to be independent and $\epsilon$ is assumed to be normally distributed:

$$\epsilon \sim N(0,R)$$

where $R$ is a $n\times n$ variance-covariance matrix

Our aim is to estimate the column vectors $\beta$ , $u$ (by estimating for the $D$ matrix) and $\epsilon$(by estimating the $R$ matrix)

It is possible to represent the above model in a marginailised form:

$$Y= \underbrace{X\beta}_\text{Fixed} + \underbrace{\epsilon^*}_\text{Random}$$

where $\epsilon^* \sim N(0, ZDZ'+R)$ which helps to relax the constraints on both $D$ and $R$ being positive definite to $ZDZ'+R$ being positive definite.

and hence $Y \sim N(X\beta, ZDZ'+R)$


In [13]:
import numpy as np
import statsmodels.api as sm
import pandas
import statsmodels.formula.api as smf

In [24]:
data = pandas.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/lme4/Penicillin.csv', index_col=0)

In [14]:
print data.describe(include='all')


          diameter plate sample
count   144.000000   144    144
unique         NaN    24      6
top            NaN     x      F
freq           NaN     6     24
mean     22.972222   NaN    NaN
std       2.031034   NaN    NaN
min      18.000000   NaN    NaN
25%      22.000000   NaN    NaN
50%      23.000000   NaN    NaN
75%      24.000000   NaN    NaN
max      27.000000   NaN    NaN

In [16]:
print (data.columns.values)


['diameter' 'plate' 'sample']

In [21]:
print data['sample'].describe()
print data['plate'].describe()


count     144
unique      6
top         F
freq       24
Name: sample, dtype: object
count     144
unique     24
top         x
freq        6
Name: plate, dtype: object

We take the Penicillin growth example from the 'lme4' manual. Diameter refers to the diameter of growth inhibition in a culturing experiment. There are 24 plates and 6 types of samples, each plate e The diameter varies with the plate and sample type. Clearly plate and sample are categorical variables with 24 and 6 levels each. The type of plate here is assumed to be sampled randomly(we are not interested in studying what effect a particular plate say A has, but are more interested in studying the effect of plate ). This is an example of crossed random effect. Crossed because each sample is used in each plate.


In [39]:
table = data.pivot_table(['plate'],  rows='sample', columns='plate', aggfunc=len)
table


Out[39]:
diameter
plate a b c d e f g h i j ... o p q r s t u v w x
sample
A 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
C 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
D 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
E 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
F 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1

6 rows × 24 columns


In [1]:
%load_ext rpy2.ipython
%R library(lme4)


Loading required package: lattice
Loading required package: Matrix

In [7]:
%R str(Penicillin)


'data.frame':	144 obs. of  3 variables:
 $ diameter: num  27 23 26 23 23 21 27 23 26 23 ...
 $ plate   : Factor w/ 24 levels "a","b","c","d",..: 1 1 1 1 1 1 2 2 2 2 ...
 $ sample  : Factor w/ 6 levels "A","B","C","D",..: 1 2 3 4 5 6 1 2 3 4 ...

In [104]:
## Crossed Random Effects
data = pandas.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/lme4/Penicillin.csv', index_col=0)

vcf = {"plate" : "1 + C(plate)", "sample" : "0 + C(sample)"}
model = sm.MixedLM.from_formula('diameter ~ 1', groups="plate", vc_formula=vcf, data=data)
result = model.fit()
result.summary()


Out[104]:
Model: MixedLM Dependent Variable: diameter
No. Observations: 144 Method: REML
No. Groups: 24 Scale: 1.4191
Min. group size: 6 Likelihood: -306.6280
Max. group size: 6 Converged: Yes
Mean group size: 6.0
Coef. Std.Err. z P>|z| [0.025 0.975]
Intercept 22.972 0.179 128.476 0.000 22.622 23.323
plate RE 0.095
sample RE 2.614

The same example in R:


In [31]:
%R print (summary(fm <- lmer(diameter ~ 1 + (1|plate) + (1|sample), Penicillin)))


Linear mixed model fit by REML ['lmerMod']
Formula: diameter ~ 1 + (1 | plate) + (1 | sample) 
   Data: Penicillin 

REML criterion at convergence: 330.8606 

Random effects:
 Groups   Name        Variance Std.Dev.
 plate    (Intercept) 0.7169   0.8467  
 sample   (Intercept) 3.7309   1.9316  
 Residual             0.3024   0.5499  
Number of obs: 144, groups: plate, 24; sample, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)  22.9722     0.8086   28.41

Dyestuff example

The dyestuff dataset is a self-sufficient dataset to understand fixed and random effects. The dataset involves batch to batch variability in the strength of dye. We have 6 batches and 5 samples for each batch, totalling to 30 observations. It is easy ti imagine two sources of variability here, the one within-batch and the other one across-batch. We use mixed models to quantify the batch-batch variability:


In [106]:
data = pandas.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/lme4/Dyestuff.csv', index_col=0)
model = smf.mixedlm('Yield ~ 1', groups="Batch", data=data)
result = model.fit()
result.summary()


Out[106]:
Model: MixedLM Dependent Variable: Yield
No. Observations: 30 Method: REML
No. Groups: 6 Scale: 2451.2239
Min. group size: 5 Likelihood: -159.8271
Max. group size: 5 Converged: Yes
Mean group size: 5.0
Coef. Std.Err. z P>|z| [0.025 0.975]
Intercept 1527.500 19.384 78.802 0.000 1489.508 1565.492
groups RE 1764.171 31.658

Same example in R:


In [39]:
%R print( summary(fm <- lmer(Yield ~ 1 + (1|Batch), Dyestuff)))


Linear mixed model fit by REML ['lmerMod']
Formula: Yield ~ 1 + (1 | Batch) 
   Data: Dyestuff 

REML criterion at convergence: 319.6543 

Random effects:
 Groups   Name        Variance Std.Dev.
 Batch    (Intercept) 1764     42.00   
 Residual             2451     49.51   
Number of obs: 30, groups: Batch, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1527.50      19.38    78.8

Dyestuff2 '

Same as the dyestuff example, but with batch-batch variability missing:


In [42]:
%R print(summary(Dyestuff2))


 Batch     Yield       
 A:5   Min.   :-0.892  
 B:5   1st Qu.: 2.765  
 C:5   Median : 5.365  
 D:5   Mean   : 5.666  
 E:5   3rd Qu.: 8.151  
 F:5   Max.   :13.434  

In [44]:
%R print( summary(fm <- lmer(Yield ~ 1 + (1|Batch), Dyestuff2)))


Linear mixed model fit by REML ['lmerMod']
Formula: Yield ~ 1 + (1 | Batch) 
   Data: Dyestuff2 

REML criterion at convergence: 161.8283 

Random effects:
 Groups   Name        Variance Std.Dev.
 Batch    (Intercept)  0.00    0.000   
 Residual             13.81    3.716   
Number of obs: 30, groups: Batch, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)   5.6656     0.6784   8.352

The variance for batch random effect is 0, thus there is no random effect. It is as good as a linear model:


In [108]:
%R print (summary(fm <- lm(Yield ~1, Dyestuff2)))


Call:
lm(formula = Yield ~ 1, data = Dyestuff2)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5576 -2.9006 -0.3006  2.4854  7.7684 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.6656     0.6784   8.352 3.32e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.716 on 29 degrees of freedom


In [109]:
data = pandas.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/lme4/Dyestuff2.csv', index_col=0)
vcf = {"batch" : "0 + C(Batch)"}
model = sm.MixedLM.from_formula('Yield ~ 1', groups="Batch",  data=data)
result = model.fit()
result.summary()


/usr/local/lib/python2.7/dist-packages/statsmodels-0.7.0-py2.7-linux-x86_64.egg/statsmodels/regression/mixed_linear_model.py:1912: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
  warnings.warn(msg, ConvergenceWarning)
Out[109]:
Model: MixedLM Dependent Variable: Yield
No. Observations: 30 Method: REML
No. Groups: 6 Scale: 13.8063
Min. group size: 5 Likelihood: -80.9141
Max. group size: 5 Converged: Yes
Mean group size: 5.0
Coef. Std.Err. z P>|z| [0.025 0.975]
Intercept 5.666 0.678 8.352 0.000 4.336 6.995
groups RE 0.000 1.235

Pastes

The experiment consists of strength of chemical pastes packaged in different casks, at two time points. There are 10 batches, with 3 casks sampled randomly and two tests performed. In total 10x3x2 observations.

So there are 30 samples with levels 'A:a', 'A:b', 'A:c' and so on, 3 samples for the 10 batches(This sort of naming saves us from the trouble of using different variable names for these 30 levels) The 'a', 'b', 'c' refer to the levels of the cask. But notice that cask 'a' of batch 'A' has no relation to cask 'a' of batch 'B' and so on. 'a', 'b', 'c' differentiate the casks within a batch but have no relation to other batches otherwise.

This is an example of nested random effects because each level of sample occurs with one and only one level of the batch(compare it with the pivot table we had earlier where all the entries were one)


In [86]:
data = pandas.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/lme4/Pastes.csv', index_col=0)

table = data.pivot_table(['batch', 'sample'],  rows='batch', columns='sample', aggfunc=len)
table


Out[86]:
strength ... cask
sample A:a A:b A:c B:a B:b B:c C:a C:b C:c D:a ... G:c H:a H:b H:c I:a I:b I:c J:a J:b J:c
batch
A 2 2 2 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
B NaN NaN NaN 2 2 2 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
C NaN NaN NaN NaN NaN NaN 2 2 2 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
D NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
E NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
F NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
G NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
H NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 2 2 2 NaN NaN NaN NaN NaN NaN
I NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 2 2 2 NaN NaN NaN
J NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 2 2 2

10 rows × 60 columns


In [110]:
data = pandas.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/lme4/Pastes.csv', index_col=0)
vcf = {"batch" : "0 + C(batch)", "sample" : "0 + C(sample)"}
model = sm.MixedLM.from_formula('strength ~ 1', groups="batch", vc_formula=vcf, data=data)
result = model.fit()
result.summary()


Out[110]:
Model: MixedLM Dependent Variable: strength
No. Observations: 60 Method: REML
No. Groups: 10 Scale: 0.6780
Min. group size: 6 Likelihood: -123.4954
Max. group size: 6 Converged: Yes
Mean group size: 6.0
Coef. Std.Err. z P>|z| [0.025 0.975]
Intercept 60.053 0.677 88.714 0.000 58.727 61.380
batch RE 1.658 2.901
sample RE 8.433 4.349

In R:


In [101]:
%R print( summary(fm <- lmer(strength ~ 1 + (1|batch) + (1|sample), Pastes)))


Linear mixed model fit by REML ['lmerMod']
Formula: strength ~ 1 + (1 | batch) + (1 | sample) 
   Data: Pastes 

REML criterion at convergence: 246.9907 

Random effects:
 Groups   Name        Variance Std.Dev.
 sample   (Intercept) 8.434    2.9041  
 batch    (Intercept) 1.657    1.2874  
 Residual             0.678    0.8234  
Number of obs: 60, groups: sample, 30; batch, 10

Fixed effects:
            Estimate Std. Error t value
(Intercept)  60.0533     0.6769   88.72

In [ ]: