As an application of the variance component model, this note demonstrates the workflow for heritability analysis in genetics, using a sample data set cg10k
with 6,670 individuals and 630,860 SNPs. Person IDs and phenotype names are masked for privacy. cg10k.bed
, cg10k.bim
, and cg10k.fam
is a set of Plink files in binary format. cg10k_traits.txt
contains 13 phenotypes of the 6,670 individuals.
In [31]:
;ls cg10k.bed cg10k.bim cg10k.fam cg10k_traits.txt
Machine information:
In [32]:
versioninfo()
We will use the SnpArrays.jl
package to read in binary SNP data and compute the empirical kinship matrix. Issue
Pkg.clone("https://github.com/OpenMendel/SnpArrays.jl.git")
within Julia
to install the SnpArrays
package.
In [33]:
using SnpArrays
In [34]:
# read in genotype data from Plink binary file (~50 secs on my laptop)
@time cg10k = SnpArray("cg10k")
Out[34]:
In [35]:
people, snps = size(cg10k)
Out[35]:
In [36]:
# summary statistics (~50 secs on my laptop)
@time maf, _, missings_by_snp, = summarize(cg10k);
In [37]:
# 5 number summary and average MAF (minor allele frequencies)
quantile(maf, [0.0 .25 .5 .75 1.0]), mean(maf)
Out[37]:
In [38]:
# Pkg.add("Plots")
# Pkg.add("PyPlot")
using Plots
pyplot()
histogram(maf, xlab = "Minor Allele Frequency (MAF)", label = "MAF")
Out[38]:
In [39]:
# proportion of missing genotypes
sum(missings_by_snp) / length(cg10k)
Out[39]:
In [40]:
# proportion of rare SNPs with maf < 0.05
countnz(maf .< 0.05) / length(maf)
Out[40]:
In [41]:
# GRM using SNPs with maf > 0.01 (default) (~10 mins on my laptop)
srand(123)
@time Φgrm = grm(cg10k; method = :GRM)
Out[41]:
Read in the phenotype data and compute descriptive statistics.
In [42]:
# Pkg.add("DataFrames")
using DataFrames
cg10k_trait = readtable(
"cg10k_traits.txt";
separator = ' ',
names = [:FID; :IID; :Trait1; :Trait2; :Trait3; :Trait4; :Trait5; :Trait6;
:Trait7; :Trait8; :Trait9; :Trait10; :Trait11; :Trait12; :Trait13],
eltypes = [String; String; Float64; Float64; Float64; Float64; Float64;
Float64; Float64; Float64; Float64; Float64; Float64; Float64; Float64]
)
# do not display FID and IID for privacy
cg10k_trait[:, 3:end]
Out[42]:
In [43]:
describe(cg10k_trait[:, 3:end])
In [44]:
Y = convert(Matrix{Float64}, cg10k_trait[:, 3:15])
histogram(Y, layout = 13)
Out[44]:
To prepare variance component model fitting, we form an instance of VarianceComponentVariate
. The two variance components are $(2\Phi, I)$.
In [45]:
using VarianceComponentModels
# form data as VarianceComponentVariate
cg10kdata = VarianceComponentVariate(Y, (2Φgrm, eye(size(Y, 1))))
fieldnames(cg10kdata)
Out[45]:
In [46]:
cg10kdata
Out[46]:
Before fitting the variance component model, we pre-compute the eigen-decomposition of $2\Phi_{\text{GRM}}$, the rotated responses, and the constant part in log-likelihood, and store them as a TwoVarCompVariateRotate
instance, which is re-used in various variane component estimation procedures.
In [47]:
# pre-compute eigen-decomposition (~50 secs on my laptop)
@time cg10kdata_rotated = TwoVarCompVariateRotate(cg10kdata)
fieldnames(cg10kdata_rotated)
Out[47]:
In [48]:
# # Pkg.add("JLD")
# using JLD
# @save "cg10k.jld"
# whos()
To load workspace
In [49]:
using SnpArrays, JLD, DataFrames, VarianceComponentModels, Plots
pyplot()
@load "cg10k.jld"
whos()
We use Fisher scoring algorithm to fit variance component model for each single trait.
In [50]:
# heritability from single trait analysis
hST = zeros(13)
# standard errors of estimated heritability
hST_se = zeros(13)
# additive genetic effects
σ2a = zeros(13)
# enviromental effects
σ2e = zeros(13)
tic()
for trait in 1:13
println(names(cg10k_trait)[trait + 2])
# form data set for trait j
traitj_data = TwoVarCompVariateRotate(cg10kdata_rotated.Yrot[:, trait], cg10kdata_rotated.Xrot,
cg10kdata_rotated.eigval, cg10kdata_rotated.eigvec, cg10kdata_rotated.logdetV2)
# initialize model parameters
traitj_model = VarianceComponentModel(traitj_data)
# estimate variance components
_, _, _, Σcov, _, _ = mle_fs!(traitj_model, traitj_data; solver=:Ipopt, verbose=false)
σ2a[trait] = traitj_model.Σ[1][1]
σ2e[trait] = traitj_model.Σ[2][1]
@show σ2a[trait], σ2e[trait]
h, hse = heritability(traitj_model.Σ, Σcov)
hST[trait] = h[1]
hST_se[trait] = hse[1]
end
toc()
Out[50]:
In [51]:
# heritability and standard errors
[hST'; hST_se']
Out[51]:
In [52]:
# additive genetic effects (2x2 psd matrices) from bavariate trait analysis;
Σa = Array{Matrix{Float64}}(13, 13)
# environmental effects (2x2 psd matrices) from bavariate trait analysis;
Σe = Array{Matrix{Float64}}(13, 13)
tic()
for i in 1:13
for j in (i+1):13
println(names(cg10k_trait)[i + 2], names(cg10k_trait)[j + 2])
# form data set for (trait1, trait2)
traitij_data = TwoVarCompVariateRotate(cg10kdata_rotated.Yrot[:, [i;j]], cg10kdata_rotated.Xrot,
cg10kdata_rotated.eigval, cg10kdata_rotated.eigvec, cg10kdata_rotated.logdetV2)
# initialize model parameters
traitij_model = VarianceComponentModel(traitij_data)
# estimate variance components
mle_fs!(traitij_model, traitij_data; solver=:Ipopt, verbose=false)
Σa[i, j] = traitij_model.Σ[1]
Σe[i, j] = traitij_model.Σ[2]
@show Σa[i, j], Σe[i, j]
end
end
toc()
Out[52]:
In [53]:
traitidx = 5:7
# form data set
trait57_data = TwoVarCompVariateRotate(cg10kdata_rotated.Yrot[:, traitidx], cg10kdata_rotated.Xrot,
cg10kdata_rotated.eigval, cg10kdata_rotated.eigvec, cg10kdata_rotated.logdetV2)
# initialize model parameters
trait57_model = VarianceComponentModel(trait57_data)
# estimate variance components
@time mle_fs!(trait57_model, trait57_data; solver=:Ipopt, verbose=true)
trait57_model
Out[53]:
We then run the MM algorithm, starting from the Fisher scoring answer. MM finds an improved solution with objective value 8.955397e+03.
In [54]:
# trait59_model contains the fitted model by Fisher scoring now
@time mle_mm!(trait57_model, trait57_data; verbose=true)
trait57_model
Out[54]:
Do another run of MM algorithm from default starting point. It leads to a slightly better local optimum -1.470104e+04, slighly worse than the Fisher scoring result. Follow up anlaysis should use the Fisher scoring result.
In [55]:
# default starting point
trait57_model = VarianceComponentModel(trait57_data)
@time _, _, _, Σcov, = mle_mm!(trait57_model, trait57_data; verbose=true)
trait57_model
Out[55]:
Heritability from 3-variate estimate and their standard errors.
In [56]:
h, hse = heritability(trait57_model.Σ, Σcov)
[h'; hse']
Out[56]:
In [57]:
# initialize model parameters
traitall_model = VarianceComponentModel(cg10kdata_rotated)
# estimate variance components using Fisher scoring algorithm
@time mle_fs!(traitall_model, cg10kdata_rotated; solver=:Ipopt, verbose=true)
From the output we can see the Fisher scoring algorithm ran into some numerical issues. Let's try the MM algorithm.
In [58]:
# reset model parameters
traitall_model = VarianceComponentModel(cg10kdata_rotated)
# estimate variance components using Fisher scoring algorithm
@time mle_mm!(traitall_model, cg10kdata_rotated; verbose=true)
Out[58]:
It converges after ~1000 iterations.
In [59]:
#using JLD
#@save "copd.jld"
#whos()