Bayesian Linear Mixed Models (Genomic Data)

In [1]:
using JWAS,JWAS.Datasets,DataFrames,CSV

In [2]:
phenofile  = Datasets.dataset("example","phenotypes.txt")
pedfile    = Datasets.dataset("example","pedigree.txt")
genofile   = Datasets.dataset("example","genotypes.txt")

phenotypes = CSV.read(phenofile,delim = ',',header=true,missingstrings=["NA"])
pedigree   = get_pedigree(pedfile,separator=",",header=true);


The delimiter in pedigree.txt is ','.
Pedigree informatin:
#individuals: 12
#sires:       4
#dams:        5
#founders:    3

In [3]:
first(phenotypes,5)


Out[3]:

5 rows × 9 columns

IDy1y2y3x1x2x3damweights
StringFloat64⍰Float64⍰Float64⍰Float64Float64StringString⍰Float64
1a1-0.063.58-1.180.92.0mmissing1.0
2a3-2.073.191.00.72.0fmissing1.0
3a4-2.636.97-0.830.61.0ma21.0
4a52.31missing-1.520.42.0ma21.0
5a7missingmissingmissing0.10.1ma31.0
Univariate Linear Mixed Model (Genomic data)

In [4]:
model_equation1  ="y1 = intercept + x1*x3 + x2 + x3 + ID + dam";

In [5]:
model1 = build_model(model_equation1);

In [6]:
set_covariate(model1,"x1");

In [7]:
set_random(model1,"x2");
set_random(model1,"ID dam",pedigree);

In [8]:
add_genotypes(model1,genofile,separator=',');


The delimiter in genotypes.txt is ','.
The header (marker IDs) is provided in genotypes.txt.
5 markers on 7 individuals were added.

In [9]:
out1=runMCMC(model1,phenotypes,methods="BayesC",estimatePi=true);


Checking phenotypes...
Individual IDs (strings) are provided in the first column of the phenotypic data.
Phenotypes for all traits included in the model for individual a7 in the row 5 are missing. This record is deleted.
The number of observations with both genotypes and phenotypes used in the analysis is 4.
The number of observations with both phenotype and pedigree information used in the analysis is 4.
Prior information for genomic variance is not provided and is generated from the data.
Prior information for residual variance is not provided and is generated from the data.
Prior information for random effect variance is not provided and is generated from the data.
Prior information for random effect variance is not provided and is generated from the data.
Missing values are found in independent variables: dam.

The prior for marker effects variance is calculated from the genetic variance and π.
The mean of the prior for the marker effects variance is: 0.617255



A Linear Mixed Model was build using model equations:

y1 = intercept + x1*x3 + x2 + x3 + ID + dam

Model Information:

Term            C/F          F/R            nLevels
intercept       factor       fixed                1
x1*x3           interaction  fixed                2
x2              factor       random               2
x3              factor       fixed                2
ID              factor       random              12
dam             factor       random              12

MCMC Information:

methods                                      BayesC
                              complete genomic data
                   (i.e., non-single-step analysis)
chain_length                                    100
burnin                                            0
estimatePi                                     true
estimateScale                                 false
starting_value                                false
printout_frequency                              101
output_samples_frequency                          1
constraint                                    false
missing_phenotypes                             true
update_priors_frequency                           0
seed                                          false

Hyper-parameters Information:

random effect variances (y1:x2):              [1.253]
random effect variances (y1:ID,y1:dam): [1.253 0.0; 0.0 1.253]
residual variances:                           1.253
genetic variances (genomic):                  1.253
marker effect variances:                      0.617
π                                               0.0

Degree of freedom for hyper-parameters:

residual variances:                           4.000
random effect variances:                      5.000
polygenic effect variances:                   6.000
marker effect variances:                      4.000



The file MCMC_samples_residual_variance.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_polygenic_effects_variance.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_marker_effects_y1.txt is created to save MCMC samples for marker_effects_y1.
The file MCMC_samples_marker_effects_variances.txt is created to save MCMC samples for marker_effects_variances.
The file MCMC_samples_pi.txt is created to save MCMC samples for pi.
The file MCMC_samples_y1.x2_variances.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_y1.ID_y1.dam_variances.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_EBV_y1.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_genetic_variance.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_heritability.txt already exists!!! It is overwritten by the new output.
running MCMC for BayesC...100%|█████████████████████████| Time: 0:00:02

The version of Julia and Platform in use:

Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)


The analysis has finished. Results are saved in the returned variable and text files. MCMC samples are saved in text files.



In [10]:
keys(out1)


Out[10]:
Base.KeySet for a Dict{Any,Any} with 9 entries. Keys:
  "marker effects"
  "EBV_y1"
  "Pi"
  "heritability"
  "location parameters"
  "residual variance"
  "polygenic effects covariance matrix"
  "genetic_variance"
  "marker effects variance"

In [11]:
out1["EBV_y1"]


Out[11]:

7 rows × 3 columns

IDEBVPEV
AnyAnyAny
1a10.4464831.80448
2a30.2465891.93395
3a4-1.203951.3241
4a5-0.3347941.48462
5a60.2391632.05027
6a70.5325152.21422
7a8-0.4318051.82995

In [12]:
out1["Pi"]


Out[12]:

1 rows × 3 columns

πEstimateStd_Error
AnyAnyAny
1π0.4278330.279688
Multivariate Linear Mixed Model (Genomic data)

In [13]:
model_equation2 ="y1 = intercept + x1 + x3 + ID + dam
                  y2 = intercept + x1 + x2 + x3 + ID
                  y3 = intercept + x1 + x1*x3 + x2 + ID";

In [14]:
model2 = build_model(model_equation2);

In [15]:
set_covariate(model2,"x1");

In [16]:
set_random(model2,"x2");
set_random(model2,"ID dam",pedigree);


x2 is not found in model equation 1.
dam is not found in model equation 2.
dam is not found in model equation 3.

In [17]:
add_genotypes(model2,genofile,separator=',');


The delimiter in genotypes.txt is ','.
The header (marker IDs) is provided in genotypes.txt.
5 markers on 7 individuals were added.

In [18]:
out2=runMCMC(model2,phenotypes,methods="BayesC",estimatePi=true);


Checking phenotypes...
Individual IDs (strings) are provided in the first column of the phenotypic data.
Phenotypes for all traits included in the model for individual a7 in the row 5 are missing. This record is deleted.
The number of observations with both genotypes and phenotypes used in the analysis is 4.
The number of observations with both phenotype and pedigree information used in the analysis is 4.
Prior information for genomic variance is not provided and is generated from the data.
Prior information for residual variance is not provided and is generated from the data.
Prior information for random effect variance is not provided and is generated from the data.
Prior information for random effect variance is not provided and is generated from the data.
Missing values are found in independent variables: dam.

Pi (Π) is not provided.
Pi (Π) is generated assuming all markers have effects on all traits.

The prior for marker effects covariance matrix is calculated from genetic covariance matrix and Π.
The mean of the prior for the marker effects covariance matrix is:
 0.617255  0.0       0.0     
 0.0       0.532118  0.0     
 0.0       0.0       0.155597



A Linear Mixed Model was build using model equations:

y1 = intercept + x1 + x3 + ID + dam
y2 = intercept + x1 + x2 + x3 + ID
y3 = intercept + x1 + x1*x3 + x2 + ID

Model Information:

Term            C/F          F/R            nLevels
intercept       factor       fixed                1
x1              covariate    fixed                1
x3              factor       fixed                2
ID              factor       random              12
dam             factor       random              12
x2              factor       random               2
x1*x3           interaction  fixed                2

MCMC Information:

methods                                      BayesC
                              complete genomic data
                   (i.e., non-single-step analysis)
chain_length                                    100
burnin                                            0
estimatePi                                     true
estimateScale                                 false
starting_value                                false
printout_frequency                              101
output_samples_frequency                          1
constraint                                    false
missing_phenotypes                             true
update_priors_frequency                           0
seed                                          false

Hyper-parameters Information:

random effect variances (y2:x2,y3:x2):
 1.081  0.0  
 0.0    0.316
random effect variances (y1:ID,y2:ID,y3:ID,y1:dam):
 1.253  0.0    0.0    0.0  
 0.0    1.081  0.0    0.0  
 0.0    0.0    0.316  0.0  
 0.0    0.0    0.0    1.253
residual variances:           
 1.253  0.0    0.0  
 0.0    1.081  0.0  
 0.0    0.0    0.316
genetic variances (polygenic):
 1.253  0.0    0.0    0.0  
 0.0    1.081  0.0    0.0  
 0.0    0.0    0.316  0.0  
 0.0    0.0    0.0    1.253
genetic variances (genomic):  
 1.253  0.0    0.0  
 0.0    1.081  0.0  
 0.0    0.0    0.316
marker effect variances:      
 0.617  0.0    0.0  
 0.0    0.532  0.0  
 0.0    0.0    0.156

Π: (Y(yes):included; N(no):excluded)

["y1", "y2", "y3"]         probability
["Y", "Y", "N"]                 0.0
["N", "N", "N"]                 0.0
["Y", "N", "N"]                 0.0
["N", "Y", "Y"]                 0.0
["Y", "N", "Y"]                 0.0
["N", "N", "Y"]                 0.0
["Y", "Y", "Y"]                 1.0
["N", "Y", "N"]                 0.0

Degree of freedom for hyper-parameters:

residual variances:                           7.000
random effect variances:                      6.000
polygenic effect variances:                   8.000
marker effect variances:                      7.000



The file MCMC_samples_residual_variance.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_polygenic_effects_variance.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_marker_effects_y1.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_marker_effects_y2.txt is created to save MCMC samples for marker_effects_y2.
The file MCMC_samples_marker_effects_y3.txt is created to save MCMC samples for marker_effects_y3.
The file MCMC_samples_marker_effects_variances.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_pi.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_y2.x2_y3.x2_variances.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_y1.ID_y2.ID_y3.ID_y1.dam_variances.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_EBV_y1.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_EBV_y2.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_EBV_y3.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_genetic_variance.txt already exists!!! It is overwritten by the new output.
The file MCMC_samples_heritability.txt already exists!!! It is overwritten by the new output.
running MCMC for BayesC...100%|█████████████████████████| Time: 0:00:03

The version of Julia and Platform in use:

Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)


The analysis has finished. Results are saved in the returned variable and text files. MCMC samples are saved in text files.



In [19]:
keys(out2)


Out[19]:
Base.KeySet for a Dict{Any,Any} with 11 entries. Keys:
  "location parameters"
  "polygenic effects covariance matrix"
  "EBV_y3"
  "genetic_variance"
  "EBV_y2"
  "EBV_y1"
  "marker effects"
  "heritability"
  "residual variance"
  "Pi"
  "marker effects variance"

In [20]:
out2["Pi"]


Out[20]:

8 rows × 3 columns

πEstimateStd_Error
AnyAnyAny
1[1.0, 1.0, 0.0]0.1367070.118103
2[0.0, 0.0, 0.0]0.1355370.110178
3[1.0, 0.0, 0.0]0.1470730.118746
4[0.0, 1.0, 1.0]0.1105080.0945137
5[1.0, 0.0, 1.0]0.1087520.117394
6[0.0, 0.0, 1.0]0.1096410.10215
7[1.0, 1.0, 1.0]0.1298290.117483
8[0.0, 1.0, 0.0]0.1219550.106232

In [21]:
out2["EBV_y2"]


Out[21]:

7 rows × 3 columns

IDEBVPEV
AnyAnyAny
1a1-1.155080.718204
2a30.6574932.40265
3a4-0.5863961.05399
4a5-0.653141.99496
5a6-0.1645681.64079
6a7-0.7236951.48565
7a8-0.8140142.1087