In [1]:
list_pkgs <- c("plyr", "dplyr", "tidyr", "readr", "car", "lme4", "lmerTest", "ggplot2", "cowplot")
new_pkgs <- list_pkgs[!(list_pkgs %in% installed.packages()[,"Package"])]
if(length(new_pkgs) > 0){ install.packages(new_pkgs) }
library(plyr)
library(dplyr)
library(tidyr)
library(readr)
library(car)
library(lme4)
library(lmerTest)
library(ggplot2)
library(cowplot)
Move on to an F3 population and look at the importance of population stratification.
An F3 is created by random mating of F2 individuals. F2 individuals are genetically different from each other, so the F3 generation typically has family structure (sets of siblings sharing the same parents). This phenomenon is generally called population stratification. It is important because individuals with correlated phenotypes have correlated genotypes due to relatedness. Treating them as independent leads to a form of pseudoreplication.
We can start by running the null model for a set of simulated markers to see how the analysis performs with a true null hypothesis. Do we get the right number of false positives?
We can also try to run this model by randomizing either genotypes and phenotypes, while ignoring the structure of the population (so we ignore family). Do we get the same significance results?
We can account for the influence of population stratification by including a random effect term that accounts for family. This is basically the same as running the variance component model, so we are fitting the family variance component as a random effect while fitting the fixed effect of a locus. This same approach could be done in a more complex population by using an animal model to fit the family variance component.
We can now go back and run the null model with simulated markers and look at how including the family covariance parameter changes the behaviour of the model.
We can also use this mixed model to simultaneously estimate QTL effects and family variance components, which is a simple version of the structure of a model that is now widely used for mapping loci in natural populations (including for human association studies)
We can also look at whether a locus is pleiotropic and affects multiple traits.
An F3 population has more recombination, which means you get higher resolution, but your model is more complex
In [2]:
f3_data = read_csv("./F3 geno pheno with QTL effect.csv")
f3_data
In [3]:
f3_data %>%
gather(trait, value, Trait2a:Trait3b) %>%
ggplot(aes(trait, value)) + geom_violin() + geom_jitter(width = 0.1, alpha = 0.1)
Let's see what happens when we use the same model as in the F2 population, using the set of null markers with no effect.
Remeber, we expect no more than 5% of significant markers in this set.
In [4]:
simulated_genotypes = read_csv("sim genotypes.csv")
f3_data_null = inner_join(select(f3_data, ID:Trait3), simulated_genotypes, by = "ID")
#f3_data_null
In [5]:
n_markers = 675
null_marker_fits_trait3 = list()
for(i in seq(n_markers)){
current_marker = paste0('G', i)
f3_data_null[[paste0(current_marker, '_D')]] = ifelse(f3_data_null[[current_marker]], 0, 1)
model_formula = paste0("Trait3 ~ Sex + LSB + LSW + ", current_marker, "+", current_marker, "_D")
null_marker_fits_trait3[[i]] = lm(as.formula(model_formula), data = f3_data_null)
}
In [6]:
table(laply(null_marker_fits_trait3, function(x) summary(x)$coefficients[5, 'Pr(>|t|)']) < 0.05) / n_markers
Almost four times what we should be getting!
We can take stratification into account using a random effect for family structure. The lme4 and lmerTest packages allow us to fit the model and extract the relevant p-value. The only difference is the (1|FAMILY) term in the model formula.
In [7]:
n_markers = 675
null_marker_mixed_fits_trait3 = list()
for(i in seq(n_markers)){
current_marker = paste0('G', i)
f3_data_null[[paste0(current_marker, '_D')]] = ifelse(f3_data_null[[current_marker]], 0, 1)
model_formula = paste0("Trait3 ~ (1|FAMILY) + Sex + LSB + LSW + ", current_marker, "+", current_marker, "_D")
null_marker_mixed_fits_trait3[[i]] = lmer(as.formula(model_formula), data = f3_data_null)
}
In [8]:
table(laply(null_marker_mixed_fits_trait3, function(x) summary(x)$coefficients[5, 'Pr(>|t|)']) < 0.05) / n_markers
Much better!
For more than 1 trait we can fit the models separetly for each trait or run them in a single model.
In [9]:
n_markers = 31
interval = 6 # Play around with this value.
fl_marker_fits_trait2a = list()
for(i in seq(n_markers)) f3_data[[paste0(paste0('G', i), '_D')]] = ifelse(f3_data[[paste0('G', i)]], 0, 1)
for(i in seq(n_markers)){
current_marker = paste0('G', i)
model_formula = paste0("Trait2a ~ (1|FAMILY) + Sex + LSB + LSW + ", current_marker, "+", current_marker, "_D")
if((i - interval) >= 1)
model_formula = paste0(model_formula, "+ G", paste0(i - interval), "+ G", paste0(i - interval), "_D")
if((i + interval) <= n_markers)
model_formula = paste0(model_formula, "+ G", paste0(i + interval), "+ G", paste0(i + interval), "_D")
fl_marker_fits_trait2a[[i]] = lmer(as.formula(model_formula), data = f3_data)
}
In [10]:
ldply(fl_marker_fits_trait2a, function(x) summary(x)$coefficients[5, 'Pr(>|t|)']) %>%
ggplot(aes(1:n_markers, -log10(V1))) + geom_line() + labs(x = "marker", y = "LPR") +
scale_x_continuous(breaks = seq(1, n_markers, by = 2))
In [11]:
n_markers = 31
interval = 6 # Play around with this value.
fl_marker_fits_trait3a = list()
for(i in seq(n_markers)){
current_marker = paste0('G', i)
model_formula = paste0("Trait3a ~ (1|FAMILY) + Sex + LSB + LSW + ", current_marker, "+", current_marker, "_D")
if((i - interval) >= 1)
model_formula = paste0(model_formula, "+ G", paste0(i - interval), "+ G", paste0(i - interval), "_D")
if((i + interval) <= n_markers)
model_formula = paste0(model_formula, "+ G", paste0(i + interval), "+ G", paste0(i + interval), "_D")
fl_marker_fits_trait3a[[i]] = lmer(as.formula(model_formula), data = f3_data)
}
In [12]:
ldply(fl_marker_fits_trait3a, function(x) summary(x)$coefficients[5, 'Pr(>|t|)']) %>%
ggplot(aes(1:n_markers, -log10(V1))) + geom_line() + labs(x = "marker", y = "LPR") +
scale_x_continuous(breaks = seq(1, n_markers, by = 2))
It's also possible to run both traits at the same time and get an estimate for the significance of the marker for all traits. This is more complicated, so we'll do it for a single marker which we already know there is an effect.
First we need to get the data in the apropriate "narrow" format, with a single column for the value of the traits and a column for the trait identity:
In [13]:
narrow_f3_data = gather(f3_data, variable, value, Trait2a, Trait3a)
narrow_f3_data
Now we set up a mixed model with a (variable|FAMILY) mixed effect, which will fit a genetic covariance matrix between the 2 traits.
In [14]:
null_2trait_fit = lmer(value ~ (variable|FAMILY) + variable + variable:Sex + variable:LSB + variable:LSW, data = narrow_f3_data)
summary(null_2trait_fit)
Now we add the marker term to this model
In [15]:
marker17_2trait_fit = update(null_2trait_fit, . ~ . + variable:G17 + variable:G17_D)
summary(marker17_2trait_fit)
Now we can compare the two models (with and without the marker) using a likelihood ratio test. As expected the marker term is highly significant.
In [16]:
anova(null_2trait_fit, marker17_2trait_fit)
We can also extract the vectors of additive and dominance effects.
In [17]:
additive = summary(marker17_2trait_fit)$coefficients[9:10, 1]
dominance = summary(marker17_2trait_fit)$coefficients[11:12, 1]
additive
dominance