In [1]:
list_pkgs <- c("plyr", "dplyr", "tidyr", "readr", "car", "ggplot2", "cowplot")
new_pkgs <- list_pkgs[!(list_pkgs %in% installed.packages()[,"Package"])]
if(length(new_pkgs) > 0){ install.packages(new_pkgs) }
library(plyr)
library(dplyr)
library(tidyr)
library(readr)
library(car)
library(ggplot2)
library(cowplot)
Introduce the basic concepts of QTL analysis - we will start with mapping in an F2 population and later consider issues that arise in more complex populations.
We should start by looking at the structure of data sets and distribution of traits.
We will use data from an F2 data to do basic analysis. Typical F2 populations are formed by crosses between inbred lines, which typically differ in some trait(s) of interest. We will start with the simplest ANOVA model to look at whether there are differences in the phenotypes of the different genotypes at a locus (we will use marker 16 and trait 2a for this first analysis).
We will next adapt this analysis to run it as a regression model, including creating index variables to separate additive from dominance effects.
This basic model can be adapted as an iterative analysis that walks across markers. We can use this to test all markers for effects on trait 2a and then run the analysis for trait 3a. We will discuss use of LPR or LOD as a measure of significance.
For trait 2a we see a single locus, but because of LD the adjacent markers are also significant. For trait 3a we are left with a confusing pattern - how could we identify the most likely positions of QTL?
How could we include marker co-factors if we want to identify positions of QTL?
If we run this model on a very large number of markers, how would we be able to identify significant loci? We could correct for multiple testing, but the tests are not independent. We could use a permutation test to establish the empirical significance threshold. We can run a simple permutational model for trait 1 to establish the significance of a single test. How can we establish the threshold for a set of tests?
We could also use this permutation approach simulate a true positive as a way of looking at power.
We'll begin this tutorial with a simple unstructured F2 population. This is a population constructed by crossing a pair of inbred lines to produce a 'hybrid' F1 population and randomly crossing these F1 individuals to produce a F2. The F2 has a single generation of recombination between the genomes of the two parental lines and all individuals are equally related because they all had genetically identical parents.
Let's first load the data we need and make some plots to see if our traits were measured and imported correctly.
In [2]:
f2_data = read_csv("./F2 geno pheno with QTL effect.csv")
f2_data
In [3]:
f2_data %>%
gather(trait, value, Trait2a:Trait3a) %>%
ggplot(aes(trait, value)) + geom_violin() + geom_jitter(width = 0.1, alpha = 0.1)
The simplest way of mapping a QTL is using an Anova. We divide genotypes at a particular marker position into 3 classes (the two homozygotes and the heterozygote) and look for diferences among the classes. Later we'll see how to dissect this further. This model is comparing the different genotypes states of the G16 marker with respect to the Trait2a phenotype. We are also controling for differences in Sex, litter size at birth (LSB) and litter size at weaning (LSW).
The lm() function in R can be used for this test.
In [4]:
anova_fit_f2_marker_16 = lm(Trait2a ~ Sex + LSB + LSW + as.factor(G16), data = f2_data)
Anova(anova_fit_f2_marker_16)
In [5]:
coef(anova_fit_f2_marker_16)
Another way of doing this is using a regression model, in which we code the different genotypes numericaly (-1 for one homozygote, 0 for the heterozygotes, and +1 for the other homozygote). The coeficient associated with this encoding is related to the additive effect of the marker.
In [6]:
reg_fit_f2_marker_16_A = lm(Trait2a ~ Sex + LSB + LSW + G16, data = f2_data)
summary(reg_fit_f2_marker_16_A)
We can also use a coding of 1 for the heterozygotes and 0 for the homozygotes to extract a coefficient related to the dominance effect of the marker.
In [7]:
f2_data$G16_D = ifelse(f2_data$G16, 0, 1) # This is the dominance coding
reg_fit_f2_marker_16_AD = lm(Trait2a ~ Sex + LSB + LSW + G16 + G16_D, data = f2_data)
summary(reg_fit_f2_marker_16_AD)
If we have a set of markers we can run a series of single markers models along the chromossome to locate the position of the QTL in that cromossome. Let's create dominance collumns and run the regression models for each marker
In [8]:
n_markers = 31
marker_fits_Trait2a = list()
for(i in seq(n_markers)){
current_marker = paste0('G', i)
f2_data[[paste0(current_marker, '_D')]] = ifelse(f2_data[[current_marker]], 0, 1)
model_formula = paste0("Trait2a ~ Sex + LSB + LSW + ", current_marker, "+", current_marker, "_D")
marker_fits_Trait2a[[i]] = lm(as.formula(model_formula), data = f2_data)
}
Let's look at one of the model fits to see if everything is ok
In [9]:
summary(marker_fits_Trait2a[[3]])
We can use this set of fits to see how the log significance changes along the cromossome. The regression model allows us to separate the additive and dominance effects.
In [10]:
ldply(marker_fits_Trait2a, function(x) summary(x)$coefficients[5, 'Pr(>|t|)']) %>%
ggplot(aes(1:n_markers, -log10(V1))) + geom_line() + labs(x = "marker", y = "LPR") +
scale_x_continuous(breaks = seq(1, n_markers, by = 2))
In [11]:
ldply(marker_fits_Trait2a, function(x) summary(x)$coefficients[6, 'Pr(>|t|)']) %>%
ggplot(aes(1:n_markers, -log10(V1))) + geom_line() + labs(x = "marker", y = "LPR") +
scale_x_continuous(breaks = seq(1, n_markers, by = 2))
We could also do both graphs at once using some faceting.
In [12]:
# Additive and dominance significance ploted together
ldply(marker_fits_Trait2a, function(x) summary(x)$coefficients[c(5, 6), 'Pr(>|t|)']) %>%
rename(additive = G1, dominance = G1_D) %>%
mutate(marker = 1:n_markers) %>%
gather(variable, value, additive:dominance) %>%
ggplot(aes(marker, -log10(value)), group = variable) +
geom_line() + labs(x = "marker", y = "LPR") +
scale_x_continuous(breaks = seq(1, n_markers, by = 2)) + facet_wrap(~variable)
In [13]:
n_markers = 31
marker_fits_Trait3a = list()
for(i in seq(n_markers)){
current_marker = paste0('G', i)
model_formula = paste0("Trait3a ~ Sex + LSB + LSW + ", current_marker, "+", current_marker, "_D")
marker_fits_Trait3a[[i]] = lm(as.formula(model_formula), data = f2_data)
}
In [14]:
ldply(marker_fits_Trait3a, function(x) summary(x)$coefficients[c(5, 6), 'Pr(>|t|)']) %>%
rename(additive = G1, dominance = G1_D) %>%
mutate(marker = 1:n_markers) %>%
gather(variable, value, additive:dominance) %>%
ggplot(aes(marker, -log10(value)), group = variable) +
geom_line() + labs(x = "marker", y = "LPR") +
scale_x_continuous(breaks = seq(1, n_markers, by = 2)) + facet_wrap(~variable)
We can take LD into account in the mapping by including flanking markers. This provides more localised estimates of the QTLs.
In [15]:
fit_f2_m13 = lm(Trait2a ~ Sex + LSB + LSW + G13 + G13_D , data = f2_data)
fit_f2_m13_fl11_15 = lm(Trait2a ~ Sex + LSB + LSW + G13 + G13_D + G11 + G11_D + G15 + G15_D, data = f2_data)
summary(fit_f2_m13)$coefficients
summary(fit_f2_m13_fl11_15)$coefficients
Now let's redo all the individual markers, but including 2 flanking markers in the models. The markers at the edge of the chromossome only get one flanking marker.
In [16]:
n_markers = 31
interval = 6 # Play around with this value.
fl_marker_fits_Trait3a = list()
for(i in seq(n_markers)){
current_marker = paste0('G', i)
f2_data[[paste0(current_marker, '_D')]] = ifelse(f2_data[[current_marker]], 0, 1)
model_formula = paste0("Trait3a ~ Sex + LSB + LSW + ", current_marker, "+", current_marker, "_D")
if((i - interval) >= 1)
model_formula = paste0(model_formula, "+ G", paste0(i - interval), "+ G", paste0(i - interval), "_D")
if((i + interval) <= n_markers)
model_formula = paste0(model_formula, "+ G", paste0(i + interval), "+ G", paste0(i + interval), "_D")
fl_marker_fits_Trait3a[[i]] = lm(as.formula(model_formula), data = f2_data)
}
In [17]:
ldply(fl_marker_fits_Trait3a, function(x) summary(x)$coefficients[c(5, 6), 'Pr(>|t|)']) %>%
rename(additive = G1, dominance = G1_D) %>%
mutate(marker = 1:n_markers) %>%
gather(variable, value, additive:dominance) %>%
ggplot(aes(marker, -log(value)), group = variable) +
geom_line() + labs(x = "marker", y = "Log") +
scale_x_continuous(breaks = seq(1, n_markers, by = 2)) + facet_wrap(~variable)
Let's run a null model, with no QTLs, and see how many false postives we get. The expectation is the significance treashold we choose (5% maybe?).
In [18]:
simulated_genotypes = read_csv("sim genotypes.csv")
f2_data_null = inner_join(select(f2_data, ID:trait3), simulated_genotypes, by = "ID")
f2_data_null
In [19]:
null_markers = 675
marker_fits_trait1 = list()
for(i in seq(null_markers)){
current_marker = paste0('G', i)
f2_data_null[[paste0(current_marker, '_D')]] = ifelse(f2_data_null[[current_marker]], 0, 1)
model_formula = paste0("trait1 ~ Sex + LSB + LSW + ", current_marker, "+", current_marker, "_D")
marker_fits_trait1[[i]] = lm(as.formula(model_formula), data = f2_data_null)
}
In [20]:
table(laply(marker_fits_trait1, function(x) summary(x)$coefficients[5, 'Pr(>|t|)']) < 0.05) / null_markers