In this tutorial we explore the consequences of genetic covariation for multivariate evolution using a toy dataset measured in 5 related species rodent species, named A to E. In the dataset we have 4 quantitative trais (Humer, Ulna, Tibia and Femur). Each species have a sample size of $N = 60$. We will use this dataset to illustrate several points related to multivariate evolution and data analyses.
As usual, first we need to install a few packages
In [1]:
list_pkgs <- c("evolqg", "ggplot2", "GGally")
new_pkgs <- list_pkgs[!(list_pkgs %in% installed.packages()[,"Package"])]
if(length(new_pkgs) > 0){ install.packages(new_pkgs) }
library(evolqg)
library(ggplot2)
library(GGally)
library(MASS)
library(scales)
Fortunately, the dataset we are going to use is included in the evolqg package, so we can just load it into our workspace using the data function.
In [2]:
data(dentus)
This creates a data.frame in the workspace names dentus with all the data we need.
In [3]:
dentus
First, let's visualize the data by making simple graphs (scatterplots) of all the traits in the dataset with different symbol and colors for all species. There should be a total of 6 graphs (Trait 1 x 2, 1 x 3, and so on). It can also be represented as a scatterplot matrix:
In [4]:
ggpairs(dentus, mapping = aes(color = species), columns = c("humerus", "ulna","femur", "tibia", "species"))
There is a lot of information on these graphs. What can you see in terms of association between traits within each species? Are they similar? Are there differences? The graphs also give you information about the size differences between species.
Let us examine now how the variation within-groups (species) and between-groups is distributed. One common method to summarize or transform variation in several traits into fewer number of variables is principal component analysis (PCA). Usually, a system with 30-40 traits can be represented by a much smaller number of principal components, say 2 to 7. These principal components are composite traits using all the original traits, but unlike the original traits, they are uncorrelated to each other.
If you know some linear algebra, the pincipal components are the eigenvectors of the covariance matrix between the traits. The eigenvalues are the variances in the dataset in the direction of the corresponding eigenvector. If you don't know linear algebra, here is a figure to ilustrate a principal componente in two dimensions.
Also, take a look at this great tutorial on PCA
Let's extract the first 2 principal components from this dataset and then plot the two new variables and examine the distribution of points and species on it. What can you see in this graph in regard to the distribution of variation within and between groups?
In [5]:
# PCA is the eigenvalues of the covariance matrix, so let's calcualte the covariance matrix for the full dataset
fullcov = cov(dentus[,1:4])
fullcov
In [6]:
#Now we take the eigen decomposition:
eigen_fullcov = eigen(fullcov)
eigen_fullcov$values
In [7]:
# These coluns are all the PCs
eigen_fullcov$vectors
In [8]:
# We can use these PCs to project the original dataset and get the scores for each individual in the PCs
dentus_fullPCscores = as.matrix(dentus[,1:4]) %*% eigen_fullcov$vectors
# These must be uncorrelated, so all correlation should be zero
round(cor(dentus_fullPCscores), 5)
In [9]:
# Now the plot of the first two principal components
ggplot(data.frame(PC1 = dentus_fullPCscores[,1], PC2 = dentus_fullPCscores[,2], species = dentus$species),
aes(-PC1, PC2, color = species)) + geom_point() +
labs(x = paste("PC1 (", percent(eigen_fullcov$values[1]/sum(eigen_fullcov$values)), ")", sep=""),
y = paste("PC2 (", percent(eigen_fullcov$values[2]/sum(eigen_fullcov$values)), ")", sep=""))
Now run a discriminant analyses with all 5 species as grouping (independent) variable and the 4 quantitative as predictor variables and plot the first discriminant function against the second one. Notice how both methods rotate the axis to find solutions to the problem. In the principal components you maximize the variance captured by each variable while on the DF you maximize the differences between groups while controlling for the variation within each group.
In [10]:
lda <- lda(species ~ .,
dentus,
prior = c(1,1,1,1,1)/5)
prop.lda = lda$svd^2/sum(lda$svd^2)
plda <- predict(object = lda,
newdata = dentus)
dataset = data.frame(species = dentus[,"species"], lda = plda$x)
ggplot(dataset) + geom_point(aes(lda.LD1, lda.LD2, colour = species), size = 2.5) +
labs(x = paste("LD1 (", percent(prop.lda[1]), ")", sep=""),
y = paste("LD2 (", percent(prop.lda[2]), ")", sep=""))
Now let’s turn our attention to the species Matrices and how traits are associated within each one of them.
Compute the correlation matrix (all 6 pairs of correlation) for each species. For our discussion here this matrix is very similar to the variance/covariance matrix so for sake of comparison we will use the correlation (all traits have very similar variance except for one in one species).
In [11]:
# We can calculate the within species matrix by subsettting the original data.frame for each matrix:
corA = cor(dentus[dentus$species == "A", 1:4])
corB = cor(dentus[dentus$species == "B", 1:4])
corC = cor(dentus[dentus$species == "C", 1:4])
corD = cor(dentus[dentus$species == "D", 1:4])
corE = cor(dentus[dentus$species == "E", 1:4])
#now let's group them in a list
cor_mats = list(A = corA, B = corB, C = corC, D = corD, E = corE)
cor_mats
Do species have the same correlation structure (patterns and magnitude of correlation)? How would that structure affect the evolutionary potential of each species? To anwser these questions we will simulate a 1000 random vectors of selection and multiply each species matrix by these vectors. Doing this we obtain 1000 response vectors produced by this simulated selection.
In [12]:
random_betas = matrix(rnorm(1000*4), 4, 1000)
# These are the simulated responses
response_list = lapply(cor_mats, function(x) x %*% random_betas)
We can now relate the structure of the correlation matrix with those responses calculating a series of statistics:
In [13]:
# First we need a vector correlation funcion:
corVector = function(x, y) x %*% y / (Norm(x) * Norm(y))
# And a function that calculates the vector correlation between the columns of two matrices:
corColumns = function(x, y){
n = ncol(x)
correlations = numeric(n)
for(i in 1:n)
correlations[i] = corVector(x[,i], y[,i])
return(correlations)
}
# And a function that calculates the vector correlation between the columns of a matrix and a vector:
corColumnVector = function(x, vector){
n = ncol(x)
correlations = numeric(n)
for(i in 1:n)
correlations[i] = corVector(x[,i], vector)
return(correlations)
}
In [14]:
# Correlation between selection vector ($\beta$) x response vector $\delta z$
cor_beta_dz = lapply(response_list, corColumns, random_betas)
# Correlation between response vector x PC1 (first principal component of each species)
species = list("A", "B", "C", "D", "E")
cor_dz_PC1 = lapply(species, function(sp) abs(corColumnVector(response_list[[sp]], eigen(cor_mats[[sp]])$vector[,1])))
names(cor_dz_PC1) = species
# Correlation between response vector x PC1 (first principal component of each species)
cor_dz_PC2 = lapply(species, function(sp) abs(corColumnVector(response_list[[sp]], eigen(cor_mats[[sp]])$vector[,2])))
names(cor_dz_PC2) = species
# We can joint all these correlation in a data.frame
corelation_df = NULL
for(sp in species){
corelation_df = rbind(corelation_df,
data.frame(beta_dz = cor_beta_dz[[sp]],
dz_PC1 = cor_dz_PC1[[sp]],
dz_PC2 = cor_dz_PC2[[sp]],
species = sp))
}
Now make a graph for each species of the correlation between PC1 x response vector in the x-axis and response vector x pc2 on the y-axis. Do you see any differences in the potential of responses for each species? Which species is most “constrained” or limited and the potential responses? Does that make sense when you look at the association between the 4 traits?
In [15]:
ggplot(corelation_df, aes(dz_PC1, dz_PC2)) + geom_point(alpha = 0.1) + facet_wrap(~species, ncol = 2)
We can also look at the distribution of the aligmente between the response and the selection for each species
In [16]:
ggplot(corelation_df, aes(species, beta_dz)) + geom_violin() + geom_jitter(alpha = 0.1)