Multivariate evolution

Gabriel Marroig and Diogo Melo

14/09/2016

In this tutorial we explore the consequences of genetic covariation for multivariate evolution using a toy dataset measured in 5 related species rodent species, named A to E. In the dataset we have 4 quantitative trais (Humer, Ulna, Tibia and Femur). Each species have a sample size of $N = 60$. We will use this dataset to illustrate several points related to multivariate evolution and data analyses.

As usual, first we need to install a few packages


In [1]:
list_pkgs <- c("evolqg", "ggplot2", "GGally")
new_pkgs <- list_pkgs[!(list_pkgs %in% installed.packages()[,"Package"])]
if(length(new_pkgs) > 0){ install.packages(new_pkgs) }

library(evolqg)
library(ggplot2)
library(GGally)
library(MASS)
library(scales)


Loading required package: plyr

Fortunately, the dataset we are going to use is included in the evolqg package, so we can just load it into our workspace using the data function.


In [2]:
data(dentus)

This creates a data.frame in the workspace names dentus with all the data we need.


In [3]:
dentus


humerusulnafemurtibiaspecies
3.214619 8.03607213.51779 19.01131 A
5.222875 10.83837515.54714 21.01718 A
5.193021 11.91177017.24719 22.43190 A
6.547400 11.29308514.89040 20.17706 A
4.724383 9.89713514.81104 19.59753 A
4.630147 9.44117415.24754 19.25937 A
6.771560 11.90540015.57164 22.06777 A
5.229635 9.88157513.98772 19.05881 A
4.786389 10.02060715.67659 19.62829 A
4.171654 8.99820415.52306 20.06622 A
5.428925 9.64538314.02324 21.07972 A
4.689755 10.72900015.81852 19.60883 A
5.826453 10.69012414.72827 19.57330 A
6.627715 10.61544616.00286 21.46821 A
4.455203 9.69060015.16681 20.01375 A
4.946527 11.13724514.24756 19.52614 A
5.380983 10.63365713.41578 18.87848 A
4.595959 8.72382014.98695 20.80915 A
4.206078 8.77137214.92444 19.57403 A
5.634718 9.50918615.48804 20.39082 A
5.291808 10.42198315.78412 20.54966 A
4.829492 9.82493714.07292 19.55087 A
3.503918 9.07436613.97675 18.90484 A
4.555670 10.16981516.37251 21.33269 A
3.917700 8.99146115.23725 20.41206 A
4.101148 8.79678215.40484 20.73443 A
5.099787 9.90273413.84876 20.25140 A
6.661751 11.99038815.30423 19.88481 A
3.903197 9.74791213.72621 19.37218 A
6.145626 10.24605414.21860 18.78641 A
5.78581914.5325520.6287726.88133E
7.15398616.1043719.6630426.53573E
7.46251113.0995022.0709128.38975E
7.96141813.9219920.4357226.26523E
6.87621214.2698720.7734527.98202E
8.19253013.4993020.4909027.00043E
9.15693513.8412320.6446728.66664E
7.31932616.0593220.8873327.28251E
6.70632914.0315921.6591328.69696E
6.40063414.6991320.2632328.91090E
7.22177013.0508420.4219727.56913E
7.10743315.1314221.1851228.13049E
6.01252113.1261018.6641926.92798E
6.71368114.3391622.4627927.50479E
7.56383614.9540020.1363428.04306E
7.09920813.4646619.8455125.82650E
6.76522314.7154420.6270128.76826E
7.39189214.1899820.2610726.43137E
7.33755613.2678119.3376626.36457E
5.84474014.1108520.5182826.41921E
6.80769215.1388720.0701126.00979E
9.01503612.3400021.5781228.13771E
5.59446214.9814421.2759827.25072E
7.25906014.5113221.9468828.78272E
7.93019615.2594621.8696928.05573E
8.69504015.5229920.8840929.06015E
7.34272613.8327621.4299428.26561E
7.91111814.2995520.2139827.89367E
7.45021215.3971919.9135828.18544E
7.87118814.4250321.1333727.91589E

First, let's visualize the data by making simple graphs (scatterplots) of all the traits in the dataset with different symbol and colors for all species. There should be a total of 6 graphs (Trait 1 x 2, 1 x 3, and so on). It can also be represented as a scatterplot matrix:


In [4]:
ggpairs(dentus,  mapping = aes(color = species), columns = c("humerus", "ulna","femur", "tibia", "species"))


`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is a lot of information on these graphs. What can you see in terms of association between traits within each species? Are they similar? Are there differences? The graphs also give you information about the size differences between species.

Principal components and discriminant analysis

Let us examine now how the variation within-groups (species) and between-groups is distributed. One common method to summarize or transform variation in several traits into fewer number of variables is principal component analysis (PCA). Usually, a system with 30-40 traits can be represented by a much smaller number of principal components, say 2 to 7. These principal components are composite traits using all the original traits, but unlike the original traits, they are uncorrelated to each other.

If you know some linear algebra, the pincipal components are the eigenvectors of the covariance matrix between the traits. The eigenvalues are the variances in the dataset in the direction of the corresponding eigenvector. If you don't know linear algebra, here is a figure to ilustrate a principal componente in two dimensions.

Also, take a look at this great tutorial on PCA

Let's extract the first 2 principal components from this dataset and then plot the two new variables and examine the distribution of points and species on it. What can you see in this graph in regard to the distribution of variation within and between groups?


In [5]:
# PCA is the eigenvalues of the covariance matrix, so let's calcualte the covariance matrix for the full dataset
fullcov = cov(dentus[,1:4])
fullcov


humerusulnafemurtibia
humerus 5.246708 9.336696 7.85146610.44042
ulna 9.33669619.10211116.13515221.28146
femur 7.85146616.13515220.10682025.94645
tibia10.44041621.28145725.94644934.96317

In [6]:
#Now we take the eigen decomposition:
eigen_fullcov = eigen(fullcov)
eigen_fullcov$values


  1. 72.626035563574
  2. 5.70583545013819
  3. 0.59169798120015
  4. 0.495240904446538

In [7]:
# These coluns are all the PCs
eigen_fullcov$vectors


-0.2302141 0.4138996 0.6109147-0.6344067
-0.4663372 0.7473891-0.3214898 0.3472514
-0.5145071-0.3053580-0.5708807-0.5622580
-0.6817723-0.4205392 0.4444353 0.4010120

In [8]:
# We can use these PCs to project the original dataset and get the scores for each individual in the PCs
dentus_fullPCscores = as.matrix(dentus[,1:4]) %*% eigen_fullcov$vectors

# These must be uncorrelated, so all correlation should be zero
round(cor(dentus_fullPCscores), 5)


1000
0100
0010
0001

In [9]:
# Now the plot of the first two principal components
ggplot(data.frame(PC1 = dentus_fullPCscores[,1], PC2 = dentus_fullPCscores[,2], species = dentus$species), 
       aes(-PC1, PC2, color = species)) + geom_point() +
 labs(x = paste("PC1 (", percent(eigen_fullcov$values[1]/sum(eigen_fullcov$values)), ")", sep=""),
       y = paste("PC2 (", percent(eigen_fullcov$values[2]/sum(eigen_fullcov$values)), ")", sep=""))


Now run a discriminant analyses with all 5 species as grouping (independent) variable and the 4 quantitative as predictor variables and plot the first discriminant function against the second one. Notice how both methods rotate the axis to find solutions to the problem. In the principal components you maximize the variance captured by each variable while on the DF you maximize the differences between groups while controlling for the variation within each group.


In [10]:
lda <- lda(species ~ ., 
           dentus, 
           prior = c(1,1,1,1,1)/5)

prop.lda = lda$svd^2/sum(lda$svd^2)

plda <- predict(object = lda,
                newdata = dentus)

dataset = data.frame(species = dentus[,"species"], lda = plda$x)

ggplot(dataset) + geom_point(aes(lda.LD1, lda.LD2, colour = species), size = 2.5) + 
  labs(x = paste("LD1 (", percent(prop.lda[1]), ")", sep=""),
       y = paste("LD2 (", percent(prop.lda[2]), ")", sep=""))


Response to selection

Now let’s turn our attention to the species Matrices and how traits are associated within each one of them.

Compute the correlation matrix (all 6 pairs of correlation) for each species. For our discussion here this matrix is very similar to the variance/covariance matrix so for sake of comparison we will use the correlation (all traits have very similar variance except for one in one species).


In [11]:
# We can calculate the within species matrix by subsettting the original data.frame for each matrix:
corA = cor(dentus[dentus$species == "A", 1:4])
corB = cor(dentus[dentus$species == "B", 1:4])
corC = cor(dentus[dentus$species == "C", 1:4])
corD = cor(dentus[dentus$species == "D", 1:4])
corE = cor(dentus[dentus$species == "E", 1:4])

#now let's group them in a list
cor_mats = list(A = corA, B = corB, C = corC, D = corD, E = corE)
cor_mats


$A
humerusulnafemurtibia
humerus1.00000000.80423490.23324360.2880717
ulna0.80423491.00000000.30382860.3321251
femur0.23324360.30382861.00000000.7690596
tibia0.28807170.33212510.76905961.0000000
$B
humerusulnafemurtibia
humerus1.0000000 0.433285690.172933720.1840682
ulna0.4332857 1.000000000.039905350.1373924
femur0.1729337 0.039905351.000000000.6896008
tibia0.1840682 0.137392440.689600821.0000000
$C
humerusulnafemurtibia
humerus1.000000000.788776140.014993360.3469841
ulna0.788776141.000000000.080682660.3640083
femur0.014993360.080682661.000000000.1662252
tibia0.346984120.364008260.166225211.0000000
$D
humerusulnafemurtibia
humerus1.00000000.77895530.51640910.5862964
ulna0.77895531.00000000.72415030.6674222
femur0.51640910.72415031.00000000.5806471
tibia0.58629640.66742220.58064711.0000000
$E
humerusulnafemurtibia
humerus 1.00000000-0.122277270.02229512 0.3655136
ulna-0.12227727 1.000000000.01355941 0.1594792
femur 0.02229512 0.013559411.00000000 0.2598259
tibia 0.36551364 0.159479200.25982593 1.0000000

Do species have the same correlation structure (patterns and magnitude of correlation)? How would that structure affect the evolutionary potential of each species? To anwser these questions we will simulate a 1000 random vectors of selection and multiply each species matrix by these vectors. Doing this we obtain 1000 response vectors produced by this simulated selection.


In [12]:
random_betas = matrix(rnorm(1000*4), 4, 1000)

# These are the simulated responses
response_list = lapply(cor_mats, function(x) x %*% random_betas)

We can now relate the structure of the correlation matrix with those responses calculating a series of statistics:

  • Correlation between selection vector x response vector
  • Correlation between response vector x PC1 (first principal component of each species)
  • Correlation between response vector x PC2 (second principal component of each species)

In [13]:
# First we need a vector correlation funcion:
corVector = function(x, y) x %*% y / (Norm(x) * Norm(y))

# And a function that calculates the vector correlation between the columns of two matrices:
corColumns = function(x, y){
    n = ncol(x)
    correlations = numeric(n)
    for(i in 1:n)
        correlations[i] = corVector(x[,i], y[,i])
    return(correlations)
}
    
# And a function that calculates the vector correlation between the columns of a matrix and a vector:
corColumnVector = function(x, vector){
    n = ncol(x)
    correlations = numeric(n)
    for(i in 1:n)
        correlations[i] = corVector(x[,i], vector)
    return(correlations)
}

In [14]:
# Correlation between selection vector ($\beta$) x response vector $\delta z$
cor_beta_dz = lapply(response_list, corColumns, random_betas)

# Correlation between response vector x PC1 (first principal component of each species)
species = list("A", "B", "C", "D", "E")
cor_dz_PC1 = lapply(species, function(sp) abs(corColumnVector(response_list[[sp]], eigen(cor_mats[[sp]])$vector[,1])))
names(cor_dz_PC1) = species
    
# Correlation between response vector x PC1 (first principal component of each species)
cor_dz_PC2 = lapply(species, function(sp) abs(corColumnVector(response_list[[sp]], eigen(cor_mats[[sp]])$vector[,2])))
names(cor_dz_PC2) = species
    
# We can joint all these correlation in a data.frame
corelation_df = NULL
for(sp in species){
    corelation_df = rbind(corelation_df, 
                      data.frame(beta_dz = cor_beta_dz[[sp]], 
                                 dz_PC1 = cor_dz_PC1[[sp]], 
                                 dz_PC2 = cor_dz_PC2[[sp]], 
                                 species = sp))
}

Now make a graph for each species of the correlation between PC1 x response vector in the x-axis and response vector x pc2 on the y-axis. Do you see any differences in the potential of responses for each species? Which species is most “constrained” or limited and the potential responses? Does that make sense when you look at the association between the 4 traits?


In [15]:
ggplot(corelation_df, aes(dz_PC1, dz_PC2)) + geom_point(alpha = 0.1) + facet_wrap(~species, ncol = 2)


We can also look at the distribution of the aligmente between the response and the selection for each species


In [16]:
ggplot(corelation_df, aes(species, beta_dz)) + geom_violin() + geom_jitter(alpha = 0.1)