Advances in RAD-seq phylogenetics

Deren Eaton

Phylogenetic methods are continually advancing, both in terms of the speed of analyses as well as in the statistical models that are applied.

The problem of phylogenetic inference

There are $(2n-3)!!$ possible rooted binary topologies to explain relationships among n samples. The size of tree space quickly becomes too large to search exhaustively.

Large numbers of taxa

The mega-phylogeny approach (Smith et al. 2008) and related methods describe an approach of mining data from online resources to assemble large supermatrices that contain few traditional markers (e.g., COI, cytB, ITS) sampled across hundreds or thousands of taxa.

Typical software:

  • raxml (fast & memory-efficient ML inference)
  • exabayes (less-fast, more more-memory efficient Bayesian inference)

Full genomes

Early phylogenomic studies typically compared few species (often model organisms) for which full genome data was available. The primary difficulties with using full genome data is in identifying proper phylogenetic markers. Many genomic regions are difficult to align, and it is difficult to identify homology between genes. For this reason, many studies with full genomes restrict phylogenetic analyses to the use of transcriptomes.

Example full genome papers:

  • Dunn et al.
  • Plants

In [ ]:

Many sequences and many taxa

Example links.

The influence of missing data

In large-scale megaphylogenies -- data mined matrices of few traditional markers across thousands of taxa -- missing data often ranges up to >90%.

In large-scale sub-genomic data sets, like RAD-seq, the proportion of missing data often ranges between 10-90%.

Importantly, the first type of problem is more sensitive to the problem of terraces in phylogenetic tree space Sanderson et al. 2012, where many taxa in the phylogeny share no information, whereas in the latter there is typically still significant phylogenetic information for all taxa.

Models

Likelihood calculation: how likely is a tree

Maximum likelihood is a method of optimizing the likelihood of the data given a defined model $P(D|M)$, and is done by estimating the parameters of the model. Different models have different numbers of parameters.

  • Jukes-Cantor (JC) = 1 parameter (simple)
  • General time-reversible (GTR) = 6 parametes (complex)

Calculation

The likelihood of the full tree is the product of the likelihood at each site.


$L = \prod_{j=1}^N{ln~L(j)}$

Phylogenetic invariants

This is an old method that has recently been revived to great popularity. It is a non-parametric method, meaning that it does not aim to infer parameters to fit a model. Instead, it uses a geometric model called the general Markov model.