phylogeny_msc-checkpoint


Estimating phylogenies from sequence capture data

Here we show one example of how the phased multiple sequence alignments (MSAs) assembled with the SECAPR pipeline can be used for phylogeny estimation. In the example below we use the Multi Species Coalescent (MSC) program BEAST2.

  1. Locus selection: We used the secapr locus_selection function in order to choose the 50 exon loci with the highest read coverage across all samples (see locus selection workflow here).
  2. Phasing: We generated phased allele sequence MSAs fo the selected 50 loci (see phasing workflow here).
  3. Generating BEAST xml file with BEAUTI: We loaded the 50 phased allele sequence MSAs into BEAUTI v2.4.4. We chose the STACEY-specific BirthDeathCollapse species tree model with a collapse height of 1e-5. This tree-model allows taxon-assignment free analyses under the MSC model, which allows to observe every sequence as an individual tip in the species tree (rather than having to assign sequences to expected clusters (= species) prior to analysis. Further priors were: bdcGrowthRate = log normal(M=4.6, S=1.5); collapseWeight = beta (alpha=2, beta=2); popPriorScale = log normal(M=-7, S=2); relativeDeathRate = beta (alpha=1.0, beta=1.0).
  4. Run BEAST v2.4.4: We set the MCMC for 1 billion generations, logging every 100,000 generations. After approximately 500 million generations all parameters had reached convergence (assessed with Tracer v.1.6, Rambaut et al. 2013) and the MCMC was stopped (approximately 80 hours on a Mac Pro, Late 2013, 3.5 GHz 6-Core Intel Xeon E5 processor).
  5. Summarizing posterior species tree distribution: The resulting distribution of species trees was summarized with TreeAnnotator (v2.4.4), using mean heights and excluding 10% burn-in. Command: treeannotator -burnin 10 -heights mean species.trees summary_tree_mean_heights.tre
  6. Generating similarity matrix from the posterior: We then used the SpeciesDelimitationAnalyser (part of the STACEY distribution) to calculate the posterior probability for each pair of species of belonging to the same cluster. This probability was calculated from the complete posterior tree distribution (excl. 10% burn-in), using a user-defined collapse-height value of 1e-5. Command: java -jar speciesDA.jar -burnin 557 -collapseheight 1e-5 species.trees species_da_results_1e-5.txt
  7. Plot the similarity matrix: We applied our custom-made R-script for plotting the similarity matrix. The plotting-script can be found here.