tetrad
is a species tree inference tool based on the SVDQuartets algorithm of Chifman and Kubatko. It uses the theory of phylogenetic invariants to resolve quartet trees from SNPs for all sets of quartets in a larger tree, and then joins the quartets together into a supertree using QMC. Here I demonstrate how to call tetrad from the ipyrad-analysis tools, which is convenient for keeping all of your analyses in jupyter notebooks. Alternatively, you could also call tetrad as a command-line tool (see the tetrad docs).
The following features of tetrad
are highlighted:
Parallelization: tetrad can be massively parallelized on an HPC cluster. It approximately scales linearly with the number of available cores. Using the flexible ipyparallel backend, you can even parallelize over multiple nodes using MPI ([more details coming soon]).
Bootstrap sampling: In contrast to SVDquartets, tetrad is designed particularly to work well with RAD-seq data, though it can apply to any SNP data set. Bootstrap replicates resample the number of loci with replacement to the same size as the original data set.
SNP subsampling: The underlying model assumes that the examined SNPs are unlinked, and tetrad uses a very efficient method to maximize the number of unlinked SNPs used in the analysis. A very crude way to achieve unlinked SNPs would be to sample one SNP per locus before beginning the analysis. However, a site that is variable in the context of all your samples is not necessarily informative for any given quartet of samples. Instead, tetrad subsamples a single SNP from every locus separately for every quartet set in the analysis, and repeats this independently in every bootstrap replicate. This way the maximum amount of unlinked SNP information is used in every quartet inference.
In [1]:
# conda install ipyrad -c bioconda
# conda install tetrad -c eaton-lab -c conda-forge
In [2]:
import ipyrad.analysis as ipa
import toytree
The input data file should be the .snps.hdf5
file produced by ipyrad (v.0.9.13 or newer). If you did not assemble your data in ipyrad then you can convert your SNPs data to this format from any VCF using the vcf_to_hdf5 tool.
In [3]:
# the path to your sequence data in HDF5 format
data = "/home/deren/Documents/virentes-reference/analysis-ipyrad/ref_min4_outfiles/ref_min4.snps.hdf5"
Here you can enter parameters for the analysis. By default only a subsample of the total quartets will be inferred. To instead infer all quartets simply enter a very large number for the nquartets
parameter and it will use the maximum. When you initialize the object it will print the size of your dataset, the number of loci, and the number of quartets. The number of loci is of interest because this is the maximum number of SNPs that will be used in an analysis.
In [7]:
# init analysis object with input data and (optional) parameter options
tet = ipa.tetrad(
name="virentes-min4",
data=data,
nquartets=1e6,
nboots=16,
)
In [8]:
tet.run(auto=True)
In [22]:
tre = toytree.tree(tet.trees.tree).root(["HE", "NI"])
tre.draw(node_labels="support", use_edge_lengths=False);
In [23]:
tre = toytree.tree(tet.trees.cons).root(["HE", "NI"])
tre.draw(node_labels="support", use_edge_lengths=False);
In [21]:
mtre = toytree.mtree(tet.trees.boots)
mtre.treelist = [i.root(["HE", "NI"]) for i in mtre.treelist]
mtre.draw_cloud_tree(
height=600,
width=400,
use_edge_lengths=False,
html=True,
);
If you want to run more bootstrap replicates you can simply reinit an analysis object with the same name
and set the number of bootstrap replicates to a higher value and call .run()
again. If you try calling .run()
again and you have already completed all of the requested results then it will simply print that it is finished. If you want it to overwrite existing results with the same name then you can use the force
arg in run.
In [24]:
# analysis is finished so it will not run
tet.run()
Here I set the number of requested bootstrap replicates to 20 and call .run()
again. You can see that the analysis continues from 17, since we already completed 16 bootstrap replicates earlier, and will go until it completes 20 bootstraps.
In [26]:
# increase nboots and continue from existing analysis object
tet.params.nboots = 20
tet.run(auto=True)
Alternatively, maybe you are returning to this analysis after a while and decide you want to do more bootstraps. You can re-load the analysis object by entering the same name
and working_dir
as in the original analysis, and adding the load=True
argument. I set the number of bootstraps to 25 now. This will load the results from before and add new results when you call .run()
.
In [29]:
# # re-init analysis object (will load existing results at this name)
# tet = ipa.tetrad(
# name="virentes-min4",
# data=data,
# nquartets=1e6,
# nboots=25,
# load=True,
# )
# tet.run(auto=True)