The ipyrad-analysis bpp
tool w/ algorithm 00 can be used infer parameters on a fixed species tree. Please see the full BPP documentation for details about BPP analyses. The ipyrad-analysis wrapper is intended to make it easy to auomate many BPP analyses on large RAD-seq datasets.
In [1]:
# conda install ipyrad toytree ipcoal -c conda-forge -c bioconda
# conda install bpp -c eaton-lab
In [2]:
import ipyrad.analysis as ipa
import pandas as pd
import numpy as np
import toytree
import toyplot
import ipcoal
In [3]:
# generate a tree with divergence times in generations
TREE = toytree.rtree.unittree(ntips=5, treeheight=3e6, seed=123)
# set larger Ne on one lineage than the other
TREE = TREE.set_node_values("Ne", {i: 5e5 for i in (0, 1, 2, 5, 6)}, default=2e5)
# draw in 'p' style to show Ne
TREE.draw(ts='p', node_hover=True);
In [4]:
# create a simulation model for this tree/network with 3 diploids per spp.
model = ipcoal.Model(tree=TREE, nsamples=6, mut=1e-8)
In [5]:
# simulate N loci
model.sim_loci(nloci=1000, nsites=150)
# write result to a database file
model.write_loci_to_hdf5(name="test-bpp", outdir="/tmp", diploid=True)
In BPP we can set a prior on the percent divergence $\tau$ (pronounced Tau) between the root and tips of the tree. When combined with an estimate of the per-site-per-generation mutation rate ($\mu$) and generation time (g) this can be converted into units of years. In our ipcoal
simulation we actually have access to the root ancestral sequence as well as the tip sequences, so we can measure the true percent sequence divergence.
In [6]:
# concatenate seqs into single alignment
concat = np.concatenate(model.seqs, axis=1)
# get ancestral root node sequence
rootseq = np.concatenate(model.ancestral_seq)
# what is sequence divergence relative to root?
np.sum(rootseq != concat[0]) / rootseq.size
Out[6]:
In addition to setting a prior on Tau we must also set a prior on $\theta$ (pronounced theta), the population mutation rate ($\theta$=4Ne$\mu$). Here we know the Ne and $\mu$ used in the simulation scenario, so again we can calculate it exactly. For real data we will not know these prior values precisely, and so we will want to put a wide prior to incorporate our uncertainty.
In [13]:
# theta in simulation
4 * TREE.treenode.Ne * 1e-8
Out[13]:
In [8]:
SEQS = "/tmp/test-bpp.seqs.hdf5"
In [9]:
IMAP = {
"r0": ["r0-0", "r0-1", "r0-2"],
"r1": ["r1-0", "r1-1", "r1-2"],
"r2": ["r2-0", "r2-1", "r2-2"],
"r3": ["r3-0", "r3-1", "r3-2"],
"r4": ["r4-0", "r4-1", "r4-2"],
}
In [10]:
# create a bpp object to run algorithm 00
tool = ipa.bpp(
name="test-bpp",
workdir="/tmp",
data=SEQS,
guidetree=TREE,
imap=IMAP,
infer_sptree=0,
infer_delimit=0,
maxloci=1000,
burnin=2e3,
nsample=2e4,
sampfreq=5,
)
In [11]:
tool.kwargs
Out[11]:
Here we will enter our expectations on the generation time and mutation rate of our simulated organisms so that we can toggle the theta and tau priors until the resulting prior distributions on Ne and T seem reasonable.
In our simulation the root divergence time was 3M generations. Assuming a generation time of 1 and prior on root tau of ~3% (as we measured above) yields a prior on the root divergence time with 95% CI from about 0.5-6.5 Ma. So this is a pretty good but not highly informative prior.
Similarly, our simulation used Ne=200K or 500K on different lineages. If we assume a wide prior on the per-site per-generation mutation rate ($\mu$) with highest density of 1e-8 (as used in our simulations) and a prior on $\theta$ with highest density at about 0.01 (we calculated 0.008 from the simulated data above) yields a 95% CI on Ne that is again accurate but not highly informative.
These priors incorporate estimated knowledge about our organisms, by describing gentime and mutrate as distributions, and puts informative but wide priors on the parameters we hope to infer (theta and tau).
In [14]:
# toggle these prior settings
tool.kwargs["thetaprior"] = (2.5, 0.004)
tool.kwargs["tauprior"] = (3, 0.01)
# draw the distributions when using our assumptions for transforming
tool.draw_priors(
gentime_min=0.99, gentime_max=1.01,
mutrate_min=5e-9, mutrate_max=2e-8,
);
In [15]:
tool.run(auto=True, nreps=2, force=True)
You can load your results anytime later after your runs have finished by providing the name and workdir for the job to ipa.bpp. This allows you to mess around with plotting and comparing distributions later. You also have the option here to combine the posteriors from multiple replicate runs (that used the same data and priors), or to analyze each separately.
In [1]:
import ipyrad.analysis as ipa
In [2]:
# reload results by creating object with a given name and workdir
tool = ipa.bpp(name="test-bpp", workdir="/tmp")
# call summarize on the tool
table, mcmc = tool.summarize_results("00", individual_results=False)
In [3]:
# tables is a summary of the posteriors
table
Out[3]:
The results are often difficult to interpret without converting the units into a more easily interpretable type. For example, the theta parameters represent the population effective mutation rate (4*Ne*u
), but we are typically more interested in just knowing N
. Similarly, the tau parameters are in units of percent sequence divergence (u/gen) but we instead like to know the divergence time estimates in units of years, or millions of years.
To convert these units we need an estimate of the per-site per-generation mutation-rate (u) and the generation time of our organism (g). Rather than offering a point estimate it is more appropriate in the Bayesian context to provide these values a statistical distribution. This is similar to how we described our priors using a gamma or inverse-gamma distribution earlier, and how we used a distribution of u and g to check the reasonable conversion of the units.
In the example below the converted units can be compared with the true simulation scenario at the beginning of this notebook where we set the divergence times and Ne values on the tree. Among the tip-level taxa we can see that mean Ne estimates are around 2e5 or ~6e5, which is pretty accurate, and the root divergence time is at 3e6, which is correct.
In [17]:
df = tool.transform(mcmc, 0.99, 1.01, 5e-9, 2e-8).T
In [18]:
df
Out[18]:
You can plot several posterior distributions on a shared axis using the .draw_posteriors()
function. This takes a list of tuples as input, where each tuple is the (mean, variance) of a parameter estimate. You can draw the posteriors from a single analysis using this function or combine results from several different analyses if you wish.
In [43]:
# get only the divergence time result
subd = df.loc[[i for i in df.index if "div_" in i], :]
# get results as tuple pairs with (mean, var)
tuples = [
(i, j) for (i, j) in
zip(subd["mean"], subd.loc[:, "S.D"] ** 2)
]
# draw the results
c, a, m = tool.draw_posteriors(
gamma_tuples=tuples,
labels=subd.index,
);
# style the axes
a.y.ticks.labels.angle = -90