# ipyrad-analysis toolkit: bpp (alg 00)

The ipyrad-analysis bpp tool w/ algorithm 00 can be used infer parameters on a fixed species tree. Please see the full BPP documentation for details about BPP analyses. The ipyrad-analysis wrapper is intended to make it easy to auomate many BPP analyses on large RAD-seq datasets.

In [1]:
# conda install ipyrad toytree ipcoal -c conda-forge -c bioconda
# conda install bpp -c eaton-lab

In [2]:
import pandas as pd
import numpy as np
import toytree
import toyplot
import ipcoal

### Simulate data under a specified model

We are interested in estimating the Ne and divergence time values on this tree.

In [3]:
# generate a tree with divergence times in generations
TREE = toytree.rtree.unittree(ntips=5, treeheight=3e6, seed=123)

# set larger Ne on one lineage than the other
TREE = TREE.set_node_values("Ne", {i: 5e5 for i in (0, 1, 2, 5, 6)}, default=2e5)

# draw in 'p' style to show Ne
TREE.draw(ts='p', node_hover=True);

In [4]:
# create a simulation model for this tree/network with 3 diploids per spp.
model = ipcoal.Model(tree=TREE, nsamples=6, mut=1e-8)

In [5]:
# simulate N loci
model.sim_loci(nloci=1000, nsites=150)

# write result to a database file
model.write_loci_to_hdf5(name="test-bpp", outdir="/tmp", diploid=True)

wrote 1000 loci to /tmp/test-bpp.seqs.hdf5

### True measurements that will inform our priors

In BPP we can set a prior on the percent divergence $\tau$ (pronounced Tau) between the root and tips of the tree. When combined with an estimate of the per-site-per-generation mutation rate ($\mu$) and generation time (g) this can be converted into units of years. In our ipcoal simulation we actually have access to the root ancestral sequence as well as the tip sequences, so we can measure the true percent sequence divergence.

In [6]:
# concatenate seqs into single alignment
concat = np.concatenate(model.seqs, axis=1)

# get ancestral root node sequence
rootseq = np.concatenate(model.ancestral_seq)

# what is sequence divergence relative to root?
np.sum(rootseq != concat[0]) / rootseq.size

Out[6]:
0.03360666666666667

In addition to setting a prior on Tau we must also set a prior on $\theta$ (pronounced theta), the population mutation rate ($\theta$=4Ne$\mu$). Here we know the Ne and $\mu$ used in the simulation scenario, so again we can calculate it exactly. For real data we will not know these prior values precisely, and so we will want to put a wide prior to incorporate our uncertainty.

In [13]:
# theta in simulation
4 * TREE.treenode.Ne * 1e-8

Out[13]:
0.008

### BPP analysis setup

In [8]:
SEQS = "/tmp/test-bpp.seqs.hdf5"

In [9]:
IMAP = {
"r0": ["r0-0", "r0-1", "r0-2"],
"r1": ["r1-0", "r1-1", "r1-2"],
"r2": ["r2-0", "r2-1", "r2-2"],
"r3": ["r3-0", "r3-1", "r3-2"],
"r4": ["r4-0", "r4-1", "r4-2"],
}

In [10]:
# create a bpp object to run algorithm 00
tool = ipa.bpp(
name="test-bpp",
workdir="/tmp",
data=SEQS,
guidetree=TREE,
imap=IMAP,
infer_sptree=0,
infer_delimit=0,
maxloci=1000,
burnin=2e3,
nsample=2e4,
sampfreq=5,
)

### The parameters of the object

In [11]:
tool.kwargs

Out[11]:
{'binary': '/tmp/bpp-4.1.4-linux-x86_64/bin/bpp',
'infer_sptree': 0,
'infer_delimit': 0,
'infer_delimit_args': (0, 2),
'speciesmodelprior': 1,
'seed': 12345,
'burnin': 2000,
'nsample': 20000,
'sampfreq': 5,
'thetaprior': (3, 0.002),
'tauprior': (3, 0.002),
'phiprior': (1, 1),
'usedata': 1,
'cleandata': 0,
'finetune': (0.01, 0.02, 0.03, 0.04, 0.05, 0.01, 0.01),
'copied': False}

### Set priors and plot priors of transformed values

Here we will enter our expectations on the generation time and mutation rate of our simulated organisms so that we can toggle the theta and tau priors until the resulting prior distributions on Ne and T seem reasonable.

In our simulation the root divergence time was 3M generations. Assuming a generation time of 1 and prior on root tau of ~3% (as we measured above) yields a prior on the root divergence time with 95% CI from about 0.5-6.5 Ma. So this is a pretty good but not highly informative prior.

Similarly, our simulation used Ne=200K or 500K on different lineages. If we assume a wide prior on the per-site per-generation mutation rate ($\mu$) with highest density of 1e-8 (as used in our simulations) and a prior on $\theta$ with highest density at about 0.01 (we calculated 0.008 from the simulated data above) yields a 95% CI on Ne that is again accurate but not highly informative.

These priors incorporate estimated knowledge about our organisms, by describing gentime and mutrate as distributions, and puts informative but wide priors on the parameters we hope to infer (theta and tau).

In [14]:
# toggle these prior settings
tool.kwargs["thetaprior"] = (2.5, 0.004)
tool.kwargs["tauprior"] = (3, 0.01)

# draw the distributions when using our assumptions for transforming
tool.draw_priors(
gentime_min=0.99, gentime_max=1.01,
mutrate_min=5e-9, mutrate_max=2e-8,
);

### Run inference

Distribute jobs in parallel and run replicate jobs that start MCMC from different random seeds.

In [15]:
tool.run(auto=True, nreps=2, force=True)

Parallel connection | latituba: 8 cores
[locus filter] full data: 1000
[locus filter] post filter: 1000
[ipa bpp] bpp v4.1.4
[ipa.bpp] distributed 2 bpp jobs (name=test-bpp, nloci=1000)
[####################] 100% 3:36:37 | progress on all reps
Parallel connection closed.

You can load your results anytime later after your runs have finished by providing the name and workdir for the job to ipa.bpp. This allows you to mess around with plotting and comparing distributions later. You also have the option here to combine the posteriors from multiple replicate runs (that used the same data and priors), or to analyze each separately.

In [1]:

In [2]:
# reload results by creating object with a given name and workdir
tool = ipa.bpp(name="test-bpp", workdir="/tmp")

# call summarize on the tool
table, mcmc = tool.summarize_results("00", individual_results=False)

[ipa.bpp] found 2 existing result files
[ipa.bpp] summarizing algorithm '00' results
[ipa.bpp] combining mcmc files

### The "00" tables result

The main result of the "00" algorithm is a dataframe with the inferred parameters of the multispecies coalescent model.

In [3]:
# tables is a summary of the posteriors
table

Out[3]:
theta_1r0 theta_2r1 theta_3r2 theta_4r3 theta_5r4 theta_6r4r3r2r1r0 theta_7r4r3 theta_8r2r1r0 theta_9r1r0 tau_6r4r3r2r1r0 tau_7r4r3 tau_8r2r1r0 tau_9r1r0 lnL
mean 0.027995 0.028212 0.025590 0.009736 0.009295 0.007839 0.005385 0.022274 0.017841 0.034456 0.024037 0.026410 0.016234 -418047.448228
median 0.027980 0.028194 0.025576 0.009733 0.009287 0.007836 0.005425 0.022171 0.017826 0.034457 0.024008 0.026409 0.016230 -418046.894000
S.D 0.000885 0.000904 0.000717 0.000304 0.000296 0.000704 0.001142 0.002391 0.001121 0.000394 0.000616 0.000444 0.000313 79.616227
min 0.024424 0.024989 0.022996 0.008656 0.008194 0.005509 0.002602 0.014451 0.013855 0.032841 0.021608 0.024757 0.015049 -418388.687000
max 0.032145 0.032099 0.028488 0.011026 0.010516 0.010827 0.009983 0.032692 0.023367 0.036106 0.026211 0.028206 0.017480 -417729.014000
2.5% 0.026300 0.026481 0.024243 0.009154 0.008728 0.006511 0.003146 0.017910 0.015693 0.033689 0.022892 0.025541 0.015626 -418202.863000
97.5% 0.029785 0.030023 0.027033 0.010342 0.009899 0.009229 0.007568 0.027284 0.020108 0.035223 0.025280 0.027283 0.016861 -417892.551000
2.5%HPD 0.026283 0.026471 0.024219 0.009152 0.008691 0.006477 0.002991 0.017773 0.015733 0.033673 0.022859 0.025528 0.015623 -418201.078000
97.5%HPD 0.029757 0.030004 0.027007 0.010338 0.009860 0.009190 0.007314 0.027099 0.020134 0.035200 0.025239 0.027268 0.016855 -417891.188000
ESS* 12262.897304 11997.844527 13891.059855 6974.445253 6525.824045 436.125051 123.289724 1897.302231 3587.614817 579.708878 147.436912 2641.537188 4378.739376 2087.556267
Eff* 0.613145 0.599892 0.694553 0.348722 0.326291 0.021806 0.006164 0.094865 0.179381 0.028985 0.007372 0.132077 0.218937 0.104378

### Converting units from the table results

The results are often difficult to interpret without converting the units into a more easily interpretable type. For example, the theta parameters represent the population effective mutation rate (4*Ne*u), but we are typically more interested in just knowing N. Similarly, the tau parameters are in units of percent sequence divergence (u/gen) but we instead like to know the divergence time estimates in units of years, or millions of years.

To convert these units we need an estimate of the per-site per-generation mutation-rate (u) and the generation time of our organism (g). Rather than offering a point estimate it is more appropriate in the Bayesian context to provide these values a statistical distribution. This is similar to how we described our priors using a gamma or inverse-gamma distribution earlier, and how we used a distribution of u and g to check the reasonable conversion of the units.

In the example below the converted units can be compared with the true simulation scenario at the beginning of this notebook where we set the divergence times and Ne values on the tree. Among the tip-level taxa we can see that mean Ne estimates are around 2e5 or ~6e5, which is pretty accurate, and the root divergence time is at 3e6, which is correct.

In [17]:
df = tool.transform(mcmc, 0.99, 1.01, 5e-9, 2e-8).T

In [18]:
df

Out[18]:
mean median S.D min max
Ne_1r0 6.16e+05 5.78e+05 2.05e+05 1.89e+05 2.55e+06
Ne_2r1 6.2e+05 5.83e+05 2.06e+05 1.9e+05 2.57e+06
Ne_3r2 5.63e+05 5.29e+05 1.87e+05 1.72e+05 2.28e+06
Ne_4r3 2.14e+05 2.01e+05 7.12e+04 6.57e+04 8.87e+05
Ne_5r4 2.04e+05 1.92e+05 6.8e+04 6.26e+04 8.47e+05
Ne_6r4r3r2r1r0 1.73e+05 1.62e+05 5.93e+04 5.47e+04 7.23e+05
Ne_7r4r3 1.2e+05 1.11e+05 4.54e+04 2.8e+04 5.7e+05
Ne_8r2r1r0 4.91e+05 4.59e+05 1.71e+05 1.45e+05 2.17e+06
Ne_9r1r0 3.92e+05 3.68e+05 1.32e+05 1.3e+05 1.71e+06
div_6r4r3r2r1r0 3.03e+06 2.85e+06 1e+06 9.51e+05 1.24e+07
div_7r4r3 2.11e+06 1.98e+06 7.01e+05 6.42e+05 8.56e+06
div_8r2r1r0 2.32e+06 2.18e+06 7.7e+05 7.2e+05 9.46e+06
div_9r1r0 1.43e+06 1.34e+06 4.74e+05 4.4e+05 5.81e+06
lnL NaN NaN NaN NaN NaN

### Plot posteriors

You can plot several posterior distributions on a shared axis using the .draw_posteriors() function. This takes a list of tuples as input, where each tuple is the (mean, variance) of a parameter estimate. You can draw the posteriors from a single analysis using this function or combine results from several different analyses if you wish.

In [43]:
# get only the divergence time result
subd = df.loc[[i for i in df.index if "div_" in i], :]

# get results as tuple pairs with (mean, var)
tuples = [
(i, j) for (i, j) in
zip(subd["mean"], subd.loc[:, "S.D"] ** 2)
]

# draw the results
c, a, m = tool.draw_posteriors(
gamma_tuples=tuples,
labels=subd.index,
);

# style the axes
a.y.ticks.labels.angle = -90

... coming