ipyrad-analysis toolkit: bpp (alg 00)

The ipyrad-analysis bpp tool w/ algorithm 00 can be used infer parameters on a fixed species tree. Please see the full BPP documentation for details about BPP analyses. The ipyrad-analysis wrapper is intended to make it easy to auomate many BPP analyses on large RAD-seq datasets.

Load packages


In [1]:
# conda install ipyrad toytree ipcoal -c conda-forge -c bioconda
# conda install bpp -c eaton-lab

In [2]:
import ipyrad.analysis as ipa
import pandas as pd
import numpy as np
import toytree
import toyplot
import ipcoal

Simulate data under a specified model

We are interested in estimating the Ne and divergence time values on this tree.


In [3]:
# generate a tree with divergence times in generations
TREE = toytree.rtree.unittree(ntips=5, treeheight=3e6, seed=123)

# set larger Ne on one lineage than the other
TREE = TREE.set_node_values("Ne", {i: 5e5 for i in (0, 1, 2, 5, 6)}, default=2e5)

# draw in 'p' style to show Ne
TREE.draw(ts='p', node_hover=True);


idx: 0 dist: 1000000.0000 support: 100.0000 height: 0.0000 name: r0 Ne: 500000.00000idx: 1 dist: 1000000.0000 support: 100.0000 height: 0.0000 name: r1 Ne: 500000.00001idx: 2 dist: 2000000.0000 support: 100.0000 height: 0.0000 name: r2 Ne: 500000.00002idx: 3 dist: 2000000.0000 support: 100.0000 height: 0.0000 name: r3 Ne: 200000.00003idx: 4 dist: 2000000.0000 support: 100.0000 height: 0.0000 name: r4 Ne: 200000.00004idx: 5 dist: 1000000.0000 support: 100.0000 height: 1000000.0000 name: 5 Ne: 500000.00005idx: 6 dist: 1000000.0000 support: 100.0000 height: 2000000.0000 name: 6 Ne: 500000.00006idx: 7 dist: 1000000.0000 support: 100.0000 height: 2000000.0000 name: 7 Ne: 200000.00007idx: 8 dist: 1000000.0000 support: 100.0000 height: 3000000.0000 name: 8 Ne: 200000.00008r0r1r2r3r4015000003000000

In [4]:
# create a simulation model for this tree/network with 3 diploids per spp.
model = ipcoal.Model(tree=TREE, nsamples=6, mut=1e-8)

In [5]:
# simulate N loci
model.sim_loci(nloci=1000, nsites=150)

# write result to a database file
model.write_loci_to_hdf5(name="test-bpp", outdir="/tmp", diploid=True)


wrote 1000 loci to /tmp/test-bpp.seqs.hdf5

True measurements that will inform our priors

In BPP we can set a prior on the percent divergence $\tau$ (pronounced Tau) between the root and tips of the tree. When combined with an estimate of the per-site-per-generation mutation rate ($\mu$) and generation time (g) this can be converted into units of years. In our ipcoal simulation we actually have access to the root ancestral sequence as well as the tip sequences, so we can measure the true percent sequence divergence.


In [6]:
# concatenate seqs into single alignment
concat = np.concatenate(model.seqs, axis=1)

# get ancestral root node sequence
rootseq = np.concatenate(model.ancestral_seq)

# what is sequence divergence relative to root?
np.sum(rootseq != concat[0]) / rootseq.size


Out[6]:
0.03360666666666667

In addition to setting a prior on Tau we must also set a prior on $\theta$ (pronounced theta), the population mutation rate ($\theta$=4Ne$\mu$). Here we know the Ne and $\mu$ used in the simulation scenario, so again we can calculate it exactly. For real data we will not know these prior values precisely, and so we will want to put a wide prior to incorporate our uncertainty.


In [13]:
# theta in simulation
4 * TREE.treenode.Ne * 1e-8


Out[13]:
0.008

BPP analysis setup


In [8]:
SEQS = "/tmp/test-bpp.seqs.hdf5"

In [9]:
IMAP = {
    "r0": ["r0-0", "r0-1", "r0-2"],
    "r1": ["r1-0", "r1-1", "r1-2"],
    "r2": ["r2-0", "r2-1", "r2-2"],
    "r3": ["r3-0", "r3-1", "r3-2"],
    "r4": ["r4-0", "r4-1", "r4-2"],
}

In [10]:
# create a bpp object to run algorithm 00
tool = ipa.bpp(
    name="test-bpp",
    workdir="/tmp",
    data=SEQS,
    guidetree=TREE,
    imap=IMAP,
    infer_sptree=0,
    infer_delimit=0,
    maxloci=1000,
    burnin=2e3,
    nsample=2e4,
    sampfreq=5,
)

The parameters of the object


In [11]:
tool.kwargs


Out[11]:
{'binary': '/tmp/bpp-4.1.4-linux-x86_64/bin/bpp',
 'infer_sptree': 0,
 'infer_delimit': 0,
 'infer_delimit_args': (0, 2),
 'speciesmodelprior': 1,
 'seed': 12345,
 'burnin': 2000,
 'nsample': 20000,
 'sampfreq': 5,
 'thetaprior': (3, 0.002),
 'tauprior': (3, 0.002),
 'phiprior': (1, 1),
 'usedata': 1,
 'cleandata': 0,
 'finetune': (0.01, 0.02, 0.03, 0.04, 0.05, 0.01, 0.01),
 'copied': False}

Set priors and plot priors of transformed values

Here we will enter our expectations on the generation time and mutation rate of our simulated organisms so that we can toggle the theta and tau priors until the resulting prior distributions on Ne and T seem reasonable.

In our simulation the root divergence time was 3M generations. Assuming a generation time of 1 and prior on root tau of ~3% (as we measured above) yields a prior on the root divergence time with 95% CI from about 0.5-6.5 Ma. So this is a pretty good but not highly informative prior.

Similarly, our simulation used Ne=200K or 500K on different lineages. If we assume a wide prior on the per-site per-generation mutation rate ($\mu$) with highest density of 1e-8 (as used in our simulations) and a prior on $\theta$ with highest density at about 0.01 (we calculated 0.008 from the simulated data above) yields a 95% CI on Ne that is again accurate but not highly informative.

These priors incorporate estimated knowledge about our organisms, by describing gentime and mutrate as distributions, and puts informative but wide priors on the parameters we hope to infer (theta and tau).


In [14]:
# toggle these prior settings 
tool.kwargs["thetaprior"] = (2.5, 0.004)
tool.kwargs["tauprior"] = (3, 0.01)

# draw the distributions when using our assumptions for transforming
tool.draw_priors(
    gentime_min=0.99, gentime_max=1.01, 
    mutrate_min=5e-9, mutrate_max=2e-8,
);


0.51.01.52.02.5prior on mutation rates (x10^-8)95% CI: 0.6 - 2.10.000.010.020.03prior on theta (4Neu)95% CI: 0.002 - 0.0260250000500000750000prior on Ne95% CI: 29742 - 565130
0.991.001.01prior on generation times95% CI: 1.0 - 1.00.000.030.060.09prior on root tau (%)95% CI: 0.0062 - 0.07220369prior on root div time (Ma)95% CI: 0.4 - 6.6

Run inference

Distribute jobs in parallel and run replicate jobs that start MCMC from different random seeds.


In [15]:
tool.run(auto=True, nreps=2, force=True)


Parallel connection | latituba: 8 cores
[locus filter] full data: 1000
[locus filter] post filter: 1000
[ipa bpp] bpp v4.1.4
[ipa.bpp] distributed 2 bpp jobs (name=test-bpp, nloci=1000)
[####################] 100% 3:36:37 | progress on all reps 
Parallel connection closed.

Re-load results

You can load your results anytime later after your runs have finished by providing the name and workdir for the job to ipa.bpp. This allows you to mess around with plotting and comparing distributions later. You also have the option here to combine the posteriors from multiple replicate runs (that used the same data and priors), or to analyze each separately.


In [1]:
import ipyrad.analysis as ipa

In [2]:
# reload results by creating object with a given name and workdir
tool = ipa.bpp(name="test-bpp", workdir="/tmp")

# call summarize on the tool
table, mcmc = tool.summarize_results("00", individual_results=False)


[ipa.bpp] found 2 existing result files
[ipa.bpp] summarizing algorithm '00' results
[ipa.bpp] combining mcmc files

The "00" tables result

The main result of the "00" algorithm is a dataframe with the inferred parameters of the multispecies coalescent model.


In [3]:
# tables is a summary of the posteriors
table


Out[3]:
theta_1r0 theta_2r1 theta_3r2 theta_4r3 theta_5r4 theta_6r4r3r2r1r0 theta_7r4r3 theta_8r2r1r0 theta_9r1r0 tau_6r4r3r2r1r0 tau_7r4r3 tau_8r2r1r0 tau_9r1r0 lnL
mean 0.027995 0.028212 0.025590 0.009736 0.009295 0.007839 0.005385 0.022274 0.017841 0.034456 0.024037 0.026410 0.016234 -418047.448228
median 0.027980 0.028194 0.025576 0.009733 0.009287 0.007836 0.005425 0.022171 0.017826 0.034457 0.024008 0.026409 0.016230 -418046.894000
S.D 0.000885 0.000904 0.000717 0.000304 0.000296 0.000704 0.001142 0.002391 0.001121 0.000394 0.000616 0.000444 0.000313 79.616227
min 0.024424 0.024989 0.022996 0.008656 0.008194 0.005509 0.002602 0.014451 0.013855 0.032841 0.021608 0.024757 0.015049 -418388.687000
max 0.032145 0.032099 0.028488 0.011026 0.010516 0.010827 0.009983 0.032692 0.023367 0.036106 0.026211 0.028206 0.017480 -417729.014000
2.5% 0.026300 0.026481 0.024243 0.009154 0.008728 0.006511 0.003146 0.017910 0.015693 0.033689 0.022892 0.025541 0.015626 -418202.863000
97.5% 0.029785 0.030023 0.027033 0.010342 0.009899 0.009229 0.007568 0.027284 0.020108 0.035223 0.025280 0.027283 0.016861 -417892.551000
2.5%HPD 0.026283 0.026471 0.024219 0.009152 0.008691 0.006477 0.002991 0.017773 0.015733 0.033673 0.022859 0.025528 0.015623 -418201.078000
97.5%HPD 0.029757 0.030004 0.027007 0.010338 0.009860 0.009190 0.007314 0.027099 0.020134 0.035200 0.025239 0.027268 0.016855 -417891.188000
ESS* 12262.897304 11997.844527 13891.059855 6974.445253 6525.824045 436.125051 123.289724 1897.302231 3587.614817 579.708878 147.436912 2641.537188 4378.739376 2087.556267
Eff* 0.613145 0.599892 0.694553 0.348722 0.326291 0.021806 0.006164 0.094865 0.179381 0.028985 0.007372 0.132077 0.218937 0.104378

Converting units from the table results

The results are often difficult to interpret without converting the units into a more easily interpretable type. For example, the theta parameters represent the population effective mutation rate (4*Ne*u), but we are typically more interested in just knowing N. Similarly, the tau parameters are in units of percent sequence divergence (u/gen) but we instead like to know the divergence time estimates in units of years, or millions of years.

To convert these units we need an estimate of the per-site per-generation mutation-rate (u) and the generation time of our organism (g). Rather than offering a point estimate it is more appropriate in the Bayesian context to provide these values a statistical distribution. This is similar to how we described our priors using a gamma or inverse-gamma distribution earlier, and how we used a distribution of u and g to check the reasonable conversion of the units.

In the example below the converted units can be compared with the true simulation scenario at the beginning of this notebook where we set the divergence times and Ne values on the tree. Among the tip-level taxa we can see that mean Ne estimates are around 2e5 or ~6e5, which is pretty accurate, and the root divergence time is at 3e6, which is correct.


In [17]:
df = tool.transform(mcmc, 0.99, 1.01, 5e-9, 2e-8).T

In [18]:
df


Out[18]:
mean median S.D min max
Ne_1r0 6.16e+05 5.78e+05 2.05e+05 1.89e+05 2.55e+06
Ne_2r1 6.2e+05 5.83e+05 2.06e+05 1.9e+05 2.57e+06
Ne_3r2 5.63e+05 5.29e+05 1.87e+05 1.72e+05 2.28e+06
Ne_4r3 2.14e+05 2.01e+05 7.12e+04 6.57e+04 8.87e+05
Ne_5r4 2.04e+05 1.92e+05 6.8e+04 6.26e+04 8.47e+05
Ne_6r4r3r2r1r0 1.73e+05 1.62e+05 5.93e+04 5.47e+04 7.23e+05
Ne_7r4r3 1.2e+05 1.11e+05 4.54e+04 2.8e+04 5.7e+05
Ne_8r2r1r0 4.91e+05 4.59e+05 1.71e+05 1.45e+05 2.17e+06
Ne_9r1r0 3.92e+05 3.68e+05 1.32e+05 1.3e+05 1.71e+06
div_6r4r3r2r1r0 3.03e+06 2.85e+06 1e+06 9.51e+05 1.24e+07
div_7r4r3 2.11e+06 1.98e+06 7.01e+05 6.42e+05 8.56e+06
div_8r2r1r0 2.32e+06 2.18e+06 7.7e+05 7.2e+05 9.46e+06
div_9r1r0 1.43e+06 1.34e+06 4.74e+05 4.4e+05 5.81e+06
lnL NaN NaN NaN NaN NaN

Plot posteriors

You can plot several posterior distributions on a shared axis using the .draw_posteriors() function. This takes a list of tuples as input, where each tuple is the (mean, variance) of a parameter estimate. You can draw the posteriors from a single analysis using this function or combine results from several different analyses if you wish.


In [43]:
# get only the divergence time result
subd = df.loc[[i for i in df.index if "div_" in i], :]

# get results as tuple pairs with (mean, var)
tuples = [
    (i, j) for (i, j) in 
    zip(subd["mean"], subd.loc[:, "S.D"] ** 2)
]

# draw the results
c, a, m = tool.draw_posteriors(
    gamma_tuples=tuples,
    labels=subd.index,
);

# style the axes
a.y.ticks.labels.angle = -90


10000006000000div_6r4r3r2r1r0div_7r4r3div_8r2r1r0div_9r1r0

Plot results on a tree

... coming