ipyrad-analysis toolkit: bpp (alg 00)

The ipyrad-analysis bpp tool w/ algorithm 00 can be used infer parameters on a fixed species tree. Please see the full BPP documentation for details about BPP analyses. The ipyrad-analysis wrapper is intended to make it easy to auomate many BPP analyses on large RAD-seq datasets.

Load packages



In [1]:

    
# conda install ipyrad toytree ipcoal -c conda-forge -c bioconda
# conda install bpp -c eaton-lab



In [2]:

    
import ipyrad.analysis as ipa
import pandas as pd
import numpy as np
import toytree
import toyplot
import ipcoal

Simulate data under a specified model

We are interested in estimating the Ne and divergence time values on this tree.



In [3]:

    
# generate a tree with divergence times in generations
TREE = toytree.rtree.unittree(ntips=5, treeheight=3e6, seed=123)

# set larger Ne on one lineage than the other
TREE = TREE.set_node_values("Ne", {i: 5e5 for i in (0, 1, 2, 5, 6)}, default=2e5)

# draw in 'p' style to show Ne
TREE.draw(ts='p', node_hover=True);



In [4]:

    
# create a simulation model for this tree/network with 3 diploids per spp.
model = ipcoal.Model(tree=TREE, nsamples=6, mut=1e-8)



In [5]:

    
# simulate N loci
model.sim_loci(nloci=1000, nsites=150)

# write result to a database file
model.write_loci_to_hdf5(name="test-bpp", outdir="/tmp", diploid=True)









    



wrote 1000 loci to /tmp/test-bpp.seqs.hdf5

True measurements that will inform our priors

In BPP we can set a prior on the percent divergence $\tau$ (pronounced Tau) between the root and tips of the tree. When combined with an estimate of the per-site-per-generation mutation rate ($\mu$) and generation time (g) this can be converted into units of years. In our ipcoal simulation we actually have access to the root ancestral sequence as well as the tip sequences, so we can measure the true percent sequence divergence.



In [6]:

    
# concatenate seqs into single alignment
concat = np.concatenate(model.seqs, axis=1)

# get ancestral root node sequence
rootseq = np.concatenate(model.ancestral_seq)

# what is sequence divergence relative to root?
np.sum(rootseq != concat[0]) / rootseq.size









    Out[6]:





0.03360666666666667

In addition to setting a prior on Tau we must also set a prior on $\theta$ (pronounced theta), the population mutation rate ($\theta$=4Ne$\mu$). Here we know the Ne and $\mu$ used in the simulation scenario, so again we can calculate it exactly. For real data we will not know these prior values precisely, and so we will want to put a wide prior to incorporate our uncertainty.



In [13]:

    
# theta in simulation
4 * TREE.treenode.Ne * 1e-8









    Out[13]:





0.008

BPP analysis setup



In [8]:

    
SEQS = "/tmp/test-bpp.seqs.hdf5"



In [9]:

    
IMAP = {
    "r0": ["r0-0", "r0-1", "r0-2"],
    "r1": ["r1-0", "r1-1", "r1-2"],
    "r2": ["r2-0", "r2-1", "r2-2"],
    "r3": ["r3-0", "r3-1", "r3-2"],
    "r4": ["r4-0", "r4-1", "r4-2"],
}



In [10]:

    
# create a bpp object to run algorithm 00
tool = ipa.bpp(
    name="test-bpp",
    workdir="/tmp",
    data=SEQS,
    guidetree=TREE,
    imap=IMAP,
    infer_sptree=0,
    infer_delimit=0,
    maxloci=1000,
    burnin=2e3,
    nsample=2e4,
    sampfreq=5,
)

The parameters of the object



In [11]:

    
tool.kwargs









    Out[11]:





{'binary': '/tmp/bpp-4.1.4-linux-x86_64/bin/bpp',
 'infer_sptree': 0,
 'infer_delimit': 0,
 'infer_delimit_args': (0, 2),
 'speciesmodelprior': 1,
 'seed': 12345,
 'burnin': 2000,
 'nsample': 20000,
 'sampfreq': 5,
 'thetaprior': (3, 0.002),
 'tauprior': (3, 0.002),
 'phiprior': (1, 1),
 'usedata': 1,
 'cleandata': 0,
 'finetune': (0.01, 0.02, 0.03, 0.04, 0.05, 0.01, 0.01),
 'copied': False}

Set priors and plot priors of transformed values

Here we will enter our expectations on the generation time and mutation rate of our simulated organisms so that we can toggle the theta and tau priors until the resulting prior distributions on Ne and T seem reasonable.

In our simulation the root divergence time was 3M generations. Assuming a generation time of 1 and prior on root tau of ~3% (as we measured above) yields a prior on the root divergence time with 95% CI from about 0.5-6.5 Ma. So this is a pretty good but not highly informative prior.

Similarly, our simulation used Ne=200K or 500K on different lineages. If we assume a wide prior on the per-site per-generation mutation rate ($\mu$) with highest density of 1e-8 (as used in our simulations) and a prior on $\theta$ with highest density at about 0.01 (we calculated 0.008 from the simulated data above) yields a 95% CI on Ne that is again accurate but not highly informative.

These priors incorporate estimated knowledge about our organisms, by describing gentime and mutrate as distributions, and puts informative but wide priors on the parameters we hope to infer (theta and tau).



In [14]:

    
# toggle these prior settings 
tool.kwargs["thetaprior"] = (2.5, 0.004)
tool.kwargs["tauprior"] = (3, 0.01)

# draw the distributions when using our assumptions for transforming
tool.draw_priors(
    gentime_min=0.99, gentime_max=1.01, 
    mutrate_min=5e-9, mutrate_max=2e-8,
);

Run inference

Distribute jobs in parallel and run replicate jobs that start MCMC from different random seeds.



In [15]:

    
tool.run(auto=True, nreps=2, force=True)









    



Parallel connection | latituba: 8 cores
[locus filter] full data: 1000
[locus filter] post filter: 1000
[ipa bpp] bpp v4.1.4
[ipa.bpp] distributed 2 bpp jobs (name=test-bpp, nloci=1000)
[####################] 100% 3:36:37 | progress on all reps 
Parallel connection closed.

Re-load results

You can load your results anytime later after your runs have finished by providing the name and workdir for the job to ipa.bpp. This allows you to mess around with plotting and comparing distributions later. You also have the option here to combine the posteriors from multiple replicate runs (that used the same data and priors), or to analyze each separately.



In [1]:

    
import ipyrad.analysis as ipa



In [2]:

    
# reload results by creating object with a given name and workdir
tool = ipa.bpp(name="test-bpp", workdir="/tmp")

# call summarize on the tool
table, mcmc = tool.summarize_results("00", individual_results=False)









    



[ipa.bpp] found 2 existing result files
[ipa.bpp] summarizing algorithm '00' results
[ipa.bpp] combining mcmc files

The "00" tables result

The main result of the "00" algorithm is a dataframe with the inferred parameters of the multispecies coalescent model.



In [3]:

    
# tables is a summary of the posteriors
table









    Out[3]:







  
    
      
      theta_1r0
      theta_2r1
      theta_3r2
      theta_4r3
      theta_5r4
      theta_6r4r3r2r1r0
      theta_7r4r3
      theta_8r2r1r0
      theta_9r1r0
      tau_6r4r3r2r1r0
      tau_7r4r3
      tau_8r2r1r0
      tau_9r1r0
      lnL
    
  
  
    
      mean
      0.027995
      0.028212
      0.025590
      0.009736
      0.009295
      0.007839
      0.005385
      0.022274
      0.017841
      0.034456
      0.024037
      0.026410
      0.016234
      -418047.448228
    
    
      median
      0.027980
      0.028194
      0.025576
      0.009733
      0.009287
      0.007836
      0.005425
      0.022171
      0.017826
      0.034457
      0.024008
      0.026409
      0.016230
      -418046.894000
    
    
      S.D
      0.000885
      0.000904
      0.000717
      0.000304
      0.000296
      0.000704
      0.001142
      0.002391
      0.001121
      0.000394
      0.000616
      0.000444
      0.000313
      79.616227
    
    
      min
      0.024424
      0.024989
      0.022996
      0.008656
      0.008194
      0.005509
      0.002602
      0.014451
      0.013855
      0.032841
      0.021608
      0.024757
      0.015049
      -418388.687000
    
    
      max
      0.032145
      0.032099
      0.028488
      0.011026
      0.010516
      0.010827
      0.009983
      0.032692
      0.023367
      0.036106
      0.026211
      0.028206
      0.017480
      -417729.014000
    
    
      2.5%
      0.026300
      0.026481
      0.024243
      0.009154
      0.008728
      0.006511
      0.003146
      0.017910
      0.015693
      0.033689
      0.022892
      0.025541
      0.015626
      -418202.863000
    
    
      97.5%
      0.029785
      0.030023
      0.027033
      0.010342
      0.009899
      0.009229
      0.007568
      0.027284
      0.020108
      0.035223
      0.025280
      0.027283
      0.016861
      -417892.551000
    
    
      2.5%HPD
      0.026283
      0.026471
      0.024219
      0.009152
      0.008691
      0.006477
      0.002991
      0.017773
      0.015733
      0.033673
      0.022859
      0.025528
      0.015623
      -418201.078000
    
    
      97.5%HPD
      0.029757
      0.030004
      0.027007
      0.010338
      0.009860
      0.009190
      0.007314
      0.027099
      0.020134
      0.035200
      0.025239
      0.027268
      0.016855
      -417891.188000
    
    
      ESS*
      12262.897304
      11997.844527
      13891.059855
      6974.445253
      6525.824045
      436.125051
      123.289724
      1897.302231
      3587.614817
      579.708878
      147.436912
      2641.537188
      4378.739376
      2087.556267
    
    
      Eff*
      0.613145
      0.599892
      0.694553
      0.348722
      0.326291
      0.021806
      0.006164
      0.094865
      0.179381
      0.028985
      0.007372
      0.132077
      0.218937
      0.104378

Converting units from the table results

The results are often difficult to interpret without converting the units into a more easily interpretable type. For example, the theta parameters represent the population effective mutation rate (4*Ne*u), but we are typically more interested in just knowing N. Similarly, the tau parameters are in units of percent sequence divergence (u/gen) but we instead like to know the divergence time estimates in units of years, or millions of years.

To convert these units we need an estimate of the per-site per-generation mutation-rate (u) and the generation time of our organism (g). Rather than offering a point estimate it is more appropriate in the Bayesian context to provide these values a statistical distribution. This is similar to how we described our priors using a gamma or inverse-gamma distribution earlier, and how we used a distribution of u and g to check the reasonable conversion of the units.

In the example below the converted units can be compared with the true simulation scenario at the beginning of this notebook where we set the divergence times and Ne values on the tree. Among the tip-level taxa we can see that mean Ne estimates are around 2e5 or ~6e5, which is pretty accurate, and the root divergence time is at 3e6, which is correct.



In [17]:

    
df = tool.transform(mcmc, 0.99, 1.01, 5e-9, 2e-8).T



In [18]:

    
df









    Out[18]:







  
    
      
      mean
      median
      S.D
      min
      max
    
  
  
    
      Ne_1r0
      6.16e+05
      5.78e+05
      2.05e+05
      1.89e+05
      2.55e+06
    
    
      Ne_2r1
      6.2e+05
      5.83e+05
      2.06e+05
      1.9e+05
      2.57e+06
    
    
      Ne_3r2
      5.63e+05
      5.29e+05
      1.87e+05
      1.72e+05
      2.28e+06
    
    
      Ne_4r3
      2.14e+05
      2.01e+05
      7.12e+04
      6.57e+04
      8.87e+05
    
    
      Ne_5r4
      2.04e+05
      1.92e+05
      6.8e+04
      6.26e+04
      8.47e+05
    
    
      Ne_6r4r3r2r1r0
      1.73e+05
      1.62e+05
      5.93e+04
      5.47e+04
      7.23e+05
    
    
      Ne_7r4r3
      1.2e+05
      1.11e+05
      4.54e+04
      2.8e+04
      5.7e+05
    
    
      Ne_8r2r1r0
      4.91e+05
      4.59e+05
      1.71e+05
      1.45e+05
      2.17e+06
    
    
      Ne_9r1r0
      3.92e+05
      3.68e+05
      1.32e+05
      1.3e+05
      1.71e+06
    
    
      div_6r4r3r2r1r0
      3.03e+06
      2.85e+06
      1e+06
      9.51e+05
      1.24e+07
    
    
      div_7r4r3
      2.11e+06
      1.98e+06
      7.01e+05
      6.42e+05
      8.56e+06
    
    
      div_8r2r1r0
      2.32e+06
      2.18e+06
      7.7e+05
      7.2e+05
      9.46e+06
    
    
      div_9r1r0
      1.43e+06
      1.34e+06
      4.74e+05
      4.4e+05
      5.81e+06
    
    
      lnL
      NaN
      NaN
      NaN
      NaN
      NaN

Plot posteriors

You can plot several posterior distributions on a shared axis using the .draw_posteriors() function. This takes a list of tuples as input, where each tuple is the (mean, variance) of a parameter estimate. You can draw the posteriors from a single analysis using this function or combine results from several different analyses if you wish.



In [43]:

    
# get only the divergence time result
subd = df.loc[[i for i in df.index if "div_" in i], :]

# get results as tuple pairs with (mean, var)
tuples = [
    (i, j) for (i, j) in 
    zip(subd["mean"], subd.loc[:, "S.D"] ** 2)
]

# draw the results
c, a, m = tool.draw_posteriors(
    gamma_tuples=tuples,
    labels=subd.index,
);

# style the axes
a.y.ticks.labels.angle = -90

Plot results on a tree

... coming

	theta_1r0	theta_2r1	theta_3r2	theta_4r3	theta_5r4	theta_6r4r3r2r1r0	theta_7r4r3	theta_8r2r1r0	theta_9r1r0	tau_6r4r3r2r1r0	tau_7r4r3	tau_8r2r1r0	tau_9r1r0	lnL
mean	0.027995	0.028212	0.025590	0.009736	0.009295	0.007839	0.005385	0.022274	0.017841	0.034456	0.024037	0.026410	0.016234	-418047.448228
median	0.027980	0.028194	0.025576	0.009733	0.009287	0.007836	0.005425	0.022171	0.017826	0.034457	0.024008	0.026409	0.016230	-418046.894000
S.D	0.000885	0.000904	0.000717	0.000304	0.000296	0.000704	0.001142	0.002391	0.001121	0.000394	0.000616	0.000444	0.000313	79.616227
min	0.024424	0.024989	0.022996	0.008656	0.008194	0.005509	0.002602	0.014451	0.013855	0.032841	0.021608	0.024757	0.015049	-418388.687000
max	0.032145	0.032099	0.028488	0.011026	0.010516	0.010827	0.009983	0.032692	0.023367	0.036106	0.026211	0.028206	0.017480	-417729.014000
2.5%	0.026300	0.026481	0.024243	0.009154	0.008728	0.006511	0.003146	0.017910	0.015693	0.033689	0.022892	0.025541	0.015626	-418202.863000
97.5%	0.029785	0.030023	0.027033	0.010342	0.009899	0.009229	0.007568	0.027284	0.020108	0.035223	0.025280	0.027283	0.016861	-417892.551000
2.5%HPD	0.026283	0.026471	0.024219	0.009152	0.008691	0.006477	0.002991	0.017773	0.015733	0.033673	0.022859	0.025528	0.015623	-418201.078000
97.5%HPD	0.029757	0.030004	0.027007	0.010338	0.009860	0.009190	0.007314	0.027099	0.020134	0.035200	0.025239	0.027268	0.016855	-417891.188000
ESS*	12262.897304	11997.844527	13891.059855	6974.445253	6525.824045	436.125051	123.289724	1897.302231	3587.614817	579.708878	147.436912	2641.537188	4378.739376	2087.556267
Eff*	0.613145	0.599892	0.694553	0.348722	0.326291	0.021806	0.006164	0.094865	0.179381	0.028985	0.007372	0.132077	0.218937	0.104378

	mean	median	S.D	min	max
Ne_1r0	6.16e+05	5.78e+05	2.05e+05	1.89e+05	2.55e+06
Ne_2r1	6.2e+05	5.83e+05	2.06e+05	1.9e+05	2.57e+06
Ne_3r2	5.63e+05	5.29e+05	1.87e+05	1.72e+05	2.28e+06
Ne_4r3	2.14e+05	2.01e+05	7.12e+04	6.57e+04	8.87e+05
Ne_5r4	2.04e+05	1.92e+05	6.8e+04	6.26e+04	8.47e+05
Ne_6r4r3r2r1r0	1.73e+05	1.62e+05	5.93e+04	5.47e+04	7.23e+05
Ne_7r4r3	1.2e+05	1.11e+05	4.54e+04	2.8e+04	5.7e+05
Ne_8r2r1r0	4.91e+05	4.59e+05	1.71e+05	1.45e+05	2.17e+06
Ne_9r1r0	3.92e+05	3.68e+05	1.32e+05	1.3e+05	1.71e+06
div_6r4r3r2r1r0	3.03e+06	2.85e+06	1e+06	9.51e+05	1.24e+07
div_7r4r3	2.11e+06	1.98e+06	7.01e+05	6.42e+05	8.56e+06
div_8r2r1r0	2.32e+06	2.18e+06	7.7e+05	7.2e+05	9.46e+06
div_9r1r0	1.43e+06	1.34e+06	4.74e+05	4.4e+05	5.81e+06
lnL	NaN	NaN	NaN	NaN	NaN