ipyrad-analysis toolkit: mrbayes

In these analyses our interest is primarily in inferring accurate branch lengths under a relaxed molecular clock model. This means that tips are forced to line up at the present (time) but that rates of substitutions are allowed to vary among branches to best explain the variation in the sequence data.

There is a huge range of models that can be employed using mrbayes by employing different combinations of parameter settings, model definitions, and prior settings. The ipyrad-analysis tool here is intended to make it easy to run such jobs many times (e.g., distributed in parallel) once you have decided on your settings. In addition, we provide a number of pre-set models (e.g., clock_model=2) that may be useful for simple scenarios.

Here we use simulations to demonstrate the accuracy of branch length estimation when sequences come from a single versus multiple distinct genealogies (e.g., gene tree vs species tree), and show an option to fix the topology to speed up analyses when your only interest is to estimate branch lengths.


In [1]:
# conda install ipyrad toytree mrbayes -c conda-forge -c bioconda

In [2]:
import toytree
import ipcoal
import ipyrad.analysis as ipa

Simulate a gene tree with 14 tips and MRCA of 1M generations


In [3]:
TREE = toytree.rtree.bdtree(ntips=8, b=0.8, d=0.2, seed=123)
TREE = TREE.mod.node_scale_root_height(1e6)
TREE.draw(ts='o', layout='d', scalebar=True);


r0r1r2r3r4r5r6r705000001000000

Simulate sequences on single gene tree and write to NEXUS

When Ne is greater the gene tree is more likely to deviate from the species tree topology and branch lengths.


In [68]:
# init simulator
model = ipcoal.Model(TREE, Ne=2e4, nsamples=2, recomb=0)

# simulate sequence data on coalescent genealogies
model.sim_loci(nloci=1, nsites=20000)

# write results to database file
model.write_concat_to_nexus(name="mbtest-1", outdir='/tmp', diploid=True)


wrote concat locus (8 x 20000bp) to /tmp/mbtest-1.nex

In [69]:
# the simulated genealogy of haploid alleles
gene = model.df.genealogy[0]

# draw the genealogy
toytree.tree(gene).draw(ts='o', layout='d', scalebar=True);


r1-1r1-0r0-1r0-0r2-1r2-0r3-1r3-0r4-1r4-0r5-1r5-0r6-1r6-0r7-1r7-005077001015400

View an example locus

This shows the 2 haploid samples simulated for each tip in the species tree.


In [70]:
model.draw_seqview(idx=0, start=0, end=50);


r0-0r0-1r1-0r1-1r2-0r2-1r3-0r3-1r4-0r4-1r5-0r5-1r6-0r6-1r7-0r7-1

(1) Infer a tree under a relaxed molecular clock model


In [71]:
# init the mb object
mb = ipa.mrbayes(
    data="/tmp/mbtest-1.nex",
    name="itest-1",
    workdir="/tmp",
    clock_model=2,
    constraints=TREE,
    ngen=int(1e6),
    nruns=2,
)

# summary of priors/params
print(mb.params)


Existing results loaded for run [3], see .trees attribute.
brlenspr      clock:uniform       
clockratepr   normal(0.01,0.005)  
clockvarpr    igr                 
igrvarpr      exp(10.0)           
nchains       4                   
ngen          1000000             
nruns         2                   
samplefreq    1000                
topologypr    fixed(fixedtree)    


In [72]:
# start the run
mb.run(force=True)


job itest-1 finished successfully

In [76]:
# load the inferred tree
mbtre = toytree.tree("/tmp/itest-1.nex.con.tre", 10)

# scale root node to 1e6
mbtre = mbtre.mod.node_scale_root_height(1e6)

# draw inferred tree 
c, a, m = mbtre.draw(ts='o', layout='d', scalebar=True);

# draw TRUE tree in orange on the same axes
TREE.draw(
    axes=a, 
    ts='o', layout='d', scalebar=True, 
    edge_colors="darkorange",
    node_sizes=0,
);


r1r0r2r3r4r5r7r6r0r1r2r3r4r5r6r705000001000000

In [77]:
# check convergence statistics
mb.convergence_stats


Out[77]:
Mean Variance Lower Upper Median minESS avgESS PSRF
Parameter
TH 9.748e-03 5.926e-07 8.215e-03 1.137e-02 9.745e-03 338.719 492.504 1.002
TL 3.681e-02 5.710e-06 3.254e-02 4.197e-02 3.658e-02 33.052 375.649 1.019
r(A<->C) 1.649e-01 2.324e-04 1.332e-01 1.865e-01 1.642e-01 7.771 353.493 1.046
r(A<->G) 1.626e-01 1.629e-04 1.361e-01 1.890e-01 1.628e-01 683.485 693.271 1.000
r(A<->T) 1.948e-01 2.129e-04 1.699e-01 2.270e-01 1.922e-01 20.844 306.530 1.007
r(C<->G) 1.814e-01 1.886e-04 1.550e-01 2.091e-01 1.831e-01 84.910 297.140 1.004
r(C<->T) 1.312e-01 1.544e-04 1.069e-01 1.559e-01 1.292e-01 55.086 365.485 1.015
r(G<->T) 1.651e-01 1.697e-04 1.401e-01 1.912e-01 1.631e-01 36.611 320.227 1.009
pi(A) 2.488e-01 7.344e-06 2.438e-01 2.544e-01 2.485e-01 387.327 441.524 1.000
pi(C) 2.518e-01 1.120e-05 2.462e-01 2.572e-01 2.517e-01 6.667 320.241 1.064
pi(G) 2.484e-01 9.192e-06 2.434e-01 2.547e-01 2.482e-01 13.418 239.950 1.021
pi(T) 2.510e-01 8.676e-06 2.451e-01 2.568e-01 2.506e-01 42.040 251.688 1.017
alpha 2.994e+00 1.636e+00 9.028e-01 5.494e+00 2.847e+00 631.629 691.315 1.002
igrvar 1.309e-04 3.692e-08 1.019e-06 5.165e-04 5.420e-05 6.539 78.078 1.060
clockrate 1.064e-02 1.476e-05 3.318e-03 1.871e-02 9.703e-03 86.095 91.622 1.000

(2) Concatenated sequences from a species tree

Here we use concatenated sequence data from 100 loci where each represents one or more distinct genealogies. In addition, Ne is increased to 1e5, allowing for more genealogical variation. We expect the accuracy of estimated edge lengths will decrease since we are now adequately modeling the genealogical variation when using concatenation.


In [78]:
# init simulator
model = ipcoal.Model(TREE, Ne=1e5, nsamples=2, recomb=0)

# simulate sequence data on coalescent genealogies
model.sim_loci(nloci=100, nsites=200)

# write results to database file
model.write_concat_to_nexus(name="mbtest-2", outdir='/tmp', diploid=True)


wrote concat locus (8 x 20000bp) to /tmp/mbtest-2.nex

In [87]:
# the simulated genealogies of haploid alleles
genes = model.df.genealogy[:4]

# draw the genealogies of the first four loci
toytree.mtree(genes).draw_tree_grid(ts='o', layout='r');


r2-1r2-0r0-1r4-1r4-0r3-1r3-0r1-0r0-0r1-1r5-1r5-0r7-1r7-0r6-0r6-1r1-1r0-0r1-0r0-1r2-1r2-0r3-1r3-0r4-1r4-0r5-1r5-0r7-1r7-0r6-1r6-0r3-1r3-0r0-0r1-1r1-0r0-1r2-1r2-0r4-1r4-0r5-1r5-0r6-1r6-0r7-1r7-0r0-1r0-0r1-1r2-0r1-0r2-1r4-1r4-0r3-1r3-0r5-1r5-0r6-1r6-0r7-1r7-0

In [88]:
# init the mb object
mb = ipa.mrbayes(
    data="/tmp/mbtest-2.nex",
    workdir="/tmp",
    name="itest-2",
    clock_model=2,
    constraints=TREE,
    ngen=int(1e6),
    nruns=2,
)

# summary of priors/params
print(mb.params)

# start the run
mb.run(force=True)


brlenspr      clock:uniform       
clockratepr   normal(0.01,0.005)  
clockvarpr    igr                 
igrvarpr      exp(10.0)           
nchains       4                   
ngen          1000000             
nruns         2                   
samplefreq    1000                
topologypr    fixed(fixedtree)    

job itest-2 finished successfully

In [91]:
# load the inferred tree
mbtre = toytree.tree("/tmp/itest-2.nex.con.tre", 10)

# scale root node from unitless to 1e6
mbtre = mbtre.mod.node_scale_root_height(1e6)

# draw inferred tree 
c, a, m = mbtre.draw(ts='o', layout='d', scalebar=True);

# draw true tree in orange on the same axes
TREE.draw(
    axes=a, 
    ts='o', layout='d', scalebar=True, 
    edge_colors="darkorange",
    node_sizes=0,
);


r1r0r2r3r4r5r7r6r0r1r2r3r4r5r6r705000001000000

In [92]:
mb.convergence_stats


Out[92]:
Mean Variance Lower Upper Median minESS avgESS PSRF
Parameter
TH 1.438e-02 8.827e-07 1.262e-02 1.641e-02 1.437e-02 561.836 656.418 1.001
TL 6.213e-02 1.072e-05 5.609e-02 6.894e-02 6.199e-02 528.868 598.330 1.000
r(A<->C) 1.307e-01 1.363e-04 1.094e-01 1.541e-01 1.309e-01 656.427 670.789 1.000
r(A<->G) 1.323e-01 1.300e-04 1.110e-01 1.544e-01 1.324e-01 576.671 663.836 1.000
r(A<->T) 2.159e-01 2.071e-04 1.889e-01 2.456e-01 2.159e-01 589.471 617.054 0.999
r(C<->G) 2.303e-01 2.222e-04 1.996e-01 2.583e-01 2.304e-01 552.482 599.817 0.999
r(C<->T) 1.479e-01 1.360e-04 1.258e-01 1.703e-01 1.473e-01 593.189 604.339 1.000
r(G<->T) 1.428e-01 1.458e-04 1.210e-01 1.671e-01 1.421e-01 457.278 527.656 1.004
pi(A) 2.479e-01 8.765e-06 2.416e-01 2.532e-01 2.479e-01 515.518 517.603 1.000
pi(C) 2.490e-01 8.842e-06 2.436e-01 2.552e-01 2.489e-01 360.866 439.721 1.000
pi(G) 2.509e-01 8.810e-06 2.454e-01 2.570e-01 2.510e-01 388.195 427.827 1.000
pi(T) 2.522e-01 9.324e-06 2.465e-01 2.582e-01 2.523e-01 378.545 414.868 0.999
alpha 4.138e-02 7.639e-04 6.544e-05 8.931e-02 3.943e-02 544.383 647.692 1.000
igrvar 1.155e-04 3.021e-08 1.036e-06 3.720e-04 6.472e-05 402.092 483.518 0.999
clockrate 1.194e-02 1.689e-05 4.819e-03 1.972e-02 1.162e-02 74.123 84.988 1.012

To see the NEXUS file (data, parameters, priors):


In [93]:
mb.print_nexus_string()


#NEXUS

[log block]
log start filename=/tmp/itest-2.nex.log replace;

[data block]
execute /tmp/mbtest-2.nex;

[tree block]
begin trees;
  tree fixedtree = ((r7,r6),(r5,(r4,(r3,(r2,(r1,r0))))));
end;

[mb block]
begin mrbayes;
  set autoclose=yes nowarn=yes;

  lset nst=6 rates=gamma;

  prset brlenspr=clock:uniform;
  prset clockvarpr=igr;
  prset igrvarpr=exp(10.0);
  prset clockratepr=normal(0.01,0.005);
  prset topologypr=fixed(fixedtree);

  

  mcmcp ngen=1000000 nrun=2 nchains=4;
  mcmcp relburnin=yes burninfrac=0.25;
  mcmcp samplefreq=1000;
  mcmcp printfreq=10000 diagnfr=5000;
  mcmcp filename=/tmp/itest-2.nex;
  mcmc;

  sump filename=/tmp/itest-2.nex;
  sumt filename=/tmp/itest-2.nex contype=allcompat;
end;

[log block]
log stop filename=/tmp/itest-2.nex.log append;

(3) Tree inference (not fixed topology) and plotting support values

Here we will try to infer the topology from a concatenated data set (i.e., not set a constraint on the topology). I increased the ngen setting since the MCMC chain takes longer to converge when searching over topology space.


In [95]:
# init the mb object
mb = ipa.mrbayes(
    data="/tmp/mbtest-2.nex",
    name="itest-3",
    workdir="/tmp",
    clock_model=2,
    ngen=int(2e6),
    nruns=2,
)

# summary of priors/params
print(mb.params)

# start run
mb.run(force=True)


brlenspr      clock:uniform       
clockratepr   normal(0.01,0.005)  
clockvarpr    igr                 
igrvarpr      exp(10.0)           
nchains       4                   
ngen          2000000             
nruns         2                   
samplefreq    1000                
topologypr    uniform             

job itest-3 finished successfully

In [ ]:
# load the inferred tree
mbtre = toytree.tree("/tmp/itest-3.nex.con.tre", 10)

# scale root node from unitless to 1e6
mbtre = mbtre.mod.node_scale_root_height(1e6)

# draw inferred tree 
c, a, m = mbtre.draw(
    #ts='s', 
    layout='d',
    scalebar=True, 
    node_sizes=18, 
    node_labels="prob{percent}",
);