This notebook uses the Pedicularis example data set from the first empirical ipyrad tutorial. Here I show how to run BUCKy on a large set of loci parsed from the output file with the .alleles.loci
ending. All code in this notebook is Python. You can simply follow along and execute this same code in a Jupyter notebook of your own.
In [1]:
## conda install -c BioBuilds mrbayes
## conda install -c ipyrad ipyrad
## conda install -c ipyrad bucky
In [2]:
## import Python libraries
import ipyrad.analysis as ipa
import ipyparallel as ipp
To execute code in parallel we will use the ipyparallel
Python library. A quick guide to starting a parallel cluster locally can be found here, and instructions for setting up a remote cluster on a HPC cluster is available here. In either case, this notebook assumes you have started an ipcluster
instance that this notebook can find, which the cell below will test.
In [3]:
## look for running ipcluster instance, and create load-balancer
ipyclient = ipp.Client()
print "{} engines found".format(len(ipyclient))
The two required arguments are the name
and data
arguments. The data
argument should be a .loci file or a .alleles.loci file. The name will be used to name output files, which will be written to {workdir}/{name}/{number}.nexus
. Bucky doesn't deal well with missing data, so loci will only be included if they contain data for all samples in the analysis. By default, all samples found in the loci file will be used, unless you enter a list of names (the samples
argument) to subsample taxa, which we do here. It is best to select one individual per species or subspecies. You can set a number of additional parameters in the .params
dictionary. Here I use the maxloci
argument to limit the total number of loci so that the example analysis will finish faster. But in practice, BUCKy runs quite fast and I would typically just use all of your loci in a real analysis.
In [4]:
## make a list of sample names you wish to include in your BUCKy analysis
samples = [
"29154_superba",
"30686_cyathophylla",
"41478_cyathophylloides",
"33413_thamno",
"30556_thamno",
"35236_rex",
"40578_rex",
"38362_rex",
"33588_przewalskii",
]
In [5]:
## initiate a bucky object
c = ipa.bucky(
name="buckytest",
data="analysis-ipyrad/pedic_outfiles/pedic.alleles.loci",
workdir="analysis-bucky",
samples=samples,
minsnps=0,
maxloci=100,
)
In [6]:
## print the params dictionary
c.params
Out[6]:
In [7]:
## This will write nexus files to {workdir}/{name}/[number].nex
c.write_nexus_files(force=True)
In [8]:
## print an example nexus file
! cat analysis-bucky/buckytest/1.nex
There are four parts to a full BUCKy analysis. The first is converting the data into nexus files. The following are .run_mrbayes()
, then .run_mbsum()
, and finally .run_bucky()
. Each uses the files produced by the previous function in order. You can use the force flag to overwrite existing files. An ipyclient should be provided to distribute the jobs in parallel. The parallelization is especially important for the mrbayes analyses, where more cores will lead to approximately linear speed improvements. An ipyrad.bucky analysis object will run all four steps sequentially by simply calling the .run()
command. See the end of the notebook for results.
In [9]:
## run the complete analysis
c.run(force=True, ipyclient=ipyclient)
In [18]:
## (1) This will write nexus files to {workdir}/{name}/[number].nex
c.write_nexus_files(force=True)
In [19]:
## (2) distributes mrbayes jobs across the parallel client
c.run_mrbayes(force=True, ipyclient=ipyclient)
In [10]:
## (3) this step is fast, simply summing the gene-tree posteriors
c.run_mbsum(force=True, ipyclient=ipyclient)
In [11]:
## (4) infer concordance factors with BUCKy. This will run in parallel
## for however many alpha values are in b.params.bucky_alpha list
b.run_bucky(force=True, ipyclient=ipyclient)
View the results in the file [workdir]/[name]/CF-{alpha-value}.concordance
. We haven't yet developed any further ipyrad tools for parsing these results, but hope to do so in the future. The main results you are typically interested in are the Primary Concordance Tree and the Splits in the Primary Concordance Tree.
In [17]:
## print first 50 lines of a results files
! head -n 50 analysis-bucky/buckytest/CF-a1.0.concordance