The ipyrad.analysis tool kit

Deren Eaton

Install software

All required software for this walkthrough is available on conda.


In [1]:
# conda install -c ipyrad ipyrad structure clumpp bpp 
# conda install -c eaton-lab toytree toyplot
# conda install -c bioconda raxml

Start an ipyparallel cluster

In a separate terminal run the following command to start a cluster of engines. If working on a notebook running remotely, use the dashboard to open a new terminal.


In [49]:
# ipcluster start --n=4

You should then be able to connect to the engines in your notebook:


In [10]:
## connect to the cluster
import ipyparallel as ipp
ipyclient = ipp.Client()

## print number of engines
print len(ipyclient), "connected engines"


4 connected engines

Assemble a RAD data set

The code here is to assemble the example empirical data set from the ipyrad tutotial.


In [ ]:
## import ipyrad
import ipyrad as ip

Minimal workflow: scroll down for details.


In [ ]:
## create an Assembly object
data = ip.Assembly("simdata")

## set I/O paths for the data
data.set_params("project_dir", "~/workshop")
data.set_params("raw_fastq_path", "ipsimdata/rad_example_R1_.fastq.gz")
data.set_params("barcodes_path", "ipsimdata/rad_example_barcodes.txt")

## run all steps of the Assembly
data.run("1234567")

Modify more parameters


In [3]:
## set params
data.set_params("filter_adapters", 2)
data.set_params("output_formats", "lpask")

## show params
data.get_params()


  0   assembly_name               simdata                                      
  1   project_dir                 /home/deren/workshop                         
  2   raw_fastq_path              ./ipsimdata/rad_example_R1_.fastq.gz         
  3   barcodes_path               ./ipsimdata/rad_example_barcodes.txt         
  4   sorted_fastq_path                                                        
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    rad                                          
  8   restriction_overhang        ('TGCAG', '')                                
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         33                                           
  11  mindepth_statistical        6                                            
  12  mindepth_majrule            6                                            
  13  maxdepth                    10000                                        
  14  clust_threshold             0.85                                         
  15  max_barcode_mismatch        0                                            
  16  filter_adapters             2                                            
  17  filter_min_trim_len         35                                           
  18  max_alleles_consens         2                                            
  19  max_Ns_consens              (5, 5)                                       
  20  max_Hs_consens              (8, 8)                                       
  21  min_samples_locus           4                                            
  22  max_SNPs_locus              (20, 20)                                     
  23  max_Indels_locus            (8, 8)                                       
  24  max_shared_Hs_locus         0.5                                          
  25  trim_reads                  (0, 0, 0, 0)                                 
  26  trim_loci                   (0, 0, 0, 0)                                 
  27  output_formats              ('l', 'p', 'a', 's', 'k')                    
  28  pop_assign_file                                                          

Assemble the data set

You can run one or more steps just like in the CLI.


In [4]:
## run all steps of assembly
data.run("1234567")


  Assembly: simdata
  [####################] 100%  sorting reads         | 0:00:03 | s1 | 
  [####################] 100%  writing/compressing   | 0:00:01 | s1 | 
  [####################] 100%  processing reads      | 0:00:03 | s2 | 
  [####################] 100%  dereplicating         | 0:00:00 | s3 | 
  [####################] 100%  clustering            | 0:00:01 | s3 | 
  [####################] 100%  building clusters     | 0:00:00 | s3 | 
  [####################] 100%  chunking              | 0:00:00 | s3 | 
  [####################] 100%  aligning              | 0:00:08 | s3 | 
  [####################] 100%  concatenating         | 0:00:00 | s3 | 
  [####################] 100%  inferring [H, E]      | 0:00:03 | s4 | 
  [####################] 100%  calculating depths    | 0:00:00 | s5 | 
  [####################] 100%  chunking clusters     | 0:00:00 | s5 | 
  [####################] 100%  consens calling       | 0:00:23 | s5 | 
  [####################] 100%  concat/shuffle input  | 0:00:00 | s6 | 
  [####################] 100%  clustering across     | 0:00:00 | s6 | 
  [####################] 100%  building clusters     | 0:00:00 | s6 | 
  [####################] 100%  aligning clusters     | 0:00:04 | s6 | 
  [####################] 100%  database indels       | 0:00:00 | s6 | 
  [####################] 100%  indexing clusters     | 0:00:01 | s6 | 
  [####################] 100%  building database     | 0:00:00 | s6 | 
  [####################] 100%  filtering loci        | 0:00:06 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:00 | s7 | 
  [####################] 100%  building arrays       | 0:00:00 | s7 | 
  [####################] 100%  writing outfiles      | 0:00:00 | s7 | 
  Outfiles written to: ~/workshop/simdata_outfiles

Access assembly results

You can easily access summary stats for the assembly as a data frame.


In [5]:
## summary stats
data.stats


Out[5]:
state reads_raw reads_passed_filter clusters_total clusters_hidepth hetero_est error_est reads_consens
1A_0 6 19862 19862 1000 1000 0.001852 0.000758 1000
1B_0 6 20043 20043 1000 1000 0.001900 0.000752 1000
1C_0 6 20136 20136 1000 1000 0.002084 0.000745 1000
1D_0 6 19966 19966 1000 1000 0.001803 0.000754 1000
2E_0 6 20017 20017 1000 1000 0.001831 0.000765 1000
2F_0 6 19933 19933 1000 1000 0.001996 0.000755 1000
2G_0 6 20030 20030 1000 1000 0.001940 0.000763 1000
2H_0 6 20199 20198 1000 1000 0.001747 0.000756 1000
3I_0 6 19885 19885 1000 1000 0.001818 0.000755 1000
3J_0 6 19822 19822 1000 1000 0.001968 0.000783 1000
3K_0 6 19965 19965 1000 1000 0.002072 0.000760 1000
3L_0 6 20008 20008 1000 1000 0.002042 0.000748 1000

Plot statistics


In [35]:
import toyplot

## plot barplot
c, a, m = toyplot.bars(
    data.stats.hetero_est, 
    height=250, width=500,
    )

## style the axes
a.x.ticks.locator = toyplot.locator.Explicit(
    locations=range(len(data.stats)),
    labels=data.stats.index)
a.y.label.text = "Heterozygosity"
a.y.ticks.show = True


1A_01B_01C_01D_02E_02F_02G_02H_03I_03J_03K_03L_00.0000.0010.002Heterozygosity

Access result files

You can also access the stats files for each step, and the output files for downstream analyses.


In [27]:
## s2 stats file
print data.stats_files.s2

## the .loci file location
print data.outfiles.loci


/home/deren/workshop/simdata_edits/s2_rawedit_stats.txt
/home/deren/workshop/simdata_outfiles/simdata.loci

ipyrad.analysis tools

The ipyrad.analysis module includes many wrapper tools that can be used to efficiently run evolutionary analysis tools in a notebook.


In [1]:
## import the toolkit
import ipyrad.analysis as ipa

RAxML analysis

Simply enter the location of the phylip file, which can be accessed from the .outfiles attribute of the Assembly object. You can also provide a name and output directory, and set many other optional parameters.


In [2]:
import ipyrad as ip
import ipyparallel as ipp

data = ip.load_json("/home/deren/workshop/simdata.json")
ipyclient = ipp.Client()


  loading Assembly: simdata
  from saved path: ~/workshop/simdata.json

Minimal workflow: scroll down for details.


In [3]:
## create a raxml object
s = ipa.raxml(
        name=data.name,
        phyfile=data.outfiles.phy, 
        workdir="~/workshop/analysis-raxml");

## run the analysis
s.run()


Error: set a new name for this job:
File exists: /home/deren/workshop/analysis-raxml/RAxML_info.simdata

Modify parameters and other functions


In [4]:
## modify params
s.params.T = 4
s.params.N = 100

In [5]:
## print the raxml command as a string
print s.command


raxmlHPC-PTHREADS-SSE3 -f a -T 4 -m GTRGAMMA -N 100 -x 12345 -p 54321 -n simdata -w /home/deren/workshop/analysis-raxml -s /home/deren/workshop/simdata_outfiles/simdata.phy

In [6]:
## overwrite existing result with this 'name'
s.run(force=True)


job simdata finished successfully

Access the tree files and plot


In [7]:
print s.trees


bestTree                   ~/workshop/analysis-raxml/RAxML_bestTree.simdata
bipartitions               ~/workshop/analysis-raxml/RAxML_bipartitions.simdata
bipartitionsBranchLabels   ~/workshop/analysis-raxml/RAxML_bipartitionsBranchLabels.simdata
boostrap                   ~/workshop/analysis-raxml/RAxML_bootstrap.simdata
info                       ~/workshop/analysis-raxml/RAxML_info.simdata


In [8]:
import toytree
tre = toytree.tree(s.trees.bipartitions)
tre.root(wildcard='3')
tre.draw(
    width=300,
    node_labels=tre.get_node_values("support"),
    node_size=20,
    );


3L_03K_03J_03I_02H_02G_02E_02F_01D_01C_01B_01A_0100100100100100100100100100100

introgression (abba-baba) analysis

The baba object can be used to set up abba-baba tests, to calculate results, and to generate plots to visualize them.

Minimal example, scroll down for details.


In [9]:
## create a baba object
b = ipa.baba(data=data.outfiles.loci)

## generate tests given the rooted tree
b.tests = [
    {"p4":["3L_0"], 
     "p3":["2F_0"], 
     "p2":["1D_0"],
     "p1":["1A_0"]}]

## run jobs distributed across the cluster
b.run(ipyclient)
b.results_table


[####################] 100%  calculating D-stats  | 0:00:00 |  
Out[9]:
dstat bootmean bootstd Z ABBA BABA nloci
0 -0.098 -0.098 0.228 0.43 11.5 14.0 1000

Auto-generate tests

Instead of writing out many tests explicitly, you can instead enter a rooted tree to the baba object and use this function to auto-generate four-taxon test fitting the tree and constraints.


In [10]:
## init baba object
b = ipa.baba(data=data.outfiles.loci, newick=tre)

## generate all possible tests on this tree
b.generate_tests_from_tree()

## set constraints on tests
cdict = {"p4": ["3L_0"], 
         "p3": ["2E_0", "2F_0"], 
         "p2": ["1D_0"]}

## generate constrainted number of tests
b.generate_tests_from_tree(
    constraint_dict=cdict,
    constraint_exact=False,
    )


674 tests generated from tree
12 tests generated from tree

Run all tests linked to a baba object


In [11]:
## run the tests (in this case 4) linked to the baba object
b.run(ipyclient)

## show results table
b.results_table


[####################] 100%  calculating D-stats  | 0:00:03 |  
Out[11]:
dstat bootmean bootstd Z ABBA BABA nloci
0 -0.181 -0.186 0.217 0.836 11.0 15.875 1000
1 -0.030 -0.037 0.213 0.143 13.0 13.812 1000
2 -0.005 -0.004 0.216 0.021 13.5 13.625 1000
3 -0.057 -0.060 0.222 0.255 12.5 14.000 1000
4 -0.135 -0.128 0.224 0.603 12.0 15.750 1000
5 -0.004 -0.007 0.208 0.021 14.0 14.125 1000
6 0.009 0.007 0.211 0.041 14.5 14.250 1000
7 -0.018 -0.009 0.227 0.080 13.5 14.000 1000
8 -0.231 -0.226 0.224 1.030 10.0 16.000 1000
9 -0.059 -0.076 0.211 0.279 12.0 13.500 1000
10 -0.020 -0.009 0.220 0.089 12.5 13.000 1000
11 -0.098 -0.095 0.223 0.440 11.5 14.000 1000

Plot results


In [12]:
b.plot(
    height=350,
    pct_tree_x = 0.4,
    pct_tree_y = 0.2,
    );


1234567891011123L_03K_03J_03I_02H_02G_02E_02F_01D_01C_01B_01A_03.00.0-0.80.8Z-scoresD-statistics

In [15]:
### Save the plot 
import toyplot.pdf
canvas, axes, mark = b.plot(height=350, pct_tree_x=0.4, pct_tree_y=0.2)
toyplot.pdf.render(canvas, "/home/deren/workshop/abba-baba.pdf")

## save the results table
b.results_table.to_csv("~/workshop/abba-baba.csv", sep="\t")

Species tree inference by phylogenetic invariants

The program tetrad follows the algorithm of SVDquartets by inferring all possible quartet trees from a large SNP alignment and uses the program quartet maxcut (Snir et al. 2012) to infer a species tree by quartet joining.


In [16]:
## create a tetrad class object
tet = ipa.tetrad(
    name=data.name,
    seqfile=data.outfiles.snpsphy,
    mapfile=data.outfiles.snpsmap,
    workdir="~/workshop/analysis-tetrad",
    nboots=100
    )

## run the analysis
tet.run(ipyclient)


loading seq array [12 taxa x 4314 bp]
max unlinked SNPs per quartet (nloci): 984
host compute node: [4 cores] on oud
inferring 495 induced quartet trees
[####################] 100%  initial tree | 0:00:00 |  
[####################] 100%  boot 100     | 0:00:40 |  

Access result tetrad trees and draw


In [17]:
tet.trees


Out[17]:
boots                /home/deren/workshop/analysis-tetrad/simdata.boots
cons                 /home/deren/workshop/analysis-tetrad/simdata.cons
nhx                  /home/deren/workshop/analysis-tetrad/simdata.nhx
tree                 /home/deren/workshop/analysis-tetrad/simdata.tree

In [63]:
## load unrooted result tree with toytree and draw
tre = toytree.tree(tet.trees.cons)
tre.draw(
    node_labels=tre.get_node_values("support"),
    node_size=20,
    );


1D_01C_01A_01B_02H_02G_02E_02F_03L_03K_03I_03J_0100100100100100100100100100

Infer a species tree with BPP


In [ ]:
# conda install bpp -c ipyrad

In [19]:
## setup: define how samples group into 'species'
IMAP = {
    "1": ["1A_0", "1B_0", "1C_0"],
    "D": ["1D_0"],
    "2": ["2F_0", "2E_0", "2G_0"],
    "H": ["2H_0"],
    "3": ["3J_0", "3I_0", "3K_0"],
    "L": ["3L_0"],    
}

## setup: define a guidetree
GUIDE = "(((1,D),(2,H)),(3,L));"

## init a bpp object 
bpp = ipa.bpp(
    locifile=data.outfiles.loci, 
    imap=IMAP, 
    guidetree=GUIDE,
    workdir="~/workshop/analysis-bpp"
    );

## submit jobs to run on the cluster
bpp.submit_bpp_jobs("A00", nreps=2, ipyclient=ipyclient)


submitted 2 bpp jobs [A00] (1000 loci)

Set parameters and filters

You can define all of the parameter settings that will be used in the BPP .ctl file by modifying the .params attributes. Similarly, you can modify which loci will be included in the analysis using the .filters attributes.


In [20]:
## set some parameters 
bpp.params.burnin = 1000
bpp.params.nsample = 5000
bpp.params.infer_sptree = 1
bpp.params.infer_delimit = 0 

## set some filters
bpp.filters.maxloci = 200
bpp.filters.minsnps = 2

## submit jobs to run on the cluster
bpp.submit_bpp_jobs("A00", nreps=2, ipyclient=ipyclient)


submitted 2 bpp jobs [A00] (200 loci)

Track running jobs

Unlike some of the other ipyrad.analysis tools, the bpp object does not "block" while the jobs are running. Meaning that after it sends jobs to run on the cluster you can continue to interact with the notebook. This is useful since BPP is not multi-threaded, so you will likely want to submit many different types of jobs. You can check on running jobs like below.


In [21]:
## a list of submitted jobs
print bpp.asyncs

## a list of result files produced by jobs
print bpp.files


[<AsyncResult: _call_bpp>, <AsyncResult: _call_bpp>, <AsyncResult: _call_bpp>, <AsyncResult: _call_bpp>]
locifile    ~/workshop/simdata_outfiles/simdata.loci
mcmcfiles   ['~/workshop/analysis-bpp/A00-r0.mcmc.txt', '~/workshop/analysis-bpp/A00-r1.mcmc.txt']
outfiles    ['~/workshop/analysis-bpp/A00-r0.out.txt', '~/workshop/analysis-bpp/A00-r1.out.txt']

Structure analyses


In [1]:
import ipyrad as ip
import ipyrad.analysis as ipa
import ipyparallel as ipp

data = ip.load_json("/home/deren/workshop/simdata.json")
ipyclient = ipp.Client()


  loading Assembly: simdata
  from saved path: ~/workshop/simdata.json

In [2]:
# conda install structure -c ipyrad
# conda install clumpp -c ipyrad

In [3]:
## create a structure class object
s = ipa.structure(
    name=data.name,
    strfile=data.outfiles.str,
    mapfile=data.outfiles.snpsmap,
    workdir="~/workshop/analysis-structure",    
    );

s.mainparams.burnin = 100
s.mainparams.numreps = 1000

## submit jobs to run on the cluster
for kpop in [2, 3, 4, 5]:
    s.submit_structure_jobs(kpop=kpop, nreps=5, ipyclient=ipyclient)


['simdata', '/home/deren/workshop/analysis-structure', 572500868, 12, 984, 2, 0]
['simdata', '/home/deren/workshop/analysis-structure', 709458039, 12, 984, 2, 1]
['simdata', '/home/deren/workshop/analysis-structure', 292133950, 12, 984, 2, 2]
['simdata', '/home/deren/workshop/analysis-structure', 316688770, 12, 984, 2, 3]
['simdata', '/home/deren/workshop/analysis-structure', 265959297, 12, 984, 2, 4]
submitted 5 structure jobs [simdata-K-2]
['simdata', '/home/deren/workshop/analysis-structure', 572500868, 12, 984, 3, 0]
['simdata', '/home/deren/workshop/analysis-structure', 709458039, 12, 984, 3, 1]
['simdata', '/home/deren/workshop/analysis-structure', 292133950, 12, 984, 3, 2]
['simdata', '/home/deren/workshop/analysis-structure', 316688770, 12, 984, 3, 3]
['simdata', '/home/deren/workshop/analysis-structure', 265959297, 12, 984, 3, 4]
submitted 5 structure jobs [simdata-K-3]
['simdata', '/home/deren/workshop/analysis-structure', 572500868, 12, 984, 4, 0]
['simdata', '/home/deren/workshop/analysis-structure', 709458039, 12, 984, 4, 1]
['simdata', '/home/deren/workshop/analysis-structure', 292133950, 12, 984, 4, 2]
['simdata', '/home/deren/workshop/analysis-structure', 316688770, 12, 984, 4, 3]
['simdata', '/home/deren/workshop/analysis-structure', 265959297, 12, 984, 4, 4]
submitted 5 structure jobs [simdata-K-4]
['simdata', '/home/deren/workshop/analysis-structure', 572500868, 12, 984, 5, 0]
['simdata', '/home/deren/workshop/analysis-structure', 709458039, 12, 984, 5, 1]
['simdata', '/home/deren/workshop/analysis-structure', 292133950, 12, 984, 5, 2]
['simdata', '/home/deren/workshop/analysis-structure', 316688770, 12, 984, 5, 3]
['simdata', '/home/deren/workshop/analysis-structure', 265959297, 12, 984, 5, 4]
submitted 5 structure jobs [simdata-K-5]

Modify parameters settings


In [5]:
s.mainparams.burnin = 10000
s.mainparams.numreps = 100000
s.extraparams.usepopinfo = 0

Summarize results with CLUMPP


In [6]:
## get results for a single K value
s.get_clumpp_table(3)

## make a dict for all results
tables = {}
for kpop in [2, 3, 4, 5]:
    tables[kpop] = s.get_clumpp_table(kpop)


[]
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-6-42ce0b7e0eb0> in <module>()
      1 ## get results for a single K value
----> 2 s.get_clumpp_table(3)
      3 
      4 ## make a dict for all results
      5 tables = {}

/home/deren/Documents/ipyrad/ipyrad/analysis/structure.py in get_clumpp_table(self, kpop)
    332 
    333     def get_clumpp_table(self, kpop):
--> 334         table = _get_clumpp_table(self, kpop)
    335         return table
    336 

/home/deren/Documents/ipyrad/ipyrad/analysis/structure.py in _get_clumpp_table(self, kpop)
    543     proc = subprocess.Popen(cmd, 
    544                             stderr=subprocess.STDOUT,
--> 545                             stdout=subprocess.PIPE)
    546     _ = proc.communicate()
    547 

/home/deren/miniconda2/lib/python2.7/subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
    388                                 p2cread, p2cwrite,
    389                                 c2pread, c2pwrite,
--> 390                                 errread, errwrite)
    391         except Exception:
    392             # Preserve original exception in case os.close raises.

/home/deren/miniconda2/lib/python2.7/subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, to_close, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
   1022                         raise
   1023                 child_exception = pickle.loads(data)
-> 1024                 raise child_exception
   1025 
   1026 

OSError: [Errno 2] No such file or directory

In [ ]: