The ipyrad.analysis tool kit

Deren Eaton

Install software

All required software for this walkthrough is available on conda.



In [1]:

    
# conda install -c ipyrad ipyrad structure clumpp bpp 
# conda install -c eaton-lab toytree toyplot
# conda install -c bioconda raxml

Start an ipyparallel cluster

In a separate terminal run the following command to start a cluster of engines. If working on a notebook running remotely, use the dashboard to open a new terminal.



In [49]:

    
# ipcluster start --n=4

You should then be able to connect to the engines in your notebook:



In [10]:

    
## connect to the cluster
import ipyparallel as ipp
ipyclient = ipp.Client()

## print number of engines
print len(ipyclient), "connected engines"









    



4 connected engines

Assemble a RAD data set

The code here is to assemble the example empirical data set from the ipyrad tutotial.



In [ ]:

    
## import ipyrad
import ipyrad as ip

Minimal workflow: scroll down for details.



In [ ]:

    
## create an Assembly object
data = ip.Assembly("simdata")

## set I/O paths for the data
data.set_params("project_dir", "~/workshop")
data.set_params("raw_fastq_path", "ipsimdata/rad_example_R1_.fastq.gz")
data.set_params("barcodes_path", "ipsimdata/rad_example_barcodes.txt")

## run all steps of the Assembly
data.run("1234567")

Modify more parameters



In [3]:

    
## set params
data.set_params("filter_adapters", 2)
data.set_params("output_formats", "lpask")

## show params
data.get_params()









    



  0   assembly_name               simdata                                      
  1   project_dir                 /home/deren/workshop                         
  2   raw_fastq_path              ./ipsimdata/rad_example_R1_.fastq.gz         
  3   barcodes_path               ./ipsimdata/rad_example_barcodes.txt         
  4   sorted_fastq_path                                                        
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    rad                                          
  8   restriction_overhang        ('TGCAG', '')                                
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         33                                           
  11  mindepth_statistical        6                                            
  12  mindepth_majrule            6                                            
  13  maxdepth                    10000                                        
  14  clust_threshold             0.85                                         
  15  max_barcode_mismatch        0                                            
  16  filter_adapters             2                                            
  17  filter_min_trim_len         35                                           
  18  max_alleles_consens         2                                            
  19  max_Ns_consens              (5, 5)                                       
  20  max_Hs_consens              (8, 8)                                       
  21  min_samples_locus           4                                            
  22  max_SNPs_locus              (20, 20)                                     
  23  max_Indels_locus            (8, 8)                                       
  24  max_shared_Hs_locus         0.5                                          
  25  trim_reads                  (0, 0, 0, 0)                                 
  26  trim_loci                   (0, 0, 0, 0)                                 
  27  output_formats              ('l', 'p', 'a', 's', 'k')                    
  28  pop_assign_file

Assemble the data set

You can run one or more steps just like in the CLI.



In [4]:

    
## run all steps of assembly
data.run("1234567")









    



  Assembly: simdata
  [####################] 100%  sorting reads         | 0:00:03 | s1 | 
  [####################] 100%  writing/compressing   | 0:00:01 | s1 | 
  [####################] 100%  processing reads      | 0:00:03 | s2 | 
  [####################] 100%  dereplicating         | 0:00:00 | s3 | 
  [####################] 100%  clustering            | 0:00:01 | s3 | 
  [####################] 100%  building clusters     | 0:00:00 | s3 | 
  [####################] 100%  chunking              | 0:00:00 | s3 | 
  [####################] 100%  aligning              | 0:00:08 | s3 | 
  [####################] 100%  concatenating         | 0:00:00 | s3 | 
  [####################] 100%  inferring [H, E]      | 0:00:03 | s4 | 
  [####################] 100%  calculating depths    | 0:00:00 | s5 | 
  [####################] 100%  chunking clusters     | 0:00:00 | s5 | 
  [####################] 100%  consens calling       | 0:00:23 | s5 | 
  [####################] 100%  concat/shuffle input  | 0:00:00 | s6 | 
  [####################] 100%  clustering across     | 0:00:00 | s6 | 
  [####################] 100%  building clusters     | 0:00:00 | s6 | 
  [####################] 100%  aligning clusters     | 0:00:04 | s6 | 
  [####################] 100%  database indels       | 0:00:00 | s6 | 
  [####################] 100%  indexing clusters     | 0:00:01 | s6 | 
  [####################] 100%  building database     | 0:00:00 | s6 | 
  [####################] 100%  filtering loci        | 0:00:06 | s7 | 
  [####################] 100%  building loci/stats   | 0:00:00 | s7 | 
  [####################] 100%  building arrays       | 0:00:00 | s7 | 
  [####################] 100%  writing outfiles      | 0:00:00 | s7 | 
  Outfiles written to: ~/workshop/simdata_outfiles

Access assembly results

You can easily access summary stats for the assembly as a data frame.



In [5]:

    
## summary stats
data.stats









    Out[5]:






  
    
      
      state
      reads_raw
      reads_passed_filter
      clusters_total
      clusters_hidepth
      hetero_est
      error_est
      reads_consens
    
  
  
    
      1A_0
      6
      19862
      19862
      1000
      1000
      0.001852
      0.000758
      1000
    
    
      1B_0
      6
      20043
      20043
      1000
      1000
      0.001900
      0.000752
      1000
    
    
      1C_0
      6
      20136
      20136
      1000
      1000
      0.002084
      0.000745
      1000
    
    
      1D_0
      6
      19966
      19966
      1000
      1000
      0.001803
      0.000754
      1000
    
    
      2E_0
      6
      20017
      20017
      1000
      1000
      0.001831
      0.000765
      1000
    
    
      2F_0
      6
      19933
      19933
      1000
      1000
      0.001996
      0.000755
      1000
    
    
      2G_0
      6
      20030
      20030
      1000
      1000
      0.001940
      0.000763
      1000
    
    
      2H_0
      6
      20199
      20198
      1000
      1000
      0.001747
      0.000756
      1000
    
    
      3I_0
      6
      19885
      19885
      1000
      1000
      0.001818
      0.000755
      1000
    
    
      3J_0
      6
      19822
      19822
      1000
      1000
      0.001968
      0.000783
      1000
    
    
      3K_0
      6
      19965
      19965
      1000
      1000
      0.002072
      0.000760
      1000
    
    
      3L_0
      6
      20008
      20008
      1000
      1000
      0.002042
      0.000748
      1000

Plot statistics



In [35]:

    
import toyplot

## plot barplot
c, a, m = toyplot.bars(
    data.stats.hetero_est, 
    height=250, width=500,
    )

## style the axes
a.x.ticks.locator = toyplot.locator.Explicit(
    locations=range(len(data.stats)),
    labels=data.stats.index)
a.y.label.text = "Heterozygosity"
a.y.ticks.show = True









    





            
            
                Save as .csv

Access result files

You can also access the stats files for each step, and the output files for downstream analyses.



In [27]:

    
## s2 stats file
print data.stats_files.s2

## the .loci file location
print data.outfiles.loci









    



/home/deren/workshop/simdata_edits/s2_rawedit_stats.txt
/home/deren/workshop/simdata_outfiles/simdata.loci

ipyrad.analysis tools

The ipyrad.analysis module includes many wrapper tools that can be used to efficiently run evolutionary analysis tools in a notebook.



In [1]:

    
## import the toolkit
import ipyrad.analysis as ipa

RAxML analysis

Simply enter the location of the phylip file, which can be accessed from the .outfiles attribute of the Assembly object. You can also provide a name and output directory, and set many other optional parameters.



In [2]:

    
import ipyrad as ip
import ipyparallel as ipp

data = ip.load_json("/home/deren/workshop/simdata.json")
ipyclient = ipp.Client()









    



  loading Assembly: simdata
  from saved path: ~/workshop/simdata.json

Minimal workflow: scroll down for details.



In [3]:

    
## create a raxml object
s = ipa.raxml(
        name=data.name,
        phyfile=data.outfiles.phy, 
        workdir="~/workshop/analysis-raxml");

## run the analysis
s.run()









    



Error: set a new name for this job:
File exists: /home/deren/workshop/analysis-raxml/RAxML_info.simdata

Modify parameters and other functions



In [4]:

    
## modify params
s.params.T = 4
s.params.N = 100



In [5]:

    
## print the raxml command as a string
print s.command









    



raxmlHPC-PTHREADS-SSE3 -f a -T 4 -m GTRGAMMA -N 100 -x 12345 -p 54321 -n simdata -w /home/deren/workshop/analysis-raxml -s /home/deren/workshop/simdata_outfiles/simdata.phy



In [6]:

    
## overwrite existing result with this 'name'
s.run(force=True)









    



job simdata finished successfully

Access the tree files and plot



In [7]:

    
print s.trees









    



bestTree                   ~/workshop/analysis-raxml/RAxML_bestTree.simdata
bipartitions               ~/workshop/analysis-raxml/RAxML_bipartitions.simdata
bipartitionsBranchLabels   ~/workshop/analysis-raxml/RAxML_bipartitionsBranchLabels.simdata
boostrap                   ~/workshop/analysis-raxml/RAxML_bootstrap.simdata
info                       ~/workshop/analysis-raxml/RAxML_info.simdata



In [8]:

    
import toytree
tre = toytree.tree(s.trees.bipartitions)
tre.root(wildcard='3')
tre.draw(
    width=300,
    node_labels=tre.get_node_values("support"),
    node_size=20,
    );









    





            
            
                Save as .csv

introgression (abba-baba) analysis

The baba object can be used to set up abba-baba tests, to calculate results, and to generate plots to visualize them.

Minimal example, scroll down for details.



In [9]:

    
## create a baba object
b = ipa.baba(data=data.outfiles.loci)

## generate tests given the rooted tree
b.tests = [
    {"p4":["3L_0"], 
     "p3":["2F_0"], 
     "p2":["1D_0"],
     "p1":["1A_0"]}]

## run jobs distributed across the cluster
b.run(ipyclient)
b.results_table









    



[####################] 100%  calculating D-stats  | 0:00:00 |  






    Out[9]:






  
    
      
      dstat
      bootmean
      bootstd
      Z
      ABBA
      BABA
      nloci
    
  
  
    
      0
      -0.098
      -0.098
      0.228
      0.43
      11.5
      14.0
      1000

Auto-generate tests

Instead of writing out many tests explicitly, you can instead enter a rooted tree to the baba object and use this function to auto-generate four-taxon test fitting the tree and constraints.



In [10]:

    
## init baba object
b = ipa.baba(data=data.outfiles.loci, newick=tre)

## generate all possible tests on this tree
b.generate_tests_from_tree()

## set constraints on tests
cdict = {"p4": ["3L_0"], 
         "p3": ["2E_0", "2F_0"], 
         "p2": ["1D_0"]}

## generate constrainted number of tests
b.generate_tests_from_tree(
    constraint_dict=cdict,
    constraint_exact=False,
    )









    



674 tests generated from tree
12 tests generated from tree

Run all tests linked to a baba object



In [11]:

    
## run the tests (in this case 4) linked to the baba object
b.run(ipyclient)

## show results table
b.results_table









    



[####################] 100%  calculating D-stats  | 0:00:03 |  






    Out[11]:






  
    
      
      dstat
      bootmean
      bootstd
      Z
      ABBA
      BABA
      nloci
    
  
  
    
      0
      -0.181
      -0.186
      0.217
      0.836
      11.0
      15.875
      1000
    
    
      1
      -0.030
      -0.037
      0.213
      0.143
      13.0
      13.812
      1000
    
    
      2
      -0.005
      -0.004
      0.216
      0.021
      13.5
      13.625
      1000
    
    
      3
      -0.057
      -0.060
      0.222
      0.255
      12.5
      14.000
      1000
    
    
      4
      -0.135
      -0.128
      0.224
      0.603
      12.0
      15.750
      1000
    
    
      5
      -0.004
      -0.007
      0.208
      0.021
      14.0
      14.125
      1000
    
    
      6
      0.009
      0.007
      0.211
      0.041
      14.5
      14.250
      1000
    
    
      7
      -0.018
      -0.009
      0.227
      0.080
      13.5
      14.000
      1000
    
    
      8
      -0.231
      -0.226
      0.224
      1.030
      10.0
      16.000
      1000
    
    
      9
      -0.059
      -0.076
      0.211
      0.279
      12.0
      13.500
      1000
    
    
      10
      -0.020
      -0.009
      0.220
      0.089
      12.5
      13.000
      1000
    
    
      11
      -0.098
      -0.095
      0.223
      0.440
      11.5
      14.000
      1000

Plot results



In [12]:

    
b.plot(
    height=350,
    pct_tree_x = 0.4,
    pct_tree_y = 0.2,
    );









    





            
            
                Save as .csv



In [15]:

    
### Save the plot 
import toyplot.pdf
canvas, axes, mark = b.plot(height=350, pct_tree_x=0.4, pct_tree_y=0.2)
toyplot.pdf.render(canvas, "/home/deren/workshop/abba-baba.pdf")

## save the results table
b.results_table.to_csv("~/workshop/abba-baba.csv", sep="\t")

Species tree inference by phylogenetic invariants

The program tetrad follows the algorithm of SVDquartets by inferring all possible quartet trees from a large SNP alignment and uses the program quartet maxcut (Snir et al. 2012) to infer a species tree by quartet joining.



In [16]:

    
## create a tetrad class object
tet = ipa.tetrad(
    name=data.name,
    seqfile=data.outfiles.snpsphy,
    mapfile=data.outfiles.snpsmap,
    workdir="~/workshop/analysis-tetrad",
    nboots=100
    )

## run the analysis
tet.run(ipyclient)









    



loading seq array [12 taxa x 4314 bp]
max unlinked SNPs per quartet (nloci): 984
host compute node: [4 cores] on oud
inferring 495 induced quartet trees
[####################] 100%  initial tree | 0:00:00 |  
[####################] 100%  boot 100     | 0:00:40 |

Access result tetrad trees and draw



In [17]:

    
tet.trees









    Out[17]:





boots                /home/deren/workshop/analysis-tetrad/simdata.boots
cons                 /home/deren/workshop/analysis-tetrad/simdata.cons
nhx                  /home/deren/workshop/analysis-tetrad/simdata.nhx
tree                 /home/deren/workshop/analysis-tetrad/simdata.tree



In [63]:

    
## load unrooted result tree with toytree and draw
tre = toytree.tree(tet.trees.cons)
tre.draw(
    node_labels=tre.get_node_values("support"),
    node_size=20,
    );









    





            
            
                Save as .csv

Infer a species tree with BPP



In [ ]:

    
# conda install bpp -c ipyrad



In [19]:

    
## setup: define how samples group into 'species'
IMAP = {
    "1": ["1A_0", "1B_0", "1C_0"],
    "D": ["1D_0"],
    "2": ["2F_0", "2E_0", "2G_0"],
    "H": ["2H_0"],
    "3": ["3J_0", "3I_0", "3K_0"],
    "L": ["3L_0"],    
}

## setup: define a guidetree
GUIDE = "(((1,D),(2,H)),(3,L));"

## init a bpp object 
bpp = ipa.bpp(
    locifile=data.outfiles.loci, 
    imap=IMAP, 
    guidetree=GUIDE,
    workdir="~/workshop/analysis-bpp"
    );

## submit jobs to run on the cluster
bpp.submit_bpp_jobs("A00", nreps=2, ipyclient=ipyclient)









    



submitted 2 bpp jobs [A00] (1000 loci)

Set parameters and filters

You can define all of the parameter settings that will be used in the BPP .ctl file by modifying the .params attributes. Similarly, you can modify which loci will be included in the analysis using the .filters attributes.



In [20]:

    
## set some parameters 
bpp.params.burnin = 1000
bpp.params.nsample = 5000
bpp.params.infer_sptree = 1
bpp.params.infer_delimit = 0 

## set some filters
bpp.filters.maxloci = 200
bpp.filters.minsnps = 2

## submit jobs to run on the cluster
bpp.submit_bpp_jobs("A00", nreps=2, ipyclient=ipyclient)









    



submitted 2 bpp jobs [A00] (200 loci)

Track running jobs

Unlike some of the other ipyrad.analysis tools, the bpp object does not "block" while the jobs are running. Meaning that after it sends jobs to run on the cluster you can continue to interact with the notebook. This is useful since BPP is not multi-threaded, so you will likely want to submit many different types of jobs. You can check on running jobs like below.



In [21]:

    
## a list of submitted jobs
print bpp.asyncs

## a list of result files produced by jobs
print bpp.files









    



[<AsyncResult: _call_bpp>, <AsyncResult: _call_bpp>, <AsyncResult: _call_bpp>, <AsyncResult: _call_bpp>]
locifile    ~/workshop/simdata_outfiles/simdata.loci
mcmcfiles   ['~/workshop/analysis-bpp/A00-r0.mcmc.txt', '~/workshop/analysis-bpp/A00-r1.mcmc.txt']
outfiles    ['~/workshop/analysis-bpp/A00-r0.out.txt', '~/workshop/analysis-bpp/A00-r1.out.txt']

Structure analyses



In [1]:

    
import ipyrad as ip
import ipyrad.analysis as ipa
import ipyparallel as ipp

data = ip.load_json("/home/deren/workshop/simdata.json")
ipyclient = ipp.Client()









    



  loading Assembly: simdata
  from saved path: ~/workshop/simdata.json



In [2]:

    
# conda install structure -c ipyrad
# conda install clumpp -c ipyrad



In [3]:

    
## create a structure class object
s = ipa.structure(
    name=data.name,
    strfile=data.outfiles.str,
    mapfile=data.outfiles.snpsmap,
    workdir="~/workshop/analysis-structure",    
    );

s.mainparams.burnin = 100
s.mainparams.numreps = 1000

## submit jobs to run on the cluster
for kpop in [2, 3, 4, 5]:
    s.submit_structure_jobs(kpop=kpop, nreps=5, ipyclient=ipyclient)









    



['simdata', '/home/deren/workshop/analysis-structure', 572500868, 12, 984, 2, 0]
['simdata', '/home/deren/workshop/analysis-structure', 709458039, 12, 984, 2, 1]
['simdata', '/home/deren/workshop/analysis-structure', 292133950, 12, 984, 2, 2]
['simdata', '/home/deren/workshop/analysis-structure', 316688770, 12, 984, 2, 3]
['simdata', '/home/deren/workshop/analysis-structure', 265959297, 12, 984, 2, 4]






    



submitted 5 structure jobs [simdata-K-2]






    



['simdata', '/home/deren/workshop/analysis-structure', 572500868, 12, 984, 3, 0]
['simdata', '/home/deren/workshop/analysis-structure', 709458039, 12, 984, 3, 1]
['simdata', '/home/deren/workshop/analysis-structure', 292133950, 12, 984, 3, 2]
['simdata', '/home/deren/workshop/analysis-structure', 316688770, 12, 984, 3, 3]
['simdata', '/home/deren/workshop/analysis-structure', 265959297, 12, 984, 3, 4]






    



submitted 5 structure jobs [simdata-K-3]






    



['simdata', '/home/deren/workshop/analysis-structure', 572500868, 12, 984, 4, 0]
['simdata', '/home/deren/workshop/analysis-structure', 709458039, 12, 984, 4, 1]
['simdata', '/home/deren/workshop/analysis-structure', 292133950, 12, 984, 4, 2]
['simdata', '/home/deren/workshop/analysis-structure', 316688770, 12, 984, 4, 3]
['simdata', '/home/deren/workshop/analysis-structure', 265959297, 12, 984, 4, 4]






    



submitted 5 structure jobs [simdata-K-4]






    



['simdata', '/home/deren/workshop/analysis-structure', 572500868, 12, 984, 5, 0]
['simdata', '/home/deren/workshop/analysis-structure', 709458039, 12, 984, 5, 1]
['simdata', '/home/deren/workshop/analysis-structure', 292133950, 12, 984, 5, 2]
['simdata', '/home/deren/workshop/analysis-structure', 316688770, 12, 984, 5, 3]
['simdata', '/home/deren/workshop/analysis-structure', 265959297, 12, 984, 5, 4]






    



submitted 5 structure jobs [simdata-K-5]

Modify parameters settings



In [5]:

    
s.mainparams.burnin = 10000
s.mainparams.numreps = 100000
s.extraparams.usepopinfo = 0

Summarize results with CLUMPP



In [6]:

    
## get results for a single K value
s.get_clumpp_table(3)

## make a dict for all results
tables = {}
for kpop in [2, 3, 4, 5]:
    tables[kpop] = s.get_clumpp_table(kpop)









    



[]






    



---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-6-42ce0b7e0eb0> in <module>()
      1 ## get results for a single K value
----> 2 s.get_clumpp_table(3)
      3 
      4 ## make a dict for all results
      5 tables = {}

/home/deren/Documents/ipyrad/ipyrad/analysis/structure.py in get_clumpp_table(self, kpop)
    332 
    333     def get_clumpp_table(self, kpop):
--> 334         table = _get_clumpp_table(self, kpop)
    335         return table
    336 

/home/deren/Documents/ipyrad/ipyrad/analysis/structure.py in _get_clumpp_table(self, kpop)
    543     proc = subprocess.Popen(cmd, 
    544                             stderr=subprocess.STDOUT,
--> 545                             stdout=subprocess.PIPE)
    546     _ = proc.communicate()
    547 

/home/deren/miniconda2/lib/python2.7/subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
    388                                 p2cread, p2cwrite,
    389                                 c2pread, c2pwrite,
--> 390                                 errread, errwrite)
    391         except Exception:
    392             # Preserve original exception in case os.close raises.

/home/deren/miniconda2/lib/python2.7/subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, to_close, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
   1022                         raise
   1023                 child_exception = pickle.loads(data)
-> 1024                 raise child_exception
   1025 
   1026 

OSError: [Errno 2] No such file or directory



In [ ]:

	state	reads_raw	reads_passed_filter	clusters_total	clusters_hidepth	hetero_est	error_est	reads_consens
1A_0	6	19862	19862	1000	1000	0.001852	0.000758	1000
1B_0	6	20043	20043	1000	1000	0.001900	0.000752	1000
1C_0	6	20136	20136	1000	1000	0.002084	0.000745	1000
1D_0	6	19966	19966	1000	1000	0.001803	0.000754	1000
2E_0	6	20017	20017	1000	1000	0.001831	0.000765	1000
2F_0	6	19933	19933	1000	1000	0.001996	0.000755	1000
2G_0	6	20030	20030	1000	1000	0.001940	0.000763	1000
2H_0	6	20199	20198	1000	1000	0.001747	0.000756	1000
3I_0	6	19885	19885	1000	1000	0.001818	0.000755	1000
3J_0	6	19822	19822	1000	1000	0.001968	0.000783	1000
3K_0	6	19965	19965	1000	1000	0.002072	0.000760	1000
3L_0	6	20008	20008	1000	1000	0.002042	0.000748	1000

	dstat	bootmean	bootstd	Z	ABBA	BABA	nloci
0	-0.181	-0.186	0.217	0.836	11.0	15.875	1000
1	-0.030	-0.037	0.213	0.143	13.0	13.812	1000
2	-0.005	-0.004	0.216	0.021	13.5	13.625	1000
3	-0.057	-0.060	0.222	0.255	12.5	14.000	1000
4	-0.135	-0.128	0.224	0.603	12.0	15.750	1000
5	-0.004	-0.007	0.208	0.021	14.0	14.125	1000
6	0.009	0.007	0.211	0.041	14.5	14.250	1000
7	-0.018	-0.009	0.227	0.080	13.5	14.000	1000
8	-0.231	-0.226	0.224	1.030	10.0	16.000	1000
9	-0.059	-0.076	0.211	0.279	12.0	13.500	1000
10	-0.020	-0.009	0.220	0.089	12.5	13.000	1000
11	-0.098	-0.095	0.223	0.440	11.5	14.000	1000