DaCosta & Sorenson (2016) single-end dd-RAD data set

Here we demonstrate a denovo assembly for an empirical RAD data set using the ipyrad Python API. This example was run on a workstation with 20 cores available and takes about <10 minutes to run completely, but can be run on even a laptop in about less than an hour.

We will use the Lagonosticta and Vidua data set from DaCosta & Sorenson 2016. This data set is composed of single-end 101bp reads from a ddRAD-seq library prepared with the SbfI and EcoRI enzymes and is available on NCBI by its study accession SRP059199. At the end of this notebook we also demonstrate the use of ipyrad.analysis tools to run downstream analyses on this data set.

The figure below from this paper shows the general workflow in which two fairly distinct clades were sequenced together but then analyzed separately.

Setup (software and data files)

If you haven't done so yet, start by installing ipyrad using conda (see ipyrad installation instructions) as well as the packages in the cell below. This is easiest to do in a terminal. Then open a jupyter-notebook, like this one, and follow along with the tutorial by copying and executing the code in the cells, and adding your own documentation between them using markdown. Feel free to modify parameters to see their effects on the downstream results.



In [1]:

    
## conda install ipyrad -c ipyrad
## conda install toytree -c eaton-lab
## conda install entrez-direct -c bioconda
## conda install sratools -c bioconda



In [8]:

    
## imports
import ipyrad as ip
import ipyrad.analysis as ipa
import ipyparallel as ipp

In contrast to the ipyrad CLI, the ipyrad API gives users much more fine-scale control over the parallelization of their analysis, but this also requires learning a little bit about the library that we use to do this, called ipyparallel. This library is designed for use with jupyter-notebooks to allow massive-scale multi-processing while working interactively.

Understanding the nuts and bolts of it might take a little while, but it is fairly easy to get started using it, especially in the way it is integrated with ipyrad. To start a parallel client to you must run the command-line program 'ipcluster'. This will essentially start a number of independent Python processes (kernels) which we can then send bits of work to do. The cluster can be stopped and restarted independently of this notebook, which is convenient for working on a cluster where connecting to many cores is not always immediately available.

Open a terminal and type the following command to start an ipcluster instance with N engines.



In [9]:

    
## ipcluster start --n=20



In [11]:

    
## connect to cluster
ipyclient = ipp.Client()
ipyclient.ids









    Out[11]:





[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Download the data set (Finches)

These data are archived on the NCBI sequence read archive (SRA) under accession id SRP059199. For convenience, the data are also hosted at a public Dropbox link which is a bit easier to access. Run the code below to download and decompress the fastq data files, which will save them into a directory called fastqs-Finches/.



In [3]:

    
## download the Pedicularis data set from NCBI
sra = ipa.sratools(accession="SRP059199", workdir="fastqs-Finches")
sra.run(force=True, ipyclient=ipyclient)









    



[####################] 100%  Downloading fastq files | 0:00:52 |  
24 fastq files downloaded to /home/deren/Documents/ipyrad/tests/fastqs-Finches

Create an Assembly object

This object stores the parameters of the assembly and the organization of data files.



In [4]:

    
## you must provide a name for the Assembly
data = ip.Assembly("Finches")









    



New Assembly: Finches

Set parameters for the Assembly. This will raise an error if any of the parameters are not allowed because they are the wrong type, or out of the allowed range.



In [6]:

    
## set parameters
data.set_params("project_dir", "analysis-ipyrad/Finches")
data.set_params("sorted_fastq_path", "fastqs-Finches/*.fastq.gz")
data.set_params("datatype", "ddrad")
data.set_params("restriction_overhang", ("CCTGCAGG", "AATTC"))
data.set_params("clust_threshold", "0.85")
data.set_params("filter_adapters", "2")
data.set_params("max_Hs_consens", (5, 5))
data.set_params("output_formats", "psvnkua")

## see/print all parameters
data.get_params()









    



0   assembly_name               Finches                                      
1   project_dir                 ./analysis-ipyrad/Finches                    
2   raw_fastq_path                                                           
3   barcodes_path                                                            
4   sorted_fastq_path           ./fastqs-Finches/*.fastq.gz                  
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    ddrad                                        
8   restriction_overhang        ('CCTGCAGG', 'AATTC')                        
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_majrule            6                                            
13  maxdepth                    10000                                        
14  clust_threshold             0.85                                         
15  max_barcode_mismatch        0                                            
16  filter_adapters             2                                            
17  filter_min_trim_len         35                                           
18  max_alleles_consens         2                                            
19  max_Ns_consens              (5, 5)                                       
20  max_Hs_consens              (5, 5)                                       
21  min_samples_locus           4                                            
22  max_SNPs_locus              (20, 20)                                     
23  max_Indels_locus            (8, 8)                                       
24  max_shared_Hs_locus         0.5                                          
25  trim_reads                  (0, 0, 0, 0)                                 
26  trim_loci                   (0, 0, 0, 0)                                 
27  output_formats              ('p', 's', 'v', 'n', 'k', 'u')               
28  pop_assign_file

Assemble the data set



In [7]:

    
## run steps 1 & 2 of the assembly
data.run("12")









    



Assembly: Finches
[####################] 100%  loading reads         | 0:00:03 | s1 | 
[####################] 100%  processing reads      | 0:01:08 | s2 |



In [8]:

    
## access the stats of the assembly (so far) from the .stats attribute
data.stats









    Out[8]:






  
    
      
      state
      reads_raw
      reads_passed_filter
    
  
  
    
      Anomalospiza_imberbis
      2
      935241
      889028
    
    
      Clytospiza_monteiri
      2
      1220879
      1168704
    
    
      Lagonosticta_larvata
      2
      1001743
      944653
    
    
      Lagonosticta_rara
      2
      992534
      934386
    
    
      Lagonosticta_rhodopareia
      2
      1020850
      961277
    
    
      Lagonosticta_rubricata_congica
      2
      1064587
      997387
    
    
      Lagonosticta_rubricata_rubricata
      2
      1079701
      1009532
    
    
      Lagonosticta_rufopicta
      2
      898117
      846834
    
    
      Lagonosticta_sanguinodorsalis
      2
      1034739
      980499
    
    
      Lagonosticta_senegala_rendalli
      2
      834688
      786701
    
    
      Lagonosticta_senegala_rhodopsis
      2
      792799
      749508
    
    
      Vidua_chalybeata_amauropteryx
      2
      1031674
      1028936
    
    
      Vidua_chalybeata_neumanni
      2
      709824
      678382
    
    
      Vidua_fischeri
      2
      998161
      935631
    
    
      Vidua_hypocherina
      2
      889818
      844395
    
    
      Vidua_interjecta
      2
      741675
      708862
    
    
      Vidua_macroura_arenosa
      2
      939649
      891987
    
    
      Vidua_macroura_macroura
      2
      729322
      690496
    
    
      Vidua_obtusa
      2
      809186
      763098
    
    
      Vidua_orientalis
      2
      862619
      824073
    
    
      Vidua_paradisaea
      2
      833981
      791532
    
    
      Vidua_purpurascens
      2
      927116
      883478
    
    
      Vidua_raricola
      2
      956686
      907898
    
    
      Vidua_regia
      2
      1012887
      965657



In [9]:

    
## run steps 3-5 (within-sample steps) of the assembly
data.run("345")









    



Assembly: Finches
[####################] 100%  dereplicating         | 0:00:03 | s3 | 
[####################] 100%  clustering            | 0:00:43 | s3 | 
[####################] 100%  building clusters     | 0:00:05 | s3 | 
[####################] 100%  chunking              | 0:00:01 | s3 | 
[####################] 100%  aligning              | 0:05:02 | s3 | 
[####################] 100%  concatenating         | 0:00:04 | s3 | 
[####################] 100%  inferring [H, E]      | 0:00:35 | s4 | 
[####################] 100%  calculating depths    | 0:00:03 | s5 | 
[####################] 100%  chunking clusters     | 0:00:02 | s5 | 
[####################] 100%  consens calling       | 0:01:24 | s5 |

Branch to create separate data sets for Vidua and Lagonistica



In [11]:

    
## create data set with only Vidua samples + outgroup
subs = [i for i in data.samples if "Vidua" in i] +\
       [i for i in data.samples if "Anomalo" in i]
vidua = data.branch("vidua", subsamples=subs)

## create data set with only Lagonostica sampes + outgroup
subs = [i for i in data.samples if "Lagon" in i] +\
       [i for i in data.samples if "Clyto" in i]
lagon = data.branch("lagon", subsamples=subs)

Assemble each data set through final steps

Or, if you are pressed for time, you can choose just one of the Assemblies going forward. If you do, you may want to choose Vidua since that is the one we use in the downstream analysis tools at the end of this notebook.



In [15]:

    
vidua.run("6")









    



Assembly: vidua
[####################] 100%  concat/shuffle input  | 0:00:01 | s6 | 
[####################] 100%  clustering across     | 0:00:08 | s6 | 
[####################] 100%  building clusters     | 0:00:02 | s6 | 
[####################] 100%  aligning clusters     | 0:00:07 | s6 | 
[####################] 100%  database indels       | 0:00:03 | s6 | 
[####################] 100%  indexing clusters     | 0:00:01 | s6 | 
[####################] 100%  building database     | 0:00:04 | s6 |



In [16]:

    
lagon.run("6")









    



Assembly: lagon
[####################] 100%  concat/shuffle input  | 0:00:01 | s6 | 
[####################] 100%  clustering across     | 0:00:06 | s6 | 
[####################] 100%  building clusters     | 0:00:01 | s6 | 
[####################] 100%  aligning clusters     | 0:00:04 | s6 | 
[####################] 100%  database indels       | 0:00:03 | s6 | 
[####################] 100%  indexing clusters     | 0:00:01 | s6 | 
[####################] 100%  building database     | 0:00:03 | s6 |

Branch to create several final data sets with different parameter settings

Here we use nested for-loops to iterate over assemblies and parameter values.



In [17]:

    
## iterate over data set and parameters
for assembly in [vidua, lagon]:
    for min_sample in [4, 10]:
        
        ## create new assembly, apply new name and parameters
        newname = "{}_min{}".format(assembly.name, min_sample)
        newdata = assembly.branch(newname)
        newdata.set_params("min_samples_locus", min_sample)
        
        ## run step 7 
        newdata.run("7")









    



Assembly: vidua_min4
[####################] 100%  filtering loci        | 0:00:09 | s7 | 
[####################] 100%  building loci/stats   | 0:00:01 | s7 | 
[####################] 100%  building vcf file     | 0:00:04 | s7 | 
[####################] 100%  writing vcf file      | 0:00:00 | s7 | 
[####################] 100%  building arrays       | 0:00:07 | s7 | 
[####################] 100%  writing outfiles      | 0:00:01 | s7 | 
Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min4_outfiles

Assembly: vidua_min10
[####################] 100%  filtering loci        | 0:00:02 | s7 | 
[####################] 100%  building loci/stats   | 0:00:02 | s7 | 
[####################] 100%  building vcf file     | 0:00:02 | s7 | 
[####################] 100%  writing vcf file      | 0:00:00 | s7 | 
[####################] 100%  building arrays       | 0:00:07 | s7 | 
[####################] 100%  writing outfiles      | 0:00:01 | s7 | 
Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min10_outfiles

Assembly: lagon_min4
[####################] 100%  filtering loci        | 0:00:02 | s7 | 
[####################] 100%  building loci/stats   | 0:00:01 | s7 | 
[####################] 100%  building vcf file     | 0:00:01 | s7 | 
[####################] 100%  writing vcf file      | 0:00:00 | s7 | 
[####################] 100%  building arrays       | 0:00:06 | s7 | 
[####################] 100%  writing outfiles      | 0:00:01 | s7 | 
Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/lagon_min4_outfiles

Assembly: lagon_min10
[####################] 100%  filtering loci        | 0:00:02 | s7 | 
[####################] 100%  building loci/stats   | 0:00:01 | s7 | 
[####################] 100%  building vcf file     | 0:00:02 | s7 | 
[####################] 100%  writing vcf file      | 0:00:00 | s7 | 
[####################] 100%  building arrays       | 0:00:06 | s7 | 
[####################] 100%  writing outfiles      | 0:00:01 | s7 | 
Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/lagon_min10_outfiles

View final stats

The .stats attribute shows a stats summary for each sample, and a number of stats dataframes can be accessed for each step from the .stats_dfs attribute of the Assembly.



In [3]:

    
vm4 = ip.load_json("analysis-ipyrad/Finches/vidua_min4.json")
vm4.stats









    



loading Assembly: vidua_min4
from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min4.json






    Out[3]:






  
    
      
      state
      reads_raw
      reads_passed_filter
      clusters_total
      clusters_hidepth
      hetero_est
      error_est
      reads_consens
    
  
  
    
      Anomalospiza_imberbis
      6
      935241
      889028
      19116
      8770
      0.007562
      0.003823
      8584
    
    
      Vidua_chalybeata_amauropteryx
      6
      1031674
      1028936
      19674
      8620
      0.003011
      0.000674
      8535
    
    
      Vidua_chalybeata_neumanni
      6
      709824
      678382
      22053
      9035
      0.005203
      0.003346
      8879
    
    
      Vidua_fischeri
      6
      998161
      935631
      26309
      9733
      0.004910
      0.002578
      9595
    
    
      Vidua_hypocherina
      6
      889818
      844395
      23806
      9539
      0.004199
      0.002698
      9406
    
    
      Vidua_interjecta
      6
      741675
      708862
      22383
      9345
      0.006617
      0.002864
      9180
    
    
      Vidua_macroura_arenosa
      6
      939649
      891987
      26536
      10385
      0.009633
      0.003643
      9994
    
    
      Vidua_macroura_macroura
      6
      729322
      690496
      22449
      9519
      0.007330
      0.002658
      9350
    
    
      Vidua_obtusa
      6
      809186
      763098
      26641
      10039
      0.006043
      0.003828
      9834
    
    
      Vidua_orientalis
      6
      862619
      824073
      25127
      9821
      0.006461
      0.002730
      9666
    
    
      Vidua_paradisaea
      6
      833981
      791532
      26286
      9636
      0.006648
      0.004084
      9360
    
    
      Vidua_purpurascens
      6
      927116
      883478
      20212
      9335
      0.004603
      0.002489
      9196
    
    
      Vidua_raricola
      6
      956686
      907898
      27028
      10303
      0.006500
      0.002545
      10115
    
    
      Vidua_regia
      6
      1012887
      965657
      26355
      9798
      0.004225
      0.002482
      9657



In [4]:

    
lm4 = ip.load_json("analysis-ipyrad/Finches/lagon_min4.json")
lm4.stats









    



loading Assembly: lagon_min4
from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/lagon_min4.json






    Out[4]:






  
    
      
      state
      reads_raw
      reads_passed_filter
      clusters_total
      clusters_hidepth
      hetero_est
      error_est
      reads_consens
    
  
  
    
      Clytospiza_monteiri
      6
      1220879
      1168704
      32301
      10216
      0.008451
      0.003611
      9775
    
    
      Lagonosticta_larvata
      6
      1001743
      944653
      25409
      10557
      0.008231
      0.002764
      10378
    
    
      Lagonosticta_rara
      6
      992534
      934386
      29094
      10016
      0.008781
      0.002809
      9799
    
    
      Lagonosticta_rhodopareia
      6
      1020850
      961277
      27767
      9947
      0.008752
      0.003191
      9714
    
    
      Lagonosticta_rubricata_congica
      6
      1064587
      997387
      26398
      10138
      0.010114
      0.003007
      9887
    
    
      Lagonosticta_rubricata_rubricata
      6
      1079701
      1009532
      22306
      8723
      0.008899
      0.003063
      8456
    
    
      Lagonosticta_rufopicta
      6
      898117
      846834
      25233
      10023
      0.006609
      0.003789
      9802
    
    
      Lagonosticta_sanguinodorsalis
      6
      1034739
      980499
      27716
      9046
      0.009435
      0.004015
      8677
    
    
      Lagonosticta_senegala_rendalli
      6
      834688
      786701
      21363
      9960
      0.007350
      0.002776
      9747
    
    
      Lagonosticta_senegala_rhodopsis
      6
      792799
      749508
      26921
      9379
      0.006013
      0.003181
      9194



In [5]:

    
## or read the full stats file as a bash command (cat)
!cat $vm4.stats_files.s7









    



## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci              13949              0          13949
filtered_by_rm_duplicates            1645           1645          12304
filtered_by_max_indels                233            233          12071
filtered_by_max_snps                  268              9          12062
filtered_by_max_shared_het             83             55          12007
filtered_by_min_sample               3583           3310           8697
filtered_by_max_alleles               724            251           8446
total_filtered_loci                  8446              0           8446


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

                               sample_coverage
Anomalospiza_imberbis                     3921
Vidua_chalybeata_amauropteryx             5701
Vidua_chalybeata_neumanni                 6548
Vidua_fischeri                            6296
Vidua_hypocherina                         6377
Vidua_interjecta                          6590
Vidua_macroura_arenosa                    6548
Vidua_macroura_macroura                   6396
Vidua_obtusa                              6898
Vidua_orientalis                          6746
Vidua_paradisaea                          6699
Vidua_purpurascens                        6674
Vidua_raricola                            6991
Vidua_regia                               6267


## The number of loci for which N taxa have data.
## ipyrad API location: [assembly].stats_dfs.s7_loci

    locus_coverage  sum_coverage
1                0             0
2                0             0
3                0             0
4              915           915
5              480          1395
6              444          1839
7              334          2173
8              368          2541
9              348          2889
10             400          3289
11             423          3712
12             760          4472
13            1895          6367
14            2079          8446


## The distribution of SNPs (var and pis) per locus.
## var = Number of loci with n variable sites (pis + autapomorphies)
## pis = Number of loci with n parsimony informative site (minor allele in >1 sample)
## ipyrad API location: [assembly].stats_dfs.s7_snps

     var  sum_var   pis  sum_pis
0    941        0  3168        0
1   1169     1169  2220     2220
2   1144     3457  1454     5128
3   1096     6745   813     7567
4    855    10165   384     9103
5    723    13780   200    10103
6    629    17554    94    10667
7    516    21166    63    11108
8    391    24294    26    11316
9    293    26931     9    11397
10   221    29141     7    11467
11   151    30802     3    11500
12   115    32182     2    11524
13    74    33144     1    11537
14    46    33788     2    11565
15    34    34298     0    11565
16    20    34618     0    11565
17    11    34805     0    11565
18    11    35003     0    11565
19     5    35098     0    11565
20     1    35118     0    11565


## Final Sample stats summary

                               state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  hetero_est  error_est  reads_consens  loci_in_assembly
Anomalospiza_imberbis              7     935241               889028           19116              8770       0.008  3.823e-03           8584              3921
Vidua_chalybeata_amauropteryx      7    1031674              1028936           19674              8620       0.003  6.740e-04           8535              5701
Vidua_chalybeata_neumanni          7     709824               678382           22053              9035       0.005  3.346e-03           8879              6548
Vidua_fischeri                     7     998161               935631           26309              9733       0.005  2.578e-03           9595              6296
Vidua_hypocherina                  7     889818               844395           23806              9539       0.004  2.698e-03           9406              6377
Vidua_interjecta                   7     741675               708862           22383              9345       0.007  2.864e-03           9180              6590
Vidua_macroura_arenosa             7     939649               891987           26536             10385       0.010  3.643e-03           9994              6548
Vidua_macroura_macroura            7     729322               690496           22449              9519       0.007  2.658e-03           9350              6396
Vidua_obtusa                       7     809186               763098           26641             10039       0.006  3.828e-03           9834              6898
Vidua_orientalis                   7     862619               824073           25127              9821       0.006  2.730e-03           9666              6746
Vidua_paradisaea                   7     833981               791532           26286              9636       0.007  4.084e-03           9360              6699
Vidua_purpurascens                 7     927116               883478           20212              9335       0.005  2.489e-03           9196              6674
Vidua_raricola                     7     956686               907898           27028             10303       0.007  2.545e-03          10115              6991
Vidua_regia                        7    1012887               965657           26355              9798       0.004  2.482e-03           9657              6267



In [6]:

    
## the same full stats for lagon
!cat $lm4.stats_files.s7









    



## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci              14533              0          14533
filtered_by_rm_duplicates            1717           1717          12816
filtered_by_max_indels                282            282          12534
filtered_by_max_snps                  264             14          12520
filtered_by_max_shared_het             84             65          12455
filtered_by_min_sample               5209           4689           7766
filtered_by_max_alleles               797            268           7498
total_filtered_loci                  7498              0           7498


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

                                  sample_coverage
Clytospiza_monteiri                          4776
Lagonosticta_larvata                         6057
Lagonosticta_rara                            5737
Lagonosticta_rhodopareia                     6061
Lagonosticta_rubricata_congica               6183
Lagonosticta_rubricata_rubricata             4741
Lagonosticta_rufopicta                       5500
Lagonosticta_sanguinodorsalis                5512
Lagonosticta_senegala_rendalli               5491
Lagonosticta_senegala_rhodopsis              5309


## The number of loci for which N taxa have data.
## ipyrad API location: [assembly].stats_dfs.s7_loci

    locus_coverage  sum_coverage
1                0             0
2                0             0
3                0             0
4             1141          1141
5              794          1935
6              840          2775
7              752          3527
8              911          4438
9             1359          5797
10            1701          7498


## The distribution of SNPs (var and pis) per locus.
## var = Number of loci with n variable sites (pis + autapomorphies)
## pis = Number of loci with n parsimony informative site (minor allele in >1 sample)
## ipyrad API location: [assembly].stats_dfs.s7_snps

    var  sum_var   pis  sum_pis
0   313        0  2301        0
1   640      640  1969     1969
2   806     2252  1296     4561
3   895     4937   835     7066
4   812     8185   521     9150
5   808    12225   255    10425
6   705    16455   151    11331
7   617    20774    73    11842
8   516    24902    41    12170
9   375    28277    25    12395
10  277    31047    16    12555
11  196    33203     8    12643
12  197    35567     3    12679
13  119    37114     2    12705
14   84    38290     0    12705
15   54    39100     1    12720
16   42    39772     1    12736
17   17    40061     0    12736
18   16    40349     0    12736
19    3    40406     0    12736
20    6    40526     0    12736


## Final Sample stats summary

                                  state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  hetero_est  error_est  reads_consens  loci_in_assembly
Clytospiza_monteiri                   7    1220879              1168704           32301             10216       0.008      0.004           9775              4776
Lagonosticta_larvata                  7    1001743               944653           25409             10557       0.008      0.003          10378              6057
Lagonosticta_rara                     7     992534               934386           29094             10016       0.009      0.003           9799              5737
Lagonosticta_rhodopareia              7    1020850               961277           27767              9947       0.009      0.003           9714              6061
Lagonosticta_rubricata_congica        7    1064587               997387           26398             10138       0.010      0.003           9887              6183
Lagonosticta_rubricata_rubricata      7    1079701              1009532           22306              8723       0.009      0.003           8456              4741
Lagonosticta_rufopicta                7     898117               846834           25233             10023       0.007      0.004           9802              5500
Lagonosticta_sanguinodorsalis         7    1034739               980499           27716              9046       0.009      0.004           8677              5512
Lagonosticta_senegala_rendalli        7     834688               786701           21363              9960       0.007      0.003           9747              5491
Lagonosticta_senegala_rhodopsis       7     792799               749508           26921              9379       0.006      0.003           9194              5309

Analysis tools

Thee is a lot more information about analysis tools in the ipyrad documentation. But here I'll show just a quick example of how you can easily access the data files for these assemblies and use them in downstream analysis software. The ipyrad analysis tools include convenient wrappers to make it easier to parallelize analyses of RAD-seq data. You should still read the full tutorial of the software you are using to understand the full scope of the parameters involved and their impacts, but once you understand that, the ipyrad analysis tools provide an easy way to setup up scripts to sample different distributions of SNPs and to run many replicates in parallel.



In [13]:

    
import ipyrad.analysis as ipa



In [14]:

    
## you can re-load assemblies at a later time from their JSON file
min4 = ip.load_json("analysis-ipyrad/Finches/vidua_min4.json")
min10 = ip.load_json("analysis-ipyrad/Finches/vidua_min10.json")









    



loading Assembly: vidua_min4
from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min4.json
loading Assembly: vidua_min10
from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min10.json

RAxML -- ML concatenation tree inference



In [29]:

    
## conda install raxml -c bioconda
## conda install toytree -c eaton-lab



In [48]:

    
## create a raxml analysis object for the min13 data sets
rax = ipa.raxml(
    name=min10.name, 
    data=min10.outfiles.phy,
    workdir="analysis-raxml",
    T=20,
    N=100,    
    o=[i for i in min10.samples if "Ano" in i],
    )



In [49]:

    
## print the raxml command and call it
print rax.command
rax.run(force=True)









    



raxmlHPC-PTHREADS-SSE3 -f a -T 20 -m GTRGAMMA -N 100 -x 12345 -p 54321 -n vidua_min10 -w /home/deren/Documents/ipyrad/tests/analysis-raxml -s /home/deren/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min10_outfiles/vidua_min10.phy -o Anomalospiza_imberbis
job vidua_min10 finished successfully



In [50]:

    
## access the resulting tree files
rax.trees









    Out[50]:





bestTree                   ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bestTree.vidua_min10
bipartitions               ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bipartitions.vidua_min10
bipartitionsBranchLabels   ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bipartitionsBranchLabels.vidua_min10
bootstrap                  ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bootstrap.vidua_min10
info                       ~/Documents/ipyrad/tests/analysis-raxml/RAxML_info.vidua_min10



In [51]:

    
## plot a tree in the notebook with toytree
import toytree
tre = toytree.tree(rax.trees.bipartitions)
tre.root(wildcard="Ano")
tre.draw(
    width=350,
    height=400,
    node_labels=tre.get_node_values("support"),
    #use_edge_lengths=True,
    );









    




Save graph vertex data as CSV
Save graph edge data as CSV
Save scatterplot as CSV

tetrad -- quartet tree inference



In [39]:

    
## create a tetrad analysis object
tet = ipa.tetrad(
    name=min4.name, 
    seqfile=min4.outfiles.snpsphy,
    mapfile=min4.outfiles.snpsmap,
    nboots=100,
    )









    



loading seq array [14 taxa x 35118 bp]
max unlinked SNPs per quartet (nloci): 7505



In [40]:

    
## run tree inference
tet.run(ipyclient)









    



host compute node: [40 cores] on tinus
inferring 1001 induced quartet trees
[####################] 100%  initial tree | 0:00:04 |  
[####################] 100%  boot 100     | 0:01:35 |



In [41]:

    
## access tree files
tet.trees









    Out[41]:





boots   ~/Documents/ipyrad/tests/analysis-tetrad/vidua_min4.boots
cons    ~/Documents/ipyrad/tests/analysis-tetrad/vidua_min4.cons
nhx     ~/Documents/ipyrad/tests/analysis-tetrad/vidua_min4.nhx
tree    ~/Documents/ipyrad/tests/analysis-tetrad/vidua_min4.tree



In [112]:

    
## plot results (just like above, but unrooted by default)
import toytree
qtre = toytree.tree(tet.trees.nhx)
qtre.root(wildcard="Ano")
qtre.draw(
    width=350,
    height=400,
    node_labels=tre.get_node_values("support"),
    );









    




Save graph vertex data as CSV
Save graph edge data as CSV
Save scatterplot as CSV



In [54]:

    
## draw a cloud-tree to see variation among bootstrap trees
## note that the trees are UNROOTED here, but tips are in the 
## same order in all trees.
boots = toytree.multitree(tet.trees.boots, fixed_order=tre.get_tip_labels())
boots.draw_cloudtree(orient='right', edge_style={"opacity": 0.05});









    




Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV

STRUCTURE -- population cluster inference



In [ ]:

    
## conda install structure clumpp -c ipyrad



In [56]:

    
## create a structure analysis object for the no-outgroup data set
struct = ipa.structure(
    name=min10.name, 
    data=min10.outfiles.str, 
    mapfile=min10.outfiles.snpsmap,    
)

## set params for analysis (should be longer in real analyses)
struct.mainparams.burnin=1000
struct.mainparams.numreps=8000



In [57]:

    
## run structure across 10 random replicates of sampled unlinked SNPs
for kpop in [2, 4, 6, 8]:
    struct.run(kpop=kpop, nreps=10, ipyclient=ipyclient)









    



submitted 10 structure jobs [vidua_min10-K-2]
submitted 10 structure jobs [vidua_min10-K-4]
submitted 10 structure jobs [vidua_min10-K-6]
submitted 10 structure jobs [vidua_min10-K-8]



In [59]:

    
## wait for all of these jobs to finish
ipyclient.wait()









    Out[59]:





True

Use Clumpp to permute replicates



In [64]:

    
## these options make it run faster
struct.clumppparams.m = 3               ## use largegreedy algorithm
struct.clumppparams.greedy_option = 2   ## test nrepeat possible orders
struct.clumppparams.repeats = 10000     ## number of repeats



In [65]:

    
## collect results
tables = {}
for kpop in [2, 4, 6, 8]:
    tables[kpop] = struct.get_clumpp_table(kpop)









    



mean scores across 10 replicates.
mean scores across 10 replicates.
mean scores across 10 replicates.
mean scores across 10 replicates.

Plot the results as a barplot

Usually a next step in a structure analysis would be do some kind of statistical analysis to compare models and identify K values that fit the data well.



In [66]:

    
## order of bars will be taken from ladderized tree above
myorder = tre.get_tip_labels()



In [69]:

    
## import toyplot (packaged with toytree) 
import toyplot

## plot bars for each K-value (mean of 10 reps)
for kpop in [2, 4, 6, 8]:
    table = tables[kpop]
    table = table.ix[myorder]
    
    ## plot barplot w/ hover
    canvas, axes, mark = toyplot.bars(
                            table, 
                            title=[[i] for i in table.index.tolist()],
                            width=400, 
                            height=200, 
                            yshow=False,                            
                            style={"stroke": toyplot.color.near_black},
                            )









    




Save bar data as CSV






    




Save bar data as CSV






    




Save bar data as CSV






    




Save bar data as CSV

TREEMIX -- ML tree & admixture co-inference



In [70]:

    
## conda install treemix -c ipyrad



In [93]:

    
## group taxa into 'populations'
imap = {
    'orient': ['Vidua_orientalis'],
    'interj': ['Vidua_interjecta'],
    'obtusa': ['Vidua_obtusa'],
    'paradi': ['Vidua_paradisaea'],
    'hypoch': ['Vidua_hypocherina'],
    'macrou': ['Vidua_macroura_macroura', 'Vidua_macroura_arenosa'],
    'fische': ['Vidua_fischeri'],
    'regia' : ['Vidua_regia'],
    'chalyb': ['Vidua_chalybeata_amauropteryx', 'Vidua_chalybeata_neumanni'],
    'purpur': ['Vidua_purpurascens'],
    'rarico': ['Vidua_raricola'],
    #'outgro': ['Anomalospiza_imberbis'],
}

## optional: loci will be filtered if they do not have data for at
## least N samples in each species. Minimums cannot be <1.
minmap = {
    'orient': 1,
    'interj': 1,
    'obtusa': 1,
    'paradi': 1,
    'hypoch': 1,
    'macrou': 2,
    'fische': 1,
    'regia' : 1,
    'chalyb': 2,
    'purpur': 1,
    'rarico': 1,
    #'outgro': 1,
    }



In [104]:

    
## create a treemix analysis object
tmix = ipa.treemix(
    name=min10.name, 
    data=min10.outfiles.snpsphy,
    imap=imap,
    minmap=minmap,  
    )

## set params on treemix object
tmix.params.m = 1
tmix.params.root = "interj,orient,paradi,obtusa"
tmix.params.global_ = 1



In [105]:

    
## you can simply write the input files and run them externally
## or, as we show below, use the .run() command to run them here.
tmix.write_output_file()









    



ntaxa 14; nSNPs total 35118; nSNPs written 16364



In [108]:

    
## a dictionary for storing treemix objects
tdict = {}

## iterate over values of m
for rep in xrange(4):
    for mig in xrange(4):
        
        ## create new treemix object copy
        name = "mig-{}-rep-{}".format(mig, rep)
        tmp = tmix.copy(name)

        ## set params on new object
        tmp.params.m = mig
    
        ## run treemix analysis
        tmp.run()
        
        ## store the treemix object
        tdict[name] = tmp



In [110]:

    
import toyplot
import numpy as np

canvas = toyplot.Canvas(width=800, height=1200)
idx = 0
for mig in range(4):
    for rep in range(4):
        tmp = tdict["mig-{}-rep-{}".format(mig, rep)]
        ax = canvas.cartesian(grid=(4, 4, idx), padding=25, margin=(25, 50, 100, 25))
        ax = tmp.draw(ax)
        idx += 1









    




Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save plot data as CSV
Save scatterplot as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save plot data as CSV
Save scatterplot as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save plot data as CSV
Save scatterplot as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save plot data as CSV
Save scatterplot as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save plot data as CSV
Save scatterplot as CSV
Save plot data as CSV
Save scatterplot as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save plot data as CSV
Save scatterplot as CSV
Save plot data as CSV
Save scatterplot as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save plot data as CSV
Save scatterplot as CSV
Save plot data as CSV
Save scatterplot as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save plot data as CSV
Save scatterplot as CSV
Save plot data as CSV
Save scatterplot as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save plot data as CSV
Save scatterplot as CSV
Save plot data as CSV
Save scatterplot as CSV
Save plot data as CSV
Save scatterplot as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save plot data as CSV
Save scatterplot as CSV
Save plot data as CSV
Save scatterplot as CSV
Save plot data as CSV
Save scatterplot as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save plot data as CSV
Save scatterplot as CSV
Save plot data as CSV
Save scatterplot as CSV
Save plot data as CSV
Save scatterplot as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save graph vertex data as CSV
Save graph edge data as CSV
Save plot data as CSV
Save scatterplot as CSV
Save plot data as CSV
Save scatterplot as CSV
Save plot data as CSV
Save scatterplot as CSV

baba: D-statistic (ABBA-BABA) admixture inference/hypothesis testing



In [15]:

    
## create a baba analysis object
bb = ipa.baba(
    data=min4.outfiles.loci,
    newick="analysis-raxml/RAxML_bestTree.vidua_min10" 
)



In [16]:

    
## this will generate tests from the tree, using constraints.
bb.generate_tests_from_tree(
    constraint_exact=False,
    constraint_dict={
        "p4": ['Anomalospiza_imberbis'],
        'p3': ['Vidua_macroura_macroura', 'Vidua_macroura_arenosa'],
    })









    



168 tests generated from tree



In [17]:

    
## run inference and significance testing on tests
bb.run(ipyclient=ipyclient)









    



[####################] 100%  calculating D-stats  | 0:00:24 |



In [18]:

    
## sorted results for the tests performed 
bb.results_table.sort_values(by="Z", ascending=False)









    Out[18]:






  
    
      
      dstat
      bootmean
      bootstd
      Z
      ABBA
      BABA
      nloci
    
  
  
    
      25
      -0.195
      -1.952e-01
      0.071
      2.751
      43.250
      64.250
      3207
    
    
      22
      0.212
      2.116e-01
      0.078
      2.729
      61.000
      39.625
      3143
    
    
      1
      -0.186
      -1.816e-01
      0.069
      2.701
      44.656
      65.094
      3262
    
    
      13
      -0.192
      -1.888e-01
      0.074
      2.608
      41.875
      61.750
      3186
    
    
      15
      -0.212
      -2.081e-01
      0.084
      2.528
      39.625
      61.000
      3143
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      120
      -0.009
      -2.353e-03
      0.104
      0.085
      34.750
      35.375
      2717
    
    
      157
      0.004
      5.400e-03
      0.085
      0.047
      46.812
      46.438
      3222
    
    
      0
      0.002
      4.923e-03
      0.071
      0.030
      58.312
      58.062
      3268
    
    
      76
      -0.003
      -6.335e-04
      0.094
      0.027
      36.500
      36.688
      2765
    
    
      70
      0.003
      1.563e-03
      0.095
      0.027
      36.688
      36.500
      2765
    
  

168 rows × 7 columns

summary

The top results show greatest support for admixture between V. macroura and V. paradisaea, though the signal is not very strong (Z=2.7).



In [19]:

    
## the test that had the most significant result: (BABA)
bb.tests[25]









    Out[19]:





{'p1': ['Vidua_paradisaea'],
 'p2': ['Vidua_orientalis', 'Vidua_interjecta'],
 'p3': ['Vidua_macroura_arenosa'],
 'p4': ['Anomalospiza_imberbis']}



In [20]:

    
## next best (ABBA)
bb.tests[22]









    Out[20]:





{'p1': ['Vidua_orientalis'],
 'p2': ['Vidua_paradisaea'],
 'p3': ['Vidua_macroura_macroura'],
 'p4': ['Anomalospiza_imberbis']}

	state	reads_raw	reads_passed_filter
Anomalospiza_imberbis	2	935241	889028
Clytospiza_monteiri	2	1220879	1168704
Lagonosticta_larvata	2	1001743	944653
Lagonosticta_rara	2	992534	934386
Lagonosticta_rhodopareia	2	1020850	961277
Lagonosticta_rubricata_congica	2	1064587	997387
Lagonosticta_rubricata_rubricata	2	1079701	1009532
Lagonosticta_rufopicta	2	898117	846834
Lagonosticta_sanguinodorsalis	2	1034739	980499
Lagonosticta_senegala_rendalli	2	834688	786701
Lagonosticta_senegala_rhodopsis	2	792799	749508
Vidua_chalybeata_amauropteryx	2	1031674	1028936
Vidua_chalybeata_neumanni	2	709824	678382
Vidua_fischeri	2	998161	935631
Vidua_hypocherina	2	889818	844395
Vidua_interjecta	2	741675	708862
Vidua_macroura_arenosa	2	939649	891987
Vidua_macroura_macroura	2	729322	690496
Vidua_obtusa	2	809186	763098
Vidua_orientalis	2	862619	824073
Vidua_paradisaea	2	833981	791532
Vidua_purpurascens	2	927116	883478
Vidua_raricola	2	956686	907898
Vidua_regia	2	1012887	965657

	dstat	bootmean	bootstd	Z	ABBA	BABA	nloci
25	-0.195	-1.952e-01	0.071	2.751	43.250	64.250	3207
22	0.212	2.116e-01	0.078	2.729	61.000	39.625	3143
1	-0.186	-1.816e-01	0.069	2.701	44.656	65.094	3262
13	-0.192	-1.888e-01	0.074	2.608	41.875	61.750	3186
15	-0.212	-2.081e-01	0.084	2.528	39.625	61.000	3143
...	...	...	...	...	...	...	...
120	-0.009	-2.353e-03	0.104	0.085	34.750	35.375	2717
157	0.004	5.400e-03	0.085	0.047	46.812	46.438	3222
0	0.002	4.923e-03	0.071	0.030	58.312	58.062	3268
76	-0.003	-6.335e-04	0.094	0.027	36.500	36.688	2765
70	0.003	1.563e-03	0.095	0.027	36.688	36.500	2765