DaCosta & Sorenson (2016) single-end dd-RAD data set

Here we demonstrate a denovo assembly for an empirical RAD data set using the ipyrad Python API. This example was run on a workstation with 20 cores available and takes about <10 minutes to run completely, but can be run on even a laptop in about less than an hour.

We will use the Lagonosticta and Vidua data set from DaCosta & Sorenson 2016. This data set is composed of single-end 101bp reads from a ddRAD-seq library prepared with the SbfI and EcoRI enzymes and is available on NCBI by its study accession SRP059199. At the end of this notebook we also demonstrate the use of ipyrad.analysis tools to run downstream analyses on this data set.

The figure below from this paper shows the general workflow in which two fairly distinct clades were sequenced together but then analyzed separately.

Setup (software and data files)

If you haven't done so yet, start by installing ipyrad using conda (see ipyrad installation instructions) as well as the packages in the cell below. This is easiest to do in a terminal. Then open a jupyter-notebook, like this one, and follow along with the tutorial by copying and executing the code in the cells, and adding your own documentation between them using markdown. Feel free to modify parameters to see their effects on the downstream results.


In [1]:
## conda install ipyrad -c ipyrad
## conda install toytree -c eaton-lab
## conda install entrez-direct -c bioconda
## conda install sratools -c bioconda

In [8]:
## imports
import ipyrad as ip
import ipyrad.analysis as ipa
import ipyparallel as ipp

In contrast to the ipyrad CLI, the ipyrad API gives users much more fine-scale control over the parallelization of their analysis, but this also requires learning a little bit about the library that we use to do this, called ipyparallel. This library is designed for use with jupyter-notebooks to allow massive-scale multi-processing while working interactively.

Understanding the nuts and bolts of it might take a little while, but it is fairly easy to get started using it, especially in the way it is integrated with ipyrad. To start a parallel client to you must run the command-line program 'ipcluster'. This will essentially start a number of independent Python processes (kernels) which we can then send bits of work to do. The cluster can be stopped and restarted independently of this notebook, which is convenient for working on a cluster where connecting to many cores is not always immediately available.

Open a terminal and type the following command to start an ipcluster instance with N engines.


In [9]:
## ipcluster start --n=20

In [11]:
## connect to cluster
ipyclient = ipp.Client()
ipyclient.ids


Out[11]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Download the data set (Finches)

These data are archived on the NCBI sequence read archive (SRA) under accession id SRP059199. For convenience, the data are also hosted at a public Dropbox link which is a bit easier to access. Run the code below to download and decompress the fastq data files, which will save them into a directory called fastqs-Finches/.


In [3]:
## download the Pedicularis data set from NCBI
sra = ipa.sratools(accession="SRP059199", workdir="fastqs-Finches")
sra.run(force=True, ipyclient=ipyclient)


[####################] 100%  Downloading fastq files | 0:00:52 |  
24 fastq files downloaded to /home/deren/Documents/ipyrad/tests/fastqs-Finches

Create an Assembly object

This object stores the parameters of the assembly and the organization of data files.


In [4]:
## you must provide a name for the Assembly
data = ip.Assembly("Finches")


New Assembly: Finches

Set parameters for the Assembly. This will raise an error if any of the parameters are not allowed because they are the wrong type, or out of the allowed range.


In [6]:
## set parameters
data.set_params("project_dir", "analysis-ipyrad/Finches")
data.set_params("sorted_fastq_path", "fastqs-Finches/*.fastq.gz")
data.set_params("datatype", "ddrad")
data.set_params("restriction_overhang", ("CCTGCAGG", "AATTC"))
data.set_params("clust_threshold", "0.85")
data.set_params("filter_adapters", "2")
data.set_params("max_Hs_consens", (5, 5))
data.set_params("output_formats", "psvnkua")

## see/print all parameters
data.get_params()


0   assembly_name               Finches                                      
1   project_dir                 ./analysis-ipyrad/Finches                    
2   raw_fastq_path                                                           
3   barcodes_path                                                            
4   sorted_fastq_path           ./fastqs-Finches/*.fastq.gz                  
5   assembly_method             denovo                                       
6   reference_sequence                                                       
7   datatype                    ddrad                                        
8   restriction_overhang        ('CCTGCAGG', 'AATTC')                        
9   max_low_qual_bases          5                                            
10  phred_Qscore_offset         33                                           
11  mindepth_statistical        6                                            
12  mindepth_majrule            6                                            
13  maxdepth                    10000                                        
14  clust_threshold             0.85                                         
15  max_barcode_mismatch        0                                            
16  filter_adapters             2                                            
17  filter_min_trim_len         35                                           
18  max_alleles_consens         2                                            
19  max_Ns_consens              (5, 5)                                       
20  max_Hs_consens              (5, 5)                                       
21  min_samples_locus           4                                            
22  max_SNPs_locus              (20, 20)                                     
23  max_Indels_locus            (8, 8)                                       
24  max_shared_Hs_locus         0.5                                          
25  trim_reads                  (0, 0, 0, 0)                                 
26  trim_loci                   (0, 0, 0, 0)                                 
27  output_formats              ('p', 's', 'v', 'n', 'k', 'u')               
28  pop_assign_file                                                          

Assemble the data set


In [7]:
## run steps 1 & 2 of the assembly
data.run("12")


Assembly: Finches
[####################] 100%  loading reads         | 0:00:03 | s1 | 
[####################] 100%  processing reads      | 0:01:08 | s2 | 

In [8]:
## access the stats of the assembly (so far) from the .stats attribute
data.stats


Out[8]:
state reads_raw reads_passed_filter
Anomalospiza_imberbis 2 935241 889028
Clytospiza_monteiri 2 1220879 1168704
Lagonosticta_larvata 2 1001743 944653
Lagonosticta_rara 2 992534 934386
Lagonosticta_rhodopareia 2 1020850 961277
Lagonosticta_rubricata_congica 2 1064587 997387
Lagonosticta_rubricata_rubricata 2 1079701 1009532
Lagonosticta_rufopicta 2 898117 846834
Lagonosticta_sanguinodorsalis 2 1034739 980499
Lagonosticta_senegala_rendalli 2 834688 786701
Lagonosticta_senegala_rhodopsis 2 792799 749508
Vidua_chalybeata_amauropteryx 2 1031674 1028936
Vidua_chalybeata_neumanni 2 709824 678382
Vidua_fischeri 2 998161 935631
Vidua_hypocherina 2 889818 844395
Vidua_interjecta 2 741675 708862
Vidua_macroura_arenosa 2 939649 891987
Vidua_macroura_macroura 2 729322 690496
Vidua_obtusa 2 809186 763098
Vidua_orientalis 2 862619 824073
Vidua_paradisaea 2 833981 791532
Vidua_purpurascens 2 927116 883478
Vidua_raricola 2 956686 907898
Vidua_regia 2 1012887 965657

In [9]:
## run steps 3-5 (within-sample steps) of the assembly
data.run("345")


Assembly: Finches
[####################] 100%  dereplicating         | 0:00:03 | s3 | 
[####################] 100%  clustering            | 0:00:43 | s3 | 
[####################] 100%  building clusters     | 0:00:05 | s3 | 
[####################] 100%  chunking              | 0:00:01 | s3 | 
[####################] 100%  aligning              | 0:05:02 | s3 | 
[####################] 100%  concatenating         | 0:00:04 | s3 | 
[####################] 100%  inferring [H, E]      | 0:00:35 | s4 | 
[####################] 100%  calculating depths    | 0:00:03 | s5 | 
[####################] 100%  chunking clusters     | 0:00:02 | s5 | 
[####################] 100%  consens calling       | 0:01:24 | s5 | 

Branch to create separate data sets for Vidua and Lagonistica


In [11]:
## create data set with only Vidua samples + outgroup
subs = [i for i in data.samples if "Vidua" in i] +\
       [i for i in data.samples if "Anomalo" in i]
vidua = data.branch("vidua", subsamples=subs)

## create data set with only Lagonostica sampes + outgroup
subs = [i for i in data.samples if "Lagon" in i] +\
       [i for i in data.samples if "Clyto" in i]
lagon = data.branch("lagon", subsamples=subs)

Assemble each data set through final steps

Or, if you are pressed for time, you can choose just one of the Assemblies going forward. If you do, you may want to choose Vidua since that is the one we use in the downstream analysis tools at the end of this notebook.


In [15]:
vidua.run("6")


Assembly: vidua
[####################] 100%  concat/shuffle input  | 0:00:01 | s6 | 
[####################] 100%  clustering across     | 0:00:08 | s6 | 
[####################] 100%  building clusters     | 0:00:02 | s6 | 
[####################] 100%  aligning clusters     | 0:00:07 | s6 | 
[####################] 100%  database indels       | 0:00:03 | s6 | 
[####################] 100%  indexing clusters     | 0:00:01 | s6 | 
[####################] 100%  building database     | 0:00:04 | s6 | 

In [16]:
lagon.run("6")


Assembly: lagon
[####################] 100%  concat/shuffle input  | 0:00:01 | s6 | 
[####################] 100%  clustering across     | 0:00:06 | s6 | 
[####################] 100%  building clusters     | 0:00:01 | s6 | 
[####################] 100%  aligning clusters     | 0:00:04 | s6 | 
[####################] 100%  database indels       | 0:00:03 | s6 | 
[####################] 100%  indexing clusters     | 0:00:01 | s6 | 
[####################] 100%  building database     | 0:00:03 | s6 | 

Branch to create several final data sets with different parameter settings

Here we use nested for-loops to iterate over assemblies and parameter values.


In [17]:
## iterate over data set and parameters
for assembly in [vidua, lagon]:
    for min_sample in [4, 10]:
        
        ## create new assembly, apply new name and parameters
        newname = "{}_min{}".format(assembly.name, min_sample)
        newdata = assembly.branch(newname)
        newdata.set_params("min_samples_locus", min_sample)
        
        ## run step 7 
        newdata.run("7")


Assembly: vidua_min4
[####################] 100%  filtering loci        | 0:00:09 | s7 | 
[####################] 100%  building loci/stats   | 0:00:01 | s7 | 
[####################] 100%  building vcf file     | 0:00:04 | s7 | 
[####################] 100%  writing vcf file      | 0:00:00 | s7 | 
[####################] 100%  building arrays       | 0:00:07 | s7 | 
[####################] 100%  writing outfiles      | 0:00:01 | s7 | 
Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min4_outfiles

Assembly: vidua_min10
[####################] 100%  filtering loci        | 0:00:02 | s7 | 
[####################] 100%  building loci/stats   | 0:00:02 | s7 | 
[####################] 100%  building vcf file     | 0:00:02 | s7 | 
[####################] 100%  writing vcf file      | 0:00:00 | s7 | 
[####################] 100%  building arrays       | 0:00:07 | s7 | 
[####################] 100%  writing outfiles      | 0:00:01 | s7 | 
Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min10_outfiles

Assembly: lagon_min4
[####################] 100%  filtering loci        | 0:00:02 | s7 | 
[####################] 100%  building loci/stats   | 0:00:01 | s7 | 
[####################] 100%  building vcf file     | 0:00:01 | s7 | 
[####################] 100%  writing vcf file      | 0:00:00 | s7 | 
[####################] 100%  building arrays       | 0:00:06 | s7 | 
[####################] 100%  writing outfiles      | 0:00:01 | s7 | 
Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/lagon_min4_outfiles

Assembly: lagon_min10
[####################] 100%  filtering loci        | 0:00:02 | s7 | 
[####################] 100%  building loci/stats   | 0:00:01 | s7 | 
[####################] 100%  building vcf file     | 0:00:02 | s7 | 
[####################] 100%  writing vcf file      | 0:00:00 | s7 | 
[####################] 100%  building arrays       | 0:00:06 | s7 | 
[####################] 100%  writing outfiles      | 0:00:01 | s7 | 
Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/lagon_min10_outfiles

View final stats

The .stats attribute shows a stats summary for each sample, and a number of stats dataframes can be accessed for each step from the .stats_dfs attribute of the Assembly.


In [3]:
vm4 = ip.load_json("analysis-ipyrad/Finches/vidua_min4.json")
vm4.stats


loading Assembly: vidua_min4
from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min4.json
Out[3]:
state reads_raw reads_passed_filter clusters_total clusters_hidepth hetero_est error_est reads_consens
Anomalospiza_imberbis 6 935241 889028 19116 8770 0.007562 0.003823 8584
Vidua_chalybeata_amauropteryx 6 1031674 1028936 19674 8620 0.003011 0.000674 8535
Vidua_chalybeata_neumanni 6 709824 678382 22053 9035 0.005203 0.003346 8879
Vidua_fischeri 6 998161 935631 26309 9733 0.004910 0.002578 9595
Vidua_hypocherina 6 889818 844395 23806 9539 0.004199 0.002698 9406
Vidua_interjecta 6 741675 708862 22383 9345 0.006617 0.002864 9180
Vidua_macroura_arenosa 6 939649 891987 26536 10385 0.009633 0.003643 9994
Vidua_macroura_macroura 6 729322 690496 22449 9519 0.007330 0.002658 9350
Vidua_obtusa 6 809186 763098 26641 10039 0.006043 0.003828 9834
Vidua_orientalis 6 862619 824073 25127 9821 0.006461 0.002730 9666
Vidua_paradisaea 6 833981 791532 26286 9636 0.006648 0.004084 9360
Vidua_purpurascens 6 927116 883478 20212 9335 0.004603 0.002489 9196
Vidua_raricola 6 956686 907898 27028 10303 0.006500 0.002545 10115
Vidua_regia 6 1012887 965657 26355 9798 0.004225 0.002482 9657

In [4]:
lm4 = ip.load_json("analysis-ipyrad/Finches/lagon_min4.json")
lm4.stats


loading Assembly: lagon_min4
from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/lagon_min4.json
Out[4]:
state reads_raw reads_passed_filter clusters_total clusters_hidepth hetero_est error_est reads_consens
Clytospiza_monteiri 6 1220879 1168704 32301 10216 0.008451 0.003611 9775
Lagonosticta_larvata 6 1001743 944653 25409 10557 0.008231 0.002764 10378
Lagonosticta_rara 6 992534 934386 29094 10016 0.008781 0.002809 9799
Lagonosticta_rhodopareia 6 1020850 961277 27767 9947 0.008752 0.003191 9714
Lagonosticta_rubricata_congica 6 1064587 997387 26398 10138 0.010114 0.003007 9887
Lagonosticta_rubricata_rubricata 6 1079701 1009532 22306 8723 0.008899 0.003063 8456
Lagonosticta_rufopicta 6 898117 846834 25233 10023 0.006609 0.003789 9802
Lagonosticta_sanguinodorsalis 6 1034739 980499 27716 9046 0.009435 0.004015 8677
Lagonosticta_senegala_rendalli 6 834688 786701 21363 9960 0.007350 0.002776 9747
Lagonosticta_senegala_rhodopsis 6 792799 749508 26921 9379 0.006013 0.003181 9194

In [5]:
## or read the full stats file as a bash command (cat)
!cat $vm4.stats_files.s7



## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci              13949              0          13949
filtered_by_rm_duplicates            1645           1645          12304
filtered_by_max_indels                233            233          12071
filtered_by_max_snps                  268              9          12062
filtered_by_max_shared_het             83             55          12007
filtered_by_min_sample               3583           3310           8697
filtered_by_max_alleles               724            251           8446
total_filtered_loci                  8446              0           8446


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

                               sample_coverage
Anomalospiza_imberbis                     3921
Vidua_chalybeata_amauropteryx             5701
Vidua_chalybeata_neumanni                 6548
Vidua_fischeri                            6296
Vidua_hypocherina                         6377
Vidua_interjecta                          6590
Vidua_macroura_arenosa                    6548
Vidua_macroura_macroura                   6396
Vidua_obtusa                              6898
Vidua_orientalis                          6746
Vidua_paradisaea                          6699
Vidua_purpurascens                        6674
Vidua_raricola                            6991
Vidua_regia                               6267


## The number of loci for which N taxa have data.
## ipyrad API location: [assembly].stats_dfs.s7_loci

    locus_coverage  sum_coverage
1                0             0
2                0             0
3                0             0
4              915           915
5              480          1395
6              444          1839
7              334          2173
8              368          2541
9              348          2889
10             400          3289
11             423          3712
12             760          4472
13            1895          6367
14            2079          8446


## The distribution of SNPs (var and pis) per locus.
## var = Number of loci with n variable sites (pis + autapomorphies)
## pis = Number of loci with n parsimony informative site (minor allele in >1 sample)
## ipyrad API location: [assembly].stats_dfs.s7_snps

     var  sum_var   pis  sum_pis
0    941        0  3168        0
1   1169     1169  2220     2220
2   1144     3457  1454     5128
3   1096     6745   813     7567
4    855    10165   384     9103
5    723    13780   200    10103
6    629    17554    94    10667
7    516    21166    63    11108
8    391    24294    26    11316
9    293    26931     9    11397
10   221    29141     7    11467
11   151    30802     3    11500
12   115    32182     2    11524
13    74    33144     1    11537
14    46    33788     2    11565
15    34    34298     0    11565
16    20    34618     0    11565
17    11    34805     0    11565
18    11    35003     0    11565
19     5    35098     0    11565
20     1    35118     0    11565


## Final Sample stats summary

                               state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  hetero_est  error_est  reads_consens  loci_in_assembly
Anomalospiza_imberbis              7     935241               889028           19116              8770       0.008  3.823e-03           8584              3921
Vidua_chalybeata_amauropteryx      7    1031674              1028936           19674              8620       0.003  6.740e-04           8535              5701
Vidua_chalybeata_neumanni          7     709824               678382           22053              9035       0.005  3.346e-03           8879              6548
Vidua_fischeri                     7     998161               935631           26309              9733       0.005  2.578e-03           9595              6296
Vidua_hypocherina                  7     889818               844395           23806              9539       0.004  2.698e-03           9406              6377
Vidua_interjecta                   7     741675               708862           22383              9345       0.007  2.864e-03           9180              6590
Vidua_macroura_arenosa             7     939649               891987           26536             10385       0.010  3.643e-03           9994              6548
Vidua_macroura_macroura            7     729322               690496           22449              9519       0.007  2.658e-03           9350              6396
Vidua_obtusa                       7     809186               763098           26641             10039       0.006  3.828e-03           9834              6898
Vidua_orientalis                   7     862619               824073           25127              9821       0.006  2.730e-03           9666              6746
Vidua_paradisaea                   7     833981               791532           26286              9636       0.007  4.084e-03           9360              6699
Vidua_purpurascens                 7     927116               883478           20212              9335       0.005  2.489e-03           9196              6674
Vidua_raricola                     7     956686               907898           27028             10303       0.007  2.545e-03          10115              6991
Vidua_regia                        7    1012887               965657           26355              9798       0.004  2.482e-03           9657              6267

In [6]:
## the same full stats for lagon
!cat $lm4.stats_files.s7



## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci              14533              0          14533
filtered_by_rm_duplicates            1717           1717          12816
filtered_by_max_indels                282            282          12534
filtered_by_max_snps                  264             14          12520
filtered_by_max_shared_het             84             65          12455
filtered_by_min_sample               5209           4689           7766
filtered_by_max_alleles               797            268           7498
total_filtered_loci                  7498              0           7498


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

                                  sample_coverage
Clytospiza_monteiri                          4776
Lagonosticta_larvata                         6057
Lagonosticta_rara                            5737
Lagonosticta_rhodopareia                     6061
Lagonosticta_rubricata_congica               6183
Lagonosticta_rubricata_rubricata             4741
Lagonosticta_rufopicta                       5500
Lagonosticta_sanguinodorsalis                5512
Lagonosticta_senegala_rendalli               5491
Lagonosticta_senegala_rhodopsis              5309


## The number of loci for which N taxa have data.
## ipyrad API location: [assembly].stats_dfs.s7_loci

    locus_coverage  sum_coverage
1                0             0
2                0             0
3                0             0
4             1141          1141
5              794          1935
6              840          2775
7              752          3527
8              911          4438
9             1359          5797
10            1701          7498


## The distribution of SNPs (var and pis) per locus.
## var = Number of loci with n variable sites (pis + autapomorphies)
## pis = Number of loci with n parsimony informative site (minor allele in >1 sample)
## ipyrad API location: [assembly].stats_dfs.s7_snps

    var  sum_var   pis  sum_pis
0   313        0  2301        0
1   640      640  1969     1969
2   806     2252  1296     4561
3   895     4937   835     7066
4   812     8185   521     9150
5   808    12225   255    10425
6   705    16455   151    11331
7   617    20774    73    11842
8   516    24902    41    12170
9   375    28277    25    12395
10  277    31047    16    12555
11  196    33203     8    12643
12  197    35567     3    12679
13  119    37114     2    12705
14   84    38290     0    12705
15   54    39100     1    12720
16   42    39772     1    12736
17   17    40061     0    12736
18   16    40349     0    12736
19    3    40406     0    12736
20    6    40526     0    12736


## Final Sample stats summary

                                  state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  hetero_est  error_est  reads_consens  loci_in_assembly
Clytospiza_monteiri                   7    1220879              1168704           32301             10216       0.008      0.004           9775              4776
Lagonosticta_larvata                  7    1001743               944653           25409             10557       0.008      0.003          10378              6057
Lagonosticta_rara                     7     992534               934386           29094             10016       0.009      0.003           9799              5737
Lagonosticta_rhodopareia              7    1020850               961277           27767              9947       0.009      0.003           9714              6061
Lagonosticta_rubricata_congica        7    1064587               997387           26398             10138       0.010      0.003           9887              6183
Lagonosticta_rubricata_rubricata      7    1079701              1009532           22306              8723       0.009      0.003           8456              4741
Lagonosticta_rufopicta                7     898117               846834           25233             10023       0.007      0.004           9802              5500
Lagonosticta_sanguinodorsalis         7    1034739               980499           27716              9046       0.009      0.004           8677              5512
Lagonosticta_senegala_rendalli        7     834688               786701           21363              9960       0.007      0.003           9747              5491
Lagonosticta_senegala_rhodopsis       7     792799               749508           26921              9379       0.006      0.003           9194              5309

Analysis tools

Thee is a lot more information about analysis tools in the ipyrad documentation. But here I'll show just a quick example of how you can easily access the data files for these assemblies and use them in downstream analysis software. The ipyrad analysis tools include convenient wrappers to make it easier to parallelize analyses of RAD-seq data. You should still read the full tutorial of the software you are using to understand the full scope of the parameters involved and their impacts, but once you understand that, the ipyrad analysis tools provide an easy way to setup up scripts to sample different distributions of SNPs and to run many replicates in parallel.


In [13]:
import ipyrad.analysis as ipa

In [14]:
## you can re-load assemblies at a later time from their JSON file
min4 = ip.load_json("analysis-ipyrad/Finches/vidua_min4.json")
min10 = ip.load_json("analysis-ipyrad/Finches/vidua_min10.json")


loading Assembly: vidua_min4
from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min4.json
loading Assembly: vidua_min10
from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min10.json

RAxML -- ML concatenation tree inference


In [29]:
## conda install raxml -c bioconda
## conda install toytree -c eaton-lab

In [48]:
## create a raxml analysis object for the min13 data sets
rax = ipa.raxml(
    name=min10.name, 
    data=min10.outfiles.phy,
    workdir="analysis-raxml",
    T=20,
    N=100,    
    o=[i for i in min10.samples if "Ano" in i],
    )

In [49]:
## print the raxml command and call it
print rax.command
rax.run(force=True)


raxmlHPC-PTHREADS-SSE3 -f a -T 20 -m GTRGAMMA -N 100 -x 12345 -p 54321 -n vidua_min10 -w /home/deren/Documents/ipyrad/tests/analysis-raxml -s /home/deren/Documents/ipyrad/tests/analysis-ipyrad/Finches/vidua_min10_outfiles/vidua_min10.phy -o Anomalospiza_imberbis
job vidua_min10 finished successfully

In [50]:
## access the resulting tree files
rax.trees


Out[50]:
bestTree                   ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bestTree.vidua_min10
bipartitions               ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bipartitions.vidua_min10
bipartitionsBranchLabels   ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bipartitionsBranchLabels.vidua_min10
bootstrap                  ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bootstrap.vidua_min10
info                       ~/Documents/ipyrad/tests/analysis-raxml/RAxML_info.vidua_min10

In [51]:
## plot a tree in the notebook with toytree
import toytree
tre = toytree.tree(rax.trees.bipartitions)
tre.root(wildcard="Ano")
tre.draw(
    width=350,
    height=400,
    node_labels=tre.get_node_values("support"),
    #use_edge_lengths=True,
    );


Anomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaidx: 1 name: 1 dist: 0.0202770172748 support: 100100idx: 2 name: 2 dist: 0.00497052067772 support: 100100idx: 3 name: 3 dist: 0.000609155416378 support: 100100idx: 4 name: 4 dist: 0.00079846260878 support: 100100idx: 5 name: 5 dist: 0.000828930900595 support: 100100idx: 6 name: 6 dist: 0.000759628385216 support: 9696idx: 7 name: 7 dist: 0.00392468254923 support: 100100idx: 8 name: 8 dist: 0.00369545217793 support: 100100idx: 9 name: 9 dist: 0.00054103209537 support: 100100idx: 10 name: 10 dist: 0.00167952306397 support: 100100idx: 11 name: 11 dist: 0.00093349964352 support: 100100idx: 12 name: 12 dist: 0.000395638998582 support: 9999

tetrad -- quartet tree inference


In [39]:
## create a tetrad analysis object
tet = ipa.tetrad(
    name=min4.name, 
    seqfile=min4.outfiles.snpsphy,
    mapfile=min4.outfiles.snpsmap,
    nboots=100,
    )


loading seq array [14 taxa x 35118 bp]
max unlinked SNPs per quartet (nloci): 7505

In [40]:
## run tree inference
tet.run(ipyclient)


host compute node: [40 cores] on tinus
inferring 1001 induced quartet trees
[####################] 100%  initial tree | 0:00:04 |  
[####################] 100%  boot 100     | 0:01:35 |  

In [41]:
## access tree files
tet.trees


Out[41]:
boots   ~/Documents/ipyrad/tests/analysis-tetrad/vidua_min4.boots
cons    ~/Documents/ipyrad/tests/analysis-tetrad/vidua_min4.cons
nhx     ~/Documents/ipyrad/tests/analysis-tetrad/vidua_min4.nhx
tree    ~/Documents/ipyrad/tests/analysis-tetrad/vidua_min4.tree

In [112]:
## plot results (just like above, but unrooted by default)
import toytree
qtre = toytree.tree(tet.trees.nhx)
qtre.root(wildcard="Ano")
qtre.draw(
    width=350,
    height=400,
    node_labels=tre.get_node_values("support"),
    );


Anomalospiza_imberbisVidua_interjectaVidua_orientalisVidua_paradisaeaVidua_obtusaVidua_hypocherinaVidua_macroura_arenosaVidua_macroura_macrouraVidua_regiaVidua_fischeriVidua_purpurascensVidua_chalybeata_amauropteryxVidua_raricolaVidua_chalybeata_neumanniidx: 1 name: added-node dist: 100 support: 100100idx: 2 name: 9 dist: 100 support: 100100idx: 3 name: 10 dist: 100 support: 100100idx: 4 name: 11 dist: 100 support: 100100idx: 5 name: 5 dist: 97 support: 9797idx: 6 name: 6 dist: 75 support: 7575idx: 7 name: 7 dist: 100 support: 100100idx: 8 name: 8 dist: 100 support: 100100idx: 9 name: 1 dist: 81 support: 8181idx: 10 name: 2 dist: 100 support: 100100idx: 11 name: 3 dist: 100 support: 100100idx: 12 name: 4 dist: 54 support: 5454

In [54]:
## draw a cloud-tree to see variation among bootstrap trees
## note that the trees are UNROOTED here, but tips are in the 
## same order in all trees.
boots = toytree.multitree(tet.trees.boots, fixed_order=tre.get_tip_labels())
boots.draw_cloudtree(orient='right', edge_style={"opacity": 0.05});


Anomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricola

STRUCTURE -- population cluster inference


In [ ]:
## conda install structure clumpp -c ipyrad

In [56]:
## create a structure analysis object for the no-outgroup data set
struct = ipa.structure(
    name=min10.name, 
    data=min10.outfiles.str, 
    mapfile=min10.outfiles.snpsmap,    
)

## set params for analysis (should be longer in real analyses)
struct.mainparams.burnin=1000
struct.mainparams.numreps=8000

In [57]:
## run structure across 10 random replicates of sampled unlinked SNPs
for kpop in [2, 4, 6, 8]:
    struct.run(kpop=kpop, nreps=10, ipyclient=ipyclient)


submitted 10 structure jobs [vidua_min10-K-2]
submitted 10 structure jobs [vidua_min10-K-4]
submitted 10 structure jobs [vidua_min10-K-6]
submitted 10 structure jobs [vidua_min10-K-8]

In [59]:
## wait for all of these jobs to finish
ipyclient.wait()


Out[59]:
True

Use Clumpp to permute replicates


In [64]:
## these options make it run faster
struct.clumppparams.m = 3               ## use largegreedy algorithm
struct.clumppparams.greedy_option = 2   ## test nrepeat possible orders
struct.clumppparams.repeats = 10000     ## number of repeats

In [65]:
## collect results
tables = {}
for kpop in [2, 4, 6, 8]:
    tables[kpop] = struct.get_clumpp_table(kpop)


mean scores across 10 replicates.
mean scores across 10 replicates.
mean scores across 10 replicates.
mean scores across 10 replicates.

Plot the results as a barplot

Usually a next step in a structure analysis would be do some kind of statistical analysis to compare models and identify K values that fit the data well.


In [66]:
## order of bars will be taken from ladderized tree above
myorder = tre.get_tip_labels()

In [69]:
## import toyplot (packaged with toytree) 
import toyplot

## plot bars for each K-value (mean of 10 reps)
for kpop in [2, 4, 6, 8]:
    table = tables[kpop]
    table = table.ix[myorder]
    
    ## plot barplot w/ hover
    canvas, axes, mark = toyplot.bars(
                            table, 
                            title=[[i] for i in table.index.tolist()],
                            width=400, 
                            height=200, 
                            yshow=False,                            
                            style={"stroke": toyplot.color.near_black},
                            )


Anomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricola051015
Anomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricola051015
Anomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricola051015
Anomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricolaAnomalospiza_imberbisVidua_orientalisVidua_interjectaVidua_obtusaVidua_paradisaeaVidua_hypocherinaVidua_macroura_macrouraVidua_macroura_arenosaVidua_fischeriVidua_regiaVidua_chalybeata_amauropteryxVidua_purpurascensVidua_chalybeata_neumanniVidua_raricola051015

TREEMIX -- ML tree & admixture co-inference


In [70]:
## conda install treemix -c ipyrad

In [93]:
## group taxa into 'populations'
imap = {
    'orient': ['Vidua_orientalis'],
    'interj': ['Vidua_interjecta'],
    'obtusa': ['Vidua_obtusa'],
    'paradi': ['Vidua_paradisaea'],
    'hypoch': ['Vidua_hypocherina'],
    'macrou': ['Vidua_macroura_macroura', 'Vidua_macroura_arenosa'],
    'fische': ['Vidua_fischeri'],
    'regia' : ['Vidua_regia'],
    'chalyb': ['Vidua_chalybeata_amauropteryx', 'Vidua_chalybeata_neumanni'],
    'purpur': ['Vidua_purpurascens'],
    'rarico': ['Vidua_raricola'],
    #'outgro': ['Anomalospiza_imberbis'],
}

## optional: loci will be filtered if they do not have data for at
## least N samples in each species. Minimums cannot be <1.
minmap = {
    'orient': 1,
    'interj': 1,
    'obtusa': 1,
    'paradi': 1,
    'hypoch': 1,
    'macrou': 2,
    'fische': 1,
    'regia' : 1,
    'chalyb': 2,
    'purpur': 1,
    'rarico': 1,
    #'outgro': 1,
    }

In [104]:
## create a treemix analysis object
tmix = ipa.treemix(
    name=min10.name, 
    data=min10.outfiles.snpsphy,
    imap=imap,
    minmap=minmap,  
    )

## set params on treemix object
tmix.params.m = 1
tmix.params.root = "interj,orient,paradi,obtusa"
tmix.params.global_ = 1

In [105]:
## you can simply write the input files and run them externally
## or, as we show below, use the .run() command to run them here.
tmix.write_output_file()


ntaxa 14; nSNPs total 35118; nSNPs written 16364

In [108]:
## a dictionary for storing treemix objects
tdict = {}

## iterate over values of m
for rep in xrange(4):
    for mig in xrange(4):
        
        ## create new treemix object copy
        name = "mig-{}-rep-{}".format(mig, rep)
        tmp = tmix.copy(name)

        ## set params on new object
        tmp.params.m = mig
    
        ## run treemix analysis
        tmp.run()
        
        ## store the treemix object
        tdict[name] = tmp

In [110]:
import toyplot
import numpy as np

canvas = toyplot.Canvas(width=800, height=1200)
idx = 0
for mig in range(4):
    for rep in range(4):
        tmp = tdict["mig-{}-rep-{}".format(mig, rep)]
        ax = canvas.cartesian(grid=(4, 4, idx), padding=25, margin=(25, 50, 100, 25))
        ax = tmp.draw(ax)
        idx += 1


orientinterjparadiobtusamacrouhypochfischeregiararicopurpurchalyb-0.10-0.050.000.05Drift parameterorientinterjparadiobtusamacrouhypochfischeregiararicopurpurchalyb-0.10-0.050.000.05Drift parameterorientinterjparadiobtusamacrouhypochfischeregiararicopurpurchalyb-0.10-0.050.000.05Drift parameterorientinterjparadiobtusamacrouhypochfischeregiararicopurpurchalyb-0.10-0.050.000.05Drift parameterorientinterjparadiobtusamacrouhypochfischeregiararicopurpurchalybweight: 0.491017-0.10-0.050.000.05Drift parameterorientinterjparadiobtusamacrouhypochfischeregiararicopurpurchalybweight: 0.491017-0.10-0.050.000.05Drift parameterorientinterjparadiobtusamacrouhypochfischeregiararicopurpurchalybweight: 0.491017-0.10-0.050.000.05Drift parameterorientinterjparadiobtusamacrouhypochfischeregiararicopurpurchalybweight: 0.491017-0.10-0.050.000.05Drift parameterorientinterjobtusaparadihypochmacrouregiafischeraricopurpurchalybweight: 0.460687weight: 0.0121953-0.10-0.050.000.05Drift parameterorientinterjobtusaparadihypochmacrouregiafischeraricopurpurchalybweight: 0.460687weight: 0.0121953-0.10-0.050.000.05Drift parameterorientinterjobtusaparadihypochmacrouregiafischeraricopurpurchalybweight: 0.460687weight: 0.0121953-0.10-0.050.000.05Drift parameterorientinterjobtusaparadihypochmacrouregiafischeraricopurpurchalybweight: 0.460687weight: 0.0121953-0.10-0.050.000.05Drift parameterorientinterjobtusaparadihypochmacrouraricochalybpurpurregiafischeweight: 0.0132624weight: 0.442482weight: 0.278645-0.15-0.10-0.050.000.05Drift parameterorientinterjobtusaparadihypochmacrouraricochalybpurpurregiafischeweight: 0.0132624weight: 0.442482weight: 0.278645-0.15-0.10-0.050.000.05Drift parameterorientinterjobtusaparadihypochmacrouraricochalybpurpurregiafischeweight: 0.0132624weight: 0.442482weight: 0.278645-0.15-0.10-0.050.000.05Drift parameterorientinterjobtusaparadihypochmacrouraricochalybpurpurregiafischeweight: 0.0132624weight: 0.442482weight: 0.278645-0.15-0.10-0.050.000.05Drift parameter

baba: D-statistic (ABBA-BABA) admixture inference/hypothesis testing


In [15]:
## create a baba analysis object
bb = ipa.baba(
    data=min4.outfiles.loci,
    newick="analysis-raxml/RAxML_bestTree.vidua_min10" 
)

In [16]:
## this will generate tests from the tree, using constraints.
bb.generate_tests_from_tree(
    constraint_exact=False,
    constraint_dict={
        "p4": ['Anomalospiza_imberbis'],
        'p3': ['Vidua_macroura_macroura', 'Vidua_macroura_arenosa'],
    })


168 tests generated from tree

In [17]:
## run inference and significance testing on tests
bb.run(ipyclient=ipyclient)


[####################] 100%  calculating D-stats  | 0:00:24 |  

In [18]:
## sorted results for the tests performed 
bb.results_table.sort_values(by="Z", ascending=False)


Out[18]:
dstat bootmean bootstd Z ABBA BABA nloci
25 -0.195 -1.952e-01 0.071 2.751 43.250 64.250 3207
22 0.212 2.116e-01 0.078 2.729 61.000 39.625 3143
1 -0.186 -1.816e-01 0.069 2.701 44.656 65.094 3262
13 -0.192 -1.888e-01 0.074 2.608 41.875 61.750 3186
15 -0.212 -2.081e-01 0.084 2.528 39.625 61.000 3143
... ... ... ... ... ... ... ...
120 -0.009 -2.353e-03 0.104 0.085 34.750 35.375 2717
157 0.004 5.400e-03 0.085 0.047 46.812 46.438 3222
0 0.002 4.923e-03 0.071 0.030 58.312 58.062 3268
76 -0.003 -6.335e-04 0.094 0.027 36.500 36.688 2765
70 0.003 1.563e-03 0.095 0.027 36.688 36.500 2765

168 rows × 7 columns

summary

The top results show greatest support for admixture between V. macroura and V. paradisaea, though the signal is not very strong (Z=2.7).


In [19]:
## the test that had the most significant result: (BABA)
bb.tests[25]


Out[19]:
{'p1': ['Vidua_paradisaea'],
 'p2': ['Vidua_orientalis', 'Vidua_interjecta'],
 'p3': ['Vidua_macroura_arenosa'],
 'p4': ['Anomalospiza_imberbis']}

In [20]:
## next best (ABBA)
bb.tests[22]


Out[20]:
{'p1': ['Vidua_orientalis'],
 'p2': ['Vidua_paradisaea'],
 'p3': ['Vidua_macroura_macroura'],
 'p4': ['Anomalospiza_imberbis']}