Simulate a really big data set for testing

This is will be used to test memory limits in ipyrad, and also for comparing run times with other software. RAD data sets have three size dimensions that greatly affect run times: the number of taxa, the number of loci, and the length of reads. Here we will set all three to quite large values, by simulating paired-end 150bp reads for 300 taxa for 100K loci. For simplicity we do not simulate any missing data, but this is not expected to affect run times or memory limits except to the extent that it increased the total number of loci in the data set.

In [7]:
## import Python libraries
import ipyrad as ip

In [9]:

## this will take about XX minutes to run, sorry, the code is not parallelized
## we simulate 360 tips by using the default 12 taxon tree and requesting 40
## individuals per taxon. Default is theta=0.002. Crown age= 5*2Nu (check this)
simrrls -L 100000 -f pairddrad -l 150 -n 30 -o Big_i360_L100K

## because it takes a long time to simulate this data set, you can alternatively
## just download the data set that we already simulated using the code above. 
## The data set is hosted on anaconda, just run the following to get it. 
# conda download -c ipyrad bigData

Process is terminated.

Load the data set with ipyrad

In [20]:
data = ip.Assembly("bigHorsePaired")

data.set_params("project_dir", "bigdata")
data.set_params("raw_fastq_path", "bigHorsePaired_R*_.fastq.gz")
data.set_params("barcodes_path", "bigHorsePaired_barcodes.txt")
data.set_params("datatype", "pairddrad")
data.set_params("restriction_overhang", ("TGCAG", "CCGG"))

  New Assembly: bigHorsePaired
  0   assembly_name               bigHorsePaired                               
  1   project_dir                 ./bigdata                                    
  2   raw_fastq_path              ./bigHorsePaired_R*_.fastq.gz                
  3   barcodes_path               ./bigHorsePaired_barcodes.txt                
  4   sorted_fastq_path                                                        
  5   assembly_method             denovo                                       
  6   reference_sequence                                                       
  7   datatype                    pairddrad                                    
  8   restriction_overhang        ('TGCAG', 'CCGG')                            
  9   max_low_qual_bases          5                                            
  10  phred_Qscore_offset         33                                           
  11  mindepth_statistical        6                                            
  12  mindepth_majrule            6                                            
  13  maxdepth                    10000                                        
  14  clust_threshold             0.85                                         
  15  max_barcode_mismatch        0                                            
  16  filter_adapters             0                                            
  17  filter_min_trim_len         35                                           
  18  max_alleles_consens         2                                            
  19  max_Ns_consens              (5, 5)                                       
  20  max_Hs_consens              (8, 8)                                       
  21  min_samples_locus           4                                            
  22  max_SNPs_locus              (20, 20)                                     
  23  max_Indels_locus            (8, 8)                                       
  24  max_shared_Hs_locus         0.5                                          
  25  edit_cutsites               (0, 0)                                       
  26  trim_overhang               (0, 0, 0, 0)                                 
  27  output_formats              ['l', 'p', 's', 'v']                         
  28  pop_assign_file                                                          

Demultiplex the data files

All of the software methods we are comparing will use the same demultiplexed data files, so we will not count this step towards the total run times comparisons.

In [ ]:'1')

The data files

These are the demultiplexed data files that each method will analyze.

In [ ]:
ls -l bigdata/bigHorsePaired_fastqs/

Assemble the data set with ipyrad

In [ ]:"234567")

Assemble the data set with stacks

In [ ]:

Assemble the data set with ddocent

In [ ]:

Assemble the data set with AftrRAD

We do not attempt this, since this software was unable to assemble much smaller data sets.

Compare run times

In [ ]: