This is will be used to test memory limits in ipyrad, and also for comparing run times with other software. RAD data sets have three size dimensions that greatly affect run times: the number of taxa, the number of loci, and the length of reads. Here we will set all three to quite large values, by simulating paired-end 150bp reads for 300 taxa for 100K loci. For simplicity we do not simulate any missing data, but this is not expected to affect run times or memory limits except to the extent that it increased the total number of loci in the data set.
In [7]:
## import Python libraries
import ipyrad as ip
In [9]:
%%bash
## this will take about XX minutes to run, sorry, the code is not parallelized
## we simulate 360 tips by using the default 12 taxon tree and requesting 40
## individuals per taxon. Default is theta=0.002. Crown age= 5*2Nu (check this)
simrrls -L 100000 -f pairddrad -l 150 -n 30 -o Big_i360_L100K
## because it takes a long time to simulate this data set, you can alternatively
## just download the data set that we already simulated using the code above.
## The data set is hosted on anaconda, just run the following to get it.
# conda download -c ipyrad bigData
In [20]:
data = ip.Assembly("bigHorsePaired")
data.set_params("project_dir", "bigdata")
data.set_params("raw_fastq_path", "bigHorsePaired_R*_.fastq.gz")
data.set_params("barcodes_path", "bigHorsePaired_barcodes.txt")
data.set_params("datatype", "pairddrad")
data.set_params("restriction_overhang", ("TGCAG", "CCGG"))
data.get_params()
In [ ]:
data.run('1')
In [ ]:
ls -l bigdata/bigHorsePaired_fastqs/
In [ ]:
data.run("234567")
In [ ]:
In [ ]:
In [ ]: