ipyrad-analysis toolkit: digest genomes

The purpose of this tool is to digest a genome file in silico using the same restriction enzymes that were used for an empirical data set to attempt to extract homologous data from the genome file. This can be a useful procedure for adding additional outgroup samples to a data set.

Required software


In [1]:
# conda install ipyrad -c bioconda

In [2]:
import ipyrad.analysis as ipa

A genome file

You will need a genome file in fasta format (optionally it can be gzip compressed).


In [3]:
genome = "/home/deren/Downloads/Ahypochondriacus_459_v2.0.fa"

Initialize the tool

You can generate single or paired-end data, and you will likely want to restrict the size of selected fragments to be within an expected size selection window, as is typically done in empirical data sets. Here I select all fragments occuring between two restriction enzymes where the intervening fragment is 300-500bp in length. I then ask that the analysis returns the digested fragments as 150bp fastq reads, and to provide 10 copies of each one.


In [4]:
digest = ipa.digest_genome(
    fasta=genome,
    name="amaranthus-digest",
    workdir="digested_genomes",
    re1="CTGCAG",
    re2="AATTC",
    ncopies=10,
    readlen=150,
    min_size=300,
    max_size=500,    
)

In [5]:
fio = open(genome)
scaffolds = fio.read().split(">")[1:]

In [14]:
ordered = sorted(scaffolds, key=lambda x: len(x), reverse=True)

In [16]:
len(ordered[0])


Out[16]:
38760089

In [5]:
digest.run()

Check results


In [6]:
ll digested_genomes/


total 23984
-rw-r--r-- 1 deren 12270437 Dec 27 14:36 amaranthus-digest_R1_.fastq.gz
-rw-r--r-- 1 deren 12285598 Dec 27 14:36 amaranthus-digest_R2_.fastq.gz