Format reference databases

This notebook describes how to format reference databases for taxonomy classification. Taxonomy classification will be more accurate, and faster, if the reference sequences are trimmed to the same region and read length as query sequences. In the case of marker gene sequences, this means that the reference dataset should consist of simulated amplicons. This notebook only needs to be run once for a given reference database/primer combination unless if changing the length of the trimmed reference sequences.

The first step is to retrieve an appropriate reference database. We will use the Greengenes 13_8 QIIME release for classification of 16S rRNA sequences, and UNITE v. 7.1 for fungal ITS sequences. Here we will use "cleaned" versions that have ambiguous taxa strings removed, which are located in the tax-credit reference sequence collection. These are the most recent versions at the time of writing, but you may wish to check for more recent updates for Greengenes and UNITE, and re-"clean" using this notebook (or, more directly, use tax_credit.framework_functions.clean_database()).

Any other database(s) may be used, simply provide the filepath to the desired database(s) in the cell below.

In the following cell, we define the filepaths of the reference databases that we wish to trim, as well as the primers that we will use for trimming. We will use the standard 515f/806r primers for 16S rRNA V4 domain, and BITSf/B58S3r primers for fungal ITS1 domain.

In [4]:
from tax_credit.framework_functions import extract_amplicons
from os.path import expandvars, join, splitext, exists, basename

from skbio import DNA
import qiime2
from qiime2.plugins import feature_classifier

In [6]:
# Define the read length to use. This should be the same length, or larger, 
# than the actual lengths of your sequence reads.
read_length = 250

# Define the minimum acceptable amplicon length
min_read_length = 80

# Filepath to directory that contains all reference dataset directories
reference_database_dir = expandvars("$HOME/Desktop/ref_dbs/")

# Dictionary of reference dataset names and filepaths to references sequences
# and taxonomy file for each.
# Format: {dataset-name : (reference sequences, forward primer sequence, 
#                          reverse primer sequence, name of primer pair)}
reference_dbs = {'gg_13_8_otus' : (join(reference_database_dir, 
                 'unite-7.1' : (join(reference_database_dir,
                 'unite-7.1-ITS1Ff-ITS2r' : 
                 'silva_123' : (join(reference_database_dir,
                 'silva_123_clean' : (join(reference_database_dir,

Now run the following cell. This will generate simulated amplicons from each reference sequence, trim the amplicons to read_length nt, and write out new files in the same location as the source files. Two files will be written: the simulated amplicons and the simulated reads. If the original file was called my_sequences.fasta, amplicons were generated using 515f-806r and read_length = 250, the new amplicons will be called my_sequences_515f-806r.fasta and the new reads will be called my_sequences_515f-806r_trim250.fasta. Sequences that do not align to the primer will be discarded, as will amplicons shorter than min_read_length.

In [7]:
for db, data in reference_dbs.items():
    seqs, taxa, fwd_primer, rev_primer, primer_pair = data
    base, ext = splitext(seqs)
    reads_out = '{0}_{1}_trim{2}{3}'.format(
        base, primer_pair, read_length, '.qza')
    if not exists(reads_out):
        seqs_in = qiime2.Artifact.import_data("FeatureData[Sequence]", seqs)
        reads = feature_classifier.methods.extract_reads(
            sequences=seqs_in, length=read_length, f_primer=fwd_primer, r_primer=rev_primer)
        print('file exists: ' + reads_out)
        #extract_amplicons(clean_fasta, amplicons_out, reads_out, 
        #                  DNA(fwd_primer), DNA(rev_primer),
        #                  read_length, min_read_length)

In [ ]: