This notebook describes the generation of reference databases for both novel-taxa and simulated community analyses. Novel-taxa analysis is a form of cross-validated taxonomic classification, wherein random unique sequences are sampled from the reference database as a test set; all sequences sharing taxonomic affiliation at a given taxonomic level are removed from the reference database (training set); and taxonomy is assigned to the query sequences at the given taxonomic level. Thus, this test interrogates the behavior of a taxonomy classifier when challenged with "novel" sequences that are not represented by close matches within the reference sequence database. Such an analysis is performed to assess the degree to which "overassignment" occurs for sequences that are not represented in a reference database.
Simulated community analysis represents more conventional cross-validated classification, wherein unique sequences are randomly sampled from a reference dataset and used as a test set for taxonomic classification, using a training set that has those sequences removed, but not other sequences that share taxonomic affiliation. Instead, the training set must contain identical taxonomies to those represented by the test sequences.
This section describes the preparation of the data sets necessary for "novel taxa" analysis. The goals of this step are:
In this first cell, we describe data set/database characteristics as a dictionary: dataset name is the key, with values reference sequence fasta, taxonomy, database name, forward primer sequence, reverse primer sequence, forward primer name, reverse primer name.
MODIFY these values to generate novel-taxa files on a new reference database
In [1]:
from tax_credit.framework_functions import (generate_simulated_datasets,
test_simulated_communities,
test_novel_taxa_datasets)
from os.path import expandvars, join
import pandas as pd
In [2]:
project_dir = expandvars("$HOME/Desktop/projects/short-read-tax-assignment")
data_dir = join(project_dir, "data")
# List databases as fasta/taxonomy file pairs
databases = {'B1-REF': [expandvars("$HOME/Desktop/ref_dbs/gg_13_8_otus/rep_set/99_otus.fasta"),
expandvars("$HOME/Desktop/ref_dbs/gg_13_8_otus/taxonomy/99_otu_taxonomy.txt"),
"gg_13_8_otus", "GTGCCAGCMGCCGCGGTAA", "ATTAGAWACCCBDGTAGTCC", "515f", "806r"],
'F1-REF': [expandvars("$HOME/Desktop/ref_dbs/sh_qiime_release_20.11.2016/developer/sh_refs_qiime_ver7_99_20.11.2016_dev.fasta"),
expandvars("$HOME/Desktop/ref_dbs/sh_qiime_release_20.11.2016/developer/sh_taxonomy_qiime_ver7_99_20.11.2016_dev.txt"),
"unite_20.11.2016", "ACCTGCGGARGGATCA", "GAGATCCRTTGYTRAAAGTT", "BITSf", "B58S3r"]
}
Now we will import these to a dataframe and view it. You should not need to modify the following cell.
In [3]:
# Arrange data set / database info in data frame
simulated_community_definitions = pd.DataFrame.from_dict(databases, orient="index")
simulated_community_definitions.columns = ["Reference file path", "Reference tax path", "Reference id",
"Fwd primer", "Rev primer", "Fwd primer id", "Rev primer id"]
simulated_community_definitions
Out[3]:
Generate "clean" reference taxonomy and sequence database by removing taxonomy strings with empty or ambiguous levels'
Set simulated community parameters, including amplicon length and the number of iterations to perform. Iterations will split our query sequence files into N chunks.
This will take a few minutes to run. Get some coffee.
In [4]:
read_length = 250
iterations = 3
generate_simulated_datasets(simulated_community_definitions, data_dir, read_length, iterations)
For peace of mind, we can test our novel taxa and simulated community datasets to confirm that:
1) For simulated communities, test (query) taxa IDs are not in training (ref) set, but all taxonomy strings are
2) For novel taxa, test taxa IDs and taxonomies are not in training (ref) set, but sister branch taxa are
If no errors print, all tests pass.
In [5]:
test_simulated_communities(simulated_community_definitions, data_dir, iterations)
As a sanity check, confirm that novel taxa were generated successfully.
In [6]:
test_novel_taxa_datasets(simulated_community_definitions, data_dir, iterations)
In [ ]: