Novel-taxa classification evaluation

The following notebooks describe the evaluation of taxonomy classifiers using "novel taxa" data sets. Novel-taxa analysis is a form of cross-validated taxonomic classification, wherein random unique sequences are sampled from the reference database as a test set; all sequences sharing taxonomic affiliation at a given taxonomic level are removed from the reference database (training set); and taxonomy is assigned to the query sequences at the given taxonomic level. Thus, this test interrogates the behavior of a taxonomy classifier when challenged with "novel" sequences that are not represented by close matches within the reference sequence database. Such an analysis is performed to assess the degree to which "overassignment" occurs for sequences that are not represented in a reference database.

At each level L, the unique taxonomic clades are randomly sampled and used as QUERY sequences. All sequences that match that taxonomic annotation at L are excluded from REF. Hence, species-level QUERY assignment asks how accurate assignment is to an "unknown" species that is not represented in the REF, though other species in the same genus are. Genus-level QUERY assignment asks how accurate assignment is to an "unknown" genus that is not represented in the REF, though other genera in the same family are, et cetera.

The steps involved in preparing and executing novel-taxa analysis are described in a series of notebooks:

1) Novel taxa dataset generation only needs to be performed once for a given reference database. Only run this notebook if you wish to make novel taxa datasets from a different reference database, or alter the parameters used to make the novel taxa datasets. The default included in Tax-Credit is Greengenes 13_8 release, amplified in silico with primers 515f and 806r, and trimmed to 250 nt from the 5' end.

2) Taxonomic classification of novel taxa sequences is performed using the datasets generated in step 1. This template currently describes classification using QIIME 1 classifiers and can be used as a template for classifiers that are called via command line interface. Python-based classifiers can be used following the example of q2-feature-classifier.

3) Classifier evaluation is performed based on taxonomic classifications generated by each classifier used in step 2.

Definitions

The dataset generation notebook uses a few novel definitions. The following provides some explanation of the definitions used in that notebook.

source = original reference database sequences and taxonomy.
QUERY = 'novel' query sequences and taxonomies randomly drawn from source.
REF = source - novel taxa, used for taxonomy assignment.
L = taxonomic level being tested
- 0 = kingdom, 1 = phylum, 2 = class, 3 = order, 4 = family, 5 = genus, 6 = species
branching = describes a taxon at level L that "branches" into two or more lineages at L + 1.
- A "branched" taxon, then, describes these lineages. E.g., in the example below Lactobacillaceae, Lactobacillus, and Pediococcus branch, while Paralactobacillus is unbranching. The Lactobacillus and Pediococcus species are "branched". Paralactobacillus selangorensis is "unbranched"
- The novel taxa analysis only uses "branching" taxa, such that for each QUERY at level L, REF must contain one or more taxa that share the same clade at level L - 1.

Lactobacillaceae
           └── Lactobacillus
           │         ├── Lactobacillus brevis
           │         └── Lactobacillus sanfranciscensis
           ├── Pediococcus
           │         ├── Pediococcus damnosus
           │         └── Pediococcus claussenii
           └── Paralactobacillus
                     └── Paralactobacillus selangorensis