Cross-validated classification evaluation

The following notebooks describe the evaluation of taxonomy classifiers using cross-validated read data sets. This analysis represents a conventional cross-validated classification, wherein unique sequences are randomly sampled from a reference dataset and used as a test set for taxonomic classification, using a training set that has those sequences removed, but not other sequences that share taxonomic affiliation. Instead, the training set must contain identical taxonomies to those represented by the test sequences.

Structuring new results for comparison to precomputed results

To prepare results from another classifier for analysis, you'll need to have tab-delimited taxonomy files that map query sequence IDs to their taxonomy assignments in the format: seqID taxonomy;label;string

An example of how to generate these is presented in the taxonomy assignment notebook in this directory, which was used to generated the precomputed data in the tax-credit repository.

Your taxonomy files should be called query_tax_assignments.txt, and nested in the following directory structure:


results_dir is the name of the top level directory, and you will set this value in the first code cell of the analysis notebooks. You can name this directory whatever you want to. cross-validated describes the specific analysis that is being run, and must be named cross-validated for the framework to find your results.

This directory structure is identical to that for the precomputed results. You can review that directory structure for an example of how this should look.


The steps involved in preparing and executing novel-taxa analysis are described in a series of notebooks:

1) Dataset generation only needs to be performed once for a given reference database. Only run this notebook if you wish to make datasets from a different reference database, or alter the parameters used to make the novel taxa datasets. The default included in Tax-Credit is Greengenes 13_8 release, amplified in silico with primers 515f and 806r, and trimmed to 250 nt from the 5' end. This notebook is actually included in the novel-taxa notebook directory, as simulated reads and "novel taxa" are generated simultaneously from the same source datasets.

2) Taxonomic classification of simulated reads is performed using the datasets generated in step 1. This template currently describes classification using QIIME 1 classifiers and can be used as a template for classifiers that are called via command line interface. Python-based classifiers can be used following the example of q2-feature-classifier.

3) Classifier evaluation is performed based on taxonomic classifications generated by each classifier used in step 2.

In [ ]: