An extensible framework for optimizing classification enhances marker-gene taxonomic assignments: Supplementary Notebooks

These notebooks were used to perform the analyses presented in (Bokulich, Kaehler, et al. (in preparation)), and can be used to reproduce the analyses in that paper, or to extend them to other data sets.

To run any of the analysis notebooks, you'll need the tax-credit project. See the README for installation instructions. For a static version of these notebooks that you can view as a webpage without installing anything, you can view the notebooks on nbviewer.

Questions should be posted as issues on the tax-credit repository issue tracker

Contents

  • Mock community classification performance: Comparative performance of classification methods for assigning taxonomy to sequences from mock communities, mixtures of microbial cells with known taxonomy and composition, which are then sequenced. Mock communities allow quantitative assessments of method performance, evaluate how methods perform under true biological conditions (with true error rates), and give some idea of how methods may perform on natural communities.
  • Cross-validated classification performance: Comparative performance of classification methods to recover the true taxonomy of annotated reference sequences, using a typical cross-validation scheme. The reference dataset is split into query and reference sets; hence, the sequences being classified are not present in the reference dataset used for classification, but sequences with matching annotations are present. This allows calculation of classic precision and recall scores to evaluate classifier performance.
  • Novel taxa classification performance: Comparative performance of classification methods to recover the nearest correct lineage of annotated reference sequences, when sequences bearing the correct taxonomy are absent from the refernece dataset. The reference dataset is split into query and reference sets, and any sequences matching query taxonomies are removed from the reference; hence, the sequences being classified contain no matches in the reference dataset used for classification, but other taxa within the same clade are present. A correct assignment is to the nearest common ancestor lineage. This tests classifier perform when challenged with a sequence that is unknown to the reference, and evaluates overall rates of overclassification, underclassification, and misidentification to the wrong lineage.
  • Simulated community classification performance: Comparative performance of classification methods for assigning taxonomy to sequences from simulated communities, which are synthetically composed collections of reference sequences that resemble natural microbial communities. As the simulated community's composition is known, this evaluation assesses how well individual methods reconstruct the expected composition. This provides some idea of how methods may perform on natural communities.
  • Computional runtime comparison: How fast do methods perform across a range of operating conditions?
  • Reference dataset performance comparison: This time, we keep the classification method and mock community constant, but compare different reference datasets to assess whether different datasets/versions provide more accurate taxonomic assignments of a mock community.

Extending the analyses to include evaluation of new taxonomic assignment methods

Given a set of precomputed taxonomic assignment method evaluation results (this will likely be the ones included in the tax-credit GitHub repository) and optionally a set of results generated by the user, the results generated by the user can be analyzed in the context of the precomputed results. This allows users to rapidly determine how a new method performs, relative to the precomputed results. Additionally, users can submit their taxonomic assignment method results to the tax-credit repository as a pull request so they can be included as precomputed results for future users.


In [ ]: