An extensible framework for optimizing taxonomic classification of marker gene reads

The problems

Taxonomic assignment of marker gene reads

Given a short, possibly error-containing, read of a marker gene, determine the taxonomy of the organism from which that gene was derived with the greatest taxonomic specificity.

Accurate and specific taxonomic assignment of these reads is essential for many (but not all!) aspects of human microbiome science, but currently used methods have not been optimized on "modern" data sets (e.g., short Illumina reads).

Lack of evaluation standards for taxonomic assignment (and more generally, bioinformatics) methods comparisons

Introducing a new taxonomic assignment method requires performance benchmarking against pre-existing methods
- is it at least as good as other methods but more computationally efficient (e.g., faster and/or smaller memory requirements)?
- is it better than other methods (e.g., more specific taxonomic assignments and/or more sequences accurately classified)?
- both?

When comparing a new method to existing methods, developers must:
- spend time identifying and obtaining test data sets
- developing an evaluation frameworks (which sometimes leave out important comparisons)
- obtaining and installing pre-existing taxonomic assignment software
- determining the parameters they should benchmark against in the pre-existing taxonomic assignment software (which sometimes leaves out important parameter combinations)

We need a standardized and extensible evaluation framework for taxonomic assignment methods, and standard and extensible evaluations need to be the standard for bioinformatics methods evaluations.

What does it mean for a taxonomic assignment evaluation framework to be standardized and extensible?

Standardized

consistent and appropriate set of metrics, defined in the same way (e.g., specificity, sensitivity, accuracy)
consistent and appropriate set of data sets

An important concern here is over-fitting. We'll come back to that.

Extensible: beyond the initial presentation, it's easy for users (not only the initial developers!) to:

add new methods
add new reference databases
add new data sets and metrics (this one is a little harder in our framework)

Toward a standardized and extensible framework for taxonomic assignment evaluation

Three core evaluations

Precision, recall and F-measure: a qualitative compositional analysis against mock communities.

At a given taxonomic level, an assignment is a:
- true positive ($TP$), if that taxonomic assignment is present in the results and in the mock community
- false positive ($FP$), if that taxonomic assignment is present in the results, but is not present in the mock community
- false negative ($FN$), if a taxonomic assignment is not present in the results, but is present the mock community
- true negative ($TN$), if a taxonomic assignment is not present in the results, and is not present the mock community
  
  $ precision = \dfrac{TP}{TP + FP} $ or the fraction of taxonomic assignments that actually members of the mock community
  
  $ recall = \dfrac{TP}{TP + FN} $ or the fraction of the mock community members that were observed in the results
Pearson/Spearman correlation: a quantitative compositional analysis against mock communities.

At a given taxonomic level, compute the correlation between the relative abundances of the taxa as predicted by the taxonomy assigner, and the known community compositions.
Correlation of distances between communities based on taxonomic composition and taxonomy-free OTU composition.

At a given taxonomic level, collapse OTUs by their taxonomic assignments and compute the Bray-Curtis distances between the samples. Compare the resulting Bray-Curtis distance matrix to the UniFrac distance matrix generated based on the uncollapsed (i.e., taxonomy-free) OTUs.

Evaluation framework workflow

Data is stored in a public GitHub repository.

This includes input data:

Fasta-formatted sequence data for all analysis data sets.
BIOM-formatted OTU tables, with no taxonomy metadata, for all analysis data sets.

Expected results:

Known compositions of mock communities.
Taxonomy-free UniFrac distance matrices and PCoA matrices for for natural communities.

And pre-computed results for other methods:

BIOM-formatted OTU tables, with taxonomy metadata, for all methods that have previously been analyzed with the framework.

The developer of a new method can apply that method to the input data.

Instructions are included for applying the method to new data, and formatting output to be plugged into evaluation framework.

The developer's output will be BIOM tables with their method's taxonomic assignments added. These are indentical in format to pre-computed result BIOM tables in the GitHub repository.

The developer's BIOM tables can be analyzed in the evaluation framework.

The evaluation framework is a pair of IPython Notebooks, which currerntly must be executed locally.

We are investigating other strategies, such as using Google App Engine, to support remote execution of these notebooks which will avoid the need for local installation (which is not difficult, but is an unnecessary barrier to use).

The evaluation framework automatically includes the pre-computed results.

The developer can then immediately determine how their method compares to pre-existing methods.

All metrics described above are applied, and the result is an IPython Notebook-based summary of the methods, including figures and tables comparing all methods. All analysis results, in the form of raw data, are easily extractable for further analysis.

At this stage, the developer can make a decision about how to proceed.

The developer now knows how their method compares to pre-existing methods. There are a few possible outcomes:

The new method out-performs pre-exisiting methods on one or more evaluations. The author should proceed with publishing their method, and the notebooks generated by this framework can be included as supporting data in that publication (these are open access, so there are no expectations that the authors of the evaluation framework would be included or acknowledged in that publication). The author should also at this stage submit a pull request including their best BIOM tables to the evaluation framework GitHub repository. This will result in their method being included as a pre-existing method in future runs of the evaluation framework. By submitting a pull request that is merged, the author will automatically be listed as a contributor to the evaluation framework project.

The new method performs worse than pre-existing methods. The author should optimize or abandon their method.

The new method perfors worse than pre-existing methods, but the author suspects that the analyses were insufficient. At this stage the author has the result of all of the pre-existing methods, and all raw test data, so it is convenient to explore other evaluations. If these evaluations prove useful, the author can submit that code as a pull request to be included in the evaluation framework. The authors updated methods and results can be presented in their publication of the method, and future users of the evaluation framework will have access to those methods. By submitting a pull request that is merged, the author will automatically be listed as a contributor to the evaluation framework project.

Example analysis results

To illustrate the utility of this evaluation framework, we developed new alignment-based taxonomy assigners based on uclust and usearch. These methods query an OTU representative sequence against a user-specified reference database, using either uclust or usearch, and return the top $n$ hits. Of these top hits, the most specific taxonomy assignment that is shared by at least $p$ percent of the hits is assigned to OTU.

This notebook is viewable on nbviewer.

In this evalution, we found that our uclust-based taxonomy assigner method runs approximately 1000x faster than some of the popular methods that we compared it against, while achieving twice the precision for genus-level assignments.

Avoiding "over-fitting" (or systematically leaving out important test data sets or metrics)

It is not possible to develop a completely comprehensive collection of test data sets and metrics. Just as users can submit results generated by new methods, users can submit pull requests containing new test data sets and evaluation metrics. These currently involve more work to integrate (because pre-existing methods must be re-run), but we are working on simplifying that.

Next steps

Avoid users having to install the framework locally. We are investigating Google App Engine and other solutions for hosting this framework as a server.

Provide instructions for submitting new metrics and test data sets. This is more complex, as it will involve re-running the pre-existing methods.

Provide support for including new reference databases. Just as new methods can be compared, this framework also supports the inclusion of new reference databases. If there is sufficient interest, e.g., using the framework to investigate incremental changes to marker gene reference databases and taxonomy, we will simplify this.
We are in the process of investaging other strategies for taxonomic assignment using this framework, including an infernal-based assigner, and a scikit-learn-based Naive Bayesian classifier.

Acknowledgements

Slides by Greg Caporaso

Corresponding author: gregcaporaso@gmail.com

An extensible framework for optimizing classification enhances short-amplicon taxonomic assignments.

Nicholas A. Bokulich$^{*}$, Jai Ram Rideout$^{*}$, Kyle Patnode, Zack Ellett, Daniel McDonald, Benjamin Wolfe, Corinne F. Maurice, Rachel J. Dutton, Peter J. Turnbaugh, Rob Knight, J. Gregory Caporaso.

Manuscript in preparation.

$^{*}$ These authors contributed equally.