Taxonomic assignment of marker gene reads
Lack of evaluation standards for taxonomic assignment (and more generally, bioinformatics) methods comparisons
We need a standardized and extensible evaluation framework for taxonomic assignment methods, and standard and extensible evaluations need to be the standard for bioinformatics methods evaluations.
Standardized
An important concern here is over-fitting. We'll come back to that.
Extensible: beyond the initial presentation, it's easy for users (not only the initial developers!) to:
Three core evaluations
Precision, recall and F-measure: a qualitative compositional analysis against mock communities.
At a given taxonomic level, an assignment is a:
true negative ($TN$), if a taxonomic assignment is not present in the results, and is not present the mock community
$ precision = \dfrac{TP}{TP + FP} $ or the fraction of taxonomic assignments that actually members of the mock community
$ recall = \dfrac{TP}{TP + FN} $ or the fraction of the mock community members that were observed in the results
Pearson/Spearman correlation: a quantitative compositional analysis against mock communities.
At a given taxonomic level, compute the correlation between the relative abundances of the taxa as predicted by the taxonomy assigner, and the known community compositions.
Correlation of distances between communities based on taxonomic composition and taxonomy-free OTU composition.
At a given taxonomic level, collapse OTUs by their taxonomic assignments and compute the Bray-Curtis distances between the samples. Compare the resulting Bray-Curtis distance matrix to the UniFrac distance matrix generated based on the uncollapsed (i.e., taxonomy-free) OTUs.
Data is stored in a public GitHub repository.
This includes input data:
Expected results:
And pre-computed results for other methods:
The developer of a new method can apply that method to the input data.
Instructions are included for applying the method to new data, and formatting output to be plugged into evaluation framework.
The developer's output will be BIOM tables with their method's taxonomic assignments added. These are indentical in format to pre-computed result BIOM tables in the GitHub repository.
The developer's BIOM tables can be analyzed in the evaluation framework.
The evaluation framework is a pair of IPython Notebooks, which currerntly must be executed locally.
We are investigating other strategies, such as using Google App Engine, to support remote execution of these notebooks which will avoid the need for local installation (which is not difficult, but is an unnecessary barrier to use).
The evaluation framework automatically includes the pre-computed results.
The developer can then immediately determine how their method compares to pre-existing methods.
All metrics described above are applied, and the result is an IPython Notebook-based summary of the methods, including figures and tables comparing all methods. All analysis results, in the form of raw data, are easily extractable for further analysis.
At this stage, the developer can make a decision about how to proceed.
The developer now knows how their method compares to pre-existing methods. There are a few possible outcomes:
To illustrate the utility of this evaluation framework, we developed new alignment-based taxonomy assigners based on uclust and usearch. These methods query an OTU representative sequence against a user-specified reference database, using either uclust or usearch, and return the top $n$ hits. Of these top hits, the most specific taxonomy assignment that is shared by at least $p$ percent of the hits is assigned to OTU.
This notebook is viewable on nbviewer.
In this evalution, we found that our uclust-based taxonomy assigner method runs approximately 1000x faster than some of the popular methods that we compared it against, while achieving twice the precision for genus-level assignments.
It is not possible to develop a completely comprehensive collection of test data sets and metrics. Just as users can submit results generated by new methods, users can submit pull requests containing new test data sets and evaluation metrics. These currently involve more work to integrate (because pre-existing methods must be re-run), but we are working on simplifying that.
Provide support for including new reference databases. Just as new methods can be compared, this framework also supports the inclusion of new reference databases. If there is sufficient interest, e.g., using the framework to investigate incremental changes to marker gene reference databases and taxonomy, we will simplify this.
We are in the process of investaging other strategies for taxonomic assignment using this framework, including an infernal-based assigner, and a scikit-learn-based Naive Bayesian classifier.
Slides by Greg Caporaso
Corresponding author: gregcaporaso@gmail.com
An extensible framework for optimizing classification enhances short-amplicon taxonomic assignments.
Nicholas A. Bokulich$^{*}$, Jai Ram Rideout$^{*}$, Kyle Patnode, Zack Ellett, Daniel McDonald, Benjamin Wolfe, Corinne F. Maurice, Rachel J. Dutton, Peter J. Turnbaugh, Rob Knight, J. Gregory Caporaso.
Manuscript in preparation.
$^{*}$ These authors contributed equally.