evaluate-classification-database-comparison


Database comparisons

The purpose of this notebook is to evaluate classification accuracy between different reference databases. Select mock community sequences are taxonomically classified using two or more different reference databases, e.g., Greengenes 13_8 trimmed to 250 nt and the same database trimmed to 150 nt. This notebook can also be modified to provide taxonomic classification with any number of desired reference databases/versions. Limit the analysis to only a few mock communities and method/parameter combinations; the goal here is to compare the databases, not the methods.

In the example below, we compare RDP classifications of a single 16S rRNA dataset (mock-3) and a single fungal ITS dataset (mock-9) using three different reference databases: reference sequences that are full length (e.g., full 16S rRNA gene), trimmed to 250 nt (approximately full V4 gene), or trimmed to 150 nt or 100 nt (length of sequence read).

Prepare the environment

First we'll import various functions that we'll need for generating the report.


In [1]:
%matplotlib inline
from os.path import join, expandvars
import pandas as pd
from tax_credit.plotting_functions import (pointplot_from_data_frame,
                                           boxplot_from_data_frame,
                                           heatmap_from_data_frame,
                                           per_level_kruskal_wallis)
from tax_credit.eval_framework import (evaluate_results,
                                       method_by_reference_comparison)

Configure local environment-specific values

This is the only cell that you will need to edit to generate basic reports locally. After editing this cell, you can run all cells in this notebook to generate your analysis report. This will take a few minutes to run, as results are computed at multiple taxonomic levels.

Values in this cell will not need to be changed, with the exception of project_dir, to generate the default results contained within tax-credit. To analyze results separately from the tax-credit precomputed results, other variables in this cell will need to be set.


In [2]:
## project_dir should be the directory where you've downloaded (or cloned) the 
## tax-credit repository. 
project_dir = expandvars("$HOME/Desktop/projects/tax-credit")

## expected_results_dir contains expected composition data in the structure
## expected_results_dir/<dataset name>/<reference name>/expected/
expected_results_dir = join(project_dir, "data/precomputed-results/", "mock-community")

## mock_results_fp designates the files to which summary results are written.
## If this file exists, it can be read in to generate results plots, instead
## of computing new scores.
mock_results_fp = join(expected_results_dir, 'mock_results.tsv')

## results_dirs should contain the directory or directories where
## results can be found. By default, this is the same location as expected 
## results included with the project. If other results should be included, 
## absolute paths to those directories should be added to this list.
results_dirs = [expected_results_dir]

## Minimum number of times an OTU must be observed for it to be included in analyses. Edit this
## to analyze the effect of the minimum count on taxonomic results.
min_count = 1

## Define the range of taxonomic levels over which to compute accuracy scores.
## The default given below will compute order (level 2) through species (level 6)
taxonomy_level_range = range(2,7)

Find mock community pre-computed tables, expected tables, and "query" tables

Next we'll use the paths defined above to find all of the tables that will be compared. These include the pre-computed result tables (i.e., the ones that the new methods will be compared to), the expected result tables (i.e., the tables containing the known composition of the mock microbial communities), and the query result tables (i.e., the tables generated with the new method(s) that we want to compare to the pre-computed result tables).


In [3]:
mock_results = evaluate_results(results_dirs, 
                                expected_results_dir, 
                                mock_results_fp, 
                                taxonomy_level_range=range(2,7), 
                                min_count=1,
                                taxa_to_keep=None, 
                                md_key='taxonomy', 
                                subsample=False,
                                force=False)


/Users/nbokulich/Desktop/projects/tax-credit/data/precomputed-results/mock-community/mock_results.tsv already exists.
Reading in pre-computed evaluation results.
To overwrite, set force=True

Restrict analyses to a set of datasets used for purpose of reference database comparisons.


In [4]:
mock_results = mock_results[mock_results['Method'].isin(['q2-nb', 'rdp', 'blast+', 'uclust', 'vsearch'])]
mock_results = mock_results[mock_results['Dataset'].isin(['mock-3', 'mock-9'])]

Evalution 1: Compute and summarize precision, recall, and F-measure for mock communities

In this evaluation, we compute and summarize precision, recall, and F-measure of each result (pre-computed and query) based on the known composition of the mock communities. We then summarize the results in two ways: first with boxplots, and second with a table of the top methods based on their F-measures.

This is a qualitative evaluation, effectively telling us about the ability of the different methods to report the taxa that are present in each sample. These metrics are not concerned with the abundance of the different taxa.

As a first step, we will evaluate how accuracy metrics perform at each taxonomic level for each method within each reference dataset type. We can evaluate this for each method individually, as is currently set for q2-feature-classifier naive bayes classifier.


In [5]:
color_pallette = {# 16S refs
                  'gg_13_8_otus': 'orange', 
                  'gg_13_8_otus_clean': 'seagreen', # 515f-806r
                  'gg_13_8_otus_clean_trim150': 'blue', # 515f-806r
                  'gg_13_8_otus_clean_full16S': 'purple',
                  'silva_123_v4_trim250': 'pink',
                  'silva_123_clean_v4_trim250': 'red',
                  'silva_123_clean_full16S': 'gold',
                  # ITS Refs
                  'unite_20.11.2016': 'orange', 
                  'unite_20.11.2016_clean': 'seagreen', # ITS1Ff-ITS2r
                  'unite_20.11.2016_clean_trim100': 'blue', # ITS1Ff-ITS2r
                  'unite_20.11.2016_clean_fullITS': 'purple', 
                  'unite_20.11.2016_clean_trim150': 'cyan'} # ITS1Ff-ITS2r

y_vars = ["Precision", "Recall", "F-measure"]

In [6]:
pointplot_from_data_frame(mock_results[mock_results['Method'] == 'q2-nb'], "Level", y_vars, 
                          group_by="Dataset", color_by="Reference",
                          color_pallette=color_pallette)



In [7]:
pointplot_from_data_frame(mock_results[mock_results['Method'] == 'rdp'], "Level", y_vars, 
                          group_by="Dataset", color_by="Reference",
                          color_pallette=color_pallette)



In [19]:
pointplot_from_data_frame(mock_results[mock_results['Method'] == 'blast+'], "Level", y_vars, 
                          group_by="Dataset", color_by="Reference",
                          color_pallette=color_pallette)



In [20]:
pointplot_from_data_frame(mock_results[mock_results['Method'] == 'uclust'], "Level", y_vars, 
                          group_by="Dataset", color_by="Reference",
                          color_pallette=color_pallette)


Kruskal-Wallis FDR-corrected p-values comparing reference database / method combinations at each level of taxonomic assignment


In [21]:
result = per_level_kruskal_wallis(mock_results, y_vars, group_by='Reference', 
                                  dataset_col='Dataset', level_name='Level',
                                  levelrange=range(2,7), alpha=0.05, 
                                  pval_correction='fdr_bh')
result


Out[21]:
Dataset Variable 2 3 4 5 6
0 mock-3 Precision 5.113689e-116 1.552753e-115 2.312131e-100 2.498765e-207 6.371697e-167
1 mock-3 Recall 3.100730e-64 2.671537e-55 3.769743e-64 1.776997e-92 6.718745e-171
2 mock-3 F-measure 2.958127e-69 2.785821e-58 2.033205e-74 8.220949e-133 4.999565e-175
3 mock-9 Precision 7.316974e-86 1.845363e-82 1.860603e-94 4.041060e-105 1.040389e-128
4 mock-9 Recall 3.764580e-101 1.307634e-109 5.804092e-103 6.494176e-103 7.778683e-128
5 mock-9 F-measure 6.920112e-100 8.424047e-100 3.355415e-108 9.091524e-114 1.854305e-132

In [22]:
heatmap_from_data_frame(mock_results, metric="Precision", rows=["Method", "Parameters"], cols=["Reference", "Level"])



In [21]:
heatmap_from_data_frame(mock_results, metric="Recall", rows=["Method", "Parameters"], cols=["Reference", "Level"])