Evaluate mock community classification accuracy

The purpose of this notebook is to evaluate taxonomic classification accuracy of mock communities using different classification methods.

Prepare the environment

First we'll import various functions that we'll need for generating the report.


In [1]:
%matplotlib inline
from os.path import join, exists, expandvars
import pandas as pd
from IPython.display import display, Markdown
import seaborn.xkcd_rgb as colors
from tax_credit.plotting_functions import (pointplot_from_data_frame,
                                           boxplot_from_data_frame,
                                           heatmap_from_data_frame,
                                           per_level_kruskal_wallis,
                                           beta_diversity_pcoa,
                                           average_distance_boxplots,
                                           rank_optimized_method_performance_by_dataset)
from tax_credit.eval_framework import (evaluate_results,
                                       method_by_dataset_a1,
                                       parameter_comparisons,
                                       merge_expected_and_observed_tables,
                                       filter_df)

Configure local environment-specific values

This is the only cell that you will need to edit to generate basic reports locally. After editing this cell, you can run all cells in this notebook to generate your analysis report. This will take a few minutes to run, as results are computed at multiple taxonomic levels.

Values in this cell will not need to be changed, with the exception of project_dir, to generate the default results contained within tax-credit. To analyze results separately from the tax-credit precomputed results, other variables in this cell will need to be set.


In [2]:
## project_dir should be the directory where you've downloaded (or cloned) the 
## tax-credit repository. 
project_dir = expandvars("../..")

## expected_results_dir contains expected composition data in the structure
## expected_results_dir/<dataset name>/<reference name>/expected/
expected_results_dir = join(project_dir, "data/precomputed-results/", "mock-community")

## mock_results_fp designates the files to which summary results are written.
## If this file exists, it can be read in to generate results plots, instead
## of computing new scores.
mock_results_fp = join(expected_results_dir, 'mock_results.tsv')

## results_dirs should contain the directory or directories where
## results can be found. By default, this is the same location as expected 
## results included with the project. If other results should be included, 
## absolute paths to those directories should be added to this list.
results_dirs = [expected_results_dir]

## directory containing mock community data, e.g., feature table without taxonomy
mock_dir = join(project_dir, "data", "mock-community")

## Minimum number of times an OTU must be observed for it to be included in analyses. Edit this
## to analyze the effect of the minimum count on taxonomic results.
min_count = 1

## Define the range of taxonomic levels over which to compute accuracy scores.
## The default given below will compute order (level 2) through species (level 6)
taxonomy_level_range = range(2,7)

In [3]:
dataset_ids = ['mock-' + str(m) for m in (3, 12, 18, 22, 24, '26-ITS1', '26-ITS9')]

Find mock community pre-computed tables, expected tables, and "query" tables

Next we'll use the paths defined above to find all of the tables that will be compared. These include the pre-computed result tables (i.e., the ones that the new methods will be compared to), the expected result tables (i.e., the tables containing the known composition of the mock microbial communities), and the query result tables (i.e., the tables generated with the new method(s) that we want to compare to the pre-computed result tables).

Note: if you have added additional methods to add, set append=True. If you are attempting to recompute pre-computed results, set force=True.

This cell will take a few minutes to run if new results are being added, so hold onto your hat. If you are attempting to re-compute everything, it may take an hour or so, so go take a nap.


In [5]:
mock_results = evaluate_results(results_dirs, 
                                expected_results_dir, 
                                mock_results_fp, 
                                mock_dir,
                                taxonomy_level_range=range(2,7),
                                dataset_ids=dataset_ids,
                                min_count=min_count,
                                taxa_to_keep=None, 
                                md_key='taxonomy', 
                                subsample=False,
                                per_seq_precision=True,
                                exclude=['other'],
                                reference_ids=['unite_20.11.2016_clean_fullITS', 'gg_13_8_otus'],
                                append=False,
                                force=True)


/Users/benkaehler/miniconda3/envs/qiime2-dev/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)

In [6]:
mock_results['Reference'].unique()


Out[6]:
array(['gg_13_8_otus', 'unite_20.11.2016_clean_fullITS'], dtype=object)

Restrict analyses to a set of datasets or references: e.g., exclude taxonomy assignments made for purpose of reference database comparisons. This can be performed as shown below — alternatively, specific reference databases, datasets, methods, or parameters can be chosen by setting dataset_ids, reference_ids, method_ids, and parameter_ids in the evaluate_results command above.


In [7]:
mock_results = filter_df(mock_results, column_name='Method',
                         values=['q2-nb'], exclude=True)

In [8]:
mock_results[(mock_results['Method'] == 'naive-bayes') & (mock_results['Reference'] == 'unite_20.11.2016_clean_fullITS')]


Out[8]:
Dataset Level SampleID Reference Method Parameters Precision Recall F-measure Taxon Accuracy Rate Taxon Detection Rate
1960 mock-24 2 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.0 0.758148 1.000000 0.862439 0.666667 1.000000
1961 mock-24 3 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.0 0.758148 1.000000 0.862439 0.555556 1.000000
1962 mock-24 4 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.0 0.758148 1.000000 0.862439 0.466667 1.000000
1963 mock-24 5 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.0 0.758148 1.000000 0.862439 0.470588 1.000000
1964 mock-24 6 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.0 0.674483 0.889646 0.767266 0.181818 0.500000
1965 mock-24 2 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.1 0.759317 1.000000 0.863195 0.500000 0.750000
1966 mock-24 3 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.1 0.759317 1.000000 0.863195 0.444444 0.800000
1967 mock-24 4 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.1 0.759317 1.000000 0.863195 0.400000 0.857143
1968 mock-24 5 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.1 0.759317 1.000000 0.863195 0.411765 0.875000
1969 mock-24 6 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.1 0.675523 0.889646 0.767938 0.181818 0.500000
1970 mock-24 2 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.2 0.759317 1.000000 0.863195 0.500000 0.750000
1971 mock-24 3 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.2 0.759317 1.000000 0.863195 0.444444 0.800000
1972 mock-24 4 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.2 0.759317 1.000000 0.863195 0.400000 0.857143
1973 mock-24 5 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.2 0.759317 1.000000 0.863195 0.411765 0.875000
1974 mock-24 6 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.2 0.675523 0.889646 0.767938 0.181818 0.500000
1975 mock-24 2 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.3 0.759317 1.000000 0.863195 0.500000 0.750000
1976 mock-24 3 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.3 0.759317 1.000000 0.863195 0.444444 0.800000
1977 mock-24 4 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.3 0.759317 1.000000 0.863195 0.400000 0.857143
1978 mock-24 5 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.3 0.759317 1.000000 0.863195 0.411765 0.875000
1979 mock-24 6 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.3 0.675538 0.889646 0.767948 0.181818 0.500000
1980 mock-24 2 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.4 0.759317 1.000000 0.863195 0.500000 0.750000
1981 mock-24 3 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.4 0.759317 1.000000 0.863195 0.444444 0.800000
1982 mock-24 4 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.4 0.759317 1.000000 0.863195 0.400000 0.857143
1983 mock-24 5 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.4 0.759317 1.000000 0.863195 0.411765 0.875000
1984 mock-24 6 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.4 0.675538 0.889646 0.767948 0.181818 0.500000
1985 mock-24 2 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.5 0.759317 1.000000 0.863195 0.500000 0.750000
1986 mock-24 3 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.5 0.759317 1.000000 0.863195 0.444444 0.800000
1987 mock-24 4 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.5 0.759317 1.000000 0.863195 0.400000 0.857143
1988 mock-24 5 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.5 0.759317 1.000000 0.863195 0.411765 0.875000
1989 mock-24 6 Mock.1 unite_20.11.2016_clean_fullITS naive-bayes False:0.5 0.675554 0.889646 0.767958 0.181818 0.500000
... ... ... ... ... ... ... ... ... ... ... ...
17450 mock-26-ITS9 6 Mock.19 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.166667 0.181818
17451 mock-26-ITS9 6 Mock.2 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.181818 0.181818
17452 mock-26-ITS9 6 Mock.20 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.222222 0.181818
17453 mock-26-ITS9 6 Mock.21 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.000000 0.000000
17454 mock-26-ITS9 6 Mock.22 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.000000 0.000000
17455 mock-26-ITS9 6 Mock.23 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.125000 0.090909
17456 mock-26-ITS9 6 Mock.24 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.125000 0.090909
17457 mock-26-ITS9 6 Mock.25 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.200000 0.090909
17458 mock-26-ITS9 6 Mock.26 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.111111 0.090909
17459 mock-26-ITS9 6 Mock.27 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.000000 0.000000
17460 mock-26-ITS9 6 Mock.28 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.142857 0.090909
17461 mock-26-ITS9 6 Mock.29 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.111111 0.090909
17462 mock-26-ITS9 6 Mock.3 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.222222 0.181818
17463 mock-26-ITS9 6 Mock.30 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.000000 0.000000
17464 mock-26-ITS9 6 Mock.31 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.000000 0.000000
17465 mock-26-ITS9 6 Mock.32 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.000000 0.000000
17466 mock-26-ITS9 6 Mock.33 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.000000 0.000000
17467 mock-26-ITS9 6 Mock.34 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.000000 0.000000
17468 mock-26-ITS9 6 Mock.35 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.111111 0.090909
17469 mock-26-ITS9 6 Mock.36 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.000000 0.000000
17470 mock-26-ITS9 6 Mock.37 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.166667 0.090909
17471 mock-26-ITS9 6 Mock.38 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.142857 0.090909
17472 mock-26-ITS9 6 Mock.39 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.000000 0.000000
17473 mock-26-ITS9 6 Mock.4 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.181818 0.181818
17474 mock-26-ITS9 6 Mock.40 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.125000 0.090909
17475 mock-26-ITS9 6 Mock.5 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.200000 0.181818
17476 mock-26-ITS9 6 Mock.6 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.222222 0.181818
17477 mock-26-ITS9 6 Mock.7 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.153846 0.181818
17478 mock-26-ITS9 6 Mock.8 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.222222 0.181818
17479 mock-26-ITS9 6 Mock.9 unite_20.11.2016_clean_fullITS naive-bayes False:0.9 -1.000000 -1.000000 -1.000000 0.200000 0.181818

2650 rows × 11 columns

Compute and summarize precision, recall, and F-measure for mock communities

In this evaluation, we compute and summarize precision, recall, and F-measure of each result (pre-computed and query) based on the known composition of the mock communities. We then summarize the results in two ways: first with boxplots, and second with a table of the top methods based on their F-measures. Higher scores = better accuracy

As a first step, we will evaluate average method performance at each taxonomic level for each method within each reference dataset type.

Note that, as parameter configurations can cause results to vary widely, average results are not a good representation of the "best" results. See here for results using optimized parameters for each method.

First we will define our color palette and the variables we want to plot. Via seaborn, we can apply the xkcd crowdsourced color names. If that still doesn't match your hue, use hex codes.


In [9]:
color_pallette={
    'expected': 'black', 'rdp': colors['baby shit green'], 'sortmerna': colors['macaroni and cheese'],
    'uclust': 'coral', 'blast': 'indigo', 'blast+': colors['electric purple'], 'naive-bayes': 'dodgerblue',
    'vsearch': 'firebrick'
}

y_vars = ["Precision", "Recall", "F-measure", "Taxon Accuracy Rate", "Taxon Detection Rate"]

In [10]:
pointplot_from_data_frame(mock_results, "Level", y_vars, 
                          group_by="Reference", color_by="Method",
                          color_pallette=color_pallette, )


Kruskal-Wallis between-method accuracy comparisons

Kruskal-Wallis FDR-corrected p-values comparing classification methods at each level of taxonomic assignment


In [172]:
result = per_level_kruskal_wallis(mock_results, y_vars, group_by='Method', 
                                  dataset_col='Reference', level_name='Level',
                                  levelrange=range(2,7), alpha=0.05, 
                                  pval_correction='fdr_bh')
result


Out[172]:
Reference Variable 2 3 4 5 6
0 gg_13_8_otus Precision 1.077059e-02 9.455196e-02 2.558814e-01 1.905525e-03 1.022860e-02
1 gg_13_8_otus Recall 6.293206e-28 9.844389e-15 7.991753e-23 1.486565e-17 4.265048e-05
2 gg_13_8_otus F-measure 2.320716e-01 5.350801e-01 9.777715e-01 1.336838e-06 1.905525e-03
3 gg_13_8_otus Taxon Accuracy Rate 6.959912e-06 8.134356e-02 5.010372e-03 2.476416e-23 7.578604e-16
4 gg_13_8_otus Taxon Detection Rate 9.584542e-01 8.763574e-01 5.649290e-07 5.948615e-47 4.774055e-32
5 unite_20.11.2016_clean_fullITS Precision 4.601516e-28 1.930879e-29 1.067839e-30 7.536402e-53 1.084495e-132
6 unite_20.11.2016_clean_fullITS Recall 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
7 unite_20.11.2016_clean_fullITS F-measure 2.173682e-55 2.073542e-69 5.405091e-54 2.663020e-66 7.336023e-112
8 unite_20.11.2016_clean_fullITS Taxon Accuracy Rate 4.384513e-265 2.122026e-308 4.685986e-305 9.104506e-280 0.000000e+00
9 unite_20.11.2016_clean_fullITS Taxon Detection Rate 2.337681e-127 2.204186e-218 4.205315e-228 2.371274e-235 0.000000e+00

Heatmaps of per-level accuracy

Heatmaps show the performance of individual method/parameter combinations at each taxonomic level, in each reference database (i.e., for bacterial and fungal mock communities individually).


In [8]:
heatmap_from_data_frame(mock_results, metric="Precision", rows=["Method", "Parameters"], cols=["Reference", "Level"])



In [9]:
heatmap_from_data_frame(mock_results, metric="Recall", rows=["Method", "Parameters"], cols=["Reference", "Level"])



In [10]:
heatmap_from_data_frame(mock_results, metric="F-measure", rows=["Method", "Parameters"], cols=["Reference", "Level"])



In [11]:
heatmap_from_data_frame(mock_results, metric="Taxon Accuracy Rate", rows=["Method", "Parameters"], cols=["Reference", "Level"])