Evaluate classification accuracy

This notebook demonstrates how to evaluate classification accuracy of "cross-validated" simulated communities. Due to the unique nature of this analysis, the metrics that we use to evaluate classification accuracy are different from those used for mock.

The key measure here is rate of match vs. overclassification, hence P/R/F are not useful metrics. Instead, we define and measure the following as percentages:

  • Match vs. overclassification rate
    • Match: exact match at level L
    • underclassification: lineage assignment is correct, but shorter than expected (e.g., not to species level)
    • misclassification: incorrect assignment

Where L = taxonomic level being tested


In [8]:
from tax_credit.framework_functions import (novel_taxa_classification_evaluation,
from tax_credit.eval_framework import parameter_comparisons
from tax_credit.plotting_functions import (pointplot_from_data_frame,
import seaborn.xkcd_rgb as colors
import pandas as pd
from os.path import expandvars, join, exists
from glob import glob
from IPython.display import display, Markdown

Evaluate classification results

First, enter in filepaths and directory paths where your data are stored, and the destination

In [9]:
project_dir = "../.."
analysis_name = "cross-validated"
precomputed_results_dir = join(project_dir, "data", "precomputed-results", analysis_name)
expected_results_dir = join(project_dir, "data", analysis_name)
summary_fp = join(precomputed_results_dir, 'evaluate_classification_summary.csv')

results_dirs = glob(join(precomputed_results_dir, '*', '*', '*', '*'))

This cell performs the classification evaluation and should not be modified.

In [10]:
force = False
if force or not exists(summary_fp):
    accuracy_results = novel_taxa_classification_evaluation(results_dirs, expected_results_dir,
                                                            summary_fp, test_type='cross-validated')
    accuracy_results = pd.DataFrame.from_csv(summary_fp)

Plot classification accuracy

Finally, we plot our results. Line plots show the mean +/- 95% confidence interval for each classification result at each taxonomic level (1 = phylum, 6 = species) in each dataset tested. Do not modify the cell below, except to adjust the color_pallette used for plotting. This palette can be a dictionary of colors for each group, as shown below, or a seaborn color palette.

match_ratio = proportion of correct matches.

underclassification_ratio = proportion of assignments to correct lineage but to a lower level than expected.

misclassification_ratio = proportion of assignments to an incorrect lineage.

In [11]:
    'expected': 'black', 'rdp': colors['baby shit green'], 'sortmerna': colors['macaroni and cheese'],
    'uclust': 'coral', 'blast': 'indigo', 'blast+': colors['electric purple'], 'naive-bayes': 'dodgerblue',
    'naive-bayes-bespoke': 'blue', 'vsearch': 'firebrick'

level_results = extract_per_level_accuracy(accuracy_results)

y_vars = ['Precision', 'Recall', 'F-measure']

In [12]:
pointplot_from_data_frame(level_results, "level", y_vars,
                          group_by="Dataset", color_by="Method",

KeyError                                  Traceback (most recent call last)
<ipython-input-12-6e3e883a7037> in <module>()
      1 pointplot_from_data_frame(level_results, "level", y_vars,
      2                           group_by="Dataset", color_by="Method",
----> 3                           color_palette=color_pallette)

~/projects/short-read-tax-assignment-bk/tax_credit/plotting_functions.py in pointplot_from_data_frame(df, x_axis, y_vars, group_by, color_by, color_palette, style_theme, plot_type)
     88     for y_var in y_vars:
     89         grid[y_var] = sns.FacetGrid(df, col=group_by, hue=color_by,
---> 90                                     palette=color_palette)
     91         grid[y_var] = grid[y_var].map(
     92             sns.pointplot, x_axis, y_var, marker="o", ms=4)

~/miniconda3/envs/qiime2-2017.6/lib/python3.5/site-packages/seaborn/axisgrid.py in __init__(self, data, row, col, hue, col_wrap, sharex, sharey, size, aspect, palette, row_order, col_order, hue_order, hue_kws, dropna, legend_out, despine, margin_titles, xlim, ylim, subplot_kws, gridspec_kws)
    234             hue_names = utils.categorical_order(data[hue], hue_order)
--> 236         colors = self._get_palette(data, hue, hue_order, palette)
    238         # Set up the lists of names for the row and column facet variables

~/miniconda3/envs/qiime2-2017.6/lib/python3.5/site-packages/seaborn/axisgrid.py in _get_palette(self, data, hue, hue_order, palette)
    156             # Allow for palette to map from hue variable names
    157             elif isinstance(palette, dict):
--> 158                 color_names = [palette[h] for h in hue_names]
    159                 colors = color_palette(color_names, n_colors)

~/miniconda3/envs/qiime2-2017.6/lib/python3.5/site-packages/seaborn/axisgrid.py in <listcomp>(.0)
    156             # Allow for palette to map from hue variable names
    157             elif isinstance(palette, dict):
--> 158                 color_names = [palette[h] for h in hue_names]
    159                 colors = color_palette(color_names, n_colors)

KeyError: 'naive-bayes-bespoke'

Naive-Bayes k-mer length picker

In [13]:
from pandas import DataFrame, concat, to_numeric

In [14]:
nb_results = level_results[level_results['Method'] == 'naive-bayes']
nb_results = nb_results.reset_index(drop=True)
columns = ['Alpha', 'kmer', 'Confidence']
def decode_params(p):
    p = p.split(':')
    p[-2] = int(eval(p[-2])[0])
    return p
params = DataFrame((decode_params(s) for s in nb_results['Parameters']), columns=columns)
keepers = ['Dataset', 'level', 'Method']
metrics = ['Precision', 'Recall', 'F-measure']
raw_param_results = concat([nb_results[keepers + metrics], params], axis=1)
raw_param_results = raw_param_results.apply(to_numeric, errors='ignore')
param_results = raw_param_results.groupby(keepers + columns, as_index=False).mean()
param_results.level = param_results.level.astype(int)
param_results.kmer = param_results.kmer.astype(int)


In [15]:
level_pallete = {n:'blue' for n in range(1,6)}
level_pallete[6] = 'orange'
pointplot_from_data_frame(param_results, "kmer", y_vars, 
                          group_by="Dataset", color_by="level",

{'F-measure': <seaborn.axisgrid.FacetGrid at 0x7fa9c6b19a20>,
 'Precision': <seaborn.axisgrid.FacetGrid at 0x7fa9ca6766d8>,
 'Recall': <seaborn.axisgrid.FacetGrid at 0x7fa9ca2525f8>}

Per-level classification accuracy statistic

Kruskal-Wallis FDR-corrected p-values comparing classification methods at each level of taxonomic assignment

In [17]:
result = per_level_kruskal_wallis(level_results, y_vars, group_by='Method', 
                                  dataset_col='Dataset', alpha=0.05, 

Dataset Variable 1 2 3 4 5 6
0 B1-REF Precision 5.415276e-03 1.167548e-01 7.945862e-20 6.120104e-10 3.674093e-18 1.191819e-15
1 B1-REF Recall 8.646268e-22 1.737241e-21 4.183030e-23 7.505660e-21 1.023730e-12 1.244342e-10
2 B1-REF F-measure 3.576752e-22 9.935459e-22 7.783520e-26 3.665289e-21 6.784682e-21 1.569797e-26
3 F1-REF Precision 2.975429e-37 8.712832e-44 1.581147e-46 3.632389e-40 4.066763e-29 6.820908e-19
4 F1-REF Recall 4.740560e-38 3.632389e-40 3.922101e-40 1.537953e-40 4.736284e-42 6.113875e-39
5 F1-REF F-measure 4.740560e-38 1.013492e-40 6.337137e-41 5.051777e-42 4.284161e-46 2.968640e-53

Heatmaps of method accuracy by parameter

Heatmaps show the performance of individual method/parameter combinations at each taxonomic level, in each reference database (i.e., for bacterial and fungal simulated datasets individually).

In [18]:
heatmap_from_data_frame(level_results, metric="Precision", rows=["Method", "Parameters"], cols=["Dataset", "level"])

In [19]:
heatmap_from_data_frame(level_results, metric="Recall", rows=["Method", "Parameters"], cols=["Dataset", "level"])