analysis


Evaluate computational runtimes

The purpose of this notebook is to analyze and plot computational runtimes generated for a list of taxonomy assignment methods in this notebook.


In [1]:
from os.path import expandvars
from tax_credit.plotting_functions import (lmplot_from_data_frame, calculate_linear_regress)
import pandas as pd
from os.path import join
import seaborn.xkcd_rgb as colors

First, load the results file generated in this notebook. Modify the contents of the following cell, then "run all" cells.


In [2]:
runtime_results = '../../temp_results_runtime/runtime_results.txt'
outdir = '../../plots/'

In [3]:
df = pd.read_csv(runtime_results, header=None, sep='\t', 
                 names=["Method", "Number of Query Sequences",
                        "Number of Reference Sequences",
                        "Iteration", "Runtime (s)"])
df = df.groupby(("Method", "Number of Query Sequences",
                 "Number of Reference Sequences")).median().reset_index()

In [4]:
color_palette={
    'rdp': colors['baby shit green'], 'sortmerna': colors['macaroni and cheese'],
    'uclust': 'coral', 'blast': 'indigo', 'blast+': colors['electric purple'], 'naive-bayes': 'dodgerblue',
    'vsearch': 'firebrick'
}

Runtime as a function of number of reference sequences

In these plots, only a single query sequence is searched against the reference database, so the lines illustrate the effect of number of reference sequences on runtime. This tells us how long it takes to assign taxonomy to the first sequence in our database, and therefore provides a measure of time needed to index the reference. We are primarily interested in the slope of the line, which indicates the effect of additional reference sequences.


In [5]:
lm, reg = lmplot_from_data_frame(df[df["Number of Query Sequences"] == 1],
                                 "Number of Reference Sequences", "Runtime (s)", 
                                 hue="Method", regress=True, color_palette=color_palette)
reg


Out[5]:
Method Slope Intercept R P-val Std Error
0 blast 0.000100 3.711695 0.452791 0.367229 0.000098
1 blast+ 0.000080 3.598454 0.512731 0.298300 0.000067
2 naive-bayes 0.000483 4.546587 0.736772 0.094814 0.000222
3 rdp 0.001696 8.200201 0.790514 0.061230 0.000657
4 sortmerna 0.000543 4.888434 0.937304 0.005773 0.000101
5 uclust 0.000028 4.064430 0.144915 0.784148 0.000095
6 vsearch 0.000072 3.496711 0.830637 0.040597 0.000024

In [6]:
lm.savefig(join(outdir, 'runtime_by_refcount.pdf'))

Runtime as a function of number of query sequences

These plots gives us an idea of how runtime scales with number of input sequences by varying the number of sequences that taxonomy is assigned to. Since database indexing is included in all of these steps, we care most about the slope of the line and very little about the y-intercept (which represents how long the database takes to index, and is a step can be typically performed once for multiple runs of a taxonomic assigner so it's a one-time cost and thus isn't as important.


In [7]:
lm, reg = lmplot_from_data_frame(df[df["Number of Reference Sequences"] == 10000],
                                 "Number of Query Sequences", "Runtime (s)",
                                 hue="Method", regress=True, color_palette=color_palette)
reg


Out[7]:
Method Slope Intercept R P-val Std Error
0 blast 0.133292 4.380001 0.999943 4.905518e-09 0.000713
1 blast+ 0.026222 3.522446 0.999830 4.318983e-08 0.000242
2 naive-bayes 0.022984 9.130614 0.999315 7.037541e-07 0.000426
3 rdp 0.002920 23.789860 0.954874 3.008638e-03 0.000454
4 sortmerna 0.003819 12.814103 0.981001 5.380209e-04 0.000378
5 uclust 0.002248 3.984771 0.996961 1.383712e-05 0.000088
6 vsearch 0.030190 9.059477 0.996643 1.688392e-05 0.001240

In [8]:
lm.savefig(join(outdir, 'runtime_by_querycount.pdf'))