The purpose of this notebook is to analyze and plot computational runtimes generated for a list of taxonomy assignment methods in this notebook.
In [1]:
from os.path import expandvars
from tax_credit.plotting_functions import (lmplot_from_data_frame, calculate_linear_regress)
import pandas as pd
from os.path import join
import seaborn.xkcd_rgb as colors
First, load the results file generated in this notebook. Modify the contents of the following cell, then "run all" cells.
In [2]:
runtime_results = '../../temp_results_runtime/runtime_results.txt'
outdir = '../../plots/'
In [3]:
df = pd.read_csv(runtime_results, header=None, sep='\t',
names=["Method", "Number of Query Sequences",
"Number of Reference Sequences",
"Iteration", "Runtime (s)"])
df = df.groupby(("Method", "Number of Query Sequences",
"Number of Reference Sequences")).median().reset_index()
In [4]:
color_palette={
'rdp': colors['baby shit green'], 'sortmerna': colors['macaroni and cheese'],
'uclust': 'coral', 'blast': 'indigo', 'blast+': colors['electric purple'], 'naive-bayes': 'dodgerblue',
'vsearch': 'firebrick'
}
In these plots, only a single query sequence is searched against the reference database, so the lines illustrate the effect of number of reference sequences on runtime. This tells us how long it takes to assign taxonomy to the first sequence in our database, and therefore provides a measure of time needed to index the reference. We are primarily interested in the slope of the line, which indicates the effect of additional reference sequences.
In [5]:
lm, reg = lmplot_from_data_frame(df[df["Number of Query Sequences"] == 1],
"Number of Reference Sequences", "Runtime (s)",
hue="Method", regress=True, color_palette=color_palette)
reg
Out[5]:
In [6]:
lm.savefig(join(outdir, 'runtime_by_refcount.pdf'))
These plots gives us an idea of how runtime scales with number of input sequences by varying the number of sequences that taxonomy is assigned to. Since database indexing is included in all of these steps, we care most about the slope of the line and very little about the y-intercept (which represents how long the database takes to index, and is a step can be typically performed once for multiple runs of a taxonomic assigner so it's a one-time cost and thus isn't as important.
In [7]:
lm, reg = lmplot_from_data_frame(df[df["Number of Reference Sequences"] == 10000],
"Number of Query Sequences", "Runtime (s)",
hue="Method", regress=True, color_palette=color_palette)
reg
Out[7]:
In [8]:
lm.savefig(join(outdir, 'runtime_by_querycount.pdf'))