Evaluate computational runtimes

The purpose of this notebook is to analyze and plot computational runtimes generated for a list of taxonomy assignment methods in this notebook.


In [6]:
from os.path import expandvars
from tax_credit.plotting_functions import (lmplot_from_data_frame, calculate_linear_regress)
import pandas as pd

First, load the results file generated in this notebook. Modify the contents of the following cell, then "run all" cells.


In [7]:
runtime_results = expandvars('$HOME/Desktop/projects/tax-credit-runtime/runtime_results.txt')

In [8]:
df = pd.read_csv(runtime_results, header=None, sep='\t', 
                 names=["Method", "Number of Query Sequences",
                        "Number of Reference Sequences",
                        "Iteration", "Runtime (s)"])
df = df.groupby(("Method", "Number of Query Sequences",
                 "Number of Reference Sequences")).median().reset_index()

Runtime as a function of number of reference sequences

In these plots, only a single query sequence is searched against the reference database, so the lines illustrate the effect of number of reference sequences on runtime. This tells us how long it takes to assign taxonomy to the first sequence in our database, and therefore provides a measure of time needed to index the reference. We are primarily interested in the slope of the line, which indicates the effect of additional reference sequences.


In [9]:
lmplot_from_data_frame(df[df["Number of Query Sequences"] == 1],
                       "Number of Reference Sequences", "Runtime (s)", 
                       "Method", regress=True)


Out[9]:
Method Slope Intercept R P-val Std Error
0 blast 0.000038 4.358554 0.487257 0.405112 0.000039
1 blast+ 0.000099 17.233174 0.405522 0.498196 0.000129
2 rdp 0.000893 6.274990 0.995944 0.000310 0.000047
3 sortmerna 0.000890 6.024165 0.998926 0.000042 0.000024
4 uclust 0.000047 4.003719 0.597064 0.287752 0.000036
5 vsearch 0.000124 16.807911 0.717344 0.172541 0.000070

Runtime as a function of number of query sequences

These plots gives us an idea of how runtime scales with number of input sequences by varying the number of sequences that taxonomy is assigned to. Since database indexing is included in all of these steps, we care most about the slope of the line and very little about the y-intercept (which represents how long the database takes to index, and is a step can be typically performed once for multiple runs of a taxonomic assigner so it's a one-time cost and thus isn't as important.


In [10]:
lmplot_from_data_frame(df[df["Number of Reference Sequences"] == 10000],
                       "Number of Query Sequences", "Runtime (s)",
                       "Method", regress=True)


Out[10]:
Method Slope Intercept R P-val Std Error
0 blast 0.150201 9.997375 0.999912 9.937231e-07 0.001152
1 blast+ 0.032299 18.245219 0.999986 6.585328e-08 0.000100
2 rdp 0.002804 15.511504 0.999020 3.683914e-05 0.000072
3 sortmerna 0.025606 17.075100 0.999264 2.398645e-05 0.000568
4 uclust 0.002911 4.426683 0.999681 6.831395e-06 0.000042
5 vsearch 0.030392 17.064180 0.999701 6.199095e-06 0.000429

In [ ]: