Evaluate computational runtimes

The purpose of this notebook is to analyze and plot computational runtimes generated for a list of taxonomy assignment methods in this notebook.



In [6]:

    
from os.path import expandvars
from tax_credit.plotting_functions import (lmplot_from_data_frame, calculate_linear_regress)
import pandas as pd

First, load the results file generated in this notebook. Modify the contents of the following cell, then "run all" cells.



In [7]:

    
runtime_results = expandvars('$HOME/Desktop/projects/tax-credit-runtime/runtime_results.txt')



In [8]:

    
df = pd.read_csv(runtime_results, header=None, sep='\t', 
                 names=["Method", "Number of Query Sequences",
                        "Number of Reference Sequences",
                        "Iteration", "Runtime (s)"])
df = df.groupby(("Method", "Number of Query Sequences",
                 "Number of Reference Sequences")).median().reset_index()

Runtime as a function of number of reference sequences

In these plots, only a single query sequence is searched against the reference database, so the lines illustrate the effect of number of reference sequences on runtime. This tells us how long it takes to assign taxonomy to the first sequence in our database, and therefore provides a measure of time needed to index the reference. We are primarily interested in the slope of the line, which indicates the effect of additional reference sequences.



In [9]:

    
lmplot_from_data_frame(df[df["Number of Query Sequences"] == 1],
                       "Number of Reference Sequences", "Runtime (s)", 
                       "Method", regress=True)

Runtime as a function of number of query sequences

These plots gives us an idea of how runtime scales with number of input sequences by varying the number of sequences that taxonomy is assigned to. Since database indexing is included in all of these steps, we care most about the slope of the line and very little about the y-intercept (which represents how long the database takes to index, and is a step can be typically performed once for multiple runs of a taxonomic assigner so it's a one-time cost and thus isn't as important.



In [10]:

    
lmplot_from_data_frame(df[df["Number of Reference Sequences"] == 10000],
                       "Number of Query Sequences", "Runtime (s)",
                       "Method", regress=True)









    












    Out[10]:






  
    
      
      Method
      Slope
      Intercept
      R
      P-val
      Std Error
    
  
  
    
      0
      blast
      0.150201
      9.997375
      0.999912
      9.937231e-07
      0.001152
    
    
      1
      blast+
      0.032299
      18.245219
      0.999986
      6.585328e-08
      0.000100
    
    
      2
      rdp
      0.002804
      15.511504
      0.999020
      3.683914e-05
      0.000072
    
    
      3
      sortmerna
      0.025606
      17.075100
      0.999264
      2.398645e-05
      0.000568
    
    
      4
      uclust
      0.002911
      4.426683
      0.999681
      6.831395e-06
      0.000042
    
    
      5
      vsearch
      0.030392
      17.064180
      0.999701
      6.199095e-06
      0.000429



In [ ]:

	Method	Slope	Intercept	R	P-val	Std Error
0	blast	0.000038	4.358554	0.487257	0.405112	0.000039
1	blast+	0.000099	17.233174	0.405522	0.498196	0.000129
2	rdp	0.000893	6.274990	0.995944	0.000310	0.000047
3	sortmerna	0.000890	6.024165	0.998926	0.000042	0.000024
4	uclust	0.000047	4.003719	0.597064	0.287752	0.000036
5	vsearch	0.000124	16.807911	0.717344	0.172541	0.000070

	Method	Slope	Intercept	R	P-val	Std Error
0	blast	0.150201	9.997375	0.999912	9.937231e-07	0.001152
1	blast+	0.032299	18.245219	0.999986	6.585328e-08	0.000100
2	rdp	0.002804	15.511504	0.999020	3.683914e-05	0.000072
3	sortmerna	0.025606	17.075100	0.999264	2.398645e-05	0.000568
4	uclust	0.002911	4.426683	0.999681	6.831395e-06	0.000042
5	vsearch	0.030392	17.064180	0.999701	6.199095e-06	0.000429