In [13]:
from generate_samples import generate_sample, PROTEIN_ALPHABET
# Let's generate two sample sequences of proteins. Length of first sequence will be 50.
seq1, seq2 = generate_sample(PROTEIN_ALPHABET, 50)
print "Sequence 1: " + seq1
print "Sequence 2: " + seq2
I've already generated some linearly distributed sample sequences of length ~100 - ~3000. Let's use them to compare speeds of pairwise2 methods. We will perform alignment of all test sequences and draw nice plots! Attention: scripts below may take some time (up to ~ 2 minutes) to execute!
In [15]:
from Bio import pairwise2
from test_optimization import run_compare_test
# Compare globalxx and localxx methods.
# These methods yield global and local alignments for constant gap penalty = 1 and match score = 1
# Search only for score (do not backtrack alignments - much faster)
run_compare_test("Test Bio pairwise2 globalxx and localxx (score only)",
[pairwise2.align.globalxx, pairwise2.align.localxx], score_only=True)
# Backtrack alignments
run_compare_test("Test Bio pairwise2 globalxx and localxx",
[pairwise2.align.globalxx, pairwise2.align.localxx])
In [16]:
As we can see local alignment methods are ridiculously slow! Why is it so? To see what takes so long I inspected Bio.pairwise2 code. The code contains a main method "align" which performs alignment by invoking some other functions. Its code can be divided into several parts:
I created my class deriving base class of alignment from pairwise2. I overrode this "align" method by copying original code and adding measuring of time of its parts. Then i wrote a script which executes this code several times for a few sample sequence pairs and plots execution times of each part.
Again, code below may take some time to execute (~ 3 minutes).
In [19]:
from analyze_bio_speed import perform_test
from generate_samples import generate_sample, PROTEIN_ALPHABET
# Let's generate 3 samples of lengths ~ 1000
samples_count, seq_len = 3, 1000
seq_pairs = [generate_sample(PROTEIN_ALPHABET, seq_len)] * samples_count
# Analyze execution time of localxx with score_only=True
perform_test(seq_len, seq_pairs, "localxx", "Analysis of Bio pairwise2 localxx execution time (score only)", score_only=True)
# Analyze execution time of globalxx with score_only=True
perform_test(seq_len, seq_pairs, "globalxx", "Analysis of Bio pairwise2 globalxx execution time (score only)", score_only=True)
# Analyze execution time of localxx with backtracing alignments
perform_test(seq_len, seq_pairs, "localxx", "Analysis of Bio pairwise2 localxx execution time")
# Analyze execution time of globalxx with backtracing alignments
perform_test(seq_len, seq_pairs, "globalxx", "Analysis of Bio pairwise2 globalxx execution time")
:::python
""" Fragment of code in "align" method: """ """ finding starts: """ starts = _find_start(score_matrix, align_globally) best_score = max([x[0] for x in starts])
if score_only: return best_score
tolerance = 0 # This seems to be placeholder for a future feature of giving some tolerance on best score
""" filtering starts: """ starts = [(score, pos) for score, pos in starts if rint(abs(score - best_score)) <= rint(tolerance)]
""" _find_start method used in "align" method: """ def _find_start(score_matrix, align_globally): if align_globally: starts = [(score_matrix[-1][-1], (nrows - 1, ncols - 1))] else: starts = [] for row in range(nrows): for col in range(ncols): score = score_matrix[row][col] starts.append((score, (row, col))) return starts
Let's cosider local alignments olny (performance of global alignments is OK). I see the following problems with the code above:
:::python
""" Fragment of code in "align" method: """ nrows, ncols = len(score_matrix), len(score_matrix[0])
if align_globally: starts = [(score_matrix[nrows-1][ncols-1], (nrows-1,ncols-1))] if score_only: return starts[0][0] else: best_score = max(score_matrix[row][col] for row, col in itertools.product(xrange(nrows), xrange(ncols))) if score_only: return best_score
starts = _find_start(score_matrix, best_score)
""" _find_start method used in "align" method: """ def _find_start(score_matrix, best_score): nrows, ncols = len(score_matrix), len(score_matrix[0])
return [(best_score, (row, col)) for row, col in itertools.product(xrange(nrows), xrange(ncols))
if score_matrix[row][col] == best_score]
In [ ]:
# Let's see how new code affects performance
# Again - this methods can take some time, ~ 2 minutes each
from Bio import pairwise2
from lib import optimized_pairwise2
from test_optimization import run_compare_test
# Compare original and optimized localxx methods.
description = "Compare localxx methods (score only)"
print(description)
run_compare_test(description, [pairwise2.align.localxx, optimized_pairwise2.align.localxx], score_only=True)
description = "Compare localxx methods (including alignments)"
print(description)
run_compare_test(description, [pairwise2.align.localxx, optimized_pairwise2.align.localxx])
My task is too optimize existing library, so it is very important that its behavior doesn't change so it can easily replace the original. I assured correctness of results by creating a unit test module. It tests for several methods that results from optimized library are exactly same as results from the original. I don't perform unit test on generated samples. Instead I manually create some test samples for unit tests. They include special cases as empty sequences, two same sequences etc.
Executing all unittests takes really long (because we test original local alignment methods which are terribly slow), however done once it assures that i didn't mess up anything.
In [ ]:
from testdata.test_sequences import get_test_sequences_pairs, all_equal
from run_unittests import AlignmentEquivalenceTestCase
# get all sequence pairs for unit testing
sequence_pairs = [(seq1, seq2) for _, seq1, seq2 in get_test_sequences_pairs(['unit'])]
# creates test suite which tests all methods on all sequence pairs
test_suite = AlignmentsEquivalenceTestCase.get_test_suite(sequence_pairs)
# this takes very very long
unittest.TextTestRunner().run(test_suite)