This illustrates how to use global alignment to find a short sequence (e.g., a PCR primer) in a longer sequence (e.g., a 454 read) using Needleman-Wunsch alignment.
In [1]:
from skbio.alignment import global_pairwise_align_nucleotide
In [2]:
s = "ACCGTGGACCGTTAGGATTGGACCCAAGGTTG"
t = "T"*25 + "ACCGTGGACCGTAGGATTGGACCAAGGTTA" + "A"*25
a = global_pairwise_align_nucleotide(s, t)
In [3]:
print a.to_fasta()
We can then slice just the targeted region as follows.
In [4]:
gap_vector = a[0].gap_vector()
start_index = gap_vector.index(False)
end_index = (a.sequence_length() - 1) - gap_vector[::-1].index(False)
print a[0][start_index:end_index+1]
print a[1][start_index:end_index+1]
And count the mismatches and gaps.
In [5]:
mismatch_count = 0
for i in range(start_index, end_index+1):
if a[0][i] != a[1][i]:
mismatch_count += 1
print mismatch_count
This is a useful example for illustrating the problem that arises when penalize_terminal_gaps == True
. This issue arises because if you continue to penalize adding gaps to s1
after you've reached the end, the score get so low that it can be better than or equal to the following alignment. This is clearly not the result that we're looking for, so penalize_terminal_gaps
is False
for the global aligners.
In [6]:
print global_pairwise_align_nucleotide(s, t, penalize_terminal_gaps=True).to_fasta()