This illustrates how to use global alignment to find a short sequence (e.g., a PCR primer) in a longer sequence (e.g., a 454 read) using Needleman-Wunsch alignment.


In [1]:
from skbio.alignment import global_pairwise_align_nucleotide

In [2]:
s = "ACCGTGGACCGTTAGGATTGGACCCAAGGTTG"
t = "T"*25 + "ACCGTGGACCGTAGGATTGGACCAAGGTTA" + "A"*25

a = global_pairwise_align_nucleotide(s, t)


/Users/caporaso/Dropbox/code/scikit-bio/skbio/alignment/_pairwise.py:594: UserWarning: make_identity_substitution_matrix is deprecated and will soon be replaced, though at the time of this writing the new name has not been finalized. Updates will be posted to issue #161: https://github.com/biocore/scikit-bio/issues/161
  warn("make_identity_substitution_matrix is deprecated and will soon be "
/Users/caporaso/Dropbox/code/scikit-bio/skbio/alignment/_pairwise.py:540: EfficiencyWarning: You're using skbio's python implementation of Needleman-Wunsch alignment. This is known to be very slow (e.g., thousands of times slower than a native C implementation). We'll be adding a faster version soon (see https://github.com/biocore/scikit-bio/issues/254 to track progress on this).
  "to track progress on this).", EfficiencyWarning)

In [3]:
print a.to_fasta()


>0
-------------------------ACCGTGGACCGTTAGGATTGGACCCAAGGTTG-------------------------
>1
TTTTTTTTTTTTTTTTTTTTTTTTTACCGTGGACCGT-AGGATTGGACC-AAGGTTAAAAAAAAAAAAAAAAAAAAAAAAAA

We can then slice just the targeted region as follows.


In [4]:
gap_vector = a[0].gap_vector()
start_index = gap_vector.index(False)
end_index = (a.sequence_length() - 1) - gap_vector[::-1].index(False)
print a[0][start_index:end_index+1]
print a[1][start_index:end_index+1]


ACCGTGGACCGTTAGGATTGGACCCAAGGTTG
ACCGTGGACCGT-AGGATTGGACC-AAGGTTA

And count the mismatches and gaps.


In [5]:
mismatch_count = 0
for i in range(start_index, end_index+1):
    if a[0][i] != a[1][i]:
        mismatch_count += 1
print mismatch_count


3

This is a useful example for illustrating the problem that arises when penalize_terminal_gaps == True. This issue arises because if you continue to penalize adding gaps to s1 after you've reached the end, the score get so low that it can be better than or equal to the following alignment. This is clearly not the result that we're looking for, so penalize_terminal_gaps is False for the global aligners.


In [6]:
print global_pairwise_align_nucleotide(s, t, penalize_terminal_gaps=True).to_fasta()


>0
-------------------------ACCGTGGACCGTTAGGATTGGACCCAAGGTT-------------------------G
>1
TTTTTTTTTTTTTTTTTTTTTTTTTACCGTGGACCGT-AGGATTGGACC-AAGGTTAAAAAAAAAAAAAAAAAAAAAAAAAA