This jupyter notebook presents a pairwise alignment of two nucleotide sequences with an edit distance metric. If you don't have jupyter, just save the code blocks to a python file and run the program. Sometimes, the following link works to make a interactive copy of this notebook.
In [1]:
import numpy as np
Two nucleotide sequences to align.
In [2]:
seq1 = "GCATGCU"
In [3]:
seq2 = "GATTACA"
We use numpy to create arrays for storing the cost of subalignments. Following wikipedia [1], we make space for one extra row and column. Wikipedia writes the first sequence across the columns, which is the second index in python. If you want to check for yourself, make the two sequences different lengths and check the number of rows and columns. Numpy doesn't work well mixing strings and numbers in an array so we don't store the sequences in the table, we need to keep them separate.
In [4]:
A = np.zeros((len(seq2)+1, len(seq1)+1), np.int8)
print(A)
We can also create an array to record the direction of the best match.
In [5]:
D = np.array([" "] * (len(seq1)*len(seq2))).reshape(len(seq2), len(seq1))
print(D)
In [6]:
for i in range(len(seq2)+1):
A[i,0]= -i
In [7]:
print(A)
In [8]:
for i in range(len(seq1)+1):
A[0,i]= -i
In [9]:
print(A)
Given two sequences s1 and s2, the initialized matrix of scores, the matrix to record directions and the indicies i and j for which we want to work, an alignment proceeds by one of several cases. Insert a gap in s1, insert a gap in s2, match s1 and s2, or mismatch. We want to take the best (largest) score, and record from which direction it came.
In [10]:
def align(s1, s2, costs, directions, i, j):
"Compute cost and record direction of a partial alignment of S1 and S2 up to positions i and j."
# Compute cost if inserting a gap in s2
up = costs[i-1, j] - 1
# compute cost if inserting gap in s1
left = costs[i, j-1] - 1
# is the diagonal a match?
if s1[j-1] == s2[i-1]:
c = 1
# or a mismmatch?
else:
c = -1
# compute cost taking a letter from each seq
diag = costs[i-1,j-1] + c
# take the best cost
best = max(up, left, diag)
costs[i,j] = best
# Now record the direction we came from
# we can't store multiple arrows, so just pick one direction
if up == best:
directions[i-1,j-1] = "|"
elif left == best:
directions[i-1,j-1] = "_"
else:
directions[i-1,j-1] = "\\" # is one diagonal slash
Now we can fill in the tables in order by looping over the rows and columns.
In [11]:
for i in range(1, len(seq2)+1):
for j in range(1, len(seq1)+1):
align(seq1, seq2, A, D, i, j)
In [12]:
print(A)
In [13]:
for i in range(D.shape[0]):
for j in range(D.shape[1]):
print(D[i,j], sep="", end=" ")
print()
From D we can reconstruct the alignment starting at the bottom right corner. Since the entry is \ we take one letter from the end of each sequence, and move diagonaly up and to the left. That entry is also \ so we repeat. The next entry is |, so we insert a gap and move up. Continuing until we reach the upper left corner.
In [14]:
def PrintAlign(D, s1, s2, n, m):
# Do something amazing here
return
When you complete your homework, the following block should work:
In [15]:
PrintAlign(D, seq1, seq2, len(seq1), len(seq2))
https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
Neil C. Jones and Pavel A. Pevzner. An Introduction to Bioinformatics Algorithms. The MIT Press, Cambridge, Massachusetts. (2004). ISBN-10: 0-262-10106-8 ISBN-13: 978-0-262-10106-6.