Pairwise Alignment Example

Humberto Ortiz-Zuazaga

Introduction

This jupyter notebook presents a pairwise alignment of two nucleotide sequences with an edit distance metric. If you don't have jupyter, just save the code blocks to a python file and run the program. Sometimes, the following link works to make a interactive copy of this notebook.

Preliminaries


In [1]:
import numpy as np

Two nucleotide sequences to align.


In [2]:
seq1 = "GCATGCU"

In [3]:
seq2 = "GATTACA"

We use numpy to create arrays for storing the cost of subalignments. Following wikipedia [1], we make space for one extra row and column. Wikipedia writes the first sequence across the columns, which is the second index in python. If you want to check for yourself, make the two sequences different lengths and check the number of rows and columns. Numpy doesn't work well mixing strings and numbers in an array so we don't store the sequences in the table, we need to keep them separate.


In [4]:
A = np.zeros((len(seq2)+1, len(seq1)+1), np.int8)
print(A)


[[0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]]

We can also create an array to record the direction of the best match.


In [5]:
D = np.array([" "] * (len(seq1)*len(seq2))).reshape(len(seq2), len(seq1))
print(D)


[[' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']]

Initialization

Fill in decreasing scores across the first row and down the first column.


In [6]:
for i in range(len(seq2)+1):
    A[i,0]= -i

In [7]:
print(A)


[[ 0  0  0  0  0  0  0  0]
 [-1  0  0  0  0  0  0  0]
 [-2  0  0  0  0  0  0  0]
 [-3  0  0  0  0  0  0  0]
 [-4  0  0  0  0  0  0  0]
 [-5  0  0  0  0  0  0  0]
 [-6  0  0  0  0  0  0  0]
 [-7  0  0  0  0  0  0  0]]

In [8]:
for i in range(len(seq1)+1):
    A[0,i]= -i

In [9]:
print(A)


[[ 0 -1 -2 -3 -4 -5 -6 -7]
 [-1  0  0  0  0  0  0  0]
 [-2  0  0  0  0  0  0  0]
 [-3  0  0  0  0  0  0  0]
 [-4  0  0  0  0  0  0  0]
 [-5  0  0  0  0  0  0  0]
 [-6  0  0  0  0  0  0  0]
 [-7  0  0  0  0  0  0  0]]

Computing scores of subalignments

Given two sequences s1 and s2, the initialized matrix of scores, the matrix to record directions and the indicies i and j for which we want to work, an alignment proceeds by one of several cases. Insert a gap in s1, insert a gap in s2, match s1 and s2, or mismatch. We want to take the best (largest) score, and record from which direction it came.


In [10]:
def align(s1, s2, costs, directions, i, j):
    "Compute cost and record direction of a partial alignment of S1 and S2 up to positions i and j."
    # Compute cost if inserting a gap in s2
    up = costs[i-1, j] - 1
    # compute cost if inserting gap in s1
    left = costs[i, j-1] - 1
    # is the diagonal a match?
    if s1[j-1] == s2[i-1]:
        c = 1
    # or a mismmatch?
    else:
        c = -1
    # compute cost taking a letter from each seq
    diag = costs[i-1,j-1] + c

    # take the best cost
    best = max(up, left, diag)
    costs[i,j] = best
    
    # Now record the direction we came from
    # we can't store multiple arrows, so just pick one direction
    if up == best:
        directions[i-1,j-1] = "|"
    elif left == best:
        directions[i-1,j-1] = "_"
    else:
        directions[i-1,j-1] = "\\" # is one diagonal slash

Now we can fill in the tables in order by looping over the rows and columns.


In [11]:
for i in range(1, len(seq2)+1):
    for j in range(1, len(seq1)+1):
        align(seq1, seq2, A, D, i, j)

Results


In [12]:
print(A)


[[ 0 -1 -2 -3 -4 -5 -6 -7]
 [-1  1  0 -1 -2 -3 -4 -5]
 [-2  0  0  1  0 -1 -2 -3]
 [-3 -1 -1  0  2  1  0 -1]
 [-4 -2 -2 -1  1  1  0 -1]
 [-5 -3 -3 -1  0  0  0 -1]
 [-6 -4 -2 -2 -1 -1  1  0]
 [-7 -5 -3 -1 -2 -2  0  0]]

In [13]:
for i in range(D.shape[0]):
    for j in range(D.shape[1]):
        print(D[i,j], sep="", end=" ")
    print()


\ _ _ _ _ _ _ 
| \ \ _ _ _ _ 
| | | \ _ _ _ 
| | | | \ _ _ 
| | \ | | \ _ 
| \ | | | \ _ 
| | \ | | | \ 

From D we can reconstruct the alignment starting at the bottom right corner. Since the entry is \ we take one letter from the end of each sequence, and move diagonaly up and to the left. That entry is also \ so we repeat. The next entry is |, so we insert a gap and move up. Continuing until we reach the upper left corner.

Homework

Can you develop a function PrintAlign(D, s1, s2, n, m) that recursively prints out the alignment given the directions, strings, and lengths of the strings? The book has an algorithm PrintLCS in Chapter 6 that almost does what you want.


In [14]:
def PrintAlign(D, s1, s2, n, m):
    # Do something amazing here
    return

When you complete your homework, the following block should work:


In [15]:
PrintAlign(D, seq1, seq2, len(seq1), len(seq2))

References

  1. https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm

  2. Neil C. Jones and Pavel A. Pevzner. An Introduction to Bioinformatics Algorithms. The MIT Press, Cambridge, Massachusetts. (2004). ISBN-10: 0-262-10106-8 ISBN-13: 978-0-262-10106-6.