In [1]:
from sys import path
from os.path import join as joinpath
from os.path import normpath
path.append(normpath(".."))

from pyprot.base.aminoacid import AminoAcid
from pyprot.base.sequence import Sequence
from pyprot.data.fasta import getSequencesFromFasta
from pyprot.align.score import ScoreMatrix
from pyprot.align.align import Align, Aligned

Sequence alignment

Genetic material such as DNA and proteins comes in sequences made up of different blocks. The study of these sequences and their correlation is invaluable to the field of biology. Since genetic material is likely to mutate over time, it is not easy to be certain of these correlations in most cases. In order to know if two or more sequences have a common origin, we must examine their similarity in a formal way, and be able to quantify it. Since the modifications of the sequences can introduce new blocks as well as delete some, the first step to comparing sequences is aligning them by making the similar regions match. Even then, high levels of similarity do not necessarily imply a common ancestor exists (homology). Our goal in this section is to create a tool that allows us to perform such alignments and give a numeric value of the chance of homology.

ADTs

We will first look at an implementation of Abstract Data Types (ADTs) that will allow us to represent Sequences.

AminoAcid

The choice of genetic material is proteins. In the case of proteins, the blocks that make up the sequences are called amino acids. Therefore, they are the first thing we must be able to represent. The following tuple lists all amino acids and miscellaneous options.

The following class allows us to create AminoAcid objects that represent any of the amino acids from the list. We can then compare, print, hash, test them with any of the defined methods. Do note that we can create them by copying other AminoAcids or by providing their name in any form (long for complete, medium for 3 chars or short for 1 char).

Let's test it a little bit.


In [2]:
a1 = AminoAcid("A")
print(a1)
a2 = AminoAcid(a1)
print(a1 == a2)
a3 = AminoAcid("K")
print(a3.getName("long"))


A
True
lysine

Sequence

That's good and all, but we don't want to align single amino acids do we ? What we want to align is sequences. We can view a Sequence as an ordered list of AminoAcid objects. It should be iterable : we'll want to go through, access, and count its items. We'll also want to change them, insert new ones, delete some, and do all that transparently, as we would with a list. Finally it would be useful to check whether a sequence contains some other sub-sequence or single amino acid. The following is a class that does just that.

Since we may not want to type every sequence we use by hand, we can also read them from files. The format chosen here is .fasta, but this can be adapted for any other format.

Let's test a few of the capabilities of this class :


In [3]:
s = Sequence("ABX")
print(s)
s.extend("cysteine")
print(s)
print(AminoAcid("alanine") in s)
sCopy = Sequence(s) #This is a deep copy
sAlias = s #This is the same object
del s[1:3]
sAlias[0] = "K"
s.setSeparator("-")
s.setNameMode("long")
print(s, sAlias, sCopy, sep=", ")

sequences = [seq for seq in getSequencesFromFasta(normpath("resources/fasta/SH3-sequence.fasta"))]
print(sequences[0])


ABX
ABXC
True
lysine-cysteine, lysine-cysteine, ABXC
GGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDS

Score

Now that the easy part is done, we need to remember why we're here (it's sequence alignment, just in case). We can align two sequences any number of ways, but we're only interested in alignments that could represent two descendants of the same original sequence. Therefore, we need to assign a score value to each alignment, that will help us see how significative it is.

Because of the way mutations happen, all modifications should not be treated equally. Some amino acids are more related between them than others, therefore making them more likely to mutate into each other. Some changes are more or less likely to happen because of the very structure of the protein. Also, when an amino acid is inserted into or deleted from a sequence (thus creating a gap in the alignment), it usually is not the only one. This tells us that, not only should the score of the alignment depend on each pair that's aligned (meaning, on the same location), but it should also depend on how many gaps there are and how big they are.

The Score class allows us to give a numerical value to each possible pair of amino acids. These values can be set manually one by one, just as they can be loaded from files. The chosen format here was .iij.

Here's an example on how to load a Score object from a .iij file, and access the scores for certains pairs :


In [4]:
scoring = ScoreMatrix(normpath("resources/blosum/blosum62.iij"), "BLOSUM 62")
print(scoring)

a1, a2 = Sequence("HN") #That's call unpacking, pretty neat huh ?
print(a1, a2, " : ", scoring.getScore(a1, a2))


---------- BLOSUM 62 ----------
A   4   
R   -1  5   
N   -2  0   6   
D   -2  -2  1   6   
C   0   -3  -3  -3  9   
Q   -1  1   0   0   -3  5   
E   -1  0   0   2   -4  2   5   
G   0   -2  0   -1  -3  -2  -2  6   
H   -2  0   1   -1  -3  0   0   -2  8   
I   -1  -3  -3  -3  -1  -3  -3  -4  -3  4   
L   -1  -2  -3  -4  -1  -2  -3  -4  -3  2   4   
K   -1  2   0   -1  -3  1   1   -2  -1  -3  -2  5   
M   -1  -1  -2  -3  -1  0   -2  -3  -2  1   2   -1  5   
F   -2  -3  -3  -3  -2  -3  -3  -3  -1  0   0   -3  0   6   
P   -1  -2  -2  -1  -3  -1  -1  -2  -2  -3  -3  -1  -2  -4  7   
S   1   -1  1   0   -1  0   0   0   -1  -2  -2  0   -1  -2  -1  4   
T   0   -1  0   -1  -1  -1  -1  -2  -2  -1  -1  -1  -1  -2  -1  1   5   
W   -3  -3  -4  -4  -2  -2  -3  -2  -2  -3  -2  -3  -1  1   -4  -3  -2  11  
Y   -2  -2  -2  -3  -2  -1  -2  -3  2   -1  -1  -2  -1  3   -3  -2  -2  2   7   
V   0   -3  -3  -3  -1  -2  -2  -3  -3  3   1   -2  1   -1  -2  -2  0   -3  -1  4   
B   -2  -1  3   4   -3  0   1   -1  0   -3  -4  0   -3  -3  -2  0   -1  -4  -3  -3  4   
Z   -1  0   0   1   -3  3   4   -2  0   -3  -3  1   -1  -3  -1  0   -1  -3  -2  -2  1   4   
X   0   -1  -1  -1  -2  -1  -1  -1  -1  -1  -1  -1  -1  -1  -2  0   0   -2  -1  -1  -1  -1  -1  

    A   R   N   D   C   Q   E   G   H   I   L   K   M   F   P   S   T   W   Y   V   B   Z   X   
H N  :  1

Needleman-Wunsch Alignment

I know, I know : all these lines and still no alignment in sight. What does that title even mean ? Well, in 1970 Saul B. Needleman and Christian D. Wunsch came up with an effective algorithm for aligning sequences. It provides us with the alignments that get the best score, given certain conditions. It uses a scoring system like the one we've just covered, with the addition of gap penalties : negative values added to the score when the alignment creates (initial penalty) and extends (extended penalty) a gap. Here's an overview of its steps :

  • Create a matrix with enough rows to fit one sequence ($A$) and enough columns to fit the other ($B$) : each cell $(i, j)$ represents a possible alignment between two amino acids $A_i$ and $B_j$.
  • Add an initial row and column to the matrix, with values (scores) determined a certain number of ways. Keep in mind that these cells represent the beginning of an alignment where one sequence only has gaps, same with the last row and column for the end of the alignment.
  • Go through every cell in the matrix and calculate its score based on the previous (left, top, left and top) cells, using the following formula where $(V,W,S)_{i,j}$ are 3 values contained in cell $(i,j)$ of the matrix, $Score(A_i, B_j)$ is the score between amino acids $A_i$ and $B_j$, $I$ is the initial gap penalty and $E$ the extended gap penalty. $$ V(i,j) = max \left\{ \begin{array}{ll} S(i-1, j) + I\\ V(i-1, j) - E \end{array} \right. \quad W(i,j) = max \left\{ \begin{array}{ll} S(i, j-1) + I\\ V(i, j-1) - E \end{array} \right. \quad S(i,j) = max \left\{ \begin{array}{ll} S(i-1, j-1) + Score(A_i, B_j)\\ V(i, j)\\ W(i, j) \end{array} \right. $$
  • Backtrack from some point of the matrix (the end of the alignment) to some other (the beginning), only passing by permitted cells. The cells allowed after cell $(i,j)$ are the ones where the value $S(i,j)$ comes from :
    • left if $S(i,j)=W(i,j)$ : sequence $A$ has a gap
    • top if $S(i,j)=V(i,j)$ : sequence $B$ has a gap
    • diagonal if $S(i,j)=S(i-1,j-1) + Score(A_i, B_j)$ : sequences $A$ and $B$ are aligned

Different types of alignments can be obtained by tweaking this algorithm.

  • Global alignments aim to align both sequences completely. In order to do that, we initialize the first row and sequences with multiples of $I$ and $E$, thus giving us negative values matching the gap required to get there. Backtracking starts at the end of the matrix and ends at the beginning.
  • Local alignments aim to produce the alignment with the best score, whithout regard for their lenght. We do not initialize the first row and column : completing an alignment with only gaps has no interest score-wise. Backtracking starts at the highest value(s) in the matrix and ends as soon as we reach a value of $0$. Local suboptimal alignments can be found by clearing the values of the local optimal alignment in the matrix, reevaluating scores for further rows and columns, and backtracking again.
  • Semiglobal alignments are intended for a global-like alignment between sequences that only overlap partially, or that have great difference in size (one is included in the other). The first row and column are not initialized for the same reason as with local alignments. Backtracking starts and tha highest value(s) but ends when we reach either the first line or first column (therefore finishing one sequence).

The following is a class that allows us to represent two aligned sequences, along with information about the way they were aligned and the result. Identity is the number of equal amino acids that are aligned, similarity is the number of similar (meaning equal or with a non negative score) amino acids that are aligned.

In order to use this class, we must (finally) implement the actual alignment algorithm. This is done by the following class.

Here is the result of a global alignment between the first two sequences from "SH3-sequences.fasta", calculated by LALIGN with the BLOSUM62 scoring matrix, initial and extended gap penalties of $-12$ and $-2$ :

n-w opt:  69  Z-score: 152.4  bits: 31.6 E(1): 6.7e-25
global/global (N-W) score: 69; 29.0% identity (62.9% similar) in 62 aa overlap (1-62:1-58)

GGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDS
      .: ::... .. .::::.:. :...:.   . :   .:. :. :.::.::.  .
---MEAIAKYDFKATADDELSFKRGDILKVLNEECDQNWYKAELN-GKDGFIPKNYIEMKPH

Next is the result from the AlignMatrix class


In [5]:
a = Align(scoring)

for align in a.globalAlign(sequences[0], sequences[1], -12, -2, False):
	print(align)


---------- Alignment ----------
Size       : 62
Type       : global
Score      : 73
Identity   : 18 (29.03%)
Similarity : 21 (33.87%)
Gaps       : 4

Upper seq. : sp|P12931|84-145
	0 Gaps, 62 AAs (positions 0 to 62)
Lower seq. : sp|P62993|1-58
	4 Gaps, 58 AAs (positions 0 to 58)

0
GGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDS
      .: ::... .. .::::.:. :...:.   . :   .:. :. :.::.::.  .  
---MEAIAKYDFKATADDELSFKRGDILKVLNEECDQNWYKAELN-GKDGFIPKNYIEMKPH
0


Besides the second one being much more readable, their output is pretty similar. There may be slight differences in the BLOSUM matrix used, responsible for the discrepancy between the scores.

Here is the result of a local alignment between the first two sequences from "maguk-sequences.fasta", calculated by LALIGN with the BLOSUM62 scoring matrix, initial and extended gap penalties of $-12$ and $-2$ :

Waterman-Eggert score: 2677;  1048.2 bits; E(1) <  0
69.5% identity (86.3% similar) in 767 aa overlap (153-904:61-817)

SHSHISPIKPTEA-VLPSPPTVPVIPVLPVPAENTVILP-TIPQANPPPVLVNTDSLETP
:..  .:   :.: ..:.   :  .:  :::...:  :  . :.  :  . .: :  :  
SQAGATPTPRTKAKLIPTGRDVGPVPPKPVPGKSTPKLNGSGPSWWPECTCTNRDWYEQ-

TYVNGTDADYEYEEITLERGNSGLGFSIAGGTDNPHIGDDSSIFITKIITGGAAAQDGRL
  :::.:. ..::::.::::::::::::::: ::::. :: .::::::: :::::.::::
--VNGSDGMFKYEEIVLERGNSGLGFSIAGGIDNPHVPDDPGIFITKIIPGGAAAMDGRL

RVNDCILRVNEVDVRDVTHSKAVEALKEAGSIVRLYVKRRKPVSEKIMEIKLIKGPKGLG
 ::::.:::::::: .:.::.::::::::: .::: :.::.:  : :::..:.:::::::
GVNDCVLRVNEVDVSEVVHSRAVEALKEAGPVVRLVVRRRQPPPETIMEVNLLKGPKGLG

FSIAGGVGNQHIPGDNSIYVTKIIEGGAAHKDGKLQIGDKLLAVNNVCLEEVTHEEAVTA
::::::.::::::::::::.:::::::::.:::.:::::.::::::. :..: :::::..
FSIAGGIGNQHIPGDNSIYITKIIEGGAAQKDGRLQIGDRLLAVNNTNLQDVRHEEAVAS

LKNTSDFVYLKVAKPTSMYMNDGYAPPDITNSSSQPVDNHVSPSSFLGQTPA--------
::::::.:::::::: :...:: ::::: ... .  .:::.: .: ::   :        
LKNTSDMVYLKVAKPGSLHLNDMYAPPDYASTFTALADNHISHNSSLGYLGAVESKVSYP

-----SPARYSPVSKAVLGDDEITREPRKVVLHRGSTGLGFNIVGGEDGEGIFISFILAG
      :.::::. . .:.....::::::..::.:::::::::::::::::::.::::::
APPQVPPTRYSPIPRHMLAEEDFTREPRKIILHKGSTGLGFNIVGGEDGEGIFVSFILAG

GPADLSGELRKGDRIISVNSVDLRAASHEQAAAALKNAGQAVTIVAQYRPEEYSRFEAKI
::::::::::.::::.:::.:.:: :.:::::::::.:::.::::::::::::::::.::
GPADLSGELRRGDRILSVNGVNLRNATHEQAAAALKRAGQSVTIVAQYRPEEYSRFESKI

HDLREQMMNSSISSGSGSLRTSQKRSLYVRALFDYDKTKDSGLPSQGLNFKFGDILHVIN
:::::::::::.::::::::::.:::::::::::::.:.:: ::::::.:..::::::::
HDLREQMMNSSMSSGSGSLRTSEKRSLYVRALFDYDRTRDSCLPSQGLSFSYGDILHVIN

ASDDEWWQARQVTPDGESDEVGVIPSKRRVEKKERARLKTVKFNSKTRDKGEIPDDMGSK
:::::::::: ::: :::...::::::.:::::::::::::::...:   : : ..    
ASDDEWWQARLVTPHGESEQIGVIPSKKRVEKKERARLKTVKFHART---GMIESNRDFP

GLKHVTSNASDSESSYRGQEEYVLSYEPVNQQEVNYTRPVIILGPMKDRINDDLISEFPD
::    :.   . .. .:::. .::::::..::..:.::::::::::::.::::::::: 
GL----SDDYYGAKNLKGQEDAILSYEPVTRQEIHYARPVIILGPMKDRVNDDLISEFPH

KFGSCVPHTTRPKRDYEVDGRDYHFVTSREQMEKDIQEHKFIEAGQYNNHLYGTSVQSVR
::::::::::::.:: ::::.:::::.::::::::::..:::::::.:..:::::.::::
KFGSCVPHTTRPRRDNEVDGQDYHFVVSREQMEKDIQDNKFIEAGQFNDNLYGTSIQSVR

EVAEKGKHCILDVSGNAIKRLQIAQLYPISIFIKPKSMENIMEMNKRLTEEQARKTFERA
 :::.::::::::::::::::: ::::::.:::::::.: .::::.: : :::.: ...:
AVAERGKHCILDVSGNAIKRLQQAQLYPIAIFIKPKSIEALMEMNRRQTYEQANKIYDKA

MKLEQEFTEHFTAIVQGDTLEDIYNQVKQIIEEQSGSYIWVPAKEKL
::::::: :.::::::::.::.:::..:::::.::: :::::. :::
MKLEQEFGEYFTAIVQGDSLEEIYNKIKQIIEDQSGHYIWVPSPEKL

In [6]:
for aligned in a.localAlign(sequences[0], sequences[1], -12, -2):
	aligned.chunkSize = 60
	print(aligned)
	break


---------- Alignment ----------
Size       : 51
Type       : local
Score      : 98
Identity   : 18 (35.29%)
Similarity : 20 (39.22%)
Gaps       : 1

Upper seq. : sp|P12931|84-145
	0 Gaps, 51 AAs (positions 6 to 57)
Lower seq. : sp|P62993|1-58
	1 Gaps, 50 AAs (positions 3 to 53)

6
VALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYV
.: ::... .. .::::.:. :...:.   . :   .:. :. :.::.::.
IAKYDFKATADDELSFKRGDILKVLNEECDQNWYKAELN-GKDGFIPKNYI
3


Once more, alignments are quite similar in terms of gap locations, scores and identity/similarity. The results suggest that these sequences might be related, given their high identity and similarity.

For each other pair of sequences (condensed results), we don't get such high changes of homology, as suggested by these condensed results :


In [7]:
for i in range(len(sequences)-1):
	for j in range(i+1, len(sequences)):
		for align in a.localAlign(sequences[i], sequences[j], -12, -2):
			align.condensed = True
			print(align)
			break


---------- Alignment ----------
Size       : 51
Type       : local
Score      : 98
Identity   : 18 (35.29%)
Similarity : 20 (39.22%)
Gaps       : 1

Upper seq. : sp|P12931|84-145
	0 Gaps, 51 AAs (positions 6 to 57)
Lower seq. : sp|P62993|1-58
	1 Gaps, 50 AAs (positions 3 to 53)

---------- Alignment ----------
Size       : 54
Type       : local
Score      : 93
Identity   : 21 (38.89%)
Similarity : 13 (24.07%)
Gaps       : None

Upper seq. : sp|P12931|84-145
	0 Gaps, 54 AAs (positions 3 to 57)
Lower seq. : sp|P41240|9-70
	0 Gaps, 54 AAs (positions 3 to 57)

---------- Alignment ----------
Size       : 51
Type       : local
Score      : 81
Identity   : 17 (33.33%)
Similarity : 18 (35.29%)
Gaps       : 2

Upper seq. : sp|P12931|84-145
	0 Gaps, 51 AAs (positions 6 to 57)
Lower seq. : sp|Q8IZP0|446-505
	2 Gaps, 49 AAs (positions 6 to 55)

---------- Alignment ----------
Size       : 55
Type       : local
Score      : 118
Identity   : 24 (43.64%)
Similarity : 19 (34.55%)
Gaps       : 1

Upper seq. : sp|P12931|84-145
	0 Gaps, 55 AAs (positions 6 to 61)
Lower seq. : sp|Q06187|214-274
	1 Gaps, 54 AAs (positions 6 to 60)

---------- Alignment ----------
Size       : 56
Type       : local
Score      : 135
Identity   : 25 (44.64%)
Similarity : 18 (32.14%)
Gaps       : 1

Upper seq. : sp|P62993|1-58
	1 Gaps, 55 AAs (positions 1 to 56)
Lower seq. : sp|P41240|9-70
	0 Gaps, 56 AAs (positions 4 to 60)

---------- Alignment ----------
Size       : 57
Type       : local
Score      : 107
Identity   : 22 (38.60%)
Similarity : 17 (29.82%)
Gaps       : 1

Upper seq. : sp|P62993|1-58
	0 Gaps, 57 AAs (positions 1 to 58)
Lower seq. : sp|Q8IZP0|446-505
	1 Gaps, 56 AAs (positions 4 to 60)

---------- Alignment ----------
Size       : 53
Type       : local
Score      : 90
Identity   : 18 (33.96%)
Similarity : 20 (37.74%)
Gaps       : 2

Upper seq. : sp|P62993|1-58
	1 Gaps, 52 AAs (positions 1 to 53)
Lower seq. : sp|Q06187|214-274
	1 Gaps, 52 AAs (positions 4 to 56)

---------- Alignment ----------
Size       : 52
Type       : local
Score      : 62
Identity   : 13 (25.00%)
Similarity : 20 (38.46%)
Gaps       : 2

Upper seq. : sp|P41240|9-70
	0 Gaps, 52 AAs (positions 6 to 58)
Lower seq. : sp|Q8IZP0|446-505
	2 Gaps, 50 AAs (positions 6 to 56)

---------- Alignment ----------
Size       : 55
Type       : local
Score      : 94
Identity   : 21 (38.18%)
Similarity : 15 (27.27%)
Gaps       : 1

Upper seq. : sp|P41240|9-70
	0 Gaps, 55 AAs (positions 6 to 61)
Lower seq. : sp|Q06187|214-274
	1 Gaps, 54 AAs (positions 6 to 60)

---------- Alignment ----------
Size       : 54
Type       : local
Score      : 73
Identity   : 16 (29.63%)
Similarity : 21 (38.89%)
Gaps       : 1

Upper seq. : sp|Q8IZP0|446-505
	1 Gaps, 53 AAs (positions 2 to 55)
Lower seq. : sp|Q06187|214-274
	0 Gaps, 54 AAs (positions 2 to 56)