In [1]:
from sys import path
from os.path import join as joinpath
from os.path import normpath
path.append(normpath(".."))
from pyprot.base.aminoacid import AminoAcid
from pyprot.base.sequence import Sequence
from pyprot.data.fasta import getSequencesFromFasta
from pyprot.align.score import ScoreMatrix
from pyprot.align.align import Align, Aligned
Genetic material such as DNA and proteins comes in sequences made up of different blocks. The study of these sequences and their correlation is invaluable to the field of biology. Since genetic material is likely to mutate over time, it is not easy to be certain of these correlations in most cases. In order to know if two or more sequences have a common origin, we must examine their similarity in a formal way, and be able to quantify it. Since the modifications of the sequences can introduce new blocks as well as delete some, the first step to comparing sequences is aligning them by making the similar regions match. Even then, high levels of similarity do not necessarily imply a common ancestor exists (homology). Our goal in this section is to create a tool that allows us to perform such alignments and give a numeric value of the chance of homology.
We will first look at an implementation of Abstract Data Types (ADTs) that will allow us to represent Sequences.
The choice of genetic material is proteins. In the case of proteins, the blocks that make up the sequences are called amino acids. Therefore, they are the first thing we must be able to represent. The following tuple lists all amino acids and miscellaneous options.
The following class allows us to create AminoAcid objects that represent any of the amino acids from the list. We can then compare, print, hash, test them with any of the defined methods. Do note that we can create them by copying other AminoAcids or by providing their name in any form (long for complete, medium for 3 chars or short for 1 char).
Let's test it a little bit.
In [2]:
a1 = AminoAcid("A")
print(a1)
a2 = AminoAcid(a1)
print(a1 == a2)
a3 = AminoAcid("K")
print(a3.getName("long"))
That's good and all, but we don't want to align single amino acids do we ? What we want to align is sequences. We can view a Sequence as an ordered list of AminoAcid objects. It should be iterable : we'll want to go through, access, and count its items. We'll also want to change them, insert new ones, delete some, and do all that transparently, as we would with a list. Finally it would be useful to check whether a sequence contains some other sub-sequence or single amino acid. The following is a class that does just that.
Since we may not want to type every sequence we use by hand, we can also read them from files. The format chosen here is .fasta, but this can be adapted for any other format.
Let's test a few of the capabilities of this class :
In [3]:
s = Sequence("ABX")
print(s)
s.extend("cysteine")
print(s)
print(AminoAcid("alanine") in s)
sCopy = Sequence(s) #This is a deep copy
sAlias = s #This is the same object
del s[1:3]
sAlias[0] = "K"
s.setSeparator("-")
s.setNameMode("long")
print(s, sAlias, sCopy, sep=", ")
sequences = [seq for seq in getSequencesFromFasta(normpath("resources/fasta/SH3-sequence.fasta"))]
print(sequences[0])
Now that the easy part is done, we need to remember why we're here (it's sequence alignment, just in case). We can align two sequences any number of ways, but we're only interested in alignments that could represent two descendants of the same original sequence. Therefore, we need to assign a score value to each alignment, that will help us see how significative it is.
Because of the way mutations happen, all modifications should not be treated equally. Some amino acids are more related between them than others, therefore making them more likely to mutate into each other. Some changes are more or less likely to happen because of the very structure of the protein. Also, when an amino acid is inserted into or deleted from a sequence (thus creating a gap in the alignment), it usually is not the only one. This tells us that, not only should the score of the alignment depend on each pair that's aligned (meaning, on the same location), but it should also depend on how many gaps there are and how big they are.
The Score
class allows us to give a numerical value to each possible pair of amino acids. These values can be set manually one by one, just as they can be loaded from files. The chosen format here was .iij
.
Here's an example on how to load a Score
object from a .iij
file, and access the scores for certains pairs :
In [4]:
scoring = ScoreMatrix(normpath("resources/blosum/blosum62.iij"), "BLOSUM 62")
print(scoring)
a1, a2 = Sequence("HN") #That's call unpacking, pretty neat huh ?
print(a1, a2, " : ", scoring.getScore(a1, a2))
I know, I know : all these lines and still no alignment in sight. What does that title even mean ? Well, in 1970 Saul B. Needleman and Christian D. Wunsch came up with an effective algorithm for aligning sequences. It provides us with the alignments that get the best score, given certain conditions. It uses a scoring system like the one we've just covered, with the addition of gap penalties : negative values added to the score when the alignment creates (initial penalty) and extends (extended penalty) a gap. Here's an overview of its steps :
Different types of alignments can be obtained by tweaking this algorithm.
The following is a class that allows us to represent two aligned sequences, along with information about the way they were aligned and the result. Identity is the number of equal amino acids that are aligned, similarity is the number of similar (meaning equal or with a non negative score) amino acids that are aligned.
In order to use this class, we must (finally) implement the actual alignment algorithm. This is done by the following class.
Here is the result of a global alignment between the first two sequences from "SH3-sequences.fasta", calculated by LALIGN with the BLOSUM62 scoring matrix, initial and extended gap penalties of $-12$ and $-2$ :
n-w opt: 69 Z-score: 152.4 bits: 31.6 E(1): 6.7e-25
global/global (N-W) score: 69; 29.0% identity (62.9% similar) in 62 aa overlap (1-62:1-58)
GGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDS
.: ::... .. .::::.:. :...:. . : .:. :. :.::.::. .
---MEAIAKYDFKATADDELSFKRGDILKVLNEECDQNWYKAELN-GKDGFIPKNYIEMKPH
Next is the result from the AlignMatrix class
In [5]:
a = Align(scoring)
for align in a.globalAlign(sequences[0], sequences[1], -12, -2, False):
print(align)
Besides the second one being much more readable, their output is pretty similar. There may be slight differences in the BLOSUM matrix used, responsible for the discrepancy between the scores.
Here is the result of a local alignment between the first two sequences from "maguk-sequences.fasta", calculated by LALIGN with the BLOSUM62 scoring matrix, initial and extended gap penalties of $-12$ and $-2$ :
Waterman-Eggert score: 2677; 1048.2 bits; E(1) < 0
69.5% identity (86.3% similar) in 767 aa overlap (153-904:61-817)
SHSHISPIKPTEA-VLPSPPTVPVIPVLPVPAENTVILP-TIPQANPPPVLVNTDSLETP
:.. .: :.: ..:. : .: :::...: : . :. : . .: : :
SQAGATPTPRTKAKLIPTGRDVGPVPPKPVPGKSTPKLNGSGPSWWPECTCTNRDWYEQ-
TYVNGTDADYEYEEITLERGNSGLGFSIAGGTDNPHIGDDSSIFITKIITGGAAAQDGRL
:::.:. ..::::.::::::::::::::: ::::. :: .::::::: :::::.::::
--VNGSDGMFKYEEIVLERGNSGLGFSIAGGIDNPHVPDDPGIFITKIIPGGAAAMDGRL
RVNDCILRVNEVDVRDVTHSKAVEALKEAGSIVRLYVKRRKPVSEKIMEIKLIKGPKGLG
::::.:::::::: .:.::.::::::::: .::: :.::.: : :::..:.:::::::
GVNDCVLRVNEVDVSEVVHSRAVEALKEAGPVVRLVVRRRQPPPETIMEVNLLKGPKGLG
FSIAGGVGNQHIPGDNSIYVTKIIEGGAAHKDGKLQIGDKLLAVNNVCLEEVTHEEAVTA
::::::.::::::::::::.:::::::::.:::.:::::.::::::. :..: :::::..
FSIAGGIGNQHIPGDNSIYITKIIEGGAAQKDGRLQIGDRLLAVNNTNLQDVRHEEAVAS
LKNTSDFVYLKVAKPTSMYMNDGYAPPDITNSSSQPVDNHVSPSSFLGQTPA--------
::::::.:::::::: :...:: ::::: ... . .:::.: .: :: :
LKNTSDMVYLKVAKPGSLHLNDMYAPPDYASTFTALADNHISHNSSLGYLGAVESKVSYP
-----SPARYSPVSKAVLGDDEITREPRKVVLHRGSTGLGFNIVGGEDGEGIFISFILAG
:.::::. . .:.....::::::..::.:::::::::::::::::::.::::::
APPQVPPTRYSPIPRHMLAEEDFTREPRKIILHKGSTGLGFNIVGGEDGEGIFVSFILAG
GPADLSGELRKGDRIISVNSVDLRAASHEQAAAALKNAGQAVTIVAQYRPEEYSRFEAKI
::::::::::.::::.:::.:.:: :.:::::::::.:::.::::::::::::::::.::
GPADLSGELRRGDRILSVNGVNLRNATHEQAAAALKRAGQSVTIVAQYRPEEYSRFESKI
HDLREQMMNSSISSGSGSLRTSQKRSLYVRALFDYDKTKDSGLPSQGLNFKFGDILHVIN
:::::::::::.::::::::::.:::::::::::::.:.:: ::::::.:..::::::::
HDLREQMMNSSMSSGSGSLRTSEKRSLYVRALFDYDRTRDSCLPSQGLSFSYGDILHVIN
ASDDEWWQARQVTPDGESDEVGVIPSKRRVEKKERARLKTVKFNSKTRDKGEIPDDMGSK
:::::::::: ::: :::...::::::.:::::::::::::::...: : : ..
ASDDEWWQARLVTPHGESEQIGVIPSKKRVEKKERARLKTVKFHART---GMIESNRDFP
GLKHVTSNASDSESSYRGQEEYVLSYEPVNQQEVNYTRPVIILGPMKDRINDDLISEFPD
:: :. . .. .:::. .::::::..::..:.::::::::::::.:::::::::
GL----SDDYYGAKNLKGQEDAILSYEPVTRQEIHYARPVIILGPMKDRVNDDLISEFPH
KFGSCVPHTTRPKRDYEVDGRDYHFVTSREQMEKDIQEHKFIEAGQYNNHLYGTSVQSVR
::::::::::::.:: ::::.:::::.::::::::::..:::::::.:..:::::.::::
KFGSCVPHTTRPRRDNEVDGQDYHFVVSREQMEKDIQDNKFIEAGQFNDNLYGTSIQSVR
EVAEKGKHCILDVSGNAIKRLQIAQLYPISIFIKPKSMENIMEMNKRLTEEQARKTFERA
:::.::::::::::::::::: ::::::.:::::::.: .::::.: : :::.: ...:
AVAERGKHCILDVSGNAIKRLQQAQLYPIAIFIKPKSIEALMEMNRRQTYEQANKIYDKA
MKLEQEFTEHFTAIVQGDTLEDIYNQVKQIIEEQSGSYIWVPAKEKL
::::::: :.::::::::.::.:::..:::::.::: :::::. :::
MKLEQEFGEYFTAIVQGDSLEEIYNKIKQIIEDQSGHYIWVPSPEKL
In [6]:
for aligned in a.localAlign(sequences[0], sequences[1], -12, -2):
aligned.chunkSize = 60
print(aligned)
break
Once more, alignments are quite similar in terms of gap locations, scores and identity/similarity. The results suggest that these sequences might be related, given their high identity and similarity.
For each other pair of sequences (condensed results), we don't get such high changes of homology, as suggested by these condensed results :
In [7]:
for i in range(len(sequences)-1):
for j in range(i+1, len(sequences)):
for align in a.localAlign(sequences[i], sequences[j], -12, -2):
align.condensed = True
print(align)
break