In [1]:
from sys import path
from os.path import join as joinpath
from os.path import normpath
path.append(normpath(".."))

from protein import loadFasta
from score import ScoreMatrix, blosumFromFasta
from align import Align


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-e04d5de57aa4> in <module>()
      4 path.append(normpath(".."))
      5 
----> 6 from protein import loadFasta
      7 from score import ScoreMatrix, blosumFromFasta
      8 from align import Align

ImportError: No module named 'protein'

Substitution matrices

In order to produce sequence alignments we've made extensive use of Soring systems in order to achieve sequence alignment. The scores assigned to each amino acid pair were not understood, we simply used standard ones. The scoring systems we used are called substitution matrices as they provide scores for each possible amino acid substitution. When comparing two sequences, one of the following is true:

  • Evolutionary model: sequences B and A do not have a common origin. Therefore B can be seen as completely random when compared to A. This does not mean B has to be completely different than A : amino acids $a$ and $b$ could still align by chance. We can calculate the probability of that happening by looking at the amino acid frequencies $p_a$ and $p_b$ in the population. $$Prob(a, b) = p_a \cdot p_b $$

  • Random model: sequences B and A have a common origin. Therefore B is not random, and a good alignment between A and B will provide us with the result of each of their mutations. The probability of amino acids $a$ and $b$ being aligned depends on the likelihood of substitutions from $a$ to $b$ or $b$ to $a$. $$Prob(a, b) = q_{a,b}$$

The goal of this section is to be able to tell these cases apart. In order to do that, we'll construct substitution matrices, assigning to each amino acid pair a score equal to the log-odds ratio $\frac{q_{a,b}}{p_a \cdot p_b}$.

BLOCKS

BLOCKS is a database of amino acid sequences. It was used to construct the standard BLOSUM matrices. We used it to retrieve two families of sequences : the SH3 and PDZ domains. These come in multiple "blocks" of gapless aligned sequences of the same size. We quickly formatted these informations into .fasta files compatible with our previous project. With them, we'll be able to construct our own BLOSUM matrices specific to these domains.

BLOSUM

BLOSUM or BLOcks SUbstitution Matrix uses the well-conserved (gapless and aligned) sequences found in BLOCKS to determine the numerical values of $p_a$, $p_b$ and $q_{a,b}$, based on the required identity between them. For example, BLOSUM 62 is the BLOSUM that was calculated for a required identity of 62%. Here's an overview of the algorithm for calculating the log-odds ratios with an identity of $X$% :

  • Separate all sequences into $t$ groups $G_{i}$, such that all sequences $s_{j} \in G_{i}$ in each group have at least $X$ percent identity with some other sequence of the same group. We assume the sequences have she same lenght $n$.

  • Calculate the weighted frequencies of all amino acid pairs, where $Count(G_{i_k}, a)$ is the number of occurrences of amino acid $a$ in column $k$ of sequences from $G_i$, $Size(G_i)$ is the number of sequences in group $G_i$ :

$$f_{a,b} = \sum_{k=1}^{n}{\sum_{i=1}^{t-1}{\sum_{j=i+1}^{t}{ \frac{Count(G_{i_k},a)}{Size(G_i)} \cdot \frac{Count(G_{j_k},b)}{Size(G_j)} + \frac{Count(G_{i_k},b)}{Size(G_i)} \cdot \frac{Count(G_{j_k},a)}{Size(G_j)}}}}$$
  • Calculate the probability of occurrences $a, b$ in the evolutionary model as the frequence of pairs $(a,b)$ divided by the frequence of all possible pairs $$q_{a,b} = \frac{f_{a,b}}{\sum_{1\leq a \leq b}{f_{a,b}}}$$

  • Calculate the probability of occurrences $a, b$ in the random model, based on the observed frequencies of each single amino acid $$ e_{a,b} = \left{ \begin{array}{ll}

    p_a^2 \quad \textrm{if} \quad a = b\\
    2 p_a p_b \quad \textrm{if} \quad a \neq b
    

    \end{array} \right. \quad \textrm{where} \quad pa = q{a,a} + \frac{1}{2}\sum{a \neq b}{q{a,b}} $$

  • Compute the log-odds ratio of all probabilities for the evolutionary and random models. This gives us a negative value (a bad score) if the random model is more likely, and a positive value the other way around. This value is the one used in BLOSUM (except we round it to the nearest integer). $$s_{a,b} = 2 \; \log_{2}{\frac{q_{a,b}}{e_{a,b}}}$$


In [ ]:
sh3Files = ("SH3-A.fasta", "SH3-B.fasta", "SH3-C.fasta", "SH3-D.fasta")
sh3Paths = [joinpath("..", "resources", "fasta", filename) for filename in sh3Files]
print("SH3 Family")
blosum40sh3 = blosumFromFasta(40, *sh3Paths)
print(blosum40sh3)
blosum70sh3 = blosumFromFasta(70, *sh3Paths)
print(blosum70sh3)

print("PDZ Family")
pdzFiles = ("PDZ-A.fasta", "PDZ-B.fasta")
pdzPaths = [joinpath("..", "resources", "fasta", filename) for filename in pdzFiles]
blosum40pdz = blosumFromFasta(40, *pdzPaths)
print(blosum40pdz)
blosum70pdz = blosumFromFasta(70, *pdzPaths)
print(blosum70pdz)

As a reminder, here is BLOSUM 62 :


In [ ]:
blosum62 = ScoreMatrix(normpath("../resources/blosum/blosum62.iij"), "BLOSUM 62")
print(blosum62)

Values are very different, which is normal since this matrix was built for a small, specific subset of the BLOCKS database and with a different required identity. Diagonal values also differ significantly. New substitutions are allowed (like between P and A), other substitutions no longer are (like all substitutions to and from Z, which is not present in the SH3 or PDZ domains).

Let's try to do some alignments with these BLOSUM matrices :


In [ ]:
a = Align(blosum62)
b = Align(blosum40pdz)
c = Align(blosum70pdz)
sequences = [seq for seq in loadFasta(normpath("../resources/fasta/PDZ-sequences.fasta"))]

for alignM in (a,b,c):
    for align in a.globalAlign(sequences[0], sequences[1], -12, -2, False):
        align.condensed = True
        print(align)
        break