In [6]:
from sys import path
from os.path import join as joinpath
from os.path import normpath
path.append(normpath(".."))

from pyprot.data.fasta import getSequencesFromFasta
from pyprot.align.align import Align, Aligned
from pyprot.align.score import PSSM

PSSM Profiles and alignment

Profiles

A profile is a way to represent multiple aligned sequences and their similarities or patterns.

Definitions

A protein domain is part of a protein that has its own structure and function, independent of the rest of the protein. Despite their use being similar across the proteins they can be found in, the same domains do not necessarily have the same amino acid sequence. However, they do have enough common amino acids to be identified and located by the means of sequence alignment. Our goal in this project is to locate a given domain within a given sequence, should it exist one or more times within that sequence. One such domain is the WW domain, known to be used by multiple species and sometimes included multiple times in a single protein. In this project we will use sequences from the WW domain that belong to human proteins to test our implementations : these have been saved to the file to-be-aligned.fasta. These domain sequences and many others can be found on the SMART database.

Multiple Sequence Alignment (or MSA) allows us to align multiple sequences together, and therefore observe how well some amino acids are preserved at some positions. This is of great interest in the study of domains, since it reveals which positions are of greater importance when trying to align new sequences to known domains. There are of course other applications, as well as numerous methods that achieve MSA, however we won't go into them since we won't be implementing any of these in this project. Two online tools were used to align all the sequences from to-be-aligned.fasta together : MUSCLE and CLUSTAL Omega. The resulting sequences can be respectively found in files msaresults-MUSCLE.fasta and msaresults-CLUSTALO.fasta.

PSSM

A Position Specific Scoring Matrix (or PSSM) is a kind of profile that uses a matrix, where each column contains information about the same columns of the aforementioned sequences, and each row matches one amino acid. In this manner, each cell points to a specific column of all observed sequences, as well as one amino acid : the value it contains is the frequency of this amino acid within these columns. PSSMs can be used to align sequences with, the result indicating whether that sequence contains a subsequence that's similar to all of the sequences represented by the PSSM.

One way to represent a profile such as a PSSM is a table containing each value from the matrix. A WebLogo is a more human-readable representation, where amino acids that appear in each column are represented sequencially and with sizes depending on their frequency within that column. Here is an example of a WebLogo, gathered from this site:

Implementation

The eponym class PSSM contains the frequency matrix as well as metadata that will allow us to use it efficiently as a scoring system when aligning sequences with it (more on that later). Here is the class, as well as a dictionary of absolute amino acid frequencies within the UniProt database (also used to provide scores) :

Scoring

Suppose we want to align a sequence to a PSSM: it will be the PSSM that determines the scores for the alignment -based on the frequencies gathered from previously aligned sequences-. Sure, PSSM has "Scoring Matrix" in it, but we haven't talked about scores yet (that is, unless you've read the code above). That's when you should think : "oh, but I know what a scoring matrix is, it gives out scores for each amino acid pair, that's what this is". Well, when we were aligning two sequences together, that was the case since we made the assumption that the positions of amino acids didn't matter. However the goal now is to align one sequence with a load of other aligned sequences -represented by a PSSM- that often don't have the same amino acid at the same position. The frequency and range of amino acids can vary for each column of the PSSM, therefore the column value is required (instead of the "other" amino acid) to provide a score.

So how is that score calculated you ask ? Well, exactly the same way we did for the Blosum matrices: with the log-odds ratio $\frac{q_{a,c}}{p_a}$, where $q_{a,c}$ is the evolutionary probability of amino acid $a$ being located at column $c$, and $p_a$ is the random probability of amino acid $a$ (which doesn't depend on the column). What differs from the previous project is the way these terms are calculated:

  • The evolutionary probability is computed as follows: $$q_{a,c} = \frac{\alpha f_{a,c} + \beta p_{a}}{\alpha + \beta}$$ where $f_{a,c}$ is the frequency of $a$ in column $c$, and $\alpha$ and $\beta$ are constants -called pseudocounts- used to prevent null probabilities from happening (since $\log{0} = - \infty$ isn't a nice value for a score). Pseudocounts not only avoid bad values, they determine the PSSM before any aligned sequences are added. Should there be no amino acids in any column, the score would entirely depend on these pseudocounts. Here, $\alpha$ is the total number of amino acids in the given column (without taking gaps into account) and $\beta$ is the total number of aligned sequences in the PSSM.
  • The random probability is based on the frequencies of amino acids in the whole UniProt database, which can be found here.

Alignment

This is going to be easy: in order to align a sequence with a PSSM, we can re-use the algorithm written for the first project to compute the alignment. The code has been adapted to better manage and represent MSAs, but the alignment is basically a local+suboptimal sequence alignment, with no affine gap penalty, where scores as well are provided by the PSSM. Note that gap penalties can be made to depend on the column and therefore are provided by the PSSM class, however in this case we use a constant gap penalty throug the whole alignment.

Results

The WW domain

Here is the HMM Logo (similar to a WebLogo) of the WW domain, provided by this site (note that columns with no "main" amino acid are not represented):

Once aligned with the two mentioned online tools, we can represent our human proteins-only WW domains with WebLogos directly from this site. Here are the results of the CLUSTAL Omega alignment:

and the MUSCLE alignemnt:

Despite the ranges being slightly different between the WW domain and our subset aligned "by hand", the similarities speak for themselves: the same amino acids are of great importance to the domain, only "smaller" amino acids differ. Conserved positions are mainly 3,6,18,19,54,57.

Sequence alignment

First protein

Protein D6C652 (Transcriptional coactivator YAP1-A) domains according to UniProt:

This tells us that protein contains two WW domains, starting at positions 141 and 199. Let's see if these match our results for its alignment against our PSSMs for the two MSAs of the WW domain (note that by default, the function multiAlign only returns one result per optimal or suboptimal lookup, and only goes to suboptimal depth of 3):


In [7]:
musclePath = normpath("resources/fasta/msaresults-MUSCLE.fasta")
clustalPath = normpath("resources/fasta/msaresults-OMEGA.fasta")
testPath = normpath("resources/fasta/test.fasta")

for aligned in (musclePath, clustalPath):
	print("\n >>> Creating PSSM from file {} ...".format(aligned), end="")
	pssm = PSSM("WW domain")
	for seq in getSequencesFromFasta(aligned):
		pssm.add(seq)
	pssm.setGapPenalty(4)
	print(" done\n\n")
	
	
	al = Align(pssm)
	toalign = [p for p in getSequencesFromFasta(testPath)][0] #First protein
	for aligned in al.multiAlign(toalign):
		print(aligned)


 >>> Creating PSSM from file resources\fasta\msaresults-MUSCLE.fasta ... done


---------- Multi-Seq. Alignment ----------
Size       : 59
Type       : local
Score      : 25.92
Gaps       : 28

PSSM : WW domain
Aligned seq. : sp|D6C652|YAP1A_XENLA Transcriptional coactivator YAP1-A OS=Xenopus laevis GN=yap1-a PE=1 SV=1
	28 Gaps, 31 AAs (positions 142 to 173)

142
-LPPGWEMAKT-PS-GQR-YFLN------------------------HIDQTTTWQDPR


---------- Multi-Seq. Alignment ----------
Size       : 60
Type       : local-suboptimal(1)
Score      : 23.73
Gaps       : 28

PSSM : WW domain
Aligned seq. : sp|D6C652|YAP1A_XENLA Transcriptional coactivator YAP1-A OS=Xenopus laevis GN=yap1-a PE=1 SV=1
	28 Gaps, 32 AAs (positions 200 to 232)

200
-LPDGWEQALTPEGEA---YFIN------------------------HKNKSTSWLDPRL


---------- Multi-Seq. Alignment ----------
Size       : 5
Type       : local-suboptimal(2)
Score      : 11.06
Gaps       : 1

PSSM : WW domain
Aligned seq. : sp|D6C652|YAP1A_XENLA Transcriptional coactivator YAP1-A OS=Xenopus laevis GN=yap1-a PE=1 SV=1
	1 Gaps, 4 AAs (positions 254 to 258)

254
-LPPP


---------- Multi-Seq. Alignment ----------
Size       : 60
Type       : local-suboptimal(3)
Score      : 10.73
Gaps       : 41

PSSM : WW domain
Aligned seq. : sp|D6C652|YAP1A_XENLA Transcriptional coactivator YAP1-A OS=Xenopus laevis GN=yap1-a PE=1 SV=1
	41 Gaps, 19 AAs (positions 121 to 140)

121
-LA-P-------PSA--P-----------------------------HL-RQSSYEIPDD



 >>> Creating PSSM from file resources\fasta\msaresults-OMEGA.fasta ... done


---------- Multi-Seq. Alignment ----------
Size       : 13
Type       : local
Score      : 25.83
Gaps       : 1

PSSM : WW domain
Aligned seq. : sp|D6C652|YAP1A_XENLA Transcriptional coactivator YAP1-A OS=Xenopus laevis GN=yap1-a PE=1 SV=1
	1 Gaps, 12 AAs (positions 142 to 154)

142
-LPPGWEMAKTPS


---------- Multi-Seq. Alignment ----------
Size       : 46
Type       : local-suboptimal(1)
Score      : 22.23
Gaps       : 26

PSSM : WW domain
Aligned seq. : sp|D6C652|YAP1A_XENLA Transcriptional coactivator YAP1-A OS=Xenopus laevis GN=yap1-a PE=1 SV=1
	26 Gaps, 20 AAs (positions 153 to 173)

153
-SGQRYFLNHID-------------------------QTTTWQDPR


---------- Multi-Seq. Alignment ----------
Size       : 12
Type       : local-suboptimal(2)
Score      : 20.5
Gaps       : 1

PSSM : WW domain
Aligned seq. : sp|D6C652|YAP1A_XENLA Transcriptional coactivator YAP1-A OS=Xenopus laevis GN=yap1-a PE=1 SV=1
	1 Gaps, 11 AAs (positions 200 to 211)

200
-LPDGWEQALTP


---------- Multi-Seq. Alignment ----------
Size       : 59
Type       : local-suboptimal(3)
Score      : 20.26
Gaps       : 35

PSSM : WW domain
Aligned seq. : sp|D6C652|YAP1A_XENLA Transcriptional coactivator YAP1-A OS=Xenopus laevis GN=yap1-a PE=1 SV=1
	35 Gaps, 24 AAs (positions 208 to 232)

208
-L-TP--EGE-A------YF--INHK----------------------NKSTSWLDPRL


  • For the MUSCLE PSSM, the first two alignments match the ones from UniProt, and the last two are lesser alignments (with lower scores) that do not seem to represent a domain -mainly because of their reduced size or number of amino acids-.
  • For the CLUSTAL PSSM, the first two alignments match part of one domain, the last two part of the other. This may be due to different PSSM values and/or non-ideal end conditions for the align algorithm (which could be modified to prefer longer alignments). This matter could be settled by examining all outputs (not one per backtrack) and determining if and when better domain alignments are returned.

Second protein

Protein P46935 (E3 ubiquitin-protein ligase NEDD4) domains according to UniProt:

Let's see how these correlate to our own results:


In [8]:
for aligned in (musclePath, clustalPath):
	print("\n >>> Creating PSSM from file {} ...".format(aligned), end="")
	pssm = PSSM("WW domain")
	for seq in getSequencesFromFasta(aligned):
		pssm.add(seq)
	pssm.setGapPenalty(4)
	print(" done\n\n")
	
	
	al = Align(pssm)
	toalign = [p for p in getSequencesFromFasta(testPath)][1] #Second protein
	for aligned in al.multiAlign(toalign):
		print(aligned)


 >>> Creating PSSM from file resources\fasta\msaresults-MUSCLE.fasta ... done


---------- Multi-Seq. Alignment ----------
Size       : 60
Type       : local
Score      : 27.34
Gaps       : 28

PSSM : WW domain
Aligned seq. : sp|P46935|NEDD4_MOUSE E3 ubiquitin-protein ligase NEDD4 OS=Mus musculus GN=Nedd4 PE=1 SV=3
	28 Gaps, 32 AAs (positions 461 to 493)

461
-LPPGWEER-T--H---T---DGRVFFIN------------------HNIKKTQWEDPRL


---------- Multi-Seq. Alignment ----------
Size       : 60
Type       : local-suboptimal(1)
Score      : 25.16
Gaps       : 28

PSSM : WW domain
Aligned seq. : sp|P46935|NEDD4_MOUSE E3 ubiquitin-protein ligase NEDD4 OS=Mus musculus GN=Nedd4 PE=1 SV=3
	28 Gaps, 32 AAs (positions 406 to 438)

406
-LPPGW-EEK------Q----DDRGRSYYVD----------------HNSKTTTWSKPTM


---------- Multi-Seq. Alignment ----------
Size       : 11
Type       : local-suboptimal(2)
Score      : 24.98
Gaps       : 1

PSSM : WW domain
Aligned seq. : sp|P46935|NEDD4_MOUSE E3 ubiquitin-protein ligase NEDD4 OS=Mus musculus GN=Nedd4 PE=1 SV=3
	1 Gaps, 10 AAs (positions 250 to 260)

250
-LPPGWEERQD


---------- Multi-Seq. Alignment ----------
Size       : 56
Type       : local-suboptimal(3)
Score      : 23.58
Gaps       : 27

PSSM : WW domain
Aligned seq. : sp|P46935|NEDD4_MOUSE E3 ubiquitin-protein ligase NEDD4 OS=Mus musculus GN=Nedd4 PE=1 SV=3
	27 Gaps, 29 AAs (positions 251 to 280)

251
-PP---GW--EE--RQ---DVLGRTYYVN----------------HESRRTQWKRP



 >>> Creating PSSM from file resources\fasta\msaresults-OMEGA.fasta ... done


---------- Multi-Seq. Alignment ----------
Size       : 11
Type       : local
Score      : 24.98
Gaps       : 1

PSSM : WW domain
Aligned seq. : sp|P46935|NEDD4_MOUSE E3 ubiquitin-protein ligase NEDD4 OS=Mus musculus GN=Nedd4 PE=1 SV=3
	1 Gaps, 10 AAs (positions 250 to 260)

250
-LPPGWEERQD


---------- Multi-Seq. Alignment ----------
Size       : 23
Type       : local-suboptimal(1)
Score      : 24.1
Gaps       : 2

PSSM : WW domain
Aligned seq. : sp|P46935|NEDD4_MOUSE E3 ubiquitin-protein ligase NEDD4 OS=Mus musculus GN=Nedd4 PE=1 SV=3
	2 Gaps, 21 AAs (positions 461 to 482)

461
-LPPGWEER-THTDGRVFFINHN


---------- Multi-Seq. Alignment ----------
Size       : 24
Type       : local-suboptimal(2)
Score      : 24.02
Gaps       : 13

PSSM : WW domain
Aligned seq. : sp|P46935|NEDD4_MOUSE E3 ubiquitin-protein ligase NEDD4 OS=Mus musculus GN=Nedd4 PE=1 SV=3
	13 Gaps, 11 AAs (positions 261 to 272)

261
-L------------GRTYYVNHES


---------- Multi-Seq. Alignment ----------
Size       : 12
Type       : local-suboptimal(3)
Score      : 23.48
Gaps       : 1

PSSM : WW domain
Aligned seq. : sp|P46935|NEDD4_MOUSE E3 ubiquitin-protein ligase NEDD4 OS=Mus musculus GN=Nedd4 PE=1 SV=3
	1 Gaps, 11 AAs (positions 406 to 417)

406
-LPPGWEEKQDD


  • For the MUSCLE PSSM, the three instances of the WW domain are found within the four results. One alignment is found to have a better score than the last WW domain, yet it is included within the latter. This may be due to the align algorithm once more, which could be tweaked to prefer longer alignments even when this means not starting at the maximum score.
  • For the CLUSTAL PSSM, no instances of the WW domain are found completely. Yet all of the results are subsets of the domains, which leads to the same conclusions as before.

Conclusion

The PSSM provides the required information to find local alignments within other sequences. By looking at the WebLogos, we can see that both online tools have aligned our sequences from the WW domain in a very similar fashion and resemble the HMM Logo of the WW domain, yet the MUSCLE alignment provides better results than the OMEGA alignemnt.

All of the WW domains from both aligned sequences have been found, however the algorithm can probably be tweaked to have better start and end conditions for the alignment, since it sometimes returns only part of the domain. We can also notice that, when a complete domain is found, its start and end index within the sequence differ very slighlty (one or two amino acids) with the UniProt alignment; this could be due to tiny differences in the PSSM values, to the arbitrary selection of results from the backtracking (only the first is displayed for each lookup), or even to the constant and arbitrary gap penalty (different gap penalties per column have been implemented but not used).

Considering that the two tested sequences come from different organisms than the ones used for creating the PSSM, the results are positive overall.