In [5]:
%matplotlib inline
from sys import path
from os.path import join as joinpath
from os.path import normpath
path.append(normpath(".."))
from pyprot.base.aminoacid import AminoAcid
from pyprot.data.dssp import DSSP
from pyprot.structure.gor import GOR3
Proteins have four levels of structure:
The goal of this project is to predict the secondary structure of a protein, based on its primary structure. This is useful in the context of multiple sequence alignment, since proteins that exerce the same function are likely to have similar secondary structure as well as related primary structures. Fortunately for us, secondary structures can be observed experimentally via multiple techniques, granting us the possibility to train and verify our prediction system with real-world data. Furthermore, the folding of proteins (into their secondary and tertiary stable structures) is highly deterministic, which means it can be predicted based on the primary structure alone.
DSSP stands for Define Secondary Structure of Proteins and is a standard for how the atomic 3D arrangement of a protein is translated into secondary structures. DSSP admits eight types of secondary structures and assigns one to each amino acid from a protein by examining their spacial coordinates. We won't be implementing DSSP, however we will need to parse .dssp
files in order to extract secondary structure information to train and verify our prediction system. Here is a class that parses such files:
Note that we won't be using all eight structures, but rather regroup them into four classes:
GOR stands for Garnier-Osguthorpe-Robson and is a secondary structure prediction method based on information theory. It has had several releases, each increasing the prediction accuracy, but we will only focus on the GOR III version here. This version uses two kinds of information to issue a prediction, all based on known protein-structure pairs parsed from a training dataset. In the following formulas, $R_j$ is the residue (amino acid) at index $j$ whose structure is being predicted, $S_j$ is one of the structures, $n-S_j$ represents all of the structures except for $S_j$, $f_{c_1,...c_k}$ is the frequency with which all conditions $c_1$ through $c_k$ are met within the training dataset and $I(\Delta S, ...) = I(S, ...) - I(n-S, ...)$ is the information difference between the predictions concerning $S$ and $n-S$.
Overall, the formula applied for the GOR III prediction is: $$I(\Delta S_j, R_{j-n} ... R_{j+n}) = I(\Delta S_j, R_j) + \sum_{m=-n, m \neq 0}^{m=n}{I(\Delta S_j, R_{j+m} | R_j)}$$
Here is an implementation of the algorithm, that we can train with new sequences then use to predict the structure of other sequences:
Our model can be seen as a set of binary classifiers, each predicting if a given amino acid belongs or not to one structure. There is one classifier per structure, each has its own values for $TP$ (true positives), $TN$ (true negatives), $FP$ (false positives) and $FN$ (false negatives).
$Q_3$ is equal to the number of correctly predicted residues, divided by the total number of residues. In our case, since we know of four different structures, anything over $\frac{1}{4} = 0.25$ is better than completely random predictions.
Matthews Correlation Coefficient (MCC) $$\frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$
Receiver Operating Characteristic (ROC) curve is a graphical representation of a binary classifier's results. The curve is obtained by plotting together the true positive rate or TPR along the $y$ axis, and the false positive rate or FPR along the $x$ axis. The data is generated for different threshold values which discriminate the positive and negative results. The curve is usually displayed along with a diagonal line joining the dots $(0,0)$ and $(1,1)$, representing the random classifier's ROC (which has a 50% chance of classifying an instance as positive or negative). If our classifier's curve is on top of that line, it means it performs better. Furthermore, the point from our curve closest to $(0,1)$ has the best threshold for efficiently classifying data.
Area Under Curve (AUC) is the area under a curve. The AUC of a ROC Curve is equal to the probability that its binary classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.
Time to run that bytecode.
The first step towards predicting structures is to gather training and validation data. Fortunately, our dear TA has provided us with all the .dssp
files we need. The only miscellaneous stuff here is that we're not interested in the whole data, just some chains from some files (specified in info
files). The output goes in the PARSED
files in a format similar to fasta
, where the structure follows the sequence.
In [6]:
datasetPath = joinpath("resources", "dataset")
with open(joinpath(datasetPath, "CATH_info.txt")) as infoFile:
with open(joinpath(datasetPath, "CATH_info-PARSED.txt"), 'w') as outFile:
dsspPath = joinpath(datasetPath, "dssp", "")
for line in infoFile.readlines():
d = DSSP(dsspPath + line[0:4] + ".dssp")
description = "> " + d.identifier + "|" + d.protein + "|" + d.organism
seq, struct = d.getSequenceStructure(line[4])
outFile.writelines(l + "\n" for l in [description,seq,struct])
with open(joinpath(datasetPath, "CATH_info_test.txt")) as infoFile:
with open(joinpath(datasetPath, "CATH_info_test-PARSED.txt"), 'w') as outFile:
dsspTestPath = joinpath(datasetPath, "dssp_test", "")
for line in infoFile.readlines():
d = DSSP(dsspTestPath + line[0:4] + ".dssp")
description = "> " + d.identifier + "|" + d.protein + "|" + d.organism
seq, struct = d.getSequenceStructure(line[4])
outFile.writelines(l + "\n" for l in [description,seq,struct])
In [7]:
gor3Pred = GOR3()
with open(joinpath(datasetPath, "CATH_info-PARSED.txt")) as inFile:
index = 0
sequence = ""
for line in inFile.readlines():
line = line.strip().upper()
if not (line=="" or line[0]==">"):
#Line is a sequence
if index % 2 == 0:
sequence = line
#Line is a structure
else:
gor3Pred.train(sequence, line)
index += 1
In [8]:
with open(joinpath(datasetPath, "CATH_info_test-PARSED.txt")) as inFile:
index = 0
sequence = ""
for line in inFile.readlines():
line = line.strip().upper()
if not (line=="" or line[0]==">"):
#Line is a sequence
if index % 2 == 0:
sequence = line
#Line is a structure
else:
structure = line
prediction = gor3Pred.predict(sequence, structure)
inter = "".join([":" if s1==s2 else " " for s1,s2 in zip(structure, prediction)])
print("-------- STRUCTURE (top) vs PREDICTION (bottom) --------")
print("True Positive Rate:", round(inter.count(":")/len(inter), 2))
print()
chunk = 80
for start in range(0, len(structure), chunk):
stop = start+chunk+1 if start+chunk+1<=len(structure) else len(structure)
print(structure[start:stop])
print(inter[start:stop])
print(prediction[start:stop])
print()
print()
index += 1
In [9]:
q3, mcc = gor3Pred.getQuality()
print("----- Overall quality -----")
print("Q3:", round(q3, 2))
print("MCC:", round(mcc, 2))
print()
gor3Pred.plotROC()
As shown by the AUC and the ROC curves, our classifiers are definitely better than completely random: the overall Q3 of 0.37 bests the random Q3 (0.25). However they do not perform very well when compared to other experimental results (Garnier et al, 1996) where the GOR 3 algorithm was trained with a database of 267 proteins, which obtained a Q3 of around 63.3% (yet this may be overestimated by 1% due to overfitting the training data by choosing better pseudocounts). In fact our Q3 is even worse than that of a random classifier with a distribution matching the real-world frequency of the structures, which is around 0.38.
In conclusion, I wouldn't use this classifier for anything too serious. However the implementation of the GOR 3 algorithm should be complete, therefore with a better training set we may be able to achieve similar results to other experiments.