Xie J, Xu Z, Zhou S, Pan X, Cai S, Yang L, et al. (2013) The VHSE-Based Prediction of Proteasomal Cleavage Sites. PLoS ONE 8(9): e74506.
doi:10.1371/journal.pone.0074506
Abstract: "Prediction of proteasomal cleavage sites has been a focus of computational biology. Up to date, the predictive methods are mostly based on nonlinear classifiers and variables with little physicochemical meanings. In this paper, the physicochemical properties of 14 residues both upstream and downstream of a cleavage site are characterized by VHSE (principal component score vector of hydrophobic, steric, and electronic properties) descriptors. Then, the resulting VHSE descriptors are employed to construct prediction models by support vector machine (SVM). For both in vivo and in vitro datasets, the performance of VHSE-based method is comparatively better than that of the well-known PAProC, MAPPP, and NetChop methods. The results reveal that the hydrophobic property of 10 residues both upstream and downstream of the cleavage site is a dominant factor affecting in vivo and in vitro cleavage specificities, followed by residue’s electronic and steric properties. Furthermore, the difference in hydrophobic potential between residues flanking the cleavage site is proposed to favor substrate cleavages. Overall, the interpretable VHSE-based method provides a preferable way to predict proteasomal cleavage sites."
Notes:
Amino acids grouped by electrically charged, polar uncharged, hydrophobic, and special case sidechains.
Each amino acid has a single letter designation.
Image by Dancojocari [CC BY-SA 3.0 or GFDL], via Wikimedia Commons
Bradykinin is an inflammatory mediator. It is a peptide that causes blood vessels to dilate (enlarge), and therefore causes blood pressure to fall. A class of drugs called ACE inhibitors, which are used to lower blood pressure, increase bradykinin (by inhibiting its degradation) further lowering blood pressure.
>sp|P01042|KNG1_HUMAN Kininogen-1 OS=Homo sapiens GN=KNG1 PE=1 SV=2
MKLITILFLCSRLLLSLTQESQSEEIDCNDKDLFKAVDAALKKYNSQNQSNNQFVLYRIT
EATKTVGSDTFYSFKYEIKEGDCPVQSGKTWQDCEYKDAAKAATGECTATVGKRSSTKFS
VATQTCQITPAEGPVVTAQYDCLGCVHPISTQSPDLEPILRHGIQYFNNNTQHSSLFMLN
EVKRAQRQVVAGLNFRITYSIVQTNCSKENFLFLTPDCKSLWNGDTGECTDNAYIDIQLR
IASFSQNCDIYPGKDFVQPPTKICVGCPRDIPTNSPELEETLTHTITKLNAENNATFYFK
IDNVKKARVQVVAGKKYFIDFVARETTCSKESNEELTESCETKKLGQSLDCNAEVYVVPW
EKKIYPTVNCQPLGMISLMKRPPGFSPFRSSRIGEIKEETTVSPPHTSMAPAQDEERDSG
KEQGHTRRHDWGHEKQRKHNLGHGHKHERDQGHGHQRGHGLGHGHEQQHGLGHGHKFKLD
DDLEHQGGHVLDHGHKHKHGHGHGKHKNKGKKNGKHNGWKTEHLASSSEDSTTPSAQTQE
KTEGPTPIPSLAKPGVTVTFSDFQDSDLIATMMPPISPAPIQSDDDWIPDIQIDPNGLSF
NPISDFPDTTSPKCPGRPWKSVSEINPTTQMKESYYFDLTDGLS
The proteasome digests polypeptides into smaller peptides 5–25 amino acids in length and is the major protease responsible for generating peptide C termini.
Transporter associated with Antigen Processing (TAP) binds to peptides of length 9-20 amino acids and transports them into the endoplasmic reticulum (ER).
Image by Scray - Own work, CC BY-SA 3.0, Link
Text from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2913210/
Cytotoxic T lymphocytes (CTLs) are the effector cells of the adaptive immune response that deal with infected, or malfunctioning, cells. Whereas intracellular pathogens are shielded from antibodies, CTLs are endowed with the ability to recognize and destroy cells harbouring intracellular threats. This obviously requires that information on the intracellular protein metabolism (including that of any intracellular pathogen) be translocated to the outside of the cell, where the CTL reside. To this end, the immune system has created an elaborate system of antigen processing and presentation. During the initial phase of antigen processing, peptide antigens are generated from intracellular pathogens and translocated into the endoplasmic reticulum. In here, these peptide antigens are specifically sampled by major histocompatibility complex (MHC) class I molecules and then exported to the cell surface, where they are presented as stable peptide: MHC I complexes awaiting the arrival of scrutinizing T cells. Hence, identifying which peptides are able to induce CTLs is of general interest for our understanding of the immune system, and of particular interest for the development of vaccines and immunotherapy directed against infectious pathogens, as previously reviewed. Peptide binding to MHC molecules is the key feature in cell-mediated immunity, because it is the peptide–MHC class I complex that can be recognized by the T-cell receptor (TCR) and thereby initiate the immune response. The CTLs are CD8+ T cells, whose TCRs recognize foreign peptides in complex with MHC class I molecules. In addition to peptide binding to MHC molecules, several other events have to be considered to be able to explain why a given peptide is eventually presented at the cell surface. Generally, an immunogenic peptide is generated from proteins expressed within the presenting cell, and peptides originating from proteins with high expression rate will normally have a higher chance of being immunogenic, compared with peptides from proteins with a lower expression rate. There are, however, significant exceptions to this generalization, e.g. cross-presentation, but this will be ignored in the following. In the classical MHC class I presenting pathway (see image on right) proteins expressed within a cell will be degraded in the cytosol by the protease complex, named the proteasome. The proteasome digests polypeptides into smaller peptides 5–25 amino acids in length and is the major protease responsible for generating peptide C termini. Some of the peptides that survive further degradation by other cytosolic exopeptidases can be bound by the transporter associated with antigen presentation (TAP), reviewed by Schölz et al. This transporter molecule binds peptides of lengths 9–20 amino acids and transports the peptides into the endoplasmic reticulum, where partially folded MHC molecules [in humans called human leucocyte antigens (HLA)], will complete folding if the peptide is able to bind to the particular allelic MHC molecule. The latter step is furthermore facilitated by the endoplasmic-reticulum-hosted protein tapasin. Each of these steps has been characterized and their individual importance has been related to final presentation on the cell surface.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, metrics
from sklearn.preprocessing import MinMaxScaler
The principal component score Vector of Hydrophobic, Steric, and Electronic properties (VHSE) is a set of amino acid descriptors that come from A new set of amino acid descriptors and its application in peptide QSARs
In [2]:
# (3-letter, VHSE1, VHSE2, VHSE3, VHSE4, VHSE5, VHSE6, VHSE7, VHSE8)
vhse = {
"A": ("Ala", 0.15, -1.11, -1.35, -0.92, 0.02, -0.91, 0.36, -0.48),
"R": ("Arg", -1.47, 1.45, 1.24, 1.27, 1.55, 1.47, 1.30, 0.83),
"N": ("Asn", -0.99, 0.00, -0.37, 0.69, -0.55, 0.85, 0.73, -0.80),
"D": ("Asp", -1.15, 0.67, -0.41, -0.01, -2.68, 1.31, 0.03, 0.56),
"C": ("Cys", 0.18, -1.67, -0.46, -0.21, 0.00, 1.20, -1.61, -0.19),
"Q": ("Gln", -0.96, 0.12, 0.18, 0.16, 0.09, 0.42, -0.20, -0.41),
"E": ("Glu", -1.18, 0.40, 0.10, 0.36, -2.16, -0.17, 0.91, 0.02),
"G": ("Gly", -0.20, -1.53, -2.63, 2.28, -0.53, -1.18, 2.01, -1.34),
"H": ("His", -0.43, -0.25, 0.37, 0.19, 0.51, 1.28, 0.93, 0.65),
"I": ("Ile", 1.27, -0.14, 0.30, -1.80, 0.30, -1.61, -0.16, -0.13),
"L": ("Leu", 1.36, 0.07, 0.26, -0.80, 0.22, -1.37, 0.08, -0.62),
"K": ("Lys", -1.17, 0.70, 0.70, 0.80, 1.64, 0.67, 1.63, 0.13),
"M": ("Met", 1.01, -0.53, 0.43, 0.00, 0.23, 0.10, -0.86, -0.68),
"F": ("Phe", 1.52, 0.61, 0.96, -0.16, 0.25, 0.28, -1.33, -0.20),
"P": ("Pro", 0.22, -0.17, -0.50, 0.05, -0.01, -1.34, -0.19, 3.56),
"S": ("Ser", -0.67, -0.86, -1.07, -0.41, -0.32, 0.27, -0.64, 0.11),
"T": ("Thr", -0.34, -0.51, -0.55, -1.06, 0.01, -0.01, -0.79, 0.39),
"W": ("Trp", 1.50, 2.06, 1.79, 0.75, 0.75, -0.13, -1.06, -0.85),
"Y": ("Tyr", 0.61, 1.60, 1.17, 0.73, 0.53, 0.25, -0.96, -0.52),
"V": ("Val", 0.76, -0.92, 0.17, -1.91, 0.22, -1.40, -0.24, -0.03)}
There were eight dataset used in this study. The reference datasets (s1, s3, s5, s7) were converted into the actual datasets used in the analysis (s2, s4, s6, s8) using the vhse
vector. The s2 and s4 datasets were used for training the SVM model and the s6 and s8 were used for testing.
In [3]:
%ls data/proteasomal_cleavage
In [4]:
from aa_props import seq_to_aa_props
# Converts the raw input into our X matrix and y vector. The 'peptide_key'
# and 'activity_key' parameters are the names of the column in the dataframe
# for the peptide amino acid string and activity (not cleaved/cleaved)
# respectively. The 'sequence_len' allows for varying the number of flanking
# amino acids to cleavage site (which is at position 14 of 28 in each cleaved
# sample.
def dataset_to_X_y(dataframe, peptide_key, activity_key, sequence_len = 28, use_vhse = True):
raw_peptide_len = 28
if (sequence_len % 2 or sequence_len > raw_peptide_len or sequence_len <= 0):
raise ValueError("sequence_len needs to an even value (0,%d]" % (raw_peptide_len))
X = []
y = []
for (peptide, activity) in zip(dataframe[peptide_key], dataframe[activity_key]):
if (len(peptide) != raw_peptide_len):
# print "Skipping peptide! len(%s)=%d. Should be len=%d" \
# % (peptide, len(peptide), raw_peptide_len)
continue
y.append(activity)
num_amino_acids_to_clip = (raw_peptide_len - sequence_len) / 2
clipped_peptide = peptide if num_amino_acids_to_clip == 0 else \
peptide[num_amino_acids_to_clip:-num_amino_acids_to_clip]
# There is a single peptide in dataset s6 with an "'" in the sequence.
# The VHSE values used for it in the study match Proline (P).
clipped_peptide = clipped_peptide.replace('\'', 'P')
row = []
if use_vhse:
for amino_acid in clipped_peptide:
row.append(vhse[amino_acid][1]) # hydrophobic
row.append(vhse[amino_acid][3]) # steric
row.append(vhse[amino_acid][5]) # electric
else:
row = seq_to_aa_props(clipped_peptide)
X.append(row)
return (X, y)
To create the in vivo training set, the authors
This process created 5,087 training samples: 2,607 cleavage and 2,480 non-cleavage samples.
In [5]:
training_set = pd.DataFrame.from_csv("data/proteasomal_cleavage/s2_in_vivo_mhc_1_antijen_swiss_prot_dataset.csv")
print training_set.head(3)
The authors measured linear, polynomial, radial basis, and sigmoid kernel and found no significant difference in performance. The linear kernel was chosen for its simplicity and interpretability. The authors did not provide the C
value used in their linear model, so I used GridSearchCV
to find the best value.
In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFECV
def create_linear_svc_model(parameters, sequence_len = 28, use_vhse = True):
scaler = MinMaxScaler()
(X_train_unscaled, y_train) = dataset_to_X_y(training_set, \
"Sequence", "Activity", \
sequence_len = sequence_len, \
use_vhse = use_vhse)
X_train = pd.DataFrame(scaler.fit_transform(X_train_unscaled))
parameters={'estimator__C': [pow(2, i) for i in xrange(-25, 4, 1)]}
svc = svm.LinearSVC()
rfe = RFECV(estimator=svc, step=.1, cv=2, scoring='accuracy', n_jobs=8)
clf = GridSearchCV(rfe, parameters, scoring='accuracy', n_jobs=8, cv=2, verbose=1)
clf.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (clf.best_score_, clf.best_params_))
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
params = clf.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
#
#svr = svm.LinearSVC()
#clf = GridSearchCV(svr, parameters, cv=10, scoring='accuracy', n_jobs=1)
#clf.fit(X_train, y_train)
#print("The best parameters are %s with a score of %0.2f" \
# % (clf.best_params_, clf.best_score_))
return (scaler, clf)
(vhse_scaler, vhse_model) = create_linear_svc_model(
parameters = {'estimator__C': [pow(2, i) for i in xrange(-25, 4, 1)]},
use_vhse = False)
In [39]:
def test_linear_svc_model(scaler, model, sequence_len = 28, use_vhse = True):
testing_set = pd.DataFrame.from_csv("data/proteasomal_cleavage/s6_in_vivo_mhc_1_ligands_dataset.csv")
(X_test_prescaled, y_test) = dataset_to_X_y(testing_set, \
"Sequences", "Activity", \
sequence_len = sequence_len,\
use_vhse = use_vhse)
X_test = pd.DataFrame(scaler.transform(X_test_prescaled))
y_predicted = model.predict(X_test)
accuracy = 100.0 * metrics.accuracy_score(y_test, y_predicted)
((tn, fp), (fn, tp)) = metrics.confusion_matrix(y_test, y_predicted, labels=[-1, 1])
sensitivity = 100.0 * tp/(tp + fn)
specificity = 100.0 * tn/(tn + fp)
mcc = metrics.matthews_corrcoef(y_test, y_predicted)
print "Authors reported performance"
print "Acc: 73.5, Sen: 82.3, Spe: 64.8, MCC: 0.48"
print "Notebook performance (sequence_len=%d, use_vhse=%s)" % (sequence_len, use_vhse)
print "Acc: %.1f, Sen: %.1f, Spe: %.1f, MCC: %.2f" \
%(accuracy, sensitivity, specificity, mcc)
test_linear_svc_model(vhse_scaler, vhse_model, use_vhse = False)
testing_set = pd.DataFrame.from_csv("data/proteasomal_cleavage/s6_in_vivo_mhc_1_ligands_dataset.csv")
(X_test_prescaled, y_test) = dataset_to_X_y(testing_set, \
"Sequences", "Activity", \
sequence_len = 28,\
use_vhse = False)
X_test = pd.DataFrame(vhse_scaler.transform(X_test_prescaled))
poslabels = ["-%02d" % (i) for i in range(14, 0, -1)] + ["+%02d" % (i) for i in range(1,15)]
# 18 H 17 S 15 E
proplables = ["H%02d" % (i) for i in range(18)] + ["S%02d" % (i) for i in range(17)] + ["E%02d" % (i) for i in range(15)]
cols = []
for poslabel in poslabels:
for proplable in proplables:
cols.append("%s%s" % (poslabel, proplable))
X_test.columns = cols
for col in X_test.columns[vhse_model.best_estimator_.get_support()]:
print col
The VHSE1
variable at the P1 position has the largest positive weight coefficient (10.49) in line with research showing that C-terminal residues are usually hydrophobic to aid in ER transfer and binding to the MHC molecule.
There is a mostly positive and mostly negative coefficents upstream and downstream of the cleavage site respectively. This potential difference appears to be conducive to cleavage.
In [ ]:
#h = svr.coef_[:, 0::3]
#s = svr.coef_[:, 1::3]
#e = svr.coef_[:, 2::3]
#%matplotlib notebook
#n_groups = h.shape[1]
#fig, ax = plt.subplots(figsize=(12,9))
#index = np.arange(n_groups)
#bar_width = 0.25
#ax1 = ax.bar(index + bar_width, h.T, bar_width, label="Hydrophobic", color='b')
#ax2 = ax.bar(index, s.T, bar_width, label="Steric", color='r')
#ax3 = ax.bar(index - bar_width, e.T, bar_width, label="Electronic", color='g')
#ax.set_xlim(-bar_width,len(index)+bar_width)
#plt.xlabel('Amino Acid Position')
#plt.ylabel('SVM Coefficient Value')
#plt.title('Hydrophobic, Steric, and Electronic Effect by Amino Acid Position')
#plt.xticks(index, range (n_groups/2, 0, -1) + [str(i)+"'" for i in range (1, n_groups/2+1)])
#plt.legend()
#plt.tight_layout()
#plt.show()
Matrix | Features | Sensitivity | Specificity | MCC |
---|---|---|---|---|
VHSE | 3x20=60 | 82.2 | 63.2 | 0.46 |
Full | 50x20=1000 | 81.2 | 64.1 | 0.46 |
In [ ]:
# Performance with no VHSE
(no_vhse_scaler, no_vhse_model) = create_linear_svc_model(
parameters = {'C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1]},
use_vhse = False)
test_linear_svc_model(no_vhse_scaler, no_vhse_model, use_vhse = False)
In [ ]:
# Performance with more flanking residues and no VHSE
(full_flank_scaler, full_flank_model) = create_linear_svc_model(
parameters = {'C': [0.0001, 0.003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1]},
use_vhse = False, sequence_len = 28)
test_linear_svc_model(full_flank_scaler, full_flank_model, use_vhse = False, sequence_len=28)
Chipper uses full 50 properties per residue instead of VHSE parameters
Method | Sensitivity | Specificity | MCC | Notes |
---|---|---|---|---|
PAProC | 45.6 | 30.0 | -0.25 | |
FragPredict | 83.5 | 16.5 | 0.00 | |
NetChop 1.0 | 39.8 | 46.3 | -0.14 | |
NetChop 2.0 | 73.6 | 42.4 | 0.16 | |
NetChop 3.0 | 81.0 | 48.0 | 0.31 | |
VHSE-based SVC | 82.3 | 64.8 | 0.48 | 3x20=60 features |
chipper VHSE (SVC) | 79.3 | 72.6 | 0.520 | 3x20=60 features |
chipper VHSE (LR) | 83.2 | 69.2 | 0.529 | 3x20=60 features |
chipper (SVC) | 79.3 | 77.9 | 0.572 | 50x20=1000 features |
chipper (LR) | 87.0 | 74.5 | 0.620 | 50x20=1000 features |
xgboost | 87.0 | 74.5 | 0.620 | 50x20=1000 features |
FANN | 82.7 | 78.8 | 0.616 | 50x20=1000 features |
Source notebook: https://github.com/massie/notebooks/blob/master/VHSE-Based%20Prediction%20of%20Proteasomal%20Cleavage%20Sites.ipynb
Chipper repo: https://github.com/massie/chipper