VHSE-Based Prediction of Proteasomal Cleavage Sites

Xie J, Xu Z, Zhou S, Pan X, Cai S, Yang L, et al. (2013) The VHSE-Based Prediction of Proteasomal Cleavage Sites. PLoS ONE 8(9): e74506. doi:10.1371/journal.pone.0074506

Abstract: "Prediction of proteasomal cleavage sites has been a focus of computational biology. Up to date, the predictive methods are mostly based on nonlinear classifiers and variables with little physicochemical meanings. In this paper, the physicochemical properties of 14 residues both upstream and downstream of a cleavage site are characterized by VHSE (principal component score vector of hydrophobic, steric, and electronic properties) descriptors. Then, the resulting VHSE descriptors are employed to construct prediction models by support vector machine (SVM). For both in vivo and in vitro datasets, the performance of VHSE-based method is comparatively better than that of the well-known PAProC, MAPPP, and NetChop methods. The results reveal that the hydrophobic property of 10 residues both upstream and downstream of the cleavage site is a dominant factor affecting in vivo and in vitro cleavage specificities, followed by residue’s electronic and steric properties. Furthermore, the difference in hydrophobic potential between residues flanking the cleavage site is proposed to favor substrate cleavages. Overall, the interpretable VHSE-based method provides a preferable way to predict proteasomal cleavage sites."

Notes:

Databases used in this study to create training and test sets
- Immune Epitope Database and Analysis Resource
- NCBI's reference sequence (RefSeq) database
- AntiJen - a kinetic, thermodynamic and cellular database v2.0
- ExPASy/SWISS-PROT

Peptide Formation

Image by GYassineMrabetTalk. (Own work) [Public domain], via Wikimedia Commons

Hydrophobic, Steric, and Electronic Properties

Amino acids grouped by electrically charged, polar uncharged, hydrophobic, and special case sidechains.

Each amino acid has a single letter designation.

Image by Dancojocari [CC BY-SA 3.0 or GFDL], via Wikimedia Commons

Protein Representation (FASTA)

Bradykinin is an inflammatory mediator. It is a peptide that causes blood vessels to dilate (enlarge), and therefore causes blood pressure to fall. A class of drugs called ACE inhibitors, which are used to lower blood pressure, increase bradykinin (by inhibiting its degradation) further lowering blood pressure.

>sp|P01042|KNG1_HUMAN Kininogen-1 OS=Homo sapiens GN=KNG1 PE=1 SV=2
MKLITILFLCSRLLLSLTQESQSEEIDCNDKDLFKAVDAALKKYNSQNQSNNQFVLYRIT
EATKTVGSDTFYSFKYEIKEGDCPVQSGKTWQDCEYKDAAKAATGECTATVGKRSSTKFS
VATQTCQITPAEGPVVTAQYDCLGCVHPISTQSPDLEPILRHGIQYFNNNTQHSSLFMLN
EVKRAQRQVVAGLNFRITYSIVQTNCSKENFLFLTPDCKSLWNGDTGECTDNAYIDIQLR
IASFSQNCDIYPGKDFVQPPTKICVGCPRDIPTNSPELEETLTHTITKLNAENNATFYFK
IDNVKKARVQVVAGKKYFIDFVARETTCSKESNEELTESCETKKLGQSLDCNAEVYVVPW
EKKIYPTVNCQPLGMISLMKRPPGFSPFRSSRIGEIKEETTVSPPHTSMAPAQDEERDSG
KEQGHTRRHDWGHEKQRKHNLGHGHKHERDQGHGHQRGHGLGHGHEQQHGLGHGHKFKLD
DDLEHQGGHVLDHGHKHKHGHGHGKHKNKGKKNGKHNGWKTEHLASSSEDSTTPSAQTQE
KTEGPTPIPSLAKPGVTVTFSDFQDSDLIATMMPPISPAPIQSDDDWIPDIQIDPNGLSF
NPISDFPDTTSPKCPGRPWKSVSEINPTTQMKESYYFDLTDGLS

Bradykinin Structure

By Yikrazuul (Own work) [Public domain], via Wikimedia Commons

MHC Class I Processing

The proteasome digests polypeptides into smaller peptides 5–25 amino acids in length and is the major protease responsible for generating peptide C termini.

Transporter associated with Antigen Processing (TAP) binds to peptides of length 9-20 amino acids and transports them into the endoplasmic reticulum (ER).

Image by Scray - Own work, CC BY-SA 3.0, Link

Text from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2913210/

Cytotoxic T lymphocytes (CTLs) are the effector cells of the adaptive immune response that deal with infected, or malfunctioning, cells. Whereas intracellular pathogens are shielded from antibodies, CTLs are endowed with the ability to recognize and destroy cells harbouring intracellular threats. This obviously requires that information on the intracellular protein metabolism (including that of any intracellular pathogen) be translocated to the outside of the cell, where the CTL reside. To this end, the immune system has created an elaborate system of antigen processing and presentation. During the initial phase of antigen processing, peptide antigens are generated from intracellular pathogens and translocated into the endoplasmic reticulum. In here, these peptide antigens are specifically sampled by major histocompatibility complex (MHC) class I molecules and then exported to the cell surface, where they are presented as stable peptide: MHC I complexes awaiting the arrival of scrutinizing T cells. Hence, identifying which peptides are able to induce CTLs is of general interest for our understanding of the immune system, and of particular interest for the development of vaccines and immunotherapy directed against infectious pathogens, as previously reviewed. Peptide binding to MHC molecules is the key feature in cell-mediated immunity, because it is the peptide–MHC class I complex that can be recognized by the T-cell receptor (TCR) and thereby initiate the immune response. The CTLs are CD8+ T cells, whose TCRs recognize foreign peptides in complex with MHC class I molecules. In addition to peptide binding to MHC molecules, several other events have to be considered to be able to explain why a given peptide is eventually presented at the cell surface. Generally, an immunogenic peptide is generated from proteins expressed within the presenting cell, and peptides originating from proteins with high expression rate will normally have a higher chance of being immunogenic, compared with peptides from proteins with a lower expression rate. There are, however, significant exceptions to this generalization, e.g. cross-presentation, but this will be ignored in the following. In the classical MHC class I presenting pathway (see image on right) proteins expressed within a cell will be degraded in the cytosol by the protease complex, named the proteasome. The proteasome digests polypeptides into smaller peptides 5–25 amino acids in length and is the major protease responsible for generating peptide C termini. Some of the peptides that survive further degradation by other cytosolic exopeptidases can be bound by the transporter associated with antigen presentation (TAP), reviewed by Schölz et al. This transporter molecule binds peptides of lengths 9–20 amino acids and transports the peptides into the endoplasmic reticulum, where partially folded MHC molecules [in humans called human leucocyte antigens (HLA)], will complete folding if the peptide is able to bind to the particular allelic MHC molecule. The latter step is furthermore facilitated by the endoplasmic-reticulum-hosted protein tapasin. Each of these steps has been characterized and their individual importance has been related to final presentation on the cell surface.



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, metrics
from sklearn.preprocessing import MinMaxScaler

The principal component score Vector of Hydrophobic, Steric, and Electronic properties (VHSE) is a set of amino acid descriptors that come from A new set of amino acid descriptors and its application in peptide QSARs

VHSE1 and VHSE2 are related to hydrophobic (H) properties,
VHSE3 and VHSE4 to steric (S) properties, and
VHSE5 to VHSE8 to electronic (E) properties.



In [2]:

    
# (3-letter, VHSE1, VHSE2, VHSE3, VHSE4, VHSE5, VHSE6, VHSE7, VHSE8)
vhse = {
"A": ("Ala", 0.15, -1.11, -1.35, -0.92, 0.02, -0.91, 0.36, -0.48),
"R": ("Arg", -1.47, 1.45, 1.24, 1.27, 1.55, 1.47, 1.30, 0.83),
"N": ("Asn", -0.99, 0.00, -0.37, 0.69, -0.55, 0.85, 0.73, -0.80),
"D": ("Asp", -1.15, 0.67, -0.41, -0.01, -2.68, 1.31, 0.03, 0.56),
"C": ("Cys", 0.18, -1.67, -0.46, -0.21, 0.00, 1.20, -1.61, -0.19),
"Q": ("Gln", -0.96, 0.12, 0.18, 0.16, 0.09, 0.42, -0.20, -0.41),
"E": ("Glu", -1.18, 0.40, 0.10, 0.36, -2.16, -0.17, 0.91, 0.02),
"G": ("Gly", -0.20, -1.53, -2.63, 2.28, -0.53, -1.18, 2.01, -1.34),
"H": ("His", -0.43, -0.25, 0.37, 0.19, 0.51, 1.28, 0.93, 0.65),
"I": ("Ile", 1.27, -0.14, 0.30, -1.80, 0.30, -1.61, -0.16, -0.13),
"L": ("Leu", 1.36, 0.07, 0.26, -0.80, 0.22, -1.37, 0.08, -0.62),
"K": ("Lys", -1.17, 0.70, 0.70, 0.80, 1.64, 0.67, 1.63, 0.13),
"M": ("Met", 1.01, -0.53, 0.43, 0.00, 0.23, 0.10, -0.86, -0.68),
"F": ("Phe", 1.52, 0.61, 0.96, -0.16, 0.25, 0.28, -1.33, -0.20),
"P": ("Pro", 0.22, -0.17, -0.50, 0.05, -0.01, -1.34, -0.19, 3.56),
"S": ("Ser", -0.67, -0.86, -1.07, -0.41, -0.32, 0.27, -0.64, 0.11),
"T": ("Thr", -0.34, -0.51, -0.55, -1.06, 0.01, -0.01, -0.79, 0.39),
"W": ("Trp", 1.50, 2.06, 1.79, 0.75, 0.75, -0.13, -1.06, -0.85),
"Y": ("Tyr", 0.61, 1.60, 1.17, 0.73, 0.53, 0.25, -0.96, -0.52),
"V": ("Val", 0.76, -0.92, 0.17, -1.91, 0.22, -1.40, -0.24, -0.03)}

There were eight dataset used in this study. The reference datasets (s1, s3, s5, s7) were converted into the actual datasets used in the analysis (s2, s4, s6, s8) using the vhse vector. The s2 and s4 datasets were used for training the SVM model and the s6 and s8 were used for testing.



In [3]:

    
%ls data/proteasomal_cleavage









    



s1_in_vivo_mhc_1_antijen_swiss_prot_reference.csv
s2_in_vivo_mhc_1_antijen_swiss_prot_dataset.csv
s3_in_vitro_from_iedb_ncbi_reference.csv
s4_in_vitro_from_iedb_ncbi_dataset.csv
s5_in_vivo_mhc_1_ligands_reference.csv
s6_in_vivo_mhc_1_ligands_dataset.csv
s7_in_vitro_ssx2_hivnef_rui_reference.csv
s8_in_vitro_ssx2_hivnef_rui_dataset.csv



In [4]:

    
from aa_props import seq_to_aa_props
# Converts the raw input into our X matrix and y vector. The 'peptide_key'
# and 'activity_key' parameters are the names of the column in the dataframe
# for the peptide amino acid string and activity (not cleaved/cleaved) 
# respectively. The 'sequence_len' allows for varying the number of flanking
# amino acids to cleavage site (which is at position 14 of 28 in each cleaved
# sample.
def dataset_to_X_y(dataframe, peptide_key, activity_key, sequence_len = 28, use_vhse = True):
    raw_peptide_len = 28
    if (sequence_len % 2 or sequence_len > raw_peptide_len or sequence_len <= 0):
        raise ValueError("sequence_len needs to an even value (0,%d]" % (raw_peptide_len))
    X = []
    y = []
    for (peptide, activity) in zip(dataframe[peptide_key], dataframe[activity_key]):
        if (len(peptide) != raw_peptide_len):
            # print "Skipping peptide! len(%s)=%d. Should be len=%d" \
            #                   % (peptide, len(peptide), raw_peptide_len)
            continue
        y.append(activity)

        num_amino_acids_to_clip = (raw_peptide_len - sequence_len) / 2
        clipped_peptide = peptide if num_amino_acids_to_clip == 0 else \
                        peptide[num_amino_acids_to_clip:-num_amino_acids_to_clip]
        # There is a single peptide in dataset s6 with an "'" in the sequence.
        # The VHSE values used for it in the study match Proline (P).
        clipped_peptide = clipped_peptide.replace('\'', 'P')
        row = []
        if use_vhse:
            for amino_acid in clipped_peptide:
                row.append(vhse[amino_acid][1]) # hydrophobic
                row.append(vhse[amino_acid][3]) # steric
                row.append(vhse[amino_acid][5]) # electric
        else:
            row = seq_to_aa_props(clipped_peptide)
        X.append(row)
    return (X, y)

Creating the In Vivo Data

To create the in vivo training set, the authors

Queried the AntiJen database (7,324 MHC-I ligands)
Removed ligands with unknown source protein in ExPASy/SWISS-PROT (6036 MHC-I ligands)
Removed duplicate ligands (3,148 ligands)
Removed the 231 ligands used for test samples by Saxova et al, (2,917 ligands)
Removed sequences less than 28 residues (2,607 ligands) to create the cleavage sample set
Assigned non-cleavage sites, removed sequences with less than 28 resides (2,480 ligands) to create the non-cleavage sample set

This process created 5,087 training samples: 2,607 cleavage and 2,480 non-cleavage samples.

Creating Samples from Ligands and Proteins

The C-terminus of the ligand is assumed to be a cleavage site and the midpoint between the N-terminus and C-terminus is assumed to not be a cleavage site.

Both the cleavage and non-cleavage sites are at the center position of each sample.

Format of Training Data

Each Sequence is 28 residues long, however the authors found the best performance using 20 residues.
The Activity is 1 for cleavage and -1 for no cleavage.
There are 28 * 8 = 224 features in the raw training set.



In [5]:

    
training_set = pd.DataFrame.from_csv("data/proteasomal_cleavage/s2_in_vivo_mhc_1_antijen_swiss_prot_dataset.csv")
print training_set.head(3)









    



                       Sequence  Name  Activity  VHSE11  VHSE12  VHSE13  \
0  AAAFCSAMYVGDLCGSVELVGQLFTFSP  cv-1         1    0.15   -1.11   -1.35   
1  AACHKCIDFYSRIRELRHYSDSVYGDTL  cv-2         1    0.15   -1.11   -1.35   
2  AACRAAGLQDCTMLVNGDDLVVICESAG  cv-3         1    0.15   -1.11   -1.35   

   VHSE14  VHSE15  VHSE16  VHSE17   ...     VHSE277  VHSE278  VHSE281  \
0   -0.92    0.02   -0.91    0.36   ...       -0.64     0.11     0.22   
1   -0.92    0.02   -0.91    0.36   ...       -0.79     0.39     1.36   
2   -0.92    0.02   -0.91    0.36   ...        0.36    -0.48    -0.20   

   VHSE282  VHSE283  VHSE284  VHSE285  VHSE286  VHSE287  VHSE288  
0    -0.17    -0.50     0.05    -0.01    -1.34    -0.19     3.56  
1     0.07     0.26    -0.80     0.22    -1.37     0.08    -0.62  
2    -1.53    -2.63     2.28    -0.53    -1.18     2.01    -1.34  

[3 rows x 227 columns]

Creating the Linear SVM Model

The authors measured linear, polynomial, radial basis, and sigmoid kernel and found no significant difference in performance. The linear kernel was chosen for its simplicity and interpretability. The authors did not provide the C value used in their linear model, so I used GridSearchCV to find the best value.



In [6]:

    
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFECV

def create_linear_svc_model(parameters, sequence_len = 28, use_vhse = True):
    scaler = MinMaxScaler()
    (X_train_unscaled, y_train) = dataset_to_X_y(training_set, \
                                             "Sequence", "Activity", \
                                             sequence_len = sequence_len, \
                                             use_vhse = use_vhse)
    X_train = pd.DataFrame(scaler.fit_transform(X_train_unscaled))
    parameters={'estimator__C': [pow(2, i) for i in xrange(-25, 4, 1)]}
    svc = svm.LinearSVC()
    rfe = RFECV(estimator=svc, step=.1, cv=2, scoring='accuracy', n_jobs=8)
    clf = GridSearchCV(rfe, parameters, scoring='accuracy', n_jobs=8, cv=2, verbose=1)
    clf.fit(X_train, y_train)

    # summarize results
    print("Best: %f using %s" % (clf.best_score_, clf.best_params_))
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    params = clf.cv_results_['params']
    for mean, stdev, param in zip(means, stds, params):
        print("%f (%f) with: %r" % (mean, stdev, param))
    #
    #svr = svm.LinearSVC()
    #clf = GridSearchCV(svr, parameters, cv=10, scoring='accuracy', n_jobs=1)
    #clf.fit(X_train, y_train)
    #print("The best parameters are %s with a score of %0.2f" \
    #      % (clf.best_params_, clf.best_score_))
    return (scaler, clf)

(vhse_scaler, vhse_model) = create_linear_svc_model(
    parameters = {'estimator__C': [pow(2, i) for i in xrange(-25, 4, 1)]},
    use_vhse = False)









    



Fitting 2 folds for each of 29 candidates, totalling 58 fits






    



/Users/matt/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1
  **self._backend_args)
/Users/matt/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1
  **self._backend_args)
/Users/matt/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1
  **self._backend_args)
/Users/matt/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1
  **self._backend_args)
/Users/matt/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1
  **self._backend_args)
/Users/matt/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1
  **self._backend_args)
/Users/matt/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1
  **self._backend_args)
/Users/matt/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1
  **self._backend_args)
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:   20.5s
[Parallel(n_jobs=8)]: Done  58 out of  58 | elapsed:  2.8min finished






    



Best: 0.801258 using {'estimator__C': 0.0009765625}
0.512483 (0.000096) with: {'estimator__C': 2.9802322387695312e-08}
0.512483 (0.000096) with: {'estimator__C': 5.960464477539063e-08}
0.512483 (0.000096) with: {'estimator__C': 1.1920928955078125e-07}
0.512483 (0.000096) with: {'estimator__C': 2.384185791015625e-07}
0.512483 (0.000096) with: {'estimator__C': 4.76837158203125e-07}
0.533517 (0.017587) with: {'estimator__C': 9.5367431640625e-07}
0.606644 (0.039590) with: {'estimator__C': 1.9073486328125e-06}
0.672105 (0.035842) with: {'estimator__C': 3.814697265625e-06}
0.720464 (0.016174) with: {'estimator__C': 7.62939453125e-06}
0.734421 (0.003394) with: {'estimator__C': 1.52587890625e-05}
0.755652 (0.010764) with: {'estimator__C': 3.0517578125e-05}
0.758404 (0.006046) with: {'estimator__C': 6.103515625e-05}
0.766464 (0.011749) with: {'estimator__C': 0.0001220703125}
0.779831 (0.014897) with: {'estimator__C': 0.000244140625}
0.790643 (0.013130) with: {'estimator__C': 0.00048828125}
0.801258 (0.011559) with: {'estimator__C': 0.0009765625}
0.798309 (0.001533) with: {'estimator__C': 0.001953125}
0.785925 (0.002513) with: {'estimator__C': 0.00390625}
0.782583 (0.000436) with: {'estimator__C': 0.0078125}
0.769609 (0.024421) with: {'estimator__C': 0.015625}
0.773737 (0.023831) with: {'estimator__C': 0.03125}
0.788284 (0.004087) with: {'estimator__C': 0.0625}
0.763318 (0.015773) with: {'estimator__C': 0.125}
0.753882 (0.012629) with: {'estimator__C': 0.25}
0.775310 (0.025010) with: {'estimator__C': 0.5}
0.777079 (0.007907) with: {'estimator__C': 1}
0.757617 (0.028552) with: {'estimator__C': 2}
0.779831 (0.005068) with: {'estimator__C': 4}
0.754669 (0.029928) with: {'estimator__C': 8}

Testing In Vivo SVM Model



In [39]:

    
def test_linear_svc_model(scaler, model, sequence_len = 28, use_vhse = True):
    testing_set = pd.DataFrame.from_csv("data/proteasomal_cleavage/s6_in_vivo_mhc_1_ligands_dataset.csv")
    (X_test_prescaled, y_test) = dataset_to_X_y(testing_set, \
                                                "Sequences", "Activity", \
                                                sequence_len = sequence_len,\
                                                use_vhse = use_vhse)
    X_test = pd.DataFrame(scaler.transform(X_test_prescaled))
    y_predicted = model.predict(X_test)
    accuracy = 100.0 * metrics.accuracy_score(y_test, y_predicted)
    ((tn, fp), (fn, tp)) = metrics.confusion_matrix(y_test, y_predicted, labels=[-1, 1])
    sensitivity = 100.0 * tp/(tp + fn)
    specificity = 100.0 * tn/(tn + fp)
    mcc = metrics.matthews_corrcoef(y_test, y_predicted)
    print "Authors reported performance"
    print "Acc: 73.5, Sen: 82.3, Spe: 64.8, MCC: 0.48"
    print "Notebook performance (sequence_len=%d, use_vhse=%s)" % (sequence_len, use_vhse)
    print "Acc: %.1f, Sen: %.1f, Spe: %.1f, MCC: %.2f" \
    %(accuracy, sensitivity, specificity, mcc)
    
test_linear_svc_model(vhse_scaler, vhse_model, use_vhse = False)

testing_set = pd.DataFrame.from_csv("data/proteasomal_cleavage/s6_in_vivo_mhc_1_ligands_dataset.csv")
(X_test_prescaled, y_test) = dataset_to_X_y(testing_set, \
                                                "Sequences", "Activity", \
                                                sequence_len = 28,\
                                                use_vhse = False)
X_test = pd.DataFrame(vhse_scaler.transform(X_test_prescaled))
poslabels = ["-%02d" % (i) for i in range(14, 0, -1)] + ["+%02d" % (i) for i in range(1,15)]
# 18 H  17 S  15 E
proplables = ["H%02d" % (i) for i in range(18)] + ["S%02d" % (i) for i in range(17)] + ["E%02d" % (i) for i in range(15)]

cols = []
for poslabel in poslabels:
    for proplable in proplables:
        cols.append("%s%s" % (poslabel, proplable))

X_test.columns = cols

for col in X_test.columns[vhse_model.best_estimator_.get_support()]:
    print col









    



Authors reported performance
Acc: 73.5, Sen: 82.3, Spe: 64.8, MCC: 0.48
Notebook performance (sequence_len=28, use_vhse=False)
Acc: 73.1, Sen: 85.1, Spe: 61.2, MCC: 0.48
-14S02
-13S10
-13E03
-13E10
-12H02
-12H13
-12S06
-12S12
-12E00
-12E04
-11S09
-11S10
-11S15
-10H11
-10H13
-10E06
-09H00
-09H03
-09H05
-09H07
-09H08
-09H10
-09H11
-09H12
-09S06
-09S12
-09E02
-09E03
-09E04
-09E06
-09E07
-09E08
-09E09
-09E10
-09E11
-08H00
-08H01
-08H03
-08H07
-08H08
-08H09
-08H10
-08H12
-08H13
-08S06
-08E01
-08E02
-08E03
-08E04
-08E05
-08E07
-08E08
-08E10
-08E13
-08E14
-07H01
-07H04
-07H11
-07H12
-07H13
-07H15
-07S12
-07S13
-07S15
-07E01
-07E02
-07E03
-07E06
-07E07
-07E13
-06H05
-06H09
-06H13
-06S05
-06S09
-06S10
-06S15
-06E00
-06E01
-06E02
-06E12
-06E13
-05S04
-05S13
-05S15
-05E02
-05E04
-04H03
-04H04
-04H12
-04S06
-04S09
-04S10
-04S15
-04S16
-04E01
-04E04
-04E06
-04E09
-04E11
-04E13
-03H00
-03H03
-03H06
-03H07
-03H17
-03S05
-03S06
-03S07
-03E01
-03E02
-03E03
-03E05
-03E06
-03E08
-03E11
-03E12
-03E13
-03E14
-02H01
-02H03
-02H07
-02H12
-02H13
-02S09
-02E00
-02E05
-02E07
-01H01
-01H03
-01H05
-01H06
-01H10
-01H11
-01H12
-01H13
-01H14
-01H15
-01H16
-01S00
-01S01
-01S02
-01S03
-01S05
-01S06
-01S08
-01S09
-01S10
-01S11
-01S12
-01S15
-01S16
-01E00
-01E01
-01E02
-01E05
-01E06
-01E07
-01E08
-01E09
-01E10
-01E11
-01E12
-01E13
-01E14
+01H02
+01H04
+01H08
+01H09
+01S04
+01S12
+01S13
+01S15
+01E00
+01E01
+01E02
+01E03
+01E05
+01E06
+01E07
+02H06
+02H08
+02S02
+02S05
+02E03
+03H00
+03H03
+03H07
+03H08
+03H09
+03H15
+03H17
+03S07
+03E00
+03E02
+03E07
+04H00
+04H06
+04H10
+04H11
+04S15
+04E10
+05H01
+05H03
+05H05
+05H10
+05H11
+05H12
+05H13
+05S00
+05S01
+05S02
+05S03
+05S05
+05S06
+05S08
+05S09
+05S10
+05S11
+05S12
+05S15
+05S16
+05E00
+05E01
+05E02
+05E05
+05E06
+05E07
+05E08
+05E09
+05E10
+05E11
+05E12
+05E14
+06H02
+06H07
+06H09
+06E07
+07S09
+07E02
+07E06
+07E11
+08H00
+08H01
+08H03
+08S09
+08S15
+08E08
+08E12
+09H07
+09S09
+09S10
+09S15
+09E02
+09E05
+10H09
+10E02
+11H00
+11H01
+11H03
+11E05
+11E06
+11E12
+12H00
+12H01
+12S02
+12S05
+12S15
+12E00
+12E13
+13H05
+13S04
+13S07
+13S15
+13E03
+13E12
+14H00
+14E04
+14E05
+14E14

Comparing Linear SVM to PAProC, FragPredict, and NetChop

Interpreting Model Weights

The VHSE1 variable at the P1 position has the largest positive weight coefficient (10.49) in line with research showing that C-terminal residues are usually hydrophobic to aid in ER transfer and binding to the MHC molecule.

There is a mostly positive and mostly negative coefficents upstream and downstream of the cleavage site respectively. This potential difference appears to be conducive to cleavage.



In [ ]:

    
#h = svr.coef_[:, 0::3]
#s = svr.coef_[:, 1::3]
#e = svr.coef_[:, 2::3]

#%matplotlib notebook

#n_groups = h.shape[1]

#fig, ax = plt.subplots(figsize=(12,9))

#index = np.arange(n_groups)
#bar_width = 0.25

#ax1 = ax.bar(index + bar_width, h.T, bar_width, label="Hydrophobic", color='b')
#ax2 = ax.bar(index, s.T, bar_width, label="Steric", color='r')
#ax3 = ax.bar(index - bar_width, e.T, bar_width, label="Electronic", color='g')

#ax.set_xlim(-bar_width,len(index)+bar_width)

#plt.xlabel('Amino Acid Position')
#plt.ylabel('SVM Coefficient Value')
#plt.title('Hydrophobic, Steric, and Electronic Effect by Amino Acid Position')
#plt.xticks(index, range (n_groups/2, 0, -1) + [str(i)+"'" for i in range (1, n_groups/2+1)])
#plt.legend()

#plt.tight_layout()
#plt.show()

PCA vs. full matrices

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components Source.
Mei et al, creators of the VHSE, applied PCA on 18 hydophobic, 17 steric, and 15 electronic properties. The first 2, 2, and 4 principle components account for 74.33, 78.68, and 77.9% of variability in original matrices.
The authors of this paper only used the first principle component from the hydrophobic, steric, and electronic matrices.
What performance would the authors have found if used the full matrices instead of PCA features?

Matrix	Features	Sensitivity	Specificity	MCC
VHSE	3x20=60	82.2	63.2	0.46
Full	50x20=1000	81.2	64.1	0.46



In [ ]:

    
# Performance with no VHSE
(no_vhse_scaler, no_vhse_model) = create_linear_svc_model(
    parameters = {'C': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1]},
    use_vhse = False)
test_linear_svc_model(no_vhse_scaler, no_vhse_model, use_vhse = False)



In [ ]:

    
# Performance with more flanking residues and no VHSE
(full_flank_scaler, full_flank_model) = create_linear_svc_model(
    parameters = {'C': [0.0001, 0.003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1]},
    use_vhse = False, sequence_len = 28)
test_linear_svc_model(full_flank_scaler, full_flank_model, use_vhse = False, sequence_len=28)

Chipper Implementation

Chipper notebook auto-generates training sets, e.g. v0.3.0 trained with 26,166 samples vs. 5,087 used in this study

Chipper Performance

Chipper uses full 50 properties per residue instead of VHSE parameters

Method	Sensitivity	Specificity	MCC	Notes
PAProC	45.6	30.0	-0.25
FragPredict	83.5	16.5	0.00
NetChop 1.0	39.8	46.3	-0.14
NetChop 2.0	73.6	42.4	0.16
NetChop 3.0	81.0	48.0	0.31
VHSE-based SVC	82.3	64.8	0.48	3x20=60 features
chipper VHSE (SVC)	79.3	72.6	0.520	3x20=60 features
chipper VHSE (LR)	83.2	69.2	0.529	3x20=60 features
chipper (SVC)	79.3	77.9	0.572	50x20=1000 features
chipper (LR)	87.0	74.5	0.620	50x20=1000 features
xgboost	87.0	74.5	0.620	50x20=1000 features
FANN	82.7	78.8	0.616	50x20=1000 features

Chipper LR Performance

Thank You

Source notebook: https://github.com/massie/notebooks/blob/master/VHSE-Based%20Prediction%20of%20Proteasomal%20Cleavage%20Sites.ipynb

Chipper repo: https://github.com/massie/chipper