epitopepredict advanced analyses

References:

Paul, S. et al., 2015. Development and validation of a broad scheme for prediction of HLA class II restricted T cell epitopes. Journal of Immunological Methods, 422, pp.28–34. Available at: http://dx.doi.org/10.1016/j.jim.2015.03.022.
Terry, F.E. et al., 2014. Time for T? Immunoinformatics addresses vaccine design for neglected tropical and emerging infectious diseases. Expert review of vaccines, 9, pp.1–15. Available at: http://www.ncbi.nlm.nih.gov/pubmed/25193104.



In [6]:

    
import numpy as np
import pandas as pd
pd.set_option('display.width', 100)
pd.set_option('max_colwidth', 80)
%matplotlib inline
import matplotlib as mpl
import seaborn as sns
sns.set_context("notebook", font_scale=1.4)
from IPython.display import display, HTML

import epitopepredict as ep
from epitopepredict import base, sequtils, analysis



In [7]:

    
genbankfile = '../testing/zaire-ebolavirus.gb'
zaire = sequtils.genbank_to_dataframe(genbankfile, cds=True)
savepath = 'test_iedbmhc1'
Pi = ep.get_predictor('iedbmhc1')
alleles = ep.mhc1_supertypes
Pi.predictProteins(zaire,length=11,alleles=alleles,path=savepath,overwrite=False)
Pi.load(path=savepath)









    



predictions done for 9 proteins in 6 alleles
results saved to /home/farrell/gitprojects/epitopepredict/examples/test_iedbmhc1

Binder clustering for finding regions with possible epitopes

One application of immunoinformatics is to screen out likely candidate antigens from the genome for further study. To do this using epitope prediction requires filtering the many potential binders in each protein. There is no theoretical basis for doing this.

Epitope clustering has been previously observed to be an indicator of T cell epitope regions. The findClusters method in the analysis module allows automatic cluster detection from a set of predicted binders from one or more proteins. It can be done for a whole genome.

The result is a table of sequence regions with the number of binders and density of epitope cluster.



In [8]:

    
#find clusters of binders in these results
pb = Pi.promiscuousBinders(cutoff=5,n=2)
cl = analysis.find_clusters(pb, dist=10, min_size=3, genome=zaire)
display(cl[:20])









    



40 clusters found in 7 proteins






    






  
    
      
      name
      start
      end
      binders
      clustersize
      gene
      peptide
    
  
  
    
      0
      ZEBOVgp7
      1451
      1481
      7
      30
      L
      FTTHFLTYPKIGLLYSFGAFVSYYLGNTIL
    
    
      1
      ZEBOVgp1
      176
      215
      6
      39
      NP
      QVHAEQGLIQYPTAWQSVGHMMVIFRLMRTNFLIKFLLI
    
    
      2
      ZEBOVgp7
      95
      118
      5
      23
      L
      CQQFLDEIIKYTMQDALFLKYYL
    
    
      3
      ZEBOVgp7
      1962
      1987
      5
      25
      L
      VFLSDTEGMLWLNDNLAPFFATGYL
    
    
      4
      ZEBOVgp5
      224
      249
      4
      25
      VP30
      ALWQQWDRQSLIMFITAFLNIALQL
    
    
      5
      ZEBOVgp4.2
      15
      35
      4
      20
      GP
      RTSFFLWVIILFQRTFSIPL
    
    
      6
      ZEBOVgp7
      139
      158
      3
      19
      L
      IQGNEFLHQMFFWYDLAIL
    
    
      7
      ZEBOVgp1
      142
      160
      3
      18
      NP
      TEANAGQFLSFASLFLPK
    
    
      8
      ZEBOVgp4.2
      225
      252
      3
      27
      GP
      GTNETEYLFEVDNLTYVQLESRFTPQF
    
    
      9
      ZEBOVgp3
      165
      186
      3
      21
      VP40
      VQLPQYFTFDLTALKLITQPL
    
    
      10
      ZEBOVgp7
      1625
      1642
      3
      17
      L
      YTCKSTASNFFHASLAY
    
    
      11
      ZEBOVgp7
      393
      419
      3
      26
      L
      TVLKALRPIVIFETYCVFKYSIAKHY
    
    
      12
      ZEBOVgp7
      1521
      1533
      2
      12
      L
      RLMSIDPHFSIY
    
    
      13
      ZEBOVgp7
      295
      312
      2
      17
      L
      KFLEPLCLAKIQLCSKY
    
    
      14
      ZEBOVgp7
      1547
      1562
      2
      15
      L
      RLFLRTSISSFLTFV
    
    
      15
      ZEBOVgp7
      1795
      1816
      2
      21
      L
      SVIDTTVYCRFTGIVSSMHYK
    
    
      16
      ZEBOVgp7
      1074
      1088
      2
      14
      L
      RKITLQRWSLWFSY
    
    
      17
      ZEBOVgp7
      941
      953
      2
      12
      L
      MIEMDDLFLPLI
    
    
      18
      ZEBOVgp7
      901
      914
      2
      13
      L
      ALAVPQVLGGLSF
    
    
      19
      ZEBOVgp7
      858
      874
      2
      16
      L
      ITAAFHTFFSVRILQY

plot regions on sequence

We can use this to see where our clusters are in the sequence, for exmaple



In [12]:

    
name = 'ZEBOVgp6'
#ep.plotting.plot_multiple([Pi], name, regions=cl, n=2)

overlaps



In [13]:

    
Pn = ep.get_predictor('netmhciipan')
savepath2='test_netmhciipan'
alleles2 = ep.mhc2_supertypes
Pn.predictProteins(zaire,length=11,alleles=alleles2,path=savepath2,overwrite=False)
Pn.load(path=savepath2)

cl = analysis.get_overlaps(cl,Pn.promiscuousBinders(n=2,cutoff=5),label='mhc2_ovlp')

#plot both sets of binders and overlay region of cluster in previous data
ax = ep.plot_tracks([Pi,Pn],name=name,legend=True,figsize=(14,4),n=2)
r = cl[cl.name==name]
print r
coords = (list(r.start),list(r.end-r.start))
coords = zip(*coords)
ep.plot_regions(coords, ax, color='gray')









    



/home/farrell/gitprojects/epitopepredict/epitopepredict/base.py:826: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  df = df.convert_objects(convert_numeric=True)
/home/farrell/gitprojects/epitopepredict/epitopepredict/analysis.py:165: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df[label] = df.apply(lambda r: overlap(r,found),axis=1)






    



predictions done for 9 proteins in 7 alleles
results saved to /home/farrell/gitprojects/epitopepredict/examples/test_netmhciipan
17 with overlapping binders
        name  start  end  binders  clustersize  gene                peptide  mhc2_ovlp
28  ZEBOVgp6    142  158        2           16  VP24       EQLSLKMLSLIRSNIL          2
29  ZEBOVgp6     86  107        2           21  VP24  IESPLWALRVILAAGIQDQLI          0
30  ZEBOVgp6     68   80        2           12  VP24           WSMTRNLFPHLF          0

conservation analysis

Conservation of epitopes may be important across strains or species.



In [14]:

    
reload(ep.base)
reload(analysis)
reload(sequtils)

name = 'ZEBOVgp2'
proteinseq = zaire[zaire.locus_tag==name].translation.iloc[0]
#print proteinseq
#print pb
seqs = pb.peptide

#provide a list of seqs (e.g. strains)
filename='ebola_VP35.fa'
r = sequtils.fetch_protein_sequences('Filovirus[Orgn] AND VP35[Gene]', filename=filename)
#align fasta sequences
aln = sequtils.muscle_alignment(filename)
#sequtils.showAlignment(aln)
alnrows = sequtils.alignment_to_dataframe(aln)

#print (sequtils.formatAlignment(aln))
#print alnrows[:25][['accession','definition','perc_ident']]
c = analysis.epitope_conservation(seqs, alnrows=alnrows)









    



25 non-redundant sequences retrieved

use a local blast database to find orthologs and conservation of your epitopes

Blast the protein sequence locally to get an alignment with any orthologs, then perform the analysis as above. In this case we have created a blast database of viral refseq proteins locally. Online blast is generally too slow for such an analysis, especially if we want to examine epitopes in many proteins.



In [19]:

    
#localdb must be present in your file system, in the case it's in a folder called db
localdb = '../db/viral_refseq'
blr = analysis.get_orthologs(proteinseq, db=localdb)
alnrows, aln = analysis.align_blast_results(blr)
alnrows.to_csv('%s_aligned.csv' %name)
print alnrows
#now we run the same analysis using the set of refseq orthologs
c = analysis.epitope_conservation(seqs, alnrows=alnrows)









    



running blast..
7 hits, 7 filtered
          expect  identity       accession                                          definition  \
0   0.000000e+00       260  YP_003815424.1  polymerase complex protein [Tai Forest ebolavirus]   
1  8.425780e-175       254  YP_003815433.1       polymerase complex protein [Bundibugyo virus]   
3  1.446050e-155       222     YP_138521.1       polymerase complex protein [Sudan ebolavirus]   
2  5.109200e-161       219     NP_690581.1      polymerase complex protein [Reston ebolavirus]   
4  3.717640e-106       170  YP_004928136.1      polymerase complex protein [Lloviu cuevavirus]   
5   4.452270e-62       119  YP_009055223.1   polymerase complex protein [Marburg marburgvirus]   
6   2.145860e-61       118  YP_001531154.1   polymerase complex protein [Marburg marburgvirus]   

   perc_ident                                                                              seq  
0   76.470588  ISTRAAAINDPSLPIRNQCTRGPELSGWISEQLMTGKIPVHEI--FNDTEPHISSGSD---CLPRPKNTAPRTRNT...  
1   74.705882  -------------TTTGTRSCGPELSGWISEQLMTGKIPITDI--FNEIETLPSISPS---IHSKIKTPSVQTRSV...  
3   65.294118  ------------------RHHGPEVSGWFSEQLMTGKIPLTEV--FVDVENKPSPAPI---TIISKNPKTTRKSDK...  
2   64.411765  ---------------------GPETTGWISEQLMTGKIPVTDI--FIDIDNKPDQMEV---RLKPSSRSSTRTCTS...  
4   50.000000  ----------------------------VTDAIMTGRLSLDTIPGFQCSAPTSRSRPTGRKCPPWEGPLIEPKTSA...  
5   35.000000  ----------------------------VSEGLMTGKVPIDQVFGTNPLEKLYKRRKP----KGTVGLQCSPCLIS...  
6   34.705882  ----------------------------VSEGLMTGKVPIDQVFGANPLEKLYKRRKP----KGTVGLQCSPCLMS...



In [ ]:

	name	start	end	binders	clustersize	gene	peptide
0	ZEBOVgp7	1451	1481	7	30	L	FTTHFLTYPKIGLLYSFGAFVSYYLGNTIL
1	ZEBOVgp1	176	215	6	39	NP	QVHAEQGLIQYPTAWQSVGHMMVIFRLMRTNFLIKFLLI
2	ZEBOVgp7	95	118	5	23	L	CQQFLDEIIKYTMQDALFLKYYL
3	ZEBOVgp7	1962	1987	5	25	L	VFLSDTEGMLWLNDNLAPFFATGYL
4	ZEBOVgp5	224	249	4	25	VP30	ALWQQWDRQSLIMFITAFLNIALQL
5	ZEBOVgp4.2	15	35	4	20	GP	RTSFFLWVIILFQRTFSIPL
6	ZEBOVgp7	139	158	3	19	L	IQGNEFLHQMFFWYDLAIL
7	ZEBOVgp1	142	160	3	18	NP	TEANAGQFLSFASLFLPK
8	ZEBOVgp4.2	225	252	3	27	GP	GTNETEYLFEVDNLTYVQLESRFTPQF
9	ZEBOVgp3	165	186	3	21	VP40	VQLPQYFTFDLTALKLITQPL
10	ZEBOVgp7	1625	1642	3	17	L	YTCKSTASNFFHASLAY
11	ZEBOVgp7	393	419	3	26	L	TVLKALRPIVIFETYCVFKYSIAKHY
12	ZEBOVgp7	1521	1533	2	12	L	RLMSIDPHFSIY
13	ZEBOVgp7	295	312	2	17	L	KFLEPLCLAKIQLCSKY
14	ZEBOVgp7	1547	1562	2	15	L	RLFLRTSISSFLTFV
15	ZEBOVgp7	1795	1816	2	21	L	SVIDTTVYCRFTGIVSSMHYK
16	ZEBOVgp7	1074	1088	2	14	L	RKITLQRWSLWFSY
17	ZEBOVgp7	941	953	2	12	L	MIEMDDLFLPLI
18	ZEBOVgp7	901	914	2	13	L	ALAVPQVLGGLSF
19	ZEBOVgp7	858	874	2	16	L	ITAAFHTFFSVRILQY