epitopepredict API basic usage

Methodology

Predictors for each method inherit from the Predictor class and all implement a predict method for scoring a single sequence. This may wrap methods from other modules and/or call command line predictors. For example the TepitopePredictor uses the mhcpredict.tepitope module. This method should return a Pandas DataFrame. The predictProteins method is used for multiple proteins contained in a dataframe of sequences in a standard format. This is created from a genbank or fasta file (see examples below). For large numbers of sequences predictProteins you should provide a path so that the results are saved as each protein is completed to avoid memory issues, since many alleles might be called for each protein. Results are saved with one file per protein in csv format. Results can be loaded into the predictor individually or all together using the load method.



In [1]:

    
import os, math, time, pickle, subprocess
from importlib import reload
from collections import OrderedDict
import numpy as np
import pandas as pd
pd.set_option('display.width', 100)
import epitopepredict as ep
from epitopepredict import sequtils, tepitope, plotting, utilities, peptutils
from IPython.display import display, HTML, Image
%matplotlib inline
import matplotlib as mpl
import pylab as plt



In [2]:

    
#get preset alleles
m2_alleles = ep.get_preset_alleles('mhc2_supertypes')
m1_alleles = ep.get_preset_alleles('mhc1_supertypes')
print (m1_alleles)
print (m2_alleles)









    



['HLA-A*01:01', 'HLA-A*02:01', 'HLA-A*03:01', 'HLA-A*24:02', 'HLA-B*07:02', 'HLA-B*44:03']
['HLA-DRB1*01:01', 'HLA-DRB1*03:01', 'HLA-DRB1*04:01', 'HLA-DRB1*07:01', 'HLA-DRB1*08:01', 'HLA-DRB1*11:01', 'HLA-DRB1*13:01', 'HLA-DRB1*15:01']

predict random peptides



In [3]:

    
seqs = peptutils.create_random_sequences(10)
print (seqs[:5])
df = pd.DataFrame(seqs,columns=['peptide'])
P = ep.get_predictor('basicmhc1')
b = P.predict_peptides(df.peptide, alleles=m1_alleles, show_cmd=True, cpus=1)









    



['PHKDEAIYS', 'VKMVVEWMS', 'VDHPIKVQF', 'FNYSGMGSY', 'TAICDLPVI']

results are returned as dataframes, sorted by score



In [4]:

    
print (b[:5])









    



     peptide  pos  log50k     score  name       allele  rank
0  FNYSGMGSY    3  0.3656    957.25  temp  HLA-A*01:01   1.0
1  ILGMLVIRA    6  0.1113  14995.97  temp  HLA-A*01:01   2.0
2  VKMVVEWMS    1  0.1049  16071.19  temp  HLA-A*01:01   3.0
3  PHKDEAIYS    0  0.1042  16193.37  temp  HLA-A*01:01   4.0
4  TAICDLPVI    4  0.0928  18319.16  temp  HLA-A*01:01   5.0

predict n-mers from a protein sequence

This example loads protein sequences from mycobacterium tuberculosis from a genbank file



In [8]:

    
#load protein sequences into a dataframe
prots = ep.genbank_to_dataframe('../MTB-H37Rv.gb',cds=True)
prots[:5]









    Out[8]:







  
    
      
      type
      protein_id
      locus_tag
      gene
      db_xref
      product
      note
      translation
      pseudo
      start
      end
      strand
      length
      order
    
  
  
    
      2
      CDS
      CCP42723.1
      Rv0001
      dnaA
      GI:444893470
      Chromosomal replication initiator protein DnaA
      Rv0001, (MT0001, MTV029.01, P49993), len: 507 ...
      MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQRA...
      NaN
      0
      1524
      1
      507
      1
    
    
      4
      CDS
      CCP42724.1
      Rv0002
      dnaN
      GI:444893471
      DNA polymerase III (beta chain) DnaN (DNA nucl...
      Rv0002, (MTV029.02, MTCY10H4.0), len: 402 aa. ...
      MDAATTRVGLTDLTFRLLRESFADAVSWVAKNLPARPAVPVLSGVL...
      NaN
      2051
      3260
      1
      402
      2
    
    
      6
      CDS
      CCP42725.1
      Rv0003
      recF
      GI:444893472
      DNA replication and repair protein RecF (singl...
      Rv0003, (MTCY10H4.01), len: 385 aa. RecF, DNA ...
      MYVRHLGLRDFRSWACVDLELHPGRTVFVGPNGYGKTNLIEALWYS...
      NaN
      3279
      4437
      1
      385
      3
    
    
      8
      CDS
      CCP42726.1
      Rv0004
      NaN
      GI:444893473
      Conserved hypothetical protein
      Rv0004, (MTCY10H4.02), len: 187 aa. Conserved ...
      MTGSVDRPDQNRGERSMKSPGLDLVRRTLDEARAAARARGQDAGRG...
      NaN
      4433
      4997
      1
      187
      4
    
    
      10
      CDS
      CCP42727.1
      Rv0005
      gyrB
      GI:444893474
      DNA gyrase (subunit B) GyrB (DNA topoisomerase...
      Rv0005, (MTCY10H4.03), len: 675 aa. GyrB, DNA ...
      MAAQKKKAQDEYGAASITILEGLEAVRKRPGMYIGSTGERGLHHLI...
      NaN
      5239
      7267
      1
      675
      5



In [9]:

    
P = ep.get_predictor('tepitope')
mb_binders = P.predict_sequences(prots[:20], alleles=m2_alleles, cpus=1)









    



predictions done for 20 sequences in 8 alleles



In [10]:

    
mb_binders.head()









    Out[10]:







  
    
      
      peptide
      core
      pos
      score
      name
      allele
      rank
    
  
  
    
      174
      DTFVIGASNRF
      FVIGASNRF
      174
      2.00
      Rv0001
      HLA-DRB1*0101
      1.0
    
    
      175
      TFVIGASNRFA
      FVIGASNRF
      175
      2.00
      Rv0001
      HLA-DRB1*0101
      1.0
    
    
      176
      FVIGASNRFAH
      FVIGASNRF
      176
      2.00
      Rv0001
      HLA-DRB1*0101
      1.0
    
    
      45
      AWLNLVQPLTI
      LNLVQPLTI
      45
      1.55
      Rv0001
      HLA-DRB1*0101
      4.0
    
    
      46
      WLNLVQPLTIV
      LNLVQPLTI
      46
      1.55
      Rv0001
      HLA-DRB1*0101
      4.0

get promiscuous binders - those peptides above the cutoff for at least 3 alleles



In [5]:

    
pb = P.promiscuous_binders(n=3)
pb.shape









    Out[5]:





(0, 8)



In [11]:

    
print (pb[:3])









    



Empty DataFrame
Columns: [peptide, pos, name, alleles, core, score, mean, median_rank]
Index: []

plot binders in a protein sequence



In [12]:

    
#get names of proteins stored in results of predictor
print (P.get_names())
ax = plotting.plot_tracks([P],name='Rv0011c',cutoff=.94,n=2)
plt.tight_layout()
plt.savefig('mhc_rv0011c.png',dpi=150)
ax = plotting.plot_binder_map(P,name='Rv0011c',cutoff=10)
plt.savefig('mhc_rv0011c_map.png',dpi=150)









    



['Rv0001', 'Rv0002', 'Rv0003', 'Rv0004', 'Rv0005', 'Rv0006', 'Rv0007', 'Rv0008c', 'Rv0009', 'Rv0010c', 'Rv0011c', 'Rv0012', 'Rv0013', 'Rv0014c', 'Rv0015c', 'Rv0016c', 'Rv0017c', 'Rv0018c', 'Rv0019c', 'Rv0020c']
Rv0011c tepitope 82 11 93






    





<Figure size 432x288 with 0 Axes>



In [13]:

    
reload(plotting)
from bokeh.io import show, output_notebook
output_notebook()
p = plotting.bokeh_plot_tracks([P],name='Rv0011c',cutoff=.95,n=2,width=800)
show(p)









    





    
        
        Loading BokehJS ...
    






    














    



/usr/local/lib/python3.6/dist-packages/bokeh/core/json_encoder.py:80: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(type(obj), np.float):



In [ ]:

	type	protein_id	locus_tag	gene	db_xref	product	note	translation	pseudo	start	end	strand	length	order
2	CDS	CCP42723.1	Rv0001	dnaA	GI:444893470	Chromosomal replication initiator protein DnaA	Rv0001, (MT0001, MTV029.01, P49993), len: 507 ...	MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQRA...	NaN	0	1524	1	507	1
4	CDS	CCP42724.1	Rv0002	dnaN	GI:444893471	DNA polymerase III (beta chain) DnaN (DNA nucl...	Rv0002, (MTV029.02, MTCY10H4.0), len: 402 aa. ...	MDAATTRVGLTDLTFRLLRESFADAVSWVAKNLPARPAVPVLSGVL...	NaN	2051	3260	1	402	2
6	CDS	CCP42725.1	Rv0003	recF	GI:444893472	DNA replication and repair protein RecF (singl...	Rv0003, (MTCY10H4.01), len: 385 aa. RecF, DNA ...	MYVRHLGLRDFRSWACVDLELHPGRTVFVGPNGYGKTNLIEALWYS...	NaN	3279	4437	1	385	3
8	CDS	CCP42726.1	Rv0004	NaN	GI:444893473	Conserved hypothetical protein	Rv0004, (MTCY10H4.02), len: 187 aa. Conserved ...	MTGSVDRPDQNRGERSMKSPGLDLVRRTLDEARAAARARGQDAGRG...	NaN	4433	4997	1	187	4
10	CDS	CCP42727.1	Rv0005	gyrB	GI:444893474	DNA gyrase (subunit B) GyrB (DNA topoisomerase...	Rv0005, (MTCY10H4.03), len: 675 aa. GyrB, DNA ...	MAAQKKKAQDEYGAASITILEGLEAVRKRPGMYIGSTGERGLHHLI...	NaN	5239	7267	1	675	5

	peptide	core	pos	score	name	allele	rank
174	DTFVIGASNRF	FVIGASNRF	174	2.00	Rv0001	HLA-DRB1*0101	1.0
175	TFVIGASNRFA	FVIGASNRF	175	2.00	Rv0001	HLA-DRB1*0101	1.0
176	FVIGASNRFAH	FVIGASNRF	176	2.00	Rv0001	HLA-DRB1*0101	1.0
45	AWLNLVQPLTI	LNLVQPLTI	45	1.55	Rv0001	HLA-DRB1*0101	4.0
46	WLNLVQPLTIV	LNLVQPLTI	46	1.55	Rv0001	HLA-DRB1*0101	4.0