PyGraphistry Tutorial: Visualize Protein Interactions From BioGrid

That is over 600.000 interactions across 50'000 proteins!

Notes

This notebook automatically downloads about 200 MB of BioGrid data. If you are going to run this notebook more than once, we recommend manually dowloading and saving the data to disk. To do so, unzip the two files and place their content in pygraphistry/demos/data.

Protein Interactions: BIOGRID-ALL-3.3.123.tab2.zip
Protein Identifiers: BIOGRID-IDENTIFIERS-3.3.123.tab.zip



In [1]:

    
import pandas
import graphistry
graphistry.register(api=2)
#graphistry.register(key='MY_API_KEY', server='labs.graphistry.com', api=2) #https://www.graphistry.com/api-request

Load Protein Interactions

Select columns of interest and drop empty rows.



In [2]:

    
url1 = 'https://s3-us-west-1.amazonaws.com/graphistry.demo.data/BIOGRID-ALL-3.3.123.tab2.txt.gz'
rawdata = pandas.read_table(url1, na_values=['-'], engine='c', compression='gzip')

# If using local data, comment the two lines above and uncomment the line below
# pandas.read_table('./data/BIOGRID-ALL-3.3.123.tab2.txt', na_values=['-'], engine='c')

cols = ['BioGRID ID Interactor A', 'BioGRID ID Interactor B', 'Official Symbol Interactor A', 
        'Official Symbol Interactor B', 'Pubmed ID', 'Author', 'Throughput']
interactions = rawdata[cols].dropna()
interactions[:3]









    



/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (19,20) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)






    Out[2]:







  
    
      
      BioGRID ID Interactor A
      BioGRID ID Interactor B
      Official Symbol Interactor A
      Official Symbol Interactor B
      Pubmed ID
      Author
      Throughput
    
  
  
    
      0
      112315
      108607
      MAP2K4
      FLNC
      9006895
      Marti A (1997)
      Low Throughput
    
    
      1
      124185
      106603
      MYPN
      ACTN2
      11309420
      Bang ML (2001)
      Low Throughput
    
    
      2
      106605
      108625
      ACVR1
      FNTA
      8599089
      Wang T (1996)
      Low Throughput

Let's have a quick peak at the data

Bind the columns storing the source/destination of each edge. This is the bare minimum to create a visualization.



In [3]:

    
g = graphistry.bind(source="BioGRID ID Interactor A", destination="BioGRID ID Interactor B")
g.plot(interactions.sample(10000))









    Out[3]:

A Fancier Visualization With Custom Labels and Colors

Let's lookup the name and organism of each protein in the BioGrid indentification DB.



In [4]:

    
# This downloads 170 MB, it might take some time.
url2 = 'https://s3-us-west-1.amazonaws.com/graphistry.demo.data/BIOGRID-IDENTIFIERS-3.3.123.tab.txt.gz'
raw_proteins = pandas.read_table(url2, na_values=['-'], engine='c', compression='gzip')

# If using local data, comment the two lines above and uncomment the line below
# raw_proteins = pandas.read_table('./data/BIOGRID-IDENTIFIERS-3.3.123.tab.txt', na_values=['-'], engine='c')


protein_ids = raw_proteins[['BIOGRID_ID', 'ORGANISM_OFFICIAL_NAME']].drop_duplicates() \
                          .rename(columns={'ORGANISM_OFFICIAL_NAME': 'ORGANISM'})
protein_ids[:3]









    Out[4]:







  
    
      
      BIOGRID_ID
      ORGANISM
    
  
  
    
      0
      1
      Arabidopsis thaliana
    
    
      7
      2
      Arabidopsis thaliana
    
    
      22
      3
      Arabidopsis thaliana

We extract the proteins referenced as either sources or targets of interactions.



In [5]:

    
source_proteins = interactions[["BioGRID ID Interactor A", "Official Symbol Interactor A"]].copy() \
                              .rename(columns={'BioGRID ID Interactor A': 'BIOGRID_ID', 
                                               'Official Symbol Interactor A': 'SYMBOL'})

target_proteins = interactions[["BioGRID ID Interactor B", "Official Symbol Interactor B"]].copy() \
                              .rename(columns={'BioGRID ID Interactor B': 'BIOGRID_ID', 
                                               'Official Symbol Interactor B': 'SYMBOL'}) 

all_proteins = pandas.concat([source_proteins, target_proteins], ignore_index=True).drop_duplicates()
all_proteins[:3]

We join on the indentification DB to get the organism in which each protein belongs.



In [6]:

    
protein_labels = pandas.merge(all_proteins, protein_ids, how='left', left_on='BIOGRID_ID', right_on='BIOGRID_ID')
protein_labels[:3]









    Out[6]:







  
    
      
      BIOGRID_ID
      SYMBOL
      ORGANISM
    
  
  
    
      0
      112315
      MAP2K4
      Homo sapiens
    
    
      1
      124185
      MYPN
      Homo sapiens
    
    
      2
      106605
      ACVR1
      Homo sapiens

We assign colors to proteins based on their organism.



In [7]:

    
colors = protein_labels.ORGANISM.unique().tolist()
protein_labels['Color'] = protein_labels.ORGANISM.map(lambda x: colors.index(x))

For convenience, let's add links to PubMed and RCSB.



In [8]:

    
def makeRcsbLink(id):
    if isinstance(id, str):
        url = 'http://www.rcsb.org/pdb/gene/' + id.upper()
        return '<a target="_blank" href="%s">%s</a>' % (url, id.upper())
    else:
        return 'n/a'
    
protein_labels.SYMBOL = protein_labels.SYMBOL.map(makeRcsbLink)
protein_labels[:3]









    Out[8]:







  
    
      
      BIOGRID_ID
      SYMBOL
      ORGANISM
      Color
    
  
  
    
      0
      112315
      <a target="_blank" href="http://www.rcsb.org/p...
      Homo sapiens
      0
    
    
      1
      124185
      <a target="_blank" href="http://www.rcsb.org/p...
      Homo sapiens
      0
    
    
      2
      106605
      <a target="_blank" href="http://www.rcsb.org/p...
      Homo sapiens
      0



In [9]:

    
def makePubmedLink(id):
    url = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=%s' % id
    return '<a target="_blank" href="%s">%s</a>' % (url, id)

interactions['Pubmed ID'] = interactions['Pubmed ID'].map(makePubmedLink)
interactions[:3]









    Out[9]:







  
    
      
      BioGRID ID Interactor A
      BioGRID ID Interactor B
      Official Symbol Interactor A
      Official Symbol Interactor B
      Pubmed ID
      Author
      Throughput
    
  
  
    
      0
      112315
      108607
      MAP2K4
      FLNC
      <a target="_blank" href="http://www.ncbi.nlm.n...
      Marti A (1997)
      Low Throughput
    
    
      1
      124185
      106603
      MYPN
      ACTN2
      <a target="_blank" href="http://www.ncbi.nlm.n...
      Bang ML (2001)
      Low Throughput
    
    
      2
      106605
      108625
      ACVR1
      FNTA
      <a target="_blank" href="http://www.ncbi.nlm.n...
      Wang T (1996)
      Low Throughput

Plotting

We bind columns to labels and colors and we are good to go.



In [10]:

    
# This will upload ~10MB of data, be patient!
g2 = g.bind(node='BIOGRID_ID', edge_title='Author', point_title='SYMBOL', point_color='Color')
g2.plot(interactions, protein_labels)









    



Uploading 7139 kB. This may take a while...






    Out[10]:



In [ ]:

	BioGRID ID Interactor A	BioGRID ID Interactor B	Official Symbol Interactor A	Official Symbol Interactor B	Pubmed ID	Author	Throughput
0	112315	108607	MAP2K4	FLNC	9006895	Marti A (1997)	Low Throughput
1	124185	106603	MYPN	ACTN2	11309420	Bang ML (2001)	Low Throughput
2	106605	108625	ACVR1	FNTA	8599089	Wang T (1996)	Low Throughput

	BIOGRID_ID	ORGANISM
0	1	Arabidopsis thaliana
7	2	Arabidopsis thaliana
22	3	Arabidopsis thaliana

	BIOGRID_ID	SYMBOL	ORGANISM
0	112315	MAP2K4	Homo sapiens
1	124185	MYPN	Homo sapiens
2	106605	ACVR1	Homo sapiens

	BIOGRID_ID	SYMBOL	ORGANISM
0	112315	<a target="_blank" href="http://www.rcsb.org/p...	Homo sapiens
1	124185	<a target="_blank" href="http://www.rcsb.org/p...	Homo sapiens
2	106605	<a target="_blank" href="http://www.rcsb.org/p...	Homo sapiens