PyGraphistry Tutorial: Visualize Protein Interactions From BioGrid

That is over 600.000 interactions across 50'000 proteins!

Notes

This notebook automatically downloads about 200 MB of BioGrid data. If you are going to run this notebook more than once, we recommend manually dowloading and saving the data to disk. To do so, unzip the two files and place their content in pygraphistry/demos/data.


In [1]:
import pandas
import graphistry
graphistry.register(api=2)
#graphistry.register(key='MY_API_KEY', server='labs.graphistry.com', api=2) #https://www.graphistry.com/api-request

Load Protein Interactions

Select columns of interest and drop empty rows.


In [2]:
url1 = 'https://s3-us-west-1.amazonaws.com/graphistry.demo.data/BIOGRID-ALL-3.3.123.tab2.txt.gz'
rawdata = pandas.read_table(url1, na_values=['-'], engine='c', compression='gzip')

# If using local data, comment the two lines above and uncomment the line below
# pandas.read_table('./data/BIOGRID-ALL-3.3.123.tab2.txt', na_values=['-'], engine='c')

cols = ['BioGRID ID Interactor A', 'BioGRID ID Interactor B', 'Official Symbol Interactor A', 
        'Official Symbol Interactor B', 'Pubmed ID', 'Author', 'Throughput']
interactions = rawdata[cols].dropna()
interactions[:3]


/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (19,20) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[2]:
BioGRID ID Interactor A BioGRID ID Interactor B Official Symbol Interactor A Official Symbol Interactor B Pubmed ID Author Throughput
0 112315 108607 MAP2K4 FLNC 9006895 Marti A (1997) Low Throughput
1 124185 106603 MYPN ACTN2 11309420 Bang ML (2001) Low Throughput
2 106605 108625 ACVR1 FNTA 8599089 Wang T (1996) Low Throughput

Let's have a quick peak at the data

Bind the columns storing the source/destination of each edge. This is the bare minimum to create a visualization.


In [3]:
g = graphistry.bind(source="BioGRID ID Interactor A", destination="BioGRID ID Interactor B")
g.plot(interactions.sample(10000))


Out[3]:

A Fancier Visualization With Custom Labels and Colors

Let's lookup the name and organism of each protein in the BioGrid indentification DB.


In [4]:
# This downloads 170 MB, it might take some time.
url2 = 'https://s3-us-west-1.amazonaws.com/graphistry.demo.data/BIOGRID-IDENTIFIERS-3.3.123.tab.txt.gz'
raw_proteins = pandas.read_table(url2, na_values=['-'], engine='c', compression='gzip')

# If using local data, comment the two lines above and uncomment the line below
# raw_proteins = pandas.read_table('./data/BIOGRID-IDENTIFIERS-3.3.123.tab.txt', na_values=['-'], engine='c')


protein_ids = raw_proteins[['BIOGRID_ID', 'ORGANISM_OFFICIAL_NAME']].drop_duplicates() \
                          .rename(columns={'ORGANISM_OFFICIAL_NAME': 'ORGANISM'})
protein_ids[:3]


Out[4]:
BIOGRID_ID ORGANISM
0 1 Arabidopsis thaliana
7 2 Arabidopsis thaliana
22 3 Arabidopsis thaliana

We extract the proteins referenced as either sources or targets of interactions.


In [5]:
source_proteins = interactions[["BioGRID ID Interactor A", "Official Symbol Interactor A"]].copy() \
                              .rename(columns={'BioGRID ID Interactor A': 'BIOGRID_ID', 
                                               'Official Symbol Interactor A': 'SYMBOL'})

target_proteins = interactions[["BioGRID ID Interactor B", "Official Symbol Interactor B"]].copy() \
                              .rename(columns={'BioGRID ID Interactor B': 'BIOGRID_ID', 
                                               'Official Symbol Interactor B': 'SYMBOL'}) 

all_proteins = pandas.concat([source_proteins, target_proteins], ignore_index=True).drop_duplicates()
all_proteins[:3]


Out[5]:
BIOGRID_ID SYMBOL
0 112315 MAP2K4
1 124185 MYPN
2 106605 ACVR1

We join on the indentification DB to get the organism in which each protein belongs.


In [6]:
protein_labels = pandas.merge(all_proteins, protein_ids, how='left', left_on='BIOGRID_ID', right_on='BIOGRID_ID')
protein_labels[:3]


Out[6]:
BIOGRID_ID SYMBOL ORGANISM
0 112315 MAP2K4 Homo sapiens
1 124185 MYPN Homo sapiens
2 106605 ACVR1 Homo sapiens

We assign colors to proteins based on their organism.


In [7]:
colors = protein_labels.ORGANISM.unique().tolist()
protein_labels['Color'] = protein_labels.ORGANISM.map(lambda x: colors.index(x))

For convenience, let's add links to PubMed and RCSB.


In [8]:
def makeRcsbLink(id):
    if isinstance(id, str):
        url = 'http://www.rcsb.org/pdb/gene/' + id.upper()
        return '<a target="_blank" href="%s">%s</a>' % (url, id.upper())
    else:
        return 'n/a'
    
protein_labels.SYMBOL = protein_labels.SYMBOL.map(makeRcsbLink)
protein_labels[:3]


Out[8]:
BIOGRID_ID SYMBOL ORGANISM Color
0 112315 <a target="_blank" href="http://www.rcsb.org/p... Homo sapiens 0
1 124185 <a target="_blank" href="http://www.rcsb.org/p... Homo sapiens 0
2 106605 <a target="_blank" href="http://www.rcsb.org/p... Homo sapiens 0

In [9]:
def makePubmedLink(id):
    url = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=%s' % id
    return '<a target="_blank" href="%s">%s</a>' % (url, id)

interactions['Pubmed ID'] = interactions['Pubmed ID'].map(makePubmedLink)
interactions[:3]


Out[9]:
BioGRID ID Interactor A BioGRID ID Interactor B Official Symbol Interactor A Official Symbol Interactor B Pubmed ID Author Throughput
0 112315 108607 MAP2K4 FLNC <a target="_blank" href="http://www.ncbi.nlm.n... Marti A (1997) Low Throughput
1 124185 106603 MYPN ACTN2 <a target="_blank" href="http://www.ncbi.nlm.n... Bang ML (2001) Low Throughput
2 106605 108625 ACVR1 FNTA <a target="_blank" href="http://www.ncbi.nlm.n... Wang T (1996) Low Throughput

Plotting

We bind columns to labels and colors and we are good to go.


In [10]:
# This will upload ~10MB of data, be patient!
g2 = g.bind(node='BIOGRID_ID', edge_title='Author', point_title='SYMBOL', point_color='Color')
g2.plot(interactions, protein_labels)


Uploading 7139 kB. This may take a while...
Out[10]:

In [ ]: