That is over 600.000 interactions across 50'000 proteins!
This notebook automatically downloads about 200 MB of BioGrid data. If you are going to run this notebook more than once, we recommend manually dowloading and saving the data to disk. To do so, unzip the two files and place their content in pygraphistry/demos/data
.
In [1]:
import pandas
import graphistry
graphistry.register(api=2)
#graphistry.register(key='MY_API_KEY', server='labs.graphistry.com', api=2) #https://www.graphistry.com/api-request
In [2]:
url1 = 'https://s3-us-west-1.amazonaws.com/graphistry.demo.data/BIOGRID-ALL-3.3.123.tab2.txt.gz'
rawdata = pandas.read_table(url1, na_values=['-'], engine='c', compression='gzip')
# If using local data, comment the two lines above and uncomment the line below
# pandas.read_table('./data/BIOGRID-ALL-3.3.123.tab2.txt', na_values=['-'], engine='c')
cols = ['BioGRID ID Interactor A', 'BioGRID ID Interactor B', 'Official Symbol Interactor A',
'Official Symbol Interactor B', 'Pubmed ID', 'Author', 'Throughput']
interactions = rawdata[cols].dropna()
interactions[:3]
Out[2]:
In [3]:
g = graphistry.bind(source="BioGRID ID Interactor A", destination="BioGRID ID Interactor B")
g.plot(interactions.sample(10000))
Out[3]:
In [4]:
# This downloads 170 MB, it might take some time.
url2 = 'https://s3-us-west-1.amazonaws.com/graphistry.demo.data/BIOGRID-IDENTIFIERS-3.3.123.tab.txt.gz'
raw_proteins = pandas.read_table(url2, na_values=['-'], engine='c', compression='gzip')
# If using local data, comment the two lines above and uncomment the line below
# raw_proteins = pandas.read_table('./data/BIOGRID-IDENTIFIERS-3.3.123.tab.txt', na_values=['-'], engine='c')
protein_ids = raw_proteins[['BIOGRID_ID', 'ORGANISM_OFFICIAL_NAME']].drop_duplicates() \
.rename(columns={'ORGANISM_OFFICIAL_NAME': 'ORGANISM'})
protein_ids[:3]
Out[4]:
We extract the proteins referenced as either sources or targets of interactions.
In [5]:
source_proteins = interactions[["BioGRID ID Interactor A", "Official Symbol Interactor A"]].copy() \
.rename(columns={'BioGRID ID Interactor A': 'BIOGRID_ID',
'Official Symbol Interactor A': 'SYMBOL'})
target_proteins = interactions[["BioGRID ID Interactor B", "Official Symbol Interactor B"]].copy() \
.rename(columns={'BioGRID ID Interactor B': 'BIOGRID_ID',
'Official Symbol Interactor B': 'SYMBOL'})
all_proteins = pandas.concat([source_proteins, target_proteins], ignore_index=True).drop_duplicates()
all_proteins[:3]
Out[5]:
We join on the indentification DB to get the organism in which each protein belongs.
In [6]:
protein_labels = pandas.merge(all_proteins, protein_ids, how='left', left_on='BIOGRID_ID', right_on='BIOGRID_ID')
protein_labels[:3]
Out[6]:
We assign colors to proteins based on their organism.
In [7]:
colors = protein_labels.ORGANISM.unique().tolist()
protein_labels['Color'] = protein_labels.ORGANISM.map(lambda x: colors.index(x))
For convenience, let's add links to PubMed and RCSB.
In [8]:
def makeRcsbLink(id):
if isinstance(id, str):
url = 'http://www.rcsb.org/pdb/gene/' + id.upper()
return '<a target="_blank" href="%s">%s</a>' % (url, id.upper())
else:
return 'n/a'
protein_labels.SYMBOL = protein_labels.SYMBOL.map(makeRcsbLink)
protein_labels[:3]
Out[8]:
In [9]:
def makePubmedLink(id):
url = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=%s' % id
return '<a target="_blank" href="%s">%s</a>' % (url, id)
interactions['Pubmed ID'] = interactions['Pubmed ID'].map(makePubmedLink)
interactions[:3]
Out[9]:
In [10]:
# This will upload ~10MB of data, be patient!
g2 = g.bind(node='BIOGRID_ID', edge_title='Author', point_title='SYMBOL', point_color='Color')
g2.plot(interactions, protein_labels)
Out[10]:
In [ ]: