Author: Charles Tapley Hoyt
This notebook outlines a simple way to explore the citations, authors, and provenance information in a graph and its subgraphs.
In [1]:
import itertools as itt
import os
import time
from collections import defaultdict, Counter
from operator import itemgetter
import pandas as pd
import pybel
import pybel_tools as pbt
from pybel.constants import *
In [2]:
time.asctime()
Out[2]:
In [3]:
pybel.__version__
Out[3]:
In [4]:
pbt.__version__
Out[4]:
To make this notebook interoperable across many machines, locations to the repositories that contain the data used in this notebook are referenced from the environment, set in ~/.bashrc
to point to the place where the repositories have been cloned. Assuming the repositories have been git clone
'd into the ~/dev
folder, the entries in ~/.bashrc
should look like:
...
export BMS_BASE=~/dev/bms
...
The biological model store (BMS) is the internal Fraunhofer SCAI repository for keeping BEL models under version control. It can be downloaded from https://tor-2.scai.fraunhofer.de/gf/project/bms/
In [5]:
bms_base = os.environ['BMS_BASE']
The Alzheimer's Disease Knowledge Assembly has been precompiled with the following command line script, and will be loaded from this format for improved performance. In general, derived data, such as the gpickle representation of a BEL script, are not saved under version control to ensure that the most up-to-date data is always used.
pybel convert --path "$BMS_BASE/aetionomy/alzheimers.bel" --pickle "$BMS_BASE/aetionomy/alzheimers.gpickle"
The BEL script can also be compiled from inside this notebook with the following python code:
>>> import os
>>> import pybel
>>> # Input from BEL script
>>> bel_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.bel')
>>> graph = pybel.from_path(bel_path)
>>> # Output to gpickle for fast loading later
>>> pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.gpickle')
>>> pybel.to_pickle(graph, pickle_path)
In [6]:
pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers', 'alzheimers.gpickle')
In [7]:
graph = pybel.from_pickle(pickle_path)
In [8]:
graph.version
Out[8]:
The number of unique referenecs to documents in PubMed can be calculated with pbt.summary.count_pmids
In [9]:
pmid_counter = pbt.summary.count_pmids(graph)
The total number of PubMed references can be readily accessed by the len()
of the counter output by pbt.summary.count_pmids.
In [10]:
len(pmid_counter)
Out[10]:
The top 15 most informative papers, in terms of number edges contributed, are displayed below.
In [11]:
for pmid, count in pmid_counter.most_common(15):
print('https://www.ncbi.nlm.nih.gov/pubmed/{}\t{}'.format(pmid, count))
The NCBI eUtils platform is used to look up all PubMed references and enrich information about the authors, publication, volume, page, and title with pbt.mutation.fix_pubmed_citations.
In [12]:
pbt.mutation.parse_authors(graph)
In [13]:
%%time
erroneous_pmids = pbt.mutation.fix_pubmed_citations(graph, stringify_authors=False)
In [14]:
erroneous_pmids
Out[14]:
In [15]:
pmid_evidences = pbt.summary.get_evidences_by_pmid(graph, erroneous_pmids)
for pmid in sorted(pmid_evidences):
print('https://www.ncbi.nlm.nih.gov/pubmed/{}'.format(pmid))
for evidence in sorted(pmid_evidences[pmid]):
print('\t', evidence, '\n')
The associations between authors and their publications can be summarized with pbt.summary.count_author_publications.
In [16]:
author_publication_counter = pbt.summary.count_author_publications(graph)
The total number of authors can be readily counted by the len()
of the Counter returned by pbt.summary.count_author_publications.
In [17]:
len(author_publication_counter)
Out[17]:
The top 35 authors, in terms of the number of publications contributed to the graph, are displayed below.
In [18]:
author_publication_counter.most_common(35)
Out[18]:
It's also possible to look up the contributions of individual authors using the Counter's dictionary lookup and a simple substring search.
In [19]:
for author in author_publication_counter:
if 'Heneka' in author:
print(author, author_publication_counter[author])
The top 35 authors, in terms of the number of edges contributed to the graph, are displayed below.
In [20]:
author_counter = pbt.summary.count_authors(graph)
author_counter.most_common(35)
Out[20]:
In [21]:
target_subgraph = 'Apoptosis signaling subgraph'
In [22]:
subgraph = pbt.selection.get_subgraph_by_annotation_value(graph, annotation='Subgraph', value=target_subgraph)
pbt.summary.print_summary(subgraph)
The unique citations for every pair of nodes is calculated. This helps to remove the bias from edges that have many notations and have a cartesian explosion. This process can be repeated with pbt.summary.count_pmids.
In [23]:
citations = defaultdict(set)
for u, v, d in subgraph.edges_iter(data=True):
c = d[CITATION]
citations[u, v].add((c[CITATION_TYPE], c[CITATION_REFERENCE], c[CITATION_NAME]))
counter = Counter(itt.chain.from_iterable(citations.values()))
for (_, pmid, name), v in counter.most_common(35):
print('https://www.ncbi.nlm.nih.gov/pubmed/{}\t{}\t{}' .format(int(pmid.strip()), v, name))
While BEL documents are a repository for biological knowledge, they also provide insight into the most prolific authors and highest information-density papers. After making this information readily available through the functions provided in PyBEL Tools, other tools that handle citation networks could be integrated and utilities like SCAIView could be further leveraged to identify which publications would have the highest potential to improve the content of the network.