Abstract

Author: Charles Tapley Hoyt

This notebook outlines a simple way to explore the citations, authors, and provenance information in a graph and its subgraphs.

Notebook Setup



In [1]:

    
import itertools as itt
import os
import time
from collections import defaultdict, Counter
from operator import itemgetter

import pandas as pd
import pybel
import pybel_tools as pbt

from pybel.constants import *

Notebook Provenance

The time of execution and the versions of the software packegs used are displayed explicitly.



In [2]:

    
time.asctime()









    Out[2]:





'Sun Aug 13 14:04:44 2017'



In [3]:

    
pybel.__version__









    Out[3]:





'0.7.2'



In [4]:

    
pbt.__version__









    Out[4]:





'0.1.18-dev'

Local Path Definitions

To make this notebook interoperable across many machines, locations to the repositories that contain the data used in this notebook are referenced from the environment, set in ~/.bashrc to point to the place where the repositories have been cloned. Assuming the repositories have been git clone'd into the ~/dev folder, the entries in ~/.bashrc should look like:

...
export BMS_BASE=~/dev/bms
...

BMS

The biological model store (BMS) is the internal Fraunhofer SCAI repository for keeping BEL models under version control. It can be downloaded from https://tor-2.scai.fraunhofer.de/gf/project/bms/



In [5]:

    
bms_base = os.environ['BMS_BASE']

Data

The Alzheimer's Disease Knowledge Assembly has been precompiled with the following command line script, and will be loaded from this format for improved performance. In general, derived data, such as the gpickle representation of a BEL script, are not saved under version control to ensure that the most up-to-date data is always used.

pybel convert --path "$BMS_BASE/aetionomy/alzheimers.bel" --pickle "$BMS_BASE/aetionomy/alzheimers.gpickle"

The BEL script can also be compiled from inside this notebook with the following python code:

>>> import os
>>> import pybel
>>> # Input from BEL script
>>> bel_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.bel')
>>> graph = pybel.from_path(bel_path)
>>> # Output to gpickle for fast loading later
>>> pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.gpickle')
>>> pybel.to_pickle(graph, pickle_path)



In [6]:

    
pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers', 'alzheimers.gpickle')



In [7]:

    
graph = pybel.from_pickle(pickle_path)



In [8]:

    
graph.version









    Out[8]:





'3.0.9'

Provenance Summary

Publication Summary

The number of unique referenecs to documents in PubMed can be calculated with pbt.summary.count_pmids



In [9]:

    
pmid_counter = pbt.summary.count_pmids(graph)

The total number of PubMed references can be readily accessed by the len() of the counter output by pbt.summary.count_pmids.



In [10]:

    
len(pmid_counter)









    Out[10]:





1821

The top 15 most informative papers, in terms of number edges contributed, are displayed below.



In [11]:

    
for pmid, count in pmid_counter.most_common(15):
    print('https://www.ncbi.nlm.nih.gov/pubmed/{}\t{}'.format(pmid, count))









    



https://www.ncbi.nlm.nih.gov/pubmed/20847424	900
https://www.ncbi.nlm.nih.gov/pubmed/22496686	592
https://www.ncbi.nlm.nih.gov/pubmed/19885299	383
https://www.ncbi.nlm.nih.gov/pubmed/19519303	280
https://www.ncbi.nlm.nih.gov/pubmed/19499146	176
https://www.ncbi.nlm.nih.gov/pubmed/24262633	170
https://www.ncbi.nlm.nih.gov/pubmed/19734902	168
https://www.ncbi.nlm.nih.gov/pubmed/21711233	146
https://www.ncbi.nlm.nih.gov/pubmed/22110360	132
https://www.ncbi.nlm.nih.gov/pubmed/22862420	120
https://www.ncbi.nlm.nih.gov/pubmed/18675468	114
https://www.ncbi.nlm.nih.gov/pubmed/24332564	102
https://www.ncbi.nlm.nih.gov/pubmed/19515914	102
https://www.ncbi.nlm.nih.gov/pubmed/19419557	101
https://www.ncbi.nlm.nih.gov/pubmed/25681350	92

Citation Enrichment

The NCBI eUtils platform is used to look up all PubMed references and enrich information about the authors, publication, volume, page, and title with pbt.mutation.fix_pubmed_citations.



In [12]:

    
pbt.mutation.parse_authors(graph)



In [13]:

    
%%time
erroneous_pmids = pbt.mutation.fix_pubmed_citations(graph, stringify_authors=False)









    



WARNING:pybel_tools.mutation.metadata:citations have already been enriched in Alzheimer's Disease Knowledge Assembly v3.0.9






    



CPU times: user 2.02 ms, sys: 1.08 ms, total: 3.1 ms
Wall time: 2.21 ms

Investigation of Errors

The erroneous PMIDs are summarized below. The evidence strings can be googled to identify the correct publications for recuration of the original BEL document.



In [14]:

    
erroneous_pmids









    Out[14]:





set()



In [15]:

    
pmid_evidences = pbt.summary.get_evidences_by_pmid(graph, erroneous_pmids)

for pmid in sorted(pmid_evidences):
    print('https://www.ncbi.nlm.nih.gov/pubmed/{}'.format(pmid))
    
    for evidence in sorted(pmid_evidences[pmid]):
        print('\t', evidence, '\n')

Author Summary

The associations between authors and their publications can be summarized with pbt.summary.count_author_publications.



In [16]:

    
author_publication_counter = pbt.summary.count_author_publications(graph)

The total number of authors can be readily counted by the len() of the Counter returned by pbt.summary.count_author_publications.



In [17]:

    
len(author_publication_counter)









    Out[17]:





9660

The top 35 authors, in terms of the number of publications contributed to the graph, are displayed below.



In [18]:

    
author_publication_counter.most_common(35)









    Out[18]:





[('Heneka MT', 20),
 ('Li Y', 17),
 ('Wang Y', 16),
 ('Hyman BT', 16),
 ('Perry G', 14),
 ('Smith MA', 14),
 ('Liu Y', 14),
 ('Wang X', 14),
 ('Williams J', 14),
 ('Lovestone S', 13),
 ('Love S', 12),
 ('Zhao Y', 12),
 ('Mayeux R', 12),
 ('Farrer LA', 12),
 ('Holtzman DM', 11),
 ('Kehoe PG', 11),
 ('Younkin SG', 11),
 ('Zhang Y', 11),
 ('Haass C', 11),
 ('Zhang J', 11),
 ('Lambert JC', 11),
 ('Amouyel P', 11),
 ('Galimberti D', 11),
 ('Hardy J', 11),
 ('Pericak-Vance MA', 11),
 ('Mattson MP', 10),
 ('Tan J', 10),
 ('Koo EH', 10),
 ('St George-Hyslop P', 10),
 ("Alzheimer's Disease Neuroimaging Initiative.", 10),
 ('Lukiw WJ', 10),
 ('Zhu X', 10),
 ('Harold D', 10),
 ('Scarpini E', 10),
 ('Owen MJ', 10)]

It's also possible to look up the contributions of individual authors using the Counter's dictionary lookup and a simple substring search.



In [19]:

    
for author in author_publication_counter:
    if 'Heneka' in author:
        print(author, author_publication_counter[author])









    



Heneka MT 20

The top 35 authors, in terms of the number of edges contributed to the graph, are displayed below.



In [20]:

    
author_counter = pbt.summary.count_authors(graph)

author_counter.most_common(35)









    Out[20]:





[('Parihar MS', 909),
 ('Brewer GJ', 900),
 ('Russo C', 639),
 ('Florio T', 639),
 ('Nizzari M', 636),
 ('Pagano A', 595),
 ('Thellung S', 592),
 ('Corsaro A', 592),
 ('Villa V', 592),
 ('Porcile C', 592),
 ('Lovestone S', 419),
 ('de la Monte SM', 383),
 ('Wands JR', 383),
 ('Hardy J', 363),
 ('Williams J', 332),
 ('Younkin SG', 322),
 ('Hyman BT', 314),
 ('Goate AM', 311),
 ('Cruchaga C', 302),
 ('Carrasquillo MM', 302),
 ('Owen MJ', 300),
 ('Harold D', 290),
 ('Kauwe JS', 288),
 ('Schubert M', 286),
 ('Kehoe PG', 280),
 ('Freude S', 280),
 ('Schilbach K', 280),
 ('Liu Y', 270),
 ('Heneka MT', 269),
 ('Sims R', 267),
 ('Morris JC', 267),
 ('Love S', 264),
 ('Gerrish A', 259),
 ('Nowotny P', 258),
 ('Brayne C', 254)]

Apoptosis Signalling Subgraph Summary

In this example, the Apopotosis Signalling Subgraph is investigated more closely.



In [21]:

    
target_subgraph = 'Apoptosis signaling subgraph'



In [22]:

    
subgraph = pbt.selection.get_subgraph_by_annotation_value(graph, annotation='Subgraph', value=target_subgraph)

pbt.summary.print_summary(subgraph)









    



Nodes: 128
Edges: 202
Citations: 58
Authors: 426
Network density: 0.012426181102362205
Components: 11
Average degree: 1.578125

The unique citations for every pair of nodes is calculated. This helps to remove the bias from edges that have many notations and have a cartesian explosion. This process can be repeated with pbt.summary.count_pmids.



In [23]:

    
citations = defaultdict(set)

for u, v, d in subgraph.edges_iter(data=True):
    c = d[CITATION]
    citations[u, v].add((c[CITATION_TYPE], c[CITATION_REFERENCE], c[CITATION_NAME]))
    
counter = Counter(itt.chain.from_iterable(citations.values()))

for (_, pmid, name), v in counter.most_common(35):
    print('https://www.ncbi.nlm.nih.gov/pubmed/{}\t{}\t{}' .format(int(pmid.strip()), v, name))









    



https://www.ncbi.nlm.nih.gov/pubmed/19499146	27	Acta biochimica et biophysica Sinica
https://www.ncbi.nlm.nih.gov/pubmed/22496686	11	Journal of toxicology
https://www.ncbi.nlm.nih.gov/pubmed/17869087	7	The Journal of nutritional biochemistry
https://www.ncbi.nlm.nih.gov/pubmed/16153637	7	European journal of pharmacology
https://www.ncbi.nlm.nih.gov/pubmed/19918364	6	PloS one
https://www.ncbi.nlm.nih.gov/pubmed/11592846	6	Neurobiology of disease
https://www.ncbi.nlm.nih.gov/pubmed/12548636	6	Proteomics
https://www.ncbi.nlm.nih.gov/pubmed/14744432	5	Cell
https://www.ncbi.nlm.nih.gov/pubmed/18997293	4	Journal of Alzheimer's disease : JAD
https://www.ncbi.nlm.nih.gov/pubmed/22236693	4	Journal of negative results in biomedicine
https://www.ncbi.nlm.nih.gov/pubmed/17316167	4	Current Alzheimer research
https://www.ncbi.nlm.nih.gov/pubmed/19734902	4	Nature genetics
https://www.ncbi.nlm.nih.gov/pubmed/22122372	4	Journal of neurochemistry
https://www.ncbi.nlm.nih.gov/pubmed/20847424	4	Journal of Alzheimer's disease : JAD
https://www.ncbi.nlm.nih.gov/pubmed/22523685	3	Journal of aging research
https://www.ncbi.nlm.nih.gov/pubmed/15671026	3	The Journal of biological chemistry
https://www.ncbi.nlm.nih.gov/pubmed/18782350	3	Aging cell
https://www.ncbi.nlm.nih.gov/pubmed/22235318	3	PloS one
https://www.ncbi.nlm.nih.gov/pubmed/22433871	3	The Journal of biological chemistry
https://www.ncbi.nlm.nih.gov/pubmed/19596284	3	Chemico-biological interactions
https://www.ncbi.nlm.nih.gov/pubmed/21490080	3	Perfusion
https://www.ncbi.nlm.nih.gov/pubmed/10662829	2	The Journal of neuroscience : the official journal of the Society for Neuroscience
https://www.ncbi.nlm.nih.gov/pubmed/10371197	2	FEBS letters
https://www.ncbi.nlm.nih.gov/pubmed/11744168	2	Brain research. Molecular brain research
https://www.ncbi.nlm.nih.gov/pubmed/21945540	2	Neurobiology of disease
https://www.ncbi.nlm.nih.gov/pubmed/22874667	2	Prion
https://www.ncbi.nlm.nih.gov/pubmed/11432833	2	The EMBO journal
https://www.ncbi.nlm.nih.gov/pubmed/18718604	2	The Journal of surgical research
https://www.ncbi.nlm.nih.gov/pubmed/21893081	2	Pharmacology, biochemistry, and behavior
https://www.ncbi.nlm.nih.gov/pubmed/16303979	2	Investigative ophthalmology &amp; visual science
https://www.ncbi.nlm.nih.gov/pubmed/22223749	2	FASEB journal : official publication of the Federation of American Societies for Experimental Biology
https://www.ncbi.nlm.nih.gov/pubmed/11400916	2	Life sciences
https://www.ncbi.nlm.nih.gov/pubmed/16243823	2	Clinical cancer research : an official journal of the American Association for Cancer Research
https://www.ncbi.nlm.nih.gov/pubmed/22249458	1	Journal of neuropathology and experimental neurology
https://www.ncbi.nlm.nih.gov/pubmed/17213958	1	Acta biochimica et biophysica Sinica

Conclusions

While BEL documents are a repository for biological knowledge, they also provide insight into the most prolific authors and highest information-density papers. After making this information readily available through the functions provided in PyBEL Tools, other tools that handle citation networks could be integrated and utilities like SCAIView could be further leveraged to identify which publications would have the highest potential to improve the content of the network.