Author: Charles Tapley Hoyt
Estimated Run Time: 1 minute
This notebook demonstrates the utilities in PyBEL Tools that facilitate the exploration and expansion of subgraphs to allow for easier interpretation and contextualization of their underlying mechanisms. The data used in this notebook comes from the AETIONOMY Alzheimer's Disease (AD) knowledge assembly that has been annotated with the NeuroMMSig Knowledge Base.
In [1]:
import logging
import os
import sys
import time
from collections import Counter, defaultdict
from operator import itemgetter
import matplotlib.pyplot as plt
import networkx as nx
import pybel
import pybel_tools as pbt
from pybel.constants import *
from pybel_tools.visualization import to_jupyter
from pybel_tools.utils import barh, barv
In [2]:
%config InlineBackend.figure_format = 'svg'
%matplotlib inline
In [3]:
time.asctime()
Out[3]:
In [4]:
pybel.__version__
Out[4]:
In [5]:
pbt.__version__
Out[5]:
To make this notebook interoperable across many machines, locations to the repositories that contain the data used in this notebook are referenced from the environment, set in ~/.bashrc
to point to the place where the repositories have been cloned. Assuming the repositories have been git clone
'd into the ~/dev
folder, the entries in ~/.bashrc
should look like:
...
export BMS_BASE=~/dev/bms
...
The biological model store (BMS) is the internal Fraunhofer SCAI repository for keeping BEL models under version control. It can be downloaded from https://tor-2.scai.fraunhofer.de/gf/project/bms/
In [6]:
bms_base = os.environ['BMS_BASE']
The Alzheimer's Disease Knowledge Assembly has been precompiled with the following command line script, and will be loaded from this format for improved performance. In general, derived data, such as the gpickle representation of a BEL script, are not saved under version control to ensure that the most up-to-date data is always used.
pybel convert --path "$BMS_BASE/aetionomy/alzheimers.bel" --pickle "$BMS_BASE/aetionomy/alzheimers.gpickle"
The BEL script can also be compiled from inside this notebook with the following python code:
>>> import os
>>> import pybel
>>> # Input from BEL script
>>> bel_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.bel')
>>> graph = pybel.from_path(bel_path)
>>> # Output to gpickle for fast loading later
>>> pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.gpickle')
>>> pybel.to_pickle(graph, pickle_path)
In [7]:
pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers', 'alzheimers.gpickle')
In [8]:
graph = pybel.from_pickle(pickle_path)
In [9]:
graph.version
Out[9]:
In [10]:
# Add all canonical names for later
pbt.mutation.add_canonical_names(graph)
The GABA Subgraph is explored in this example. This subgraph contains a representative group of genes, RNAs, proteins, biological processes, and pathologies; and all of their relations. It is extracted with pbt.selection.get_subgraph_by_annotation.
In [11]:
example_subgraph_name = 'GABA subgraph'
In [12]:
subgraph = pbt.selection.get_subgraph_by_annotation_value(graph, annotation='Subgraph', value=example_subgraph_name)
pbt.summary.print_summary(subgraph)
In [13]:
to_jupyter(subgraph)
Out[13]:
The subgraph also contains elements with important unqualified edges, like the relationships between complexes and their members. These relationships can be enriched from the original graph using the function pbt.mutation.enrich_unqualified. For example, the connection between complex(p(HGNC:EGR1), p(HGNC:PSEN2))
and p(HGNC:PSEN2)
is added during this process. The connection between the p(HGNC:APP)
and p(HGNC:APP, frag(672_713))
is also recovered.
In [14]:
pbt.mutation.enrich_unqualified(graph, subgraph)
pbt.summary.print_summary(subgraph)
In [15]:
to_jupyter(subgraph)
Out[15]:
The graph also contains some related nodes, like r(HGNC:GABRA5)
and p(HGNC:GABRA5)
that are disconnected. Inferring the translation and transcriptional relationships between genes, RNAs, and proteins allows for connecting parts of the graph without much information. This can be accomplished with pbt.mutation.infer_central_dogma.
In [16]:
pbt.mutation.infer_central_dogma(subgraph)
pbt.summary.print_summary(subgraph)
In [17]:
to_jupyter(subgraph)
Out[17]:
Finally, some of the genes and RNAs that have been added have no connections, and can be removed with pbt.mutation.prune_central_dogma.
In [18]:
pbt.mutation.prune_central_dogma(subgraph)
pbt.summary.print_summary(subgraph)
In [19]:
to_jupyter(subgraph)
Out[19]:
The concept of expansion then contraction is commonly called "opening" in the domain of image processing. Inference of the central dogma then removal of leaf genes and RNAs is such a standard operation that both steps can be run by pbt.mutation.opening_on_central_dogma.
The fact that a subgraph contains more than one connected component probably means that there were errors in the original BEL script. There is an entire module devoted to analyzing the errors produced during compilation called pbt.summary.error_summary
However, it's also possible that the connections are due to lack of knowledge in the literature. In the curation process for the NeuroMMSig Database, many entitity types were not considered. We've developed an algorithm for inferring additional members of a subgraph, including chemicals that occur as intermediates in biochemical processes, and higher level entities such as biological processes. The set of tools for running the algorithm are avaliable in the pbt.mutations.subgraph_expansion
submodule (see pbt.mutation.fill_subgraph).
In [20]:
example_subgraph_name = 'Estrogen subgraph'
In [21]:
subgraph = pbt.selection.get_subgraph_by_annotation_value(graph, annotation='Subgraph', value=example_subgraph_name)
pbt.mutation.enrich_unqualified(graph, subgraph)
pbt.mutation.opening_on_central_dogma(subgraph)
pbt.summary.print_summary(subgraph)
In [22]:
to_jupyter(subgraph)
Out[22]:
The nodes along the periphery of this subgraph can be investigated with pbt.mutation.get_subgraph_peripheral_nodes. Below, it is used to output which nodes which aren't already in the Estrogen Subgraph, and how many in- and out-edges they have to it.
In [23]:
pnd = pbt.mutation.get_subgraph_peripheral_nodes(graph, subgraph, node_filters=pbt.filters.exclude_pathology_filter)
In [24]:
for node in sorted(pnd, key=lambda k: len(set(pnd[k]['successor']) | set(pnd[k]['predecessor'])), reverse=True):
pred_d = pnd[node]['predecessor']
succ_d = pnd[node]['successor']
if 0 == len(pred_d) or 0 == len(succ_d):
continue
periphery = set(pred_d) | set(succ_d)
if 4 > len(periphery):
continue
print(node, len(pred_d), len(succ_d), len(periphery))
The function pbt.mutation.expand_periphery automatically handles these calcuations and allows for the specification of a threshold for how "confident" it should be to add a node to the subgraph. Filters to exclude pathologies (which have many connections to everything). The inferred edges are limited to only causal edges, to avoid adding many low confidence relations. Luckily, the Estrogen Subgraph is small and doesn't become unmanagable after expanding along the periphery. Other, larger subgraphs might have this issue. If the subgraph becomes too complicated, it might be useful to extract the causal subgraph using pbt.selection.get_causal_subgraph.
In [25]:
pbt.mutation.expand_periphery(
graph,
subgraph,
node_filters=pbt.filters.exclude_pathology_filter,
threshold=3)
pbt.summary.print_summary(subgraph)
In [26]:
to_jupyter(subgraph)
Out[26]:
In this case, we were able to infer connections that not only gave the Estrogen subgraph more context, but also connected the individual components
A final touch to the subgraph might be to infer connections between nodes that have just been added. This can be done with pbt.mutation.expand_internal which allows for specification of edge filters, or with pbt.mutation.expand_internal_causal that is a thin wrapper, giving the edge filter pbt.filters.keep_causal_edges. Again, expansion on unqualified edges and opening with the central dogma can make this expanded subgraph easier to interpret.
Further, an unbiased expansion method could allow for annotations of entities to subgraphs such as chemicals and bioprocesses, and allow for more exotic enrichment algorithms to be implemented similar to NeuroMMSigDB.