This notebook demonstrates basic usage of BioThings Explorer, an engine for autonomously querying a distributed knowledge graph. BioThings Explorer can answer two classes of queries -- "PREDICT" and "EXPLAIN". PREDICT queries are described in PREDICT_demo.ipynb. Here, we describe EXPLAIN queries and how to use BioThings Explorer to execute them. A more detailed overview of the BioThings Explorer systems is provided in these slides.
EXPLAIN queries are designed to identify plausible reasoning chains to explain the relationship between two entities. For example, in this notebook, we explore the question:
"Why does imatinib have an effect on the treatment of chronic myelogenous leukemia (CML)?"
Later, we also compare those results to a similar query looking at imatinib's role in treating gastrointestinal stromal tumors (GIST).
To experiment with an executable version of this notebook, .
First, install the biothings_explorer and biothings_schema packages, as described in this README. This only needs to be done once (but including it here for compability with ).
In [1]:
%%capture
!pip install git+https://github.com/biothings/biothings_explorer#egg=biothings_explorer
Next, import the relevant modules:
In [1]:
# import modules from biothings_explorer
from biothings_explorer.hint import Hint
from biothings_explorer.user_query_dispatcher import FindConnection
In this step, BioThings Explorer translates our query strings "chronic myelogenous leukemia" and "imatinib" into BioThings objects, which contain mappings to many common identifiers. Generally, the top result returned by the Hint module will be the correct item, but you should confirm that using the identifiers shown.
Search terms can correspond to any child of BiologicalEntity from the Biolink Model, including DiseaseOrPhenotypicFeature (e.g., "lupus"), ChemicalSubstance (e.g., "acetaminophen"), Gene (e.g., "CDK2"), BiologicalProcess (e.g., "T cell differentiation"), and Pathway (e.g., "Citric acid cycle").
In [2]:
ht = Hint()
# find all potential representations of CML
cml_hint = ht.query("chronic myelogenous leukemia")
# select the correct representation of CML
cml = cml_hint['Disease'][0]
cml
Out[2]:
In [3]:
# find all potential representations of imatinib
imatinib_hint = ht.query("imatinib")
# select the correct representation of imatinib
imatinib = imatinib_hint['ChemicalSubstance'][0]
imatinib
Out[3]:
In this section, we find all paths in the knowledge graph that connect imatinib and chronic myelogenous leukemia. To do that, we will use FindConnection. This class is a convenient wrapper around two advanced functions for query path planning and query path execution. More advanced features for both query path planning and query path execution are in development and will be documented in the coming months.
The parameters for FindConnection are described below:
In [5]:
help(FindConnection.__init__)
Here, we formulate a FindConnection query with "CML" as the input_ojb, "imatinib" as the output_obj. We further specify with the intermediate_nodes parameter that we are looking for paths joining chronic myelogenous leukemia and imatinib with one intermediate node that is a Gene. (The ability to search for longer reasoning paths that include additional intermediate nodes will be added shortly.)
In [4]:
fc = FindConnection(input_obj=cml, output_obj=imatinib, intermediate_nodes='Gene')
We next execute the connect method, which performs the query path planning and query path execution process. In short, BioThings Explorer is deconstructing the query into individual API calls, executing those API calls, then assembling the results.
A verbose log of this process is displayed below:
In [5]:
# set verbose=True will display all steps which BTE takes to find the connection
fc.connect(verbose=True)
This section demonstrates post-query filtering done in Python. Later, more advanced filtering functions will be added to the query path execution module for interleaved filtering, thereby enabling longer query paths. More details to come...
First, all matching paths can be exported to a data frame. Let's examine a sample of those results.
In [6]:
df = fc.display_table_view()
# because UMLS is not currently well-integrated in our ID-to-object translation system, removing UMLS-only entries here
patternDel = "^UMLS:C\d+"
filter = df.node1_id.str.contains(patternDel)
df = df[~filter]
print(df.shape)
df.sample(10)
Out[6]:
While most results are based on edges from semmed, edges from DGIdb, biolink, disgenet, mydisease.info and drugcentral were also retrieved from their respective APIs.
Next, let's look to see which genes are mentioned the most.
In [7]:
df.node1_name.value_counts().head(10)
Out[7]:
Not surprisingly, the top two genes that BioThings Explorer found that join imatinib to CML are ABL1 and BCR, the two genes that are fused in the "Philadelphia chromosome", the genetic abnormality that underlies CML, and the validate target of imatinib.
Let's examine some of the PubMed articles linking CML to ABL1 and ABL1 to imatinib.
In [8]:
# fetch all articles connecting 'chronic myelogenous leukemia' and 'ABL1'
articles = []
for info in fc.display_edge_info('chronic myelogenous leukemia', 'ABL1').values():
if 'pubmed' in info['info']:
articles += info['info']['pubmed']
print("There are "+str(len(articles))+" articles supporting the edge between CML and ABL1. Sampling of 10 of those:")
x = [print("http://pubmed.gov/"+str(x)) for x in articles[0:10] ]
In [11]:
# fetch all articles connecting 'ABL1' and 'Imatinib
articles = []
for info in fc.display_edge_info('ABL1', 'imatinib').values():
if 'pubmed' in info['info']:
articles += info['info']['pubmed']
print("There are "+str(len(articles))+" articles supporting the edge between ABL1 and imatinib. Sampling of 10 of those:")
x = [print("http://pubmed.gov/"+str(x)) for x in articles[0:10] ]
In [12]:
ht = Hint()
# find all potential representations of CML
gist_hint = ht.query("gastrointestinal stromal tumor")
# select the correct representation of CML
gist = gist_hint['Disease'][0]
gist
Out[12]:
In [13]:
fc = FindConnection(input_obj=gist, output_obj=imatinib, intermediate_nodes='Gene')
In [14]:
fc.connect(verbose=False) # skipping the verbose log here
In [15]:
df = fc.display_table_view()
# because UMLS is not currently well-integrated in our ID-to-object translation system, removing UMLS-only entries here
patternDel = "^UMLS:C\d+"
filter = df.node1_id.str.contains(patternDel)
df = df[~filter]
print(df.shape)
df.sample(10)
Out[15]:
In [16]:
df.node1_name.value_counts().head(10)
Out[16]:
Here, the top two genes that BioThings Explorer found that join imatinib to GIST are PDGFRA and KIT, the most commonly mutated genes found in GIST and validated targets of imatinib.
While several of the listed genes would be considered positive controls, others on the list could be viewed as testable hypotheses and discovery opportunities to be evaluated by domain experts.
This notebook demonstrated the use of BioThings Explorer in EXPLAIN mode to investigate the relationship between imatinib and two diseases that it treats -- chronic myelogenous leukemia (CML) and gastrointestinal stromal tumors (GIST). In each case, BioThings Explorer autonomously queried a distributed knowledge graph of biomedical APIs to find the most common genes, and in each case the relevant targets were retrieved.
There are still many areas for improvement (and some areas in which BioThings Explorer is still buggy). And of course, BioThings Explorer is dependent on the accessibility of the APIs that comprise the distributed knowledge graph. Nevertheless, we encourage users to try other variants of the EXPLAIN queries demonstrated in this notebook.