pyConTextNLP uses NetworkX directional graphs to represent the markup: nodes in the graph will be the concepts that are identified in the sentence and edges in the graph will be the relationships between those concepts.
In [1]:
import pyConTextNLP.pyConTextGraph as pyConText
import pyConTextNLP.itemData as itemData
import networkx as nx
pyConTextGraph
contains the bulk of the pyConTextNLP functionality, including basic class definitions such as the ConTextMarkup
class that represents the markup of a sentence.itemData
contains a class definition for an itemData and functions for reading itemData definitions which are assumed to be in a tab seperated file that is specified as either a local file or a remote resource. In this example we will read definitions straight from the GitHub repository.itemData
in its most basic form is a four-tuple consisting of These example reports are taken from (with modification) the MIMIC2 demo data set that is a publically available database of de-identified medical records for deceased individuals.
In [2]:
reports = [
"""IMPRESSION: Evaluation limited by lack of IV contrast; however, no evidence of
bowel obstruction or mass identified within the abdomen or pelvis. Non-specific interstitial opacities and bronchiectasis seen at the right
base, suggestive of post-inflammatory changes.""",
"""IMPRESSION: Evidence of early pulmonary vascular congestion and interstitial edema. Probable scarring at the medial aspect of the right lung base, with no
definite consolidation."""
,
"""IMPRESSION:
1. 2.0 cm cyst of the right renal lower pole. Otherwise, normal appearance
of the right kidney with patent vasculature and no sonographic evidence of
renal artery stenosis.
2. Surgically absent left kidney.""",
"""IMPRESSION: No pneumothorax.""",
"""IMPRESSION: No definite pneumothorax"""
"""IMPRESSION: New opacity at the left lower lobe consistent with pneumonia."""
]
In [3]:
modifiers = itemData.instantiateFromCSVtoitemData(
"https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/lexical_kb_05042016.tsv")
targets = itemData.instantiateFromCSVtoitemData(
"https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/utah_crit.tsv")
In [4]:
def markup_sentence(s, modifiers, targets, prune_inactive=True):
"""
"""
markup = pyConText.ConTextMarkup()
markup.setRawText(s)
markup.cleanText()
markup.markItems(modifiers, mode="modifier")
markup.markItems(targets, mode="target")
markup.pruneMarks()
markup.dropMarks('Exclusion')
# apply modifiers to any targets within the modifiers scope
markup.applyModifiers()
markup.pruneSelfModifyingRelationships()
if prune_inactive:
markup.dropInactiveModifiers()
return markup
In [5]:
reports[3]
Out[5]:
In [6]:
markup = pyConText.ConTextMarkup()
In [7]:
isinstance(markup,nx.DiGraph)
Out[7]:
In [8]:
#### Set the text to be processed
In [9]:
markup.setRawText(reports[3].lower())
print(markup)
print(len(markup.getRawText()))
In [10]:
markup.cleanText()
print(markup)
print(len(markup.getText()))
The markItems
method takes a list of itemData and uses the regular expressions to identify any instances of the itemData in the sentence. With the mode
keyword we specify whether these itemData
are targets or modifiers. This value will be stored as a data attribute of the node that is created in the graph for any identified concepts.
In [11]:
markup.markItems(modifiers, mode="modifier")
print(markup.nodes(data=True))
print(type(markup.nodes()[0]))
no
tagObject
for this concept which keeps track of the actual phrase identified by the regular expression, what the category of the itemData was (definite_negated_existence
), this is a list because there can be multiple categories. There is also an absurdly long identifier for the node. Note that our mode modifier
has been stored as a data element of the node. In NetworkX each node (or edge) has a dictionary for data.
In [12]:
markup.markItems(targets, mode="target")
In [13]:
print(markup.nodes(data=True))
In [14]:
markup.pruneMarks()
print(markup.nodes())
In [15]:
print(markup.edges())
In [16]:
markup.applyModifiers()
print(markup.edges())
The value of pruning is shown in this notebook.