Demonstration of Basic Sentence Markup with pyConTextNLP

pyConTextNLP uses NetworkX directional graphs to represent the markup: nodes in the graph will be the concepts that are identified in the sentence and edges in the graph will be the relationships between those concepts.


In [1]:
import pyConTextNLP.pyConTextGraph as pyConText
import pyConTextNLP.itemData as itemData
import networkx as nx
  • pyConTextGraph contains the bulk of the pyConTextNLP functionality, including basic class definitions such as the ConTextMarkup class that represents the markup of a sentence.
  • itemData contains a class definition for an itemData and functions for reading itemData definitions which are assumed to be in a tab seperated file that is specified as either a local file or a remote resource. In this example we will read definitions straight from the GitHub repository.
    • An itemData in its most basic form is a four-tuple consisting of
      1. A literal (e.g. "pulmonary embolism", "no definite evidence of")
      2. A category (e.g. "CRITICAL_FINDING", "PROBABLE_EXISTENCE")
      3. A regular expression that defines how to identify the literal concept. If no regular expression is specified, a regular expression will be built directly from the literal by wrapping it with word boundaries (e.g. r"""\bpulmonary embolism\b""")
      4. A rule that defines how the concept works in the sentence (e.g. a negation term that looks forward in the sentence).

Sentences

These example reports are taken from (with modification) the MIMIC2 demo data set that is a publically available database of de-identified medical records for deceased individuals.


In [2]:
reports = [
    """IMPRESSION: Evaluation limited by lack of IV contrast; however, no evidence of
      bowel obstruction or mass identified within the abdomen or pelvis. Non-specific interstitial opacities and bronchiectasis seen at the right
     base, suggestive of post-inflammatory changes.""",
    """IMPRESSION: Evidence of early pulmonary vascular congestion and interstitial edema. Probable scarring at the medial aspect of the right lung base, with no
     definite consolidation."""
    ,
    """IMPRESSION:
     
     1.  2.0 cm cyst of the right renal lower pole.  Otherwise, normal appearance
     of the right kidney with patent vasculature and no sonographic evidence of
     renal artery stenosis.
     2.  Surgically absent left kidney.""",
    """IMPRESSION:  No pneumothorax.""",
    """IMPRESSION: No definite pneumothorax"""
    """IMPRESSION:  New opacity at the left lower lobe consistent with pneumonia."""
]

Read the itemData definitions

We're reading directly from GitHub. You could read from a local file using a file:// URL.


In [3]:
modifiers = itemData.instantiateFromCSVtoitemData(
    "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/lexical_kb_05042016.tsv")
targets = itemData.instantiateFromCSVtoitemData(
    "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/utah_crit.tsv")

Example function to analyze each sentence

This the function we'll use for each report. The following section of this document steps through each line.


In [4]:
def markup_sentence(s, modifiers, targets, prune_inactive=True):
    """
    """
    markup = pyConText.ConTextMarkup()
    markup.setRawText(s)
    markup.cleanText()
    markup.markItems(modifiers, mode="modifier")
    markup.markItems(targets, mode="target")
    markup.pruneMarks()
    markup.dropMarks('Exclusion')
    # apply modifiers to any targets within the modifiers scope
    markup.applyModifiers()
    markup.pruneSelfModifyingRelationships()
    if prune_inactive:
        markup.dropInactiveModifiers()
    return markup

We're going to start with our simplest of sentences


In [5]:
reports[3]


Out[5]:
'IMPRESSION:  No pneumothorax.'

marking up a sentence

We start by creating an instance of the ConTextMarkup class. This is a subclass of a NetworkX DiGraph. Information will be stored in the nodes and edges.


In [6]:
markup = pyConText.ConTextMarkup()

In [7]:
isinstance(markup,nx.DiGraph)


Out[7]:
True

In [8]:
#### Set the text to be processed

In [9]:
markup.setRawText(reports[3].lower())
print(markup)
print(len(markup.getRawText()))


__________________________________________
rawText: impression:  no pneumothorax.
cleanedText: None
__________________________________________

29

Clean the text

Prior to processing we do some basic cleaning of the text, sucha s replacing multiple white spaces with a single space. You'll notice this in the spacing between the colon and "no" in the raw and clean versions of the text.


In [10]:
markup.cleanText()
print(markup)
print(len(markup.getText()))


__________________________________________
rawText: impression:  no pneumothorax.
cleanedText: impression: no pneumothorax.
__________________________________________

28

Identify concepts in the sentence

The markItems method takes a list of itemData and uses the regular expressions to identify any instances of the itemData in the sentence. With the mode keyword we specify whether these itemData are targets or modifiers. This value will be stored as a data attribute of the node that is created in the graph for any identified concepts.


In [11]:
markup.markItems(modifiers, mode="modifier")
print(markup.nodes(data=True))
print(type(markup.nodes()[0]))


[(<id> 215455046329351668380259283016206441424 </id> <phrase> no </phrase> <category> ['definite_negated_existence'] </category> , {'category': 'modifier'})]
<class 'pyConTextNLP.pyConTextGraph.tagObject'>

What does our initial markup look like?

  • We've identified one concept in the sentence: no
  • We've created a tagObject for this concept which keeps track of the actual phrase identified by the regular expression, what the category of the itemData was (definite_negated_existence), this is a list because there can be multiple categories. There is also an absurdly long identifier for the node. Note that our mode modifier has been stored as a data element of the node. In NetworkX each node (or edge) has a dictionary for data.

Now let's markup the targets


In [12]:
markup.markItems(targets, mode="target")

In [13]:
print(markup.nodes(data=True))


[(<id> 215496472358685501776852452868303018960 </id> <phrase> pneumothorax </phrase> <category> ['pneumothorax'] </category> , {'category': 'target'}), (<id> 215455046329351668380259283016206441424 </id> <phrase> no </phrase> <category> ['definite_negated_existence'] </category> , {'category': 'modifier'})]

What does our markup look like now?

We've added another node to the graph. This time the target pneumothorax.

Prune Marks

After identifying concepts, we prune concepts that are a subset of another identified concept. This results in no changes here, but the importance will be shown later with a different sentence.


In [14]:
markup.pruneMarks()
print(markup.nodes())


[<id> 215496472358685501776852452868303018960 </id> <phrase> pneumothorax </phrase> <category> ['pneumothorax'] </category> , <id> 215455046329351668380259283016206441424 </id> <phrase> no </phrase> <category> ['definite_negated_existence'] </category> ]

Are there any relationships in our markup?

We do not yet have any relationships (edges) between our concepts (target and modifier edges)


In [15]:
print(markup.edges())


[]

Apply modifiers

We now call the applyModifiers method of the ConTextMarkup object to identify any relationships between the nodes.


In [16]:
markup.applyModifiers()
print(markup.edges())


[(<id> 215455046329351668380259283016206441424 </id> <phrase> no </phrase> <category> ['definite_negated_existence'] </category> , <id> 215496472358685501776852452868303018960 </id> <phrase> pneumothorax </phrase> <category> ['pneumothorax'] </category> )]

We now have a relationship!

We now have a directed edge between our no node and our pneumothorax node. This will be interepreted as pneumothorax being a definitely negated concept in the sentence.

What's next?

The value of pruning is shown in this notebook.