RadNLP

Radiology NLP or

Rad (as in cool) NLP or

[Fill in the Blank] NLP

© Brian E. Chapman, PhD

RadNLP is a package that builds upon the pyConTextNLP algorithm's sentence-level text processing to perform simple document-level classification. The package also contains a number of functions for identifying sections of reports, identifing and eliminating boiler-plate text, etc.

In this notebook I will demonstrate radnlp's most basic functionality: given a text of interest create an overall document classification. The classifiction uses a simple maximum function: for each concept marked in a report the maximal schema value occurance is selected to characterize the report for that concept.

Report Schema and maximal value

The document classification is based on schema that combines multiple concepts (e.g. existence, certitude, severity) into a single ordinal scale. The RadNLP GitHub repository includes a knowledge base directory (KBs) contains the schema we ahve previously developed for critical findings projects. It is included below:

# Lines that start with the # symbol are comments and are ignored
# The schema consists of a numeric value, followed by a label (e.g. "AMBIVALENT"), 
# followed by a Python express that can evaluate to True or False
# The Python expression uses LABELS from the rules. processReports.py will substitute 
# the LABEL with any matched values identified from 
# the corresponding rules
1,AMBIVALENT,DISEASE_STATE == 2
2,Negative/Certain/Acute,DISEASE_STATE == 0 and CERTAINTY_STATE == 1
3,Negative/Uncertain/Chronic,DISEASE_STATE == 0 and CERTAINTY_STATE == 0 and ACUTE_STATE == 0
4,Positive/Uncertain/Chronic,DISEASE_STATE == 1 and CERTAINTY_STATE == 0 and ACUTE_STATE == 0 
5,Positive/Certain/Chronic,DISEASE_STATE == 1 and CERTAINTY_STATE == 1 and ACUTE_STATE == 0 
6,Negative/Uncertain/Acute,DISEASE_STATE == 0 and CERTAINTY_STATE == 0 
7,Positive/Uncertain/Acute,DISEASE_STATE == 1 and CERTAINTY_STATE == 0 and ACUTE_STATE == 1 
8,Positive/Certain/Acute,DISEASE_STATE == 1 and CERTAINTY_STATE == 1 and ACUTE_STATE == 1

A key idea is "a Python express that can evaluate to True or False".

The radnlp.schema subpackage contains functions for reading schema and applying the schema to the pyConTextNLP findings given rules specified by the user.

There are two key functions in radnlp.schema:

def instantiate_schema(values, rule):
    """
    evaluates rule by substituting values into rule and evaluating the resulting literal.
    This is currently insecure
        * "For security the ast.literal_eval() method should be used."
    """
    r = rule
    for k in values.keys():
        r = r.replace(k, values[k].__str__())
    #return ast.literal_eval(r)
    return eval(r)

def assign_schema(values, rules):
    """
    """
    for k in rules.keys():
        if instantiate_schema(values, rules[k][1]):
            return k

For any given category (e.g. pulmonary_embolism), the maximal schema score encountered in the report is selected to characterize that report.

radnlp Rules

radnlp uses rule files to specify rules that define how particular pyConTextNLP findings relate to radnlp concepts. For example, in the classificationRules3.csv provided in KBs, we provide a rules that state:

  • The default disease state is 1.
  • PROBABLE_EXISTENCE AND DEFINITE_EXISTENCE map to a disease state of 1

Rules as currently indicated are not quite general and reflect paraticular use cases we were working on.

Types of Rules

We currently support three rules:

  1. CLASSIFICAITON_RULE: these are the rules that relate to disease, temporality, and certainty
  2. CATEGORY_RULE: these are only partially developed concepts that attempt to address combinatorics problems in pyConTextNLP by making default findings more general (e.g. infaract) and then tries to create more specific findings by attaching anatomic locations to the findings (e.g. an infarct becomes a critical finding when attached to an anatomic concept like brain_anatomy or heart_anatomy.
  3. SEVERITY_RULE: Again, not fully developed but relates to extracting severity concepts (e.g. large or 4.3 cm).
# Lines that start with the # symbol are comments and are ignored,,,,,,,,,,,,,
# processReport current has three types of rules: @CLASSIFICATION_RULE, @CATEGORY_RULE, and @SEVERITY_RULE,,,,,,,,,,,
# classification rules would be for things like disease_state, certainty_state, temporality state,,,,,,,,,,,
# For each classification_rule set," there is a rule label (e.g. ""DISEASE_STATE"". This must match",,,,,,,,,,,,
# the terms used in the schema file,,,,,,,,,,,,,
# Each rule set requires a DEFAULT which is the schema value to be returned if no rule conditions are satisifed,,,,,,,,,,,,,
# Each rule set has zero or more rules consisting of a schema value to be returned if the rule evaluates to true,,,,,,,,,,,,,
# A rule evalutes to true if the target is modified by one or more of the ConText CATEGORIES listed following,,,,,,,,,,,,,
@CLASSIFICATION_RULE,DISEASE_STATE,RULE,0,DEFINITE_NEGATED_EXISTENCE,PROBABLE_NEGATED_EXISTENCE,FUTURE,INDICATION,PSEUDONEG,,,,,
@CLASSIFICATION_RULE,DISEASE_STATE,RULE,2,AMBIVALENT_EXISTENCE,,,,,,,,,
@CLASSIFICATION_RULE,DISEASE_STATE,RULE,1,PROBABLE_EXISTENCE,DEFINITE_EXISTENCE,,,,,,,,
@CLASSIFICATION_RULE,DISEASE_STATE,DEFAULT,1,,,,,,,,,,
@CLASSIFICATION_RULE,CERTAINTY_STATE,RULE,0,PROBABLE_NEGATED_EXISTENCE,AMBIVALENT_EXISTENCE,PROBABLE_EXISTENCE,,,,,,,
@CLASSIFICATION_RULE,CERTAINTY_STATE,DEFAULT,1,,,,,,,,,,
@CLASSIFICATION_RULE,ACUTE_STATE,RULE,0,HISTORICAL,,,,,,,,,
@CLASSIFICATION_RULE,ACUTE_STATE,DEFAULT,1,,,,,,,,,,
#CATEGORY_RULE rules specify what Findings (e.g. DVT) can have the category modified by the following ANATOMIC modifies,,,,,,,,,,,,,
@CATEGORY_RULE,DVT,LOWER_DEEP_VEIN,UPPER_DEEP_VEIN,HEPATIC_VEIN,PORTAL_SYSTEM_VEIN,PULMONARY_VEIN,RENAL_VEIN,SINUS_VEIN,LOWER_SUPERFICIAL_VEIN,UPPER_SUPERFICIAL_VEIN,VARICOCELE,ARTERIAL,NON_VASCULAR
@CATEGORY_RULE,INFARCT,BRAIN_ANATOMY,HEART_ANATOMY,OTHER_CRITICAL_ANATOMY,,,,,,,,,
@CATEGORY_RULE,ANEURYSM,AORTIC_ANATOMY,,,,,,,,,,,
#SEVERITY_RUlE specifiy which targets to try to obtain severity measures for,,,,,,,,,,,,,
@SEVERITY_RULE,AORTIC_ANATOMY_ANEURYSM,SEVERITY,,,,,,,,,,,

In [ ]:
%matplotlib inline

Licensing

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Program Description


In [ ]:
import radnlp.rules as rules
import radnlp.schema as schema
import radnlp.utils as utils
import radnlp.classifier as classifier
import radnlp.split as split

from IPython.display import clear_output, display, HTML
from IPython.html.widgets import interact, interactive, fixed
import io
from IPython.html import widgets # Widget definitions
import pyConTextNLP.itemData as itemData

from pyConTextNLP.display.html import mark_document_with_html

Example Data

Below are two example radiology reports pulled from the MIMIC2 demo data set.


In [ ]:
reports = ["""1.  Pulmonary embolism with  filling defects noted within the upper and lower
     lobar branches of the right main pulmonary artery.
     2.  Bilateral pleural effusions, greater on the left.
     3.  Ascites.
     4.  There is edema of the gallbladder wall, without any evidence of
     distention, intra- or extra-hepatic biliary dilatation.  This, along with
     stranding within the mesentery, likely represents third spacing of fluid.
     5.  There are several wedge shaped areas of decreased perfusion within the
     spleen, which may represent splenic infarcts.
     
     Results were discussed with Dr. [**First Name8 (NamePattern2) 15561**] [**Last Name (NamePattern1) 13459**] 
     at 8 pm on [**3099-11-6**].""",
           
    """1. Filling defects within the subsegmental arteries in the region
     of the left lower lobe and lingula and within the right lower lobe consistent
     with pulmonary emboli.
     
     2. Small bilateral pleural effusions with associated bibasilar atelectasis.
     
     3. Left anterior pneumothorax.
     
     4. No change in the size of the thoracoabdominal aortic aneurysm.
     
     5. Endotracheal tube 1.8 cm above the carina. NG tube within the stomach,
     although the tip is pointed superiorly toward the fundus.""",
           
    """1. There are no pulmonary emboli observed.
     
     2. Small bilateral pleural effusions with associated bibasilar atelectasis.
     
     3. Left anterior pneumothorax.
     
     4. No change in the size of the thoracoabdominal aortic aneurysm.
     
     5. Endotracheal tube 1.8 cm above the carina. NG tube within the stomach,
     although the tip is pointed superiorly toward the fundus."""
]

In [ ]:
#!python -m textblob.download_corpora

Define locations of knowledge, schema, and rules files


In [ ]:
def getOptions():
    """Generates arguments for specifying database and other parameters"""
    options = {}
    options['lexical_kb'] = ["https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/lexical_kb_04292013.tsv", 
                             "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/criticalfinder_generalized_modifiers.tsv"]
    options['domain_kb'] = ["https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/pe_kb.tsv"]#[os.path.join(DATADIR2,"pe_kb.tsv")]
    options["schema"] = "https://raw.githubusercontent.com/chapmanbe/RadNLP/master/KBs/schema2.csv"#"file specifying schema"
    options["rules"] = "https://raw.githubusercontent.com/chapmanbe/RadNLP/master/KBs/classificationRules3.csv" # "file specifying sentence level rules")
    return options

Define report analysis

For every report we do two steps

  1. Markup all the sentences in the report based on the provided targets and modifiers
  2. Given this markup we apply our rules and schema to generate a document classification.

radnlp provides functions to do both of these steps:

  1. radnlp.utils.mark_report takes lists of modifiers and targets and generates a pyConTextNLP document graph
  2. radnlp.classify.classify_document_targets takes the document graph, rules, and schema and generates document classification for each identified concept.

Because pyConTextNLP operates on sentences we split the report into sentences. In this function we use radnlp.split.get_sentences which is simply a wrapper around textblob for splitting the sentences.


In [ ]:
def analyze_report(report, modifiers, targets, rules, schema):
    """
    given an individual radiology report, creates a pyConTextGraph
    object that contains the context markup
    report: a text string containing the radiology reports
    """
    
    markup = utils.mark_report(split.get_sentences(report),
                         modifiers,
                         targets)
    return  classifier.classify_document_targets(markup,
                                          rules[0],
                                          rules[1],
                                          rules[2],
                                          schema)

In [ ]:
def process_report(report):
    
    options = getOptions()

    _radnlp_rules = rules.read_rules(options["rules"])
    _schema = schema.read_schema(options["schema"])
    #_schema = readSchema(options["schema"])
    modifiers = itemData.itemData()
    targets = itemData.itemData()
    for kb in options['lexical_kb']:
        modifiers.extend( itemData.instantiateFromCSVtoitemData(kb) )
    for kb in options['domain_kb']:
        targets.extend( itemData.instantiateFromCSVtoitemData(kb) )
    return analyze_report(report, modifiers, targets, _radnlp_rules, _schema)

In [ ]:
rslt_0 = process_report(reports[0])

radnlp.classifier.classify_document_targets returns a dictionary with keys equal to the target category (e.g. pulmonary_embolism) and the values a 3-tuple with the following values:

  1. The schema category (e.g. 8 or 2).
  2. The XML representation of the maximal schema node
  3. A list (usually empty (not really implemented yet)) of severity values.

In [ ]:
for key, value in rslt_0.items():
    print(("%s"%key).center(42,"-"))
    for v in value:
        print(v)

In [ ]:
rslt_1 = main(reports[1])

for key, value in rslt_1.items():
    print(("%s"%key).center(42,"-"))
    for v in value:
        print(v)

Negative Report

For the third report I simply rewrote one of the findings to be negative for PE. We now see a change in the schema classification.


In [ ]:
rslt_2 = main(reports[2])

for key, value in rslt_2.items():
    print(("%s"%key).center(42,"-"))
    for v in value:
        print(v)

In [ ]:
keys = list(pec.markups.keys())
keys.sort()

pec.reports.insert(pec.reports.columns.get_loc(u'markup')+1,
                   "ConText Coding",
                   [codingKey.get(pec.markups[k][1].get("pulmonary_embolism",[None])[0],"NA") for k in keys])