RadNLP is a package that builds upon the pyConTextNLP algorithm's sentence-level text processing to perform simple document-level classification. The package also contains a number of functions for identifying sections of reports, identifing and eliminating boiler-plate text, etc.
In this notebook I will demonstrate radnlp's most basic functionality: given a text of interest create an overall document classification. The classifiction uses a simple maximum function: for each concept marked in a report the maximal schema value occurance is selected to characterize the report for that concept.
The document classification is based on schema that combines multiple concepts (e.g. existence, certitude, severity) into a single ordinal scale. The RadNLP GitHub repository includes a knowledge base directory (KBs) contains the schema we ahve previously developed for critical findings projects. It is included below:
# Lines that start with the # symbol are comments and are ignored
# The schema consists of a numeric value, followed by a label (e.g. "AMBIVALENT"),
# followed by a Python express that can evaluate to True or False
# The Python expression uses LABELS from the rules. processReports.py will substitute
# the LABEL with any matched values identified from
# the corresponding rules
1,AMBIVALENT,DISEASE_STATE == 2
2,Negative/Certain/Acute,DISEASE_STATE == 0 and CERTAINTY_STATE == 1
3,Negative/Uncertain/Chronic,DISEASE_STATE == 0 and CERTAINTY_STATE == 0 and ACUTE_STATE == 0
4,Positive/Uncertain/Chronic,DISEASE_STATE == 1 and CERTAINTY_STATE == 0 and ACUTE_STATE == 0
5,Positive/Certain/Chronic,DISEASE_STATE == 1 and CERTAINTY_STATE == 1 and ACUTE_STATE == 0
6,Negative/Uncertain/Acute,DISEASE_STATE == 0 and CERTAINTY_STATE == 0
7,Positive/Uncertain/Acute,DISEASE_STATE == 1 and CERTAINTY_STATE == 0 and ACUTE_STATE == 1
8,Positive/Certain/Acute,DISEASE_STATE == 1 and CERTAINTY_STATE == 1 and ACUTE_STATE == 1
A key idea is "a Python express that can evaluate to True or False".
The radnlp.schema
subpackage contains functions for reading schema and applying the schema to the pyConTextNLP findings given rules
specified by the user.
There are two key functions in radnlp.schema
:
def instantiate_schema(values, rule):
"""
evaluates rule by substituting values into rule and evaluating the resulting literal.
This is currently insecure
* "For security the ast.literal_eval() method should be used."
"""
r = rule
for k in values.keys():
r = r.replace(k, values[k].__str__())
#return ast.literal_eval(r)
return eval(r)
def assign_schema(values, rules):
"""
"""
for k in rules.keys():
if instantiate_schema(values, rules[k][1]):
return k
For any given category (e.g. pulmonary_embolism
), the maximal schema score encountered in the report is selected to characterize that report.
radnlp
Rulesradnlp
uses rule files to specify rules that define how particular pyConTextNLP findings relate to radnlp concepts. For example, in the classificationRules3.csv
provided in KBs, we provide a rules that state:
PROBABLE_EXISTENCE
AND DEFINITE_EXISTENCE
map to a disease state of 1Rules as currently indicated are not quite general and reflect paraticular use cases we were working on.
CLASSIFICAITON_RULE
: these are the rules that relate to disease, temporality, and certaintyCATEGORY_RULE
: these are only partially developed concepts that attempt to address combinatorics problems in pyConTextNLP by making default findings more general (e.g. infaract
) and then tries to create more specific findings by attaching anatomic locations to the findings (e.g. an infarct
becomes a critical finding when attached to an anatomic concept like brain_anatomy
or heart_anatomy
.SEVERITY_RULE
: Again, not fully developed but relates to extracting severity concepts (e.g. large
or 4.3 cm).# Lines that start with the # symbol are comments and are ignored,,,,,,,,,,,,,
# processReport current has three types of rules: @CLASSIFICATION_RULE, @CATEGORY_RULE, and @SEVERITY_RULE,,,,,,,,,,,
# classification rules would be for things like disease_state, certainty_state, temporality state,,,,,,,,,,,
# For each classification_rule set," there is a rule label (e.g. ""DISEASE_STATE"". This must match",,,,,,,,,,,,
# the terms used in the schema file,,,,,,,,,,,,,
# Each rule set requires a DEFAULT which is the schema value to be returned if no rule conditions are satisifed,,,,,,,,,,,,,
# Each rule set has zero or more rules consisting of a schema value to be returned if the rule evaluates to true,,,,,,,,,,,,,
# A rule evalutes to true if the target is modified by one or more of the ConText CATEGORIES listed following,,,,,,,,,,,,,
@CLASSIFICATION_RULE,DISEASE_STATE,RULE,0,DEFINITE_NEGATED_EXISTENCE,PROBABLE_NEGATED_EXISTENCE,FUTURE,INDICATION,PSEUDONEG,,,,,
@CLASSIFICATION_RULE,DISEASE_STATE,RULE,2,AMBIVALENT_EXISTENCE,,,,,,,,,
@CLASSIFICATION_RULE,DISEASE_STATE,RULE,1,PROBABLE_EXISTENCE,DEFINITE_EXISTENCE,,,,,,,,
@CLASSIFICATION_RULE,DISEASE_STATE,DEFAULT,1,,,,,,,,,,
@CLASSIFICATION_RULE,CERTAINTY_STATE,RULE,0,PROBABLE_NEGATED_EXISTENCE,AMBIVALENT_EXISTENCE,PROBABLE_EXISTENCE,,,,,,,
@CLASSIFICATION_RULE,CERTAINTY_STATE,DEFAULT,1,,,,,,,,,,
@CLASSIFICATION_RULE,ACUTE_STATE,RULE,0,HISTORICAL,,,,,,,,,
@CLASSIFICATION_RULE,ACUTE_STATE,DEFAULT,1,,,,,,,,,,
#CATEGORY_RULE rules specify what Findings (e.g. DVT) can have the category modified by the following ANATOMIC modifies,,,,,,,,,,,,,
@CATEGORY_RULE,DVT,LOWER_DEEP_VEIN,UPPER_DEEP_VEIN,HEPATIC_VEIN,PORTAL_SYSTEM_VEIN,PULMONARY_VEIN,RENAL_VEIN,SINUS_VEIN,LOWER_SUPERFICIAL_VEIN,UPPER_SUPERFICIAL_VEIN,VARICOCELE,ARTERIAL,NON_VASCULAR
@CATEGORY_RULE,INFARCT,BRAIN_ANATOMY,HEART_ANATOMY,OTHER_CRITICAL_ANATOMY,,,,,,,,,
@CATEGORY_RULE,ANEURYSM,AORTIC_ANATOMY,,,,,,,,,,,
#SEVERITY_RUlE specifiy which targets to try to obtain severity measures for,,,,,,,,,,,,,
@SEVERITY_RULE,AORTIC_ANATOMY_ANEURYSM,SEVERITY,,,,,,,,,,,
In [ ]:
%matplotlib inline
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
In [ ]:
import radnlp.rules as rules
import radnlp.schema as schema
import radnlp.utils as utils
import radnlp.classifier as classifier
import radnlp.split as split
from IPython.display import clear_output, display, HTML
from IPython.html.widgets import interact, interactive, fixed
import io
from IPython.html import widgets # Widget definitions
import pyConTextNLP.itemData as itemData
from pyConTextNLP.display.html import mark_document_with_html
In [ ]:
reports = ["""1. Pulmonary embolism with filling defects noted within the upper and lower
lobar branches of the right main pulmonary artery.
2. Bilateral pleural effusions, greater on the left.
3. Ascites.
4. There is edema of the gallbladder wall, without any evidence of
distention, intra- or extra-hepatic biliary dilatation. This, along with
stranding within the mesentery, likely represents third spacing of fluid.
5. There are several wedge shaped areas of decreased perfusion within the
spleen, which may represent splenic infarcts.
Results were discussed with Dr. [**First Name8 (NamePattern2) 15561**] [**Last Name (NamePattern1) 13459**]
at 8 pm on [**3099-11-6**].""",
"""1. Filling defects within the subsegmental arteries in the region
of the left lower lobe and lingula and within the right lower lobe consistent
with pulmonary emboli.
2. Small bilateral pleural effusions with associated bibasilar atelectasis.
3. Left anterior pneumothorax.
4. No change in the size of the thoracoabdominal aortic aneurysm.
5. Endotracheal tube 1.8 cm above the carina. NG tube within the stomach,
although the tip is pointed superiorly toward the fundus.""",
"""1. There are no pulmonary emboli observed.
2. Small bilateral pleural effusions with associated bibasilar atelectasis.
3. Left anterior pneumothorax.
4. No change in the size of the thoracoabdominal aortic aneurysm.
5. Endotracheal tube 1.8 cm above the carina. NG tube within the stomach,
although the tip is pointed superiorly toward the fundus."""
]
In [ ]:
#!python -m textblob.download_corpora
In [ ]:
def getOptions():
"""Generates arguments for specifying database and other parameters"""
options = {}
options['lexical_kb'] = ["https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/lexical_kb_04292013.tsv",
"https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/criticalfinder_generalized_modifiers.tsv"]
options['domain_kb'] = ["https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/pe_kb.tsv"]#[os.path.join(DATADIR2,"pe_kb.tsv")]
options["schema"] = "https://raw.githubusercontent.com/chapmanbe/RadNLP/master/KBs/schema2.csv"#"file specifying schema"
options["rules"] = "https://raw.githubusercontent.com/chapmanbe/RadNLP/master/KBs/classificationRules3.csv" # "file specifying sentence level rules")
return options
For every report we do two steps
radnlp
provides functions to do both of these steps:
radnlp.utils.mark_report
takes lists of modifiers and targets and generates a pyConTextNLP document graphradnlp.classify.classify_document_targets
takes the document graph, rules, and schema and generates document classification for each identified concept.Because pyConTextNLP operates on sentences we split the report into sentences. In this function we use radnlp.split.get_sentences
which is simply a wrapper around textblob
for splitting the sentences.
In [ ]:
def analyze_report(report, modifiers, targets, rules, schema):
"""
given an individual radiology report, creates a pyConTextGraph
object that contains the context markup
report: a text string containing the radiology reports
"""
markup = utils.mark_report(split.get_sentences(report),
modifiers,
targets)
return classifier.classify_document_targets(markup,
rules[0],
rules[1],
rules[2],
schema)
In [ ]:
def process_report(report):
options = getOptions()
_radnlp_rules = rules.read_rules(options["rules"])
_schema = schema.read_schema(options["schema"])
#_schema = readSchema(options["schema"])
modifiers = itemData.itemData()
targets = itemData.itemData()
for kb in options['lexical_kb']:
modifiers.extend( itemData.instantiateFromCSVtoitemData(kb) )
for kb in options['domain_kb']:
targets.extend( itemData.instantiateFromCSVtoitemData(kb) )
return analyze_report(report, modifiers, targets, _radnlp_rules, _schema)
In [ ]:
rslt_0 = process_report(reports[0])
radnlp.classifier.classify_document_targets
returns a dictionary with keys equal to the target category (e.g. pulmonary_embolism
) and the values a 3-tuple with the following values:
In [ ]:
for key, value in rslt_0.items():
print(("%s"%key).center(42,"-"))
for v in value:
print(v)
In [ ]:
rslt_1 = main(reports[1])
for key, value in rslt_1.items():
print(("%s"%key).center(42,"-"))
for v in value:
print(v)
In [ ]:
rslt_2 = main(reports[2])
for key, value in rslt_2.items():
print(("%s"%key).center(42,"-"))
for v in value:
print(v)
In [ ]:
keys = list(pec.markups.keys())
keys.sort()
pec.reports.insert(pec.reports.columns.get_loc(u'markup')+1,
"ConText Coding",
[codingKey.get(pec.markups[k][1].get("pulmonary_embolism",[None])[0],"NA") for k in keys])