Classifying Documents

In this notebook we demonstrate a basic document level classification of reports with respect to a single finding ( fever). We leverage the convenience of Pandas to read our data from a SQLite database and then use Pandas to add our classification as a new column in the dataframe.

Many of the common pyConTextNLP tasks have been wrapped into functions contained in the radnlp pacakge. We important multiple modules that will allow us to write concise code.


In [2]:
import pyConTextNLP.pyConTextGraph as pyConText
import pyConTextNLP.itemData as itemData
import pymysql
import numpy as np
import os
import radnlp.io  as rio
import radnlp.view as rview
import radnlp.rules as rules
import radnlp.schema as schema
import radnlp.utils as utils
import radnlp.split as split
import radnlp.classifier as classifier
import pandas as pd
from IPython.display import clear_output, display, HTML, Image
from IPython.html.widgets import interact, interactive, fixed
from IPython.display import clear_output
import ipywidgets as widgets
from radnlp.data import classrslts 
import networkx as nx
conn = pymysql.connect(host="mysql",
                       port=3306,user="jovyan",
                       passwd='jovyan',db='mimic2')


/opt/conda/lib/python3.5/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

In [3]:
conn = pymysql.connect(host="mysql",
                       port=3306,user="jovyan",
                       passwd='jovyan',db='mimic2')

Annotation coloring scheme

Later in the notebook we are going to provide an HTML display of the marked up documents. Here we specify what colors we want to use for each category we have defined.


In [15]:
colors={"critical_finding":"blue",
        "pneumonia":"blue",
        "pneumothorax":"blue",
        "diverticulitis":"blue",
       "definite_negated_existence":"red",
       "probable_negated_existence":"indianred",
       "ambivalent_existence":"orange",
       "probable_existence":"forestgreen",
       "definite_existence":"green",
       "historical":"goldenrod",
       "indication":"Pink",
       "acute":"golden"}

Explanation of getOptions

This is just kind of a port of a command line application where I'd use argparse to get the options.


In [37]:
def getOptions():
    """Generates arguments for specifying database and other parameters"""
    options = {}
    options['lexical_kb'] = ["https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/lexical_kb_nlm.tsv"]
    options["schema"] = "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/schema2.csv"
    options["rules"] = "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/classificationRules3.csv" 
    return options

Define our domain terms

You need to define four-tuples for the targets we want to find and the modifiers we are interested in

Each four-tuple needs the following items

  1. a lexical term (e.g. "pulmonary embolism")
  2. a category (e.g. "critical_finding")
  3. a regular expression to capture the lexical term (can be an empty string)
  4. a rule that describes which direction the modifier operates in the sentence (an empty string for a target)

Edit these lines in the code block below

targets.extend([["pulmonary embolism", "CRITICAL_FINDING", "", ""],
                   ["pneumonia", "CRITICAL_FINDING", "", ""]])
modifiers.extend((["no definite", "PROBABLE_NEGATED_EXISTENCE", "", "forward"],))

In [38]:
def get_kb_rules_schema(options):
    """
    Get the relevant kb, rules, and schema.
    
    """

    _radnlp_rules = rules.read_rules(options["rules"])
    _schema = schema.read_schema(options["schema"])
    
    modifiers = itemData.itemData()
    targets = itemData.itemData()
    for kb in options['lexical_kb']:
        modifiers.extend( itemData.instantiateFromCSVtoitemData(kb) )
    targets.extend([["pulmonary embolism", "CRITICAL_FINDING", "", ""],
                   ["pneumonia", "CRITICAL_FINDING", "", ""]])
    modifiers.extend((["no definite", "PROBABLE_NEGATED_EXISTENCE", "", "forward"],))
    return {"rules":_radnlp_rules,
            "schema":_schema,
            "modifiers":modifiers,
            "targets":targets}

In [39]:
def analyze_report(report, modifiers, targets, rules, schema):
    """
    given an individual radiology report, creates a pyConTextGraph
    object that contains the context markup
    report: a text string containing the radiology reports
    """
    markup = utils.mark_report(split.get_sentences(report),
                         modifiers,
                         targets)
    
    clssfy =   classifier.classify_document_targets(markup,
                                          rules[0],
                                          rules[1],
                                          rules[2],
                                          schema)
    return classrslts(context_document=markup, exam_type="ctpa", report_text=report, classification_result=clssfy)

In [40]:
options = getOptions()
kb = get_kb_rules_schema(options)
#data = data.dropna()

Read the Data

We are using the MIMICII Clinic Data, the complete electronic medical record for ICU patients. We are only using the MIMIC2 demo data set as no data use agreement is required, as the records are for 4000 deceased individuals. The data are stored in a MySQL database. We use Pandas to read the reports in, limiting ourselves to radiology reports for individuals with an ICD9 code of pulmonary embolism.


In [41]:
data = \
pd.read_sql("""SELECT noteevents.subject_id, 
                      noteevents.text, 
                      icd9.code 
               FROM noteevents INNER JOIN icd9 ON 
                      noteevents.subject_id = icd9.subject_id 
               WHERE (   icd9.code LIKE '415.1%'
                      ) 
                      AND noteevents.category = 'RADIOLOGY_REPORT'""",
            conn).drop_duplicates()
data.head(10)


Out[41]:
subject_id text code
0 150 \n\n\n DATE: [**2721-6-30**] 9:45 PM\n ... 415.19
1 150 \n\n\n DATE: [**2721-7-1**] 4:01 PM\n ... 415.19
2 150 \n\n\n DATE: [**2721-7-2**] 4:17 PM\n ... 415.19
3 150 \n\n\n BONE SCAN ... 415.19
4 150 \n\n\n DATE: [**2721-7-12**] 5:24 PM\n ... 415.19
5 1165 \n\n\n DATE: [**3099-10-20**] 5:55 PM\n ... 415.11
6 1165 \n\n\n PERSANTINE MIBI ... 415.11
7 1165 \n\n\n DATE: [**3099-10-21**] 12:07 AM\n ... 415.11
8 1165 \n\n\n DATE: [**3099-10-23**] 9:52 AM\n ... 415.11
9 1165 \n\n\n DATE: [**3099-10-23**] 1:34 PM\n ... 415.11

Find Impression Sections and Do Simple Search for Chest and CT

While these radiology reports are all for patients with a pulmonary embolism diagnosis, the images are acquired for lots of reasons. To try to enrich our data set we do some key word filtering to get chest-oriented images.


In [42]:
doc_split=["IMPRESSION:", "INTERPRETATION:", "CONCLUSION:"]

def find_impression(text, split):
    for term in split:
        if term in text:
            return text.split(term)[1]
    return np.NaN

data["impression"] = data.apply(lambda row: find_impression(row["text"], doc_split), axis=1)
data = data.dropna(axis=0, inplace=False)
data["chest"] = data.apply(lambda x: 'chest' in x["text"].lower() and 'ct' in x["text"].lower(), axis=1)
data = data[data["chest"] == True]
data = data.reset_index()
print(data.shape)
data.head()


(661, 6)
Out[42]:
index subject_id text code impression chest
0 0 150 \n\n\n DATE: [**2721-6-30**] 9:45 PM\n ... 415.19 Small focal opacity in right upper lobe and ... True
1 5 1165 \n\n\n DATE: [**3099-10-20**] 5:55 PM\n ... 415.11 Limited study. The tracheal component of th... True
2 10 1165 \n\n\n DATE: [**3099-10-24**] 3:59 PM\n ... 415.11 \n \n Tracheal stent extending from th... True
3 18 1165 \n\n\n DATE: [**3099-11-4**] 11:04 PM\n ... 415.11 \n \n Increased density in the retroca... True
4 25 1165 \n\n\n DATE: [**3099-11-6**] 5:36 PM\n ... 415.11 \n 1. Pulmonary embolism with filling de... True

Document Markup and Classification

We are going to be doing a lot behind the scenes here. We've got some code that is going to take our lexical and domain definitions, markup our documents according to these definitions and use pyConTextNLP to assign relationships between the concepts for each sentence int he document. Finally, we take a schema that defines how we are going to classify our documents, based on the sentence level markup. The schema we are going to use here is an ordinal schema where we account for existence, uncertainty, and acuity. The document classification is basically a maximum function: the document is assigned the maximum score of any sentence in the document.

We now need to apply our schema to the reports. Since our data is in a Pandas data frame, the easiest way to process our reports is with the DataFrame apply method.

  • We use lambda to create an anonymous function which basically just applies analyze_report to the "impression" column with the modifiers, targets, etc. that we have read in separately.
  • analyze_report returns a dictionary with keys as any identified targets defined in the "targets" file and values as a tuple with values:
    • The schema value that was selected for the document
    • The node (evidence) that was used for selecting that schema value

In [46]:
data["pe rslt"] = \
    data.apply(lambda x: analyze_report(x["impression"], 
                                         kb["modifiers"], 
                                         kb["targets"],
                                         kb["rules"],
                                         kb["schema"]), axis=1)
data.head()


Out[46]:
index subject_id text code impression chest pe rslt
0 0 150 \n\n\n DATE: [**2721-6-30**] 9:45 PM\n ... 415.19 Small focal opacity in right upper lobe and ... True (__________________________________________\n,...
1 5 1165 \n\n\n DATE: [**3099-10-20**] 5:55 PM\n ... 415.11 Limited study. The tracheal component of th... True (__________________________________________\n,...
2 10 1165 \n\n\n DATE: [**3099-10-24**] 3:59 PM\n ... 415.11 \n \n Tracheal stent extending from th... True (__________________________________________\n,...
3 18 1165 \n\n\n DATE: [**3099-11-4**] 11:04 PM\n ... 415.11 \n \n Increased density in the retroca... True (__________________________________________\n,...
4 25 1165 \n\n\n DATE: [**3099-11-6**] 5:36 PM\n ... 415.11 \n 1. Pulmonary embolism with filling de... True (__________________________________________\n,...

In [44]:
def view_markup(reports, colors):
    @interact(i=widgets.IntSlider(min=0, max=len(reports)-1))
    def _view_markup(i):
        markup = reports["pe rslt"][i]
        rview.markup_to_pydot(markup)
        display(Image("tmp.png"))
        mt = rview.markup_to_html(markup, color_map=colors)

        display(HTML(mt))

In [45]:
view_markup(data, colors)


PE Finder Case Review
reportclassification

1) Satisfactory position of lines and tubes as described above. No pneumothorax. 2) Progression of diffuse bilateral air space disease, most prominent in left mid and lower lung zone and right apex. This most likely represents multifocal pneumonia in a patient with fever and infected sputum.

critical_finding (Positive/Certain/Acute)

In [ ]: