Classifying Documents

In this notebook we demonstrate a basic document level classification of reports with respect to a single finding ( fever). We leverage the convenience of Pandas to read our data from a MySQL database and then use Pandas to add our classification as a new column in the dataframe.

Many of the common pyConTextNLP tasks have been wrapped into functions contained in the radnlp pacakge. We important multiple modules that will allow us to write concise code.


In [12]:
from utils import *
import pandas as pd
options = {}
options['lexical_kb'] = ["https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/lexical_kb_nlm.tsv"]
options["schema"] = "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/schema2.csv"
options["rules"] = "https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/classificationRules3.csv"

data = get_data()
data.head(5)


Out[12]:
text impression code
0 \n\n\n DATE: [**2721-6-30**] 9:45 PM\n ... Small focal opacity in right upper lobe and ... 415.19
1 \n\n\n DATE: [**3099-10-20**] 5:55 PM\n ... Limited study. The tracheal component of th... 415.11
2 \n\n\n DATE: [**3099-10-24**] 3:59 PM\n ... \n \n Tracheal stent extending from th... 415.11
3 \n\n\n DATE: [**3099-11-4**] 11:04 PM\n ... \n \n Increased density in the retroca... 415.11
4 \n\n\n DATE: [**3099-11-6**] 5:36 PM\n ... \n 1. Pulmonary embolism with filling de... 415.11

Document Classification

Modify targets and modifiers as demonstrated below.

  • You can have as many enteries per disease as you want (e.g. the two 4-tuples for pulmonary embolism).

Define a color value for each category you define

  • Color names need to be valid HTML colors. You might need to experiment.

We now need to apply our schema to the reports. Since our data is in a Pandas data frame, the easiest way to process our reports is with the DataFrame apply method.


In [10]:
radnlp_rules = rules.read_rules(options["rules"])
myschema = schema.read_schema(options["schema"])

modifiers = itemData.itemData()
targets = itemData.itemData()
for kb in options['lexical_kb']:
    modifiers.extend( itemData.instantiateFromCSVtoitemData(kb) )
targets.extend([["pulmonary embolism", "PULMONARY_EMBOLISM", "", ""],
                ["pulmonary emboli", "PULMONARY_EMBOLISM", "", ""],
               ["pneumonia", "LUNG_DISEASE", "", ""]])
modifiers.extend((["no definite", "PROBABLE_NEGATED_EXISTENCE", "", "forward"],
                  ["no", "DEFINITE_NEGATED_EXISTENCE", "", "forward"],))

colors = {"pulmonary_embolism":"blue",
          "lung_disease":"turquoise",
          "probable_negated_existence":"pink",
          "definite_negated_existence":"red",
          "probable_existence":"green",
          "conj":"goldenrod",
         }
#data = data.dropna()

data["pe rslt"] = \
    data.apply(lambda x: analyze_report(x["impression"], 
                                         modifiers, 
                                         targets,
                                         radnlp_rules,
                                         myschema), axis=1)

In [11]:
view_markup(data, colors)


PE Finder Case Review
reportclassification

Normal lower extremity ultrasound bilaterally.


In [ ]: