Medical Text Classification with IPython Notebook

The purpose of this notebook is to show a simple medical text classification workflow using IPython notebook.

Standard imports and settings



In [16]:

    
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['font.size'] = 16.0
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd            # process data with pandas dataframe
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import time
import os

Few words about IPython Notebook

Some of the main features of the IPython Notebook app include:

In-browser editing for code, with automatic syntax highlighting, indentation, and tab completion.
The ability to execute code from the browser, with the results of computations attached to the code which generated them.
Displaying the result of computation using rich media representations, such as HTML, LaTeX, PNG, SVG, etc, up to publication-quality figures content.
In-browser editing for rich text using the Markdown markup language, which can provide commentary for the code.
The ability to easily include mathematical notation within markdown cells using LaTeX.

Personal advantages for my projects:

Great support for Interpretable Data Science - This greatly contributes to the process of harnessing the medical expertise of the users to train the algorithms better.

Architecture of IPython notebook

The data

The Ohsumed test collection (available at ftp://medir.ohsu.edu/pub/ohsumed) is a subset of the MEDLINE database, which is a bibliographic database of important, peer-reviewed medical literature maintained by the National Library of Medicine. The initial subset I consider in the project is the collection consisting of the first 20,000 documents from the 50,216 medical abstracts of the year 1991. The classification scheme consists of the 23 Medical Subject Headings (MeSH) categories of cardiovascular diseases group:

Category	Description
C01	Bacterial Infections and Mycoses
C02	Virus Diseases
C03	Parasitic Diseases
C04	Neoplasms
C05	Musculoskeletal Diseases
C06	Digestive System Diseases
C07	Stomatognathic Diseases
C08	Respiratory Tract Diseases
C09	Otorhinolaryngologic Diseases
C10	Nervous System Diseases
C11	Eye Diseases
C12	Urologic and Male Genital Diseases
C13	Female Genital Diseases and Pregnancy Complications
C14	Cardiovascular Diseases
C15	Hemic and Lymphatic Diseases
C16	Neonatal Diseases and Abnormalities
C17	Skin and Connective Tissue Diseases
C18	Nutritional and Metabolic Diseases
C19	Endocrine Diseases
C20	Immunologic Diseases
C21	Disorders of Environmental Origin
C22	Animal Diseases
C23	Pathological Conditions, Signs and Symptoms

Downloading the data:

Cardiovascular diseases abstracts (the first 20,000 abstracts of the year 1991)

The following code assumes that the data is extracted into folder with the name 'Data' in the same folder of the IPython notebook

Storing the data in a sql database

The following code iterates over the extracted data and converts the ohsumed directory structure to a single sqlite databse with the data. The original dataset is already divided to Test and Training datasets. We will keep this division.



In [29]:

    
import sqlite3 as sqlite

def get_all_tables(c):
    """
    Helper function - Gets a list of all the tables in the database.
    """
    all_tables = []
    c.execute('SELECT name FROM SQLITE_MASTER WHERE type = "table"')
    for tbl in c:
        all_tables.append(tbl[0])
    return all_tables

def drop_tables(c, tables):
    """
    Helper function - Checks that the specified tables exist, and for those that do, drop them.
    """
    all_tables = get_all_tables(c)
    for t in tables:
        if t in all_tables:
            c.execute('DROP TABLE %s' % t)

def create_documents_table(c):
    """
    Helper function - This function uses SQL to create the Documents table
    """
    drop_tables(c, [ 'Documents' ])
    c.execute("""CREATE TABLE Documents ( 
            DOCID TEXT PRIMARY KEY,
            NOTE_TEXT TEXT, 
            CATEGORY TEXT
        )""")

def add_document(c, docid, text, category):
    """
    Helper function - Adds one document to sql Documents database
    """
    c.execute('insert or replace into Documents values ( ?, ?, ? )', (docid, text, category))

def ohsumed2sqlite(src_root_dir, dest_sqlite):
    start_time = time.time()
    print 'Converting ohsumed directory structure {0} to sqlite database'.format(src_root_dir)

    conn_out = sqlite.connect(dest_sqlite)
    c_out = conn_out.cursor()

    fout = open(dest_sqlite, 'w')
    fout.close()

    create_documents_table(c_out)
    # Process the ohsumed directory
    dict = {}
    for root, dirs, files in os.walk(src_root_dir):
        for f in files:
            category = os.path.basename(root) # directory name is the category
            with open (os.path.join(root, f), "r") as cur_file:
                data=cur_file.read()
                if f in dict:
                    category = dict[f] + ',' + category
                dict[f] = category
                add_document(c_out, f, data, category)
    conn_out.commit()
    c_out.close()
    print("--- ohsumed2sqlite %s seconds ---" % (time.time() - start_time))



In [30]:

    
# Convert the training data
ohsumed2sqlite('.\\Data\\ohsumed-first-20000-docs\\training', 'training.sqlite')

# Convert the test data
ohsumed2sqlite('.\\Data\\ohsumed-first-20000-docs\\test', 'test.sqlite')









    



Converting ohsumed directory structure .\Data\ohsumed-first-20000-docs\training to sqlite database
--- ohsumed2sqlite 1.50699996948 seconds ---
Converting ohsumed directory structure .\Data\ohsumed-first-20000-docs\test to sqlite database
--- ohsumed2sqlite 1.83299994469 seconds ---

Data exploration and preparation

Read the data from the sqlite databases into Pandas dataframes



In [64]:

    
con = sqlite.connect('training.sqlite')

df_train = pd.read_sql_query("SELECT * from Documents", con)

con.close()

con = sqlite.connect('test.sqlite')

df_test = pd.read_sql_query("SELECT * from Documents", con)

con.close()

Output a random sample of 20 records



In [65]:

    
pd.options.display.max_colwidth = 200

df_train.sample(20)









    Out[65]:






  
    
      
      DOCID
      NOTE_TEXT
      CATEGORY
    
  
  
    
      1919
      0004299
      Sensitivity of digoxigenin and biotin labelled probes for detection of human papillomavirus by in situ hybridisation.\n The sensitivity of digoxigenin and biotin labelled DNA probes for the detect...
      C02,C04,C09,C13
    
    
      3197
      0000531
      Absence of brown product FFI in nondiabetic and diabetic rat collagen.\n Accumulation of brown products in long-lived proteins might be an important factor in determining long-term diabetic compli...
      C18
    
    
      3861
      0007389
      Immunoblastic T-cell lymphoma presenting as an eyelid tumor.\n A 59-year-old white man presented with an ulcerating mass of the left upper eyelid of 6 months' duration.\n A biopsy specimen of the ...
      C04,C20
    
    
      1791
      0007955
      The prevalence of subfertility: a review of the current confusion and a report of two new studies.\n The difficulties inherent in measuring the prevalence of subfertility are discussed.\n Four sub...
      C12
    
    
      714
      0000565
      Phospholipids from rat, human, and canine gastric mucosa. Composition and metabolism of molecular classes of phosphatidylcholine.\n To validate a recent proposal that a phospholipid lining with a ...
      C06
    
    
      60
      0005909
      Early postoperative care of the cardiac transplantation patient: routine considerations and immunosuppressive therapy.\n The authors have attempted to outline the current state of the art with res...
      C01
    
    
      5171
      0004416
      Serial left ventricular performance evaluated by cardiac catheterization before, immediately after and at 6 months after balloon aortic valvuloplasty.\n Although impaired ventricular function has ...
      C14,C23
    
    
      5828
      0007682
      Longitudinal study of diagnoses in children of women with unipolar and bipolar affective disorder.\n School-age children of unipolar depressed, bipolar, chronically medically ill, or normal women ...
      C23
    
    
      3193
      0000521
      Association of elevated fasting C-peptide level and increased intra-abdominal fat distribution with development of NIDDM in Japanese-American men.\n The Japanese-American population of King County...
      C18
    
    
      5269
      0005114
      Pressure threshold for shock wave induced renal hemorrhage.\n Studies were performed with an interest in determining a pressure threshold for extracorporeal shock wave induced renal damage.\n Hist...
      C12,C23
    
    
      2428
      0005912
      Brucella endocarditis: the role of combined medical and surgical treatment.\n Brucella endocarditis, although a rare complication of brucellosis, is the main cause of death related to this disease...
      C01,C14
    
    
      3818
      0006685
      Prehospital administration of inhaled metaproterenol.\n STUDY OBJECTIVES: We conducted a study of the prehospital use of inhaled metaproterenol.\n DESIGN, SETTING, TYPE OF PARTICIPANTS, AND INTERV...
      C08,C20
    
    
      5665
      0006905
      Squamous carcinoma metastatic to the sternum.\n A 63-year-old man had a 10 x 16-cm sternal mass 18 months after a second aortocoronary bypass operation.\n The resected lesion was a metastatic tumo...
      C04,C23
    
    
      3247
      0003370
      Early detection and treatment of hyperlipidemia: physician practices in Canada.\n We surveyed primary care physicians in Canada to determine their current practices regarding the detection and tre...
      C14,C18
    
    
      5230
      0004849
      Subarachnoid hemorrhage caused by a fungal aneurysm of the vertebral artery as a complication of intracranial aneurysm clipping. Case report.\n Intracranial aneurysms are an uncommon manifestation...
      C01,C10,C14,C23
    
    
      1502
      0009007
      Changes in thermal and mechanical pain thresholds in hand amputees. A clinical and physiological long-term follow-up.\n In a previous study, allodynia to cold and vibratory stimuli was found in th...
      C10
    
    
      5565
      0006590
      Adverse haemodynamic effects of high-dose aprotinin in a paediatric cardiac surgical patient.\n High-dose aprotinin for reduction of intra- and postoperative blood loss was associated with profoun...
      C14,C23
    
    
      903
      0008463
      Primary sclerosing cholangitis.\n Primary sclerosing cholangitis is a rare disease of unknown etiology.\n Sclerosis of the bile ducts may actually be the final result of multiple factors such as a...
      C06
    
    
      380
      0005157
      Prospective study of estrogen replacement therapy and risk of breast cancer in postmenopausal women [published erratum appears in JAMA 1991 Apr 10;265(14):1828] \n We prospectively examined the us...
      C04
    
    
      3246
      0003265
      The electroretinogram in minimal diabetic retinopathy.\n The pattern and diffuse flash electroretinograms were measured in 20 normal subjects and 40 diabetic patients who had either normal fundi o...
      C18

Creating binary classification dataset

As we see, each document is assigned one class or more. In this exercise I would like to implement a simple binary classification workflow, so I define 2 binary classes:

Positive / True - Documents that belong to class C14 - Cardiovascular Diseases
Negative / False - Documents that do not belong to C14

The following code converts that data to a binary classification dataset:



In [66]:

    
import re

def ConvertCategoryColToBinVal(df, poscat):
    df['CATEGORY'] = df['CATEGORY'].apply(lambda x: bool(re.search(poscat, x, re.IGNORECASE) and True))
    
ConvertCategoryColToBinVal(df_train, 'C14')

ConvertCategoryColToBinVal(df_test, 'C14')

Let's take another look at the data:



In [67]:

    
df_train.sample(20)









    Out[67]:






  
    
      
      DOCID
      NOTE_TEXT
      CATEGORY
    
  
  
    
      6060
      0008845
      Causes, diagnosis, and treatment of pharyngitis.\n Pharyngitis is a common disease of the respiratory tract that can be caused by several different viruses and bacterial organisms.\n Clinically sp...
      False
    
    
      978
      0005872
      Secondary correction of the unilateral cleft lip nose using a conchal composite graft.\n The secondary deformity of the unilateral cleft lip nose has many components.\n One is the dorsal dislocati...
      False
    
    
      3655
      0003185
      Ability of anti-HIV agents to inhibit HIV replication in monocyte/macrophages or U937 monocytoid cells under conditions of enhancement by GM-CSF or anti-HIV antibody.\n Monocyte/macrophages (M/M) ...
      False
    
    
      4045
      0002256
      Alcohol and trauma. An endemic syndrome.\n Injuries are a pervasive and costly problem, and alcohol use appears to be an important risk factor for injury.\n We examined the blood alcohol levels an...
      False
    
    
      2447
      0006180
      Dose-dependent reduction of myocardial infarct size with the perfluorochemical Fluosol-DA.\n The perfluorochemical Fluosol-DA has been shown to reduce infarct size.\n However, the dose-response re...
      True
    
    
      3284
      0004261
      Clinical review 16: Parathyroid hormone-related proteins: coming of age in the 1990s.\n The last 3 yr have yielded a fertile harvest of new information on the HHM clinical syndrome and on the nove...
      False
    
    
      4031
      0000997
      Treatment of phenobarbital poisoning with multiple dose activated charcoal in an infant.\n A 28-day-old infant developed lethargy, hypotonia, and hypothermia following a phenobarbital overdose sec...
      False
    
    
      5510
      0006332
      Clinicopathologic features and long-term results of alpha-fetoprotein-producing gastric cancer.\n During a 10-yr-period, 24 cases of alpha-fetoprotein-producing gastric cancer were experienced in ...
      False
    
    
      1653
      0002449
      Geriatric pharmacokinetics and the kidney.\n The general population is aging and, as a result, drugs are increasingly prescribed for a variety of medical conditions in a group of patients with mul...
      False
    
    
      4737
      0002332
      Changes in circulating norepinephrine with hemofiltration in advanced congestive heart failure.\n In congestive heart failure (CHF), hemofiltration is associated with an obvious decrease in circul...
      True
    
    
      5448
      0006046
      Focal nodular hyperplasia of the liver.\n Twenty-four patients underwent biopsy or resection of the liver for focal nodular hyperplasia (FNH) at Memorial Sloan-Kettering Cancer Center from 1978 to...
      False
    
    
      4405
      0009743
      Subtle injuries of the Lisfranc joint.\n In fifteen patients, a subtle injury of the Lisfranc joint (tarsometatarsal articulation) was found.\n The lesion was defined as a diastasis of two to five...
      False
    
    
      490
      0007578
      Immunocytochemical profile of benign and carcinomatous effusions. A practical approach to difficult diagnosis.\n One of the great challenges in the cytodiagnosis of effusions is the distinction be...
      False
    
    
      121
      0004622
      Measles incidence, vaccine efficacy, and mortality in two urban African areas with high vaccination coverage.\n Measles incidence, vaccine efficacy, and mortality were examined prospectively in tw...
      False
    
    
      5399
      0005694
      Abdominal trauma in pregnancy. When is fetal monitoring necessary? \n The type and duration of observation and monitoring of mother and fetus after abdominal trauma are dependent on gestational ag...
      False
    
    
      2139
      0002320
      Transient left ventricular filling abnormalities (diastolic stunning) after acute myocardial infarction.\n A variety of experimental studies suggest that diastolic left ventricular (LV) function c...
      True
    
    
      5899
      0008051
      Stress adaptation and low-frequency impedance of rat lungs.\n At transpulmonary pressures (Ptp) of 7-12 cmH2O, pressure-volume hysteresis of isolated cat lungs has been found to be 20-50% larger t...
      False
    
    
      4989
      0003536
      Perioperative arrhythmias after Fontan repair.\n Arrhythmias are well-recognized sequelae of the Fontan repair.\n A prospective analysis of perioperative arrhythmias after Fontan repair was perfor...
      False
    
    
      4396
      0009331
      Motor unit numbers and contractile properties after spinal cord injury.\n The number of motor units in the thenar muscle group was estimated in 11 patients with cervical spinal cord injuries.\n Th...
      False
    
    
      2280
      0003481
      Nonuniform regional deformation of the pericardium during the cardiac cycle in dogs.\n We hypothesized that local contact forces between the pericardium and the heart cause regional variation in p...
      True

How many records do we have in the test / train datasets ?



In [72]:

    
print 'Training data has: ', len(df_train.index), ' documents'

print 'Test data has: ', len(df_test.index), ' documents'









    



Training data has:  6286  documents
Test data has:  7643  documents

Data distribution



In [73]:

    
plt.axis('equal')
plt.pie(
    df_train.CATEGORY.value_counts().tolist(), 
    labels=['False', 'True'], 
    autopct='%1.1f%%', 
    colors=("#E13F29", "#D69A80"));

The accuracy of a dumb classifier that classifies all the documents as False is >80%

Baseline classifier

We define a trivial baseline classifier that simply look for the string 'Cardio' in the text



In [79]:

    
import re

def baseline_cpr_classifier(txt):
    if re.search('Cardio', txt, re.IGNORECASE):
        return True
    else:
        return False

Testing our baseline classifier



In [90]:

    
Ytrue = df_test.CATEGORY.tolist()

Ypred = []

for index, row in df_test.iterrows():
    Ypred.append(baseline_cpr_classifier(row['NOTE_TEXT']))

Comparing actual values to the predictions of the baseline classifier



In [91]:

    
dictY = {'Actual' : Ytrue, 'Predicted': Ypred}

dfY = pd.DataFrame.from_dict(dictY)

dfY.sample(20)









    Out[91]:






  
    
      
      Actual
      Predicted
    
  
  
    
      1598
      False
      False
    
    
      6700
      True
      True
    
    
      7296
      False
      False
    
    
      4630
      False
      False
    
    
      626
      False
      False
    
    
      6721
      False
      False
    
    
      366
      False
      False
    
    
      870
      False
      False
    
    
      2489
      False
      False
    
    
      3567
      False
      False
    
    
      3058
      True
      False
    
    
      3725
      False
      False
    
    
      7455
      True
      True
    
    
      7082
      False
      False
    
    
      6063
      False
      False
    
    
      7470
      False
      False
    
    
      4234
      False
      False
    
    
      346
      False
      False
    
    
      4351
      False
      False
    
    
      3262
      True
      False

Assessing Classifier Performance

Confusion matrix and derivative estimators



In [92]:

    
from ipy_table import *

confusion_matrix_binary = [
    ['', 'Predicted 0', 'Predicted 1'],
    ['Actual 0', 'True Negative', 'False Positive'],
    ['Actual 1', 'False Negative', 'True Positive']
]

make_table(confusion_matrix_binary)
apply_theme('basic_both')









    Out[92]:




Predicted&nbsp0 Predicted&nbsp1
Actual&nbsp0 True&nbspNegative False&nbspPositive
Actual&nbsp1 False&nbspNegative True&nbspPositive

Sensitivity (Recall) or true positive rate (TPR) $=\frac{TP}{TP+FN}$

Specificity (SPC) or true negative rate $=\frac{TN}{TN+FP}$

Precision or positive predictive value (PPV) $=\frac{TP}{TP+FP}$

Confusion matrix and derivatives of the baseline classifier



In [93]:

    
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(Ytrue, Ypred)

print conf_mat

print sum(conf_mat[0])









    



[[6226  116]
 [ 922  379]]
6342

Some code to prettify the printout of the confusion matrix in the notebook:



In [87]:

    
def render_confusion_matrix(ytrue, ypred):
    return pd.crosstab(pd.Series(ytrue), pd.Series(ypred), rownames=['Actual'], colnames=['Predicted'], margins=True)



In [88]:

    
render_confusion_matrix(Ytrue, Ypred)



In [89]:

    
from sklearn.metrics import classification_report, accuracy_score

print
print 'Accuracy: ', accuracy_score(Ytrue, Ypred)
print
print classification_report(Ytrue, Ypred)









    



Accuracy:  0.864189454403

             precision    recall  f1-score   support

      False       0.87      0.98      0.92      6342
       True       0.77      0.29      0.42      1301

avg / total       0.85      0.86      0.84      7643

Even though the accuracy is OK, the recall is misearable. Let's build a real classifier

Building a ML classifier

Vectorizing textual data using Bag-of-Words (or Bag-of-Ngrams)

Our vector space is a dictionary of all the N-grams in our set of documents.

Defining tokenizer to extract better features from the text



In [99]:

    
from nltk.stem.porter import *
from nltk.tokenize import word_tokenize
import string

stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [item for item in tokens if item.isalpha()]
    stems = stem_tokens(tokens, stemmer)
    return stems

Vectorize the test data



In [100]:

    
X_train = df_train['NOTE_TEXT'].tolist()
Y_train = df_train.CATEGORY.tolist()

vectorizer = CountVectorizer(tokenizer=tokenize, ngram_range=(1,2))

wcounts = vectorizer.fit_transform(X_train)

What is the dimentionality of our vector space?



In [102]:

    
wcounts









    Out[102]:





<6286x355860 sparse matrix of type '<type 'numpy.int64'>'
	with 1452481 stored elements in Compressed Sparse Row format>

We have 6286 documents in our training set, each has 355860 features (the combination of all the uni/bi-grams)

The first 200 features



In [103]:

    
feats = vectorizer.get_feature_names()

feats[:200]









    Out[103]:





[u'a',
 u'a a',
 u'a absolut',
 u'a accuraci',
 u'a acid',
 u'a activ',
 u'a adrenalectomi',
 u'a after',
 u'a alon',
 u'a alpha',
 u'a alphasubunit',
 u'a amino',
 u'a an',
 u'a analog',
 u'a analysi',
 u'a and',
 u'a antibodi',
 u'a antigen',
 u'a antiifnalpha',
 u'a aortic',
 u'a apic',
 u'a aqueoussolubl',
 u'a are',
 u'a as',
 u'a assay',
 u'a assessmentreferr',
 u'a ata',
 u'a atmospher',
 u'a atrial',
 u'a attempt',
 u'a averag',
 u'a b',
 u'a babi',
 u'a baboon',
 u'a background',
 u'a backtoback',
 u'a bacteremia',
 u'a bacteri',
 u'a bacteriolog',
 u'a bad',
 u'a baffl',
 u'a bake',
 u'a balanc',
 u'a balloon',
 u'a ban',
 u'a band',
 u'a bandlik',
 u'a barbitur',
 u'a bard',
 u'a barium',
 u'a baromet',
 u'a baroreceptor',
 u'a barrier',
 u'a basal',
 u'a base',
 u'a baselin',
 u'a basi',
 u'a basic',
 u'a basketweav',
 u'a bath',
 u'a batteri',
 u'a bauermeist',
 u'a bcc',
 u'a bcell',
 u'a bcrabl',
 u'a bdvinfect',
 u'a bdvrelat',
 u'a bear',
 u'a beat',
 u'a becaus',
 u'a becker',
 u'a bed',
 u'a begin',
 u'a behavior',
 u'a behaviour',
 u'a below',
 u'a belowthekne',
 u'a bench',
 u'a benefici',
 u'a benefit',
 u'a benign',
 u'a bent',
 u'a benzodiazepin',
 u'a beta',
 u'a betaadrenerg',
 u'a betaadrenoceptor',
 u'a betaagonist',
 u'a betablock',
 u'a betachain',
 u'a betacl',
 u'a betahemolyt',
 u'a betasympathomimet',
 u'a better',
 u'a betweenheel',
 u'a bewild',
 u'a bfg',
 u'a bgl',
 u'a bia',
 u'a biatrial',
 u'a bicommissur',
 u'a bicornu',
 u'a bicycl',
 u'a bidimension',
 u'a bidirect',
 u'a biexponenti',
 u'a bifoil',
 u'a bilater',
 u'a bile',
 u'a biliari',
 u'a biliou',
 u'a billingham',
 u'a bimod',
 u'a bind',
 u'a bing',
 u'a bioartifici',
 u'a bioassay',
 u'a biograph',
 u'a biolog',
 u'a biomechan',
 u'a biomedicu',
 u'a bioprosthesi',
 u'a bioprosthet',
 u'a biopsi',
 u'a biotinyl',
 u'a bipariet',
 u'a bipedicl',
 u'a biphas',
 u'a biplan',
 u'a bipolar',
 u'a bipotenti',
 u'a biraci',
 u'a birth',
 u'a bisexu',
 u'a bisinu',
 u'a bite',
 u'a biventricular',
 u'a black',
 u'a blackout',
 u'a blackpowd',
 u'a blackpowderrifl',
 u'a bladder',
 u'a blade',
 u'a blalocktaussig',
 u'a blank',
 u'a bleed',
 u'a blind',
 u'a block',
 u'a blocker',
 u'a blogt',
 u'a blood',
 u'a bloodbrain',
 u'a bloodcontamin',
 u'a bloodi',
 u'a bloodless',
 u'a bloodliquid',
 u'a blunt',
 u'a boari',
 u'a bodi',
 u'a bohr',
 u'a bold',
 u'a bolu',
 u'a bone',
 u'a boni',
 u'a booster',
 u'a borg',
 u'a borna',
 u'a bottom',
 u'a bout',
 u'a bovin',
 u'a bowel',
 u'a boy',
 u'a boyfriendlov',
 u'a bp',
 u'a brachialjugular',
 u'a bradycardia',
 u'a brain',
 u'a brainstem',
 u'a branch',
 u'a brdurd',
 u'a breast',
 u'a breastpreserv',
 u'a breathhold',
 u'a bridg',
 u'a brief',
 u'a broad',
 u'a broaden',
 u'a broader',
 u'a broadli',
 u'a broadspectrum',
 u'a broken',
 u'a bromin',
 u'a bromodeoxyuridin',
 u'a bronchial',
 u'a bronchodil',
 u'a bronchoplast',
 u'a bronchu',
 u'a brother',
 u'a brown',
 u'a brunetti',
 u'a buccolingu']

At this point of the workflow, there are many feature selection and engineering techniques we could apply. Some generic, some based on natural language processing and some unique to the medical domain. However, to keep it simple for now, let's move on to the classifier:

Naive Bayes classifier

Naive conditional independence assumption: The counts of individual features (or n-grams) are independent
Bayes' rule is used to calculate the conditional probabilities for each class/feature pair: $P(Y|W)=\frac{P(W|Y) \cdot P(Y)}{P(W)}$

$\implies P(Y|w_{1}, w_{2}, ..., w_{N})=\frac{1}{P(W)} \cdot P(Y) \cdot \prod_{k=1}^{N} P(w_{k}|Y)$

Since the probability of all the features is not dependent on the probability of Y and all we care about is finding the Y that is most likely, we can drop $\frac{1}{P(W)}$ and stay with:

$P(Y|w_{1}, w_{2}, ..., w_{N})=P(Y) \cdot \prod_{k=1}^{N} P(w_{k}|Y)$

We estimate the probabilities from the training set using multinomial distribution with a smoothing factor alpha:

$P(w_{i}) = \frac{count_{i}+alpha}{overallcount_{i}+alpha \cdot N}$

Fit (aka train) a classifier with some ad-hoc parameters on the test data



In [111]:

    
from sklearn.naive_bayes import MultinomialNB

clf_nb = MultinomialNB(alpha=0.1)

clf_nb.fit(wcounts, Y_train)









    Out[111]:





MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

Most Predictive Features

A cool feature of the Naive Bayes classifier is that it can list for us most relevant features for each class. These are the features that are most relevant to the positive documents. Some of them are trivial English words, that will also appear in the list of features relevant to the Negative documents. However - we also see some domain specific features, such as arteri and coronari :



In [112]:

    
pf = [(clf_nb.feature_log_prob_[1, i], feats[i]) for i in range(len(feats))]
pf.sort(reverse=True)
for p in pf[:25]:
    print 'Positive word %.2f: %s' % (p[0], p[1])









    



Positive word -3.82: the
Positive word -3.84: of
Positive word -4.15: in
Positive word -4.18: and
Positive word -4.77: to
Positive word -4.81: with
Positive word -4.86: patient
Positive word -4.89: a
Positive word -5.14: wa
Positive word -5.41: were
Positive word -5.57: of the
Positive word -5.58: for
Positive word -5.71: than
Positive word -5.73: in the
Positive word -5.85: by
Positive word -5.87: or
Positive word -5.96: arteri
Positive word -6.00: patient with
Positive word -6.01: that
Positive word -6.04: is
Positive word -6.09: after
Positive word -6.11: group
Positive word -6.15: coronari
Positive word -6.16: less
Positive word -6.19: from

Test the classifier with the test data



In [117]:

    
# Read the data from pandas dataframe to an array
X_test = df_test['NOTE_TEXT'].tolist()
Y_test = df_test.CATEGORY.tolist()

# Convert the text to arrays of numbers
counts_test = vectorizer.transform(X_test)



In [118]:

    
counts_test









    Out[118]:





<7643x355860 sparse matrix of type '<type 'numpy.int64'>'
	with 1468293 stored elements in Compressed Sparse Row format>

We have 7643 documents in our test set, each has 355860 features - exactly the same features we set while processing the training set of course.



In [119]:

    
# Predict the values of the test set
predictions = clf_nb.predict(counts_test)



In [120]:

    
# Look at the first 20 predictions
predictions[:20]









    Out[120]:





array([False, False, False, False, False, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False, False], dtype=bool)

Confusion matrix and derivatives of the classifier



In [122]:

    
render_confusion_matrix(Y_test, predictions)



In [124]:

    
print
print 'Accuracy: ', accuracy_score(Y_test, predictions)
print
print classification_report(Y_test, predictions)









    



Accuracy:  0.92542195473

             precision    recall  f1-score   support

      False       0.95      0.96      0.96      6342
       True       0.78      0.78      0.78      1301

avg / total       0.93      0.93      0.93      7643

As expected we improved the accuracy compared to the baseline classifier, and dramatically improved the recall on the positive documents - from 29% to 78%.

Before we move on to optimize the classifier, let's look at some other interesting output type of Naive Bayes - some insightful output besides the predictions on the test data

Get the indices of the elements that with wrong predictions



In [138]:

    
iwrong_predictions = [i for i,v in enumerate(zip(Y_test, predictions)) if v[0] != v[1]]

iwrong_predictions[:20]

Probabilities of a particular classification

Bayesian models like the Naive Bayes classifier have the nice property that they compute probabilities of a particular classification -- the predict_proba and predict_log_proba methods of MultinomialNB compute these probabilities.



In [139]:

    
proba = clf_nb.predict_proba(counts_test)
log_proba = clf_nb.predict_log_proba(counts_test)



In [140]:

    
proba[:10]









    Out[140]:





array([[  1.00000000e+000,   2.94055710e-037],
       [  1.00000000e+000,   2.56434674e-112],
       [  9.99999993e-001,   7.36050607e-009],
       [  1.00000000e+000,   3.67672356e-048],
       [  1.00000000e+000,   2.17669993e-030],
       [  1.00000000e+000,   1.25863592e-081],
       [  1.00000000e+000,   2.47597297e-173],
       [  1.00000000e+000,   2.14988957e-053],
       [  1.00000000e+000,   5.70562372e-108],
       [  1.00000000e+000,   4.42691208e-041]])

The output of clf_nb.predict_proba(counts_test) is a (N example, 2) array. The first column gives the probability $P(Y=0)$ or $P(False)$, and the second gives $P(Y=1)$ or $P(True)$.

Commonly it is more comfortable to work with the log values of the probabilities - the results of clf_nb.predict_log_proba

Calculate and plot the differences between the True and False probabilities



In [141]:

    
diff_prob = proba[:,1] - proba[:,0]
diff_log_proba = log_proba[:,1] - log_proba[:,0]



In [142]:

    
print 'diff_prob:\n'
print 'mean:',np.mean(diff_prob)
print 'std:', np.std(diff_prob)

print '\ndiff_log_prob:\n'
print 'mean:', np.mean(diff_log_proba)
print 'std:', np.std(diff_log_proba)









    



diff_prob:

mean: -0.662949326315
std: 0.746432184454

diff_log_prob:

mean: -94.3395679123
std: 150.446111867



In [143]:

    
diff_log_proba[-10:]









    Out[143]:





array([-120.99198123, -105.33481755,  -45.92763286, -115.47860148,
       -126.40541505, -158.90170764,   30.44267742,  -27.78738764,
        404.37342916, -157.70848569])

Distribution of the predictions on the test set, as function of the difference between the True/False log probabilities



In [149]:

    
# Plot  histogram.
plt.hist(diff_log_proba, range=[-500, 500], bins=30, normed=True, alpha=0.5)
plt.axvline(0, color='r')









    Out[149]:





<matplotlib.lines.Line2D at 0x2962ad30>

Another view on the distribution of the prediction scores of our classifier

Divide the test set into 2 sets - True (blue) / False(red) - and plot them in a scatter plot



In [151]:

    
pospts = [v[1] for i,v in enumerate(zip(Y_test, diff_log_proba)) if v[0] == 1]
negpts = [v[1] for i,v in enumerate(zip(Y_test, diff_log_proba)) if v[0] == 0]



In [202]:

    
import random 

plt.scatter(
    pospts, 
    np.random.uniform(0.9, 1.1, len(pospts)),
    color='blue')

plt.scatter(
    negpts, 
    np.random.uniform(0.9, 1.1, len(negpts)),
    color='red')

plt.xlim(-500, 450)

plt.ylim(0.8, 1.2)

plt.axvline(0, color='r')

values = np.array(diff_log_proba[iwrong_predictions])

plt.axvspan(
    np.mean(values)-2*np.std(values),
    np.mean(values)+2*np.std(values), 
    facecolor='b', alpha=0.1)

plt.axvline(np.mean(values), linewidth=1);

Optimizing the classification pipeline

Scorer function

When optimizing parameters for classification pipeline we can write or define our own scoring function. The optimizing routine GridSearchCV which runs a brute force parameter search, will select the parameters that got the highest score.

What do we want to optimize exactly?

When classifing medical documents, we sometimes want to maximize sensitivity (recall) while keeping specificity and accuracy in check. In other words - the sensitivity is our target optimization parameter. To maximize sensitivity, I'll define a scorer function that returns the recall score.



In [199]:

    
from sklearn.metrics import recall_score, make_scorer

# Define scorer function that returns the recall score
recall_scorer = make_scorer(recall_score)

Now we build a full pipeline and use the GridSearchCV sklearn utility to fit the best paramaters for the pipeline. Normally I use longer lists of parameters, but that takes hours to run.

NOTE - we will use GridSearchCV default 3-fold cross-validation. Other cross-validation schemes can be defined if needed.



In [201]:

    
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('clf', MultinomialNB())
])

# Define some possible parameters for feature extraction and for the classifier
parameters = {
    'vect__max_features': (20000, 30000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'clf__alpha': (0.05, 0.1, 0.2)
}

# find the best parameters for both the feature extraction and the
# classifier, based on the scorer function we defined
grid_search = GridSearchCV(pipeline, parameters, verbose=1, scoring=recall_scorer)

grid_search.fit(df_train['NOTE_TEXT'].tolist(), df_train.CATEGORY.tolist())









    



[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:   17.2s
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed: 10.9min finished






    



Fitting 3 folds for each of 12 candidates, totalling 36 fits






    Out[201]:





GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st...28>,
        vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'vect__ngram_range': ((1, 1), (1, 2)), 'vect__max_features': (20000, 30000), 'clf__alpha': (0.05, 0.1, 0.2)},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring=make_scorer(recall_score), verbose=1)



In [203]:

    
print("Best parameters set found on development set:")
print
print(grid_search.best_params_)









    



Best parameters set found on development set:

{'vect__ngram_range': (1, 1), 'vect__max_features': 30000, 'clf__alpha': 0.2}

Try the optimized classifier on the test set and output the performance parameters



In [204]:

    
# Use the optimal classifier to make predictions on the test set

opt_predictions = grid_search.predict(df_test['NOTE_TEXT'].tolist())



In [205]:

    
render_confusion_matrix(df_test.CATEGORY.tolist(), opt_predictions)



In [206]:

    
print
print 'Accuracy: ', accuracy_score(df_test.CATEGORY.tolist(), opt_predictions)
print
print classification_report(df_test.CATEGORY.tolist(), opt_predictions)









    



Accuracy:  0.914431505953

             precision    recall  f1-score   support

      False       0.96      0.93      0.95      6342
       True       0.72      0.82      0.76      1301

avg / total       0.92      0.91      0.92      7643

Comparing the performance of the optimized classifier to the performance of the the classifier with the ad-hoc parameters shows, that while the accuracy of the optimized classifier is a bit lower (91% compared to 92%), the sensitivity for True documents, which is our target performance parameter, increased from 78% to 82%.



In [ ]:

	DOCID	NOTE_TEXT	CATEGORY
1919	0004299	Sensitivity of digoxigenin and biotin labelled probes for detection of human papillomavirus by in situ hybridisation.\n The sensitivity of digoxigenin and biotin labelled DNA probes for the detect...	C02,C04,C09,C13
3197	0000531	Absence of brown product FFI in nondiabetic and diabetic rat collagen.\n Accumulation of brown products in long-lived proteins might be an important factor in determining long-term diabetic compli...	C18
3861	0007389	Immunoblastic T-cell lymphoma presenting as an eyelid tumor.\n A 59-year-old white man presented with an ulcerating mass of the left upper eyelid of 6 months' duration.\n A biopsy specimen of the ...	C04,C20
1791	0007955	The prevalence of subfertility: a review of the current confusion and a report of two new studies.\n The difficulties inherent in measuring the prevalence of subfertility are discussed.\n Four sub...	C12
714	0000565	Phospholipids from rat, human, and canine gastric mucosa. Composition and metabolism of molecular classes of phosphatidylcholine.\n To validate a recent proposal that a phospholipid lining with a ...	C06
60	0005909	Early postoperative care of the cardiac transplantation patient: routine considerations and immunosuppressive therapy.\n The authors have attempted to outline the current state of the art with res...	C01
5171	0004416	Serial left ventricular performance evaluated by cardiac catheterization before, immediately after and at 6 months after balloon aortic valvuloplasty.\n Although impaired ventricular function has ...	C14,C23
5828	0007682	Longitudinal study of diagnoses in children of women with unipolar and bipolar affective disorder.\n School-age children of unipolar depressed, bipolar, chronically medically ill, or normal women ...	C23
3193	0000521	Association of elevated fasting C-peptide level and increased intra-abdominal fat distribution with development of NIDDM in Japanese-American men.\n The Japanese-American population of King County...	C18
5269	0005114	Pressure threshold for shock wave induced renal hemorrhage.\n Studies were performed with an interest in determining a pressure threshold for extracorporeal shock wave induced renal damage.\n Hist...	C12,C23
2428	0005912	Brucella endocarditis: the role of combined medical and surgical treatment.\n Brucella endocarditis, although a rare complication of brucellosis, is the main cause of death related to this disease...	C01,C14
3818	0006685	Prehospital administration of inhaled metaproterenol.\n STUDY OBJECTIVES: We conducted a study of the prehospital use of inhaled metaproterenol.\n DESIGN, SETTING, TYPE OF PARTICIPANTS, AND INTERV...	C08,C20
5665	0006905	Squamous carcinoma metastatic to the sternum.\n A 63-year-old man had a 10 x 16-cm sternal mass 18 months after a second aortocoronary bypass operation.\n The resected lesion was a metastatic tumo...	C04,C23
3247	0003370	Early detection and treatment of hyperlipidemia: physician practices in Canada.\n We surveyed primary care physicians in Canada to determine their current practices regarding the detection and tre...	C14,C18
5230	0004849	Subarachnoid hemorrhage caused by a fungal aneurysm of the vertebral artery as a complication of intracranial aneurysm clipping. Case report.\n Intracranial aneurysms are an uncommon manifestation...	C01,C10,C14,C23
1502	0009007	Changes in thermal and mechanical pain thresholds in hand amputees. A clinical and physiological long-term follow-up.\n In a previous study, allodynia to cold and vibratory stimuli was found in th...	C10
5565	0006590	Adverse haemodynamic effects of high-dose aprotinin in a paediatric cardiac surgical patient.\n High-dose aprotinin for reduction of intra- and postoperative blood loss was associated with profoun...	C14,C23
903	0008463	Primary sclerosing cholangitis.\n Primary sclerosing cholangitis is a rare disease of unknown etiology.\n Sclerosis of the bile ducts may actually be the final result of multiple factors such as a...	C06
380	0005157	Prospective study of estrogen replacement therapy and risk of breast cancer in postmenopausal women [published erratum appears in JAMA 1991 Apr 10;265(14):1828] \n We prospectively examined the us...	C04
3246	0003265	The electroretinogram in minimal diabetic retinopathy.\n The pattern and diffuse flash electroretinograms were measured in 20 normal subjects and 40 diabetic patients who had either normal fundi o...	C18

	DOCID	NOTE_TEXT	CATEGORY
6060	0008845	Causes, diagnosis, and treatment of pharyngitis.\n Pharyngitis is a common disease of the respiratory tract that can be caused by several different viruses and bacterial organisms.\n Clinically sp...	False
978	0005872	Secondary correction of the unilateral cleft lip nose using a conchal composite graft.\n The secondary deformity of the unilateral cleft lip nose has many components.\n One is the dorsal dislocati...	False
3655	0003185	Ability of anti-HIV agents to inhibit HIV replication in monocyte/macrophages or U937 monocytoid cells under conditions of enhancement by GM-CSF or anti-HIV antibody.\n Monocyte/macrophages (M/M) ...	False
4045	0002256	Alcohol and trauma. An endemic syndrome.\n Injuries are a pervasive and costly problem, and alcohol use appears to be an important risk factor for injury.\n We examined the blood alcohol levels an...	False
2447	0006180	Dose-dependent reduction of myocardial infarct size with the perfluorochemical Fluosol-DA.\n The perfluorochemical Fluosol-DA has been shown to reduce infarct size.\n However, the dose-response re...	True
3284	0004261	Clinical review 16: Parathyroid hormone-related proteins: coming of age in the 1990s.\n The last 3 yr have yielded a fertile harvest of new information on the HHM clinical syndrome and on the nove...	False
4031	0000997	Treatment of phenobarbital poisoning with multiple dose activated charcoal in an infant.\n A 28-day-old infant developed lethargy, hypotonia, and hypothermia following a phenobarbital overdose sec...	False
5510	0006332	Clinicopathologic features and long-term results of alpha-fetoprotein-producing gastric cancer.\n During a 10-yr-period, 24 cases of alpha-fetoprotein-producing gastric cancer were experienced in ...	False
1653	0002449	Geriatric pharmacokinetics and the kidney.\n The general population is aging and, as a result, drugs are increasingly prescribed for a variety of medical conditions in a group of patients with mul...	False
4737	0002332	Changes in circulating norepinephrine with hemofiltration in advanced congestive heart failure.\n In congestive heart failure (CHF), hemofiltration is associated with an obvious decrease in circul...	True
5448	0006046	Focal nodular hyperplasia of the liver.\n Twenty-four patients underwent biopsy or resection of the liver for focal nodular hyperplasia (FNH) at Memorial Sloan-Kettering Cancer Center from 1978 to...	False
4405	0009743	Subtle injuries of the Lisfranc joint.\n In fifteen patients, a subtle injury of the Lisfranc joint (tarsometatarsal articulation) was found.\n The lesion was defined as a diastasis of two to five...	False
490	0007578	Immunocytochemical profile of benign and carcinomatous effusions. A practical approach to difficult diagnosis.\n One of the great challenges in the cytodiagnosis of effusions is the distinction be...	False
121	0004622	Measles incidence, vaccine efficacy, and mortality in two urban African areas with high vaccination coverage.\n Measles incidence, vaccine efficacy, and mortality were examined prospectively in tw...	False
5399	0005694	Abdominal trauma in pregnancy. When is fetal monitoring necessary? \n The type and duration of observation and monitoring of mother and fetus after abdominal trauma are dependent on gestational ag...	False
2139	0002320	Transient left ventricular filling abnormalities (diastolic stunning) after acute myocardial infarction.\n A variety of experimental studies suggest that diastolic left ventricular (LV) function c...	True
5899	0008051	Stress adaptation and low-frequency impedance of rat lungs.\n At transpulmonary pressures (Ptp) of 7-12 cmH2O, pressure-volume hysteresis of isolated cat lungs has been found to be 20-50% larger t...	False
4989	0003536	Perioperative arrhythmias after Fontan repair.\n Arrhythmias are well-recognized sequelae of the Fontan repair.\n A prospective analysis of perioperative arrhythmias after Fontan repair was perfor...	False
4396	0009331	Motor unit numbers and contractile properties after spinal cord injury.\n The number of motor units in the thenar muscle group was estimated in 11 patients with cervical spinal cord injuries.\n Th...	False
2280	0003481	Nonuniform regional deformation of the pericardium during the cardiac cycle in dogs.\n We hypothesized that local contact forces between the pericardium and the heart cause regional variation in p...	True

	Actual	Predicted
1598	False	False
6700	True	True
7296	False	False
4630	False	False
626	False	False
6721	False	False
366	False	False
870	False	False
2489	False	False
3567	False	False
3058	True	False
3725	False	False
7455	True	True
7082	False	False
6063	False	False
7470	False	False
4234	False	False
346	False	False
4351	False	False
3262	True	False

	Predicted&nbsp0	Predicted&nbsp1
Actual&nbsp0	True&nbspNegative	False&nbspPositive
Actual&nbsp1	False&nbspNegative	True&nbspPositive