Experimental Results from a Decision Tree based NER model

Decisions Trees, as opposed to other machine learning techniques such as SVM's and Neural Networks, provide a human-interpretable classification model. We will exploit this to both generate pretty pictures and glean information for feature selection in our high dimensionality datasets.

This report will provide precision, recall, and f-measure values for Decision Trees built on the orthographic; orthograhic + morphological; orthographic + morphological + lexical feature sets for the Adverse Reaction, Indication, Active Ingredient, and Inactive Ingredient entities. A viewable Decision Tree structure will also be generated for each fold.


The file 'decisiontree.py' builds a Decision Tree classifier on the sparse format ARFF file passed in as a parameter. This file is saved in the models directory with the format 'decisiontree_[featuresets]_[entity name].pkl'
The file 'evaluate_decisiontree.py' evaluates a given Decision Tree model stored inside a '.pkl' file outputing appropriate statistics and saving a pdf image of the underlying decision structure associated with the given model.

All ARFF files were cleaned with 'arff_translator.py'. This cleaning consisted of removing a comma from each instance that was mistakenly inserted during file creation.


In [1]:
#python3 arff_translator.py [filename]

Adverse Reaction Feature Set

Orthographic Features


In [2]:
import subprocess

""" Creates models for each fold and runs evaluation with results """
featureset = "o"
entity_name = "adversereaction"

for fold in range(1,1): #training has already been done
    training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
    os.system("python3 decisiontree.py -tr %s" % (training_data))


for fold in range(1,11):
    testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
    output = subprocess.check_output("python3 evaluate_decisiontree.py -te %s" % (testing_data), shell=True)
    print(output.decode('utf-8'))


adversereaction_test-1.arff_o
Precision: 0.961538
Recall: 0.013789
F-measure: 0.027189
[[   25  1788]
 [    1 16927]]


adversereaction_test-2.arff_o
Precision: 0.750000
Recall: 0.008167
F-measure: 0.016158
[[    9  1093]
 [    3 19878]]


adversereaction_test-3.arff_o
Precision: 0.333333
Recall: 0.001961
F-measure: 0.003899
[[    1   509]
 [    2 10642]]


adversereaction_test-4.arff_o
Precision: 1.000000
Recall: 0.009394
F-measure: 0.018613
[[   11  1160]
 [    0 10655]]


adversereaction_test-5.arff_o
Precision: 0.571429
Recall: 0.010852
F-measure: 0.021299
[[   20  1823]
 [   15 18196]]


adversereaction_test-6.arff_o
Precision: 0.166667
Recall: 0.002210
F-measure: 0.004362
[[    2   903]
 [   10 13178]]


adversereaction_test-7.arff_o
Precision: 0.800000
Recall: 0.006098
F-measure: 0.012103
[[    4   652]
 [    1 18655]]


adversereaction_test-8.arff_o
Precision: 0.708333
Recall: 0.020118
F-measure: 0.039125
[[   17   828]
 [    7 15856]]


adversereaction_test-9.arff_o
Precision: 0.500000
Recall: 0.001765
F-measure: 0.003518
[[   2 1131]
 [   2 8715]]


adversereaction_test-10.arff_o
Precision: 0.538462
Recall: 0.006261
F-measure: 0.012378
[[    7  1111]
 [    6 15010]]


Average Precision: 0.6329762
Average Recall : 0.0080615
Average F-Measure: 0.0158644

Rather lackluster performance.


In [3]:
import graphviz
from sklearn.externals import joblib
from Tools import arff_converter
from sklearn import tree

featureset = "o" #Careful with highly dimensional datasets
entity_name = "adversereaction"

fold = 1  #change this to display a graph of the decision tree structure for a fold
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
dataset = arff_converter.arff2df(training_data)
dtree = joblib.load('../Models/decisiontree/%s_%s/decisiontree_%s_%s_train-%i.arff.pkl' % (entity_name, featureset, featureset, entity_name,fold))
tree.export_graphviz(dtree, 
                     out_file="visual/temptree1.dot",
                     feature_names=dataset.columns.values[:-1],
                     class_names=["Entity", "Non-Entity"], label='all',
                     filled=True, rounded=True, proportion=False, leaves_parallel=True,
                     special_characters=True,
                     max_depth=3  #change for more detail, careful with large datasets
                    )
with open("visual/temptree1.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)
#graphviz.Source(dot_graph).view()
#the above line is a fullscreen alternative, also generates a temporary file that requires manual removal


Out[3]:
Tree 0 has_punctuation ≤ 0.5 gini = 0.124 samples = 140114 value = [9283, 130831] class = Non-Entity 1 single_character ≤ 0.5 gini = 0.141 samples = 119438 value = [9096, 110342] class = Non-Entity 0->1 True 14 has_hyphen ≤ 0.5 gini = 0.018 samples = 20676 value = [187, 20489] class = Non-Entity 0->14 False 2 all_capital ≤ 0.5 gini = 0.14 samples = 119396 value = [9054, 110342] class = Non-Entity 1->2 13 gini = 0.0 samples = 42 value = [42, 0] class = Entity 1->13 3 first_letter_capital ≤ 0.5 gini = 0.144 samples = 112810 value = [8834, 103976] class = Non-Entity 2->3 12 gini = 0.065 samples = 6586 value = [220, 6366] class = Non-Entity 2->12 4 (...) 3->4 9 (...) 3->9 15 all_digit ≤ 0.5 gini = 0.008 samples = 20296 value = [82, 20214] class = Non-Entity 14->15 26 first_letter_capital ≤ 0.5 gini = 0.4 samples = 380 value = [105, 275] class = Non-Entity 14->26 16 first_letter_capital ≤ 0.5 gini = 0.008 samples = 20295 value = [81, 20214] class = Non-Entity 15->16 25 gini = 0.0 samples = 1 value = [1, 0] class = Entity 15->25 17 (...) 16->17 20 (...) 16->20 27 all_digit ≤ 0.5 gini = 0.336 samples = 272 value = [58, 214] class = Non-Entity 26->27 34 has_digit ≤ 0.5 gini = 0.492 samples = 108 value = [47, 61] class = Non-Entity 26->34 28 (...) 27->28 33 (...) 27->33 35 (...) 34->35 38 (...) 34->38

The above tree suggests that solely orthograhic information may not be enough to train the classifier. Notice left subtree of the root node splits on the 'single_character' feature. Clearly an adverse reaction would not be a single character yet the tree predicts that any instance where 'single_character' holds the value of 1 would in-fact be an adverse reaction.


Orthographic + Morphological Features


In [3]:
import subprocess

""" Creates models for each fold and runs evaluation with results """
featureset = "om"
entity_name = "adversereaction"

for fold in range(1,1): #training has already been done
    training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
    os.system("python3 decisiontree.py -tr %s" % (training_data))


for fold in range(1,11):
    testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
    output = subprocess.check_output("python3 evaluate_decisiontree.py -te %s" % (testing_data), shell=True)
    print(output.decode('utf-8'))


adversereaction_test-1.arff_om
Precision: 0.810458
Recall: 0.478764
F-measure: 0.601942
[[  868   945]
 [  203 16725]]


adversereaction_test-2.arff_om
Precision: 0.475576
Recall: 0.468240
F-measure: 0.471879
[[  516   586]
 [  569 19312]]


adversereaction_test-3.arff_om
Precision: 0.487965
Recall: 0.437255
F-measure: 0.461220
[[  223   287]
 [  234 10410]]


adversereaction_test-4.arff_om
Precision: 0.795165
Recall: 0.533732
F-measure: 0.638733
[[  625   546]
 [  161 10494]]


adversereaction_test-5.arff_om
Precision: 0.767084
Recall: 0.432447
F-measure: 0.553088
[[  797  1046]
 [  242 17969]]


adversereaction_test-6.arff_om
Precision: 0.607207
Recall: 0.372376
F-measure: 0.461644
[[  337   568]
 [  218 12970]]


adversereaction_test-7.arff_om
Precision: 0.423135
Recall: 0.423780
F-measure: 0.423458
[[  278   378]
 [  379 18277]]


adversereaction_test-8.arff_om
Precision: 0.526387
Recall: 0.460355
F-measure: 0.491162
[[  389   456]
 [  350 15513]]


adversereaction_test-9.arff_om
Precision: 0.797601
Recall: 0.469550
F-measure: 0.591111
[[ 532  601]
 [ 135 8582]]


adversereaction_test-10.arff_om
Precision: 0.732477
Recall: 0.560823
F-measure: 0.635258
[[  627   491]
 [  229 14787]]


Average Precision: 0.6423055
Average Recall : 0.4637322
Average F-Measure: 0.5329495

It appears adding in the morphological features greatly increased classifier performance.
Below, find the underlying decision tree structure representing the classifier.


In [5]:
import graphviz
from sklearn.externals import joblib
from Tools import arff_converter
from sklearn import tree

featureset = "om" #Careful with highly dimensional datasets
entity_name = "adversereaction"

fold = 2  #change this to display a graph of the decision tree structure for a fold
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
dataset = arff_converter.arff2df(training_data)
dtree = joblib.load('../Models/decisiontree/%s_%s/decisiontree_%s_%s_train-%i.arff.pkl' % (entity_name, featureset, featureset, entity_name,fold))
tree.export_graphviz(dtree, 
                     out_file="visual/temptree.dot",
                     feature_names=dataset.columns.values[:-1],
                     class_names=["Entity", "Non-Entity"], label='all',
                     filled=True, rounded=True, proportion=False, leaves_parallel=True,
                     special_characters=True,
                     max_depth=3  #change for more detail, careful with large datasets
                    )
with open("visual/temptree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)
#graphviz.Source(dot_graph).view()
#the above line is a fullscreen alternative, also generates a temporary file that requires manual removal


Out[5]:
Tree 0 nia_m ≤ 0.5 gini = 0.134 samples = 137872 value = [9994, 127878] class = Non-Entity 1 tis_m ≤ 0.5 gini = 0.131 samples = 137479 value = [9664, 127815] class = Non-Entity 0->1 True 6652 amm_m ≤ 0.5 gini = 0.269 samples = 393 value = [330, 63] class = Entity 0->6652 False 2 mia_m ≤ 0.5 gini = 0.127 samples = 136809 value = [9301, 127508] class = Non-Entity 1->2 6613 hep_m ≤ 0.5 gini = 0.497 samples = 670 value = [363, 307] class = Entity 1->6613 3 ema_m ≤ 0.5 gini = 0.124 samples = 136367 value = [9043, 127324] class = Non-Entity 2->3 6586 chy_m ≤ 0.5 gini = 0.486 samples = 442 value = [258, 184] class = Entity 2->6586 4 (...) 3->4 6571 (...) 3->6571 6587 (...) 6586->6587 6612 (...) 6586->6612 6614 sue_m ≤ 0.5 gini = 0.478 samples = 570 value = [345, 225] class = Entity 6613->6614 6651 gini = 0.295 samples = 100 value = [18, 82] class = Non-Entity 6613->6651 6615 (...) 6614->6615 6650 (...) 6614->6650 6653 neu_m ≤ 0.5 gini = 0.266 samples = 392 value = [330, 62] class = Entity 6652->6653 6686 gini = 0.0 samples = 1 value = [0, 1] class = Non-Entity 6652->6686 6654 lym_m ≤ 0.5 gini = 0.241 samples = 286 value = [246, 40] class = Entity 6653->6654 6685 gini = 0.329 samples = 106 value = [84, 22] class = Entity 6653->6685 6655 (...) 6654->6655 6684 (...) 6654->6684

The decision tree structure above confirms that the features maximizing split purity are dominantly morphological; that is, orthographic features may just be serving as noise if they are included.


Orthographic + Morphological + Lexical Features


In [4]:
import subprocess

""" Creates models for each fold and runs evaluation with results """
featureset = "omt"
entity_name = "adversereaction"

for fold in range(1,1): #training has already been done
    training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
    os.system("python3 decisiontree.py -tr %s" % (training_data))


for fold in range(1,11):
    testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
    output = subprocess.check_output("python3 evaluate_decisiontree.py -te %s" % (testing_data), shell=True)
    print(output.decode('utf-8'))


adversereaction_test-1.arff_omt
Precision: 0.795666
Recall: 0.708770
F-measure: 0.749708
[[ 1285   528]
 [  330 16598]]


adversereaction_test-2.arff_omt
Precision: 0.481679
Recall: 0.656080
F-measure: 0.555513
[[  723   379]
 [  778 19103]]


adversereaction_test-3.arff_omt
Precision: 0.569767
Recall: 0.672549
F-measure: 0.616906
[[  343   167]
 [  259 10385]]


adversereaction_test-4.arff_omt
Precision: 0.773176
Recall: 0.669513
F-measure: 0.717620
[[  784   387]
 [  230 10425]]


adversereaction_test-5.arff_omt
Precision: 0.703226
Recall: 0.532284
F-measure: 0.605930
[[  981   862]
 [  414 17797]]


adversereaction_test-6.arff_omt
Precision: 0.722424
Recall: 0.658564
F-measure: 0.689017
[[  596   309]
 [  229 12959]]


adversereaction_test-7.arff_omt
Precision: 0.537500
Recall: 0.786585
F-measure: 0.638614
[[  516   140]
 [  444 18212]]


adversereaction_test-8.arff_omt
Precision: 0.508292
Recall: 0.725444
F-measure: 0.597757
[[  613   232]
 [  593 15270]]


adversereaction_test-9.arff_omt
Precision: 0.813880
Recall: 0.683142
F-measure: 0.742802
[[ 774  359]
 [ 177 8540]]


adversereaction_test-10.arff_omt
Precision: 0.734308
Recall: 0.763864
F-measure: 0.748794
[[  854   264]
 [  309 14707]]


Average Precision: 0.6639918
Average Recall : 0.6856795
Average F-Measure: 0.6662661

The addition of lexical features clearly assists the classifiers recall of the minority class. It appears, however, that the inclusion of lexical features leads to lowering of classifier precision relative to recall. This suggests that the inclusion of lexical features introduces noise that skews the decision boundary towards the majority 'Non-entity' class but still is necessary to strengthen the boundary around the minority class as shown by the higher recall scores.

The undersampling of majority class instances coupled with feature selection may lead to favorable results on this set of combined features.


In [1]:
import graphviz
from sklearn.externals import joblib
from Tools import arff_converter
from sklearn import tree

featureset = "omt" #Careful with large datasets
entity_name = "adversereaction"

fold = 10  #change this to display a graph of the decision tree structure for a fold
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
dataset = arff_converter.arff2df(training_data)
dtree = joblib.load('../Models/decisiontree/%s_%s/decisiontree_%s_%s_train-%i.arff.pkl' % (entity_name, featureset, featureset, entity_name,fold))
tree.export_graphviz(dtree, 
                     out_file="visual/temp1.dot",
                     feature_names=dataset.columns.values[:-1],
                     class_names=["Entity", "Non-Entity"], label='all',
                     filled=True, rounded=True, proportion=False, leaves_parallel=True,
                     special_characters=True,
                     max_depth=4  #change for more detail, careful with large datasets
                    )
with open("visual/temp1.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)
#graphviz.Source(dot_graph).view()
#the above line is a fullscreen alternative, also generates a temporary file that requires manual removal


Out[1]:
Tree 0 nia_m ≤ 0.5 gini = 0.13 samples = 142721 value = [9978, 132743] class = Non-Entity 1 pain_t ≤ 0.5 gini = 0.126 samples = 142299 value = [9630, 132669] class = Non-Entity 0->1 True 11514 in_t ≤ 0.5 gini = 0.289 samples = 422 value = [348, 74] class = Entity 0->11514 False 2 mia_m ≤ 0.5 gini = 0.123 samples = 141960 value = [9364, 132596] class = Non-Entity 1->2 11375 has_punctuation ≤ 0.5 gini = 0.338 samples = 339 value = [266, 73] class = Entity 1->11375 3 tis_m ≤ 0.5 gini = 0.12 samples = 141532 value = [9084, 132448] class = Non-Entity 2->3 11142 primary_t ≤ 0.5 gini = 0.452 samples = 428 value = [280, 148] class = Entity 2->11142 4 pain_t_self ≤ 0.5 gini = 0.117 samples = 140906 value = [8756, 132150] class = Non-Entity 3->4 10961 hepatitis_t_self ≤ 0.5 gini = 0.499 samples = 626 value = [328, 298] class = Entity 3->10961 5 (...) 4->5 10750 (...) 4->10750 10962 (...) 10961->10962 11099 (...) 10961->11099 11143 dyslipidemia_t_self ≤ 0.5 gini = 0.433 samples = 410 value = [280, 130] class = Entity 11142->11143 11374 gini = 0.0 samples = 18 value = [0, 18] class = Non-Entity 11142->11374 11144 (...) 11143->11144 11373 (...) 11143->11373 11376 to_t ≤ 0.5 gini = 0.306 samples = 323 value = [262, 61] class = Entity 11375->11376 11511 has_hyphen ≤ 0.5 gini = 0.375 samples = 16 value = [4, 12] class = Non-Entity 11375->11511 11377 ere_m ≤ 0.5 gini = 0.294 samples = 319 value = [262, 57] class = Entity 11376->11377 11510 gini = 0.0 samples = 4 value = [0, 4] class = Non-Entity 11376->11510 11378 (...) 11377->11378 11509 (...) 11377->11509 11512 gini = 0.0 samples = 12 value = [0, 12] class = Non-Entity 11511->11512 11513 gini = 0.0 samples = 4 value = [4, 0] class = Entity 11511->11513 11515 california_t_self ≤ 0.5 gini = 0.269 samples = 413 value = [347, 66] class = Entity 11514->11515 11742 thr_m ≤ 0.5 gini = 0.198 samples = 9 value = [1, 8] class = Non-Entity 11514->11742 11516 andor_t ≤ 0.5 gini = 0.254 samples = 408 value = [347, 61] class = Entity 11515->11516 11741 gini = 0.0 samples = 5 value = [0, 5] class = Non-Entity 11515->11741 11517 platelet_t ≤ 0.5 gini = 0.246 samples = 404 value = [346, 58] class = Entity 11516->11517 11738 febrile_t ≤ 0.5 gini = 0.375 samples = 4 value = [1, 3] class = Non-Entity 11516->11738 11518 (...) 11517->11518 11737 (...) 11517->11737 11739 (...) 11738->11739 11740 (...) 11738->11740 11743 gini = 0.0 samples = 7 value = [0, 7] class = Non-Entity 11742->11743 11744 or_t ≤ 0.5 gini = 0.5 samples = 2 value = [1, 1] class = Entity 11742->11744 11745 gini = 0.0 samples = 1 value = [1, 0] class = Entity 11744->11745 11746 gini = 0.0 samples = 1 value = [0, 1] class = Non-Entity 11744->11746

Morphological Feature Set


In [ ]: