Experimental Results from a Random Forest Decision Tree Ensemble based NER model

Decisions Trees, as opposed to other machine learning techniques such as SVM's and Neural Networks, provide a human-interpretable classification model. We will exploit this to both generate pretty pictures and glean information for feature selection in our high dimensionality datasets.

This report will provide precision, recall, and f-measure values for Decision Trees built on the orthographic; orthograhic + morphological; orthographic + morphological + lexical feature sets for the Adverse Reaction, Indication, Active Ingredient, and Inactive Ingredient entities. A viewable Decision Tree structure will also be generated for each fold.


The file 'decisiontree.py' builds a Decision Tree classifier on the sparse format ARFF file passed in as a parameter. This file is saved in the models directory with the format 'decisiontree_[featuresets]_[entity name].pkl'
The file 'evaluate_decisiontree.py' evaluates a given Decision Tree model stored inside a '.pkl' file outputing appropriate statistics and saving a pdf image of the underlying decision structure associated with the given model.

All ARFF files were cleaned with 'arff_translator.py'. This cleaning consisted of removing a comma from each instance that was mistakenly inserted during file creation.

python3 arff_translator.py [filename]

Adverse Reaction Feature Set

Orthographic Features


In [1]:
import subprocess

""" Creates models for each fold and runs evaluation with results """
featureset = "o"
entity_name = "adversereaction"

for fold in range(1,1): #training has already been done
    training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
    os.system("python3 decisiontree.py -tr %s" % (training_data))


for fold in range(1,11):
    testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
    output = subprocess.check_output("python3 evaluate_decisiontree.py -te %s" % (testing_data), shell=True)
    print(output.decode('utf-8'))


adversereaction_test-1.arff
Precision: 0.961538
Recall: 0.013789
[[   25  1788]
 [    1 16927]]


adversereaction_test-2.arff
Precision: 0.750000
Recall: 0.008167
[[    9  1093]
 [    3 19878]]


adversereaction_test-3.arff
Precision: 0.333333
Recall: 0.001961
[[    1   509]
 [    2 10642]]


adversereaction_test-4.arff
Precision: 1.000000
Recall: 0.009394
[[   11  1160]
 [    0 10655]]


adversereaction_test-5.arff
Precision: 0.571429
Recall: 0.010852
[[   20  1823]
 [   15 18196]]


adversereaction_test-6.arff
Precision: 0.166667
Recall: 0.002210
[[    2   903]
 [   10 13178]]


adversereaction_test-7.arff
Precision: 0.800000
Recall: 0.006098
[[    4   652]
 [    1 18655]]


adversereaction_test-8.arff
Precision: 0.708333
Recall: 0.020118
[[   17   828]
 [    7 15856]]


adversereaction_test-9.arff
Precision: 0.500000
Recall: 0.001765
[[   2 1131]
 [   2 8715]]


adversereaction_test-10.arff
Precision: 0.538462
Recall: 0.006261
[[    7  1111]
 [    6 15010]]


Rather lackluster performance.

Orthographic + Morphological Features


In [2]:
import subprocess

""" Creates models for each fold and runs evaluation with results """
featureset = "om"
entity_name = "adversereaction"

for fold in range(1,1): #training has already been done
    training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
    os.system("python3 decisiontree.py -tr %s" % (training_data))


for fold in range(1,11):
    testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
    output = subprocess.check_output("python3 evaluate_decisiontree.py -te %s" % (testing_data), shell=True)
    print(output.decode('utf-8'))


adversereaction_test-1.arff
Precision: 0.810458
Recall: 0.478764
[[  868   945]
 [  203 16725]]


adversereaction_test-2.arff
Precision: 0.475576
Recall: 0.468240
[[  516   586]
 [  569 19312]]


adversereaction_test-3.arff
Precision: 0.487965
Recall: 0.437255
[[  223   287]
 [  234 10410]]


adversereaction_test-4.arff
Precision: 0.795165
Recall: 0.533732
[[  625   546]
 [  161 10494]]


adversereaction_test-5.arff
Precision: 0.767084
Recall: 0.432447
[[  797  1046]
 [  242 17969]]


adversereaction_test-6.arff
Precision: 0.607207
Recall: 0.372376
[[  337   568]
 [  218 12970]]


adversereaction_test-7.arff
Precision: 0.423135
Recall: 0.423780
[[  278   378]
 [  379 18277]]


adversereaction_test-8.arff
Precision: 0.526387
Recall: 0.460355
[[  389   456]
 [  350 15513]]


adversereaction_test-9.arff
Precision: 0.797601
Recall: 0.469550
[[ 532  601]
 [ 135 8582]]


adversereaction_test-10.arff
Precision: 0.732477
Recall: 0.560823
[[  627   491]
 [  229 14787]]


It appears adding in the morphological features greatly increased classifier performance.
Below, find the underlying decision tree structure representing the classifier.


In [1]:
import graphviz
from sklearn.externals import joblib
from Tools import arff_converter
from sklearn import tree

featureset = "o" #Careful with high dimensional datasets
entity_name = "adversereaction"

fold = 1  #change this to display a graph of the decision tree structure for a fold
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
dataset = arff_converter.arff2df(training_data)
dtree = joblib.load('../Models/decisiontree/%s_%s/decisiontree_%s_%s_train-%i.arff.pkl' % (entity_name, featureset, featureset, entity_name,fold))
tree.export_graphviz(dtree, 
                     out_file="visual/temptree.dot",
                     feature_names=dataset.columns.values[:-1],
                     class_names=["Entity", "Non-Entity"], label='all',
                     filled=True, rounded=True, proportion=False, leaves_parallel=True,
                     special_characters=True,
                     max_depth=3  #change for more detail, careful with large datasets
                    )
with open("visual/temptree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)
#graphviz.Source(dot_graph).view()
#the above line is a fullscreen alternative, also generates a temporary file that requires manual removal


Out[1]:
Tree 0 has_punctuation ≤ 0.5 gini = 0.124 samples = 140114 value = [9283, 130831] class = Non-Entity 1 single_character ≤ 0.5 gini = 0.141 samples = 119438 value = [9096, 110342] class = Non-Entity 0->1 True 14 has_hyphen ≤ 0.5 gini = 0.018 samples = 20676 value = [187, 20489] class = Non-Entity 0->14 False 2 all_capital ≤ 0.5 gini = 0.14 samples = 119396 value = [9054, 110342] class = Non-Entity 1->2 13 gini = 0.0 samples = 42 value = [42, 0] class = Entity 1->13 3 first_letter_capital ≤ 0.5 gini = 0.144 samples = 112810 value = [8834, 103976] class = Non-Entity 2->3 12 gini = 0.065 samples = 6586 value = [220, 6366] class = Non-Entity 2->12 4 (...) 3->4 9 (...) 3->9 15 all_digit ≤ 0.5 gini = 0.008 samples = 20296 value = [82, 20214] class = Non-Entity 14->15 26 first_letter_capital ≤ 0.5 gini = 0.4 samples = 380 value = [105, 275] class = Non-Entity 14->26 16 first_letter_capital ≤ 0.5 gini = 0.008 samples = 20295 value = [81, 20214] class = Non-Entity 15->16 25 gini = 0.0 samples = 1 value = [1, 0] class = Entity 15->25 17 (...) 16->17 20 (...) 16->20 27 all_digit ≤ 0.5 gini = 0.336 samples = 272 value = [58, 214] class = Non-Entity 26->27 34 has_digit ≤ 0.5 gini = 0.492 samples = 108 value = [47, 61] class = Non-Entity 26->34 28 (...) 27->28 33 (...) 27->33 35 (...) 34->35 38 (...) 34->38

In [ ]: