Decisions Trees, as opposed to other machine learning techniques such as SVM's and Neural Networks, provide a human-interpretable classification model. We will exploit this to both generate pretty pictures and glean information for feature selection in our high dimensionality datasets.
This report will provide precision, recall, and f-measure values for Decision Trees built on the orthographic; orthograhic + morphological; orthographic + morphological + lexical feature sets for the Adverse Reaction, Indication, Active Ingredient, and Inactive Ingredient entities. A viewable Decision Tree structure will also be generated for each fold.
The file 'decisiontree.py' builds a Decision Tree classifier on the sparse format ARFF file passed in as a parameter. This file is saved in the models directory with the format 'decisiontree_[featuresets]_[entity name].pkl'
The file 'evaluate_decisiontree.py' evaluates a given Decision Tree model stored inside a '.pkl' file outputing appropriate statistics and saving a pdf image of the underlying decision structure associated with the given model.
All ARFF files were cleaned with 'arff_translator.py'. This cleaning consisted of removing a comma from each instance that was mistakenly inserted during file creation.
In [1]:
#python3 arff_translator.py [filename]
In [2]:
import subprocess
""" Creates models for each fold and runs evaluation with results """
featureset = "o"
entity_name = "adversereaction"
for fold in range(1,1): #training has already been done
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
os.system("python3 decisiontree.py -tr %s" % (training_data))
for fold in range(1,11):
testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
output = subprocess.check_output("python3 evaluate_decisiontree.py -te %s" % (testing_data), shell=True)
print(output.decode('utf-8'))
Average Precision: 0.6329762
Average Recall : 0.0080615
Average F-Measure: 0.0158644
Rather lackluster performance.
In [3]:
import graphviz
from sklearn.externals import joblib
from Tools import arff_converter
from sklearn import tree
featureset = "o" #Careful with highly dimensional datasets
entity_name = "adversereaction"
fold = 1 #change this to display a graph of the decision tree structure for a fold
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
dataset = arff_converter.arff2df(training_data)
dtree = joblib.load('../Models/decisiontree/%s_%s/decisiontree_%s_%s_train-%i.arff.pkl' % (entity_name, featureset, featureset, entity_name,fold))
tree.export_graphviz(dtree,
out_file="visual/temptree1.dot",
feature_names=dataset.columns.values[:-1],
class_names=["Entity", "Non-Entity"], label='all',
filled=True, rounded=True, proportion=False, leaves_parallel=True,
special_characters=True,
max_depth=3 #change for more detail, careful with large datasets
)
with open("visual/temptree1.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
#graphviz.Source(dot_graph).view()
#the above line is a fullscreen alternative, also generates a temporary file that requires manual removal
Out[3]:
The above tree suggests that solely orthograhic information may not be enough to train the classifier. Notice left subtree of the root node splits on the 'single_character' feature. Clearly an adverse reaction would not be a single character yet the tree predicts that any instance where 'single_character' holds the value of 1 would in-fact be an adverse reaction.
In [3]:
import subprocess
""" Creates models for each fold and runs evaluation with results """
featureset = "om"
entity_name = "adversereaction"
for fold in range(1,1): #training has already been done
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
os.system("python3 decisiontree.py -tr %s" % (training_data))
for fold in range(1,11):
testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
output = subprocess.check_output("python3 evaluate_decisiontree.py -te %s" % (testing_data), shell=True)
print(output.decode('utf-8'))
Average Precision: 0.6423055
Average Recall : 0.4637322
Average F-Measure: 0.5329495
It appears adding in the morphological features greatly increased classifier performance.
Below, find the underlying decision tree structure representing the classifier.
In [5]:
import graphviz
from sklearn.externals import joblib
from Tools import arff_converter
from sklearn import tree
featureset = "om" #Careful with highly dimensional datasets
entity_name = "adversereaction"
fold = 2 #change this to display a graph of the decision tree structure for a fold
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
dataset = arff_converter.arff2df(training_data)
dtree = joblib.load('../Models/decisiontree/%s_%s/decisiontree_%s_%s_train-%i.arff.pkl' % (entity_name, featureset, featureset, entity_name,fold))
tree.export_graphviz(dtree,
out_file="visual/temptree.dot",
feature_names=dataset.columns.values[:-1],
class_names=["Entity", "Non-Entity"], label='all',
filled=True, rounded=True, proportion=False, leaves_parallel=True,
special_characters=True,
max_depth=3 #change for more detail, careful with large datasets
)
with open("visual/temptree.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
#graphviz.Source(dot_graph).view()
#the above line is a fullscreen alternative, also generates a temporary file that requires manual removal
Out[5]:
The decision tree structure above confirms that the features maximizing split purity are dominantly morphological; that is, orthographic features may just be serving as noise if they are included.
In [4]:
import subprocess
""" Creates models for each fold and runs evaluation with results """
featureset = "omt"
entity_name = "adversereaction"
for fold in range(1,1): #training has already been done
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
os.system("python3 decisiontree.py -tr %s" % (training_data))
for fold in range(1,11):
testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
output = subprocess.check_output("python3 evaluate_decisiontree.py -te %s" % (testing_data), shell=True)
print(output.decode('utf-8'))
Average Precision: 0.6639918
Average Recall : 0.6856795
Average F-Measure: 0.6662661
The addition of lexical features clearly assists the classifiers recall of the minority class. It appears, however, that the inclusion of lexical features leads to lowering of classifier precision relative to recall. This suggests that the inclusion of lexical features introduces noise that skews the decision boundary towards the majority 'Non-entity' class but still is necessary to strengthen the boundary around the minority class as shown by the higher recall scores.
The undersampling of majority class instances coupled with feature selection may lead to favorable results on this set of combined features.
In [1]:
import graphviz
from sklearn.externals import joblib
from Tools import arff_converter
from sklearn import tree
featureset = "omt" #Careful with large datasets
entity_name = "adversereaction"
fold = 10 #change this to display a graph of the decision tree structure for a fold
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
dataset = arff_converter.arff2df(training_data)
dtree = joblib.load('../Models/decisiontree/%s_%s/decisiontree_%s_%s_train-%i.arff.pkl' % (entity_name, featureset, featureset, entity_name,fold))
tree.export_graphviz(dtree,
out_file="visual/temp1.dot",
feature_names=dataset.columns.values[:-1],
class_names=["Entity", "Non-Entity"], label='all',
filled=True, rounded=True, proportion=False, leaves_parallel=True,
special_characters=True,
max_depth=4 #change for more detail, careful with large datasets
)
with open("visual/temp1.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
#graphviz.Source(dot_graph).view()
#the above line is a fullscreen alternative, also generates a temporary file that requires manual removal
Out[1]:
In [ ]: