Decisions Trees, as opposed to other machine learning techniques such as SVM's and Neural Networks, provide a human-interpretable classification model. We will exploit this to both generate pretty pictures and glean information for feature selection in our high dimensionality datasets.
This report will provide precision, recall, and f-measure values for Decision Trees built on the orthographic; orthograhic + morphological; orthographic + morphological + lexical feature sets for the Adverse Reaction, Indication, Active Ingredient, and Inactive Ingredient entities. A viewable Decision Tree structure will also be generated for each fold.
The file 'decisiontree.py' builds a Decision Tree classifier on the sparse format ARFF file passed in as a parameter. This file is saved in the models directory with the format 'decisiontree_[featuresets]_[entity name].pkl'
The file 'evaluate_decisiontree.py' evaluates a given Decision Tree model stored inside a '.pkl' file outputing appropriate statistics and saving a pdf image of the underlying decision structure associated with the given model.
All ARFF files were cleaned with 'arff_translator.py'. This cleaning consisted of removing a comma from each instance that was mistakenly inserted during file creation.
In [1]:
import subprocess
""" Creates models for each fold and runs evaluation with results """
featureset = "o"
entity_name = "adversereaction"
for fold in range(1,1): #training has already been done
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
os.system("python3 decisiontree.py -tr %s" % (training_data))
for fold in range(1,11):
testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
output = subprocess.check_output("python3 evaluate_decisiontree.py -te %s" % (testing_data), shell=True)
print(output.decode('utf-8'))
Rather lackluster performance.
In [2]:
import subprocess
""" Creates models for each fold and runs evaluation with results """
featureset = "om"
entity_name = "adversereaction"
for fold in range(1,1): #training has already been done
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
os.system("python3 decisiontree.py -tr %s" % (training_data))
for fold in range(1,11):
testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
output = subprocess.check_output("python3 evaluate_decisiontree.py -te %s" % (testing_data), shell=True)
print(output.decode('utf-8'))
It appears adding in the morphological features greatly increased classifier performance.
Below, find the underlying decision tree structure representing the classifier.
In [1]:
import graphviz
from sklearn.externals import joblib
from Tools import arff_converter
from sklearn import tree
featureset = "o" #Careful with high dimensional datasets
entity_name = "adversereaction"
fold = 1 #change this to display a graph of the decision tree structure for a fold
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
dataset = arff_converter.arff2df(training_data)
dtree = joblib.load('../Models/decisiontree/%s_%s/decisiontree_%s_%s_train-%i.arff.pkl' % (entity_name, featureset, featureset, entity_name,fold))
tree.export_graphviz(dtree,
out_file="visual/temptree.dot",
feature_names=dataset.columns.values[:-1],
class_names=["Entity", "Non-Entity"], label='all',
filled=True, rounded=True, proportion=False, leaves_parallel=True,
special_characters=True,
max_depth=3 #change for more detail, careful with large datasets
)
with open("visual/temptree.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
#graphviz.Source(dot_graph).view()
#the above line is a fullscreen alternative, also generates a temporary file that requires manual removal
Out[1]:
In [ ]: