Random Forest Decision Tree Ensembles build a set of Decision Trees, each on a random subset of a given data set, that return a combined classification decision.
This report will provide precision, recall, and f-measure values for Decision Trees built on the orthographic; orthograhic + morphological; orthographic + morphological + lexical feature sets for the Adverse Reaction, Indication, Active Ingredient, and Inactive Ingredient entities. A viewable Decision Tree structure will also be generated for each fold.
The file 'randomforest.py' builds a Random Forest Ensemble classifier on the sparse format ARFF file passed in as a parameter. This file is saved in the models directory with the format 'randomforest_[featuresets]_[entity name].pkl'
The file 'evaluate_randomforest.py' evaluates a given Random Forest Ensemble model stored inside a '.pkl' file outputing appropriate statistics and saving a pdf image of the underlying decision structure associated with the given model.
All ARFF files were cleaned with 'arff_translator.py'. This cleaning consisted of removing a comma from each instance that was mistakenly inserted during file creation.
In [1]:
import subprocess
""" Creates models for each fold and runs evaluation with results """
featureset = "o"
entity_name = "adversereaction"
for fold in range(1,1): #training has already been done
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
os.system("python3 decisiontree.py -tr %s" % (training_data))
for fold in range(1,11):
testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
output = subprocess.check_output("python3 evaluate_randomforest.py -te %s" % (testing_data), shell=True)
print(output.decode('utf-8'))
Orthographic features alone contribute to a relatively high precision but very low recall. This implies that orthographic features alone are not enough to carve out the decision boundary for all the positive instances hence the low recall.However,the decision boundary created is very selective as indicated by the high precision.
In [2]:
import subprocess
""" Creates models for each fold and runs evaluation with results """
featureset = "om"
entity_name = "adversereaction"
for fold in range(1,1): #training has already been done
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
os.system("python3 decisiontree.py -tr %s" % (training_data))
for fold in range(1,11):
testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
output = subprocess.check_output("python3 evaluate_randomforest.py -te %s" % (testing_data), shell=True)
print(output.decode('utf-8'))
It appears adding in the morphological features greatly increased classifier performance.
Below, find the underlying decision tree structure representing the classifier.
In [1]:
import subprocess
""" Creates models for each fold and runs evaluation with results """
featureset = "omt"
entity_name = "adversereaction"
for fold in range(1,1): #training has already been done
training_data = "../ARFF_Files/%s_ARFF/_%s/_train/%s_train-%i.arff" % (entity_name, featureset, entity_name, fold)
os.system("python3 decisiontree.py -tr %s" % (training_data))
for fold in range(1,11):
testing_data = "../ARFF_Files/%s_ARFF/_%s/_test/%s_test-%i.arff" % (entity_name, featureset, entity_name, fold)
output = subprocess.check_output("python3 evaluate_randomforest.py -te %s" % (testing_data), shell=True)
print(output.decode('utf-8'))
In [ ]: