In [2]:
import __init__
import pandas as pd
from bin.splitLogFile import extractSummaryLine
Based on annoted ground truth, we tried to learn a model to classify domains specific words.
We use as input a combinaison of 4 datasets:
To do so we will explore the carthesian product of:
We use a 10 K-Fold cross validation.
Once you downloaded the files, you can use this script reproduce the experience at home:
python experiment/trainAll_domainClf.py > ../data/learnedModel/domain/log.txt
In [6]:
summaryDf = pd.DataFrame([extractSummaryLine(l) for l in open('../../data/learnedModel/domain/summary.txt').readlines()],
columns=['domain', 'strict', 'clf', 'feature', 'post', 'precision', 'recall', 'f1'])
summaryDf = summaryDf[summaryDf['clf'] != 'KNeighborsClassifier'].sort_values('f1', ascending=False)
print len(summaryDf)
summaryDf[:5]
Out[6]:
Considering the nature of the date (really close points for semanticly close concept - ie puppy, dog), Knn is not relevant.
As you can see we, there is a lot trained model (198), therefore,
we need to find a method to select the best combinaison - ie: robust to the number and variety of domains
To do so, we'll select the best average model depending of the dataset combinaison
In [8]:
summaryDf['f1'] = summaryDf['f1'].astype(float)
summaryDf[['feature', 'post', 'f1']].groupby(['feature', 'post']).describe().unstack(level=-1)
Out[8]:
We observe several things here:
If we had to select one model, we could choose angular feature with no post processing. which is the best in the edge case (4 domains)
In [20]:
summaryDf[summaryDf['domain'] == 'animal-plant-vehicle-other'][:1]
Out[20]:
In [19]:
summaryDf[(summaryDf['feature'] == 'angular') & (summaryDf['post'] == 'noPost') & (summaryDf['strict'] == '')]
Out[19]:
In [25]:
!python ../../toolbox/script/detailConceptClfError.py ../../data/voc/npy/wikiEn-skipgram.npy ../../data/learnedModel/domain/animal-plant-vehicle__RandomForestClassifier_angular_noPost.dill ../../data/domain/luu_animal.txt animal ../../data/domain/luu_plant.txt plant ../../data/domain/luu_vehicle.txt vehicle
Be aware there is no cross validation here, so we are overfitting
Yet, we see collisions seems to be due to several meaning for one concept:
rocket for example, is both a plant and a vehicle, which make it an unsolvable case for this model.
Let's compare with the same domains but adding other:
In [26]:
!python ../../toolbox/script/detailConceptClfError.py ../../data/voc/npy/wikiEn-skipgram.npy ../../data/learnedModel/domain/animal-plant-vehicle-other__RandomForestClassifier_angular_noPost.dill ../../data/domain/luu_animal.txt animal ../../data/domain/luu_plant.txt plant ../../data/domain/luu_vehicle.txt vehicle ../../data/domain/all_1400.txt other
We observe 3 things:
Once again, the model proves it's ability to 'challenge' the ground truth (which is here highly biaised) .
The recognition rate is satisfying, thus the classification errors highlight one main issues: