In [2]:
import __init__

import pandas as pd

from bin.splitLogFile import extractSummaryLine

Experience

Based on annoted ground truth, we tried to learn a model to classify domains specific words.

We use as input a combinaison of 4 datasets:

  • animal
  • vehicle
  • plant
  • other - a random sample from the whole vocabulary

To do so we will explore the carthesian product of:

  • domains: a combinaison of N previously presented domains
  • strict: try to compose missing concept
  • randomForest / knn: knn allow us to check if there is anything consistent to learn, randomForest is a basic model as a first approach to learn the function
  • feature: one of the feature presented in the guided tour
  • postFeature: any extra processing to apply to the feature extraction (like normalise)

We use a 10 K-Fold cross validation.

Once you downloaded the files, you can use this script reproduce the experience at home:

python experiment/trainAll_domainClf.py > ../data/learnedModel/domain/log.txt

Results

Here is the summary of the results we gathered, You can find details reports in logs.


In [6]:
summaryDf = pd.DataFrame([extractSummaryLine(l) for l in open('../../data/learnedModel/domain/summary.txt').readlines()],
                        columns=['domain', 'strict', 'clf', 'feature', 'post', 'precision', 'recall', 'f1'])

summaryDf = summaryDf[summaryDf['clf'] != 'KNeighborsClassifier'].sort_values('f1', ascending=False)
print len(summaryDf)
summaryDf[:5]


198
Out[6]:
domain strict clf feature post precision recall f1
338 plant-vehicle RandomForestClassifier identity postNormalize 0.972 0.97 0.97
356 plant-vehicle strict RandomForestClassifier identity postNormalize 0.971 0.969 0.968
340 plant-vehicle RandomForestClassifier polar postAbs 0.968 0.967 0.967
352 plant-vehicle strict RandomForestClassifier angular postAbs 0.97 0.967 0.967
359 plant-vehicle strict RandomForestClassifier polar postNormalize 0.966 0.963 0.964

Considering the nature of the date (really close points for semanticly close concept - ie puppy, dog), Knn is not relevant.

As you can see we, there is a lot trained model (198), therefore,
we need to find a method to select the best combinaison - ie: robust to the number and variety of domains

To do so, we'll select the best average model depending of the dataset combinaison


In [8]:
summaryDf['f1'] = summaryDf['f1'].astype(float)
summaryDf[['feature', 'post', 'f1']].groupby(['feature', 'post']).describe().unstack(level=-1)


Out[8]:
f1
count mean std min 25% 50% 75% max
feature post
angular noPost 22.0 0.897682 0.044522 0.811 0.85975 0.9140 0.93050 0.962
postAbs 22.0 0.899136 0.045783 0.806 0.85975 0.9185 0.92900 0.967
postNormalize 22.0 0.894500 0.042041 0.813 0.86375 0.8990 0.93275 0.958
identity noPost 22.0 0.891591 0.043890 0.803 0.86200 0.8930 0.93000 0.956
postAbs 22.0 0.778818 0.069272 0.639 0.72250 0.7875 0.82400 0.876
postNormalize 22.0 0.897682 0.045976 0.802 0.86175 0.9070 0.92900 0.970
polar noPost 22.0 0.897182 0.042480 0.807 0.87150 0.9085 0.93100 0.961
postAbs 22.0 0.895000 0.044825 0.804 0.85875 0.9035 0.93050 0.967
postNormalize 22.0 0.892591 0.043845 0.807 0.85875 0.8985 0.92675 0.964

We observe several things here:

  • The f1-score decrease as we add variety of domains (from ~95% for 2 to ~80% for 4)
  • In average, the results are satisfying for the basic model.
  • The feature selected angular, polar, carthesian have a litte impact on the average score.
  • Adding possibility to compose concept (strict) improve very slightly the score

If we had to select one model, we could choose angular feature with no post processing. which is the best in the edge case (4 domains)


In [20]:
summaryDf[summaryDf['domain'] == 'animal-plant-vehicle-other'][:1]


Out[20]:
domain strict clf feature post precision recall f1
126 animal-plant-vehicle-other RandomForestClassifier angular noPost 0.843 0.841 0.823

In [19]:
summaryDf[(summaryDf['feature'] == 'angular') & (summaryDf['post'] == 'noPost') & (summaryDf['strict'] == '')]


Out[19]:
domain strict clf feature post precision recall f1
333 plant-vehicle RandomForestClassifier angular noPost 0.965 0.962 0.962
225 animal-vehicle RandomForestClassifier angular noPost 0.953 0.953 0.950
81 animal-plant RandomForestClassifier angular noPost 0.947 0.945 0.945
369 vehicle-other RandomForestClassifier angular noPost 0.923 0.92 0.919
9 animal-other RandomForestClassifier angular noPost 0.917 0.916 0.916
261 plant-other RandomForestClassifier angular noPost 0.919 0.915 0.915
153 animal-plant-vehicle RandomForestClassifier angular noPost 0.901 0.899 0.895
306 plant-vehicle-other RandomForestClassifier angular noPost 0.894 0.887 0.878
54 animal-plant-other RandomForestClassifier angular noPost 0.865 0.861 0.859
198 animal-vehicle-other RandomForestClassifier angular noPost 0.872 0.862 0.848
126 animal-plant-vehicle-other RandomForestClassifier angular noPost 0.843 0.841 0.823

Study errors

Here is the detail of classification error for combined animal, plant and vehicle:


In [25]:
!python ../../toolbox/script/detailConceptClfError.py ../../data/voc/npy/wikiEn-skipgram.npy ../../data/learnedModel/domain/animal-plant-vehicle__RandomForestClassifier_angular_noPost.dill ../../data/domain/luu_animal.txt animal ../../data/domain/luu_plant.txt plant ../../data/domain/luu_vehicle.txt vehicle


1388424 loaded from wikiEn-skipgram
mem usage 1.6GiB
loaded time 2.28821516037 s
input: elder  /  predicted: animal  /  true: plant  /  proba:[ 0.52770563  0.46344372  0.00885065]
input: periwinkle  /  predicted: animal  /  true: plant  /  proba:[ 0.59370291  0.38447223  0.02182487]
input: rocket  /  predicted: animal  /  true: plant  /  proba:[ 0.36572428  0.26895122  0.36532451]
input: dumper  /  predicted: animal  /  true: vehicle  /  proba:[ 0.53974509  0.17098901  0.28926589]
input: electric  /  predicted: animal  /  true: vehicle  /  proba:[ 0.60716588  0.10343283  0.28940128]
input: rocket  /  predicted: animal  /  true: vehicle  /  proba:[ 0.36572428  0.26895122  0.36532451]
input: semi  /  predicted: animal  /  true: vehicle  /  proba:[ 0.67749268  0.03053849  0.29196884]
input: tipper  /  predicted: animal  /  true: vehicle  /  proba:[ 0.48796488  0.11740702  0.3946281 ]

--  REPORT  --
             precision    recall  f1-score   support

     animal       0.99      1.00      0.99       542
      plant       1.00      0.99      1.00       466
    vehicle       1.00      0.95      0.98       103

avg / total       0.99      0.99      0.99      1111

Be aware there is no cross validation here, so we are overfitting

Yet, we see collisions seems to be due to several meaning for one concept:
rocket for example, is both a plant and a vehicle, which make it an unsolvable case for this model.

Let's compare with the same domains but adding other:


In [26]:
!python ../../toolbox/script/detailConceptClfError.py ../../data/voc/npy/wikiEn-skipgram.npy ../../data/learnedModel/domain/animal-plant-vehicle-other__RandomForestClassifier_angular_noPost.dill ../../data/domain/luu_animal.txt animal ../../data/domain/luu_plant.txt plant ../../data/domain/luu_vehicle.txt vehicle ../../data/domain/all_1400.txt other


1388424 loaded from wikiEn-skipgram
mem usage 1.6GiB
loaded time 2.23708605766 s
input: ant  /  predicted: other  /  true: animal  /  proba:[ 0.44698061  0.47728024  0.03956311  0.03617603]
input: aphid  /  predicted: plant  /  true: animal  /  proba:[ 0.4205326   0.06616029  0.50666878  0.00663833]
input: bryozoan  /  predicted: other  /  true: animal  /  proba:[ 0.28859803  0.37158572  0.18269929  0.15711696]
input: bullock  /  predicted: other  /  true: animal  /  proba:[ 0.44268397  0.47954619  0.0655728   0.01219704]
input: cephalochordate  /  predicted: other  /  true: animal  /  proba:[ 0.29835134  0.59678727  0.09063798  0.0142234 ]
input: cephalopod  /  predicted: other  /  true: animal  /  proba:[ 0.40773633  0.44026934  0.10582134  0.04617299]
input: coelenterate  /  predicted: other  /  true: animal  /  proba:[ 0.35773328  0.42768328  0.16387334  0.0507101 ]
input: collembolan  /  predicted: other  /  true: animal  /  proba:[ 0.28649462  0.42505779  0.27505705  0.01339054]
input: conch  /  predicted: other  /  true: animal  /  proba:[ 0.4149726   0.49854156  0.06766974  0.0188161 ]
input: cricket  /  predicted: other  /  true: animal  /  proba:[ 0.3628891   0.57340675  0.03811551  0.02558863]
input: female  /  predicted: other  /  true: animal  /  proba:[ 0.41602699  0.43227433  0.06870852  0.08299015]
input: fetus  /  predicted: other  /  true: animal  /  proba:[ 0.21438143  0.38281834  0.30153137  0.10126886]
input: fisher  /  predicted: other  /  true: animal  /  proba:[ 0.23903853  0.50360001  0.06098512  0.19637634]
input: game  /  predicted: other  /  true: animal  /  proba:[ 0.30137054  0.58407286  0.07058566  0.04397094]
input: gastropod  /  predicted: other  /  true: animal  /  proba:[ 0.33504196  0.44254691  0.19681332  0.02559781]
input: gnu  /  predicted: other  /  true: animal  /  proba:[ 0.27754129  0.55000117  0.14460895  0.02784859]
input: gorgonian  /  predicted: other  /  true: animal  /  proba:[ 0.28779823  0.45182579  0.19016982  0.07020616]
input: grub  /  predicted: other  /  true: animal  /  proba:[ 0.35815936  0.36892295  0.19039886  0.08251882]
input: guanaco  /  predicted: other  /  true: animal  /  proba:[ 0.28651099  0.52009469  0.16996517  0.02342914]
input: hart  /  predicted: other  /  true: animal  /  proba:[ 0.375365    0.41697974  0.1896232   0.01803206]
input: hydrozoan  /  predicted: other  /  true: animal  /  proba:[ 0.28298707  0.38912175  0.30523088  0.0226603 ]
input: jack  /  predicted: other  /  true: animal  /  proba:[ 0.30253979  0.46881654  0.18399569  0.04464798]
input: kiang  /  predicted: other  /  true: animal  /  proba:[ 0.18135746  0.54336857  0.23603194  0.03924203]
input: kudus  /  predicted: other  /  true: animal  /  proba:[ 0.41369482  0.44373094  0.09797738  0.04459685]
input: leech  /  predicted: other  /  true: animal  /  proba:[ 0.33710568  0.40904253  0.22422854  0.02962325]
input: livestock  /  predicted: other  /  true: animal  /  proba:[ 0.36199703  0.39503132  0.20583432  0.03713732]
input: liza  /  predicted: other  /  true: animal  /  proba:[ 0.35388255  0.47089143  0.08730651  0.08791951]
input: locust  /  predicted: plant  /  true: animal  /  proba:[ 0.1957628   0.27105741  0.51315542  0.02002436]
input: mare  /  predicted: other  /  true: animal  /  proba:[ 0.34153627  0.38334255  0.21900775  0.05611343]
input: medusa  /  predicted: other  /  true: animal  /  proba:[ 0.35433538  0.5468813   0.07576345  0.02301986]
input: moth  /  predicted: plant  /  true: animal  /  proba:[ 0.31682934  0.24001358  0.43623565  0.00692143]
input: mullet  /  predicted: other  /  true: animal  /  proba:[ 0.26172462  0.5953232   0.1202752   0.02267697]
input: periwinkle  /  predicted: plant  /  true: animal  /  proba:[ 0.22760357  0.08097104  0.62379982  0.06762556]
input: pest  /  predicted: plant  /  true: animal  /  proba:[ 0.34586507  0.24986497  0.3782927   0.02597726]
input: primate  /  predicted: other  /  true: animal  /  proba:[ 0.45622375  0.46011274  0.05276722  0.0308963 ]
input: ram  /  predicted: other  /  true: animal  /  proba:[ 0.28062712  0.54942798  0.14992112  0.02002378]
input: ratel  /  predicted: other  /  true: animal  /  proba:[ 0.42538728  0.45904248  0.05509626  0.06047398]
input: rhea  /  predicted: other  /  true: animal  /  proba:[ 0.23205132  0.51245318  0.21437706  0.04111845]
input: robin  /  predicted: other  /  true: animal  /  proba:[ 0.37436454  0.519887    0.06746879  0.03827967]
input: salp  /  predicted: other  /  true: animal  /  proba:[ 0.37061139  0.56412546  0.02246334  0.0427998 ]
input: saurian  /  predicted: other  /  true: animal  /  proba:[ 0.41519959  0.46158929  0.07983094  0.04338019]
input: skate  /  predicted: other  /  true: animal  /  proba:[ 0.35476907  0.39567611  0.23025952  0.0192953 ]
input: yak  /  predicted: other  /  true: animal  /  proba:[ 0.25577816  0.39443725  0.13638508  0.21339952]
input: young  /  predicted: other  /  true: animal  /  proba:[ 0.22593007  0.64861118  0.08981339  0.03564536]
input: alga  /  predicted: other  /  true: plant  /  proba:[ 0.11713995  0.45686998  0.40246224  0.02352783]
input: alstroemeria  /  predicted: other  /  true: plant  /  proba:[ 0.20058751  0.40987284  0.37376524  0.01577441]
input: annual  /  predicted: other  /  true: plant  /  proba:[ 0.19559753  0.62341408  0.15218006  0.02880833]
input: bitterroot  /  predicted: animal  /  true: plant  /  proba:[ 0.44683013  0.27589282  0.26372874  0.01354831]
input: centaury  /  predicted: other  /  true: plant  /  proba:[ 0.22686339  0.45119916  0.30944789  0.01248956]
input: climber  /  predicted: other  /  true: plant  /  proba:[ 0.26019133  0.39891531  0.1632122   0.17768115]
input: composite  /  predicted: other  /  true: plant  /  proba:[ 0.14099299  0.57778067  0.23784378  0.04338255]
input: conifer  /  predicted: animal  /  true: plant  /  proba:[ 0.48151509  0.10926372  0.38960809  0.0196131 ]
input: crucifer  /  predicted: other  /  true: plant  /  proba:[ 0.23371797  0.40145504  0.35424806  0.01057893]
input: elder  /  predicted: other  /  true: plant  /  proba:[ 0.16285167  0.6013301   0.21277506  0.02304317]
input: frangipani  /  predicted: other  /  true: plant  /  proba:[ 0.2298451   0.47447679  0.26744127  0.02823684]
input: guar  /  predicted: other  /  true: plant  /  proba:[ 0.12317942  0.60200737  0.24554262  0.02927059]
input: hawthorn  /  predicted: other  /  true: plant  /  proba:[ 0.04626004  0.55597733  0.37601294  0.0217497 ]
input: heath  /  predicted: other  /  true: plant  /  proba:[ 0.15354265  0.46233484  0.34902338  0.03509913]
input: hoya  /  predicted: other  /  true: plant  /  proba:[ 0.07286231  0.68775714  0.20756999  0.03181056]
input: liana  /  predicted: other  /  true: plant  /  proba:[ 0.1391583   0.58791597  0.21477134  0.05815439]
input: ling  /  predicted: other  /  true: plant  /  proba:[ 0.14403447  0.57713727  0.25246479  0.02636347]
input: mangrove  /  predicted: animal  /  true: plant  /  proba:[ 0.50842938  0.10948469  0.36002458  0.02206134]
input: maranta  /  predicted: other  /  true: plant  /  proba:[ 0.09980329  0.49384903  0.36726399  0.03908369]
input: marrow  /  predicted: other  /  true: plant  /  proba:[ 0.06658945  0.47905279  0.44372157  0.01063619]
input: papyrus  /  predicted: other  /  true: plant  /  proba:[ 0.08028603  0.46276602  0.38189614  0.07505181]
input: phytoplankton  /  predicted: other  /  true: plant  /  proba:[ 0.21428117  0.38745226  0.35188804  0.04637853]
input: pinon  /  predicted: other  /  true: plant  /  proba:[ 0.15043216  0.46326095  0.3546033   0.03170359]
input: ramp  /  predicted: other  /  true: plant  /  proba:[ 0.09954547  0.62049674  0.17999476  0.09996303]
input: rocket  /  predicted: vehicle  /  true: plant  /  proba:[ 0.17231564  0.23032237  0.26717962  0.33018237]
input: senna  /  predicted: other  /  true: plant  /  proba:[ 0.02432083  0.4445068   0.43489362  0.09627875]
input: soie  /  predicted: other  /  true: plant  /  proba:[ 0.04972663  0.51084713  0.36275289  0.07667335]
input: spermatophyte  /  predicted: other  /  true: plant  /  proba:[ 0.22809579  0.5651832   0.16126189  0.04545912]
input: ti  /  predicted: other  /  true: plant  /  proba:[ 0.1139587   0.43627034  0.38966767  0.0601033 ]
input: tobacco  /  predicted: other  /  true: plant  /  proba:[ 0.14717922  0.41151974  0.39556695  0.04573408]
input: tumbleweed  /  predicted: other  /  true: plant  /  proba:[ 0.26434696  0.34145432  0.32274227  0.07145644]
input: viola  /  predicted: other  /  true: plant  /  proba:[ 0.04577327  0.52191823  0.3469759   0.0853326 ]
input: woad  /  predicted: other  /  true: plant  /  proba:[ 0.12439215  0.42086573  0.38730698  0.06743514]
input: yam  /  predicted: other  /  true: plant  /  proba:[ 0.16391467  0.50504901  0.31289888  0.01813744]
input: apc  /  predicted: other  /  true: vehicle  /  proba:[ 0.26338496  0.46749353  0.04944805  0.21967347]
input: automobile  /  predicted: other  /  true: vehicle  /  proba:[ 0.03760072  0.50164853  0.01545129  0.44529946]
input: bicycle  /  predicted: other  /  true: vehicle  /  proba:[ 0.26801871  0.42687975  0.06401224  0.24108929]
input: bomber  /  predicted: other  /  true: vehicle  /  proba:[ 0.37439346  0.49298322  0.03853735  0.09408597]
input: buggy  /  predicted: other  /  true: vehicle  /  proba:[ 0.12294841  0.48245099  0.17824378  0.21635682]
input: bulldozer  /  predicted: other  /  true: vehicle  /  proba:[ 0.27321283  0.32869367  0.15183118  0.24626233]
input: bus  /  predicted: other  /  true: vehicle  /  proba:[ 0.16877928  0.49918392  0.06418505  0.26785174]
input: camper  /  predicted: animal  /  true: vehicle  /  proba:[ 0.45872386  0.27128794  0.10955703  0.16043116]
input: caravan  /  predicted: other  /  true: vehicle  /  proba:[ 0.19109092  0.55385374  0.13067359  0.12438175]
input: cart  /  predicted: other  /  true: vehicle  /  proba:[ 0.14680899  0.51545333  0.08329053  0.25444715]
input: chariot  /  predicted: other  /  true: vehicle  /  proba:[ 0.25945618  0.43236877  0.09895327  0.20922178]
input: chopper  /  predicted: other  /  true: vehicle  /  proba:[ 0.23607557  0.38920749  0.06213377  0.31258316]
input: coach  /  predicted: other  /  true: vehicle  /  proba:[ 0.29221715  0.37835215  0.101267    0.22816369]
input: convertible  /  predicted: other  /  true: vehicle  /  proba:[ 0.26067997  0.48882211  0.10872144  0.14177648]
input: cycle  /  predicted: other  /  true: vehicle  /  proba:[ 0.18152936  0.35876236  0.17274909  0.28695919]
input: dozer  /  predicted: other  /  true: vehicle  /  proba:[ 0.33144972  0.41516442  0.10895253  0.14443333]
input: dray  /  predicted: other  /  true: vehicle  /  proba:[ 0.14432557  0.53894097  0.04229209  0.27444136]
input: dumper  /  predicted: other  /  true: vehicle  /  proba:[ 0.1761083   0.36689896  0.23469432  0.22229841]
input: electric  /  predicted: other  /  true: vehicle  /  proba:[ 0.14161699  0.52688516  0.215438    0.11605985]
input: hearse  /  predicted: other  /  true: vehicle  /  proba:[ 0.13044007  0.5555208   0.0322955   0.28174363]
input: houseboat  /  predicted: other  /  true: vehicle  /  proba:[ 0.16409086  0.39546167  0.22919668  0.21125079]
input: jet  /  predicted: animal  /  true: vehicle  /  proba:[ 0.41659118  0.24045766  0.12852916  0.214422  ]
input: lorry  /  predicted: other  /  true: vehicle  /  proba:[ 0.19251535  0.3975372   0.10787281  0.30207464]
input: missile  /  predicted: other  /  true: vehicle  /  proba:[ 0.11792487  0.37550471  0.14647782  0.3600926 ]
input: moped  /  predicted: other  /  true: vehicle  /  proba:[ 0.23583078  0.39957087  0.11023425  0.2543641 ]
input: motorbike  /  predicted: other  /  true: vehicle  /  proba:[ 0.21363365  0.3919265   0.14481813  0.24962173]
input: motorcar  /  predicted: other  /  true: vehicle  /  proba:[ 0.24521803  0.37432921  0.15142443  0.22902833]
input: pedicab  /  predicted: other  /  true: vehicle  /  proba:[ 0.30970105  0.50462925  0.07222327  0.11344644]
input: plane  /  predicted: other  /  true: vehicle  /  proba:[ 0.20939531  0.4219785   0.22035301  0.14827318]
input: roadster  /  predicted: other  /  true: vehicle  /  proba:[ 0.16001166  0.41088383  0.10318553  0.32591898]
input: rv  /  predicted: other  /  true: vehicle  /  proba:[ 0.01757679  0.47998215  0.11445923  0.38798183]
input: scrambler  /  predicted: other  /  true: vehicle  /  proba:[ 0.21789814  0.33768964  0.11725724  0.32715498]
input: sedan  /  predicted: other  /  true: vehicle  /  proba:[ 0.17622515  0.40241205  0.19800924  0.22335355]
input: semi  /  predicted: other  /  true: vehicle  /  proba:[ 0.30736164  0.45821491  0.19476822  0.03965522]
input: ship  /  predicted: other  /  true: vehicle  /  proba:[ 0.21511132  0.32508371  0.19638452  0.26342046]
input: skateboard  /  predicted: other  /  true: vehicle  /  proba:[ 0.16724618  0.44982321  0.20380925  0.17912136]
input: steamboat  /  predicted: animal  /  true: vehicle  /  proba:[ 0.3762554   0.24981554  0.17481211  0.19911695]
input: streetcar  /  predicted: animal  /  true: vehicle  /  proba:[ 0.38663018  0.32709068  0.13960565  0.14667348]
input: stroller  /  predicted: animal  /  true: vehicle  /  proba:[ 0.34845657  0.30748563  0.08174152  0.26231628]
input: submarine  /  predicted: other  /  true: vehicle  /  proba:[ 0.13769305  0.53297819  0.07150524  0.25782352]
input: tanker  /  predicted: other  /  true: vehicle  /  proba:[ 0.14454939  0.33190776  0.28603584  0.23750701]
input: tipper  /  predicted: other  /  true: vehicle  /  proba:[ 0.07568354  0.46817615  0.07683698  0.37930332]
input: trailer  /  predicted: other  /  true: vehicle  /  proba:[ 0.19395311  0.46803951  0.07789979  0.26010759]
input: trolley  /  predicted: other  /  true: vehicle  /  proba:[ 0.26924116  0.30643631  0.11895962  0.30536291]
input: van  /  predicted: other  /  true: vehicle  /  proba:[ 0.09007364  0.50754203  0.17570541  0.22667892]
input: watercraft  /  predicted: other  /  true: vehicle  /  proba:[ 0.21619092  0.33447561  0.17472849  0.27460498]
input: starlings  /  predicted: animal  /  true: other  /  proba:[ 0.81435831  0.05387403  0.13077711  0.00099055]
input: boitani  /  predicted: animal  /  true: other  /  proba:[ 0.36062877  0.31341924  0.30542328  0.02052871]
input: procolobus  /  predicted: animal  /  true: other  /  proba:[ 0.51969579  0.35743106  0.0423133   0.08055985]
input: ericaceous  /  predicted: plant  /  true: other  /  proba:[ 0.09850226  0.20035667  0.6917039   0.00943717]
input: rainforests  /  predicted: animal  /  true: other  /  proba:[ 0.52915062  0.18908949  0.24105434  0.04070554]
input: syntopic  /  predicted: animal  /  true: other  /  proba:[ 0.6412854   0.2478721   0.10426714  0.00657537]
input: bonny  /  predicted: animal  /  true: other  /  proba:[ 0.39923775  0.36414631  0.21983564  0.01678029]
input: milkvetch  /  predicted: animal  /  true: other  /  proba:[ 0.32614018  0.31034904  0.31211206  0.05139873]
input: lysergol  /  predicted: plant  /  true: other  /  proba:[ 0.1764514   0.32334008  0.33983713  0.16037139]
input: kachinensis  /  predicted: plant  /  true: other  /  proba:[ 0.19585754  0.38765987  0.40251974  0.01396285]
input: sepat  /  predicted: animal  /  true: other  /  proba:[ 0.42503001  0.41305372  0.12566559  0.03625067]
input: myiagra  /  predicted: animal  /  true: other  /  proba:[ 0.46290275  0.33394081  0.16315522  0.04000121]
input: lumnitzera  /  predicted: plant  /  true: other  /  proba:[ 0.19724492  0.13226234  0.65796119  0.01253154]
input: diops  /  predicted: plant  /  true: other  /  proba:[ 0.20869245  0.26492036  0.50964693  0.01674026]
input: longstalk  /  predicted: plant  /  true: other  /  proba:[ 0.11710073  0.17947155  0.69121117  0.01221655]
input: melanochromis  /  predicted: animal  /  true: other  /  proba:[ 0.47510905  0.45094812  0.04655534  0.0273875 ]
input: baitings  /  predicted: animal  /  true: other  /  proba:[ 0.42424333  0.37487265  0.18321105  0.01767297]
input: tetradymia  /  predicted: plant  /  true: other  /  proba:[ 0.08688111  0.29131097  0.61200516  0.00980275]
input: cactoblastis  /  predicted: animal  /  true: other  /  proba:[ 0.43533487  0.29009788  0.18637892  0.08818833]
input: manystem  /  predicted: plant  /  true: other  /  proba:[ 0.03567812  0.1993369   0.75461795  0.01036703]
input: muddy  /  predicted: animal  /  true: other  /  proba:[ 0.43163088  0.2053937   0.25201071  0.1109647 ]
input: heteropods  /  predicted: animal  /  true: other  /  proba:[ 0.60800415  0.26074369  0.05995172  0.07130044]
input: megaloceros  /  predicted: animal  /  true: other  /  proba:[ 0.91891776  0.06200078  0.01184872  0.00723274]
input: calliphorid  /  predicted: animal  /  true: other  /  proba:[ 0.54669444  0.25647947  0.1646302   0.03219589]
input: oreobolus  /  predicted: plant  /  true: other  /  proba:[ 0.11708665  0.33905031  0.53823035  0.00563269]
input: familiaris  /  predicted: animal  /  true: other  /  proba:[ 0.42812561  0.40401686  0.10289357  0.06496395]
input: acanthostega  /  predicted: animal  /  true: other  /  proba:[ 0.52873984  0.19904484  0.18061651  0.09159881]
input: yews  /  predicted: plant  /  true: other  /  proba:[ 0.08739409  0.35370095  0.5306433   0.02826166]
input: cottonrose  /  predicted: plant  /  true: other  /  proba:[ 0.12215332  0.31770424  0.5202154   0.03992704]
input: macrocystidia  /  predicted: plant  /  true: other  /  proba:[ 0.32152523  0.31591634  0.35410521  0.00845322]
input: guinotia  /  predicted: animal  /  true: other  /  proba:[ 0.40230982  0.32105125  0.11917954  0.15745939]
input: chartoff  /  predicted: animal  /  true: other  /  proba:[ 0.40002482  0.33305514  0.10337872  0.16354132]
input: tecticornia  /  predicted: plant  /  true: other  /  proba:[ 0.29488675  0.15256598  0.53614724  0.01640003]
input: weedn  /  predicted: animal  /  true: other  /  proba:[ 0.41091637  0.29929518  0.25668337  0.03310508]
input: racemiflora  /  predicted: plant  /  true: other  /  proba:[ 0.09820956  0.37506615  0.4910996   0.03562468]
input: tsho  /  predicted: plant  /  true: other  /  proba:[ 0.17801377  0.30077783  0.4543915   0.0668169 ]
input: friddle  /  predicted: animal  /  true: other  /  proba:[ 0.39829057  0.34031825  0.14995455  0.11143664]
input: herriot  /  predicted: plant  /  true: other  /  proba:[ 0.27411987  0.31065559  0.39150357  0.02372098]
input: orleanesia  /  predicted: plant  /  true: other  /  proba:[ 0.1528525   0.24582055  0.57514492  0.02618203]
input: atripalpis  /  predicted: plant  /  true: other  /  proba:[ 0.12259691  0.37900551  0.49400073  0.00439686]
input: haredevil  /  predicted: animal  /  true: other  /  proba:[ 0.49531791  0.44536365  0.01503562  0.04428282]
input: agression  /  predicted: animal  /  true: other  /  proba:[ 0.44361778  0.4207662   0.04544173  0.09017429]
input: coniferous  /  predicted: animal  /  true: other  /  proba:[ 0.36015237  0.3384307   0.29377651  0.00764042]
input: pomatomus  /  predicted: animal  /  true: other  /  proba:[ 0.43809022  0.26796846  0.2876037   0.00633762]
input: seedbox  /  predicted: plant  /  true: other  /  proba:[ 0.08129003  0.40654893  0.48535383  0.0268072 ]
input: amicability  /  predicted: animal  /  true: other  /  proba:[ 0.41140301  0.34604853  0.24026431  0.00228415]
input: ceratophthalma  /  predicted: animal  /  true: other  /  proba:[ 0.6658318   0.22989368  0.07596553  0.02830899]
input: topolnitsa  /  predicted: animal  /  true: other  /  proba:[ 0.36489274  0.32614001  0.2742317   0.03473555]
input: enneacanthus  /  predicted: animal  /  true: other  /  proba:[ 0.40331055  0.37513319  0.18604871  0.03550754]
input: asfv  /  predicted: plant  /  true: other  /  proba:[ 0.29858698  0.26902847  0.3693616   0.06302295]
input: forepaugh  /  predicted: animal  /  true: other  /  proba:[ 0.5459933   0.21708621  0.15638582  0.08053467]
input: armigera  /  predicted: animal  /  true: other  /  proba:[ 0.40374934  0.32331048  0.23403465  0.03890554]
input: cacua  /  predicted: animal  /  true: other  /  proba:[ 0.43166514  0.3300851   0.10574005  0.13250971]
input: orthents  /  predicted: plant  /  true: other  /  proba:[ 0.30451453  0.31581307  0.35068857  0.02898383]
input: friedman  /  predicted: plant  /  true: other  /  proba:[ 0.04115728  0.43712115  0.45800424  0.06371733]
input: sloanii  /  predicted: animal  /  true: other  /  proba:[ 0.4195294   0.34433071  0.19485046  0.04128943]
input: chihuahuas  /  predicted: animal  /  true: other  /  proba:[ 0.48876835  0.27795145  0.10564748  0.12763272]
input: cinclidae  /  predicted: animal  /  true: other  /  proba:[ 0.53180844  0.16985219  0.27894323  0.01939614]
input: cuscutaceae  /  predicted: plant  /  true: other  /  proba:[ 0.23064011  0.16916667  0.60019322  0.        ]
input: angrysummit  /  predicted: animal  /  true: other  /  proba:[ 0.43779399  0.43555442  0.11231978  0.01433181]
input: eves  /  predicted: plant  /  true: other  /  proba:[ 0.08630488  0.33244298  0.46714026  0.11411188]
input: hamun  /  predicted: animal  /  true: other  /  proba:[ 0.59905033  0.33122977  0.03698902  0.03273088]
input: brachyceros  /  predicted: animal  /  true: other  /  proba:[ 0.58180292  0.3108812   0.07894479  0.02837108]
input: pollachius  /  predicted: animal  /  true: other  /  proba:[ 0.49145639  0.32555768  0.13705392  0.04593202]
input: australasiae  /  predicted: animal  /  true: other  /  proba:[ 0.36081657  0.29505046  0.22076636  0.12336662]

--  REPORT  --
             precision    recall  f1-score   support

     animal       0.91      0.92      0.91       542
      other       0.92      0.95      0.94      1400
      plant       0.94      0.93      0.93       466
    vehicle       0.98      0.55      0.71       103

avg / total       0.93      0.92      0.92      2511

We observe 3 things:

  • A very bad recall on vehicle, which could be explained by the small size of the dataset or the face.
  • the several semantic meanings of words, create noise:
    • grub is predicted as other unstead of animal
    • ...
  • Almost all conflict are 'other' class related, indeed, by looking deeper into the other dataset
    which is a random sample from the vocabulary, we notice that a lot of them actually belong to one of the other domains (animal, plant, vehicle)
    • cuscutaceae is a plant
    • sloanii is an animal
    • ...

Once again, the model proves it's ability to 'challenge' the ground truth (which is here highly biaised) .

Conclusion

The recognition rate is satisfying, thus the classification errors highlight one main issues:

  • Lot of words a several meaning but have only one semantic position in word2vec space.
    a workaround for this could be to have an adapted annotation for the dataset but the real problem in inherent to word2vec.