In [2]:

    
import __init__

import pandas as pd

from bin.splitLogFile import extractSummaryLine

Experience

Based on annoted ground truth, we tried to learn a model to classify domains specific words.

We use as input a combinaison of 4 datasets:

animal
vehicle
plant
other - a random sample from the whole vocabulary

To do so we will explore the carthesian product of:

domains: a combinaison of N previously presented domains
strict: try to compose missing concept
randomForest / knn: knn allow us to check if there is anything consistent to learn, randomForest is a basic model as a first approach to learn the function
feature: one of the feature presented in the guided tour
postFeature: any extra processing to apply to the feature extraction (like normalise)

We use a 10 K-Fold cross validation.

Once you downloaded the files, you can use this script reproduce the experience at home:

python experiment/trainAll_domainClf.py > ../data/learnedModel/domain/log.txt

Results

Here is the summary of the results we gathered, You can find details reports in logs.



In [6]:

    
summaryDf = pd.DataFrame([extractSummaryLine(l) for l in open('../../data/learnedModel/domain/summary.txt').readlines()],
                        columns=['domain', 'strict', 'clf', 'feature', 'post', 'precision', 'recall', 'f1'])

summaryDf = summaryDf[summaryDf['clf'] != 'KNeighborsClassifier'].sort_values('f1', ascending=False)
print len(summaryDf)
summaryDf[:5]









    



198






    Out[6]:






  
    
      
      domain
      strict
      clf
      feature
      post
      precision
      recall
      f1
    
  
  
    
      338
      plant-vehicle
      
      RandomForestClassifier
      identity
      postNormalize
      0.972
      0.97
      0.97
    
    
      356
      plant-vehicle
      strict
      RandomForestClassifier
      identity
      postNormalize
      0.971
      0.969
      0.968
    
    
      340
      plant-vehicle
      
      RandomForestClassifier
      polar
      postAbs
      0.968
      0.967
      0.967
    
    
      352
      plant-vehicle
      strict
      RandomForestClassifier
      angular
      postAbs
      0.97
      0.967
      0.967
    
    
      359
      plant-vehicle
      strict
      RandomForestClassifier
      polar
      postNormalize
      0.966
      0.963
      0.964

Considering the nature of the date (really close points for semanticly close concept - ie puppy, dog), Knn is not relevant.

As you can see we, there is a lot trained model (198), therefore,
we need to find a method to select the best combinaison - ie: robust to the number and variety of domains

To do so, we'll select the best average model depending of the dataset combinaison



In [8]:

    
summaryDf['f1'] = summaryDf['f1'].astype(float)
summaryDf[['feature', 'post', 'f1']].groupby(['feature', 'post']).describe().unstack(level=-1)









    Out[8]:






  
    
      
      
      f1
    
    
      
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
    
      feature
      post
      
      
      
      
      
      
      
      
    
  
  
    
      angular
      noPost
      22.0
      0.897682
      0.044522
      0.811
      0.85975
      0.9140
      0.93050
      0.962
    
    
      postAbs
      22.0
      0.899136
      0.045783
      0.806
      0.85975
      0.9185
      0.92900
      0.967
    
    
      postNormalize
      22.0
      0.894500
      0.042041
      0.813
      0.86375
      0.8990
      0.93275
      0.958
    
    
      identity
      noPost
      22.0
      0.891591
      0.043890
      0.803
      0.86200
      0.8930
      0.93000
      0.956
    
    
      postAbs
      22.0
      0.778818
      0.069272
      0.639
      0.72250
      0.7875
      0.82400
      0.876
    
    
      postNormalize
      22.0
      0.897682
      0.045976
      0.802
      0.86175
      0.9070
      0.92900
      0.970
    
    
      polar
      noPost
      22.0
      0.897182
      0.042480
      0.807
      0.87150
      0.9085
      0.93100
      0.961
    
    
      postAbs
      22.0
      0.895000
      0.044825
      0.804
      0.85875
      0.9035
      0.93050
      0.967
    
    
      postNormalize
      22.0
      0.892591
      0.043845
      0.807
      0.85875
      0.8985
      0.92675
      0.964

We observe several things here:

The f1-score decrease as we add variety of domains (from ~95% for 2 to ~80% for 4)
In average, the results are satisfying for the basic model.
The feature selected angular, polar, carthesian have a litte impact on the average score.
Adding possibility to compose concept (strict) improve very slightly the score

If we had to select one model, we could choose angular feature with no post processing. which is the best in the edge case (4 domains)



In [20]:

    
summaryDf[summaryDf['domain'] == 'animal-plant-vehicle-other'][:1]









    Out[20]:






  
    
      
      domain
      strict
      clf
      feature
      post
      precision
      recall
      f1
    
  
  
    
      126
      animal-plant-vehicle-other
      
      RandomForestClassifier
      angular
      noPost
      0.843
      0.841
      0.823



In [19]:

    
summaryDf[(summaryDf['feature'] == 'angular') & (summaryDf['post'] == 'noPost') & (summaryDf['strict'] == '')]









    Out[19]:






  
    
      
      domain
      strict
      clf
      feature
      post
      precision
      recall
      f1
    
  
  
    
      333
      plant-vehicle
      
      RandomForestClassifier
      angular
      noPost
      0.965
      0.962
      0.962
    
    
      225
      animal-vehicle
      
      RandomForestClassifier
      angular
      noPost
      0.953
      0.953
      0.950
    
    
      81
      animal-plant
      
      RandomForestClassifier
      angular
      noPost
      0.947
      0.945
      0.945
    
    
      369
      vehicle-other
      
      RandomForestClassifier
      angular
      noPost
      0.923
      0.92
      0.919
    
    
      9
      animal-other
      
      RandomForestClassifier
      angular
      noPost
      0.917
      0.916
      0.916
    
    
      261
      plant-other
      
      RandomForestClassifier
      angular
      noPost
      0.919
      0.915
      0.915
    
    
      153
      animal-plant-vehicle
      
      RandomForestClassifier
      angular
      noPost
      0.901
      0.899
      0.895
    
    
      306
      plant-vehicle-other
      
      RandomForestClassifier
      angular
      noPost
      0.894
      0.887
      0.878
    
    
      54
      animal-plant-other
      
      RandomForestClassifier
      angular
      noPost
      0.865
      0.861
      0.859
    
    
      198
      animal-vehicle-other
      
      RandomForestClassifier
      angular
      noPost
      0.872
      0.862
      0.848
    
    
      126
      animal-plant-vehicle-other
      
      RandomForestClassifier
      angular
      noPost
      0.843
      0.841
      0.823

Study errors

Here is the detail of classification error for combined animal, plant and vehicle:



In [25]:

    
!python ../../toolbox/script/detailConceptClfError.py ../../data/voc/npy/wikiEn-skipgram.npy ../../data/learnedModel/domain/animal-plant-vehicle__RandomForestClassifier_angular_noPost.dill ../../data/domain/luu_animal.txt animal ../../data/domain/luu_plant.txt plant ../../data/domain/luu_vehicle.txt vehicle









    



1388424 loaded from wikiEn-skipgram
mem usage 1.6GiB
loaded time 2.28821516037 s
input: elder  /  predicted: animal  /  true: plant  /  proba:[ 0.52770563  0.46344372  0.00885065]
input: periwinkle  /  predicted: animal  /  true: plant  /  proba:[ 0.59370291  0.38447223  0.02182487]
input: rocket  /  predicted: animal  /  true: plant  /  proba:[ 0.36572428  0.26895122  0.36532451]
input: dumper  /  predicted: animal  /  true: vehicle  /  proba:[ 0.53974509  0.17098901  0.28926589]
input: electric  /  predicted: animal  /  true: vehicle  /  proba:[ 0.60716588  0.10343283  0.28940128]
input: rocket  /  predicted: animal  /  true: vehicle  /  proba:[ 0.36572428  0.26895122  0.36532451]
input: semi  /  predicted: animal  /  true: vehicle  /  proba:[ 0.67749268  0.03053849  0.29196884]
input: tipper  /  predicted: animal  /  true: vehicle  /  proba:[ 0.48796488  0.11740702  0.3946281 ]

--  REPORT  --
             precision    recall  f1-score   support

     animal       0.99      1.00      0.99       542
      plant       1.00      0.99      1.00       466
    vehicle       1.00      0.95      0.98       103

avg / total       0.99      0.99      0.99      1111

Be aware there is no cross validation here, so we are overfitting

Yet, we see collisions seems to be due to several meaning for one concept:
rocket for example, is both a plant and a vehicle, which make it an unsolvable case for this model.

Let's compare with the same domains but adding other:



In [26]:

    
!python ../../toolbox/script/detailConceptClfError.py ../../data/voc/npy/wikiEn-skipgram.npy ../../data/learnedModel/domain/animal-plant-vehicle-other__RandomForestClassifier_angular_noPost.dill ../../data/domain/luu_animal.txt animal ../../data/domain/luu_plant.txt plant ../../data/domain/luu_vehicle.txt vehicle ../../data/domain/all_1400.txt other









    



1388424 loaded from wikiEn-skipgram
mem usage 1.6GiB
loaded time 2.23708605766 s
input: ant  /  predicted: other  /  true: animal  /  proba:[ 0.44698061  0.47728024  0.03956311  0.03617603]
input: aphid  /  predicted: plant  /  true: animal  /  proba:[ 0.4205326   0.06616029  0.50666878  0.00663833]
input: bryozoan  /  predicted: other  /  true: animal  /  proba:[ 0.28859803  0.37158572  0.18269929  0.15711696]
input: bullock  /  predicted: other  /  true: animal  /  proba:[ 0.44268397  0.47954619  0.0655728   0.01219704]
input: cephalochordate  /  predicted: other  /  true: animal  /  proba:[ 0.29835134  0.59678727  0.09063798  0.0142234 ]
input: cephalopod  /  predicted: other  /  true: animal  /  proba:[ 0.40773633  0.44026934  0.10582134  0.04617299]
input: coelenterate  /  predicted: other  /  true: animal  /  proba:[ 0.35773328  0.42768328  0.16387334  0.0507101 ]
input: collembolan  /  predicted: other  /  true: animal  /  proba:[ 0.28649462  0.42505779  0.27505705  0.01339054]
input: conch  /  predicted: other  /  true: animal  /  proba:[ 0.4149726   0.49854156  0.06766974  0.0188161 ]
input: cricket  /  predicted: other  /  true: animal  /  proba:[ 0.3628891   0.57340675  0.03811551  0.02558863]
input: female  /  predicted: other  /  true: animal  /  proba:[ 0.41602699  0.43227433  0.06870852  0.08299015]
input: fetus  /  predicted: other  /  true: animal  /  proba:[ 0.21438143  0.38281834  0.30153137  0.10126886]
input: fisher  /  predicted: other  /  true: animal  /  proba:[ 0.23903853  0.50360001  0.06098512  0.19637634]
input: game  /  predicted: other  /  true: animal  /  proba:[ 0.30137054  0.58407286  0.07058566  0.04397094]
input: gastropod  /  predicted: other  /  true: animal  /  proba:[ 0.33504196  0.44254691  0.19681332  0.02559781]
input: gnu  /  predicted: other  /  true: animal  /  proba:[ 0.27754129  0.55000117  0.14460895  0.02784859]
input: gorgonian  /  predicted: other  /  true: animal  /  proba:[ 0.28779823  0.45182579  0.19016982  0.07020616]
input: grub  /  predicted: other  /  true: animal  /  proba:[ 0.35815936  0.36892295  0.19039886  0.08251882]
input: guanaco  /  predicted: other  /  true: animal  /  proba:[ 0.28651099  0.52009469  0.16996517  0.02342914]
input: hart  /  predicted: other  /  true: animal  /  proba:[ 0.375365    0.41697974  0.1896232   0.01803206]
input: hydrozoan  /  predicted: other  /  true: animal  /  proba:[ 0.28298707  0.38912175  0.30523088  0.0226603 ]
input: jack  /  predicted: other  /  true: animal  /  proba:[ 0.30253979  0.46881654  0.18399569  0.04464798]
input: kiang  /  predicted: other  /  true: animal  /  proba:[ 0.18135746  0.54336857  0.23603194  0.03924203]
input: kudus  /  predicted: other  /  true: animal  /  proba:[ 0.41369482  0.44373094  0.09797738  0.04459685]
input: leech  /  predicted: other  /  true: animal  /  proba:[ 0.33710568  0.40904253  0.22422854  0.02962325]
input: livestock  /  predicted: other  /  true: animal  /  proba:[ 0.36199703  0.39503132  0.20583432  0.03713732]
input: liza  /  predicted: other  /  true: animal  /  proba:[ 0.35388255  0.47089143  0.08730651  0.08791951]
input: locust  /  predicted: plant  /  true: animal  /  proba:[ 0.1957628   0.27105741  0.51315542  0.02002436]
input: mare  /  predicted: other  /  true: animal  /  proba:[ 0.34153627  0.38334255  0.21900775  0.05611343]
input: medusa  /  predicted: other  /  true: animal  /  proba:[ 0.35433538  0.5468813   0.07576345  0.02301986]
input: moth  /  predicted: plant  /  true: animal  /  proba:[ 0.31682934  0.24001358  0.43623565  0.00692143]
input: mullet  /  predicted: other  /  true: animal  /  proba:[ 0.26172462  0.5953232   0.1202752   0.02267697]
input: periwinkle  /  predicted: plant  /  true: animal  /  proba:[ 0.22760357  0.08097104  0.62379982  0.06762556]
input: pest  /  predicted: plant  /  true: animal  /  proba:[ 0.34586507  0.24986497  0.3782927   0.02597726]
input: primate  /  predicted: other  /  true: animal  /  proba:[ 0.45622375  0.46011274  0.05276722  0.0308963 ]
input: ram  /  predicted: other  /  true: animal  /  proba:[ 0.28062712  0.54942798  0.14992112  0.02002378]
input: ratel  /  predicted: other  /  true: animal  /  proba:[ 0.42538728  0.45904248  0.05509626  0.06047398]
input: rhea  /  predicted: other  /  true: animal  /  proba:[ 0.23205132  0.51245318  0.21437706  0.04111845]
input: robin  /  predicted: other  /  true: animal  /  proba:[ 0.37436454  0.519887    0.06746879  0.03827967]
input: salp  /  predicted: other  /  true: animal  /  proba:[ 0.37061139  0.56412546  0.02246334  0.0427998 ]
input: saurian  /  predicted: other  /  true: animal  /  proba:[ 0.41519959  0.46158929  0.07983094  0.04338019]
input: skate  /  predicted: other  /  true: animal  /  proba:[ 0.35476907  0.39567611  0.23025952  0.0192953 ]
input: yak  /  predicted: other  /  true: animal  /  proba:[ 0.25577816  0.39443725  0.13638508  0.21339952]
input: young  /  predicted: other  /  true: animal  /  proba:[ 0.22593007  0.64861118  0.08981339  0.03564536]
input: alga  /  predicted: other  /  true: plant  /  proba:[ 0.11713995  0.45686998  0.40246224  0.02352783]
input: alstroemeria  /  predicted: other  /  true: plant  /  proba:[ 0.20058751  0.40987284  0.37376524  0.01577441]
input: annual  /  predicted: other  /  true: plant  /  proba:[ 0.19559753  0.62341408  0.15218006  0.02880833]
input: bitterroot  /  predicted: animal  /  true: plant  /  proba:[ 0.44683013  0.27589282  0.26372874  0.01354831]
input: centaury  /  predicted: other  /  true: plant  /  proba:[ 0.22686339  0.45119916  0.30944789  0.01248956]
input: climber  /  predicted: other  /  true: plant  /  proba:[ 0.26019133  0.39891531  0.1632122   0.17768115]
input: composite  /  predicted: other  /  true: plant  /  proba:[ 0.14099299  0.57778067  0.23784378  0.04338255]
input: conifer  /  predicted: animal  /  true: plant  /  proba:[ 0.48151509  0.10926372  0.38960809  0.0196131 ]
input: crucifer  /  predicted: other  /  true: plant  /  proba:[ 0.23371797  0.40145504  0.35424806  0.01057893]
input: elder  /  predicted: other  /  true: plant  /  proba:[ 0.16285167  0.6013301   0.21277506  0.02304317]
input: frangipani  /  predicted: other  /  true: plant  /  proba:[ 0.2298451   0.47447679  0.26744127  0.02823684]
input: guar  /  predicted: other  /  true: plant  /  proba:[ 0.12317942  0.60200737  0.24554262  0.02927059]
input: hawthorn  /  predicted: other  /  true: plant  /  proba:[ 0.04626004  0.55597733  0.37601294  0.0217497 ]
input: heath  /  predicted: other  /  true: plant  /  proba:[ 0.15354265  0.46233484  0.34902338  0.03509913]
input: hoya  /  predicted: other  /  true: plant  /  proba:[ 0.07286231  0.68775714  0.20756999  0.03181056]
input: liana  /  predicted: other  /  true: plant  /  proba:[ 0.1391583   0.58791597  0.21477134  0.05815439]
input: ling  /  predicted: other  /  true: plant  /  proba:[ 0.14403447  0.57713727  0.25246479  0.02636347]
input: mangrove  /  predicted: animal  /  true: plant  /  proba:[ 0.50842938  0.10948469  0.36002458  0.02206134]
input: maranta  /  predicted: other  /  true: plant  /  proba:[ 0.09980329  0.49384903  0.36726399  0.03908369]
input: marrow  /  predicted: other  /  true: plant  /  proba:[ 0.06658945  0.47905279  0.44372157  0.01063619]
input: papyrus  /  predicted: other  /  true: plant  /  proba:[ 0.08028603  0.46276602  0.38189614  0.07505181]
input: phytoplankton  /  predicted: other  /  true: plant  /  proba:[ 0.21428117  0.38745226  0.35188804  0.04637853]
input: pinon  /  predicted: other  /  true: plant  /  proba:[ 0.15043216  0.46326095  0.3546033   0.03170359]
input: ramp  /  predicted: other  /  true: plant  /  proba:[ 0.09954547  0.62049674  0.17999476  0.09996303]
input: rocket  /  predicted: vehicle  /  true: plant  /  proba:[ 0.17231564  0.23032237  0.26717962  0.33018237]
input: senna  /  predicted: other  /  true: plant  /  proba:[ 0.02432083  0.4445068   0.43489362  0.09627875]
input: soie  /  predicted: other  /  true: plant  /  proba:[ 0.04972663  0.51084713  0.36275289  0.07667335]
input: spermatophyte  /  predicted: other  /  true: plant  /  proba:[ 0.22809579  0.5651832   0.16126189  0.04545912]
input: ti  /  predicted: other  /  true: plant  /  proba:[ 0.1139587   0.43627034  0.38966767  0.0601033 ]
input: tobacco  /  predicted: other  /  true: plant  /  proba:[ 0.14717922  0.41151974  0.39556695  0.04573408]
input: tumbleweed  /  predicted: other  /  true: plant  /  proba:[ 0.26434696  0.34145432  0.32274227  0.07145644]
input: viola  /  predicted: other  /  true: plant  /  proba:[ 0.04577327  0.52191823  0.3469759   0.0853326 ]
input: woad  /  predicted: other  /  true: plant  /  proba:[ 0.12439215  0.42086573  0.38730698  0.06743514]
input: yam  /  predicted: other  /  true: plant  /  proba:[ 0.16391467  0.50504901  0.31289888  0.01813744]
input: apc  /  predicted: other  /  true: vehicle  /  proba:[ 0.26338496  0.46749353  0.04944805  0.21967347]
input: automobile  /  predicted: other  /  true: vehicle  /  proba:[ 0.03760072  0.50164853  0.01545129  0.44529946]
input: bicycle  /  predicted: other  /  true: vehicle  /  proba:[ 0.26801871  0.42687975  0.06401224  0.24108929]
input: bomber  /  predicted: other  /  true: vehicle  /  proba:[ 0.37439346  0.49298322  0.03853735  0.09408597]
input: buggy  /  predicted: other  /  true: vehicle  /  proba:[ 0.12294841  0.48245099  0.17824378  0.21635682]
input: bulldozer  /  predicted: other  /  true: vehicle  /  proba:[ 0.27321283  0.32869367  0.15183118  0.24626233]
input: bus  /  predicted: other  /  true: vehicle  /  proba:[ 0.16877928  0.49918392  0.06418505  0.26785174]
input: camper  /  predicted: animal  /  true: vehicle  /  proba:[ 0.45872386  0.27128794  0.10955703  0.16043116]
input: caravan  /  predicted: other  /  true: vehicle  /  proba:[ 0.19109092  0.55385374  0.13067359  0.12438175]
input: cart  /  predicted: other  /  true: vehicle  /  proba:[ 0.14680899  0.51545333  0.08329053  0.25444715]
input: chariot  /  predicted: other  /  true: vehicle  /  proba:[ 0.25945618  0.43236877  0.09895327  0.20922178]
input: chopper  /  predicted: other  /  true: vehicle  /  proba:[ 0.23607557  0.38920749  0.06213377  0.31258316]
input: coach  /  predicted: other  /  true: vehicle  /  proba:[ 0.29221715  0.37835215  0.101267    0.22816369]
input: convertible  /  predicted: other  /  true: vehicle  /  proba:[ 0.26067997  0.48882211  0.10872144  0.14177648]
input: cycle  /  predicted: other  /  true: vehicle  /  proba:[ 0.18152936  0.35876236  0.17274909  0.28695919]
input: dozer  /  predicted: other  /  true: vehicle  /  proba:[ 0.33144972  0.41516442  0.10895253  0.14443333]
input: dray  /  predicted: other  /  true: vehicle  /  proba:[ 0.14432557  0.53894097  0.04229209  0.27444136]
input: dumper  /  predicted: other  /  true: vehicle  /  proba:[ 0.1761083   0.36689896  0.23469432  0.22229841]
input: electric  /  predicted: other  /  true: vehicle  /  proba:[ 0.14161699  0.52688516  0.215438    0.11605985]
input: hearse  /  predicted: other  /  true: vehicle  /  proba:[ 0.13044007  0.5555208   0.0322955   0.28174363]
input: houseboat  /  predicted: other  /  true: vehicle  /  proba:[ 0.16409086  0.39546167  0.22919668  0.21125079]
input: jet  /  predicted: animal  /  true: vehicle  /  proba:[ 0.41659118  0.24045766  0.12852916  0.214422  ]
input: lorry  /  predicted: other  /  true: vehicle  /  proba:[ 0.19251535  0.3975372   0.10787281  0.30207464]
input: missile  /  predicted: other  /  true: vehicle  /  proba:[ 0.11792487  0.37550471  0.14647782  0.3600926 ]
input: moped  /  predicted: other  /  true: vehicle  /  proba:[ 0.23583078  0.39957087  0.11023425  0.2543641 ]
input: motorbike  /  predicted: other  /  true: vehicle  /  proba:[ 0.21363365  0.3919265   0.14481813  0.24962173]
input: motorcar  /  predicted: other  /  true: vehicle  /  proba:[ 0.24521803  0.37432921  0.15142443  0.22902833]
input: pedicab  /  predicted: other  /  true: vehicle  /  proba:[ 0.30970105  0.50462925  0.07222327  0.11344644]
input: plane  /  predicted: other  /  true: vehicle  /  proba:[ 0.20939531  0.4219785   0.22035301  0.14827318]
input: roadster  /  predicted: other  /  true: vehicle  /  proba:[ 0.16001166  0.41088383  0.10318553  0.32591898]
input: rv  /  predicted: other  /  true: vehicle  /  proba:[ 0.01757679  0.47998215  0.11445923  0.38798183]
input: scrambler  /  predicted: other  /  true: vehicle  /  proba:[ 0.21789814  0.33768964  0.11725724  0.32715498]
input: sedan  /  predicted: other  /  true: vehicle  /  proba:[ 0.17622515  0.40241205  0.19800924  0.22335355]
input: semi  /  predicted: other  /  true: vehicle  /  proba:[ 0.30736164  0.45821491  0.19476822  0.03965522]
input: ship  /  predicted: other  /  true: vehicle  /  proba:[ 0.21511132  0.32508371  0.19638452  0.26342046]
input: skateboard  /  predicted: other  /  true: vehicle  /  proba:[ 0.16724618  0.44982321  0.20380925  0.17912136]
input: steamboat  /  predicted: animal  /  true: vehicle  /  proba:[ 0.3762554   0.24981554  0.17481211  0.19911695]
input: streetcar  /  predicted: animal  /  true: vehicle  /  proba:[ 0.38663018  0.32709068  0.13960565  0.14667348]
input: stroller  /  predicted: animal  /  true: vehicle  /  proba:[ 0.34845657  0.30748563  0.08174152  0.26231628]
input: submarine  /  predicted: other  /  true: vehicle  /  proba:[ 0.13769305  0.53297819  0.07150524  0.25782352]
input: tanker  /  predicted: other  /  true: vehicle  /  proba:[ 0.14454939  0.33190776  0.28603584  0.23750701]
input: tipper  /  predicted: other  /  true: vehicle  /  proba:[ 0.07568354  0.46817615  0.07683698  0.37930332]
input: trailer  /  predicted: other  /  true: vehicle  /  proba:[ 0.19395311  0.46803951  0.07789979  0.26010759]
input: trolley  /  predicted: other  /  true: vehicle  /  proba:[ 0.26924116  0.30643631  0.11895962  0.30536291]
input: van  /  predicted: other  /  true: vehicle  /  proba:[ 0.09007364  0.50754203  0.17570541  0.22667892]
input: watercraft  /  predicted: other  /  true: vehicle  /  proba:[ 0.21619092  0.33447561  0.17472849  0.27460498]
input: starlings  /  predicted: animal  /  true: other  /  proba:[ 0.81435831  0.05387403  0.13077711  0.00099055]
input: boitani  /  predicted: animal  /  true: other  /  proba:[ 0.36062877  0.31341924  0.30542328  0.02052871]
input: procolobus  /  predicted: animal  /  true: other  /  proba:[ 0.51969579  0.35743106  0.0423133   0.08055985]
input: ericaceous  /  predicted: plant  /  true: other  /  proba:[ 0.09850226  0.20035667  0.6917039   0.00943717]
input: rainforests  /  predicted: animal  /  true: other  /  proba:[ 0.52915062  0.18908949  0.24105434  0.04070554]
input: syntopic  /  predicted: animal  /  true: other  /  proba:[ 0.6412854   0.2478721   0.10426714  0.00657537]
input: bonny  /  predicted: animal  /  true: other  /  proba:[ 0.39923775  0.36414631  0.21983564  0.01678029]
input: milkvetch  /  predicted: animal  /  true: other  /  proba:[ 0.32614018  0.31034904  0.31211206  0.05139873]
input: lysergol  /  predicted: plant  /  true: other  /  proba:[ 0.1764514   0.32334008  0.33983713  0.16037139]
input: kachinensis  /  predicted: plant  /  true: other  /  proba:[ 0.19585754  0.38765987  0.40251974  0.01396285]
input: sepat  /  predicted: animal  /  true: other  /  proba:[ 0.42503001  0.41305372  0.12566559  0.03625067]
input: myiagra  /  predicted: animal  /  true: other  /  proba:[ 0.46290275  0.33394081  0.16315522  0.04000121]
input: lumnitzera  /  predicted: plant  /  true: other  /  proba:[ 0.19724492  0.13226234  0.65796119  0.01253154]
input: diops  /  predicted: plant  /  true: other  /  proba:[ 0.20869245  0.26492036  0.50964693  0.01674026]
input: longstalk  /  predicted: plant  /  true: other  /  proba:[ 0.11710073  0.17947155  0.69121117  0.01221655]
input: melanochromis  /  predicted: animal  /  true: other  /  proba:[ 0.47510905  0.45094812  0.04655534  0.0273875 ]
input: baitings  /  predicted: animal  /  true: other  /  proba:[ 0.42424333  0.37487265  0.18321105  0.01767297]
input: tetradymia  /  predicted: plant  /  true: other  /  proba:[ 0.08688111  0.29131097  0.61200516  0.00980275]
input: cactoblastis  /  predicted: animal  /  true: other  /  proba:[ 0.43533487  0.29009788  0.18637892  0.08818833]
input: manystem  /  predicted: plant  /  true: other  /  proba:[ 0.03567812  0.1993369   0.75461795  0.01036703]
input: muddy  /  predicted: animal  /  true: other  /  proba:[ 0.43163088  0.2053937   0.25201071  0.1109647 ]
input: heteropods  /  predicted: animal  /  true: other  /  proba:[ 0.60800415  0.26074369  0.05995172  0.07130044]
input: megaloceros  /  predicted: animal  /  true: other  /  proba:[ 0.91891776  0.06200078  0.01184872  0.00723274]
input: calliphorid  /  predicted: animal  /  true: other  /  proba:[ 0.54669444  0.25647947  0.1646302   0.03219589]
input: oreobolus  /  predicted: plant  /  true: other  /  proba:[ 0.11708665  0.33905031  0.53823035  0.00563269]
input: familiaris  /  predicted: animal  /  true: other  /  proba:[ 0.42812561  0.40401686  0.10289357  0.06496395]
input: acanthostega  /  predicted: animal  /  true: other  /  proba:[ 0.52873984  0.19904484  0.18061651  0.09159881]
input: yews  /  predicted: plant  /  true: other  /  proba:[ 0.08739409  0.35370095  0.5306433   0.02826166]
input: cottonrose  /  predicted: plant  /  true: other  /  proba:[ 0.12215332  0.31770424  0.5202154   0.03992704]
input: macrocystidia  /  predicted: plant  /  true: other  /  proba:[ 0.32152523  0.31591634  0.35410521  0.00845322]
input: guinotia  /  predicted: animal  /  true: other  /  proba:[ 0.40230982  0.32105125  0.11917954  0.15745939]
input: chartoff  /  predicted: animal  /  true: other  /  proba:[ 0.40002482  0.33305514  0.10337872  0.16354132]
input: tecticornia  /  predicted: plant  /  true: other  /  proba:[ 0.29488675  0.15256598  0.53614724  0.01640003]
input: weedn  /  predicted: animal  /  true: other  /  proba:[ 0.41091637  0.29929518  0.25668337  0.03310508]
input: racemiflora  /  predicted: plant  /  true: other  /  proba:[ 0.09820956  0.37506615  0.4910996   0.03562468]
input: tsho  /  predicted: plant  /  true: other  /  proba:[ 0.17801377  0.30077783  0.4543915   0.0668169 ]
input: friddle  /  predicted: animal  /  true: other  /  proba:[ 0.39829057  0.34031825  0.14995455  0.11143664]
input: herriot  /  predicted: plant  /  true: other  /  proba:[ 0.27411987  0.31065559  0.39150357  0.02372098]
input: orleanesia  /  predicted: plant  /  true: other  /  proba:[ 0.1528525   0.24582055  0.57514492  0.02618203]
input: atripalpis  /  predicted: plant  /  true: other  /  proba:[ 0.12259691  0.37900551  0.49400073  0.00439686]
input: haredevil  /  predicted: animal  /  true: other  /  proba:[ 0.49531791  0.44536365  0.01503562  0.04428282]
input: agression  /  predicted: animal  /  true: other  /  proba:[ 0.44361778  0.4207662   0.04544173  0.09017429]
input: coniferous  /  predicted: animal  /  true: other  /  proba:[ 0.36015237  0.3384307   0.29377651  0.00764042]
input: pomatomus  /  predicted: animal  /  true: other  /  proba:[ 0.43809022  0.26796846  0.2876037   0.00633762]
input: seedbox  /  predicted: plant  /  true: other  /  proba:[ 0.08129003  0.40654893  0.48535383  0.0268072 ]
input: amicability  /  predicted: animal  /  true: other  /  proba:[ 0.41140301  0.34604853  0.24026431  0.00228415]
input: ceratophthalma  /  predicted: animal  /  true: other  /  proba:[ 0.6658318   0.22989368  0.07596553  0.02830899]
input: topolnitsa  /  predicted: animal  /  true: other  /  proba:[ 0.36489274  0.32614001  0.2742317   0.03473555]
input: enneacanthus  /  predicted: animal  /  true: other  /  proba:[ 0.40331055  0.37513319  0.18604871  0.03550754]
input: asfv  /  predicted: plant  /  true: other  /  proba:[ 0.29858698  0.26902847  0.3693616   0.06302295]
input: forepaugh  /  predicted: animal  /  true: other  /  proba:[ 0.5459933   0.21708621  0.15638582  0.08053467]
input: armigera  /  predicted: animal  /  true: other  /  proba:[ 0.40374934  0.32331048  0.23403465  0.03890554]
input: cacua  /  predicted: animal  /  true: other  /  proba:[ 0.43166514  0.3300851   0.10574005  0.13250971]
input: orthents  /  predicted: plant  /  true: other  /  proba:[ 0.30451453  0.31581307  0.35068857  0.02898383]
input: friedman  /  predicted: plant  /  true: other  /  proba:[ 0.04115728  0.43712115  0.45800424  0.06371733]
input: sloanii  /  predicted: animal  /  true: other  /  proba:[ 0.4195294   0.34433071  0.19485046  0.04128943]
input: chihuahuas  /  predicted: animal  /  true: other  /  proba:[ 0.48876835  0.27795145  0.10564748  0.12763272]
input: cinclidae  /  predicted: animal  /  true: other  /  proba:[ 0.53180844  0.16985219  0.27894323  0.01939614]
input: cuscutaceae  /  predicted: plant  /  true: other  /  proba:[ 0.23064011  0.16916667  0.60019322  0.        ]
input: angrysummit  /  predicted: animal  /  true: other  /  proba:[ 0.43779399  0.43555442  0.11231978  0.01433181]
input: eves  /  predicted: plant  /  true: other  /  proba:[ 0.08630488  0.33244298  0.46714026  0.11411188]
input: hamun  /  predicted: animal  /  true: other  /  proba:[ 0.59905033  0.33122977  0.03698902  0.03273088]
input: brachyceros  /  predicted: animal  /  true: other  /  proba:[ 0.58180292  0.3108812   0.07894479  0.02837108]
input: pollachius  /  predicted: animal  /  true: other  /  proba:[ 0.49145639  0.32555768  0.13705392  0.04593202]
input: australasiae  /  predicted: animal  /  true: other  /  proba:[ 0.36081657  0.29505046  0.22076636  0.12336662]

--  REPORT  --
             precision    recall  f1-score   support

     animal       0.91      0.92      0.91       542
      other       0.92      0.95      0.94      1400
      plant       0.94      0.93      0.93       466
    vehicle       0.98      0.55      0.71       103

avg / total       0.93      0.92      0.92      2511

We observe 3 things:

A very bad recall on vehicle, which could be explained by the small size of the dataset or the face.
the several semantic meanings of words, create noise:
- grub is predicted as other unstead of animal
- ...
Almost all conflict are 'other' class related, indeed, by looking deeper into the other dataset
which is a random sample from the vocabulary, we notice that a lot of them actually belong to one of the other domains (animal, plant, vehicle)
- cuscutaceae is a plant
- sloanii is an animal
- ...

Once again, the model proves it's ability to 'challenge' the ground truth (which is here highly biaised) .

Conclusion

The recognition rate is satisfying, thus the classification errors highlight one main issues:

Lot of words a several meaning but have only one semantic position in word2vec space.
a workaround for this could be to have an adapted annotation for the dataset but the real problem in inherent to word2vec.

	domain	strict	clf	feature	post	precision	recall	f1
338	plant-vehicle		RandomForestClassifier	identity	postNormalize	0.972	0.97	0.97
356	plant-vehicle	strict	RandomForestClassifier	identity	postNormalize	0.971	0.969	0.968
340	plant-vehicle		RandomForestClassifier	polar	postAbs	0.968	0.967	0.967
352	plant-vehicle	strict	RandomForestClassifier	angular	postAbs	0.97	0.967	0.967
359	plant-vehicle	strict	RandomForestClassifier	polar	postNormalize	0.966	0.963	0.964

		f1
		count	mean	std	min	25%	50%	75%	max
feature	post
angular	noPost	22.0	0.897682	0.044522	0.811	0.85975	0.9140	0.93050	0.962
	postAbs	22.0	0.899136	0.045783	0.806	0.85975	0.9185	0.92900	0.967
	postNormalize	22.0	0.894500	0.042041	0.813	0.86375	0.8990	0.93275	0.958
identity	noPost	22.0	0.891591	0.043890	0.803	0.86200	0.8930	0.93000	0.956
	postAbs	22.0	0.778818	0.069272	0.639	0.72250	0.7875	0.82400	0.876
	postNormalize	22.0	0.897682	0.045976	0.802	0.86175	0.9070	0.92900	0.970
polar	noPost	22.0	0.897182	0.042480	0.807	0.87150	0.9085	0.93100	0.961
	postAbs	22.0	0.895000	0.044825	0.804	0.85875	0.9035	0.93050	0.967
	postNormalize	22.0	0.892591	0.043845	0.807	0.85875	0.8985	0.92675	0.964