In [1]:

    
import __init__

import pandas as pd

from nltk.corpus import wordnet as wn
from nltk.corpus.reader import NOUN

import cpLib.conceptDB as db

import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline

In this notebook, we'll study the influence of the norm on the 'precision' of concepts.

To do so, we'll study distribution of norm according to the concept depthness in wordnet taxonomy

Wikipedia corpus



In [6]:

    
def buildCptDf(d):
    normDict = {}
    for w in d.voc:
        normDict[w] = d.get(w).vect[0]

    # fetch the norm
    cptDf = pd.DataFrame.from_dict(normDict, orient='index')
    cptDf.columns = ['norm']

    # depth
    cptDf['depth'] = cptDf.index.map(lambda w: 0.0 if len(wn.synsets(w, NOUN)) == 0 else min([s.max_depth() for s in wn.synsets(w, NOUN)]))
    
    return cptDf


d = db.DB('../../data/voc/npy/wikiEn-skipgram_polar.npy')
cptDf = buildCptDf(d)
cptDf[:5]









    



1388424 loaded from wikiEn-skipgram_polar
mem usage 1.6GiB
loaded time 6.01143097878 s






    Out[6]:






  
    
      
      norm
      depth
    
  
  
    
      tripolitan
      5.527292
      0.0
    
    
      verplank
      5.397795
      0.0
    
    
      mdbg
      4.410326
      0.0
    
    
      biysk
      5.908517
      0.0
    
    
      phintella
      6.484352
      0.0

For each nouns and its potential meaning (synsets), we choose the higher level in wordnet taxonomy, which is supposed to be the most englobing concept.



In [9]:

    
def meanNormDf(cptDf):
    sumDf = []
    for depth in range(1, int(cptDf['depth'].max())):
        normSerie = cptDf[cptDf['depth'] == depth]['norm'].rename(str(depth))
        
        # At least 250 entries
        if len(normSerie) > 250:
            sumDf += [normSerie.mean()]
            normSerie.plot(kind='kde', legend=True)

    return pd.DataFrame(sumDf, columns=['mean of norm']).T
    
meanNormDf(cptDf)









    Out[9]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
    
  
  
    
      mean of norm
      4.019843
      4.235762
      4.353298
      4.49749
      4.843566
      4.715548
      4.659304
      4.700412
      4.786809
      4.895845
      5.012274
      5.032075
      4.969377
      4.975434

We can observe that the deeper we go in the taxonomy, the higher tends to be the norm (as long as we do not go too deep in the taxonomy).

Text8 corpus



In [10]:

    
d = db.DB('../../data/voc/npy/text8_polar.npy')
cptDf = buildCptDf(d)
meanNormDf(cptDf)









    



71291 loaded from text8_polar
mem usage 54.4MiB
loaded time 0.254580974579 s






    Out[10]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
    
  
  
    
      mean of norm
      17.668271
      15.836827
      14.535161
      13.507571
      12.191258
      11.95251
      11.792441
      11.238358
      11.002446
      10.873932
      10.335458

We observe here a different distribution, both in shape and the fact the mean decrease.

Thus, we still can observe a monotonic relation between the norm value and the deepness in the taxonomy.

	norm	depth
tripolitan	5.527292	0.0
verplank	5.397795	0.0
mdbg	4.410326	0.0
biysk	5.908517	0.0
phintella	6.484352	0.0