In [1]:
import __init__

import pandas as pd

from nltk.corpus import wordnet as wn
from nltk.corpus.reader import NOUN

import cpLib.conceptDB as db

import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline

In this notebook, we'll study the influence of the norm on the 'precision' of concepts.

To do so, we'll study distribution of norm according to the concept depthness in wordnet taxonomy

Wikipedia corpus


In [6]:
def buildCptDf(d):
    normDict = {}
    for w in d.voc:
        normDict[w] = d.get(w).vect[0]

    # fetch the norm
    cptDf = pd.DataFrame.from_dict(normDict, orient='index')
    cptDf.columns = ['norm']

    # depth
    cptDf['depth'] = cptDf.index.map(lambda w: 0.0 if len(wn.synsets(w, NOUN)) == 0 else min([s.max_depth() for s in wn.synsets(w, NOUN)]))
    
    return cptDf


d = db.DB('../../data/voc/npy/wikiEn-skipgram_polar.npy')
cptDf = buildCptDf(d)
cptDf[:5]


1388424 loaded from wikiEn-skipgram_polar
mem usage 1.6GiB
loaded time 6.01143097878 s
Out[6]:
norm depth
tripolitan 5.527292 0.0
verplank 5.397795 0.0
mdbg 4.410326 0.0
biysk 5.908517 0.0
phintella 6.484352 0.0

For each nouns and its potential meaning (synsets), we choose the higher level in wordnet taxonomy, which is supposed to be the most englobing concept.


In [9]:
def meanNormDf(cptDf):
    sumDf = []
    for depth in range(1, int(cptDf['depth'].max())):
        normSerie = cptDf[cptDf['depth'] == depth]['norm'].rename(str(depth))
        
        # At least 250 entries
        if len(normSerie) > 250:
            sumDf += [normSerie.mean()]
            normSerie.plot(kind='kde', legend=True)

    return pd.DataFrame(sumDf, columns=['mean of norm']).T
    
meanNormDf(cptDf)


Out[9]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
mean of norm 4.019843 4.235762 4.353298 4.49749 4.843566 4.715548 4.659304 4.700412 4.786809 4.895845 5.012274 5.032075 4.969377 4.975434

We can observe that the deeper we go in the taxonomy, the higher tends to be the norm (as long as we do not go too deep in the taxonomy).

Text8 corpus


In [10]:
d = db.DB('../../data/voc/npy/text8_polar.npy')
cptDf = buildCptDf(d)
meanNormDf(cptDf)


71291 loaded from text8_polar
mem usage 54.4MiB
loaded time 0.254580974579 s
Out[10]:
0 1 2 3 4 5 6 7 8 9 10
mean of norm 17.668271 15.836827 14.535161 13.507571 12.191258 11.95251 11.792441 11.238358 11.002446 10.873932 10.335458

We observe here a different distribution, both in shape and the fact the mean decrease.

Thus, we still can observe a monotonic relation between the norm value and the deepness in the taxonomy.