In [1]:
import __init__
import pandas as pd
from nltk.corpus import wordnet as wn
from nltk.corpus.reader import NOUN
import cpLib.conceptDB as db
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
In this notebook, we'll study the influence of the norm on the 'precision' of concepts.
To do so, we'll study distribution of norm according to the concept depthness in wordnet taxonomy
In [6]:
def buildCptDf(d):
normDict = {}
for w in d.voc:
normDict[w] = d.get(w).vect[0]
# fetch the norm
cptDf = pd.DataFrame.from_dict(normDict, orient='index')
cptDf.columns = ['norm']
# depth
cptDf['depth'] = cptDf.index.map(lambda w: 0.0 if len(wn.synsets(w, NOUN)) == 0 else min([s.max_depth() for s in wn.synsets(w, NOUN)]))
return cptDf
d = db.DB('../../data/voc/npy/wikiEn-skipgram_polar.npy')
cptDf = buildCptDf(d)
cptDf[:5]
Out[6]:
For each nouns and its potential meaning (synsets), we choose the higher level in wordnet taxonomy, which is supposed to be the most englobing concept.
In [9]:
def meanNormDf(cptDf):
sumDf = []
for depth in range(1, int(cptDf['depth'].max())):
normSerie = cptDf[cptDf['depth'] == depth]['norm'].rename(str(depth))
# At least 250 entries
if len(normSerie) > 250:
sumDf += [normSerie.mean()]
normSerie.plot(kind='kde', legend=True)
return pd.DataFrame(sumDf, columns=['mean of norm']).T
meanNormDf(cptDf)
Out[9]:
We can observe that the deeper we go in the taxonomy, the higher tends to be the norm (as long as we do not go too deep in the taxonomy).
In [10]:
d = db.DB('../../data/voc/npy/text8_polar.npy')
cptDf = buildCptDf(d)
meanNormDf(cptDf)
Out[10]:
We observe here a different distribution, both in shape and the fact the mean decrease.
Thus, we still can observe a monotonic relation between the norm value and the deepness in the taxonomy.