WordNet and NLTK

(C) 2016-2019 by Damir Cavar <dcavar@iu.edu>

Version: 1.1, November 2019

This is a brief introduction to WordNet in NLTK.

You will find more details on WordNet as such on the WordNet website.

Using WordNet

Some content and ideas in the following introduction are taken from the NLTK-howto on WordNet.

Import the WordNet corpus reader in NLTK using this code:


In [1]:
from nltk.corpus import wordnet

WordNet is a lexical resource that organizes nouns, verbs, adjectives, and adverbs into some form of taxonomy. Lexical items are for example organized in groups of synonyms. In WordNet these synonym groups are calls synsets. Every each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.


In [2]:
wordnet.synsets('can')


Out[2]:
[Synset('can.n.01'),
 Synset('can.n.02'),
 Synset('can.n.03'),
 Synset('buttocks.n.01'),
 Synset('toilet.n.02'),
 Synset('toilet.n.01'),
 Synset('can.v.01'),
 Synset('displace.v.03')]

The output for the synset contains all synonyms of the word can in a list. Each individual synset is a dot-delimited triple that specifies the word, the part-of-speech (PoS) of the specific words, and a running number from 1 to n, for every specific synset. The PoS-tag n stands for noun and the PoS-tag v for verb.

You can request the synset providing the full code:


In [3]:
wordnet.synset('can.v.01')


Out[3]:
Synset('can.v.01')

You can output the definition of any such synset:


In [6]:
wordnet.synset('displace.v.03').definition()


Out[6]:
'terminate the employment of; discharge from an office or position'

You can request all synsets with a specific PoS using the word and the PoS-tag in the synset-function:


In [5]:
wordnet.synsets('can', pos=wordnet.VERB)


Out[5]:
[Synset('can.v.01'), Synset('displace.v.03')]

The possible PoS-tags are: ADJ, ADJ_SAT, ADV, NOUN, VERB.

I will use the word lemmas refering to lemmata.

WordNet contains a list of lemmas for each synset. You can print out the lemmas using the following function:


In [7]:
wordnet.synset('can.v.01').lemmas()


Out[7]:
[Lemma('can.v.01.can'), Lemma('can.v.01.tin'), Lemma('can.v.01.put_up')]

You can map the lammas to a list of strings using the following list comprehension function:


In [8]:
[str(lemma.name()) for lemma in wordnet.synset('can.v.01').lemmas()]


Out[8]:
['can', 'tin', 'put_up']

The NLTK WordNet reader provides access to a multi-lingual WordNet, that is the Open Multilingual WordNet. The multi-lingual data is accessible using ISO-639 language codes (see the ISO-639 Wikipedia page):


In [9]:
wordnet.langs()


Out[9]:
['eng',
 'als',
 'arb',
 'bul',
 'cat',
 'cmn',
 'dan',
 'ell',
 'eus',
 'fas',
 'fin',
 'fra',
 'glg',
 'heb',
 'hrv',
 'ind',
 'ita',
 'jpn',
 'nld',
 'nno',
 'nob',
 'pol',
 'por',
 'qcn',
 'slv',
 'spa',
 'swe',
 'tha',
 'zsm']

To access the synsets of the Croatian (hrv) word kuća, you can use the language code specification in the synset function:


In [11]:
wordnet.synsets('kot', lang='pol')


Out[11]:
[Synset('cat.n.01'),
 Synset('domestic_cat.n.01'),
 Synset('grunt.n.02'),
 Synset('raw_recruit.n.01'),
 Synset('sprog.n.01')]

We can even request the list of lemmas in a specific language for a given English word, for example the synset 01 for the noun house would have the following lemmas in Croatian (hrv):


In [14]:
wordnet.synset('house.n.01').lemma_names('jpn')


Out[14]:
['ハウス',
 '人家',
 '人屋',
 '令堂',
 '住みか',
 '住み処',
 '住宅',
 '住家',
 '住み家',
 '住居',
 '住屋',
 'お宅',
 '宅',
 '室家',
 'お家',
 '家',
 '家宅',
 '家屋',
 '宿',
 '居',
 '居宅',
 '居所',
 '居館',
 '屋',
 '屋宇',
 '建屋',
 '戸',
 '棲家',
 '棲み家',
 '館']

The same word would have the following lemmas in Japanese:


In [18]:
wordnet.synset('house.n.01').lemma_names('jpn')


Out[18]:
['ハウス',
 '人家',
 '人屋',
 '令堂',
 '住みか',
 '住み処',
 '住宅',
 '住家',
 '住み家',
 '住居',
 '住屋',
 'お宅',
 '宅',
 '室家',
 'お家',
 '家',
 '家宅',
 '家屋',
 '宿',
 '居',
 '居宅',
 '居所',
 '居館',
 '屋',
 '屋宇',
 '建屋',
 '戸',
 '棲家',
 '棲み家',
 '館']

We can save the synset request in a variable called house:


In [15]:
house = wordnet.synset('house.n.02')

We can now request the hypernyms for the word house using the variable:


In [16]:
house.hypernyms()


Out[16]:
[Synset('business.n.01')]

Try this for some other words like trout and poodle:


In [17]:
wordnet.synset('trout.n.01').hypernyms()


Out[17]:
[Synset('fish.n.02')]

In [18]:
wordnet.synset('poodle.n.01').hypernyms()


Out[18]:
[Synset('dog.n.01')]

We can also request the list of hyponyms for a given word. Here we request the list of hyponyms for house:


In [19]:
wordnet.synset('dog.n.01').hyponyms()


Out[19]:
[Synset('basenji.n.01'),
 Synset('corgi.n.01'),
 Synset('cur.n.01'),
 Synset('dalmatian.n.02'),
 Synset('great_pyrenees.n.01'),
 Synset('griffon.n.02'),
 Synset('hunting_dog.n.01'),
 Synset('lapdog.n.01'),
 Synset('leonberg.n.01'),
 Synset('mexican_hairless.n.01'),
 Synset('newfoundland.n.01'),
 Synset('pooch.n.01'),
 Synset('poodle.n.01'),
 Synset('pug.n.01'),
 Synset('puppy.n.01'),
 Synset('spitz.n.01'),
 Synset('toy_dog.n.01'),
 Synset('working_dog.n.01')]

In the same way we can now request the holonyms for certain words. For example, imagine we are interested in the holonyms for dog:


In [23]:
dog = wordnet.synset('bird.n.01')
dog.member_holonyms()


Out[23]:
[Synset('aves.n.01'), Synset('flock.n.02')]

We can also request the root hypernym for some word:


In [25]:
wordnet.synset('finger.n.01').root_hypernyms()


Out[25]:
[Synset('entity.n.01')]

We can request the lowest common hypernym for two words, here for example for leg and arm:


In [26]:
wordnet.synset('leg.n.01').lowest_common_hypernyms(wordnet.synset('arm.n.01'))


Out[26]:
[Synset('limb.n.01')]

In addition to hypernym, hyponyms, holonyms, WordNet also provides the means to request antonyms), derivationally related forms and pertainyms. Consider for example the word good. You can request the antonyms for a lemma, that is we fetch all lemmas of the synset good and request the antonyms for the first lemma:


In [28]:
wordnet.synset('white.a.01').lemmas()[0].antonyms()


Out[28]:
[Lemma('black.a.01.black')]

We can now fetch the lemma names for good for Slovenian for example:


In [29]:
wordnet.synset('cold.n.01').lemma_names('slv')


Out[29]:
['mraz']

Once again, the lemma names we can now use to request their Spanish lemma names:


In [30]:
slv_good = wordnet.synset('dog.n.01').lemma_names('spa')
print(slv_good)


['can', 'perro']

We can now request the derivationally related forms for a lemma. In this example we request the derivationally related forms for the adjective (PoS: a) vocal, which is the verb (PoS: v) vocalize:


In [31]:
wordnet.lemma('singer.n.01.singer').derivationally_related_forms()


Out[31]:
[Lemma('sing.v.01.sing'), Lemma('sing.v.02.sing'), Lemma('sing.v.03.sing')]

We can also request the pertainyms for specific words:


In [32]:
wordnet.lemma('vocal.a.01.vocal').pertainyms()


Out[32]:
[Lemma('voice.n.02.voice')]

For verbs we can for example request the verb frames from WordNet. In the following example we request the frames for all the different lemmas of the verb sleep:


In [36]:
wordnet.synsets("say")
wordnet.synset('say.v.07').definition()


Out[36]:
'communicate or express nonverbally'

In [33]:
wordnet.synset('say.v.01').frame_ids()
for lemma in wordnet.synset('say.v.01').lemmas():
    print(lemma, lemma.frame_ids())
    print(" | ".join(lemma.frame_strings()))


Lemma('state.v.01.state') [8, 11, 26]
Somebody state something | Something state something | Somebody state that CLAUSE
Lemma('state.v.01.say') [8, 11, 26]
Somebody say something | Something say something | Somebody say that CLAUSE
Lemma('state.v.01.tell') [8, 11, 26]
Somebody tell something | Something tell something | Somebody tell that CLAUSE

In the following example we request the verb-frames for the ditransitive verb to give:


In [37]:
wordnet.synset('give.v.01').frame_ids()
for lemma in wordnet.synset('give.v.01').lemmas():
    print(lemma, lemma.frame_ids())
    print(" | ".join(lemma.frame_strings()))


Lemma('give.v.01.give') [14]
Somebody give somebody something

Morphological Analysis and Lemmatization

For many tasks in NLP one needs a lemmatizer or morphological analyzer to map inflected word forms to lemmas. Morphy in the WordNet module of the NLTK can do that. To lemmatize a word, provide the word and the PoS to the morphy function in wordnet:


In [38]:
wordnet.morphy('calls', wordnet.NOUN)


Out[38]:
'call'

Morphy can cope with surface forms that are the result of various rules of English word formations, as for example e-insertion or consonant reduplication:


In [40]:
wordnet.morphy('stopped', wordnet.VERB)


Out[40]:
'stop'

Similarity of Words

...

References

Fellbaum, Christiane (2005). WordNet and wordnets. In: Brown, Keith et al. (eds.), Encyclopedia of Language and Linguistics, Second Edition, Oxford: Elsevier, 665-670.

(C) 2016-2019 by Damir Cavar <dcavar@iu.edu>