11. Semantics 1: words - Lab excercises

11.E1 Accessing WordNet using NLTK

NLTK (Natural Language Toolkit) is a python library for accessing many NLP tools and resources. The NLTK WordNet interface is described here: http://www.nltk.org/howto/wordnet.html

The NLTK python package can be installed using pip:


In [ ]:
!pip install nltk

Import nltk and use its internal download tool to get WordNet:


In [ ]:
import nltk
nltk.download('wordnet')

Import the wordnet module:


In [ ]:
from nltk.corpus import wordnet as wn

Access synsets of a word using the synsets function:


In [ ]:
club_synsets = wn.synsets('club')
print(club_synsets)

Each synset has a definition function:


In [ ]:
for synset in club_synsets:
    print("{0}\t{1}".format(synset.name(), synset.definition()))

In [ ]:
dog = wn.synsets('dog')[0]
dog.definition()

List lemmas of a synset:


In [ ]:
dog.lemmas()

List hypernyms and hyponyms of a synset


In [ ]:
dog.hypernyms()

In [ ]:
dog.hyponyms()

The closure method of synsets allows us to retrieve the transitive closure of the hypernym, hyponym, etc. relations:


In [ ]:
list(dog.closure(lambda s: s.hypernyms()))

common_hypernyms and lowest_common_hypernyms work in relation to another synset:


In [ ]:
cat = wn.synsets('cat')[0]
dog.lowest_common_hypernyms(cat)

In [ ]:
dog.common_hypernyms(cat)

In [ ]:
dog.path_similarity(cat)

To iterate through all synsets, possibly by POS-tag, use all_synsets, which returns a generator:


In [ ]:
wn.all_synsets(pos='n')

In [ ]:
for c, noun in enumerate(wn.all_synsets(pos='n')):
    if c > 5:
        break
    print(noun.name())

Excercise (optional): use WordNet to implement the "Guess the category" game: the program lists lemmas that all share a hypernym, which the user has to guess.

11.E2 Using word embeddings

  • Download and extract the word embedding glove.6B, which was trained on 6 billion words of English text using the GloVe algorithm.

In [ ]:
!wget http://sandbox.hlt.bme.hu/~recski/stuff/glove.6B.50d.txt.gz
!gunzip -f glove.6B.50d.txt.gz
  • Read the embedding into a 2D numpy array. Word forms should be stored in a separate 1D array. Also create a word index, a dictionary that returns the index of each word in the embedding. Vectors should be normalized to a length of 1

In [ ]:
import numpy as np

In [ ]:


In [ ]:


In [ ]:
# words, word_index, emb = read_embedding('glove.6B.50d.txt')
# emb = normalize_embedding(emb)
  • write a function that takes two words and the embedding as input and returns their cosine similarity

In [ ]:


In [ ]:
# vec_sim('cat', 'dog', word_index, emb)
  • Implement a function that takes a word as a parameter and returns the 5 words that are closest to it in the embedding space

In [ ]:


In [ ]:
# print(nearest_n('dog', words, word_index, emb))
# print(nearest_n('king', words, word_index, emb))

11.E3 Vector similarity in WordNet

Use the code written in 11.E2 to analyze word groups in WordNet:

  • Create an embedding of WordNet synsets by mapping each of them to the mean of their lemmas' vectors.

In [ ]:


In [ ]:
# synset_emb = embed_synsets(words, word_index, emb)
  • write a function that measures the similarity of two synsets based on the cosine similarity of their vectors

In [ ]:


In [ ]:
# synset_sim(dog, cat, synset_emb)
  • Write a function that takes a synset as input and retrieves the n most similar synsets, using the above embedding

In [ ]:


In [ ]:
# nearest_n_synsets(wn.synsets('penguin')[0], synset_emb, 10)
  • Build the list of all words that are both in wordnet and the GloVe embedding. On a sample of 100 such words, measure Spearman correlation of synset similarity and vector similarity (use scipy.stats.spearmanr)

In [ ]:


In [ ]:


In [ ]:


In [ ]:
# compare_sims(sample, synset_emb, word_index, emb)