11. Semantics 1: words - Lab excercises

11.E1 Accessing WordNet using NLTK

11.E2 Using word embeddings

11.E3 Comparing WordNet and word embeddings

11.E1 Accessing WordNet using NLTK

NLTK (Natural Language Toolkit) is a python library for accessing many NLP tools and resources. The NLTK WordNet interface is described here: http://www.nltk.org/howto/wordnet.html

The NLTK python package can be installed using pip:



In [ ]:

    
!pip install nltk

Import nltk and use its internal download tool to get WordNet:



In [ ]:

    
import nltk
nltk.download('wordnet')

Import the wordnet module:



In [ ]:

    
from nltk.corpus import wordnet as wn

Access synsets of a word using the synsets function:



In [ ]:

    
club_synsets = wn.synsets('club')
print(club_synsets)

Each synset has a definition function:



In [ ]:

    
for synset in club_synsets:
    print("{0}\t{1}".format(synset.name(), synset.definition()))



In [ ]:

    
dog = wn.synsets('dog')[0]
dog.definition()

List lemmas of a synset:



In [ ]:

    
dog.lemmas()

List hypernyms and hyponyms of a synset



In [ ]:

    
dog.hypernyms()



In [ ]:

    
dog.hyponyms()

The closure method of synsets allows us to retrieve the transitive closure of the hypernym, hyponym, etc. relations:



In [ ]:

    
list(dog.closure(lambda s: s.hypernyms()))

common_hypernyms and lowest_common_hypernyms work in relation to another synset:



In [ ]:

    
cat = wn.synsets('cat')[0]
dog.lowest_common_hypernyms(cat)



In [ ]:

    
dog.common_hypernyms(cat)



In [ ]:

    
dog.path_similarity(cat)

To iterate through all synsets, possibly by POS-tag, use all_synsets, which returns a generator:



In [ ]:

    
wn.all_synsets(pos='n')



In [ ]:

    
for c, noun in enumerate(wn.all_synsets(pos='n')):
    if c > 5:
        break
    print(noun.name())

Excercise (optional): use WordNet to implement the "Guess the category" game: the program lists lemmas that all share a hypernym, which the user has to guess.

11.E2 Using word embeddings

Download and extract the word embedding glove.6B, which was trained on 6 billion words of English text using the GloVe algorithm.



In [ ]:

    
!wget http://sandbox.hlt.bme.hu/~recski/stuff/glove.6B.50d.txt.gz
!gunzip -f glove.6B.50d.txt.gz

Read the embedding into a 2D numpy array. Word forms should be stored in a separate 1D array. Also create a word index, a dictionary that returns the index of each word in the embedding. Vectors should be normalized to a length of 1



In [ ]:

    
import numpy as np



In [ ]:



In [ ]:



In [ ]:

    
# words, word_index, emb = read_embedding('glove.6B.50d.txt')
# emb = normalize_embedding(emb)

write a function that takes two words and the embedding as input and returns their cosine similarity



In [ ]:



In [ ]:

    
# vec_sim('cat', 'dog', word_index, emb)

Implement a function that takes a word as a parameter and returns the 5 words that are closest to it in the embedding space



In [ ]:



In [ ]:

    
# print(nearest_n('dog', words, word_index, emb))
# print(nearest_n('king', words, word_index, emb))

11.E3 Vector similarity in WordNet

Use the code written in 11.E2 to analyze word groups in WordNet:

Create an embedding of WordNet synsets by mapping each of them to the mean of their lemmas' vectors.



In [ ]:



In [ ]:

    
# synset_emb = embed_synsets(words, word_index, emb)

write a function that measures the similarity of two synsets based on the cosine similarity of their vectors



In [ ]:



In [ ]:

    
# synset_sim(dog, cat, synset_emb)

Write a function that takes a synset as input and retrieves the n most similar synsets, using the above embedding



In [ ]:



In [ ]:

    
# nearest_n_synsets(wn.synsets('penguin')[0], synset_emb, 10)

Build the list of all words that are both in wordnet and the GloVe embedding. On a sample of 100 such words, measure Spearman correlation of synset similarity and vector similarity (use scipy.stats.spearmanr)



In [ ]:



In [ ]:



In [ ]:



In [ ]:

    
# compare_sims(sample, synset_emb, word_index, emb)