In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence.
In [1]:
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
In [2]:
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")
for token in doc1:
print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)
In the above sentence, `running`, `run` and `ran` all point to the same lemma `run` (...11841) to avoid duplication.
In [3]:
def show_lemmas(text):
for token in text:
print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')
Here we're using an f-string to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value.
In [4]:
doc2 = nlp(u"I saw eighteen mice today!")
show_lemmas(doc2)
Notice that the lemma of `saw` is `see`, `mice` is the plural form of `mouse`, and yet `eighteen` is its own number, *not* an expanded form of `eight`.
In [5]:
doc3 = nlp(u"I am meeting him tomorrow at the meeting.")
show_lemmas(doc3)
Here the lemma of `meeting` is determined by its Part of Speech tag.
In [6]:
doc4 = nlp(u"That's an enormous automobile")
show_lemmas(doc4)
Note that lemmatization does *not* reduce words to their most basic synonym - that is, `enormous` doesn't become `big` and `automobile` doesn't become `car`.