Estnltk does basic morphological disambiguation by using a probabilistic disambiguator which relies on the local (sentence) context. [KA01] This works well enough for any type of texts: news articles, comments, mixed content etc.
However, the quality of the disambiguation can be further improved if a broader context (e.g. the whole text, or a collection of texts) is considered in the process. If morphologically ambiguous words (for example: proper names) reoccur in other parts of the text or in other related texts, one can use the assumption "one lemma per discourse" (inspired by the observation "one sense per discourse" from Word Sense Disambiguation) and choose the right analysis based on the most frequently occurring lemma candidate. [KA12]
[KA01] Kaalep, Heiki-Jaan, and Vaino, Tarmo. “Complete morphological analysis in the linguist’s toolbox.” Congressus Nonus Internationalis Fenno-Ugristarum Pars V (2001): 9-16.
[KA12] Kaalep, Heiki Jaan, Riin Kirt, and Kadri Muischnek. “A trivial method for choosing the right lemma.” Baltic HLT. 2012.
Consider the following example of a text collection:
In [1]:
corpus = ['Esimesele kohale tuleb Jänes, kuigi tema punktide summa pole kõrgeim.',\
'Lõpparvestuses läks Konnale esimene koht. Teise koha sai seekord Jänes. Uus võistlus toimub 2. mail.', \
'Konn paistis silma suurima punktide summaga. Uue võistluse toimumisajaks on 2. mai.']
After applying the default (local context) morphological disambiguation, some of the words will still be ambiguous, as can be revealed by executing the following scipt:
In [2]:
from estnltk import Text
from estnltk.names import TEXT, ANALYSIS, ROOT, POSTAG, FORM
for text_str in corpus:
text = Text(text_str)
# Perform morphological analysis with default disambiguation
text.tag_analysis()
# Print out all words with ambiguous analyses
for word in text.words:
if len(word[ANALYSIS]) > 1:
print( word[TEXT],[(a[ROOT],a[POSTAG],a[FORM]) for a in word[ANALYSIS]] )
In [3]:
from estnltk import Disambiguator
corpus = ['Esimesele kohale tuleb Jänes, kuigi tema punktide summa pole kõrgeim.',\
'Lõpparvestuses läks Konnale esimene koht. Teise koha sai seekord Jänes. Uus võistlus toimub 2. mail.', \
'Konn paistis silma suurima punktide summaga. Uue võistluse toimumisajaks on 2. mai.']
disamb = Disambiguator()
texts = disamb.disambiguate(corpus)
The method returns a list of Text objects. We can use the following script to check for morphological ambiguities in this list:
In [4]:
from estnltk.names import TEXT, ANALYSIS, ROOT, POSTAG, FORM
for text in texts:
# Print out all words with ambiguous analyses
for word in text.words:
if len(word[ANALYSIS]) > 1:
print(word[TEXT],[(a[ROOT],a[POSTAG],a[FORM]) for a in word[ANALYSIS]])
The output shows that the ambiguities in the content words (nouns kohale, koha, mail, summaga) have been removed.
Under the hood, the disambiguation process implemented in Disambiguator can be broken down into three steps:
By default, all three steps are performed on the input corpus. However,
if needed, pre-disambiguation and post-disambiguation can also be
disabled, passing pre_disambiguate=False
and post_disambiguate=False
as input arguments for the method
disambiguate().
In following example, disambiguation is applied both with pre-disambiguation enabled and disabled, and the difference in results is printed out:
In [5]:
corpus = ['Jänes oli parajasti põllu peal. Hunti nähes ta ehmus ja pani jooksu.',\
'Talupidaja Jänes kommenteeris, et hunte on viimasel ajal liiga palju siginenud. Tema naaber, talunik Lammas, nõustus sellega.', \
'Jänesele ja Lambale oli selge, et midagi tuleb ette võtta. Eile algatasid nad huntidevastase kampaania.']
from estnltk.names import TEXT, ANALYSIS, ROOT, POSTAG, FORM
from estnltk import Disambiguator
disamb = Disambiguator()
texts_with_predisamb = disamb.disambiguate(corpus)
texts_without_predisamb = disamb.disambiguate(corpus, pre_disambiguate=False)
for i in range(len(texts_with_predisamb)):
with_predisamb = texts_with_predisamb[i]
without_predisamb = texts_without_predisamb[i]
for j in range(len(with_predisamb.words)):
word1 = with_predisamb.words[j]
word2 = without_predisamb.words[j]
if word1 != word2:
print(word1[TEXT], \
[(a[ROOT],a[POSTAG],a[FORM]) for a in word1[ANALYSIS]], \
' vs ', \
[(a[ROOT],a[POSTAG],a[FORM]) for a in word2[ANALYSIS]] )