In this tutorial, we start with Estnltk basics and introduce you to the Text class. We will take the class apart to bits and pieces and put it back together to give a good overview, what it can do for you and how can you work with it.
One of the most important classes in Estnltk is Text,
which is essentally the main interface for doing everything Estnltk is
capable of. It is actually a subclass of standard dict
class in Python
and stores all data relevant to the text in this form:
In [1]:
from estnltk import Text
text = Text('Tere maailm!')
print (repr(text))
You can use Text instances the same way as you would
use a typical dict
object in Python:
In [2]:
print (text['text'])
Naturally, you can initiate a new instance from a dictionary:
In [3]:
text2 = Text({'text': 'Tere maailm!'})
print (text == text2)
As the Text class is essentially a dictionary, it has a number of advantages:
Main disadvantage is that the dictionary can get quite verbose, so space can be an issue when storing large corpora with many layers of annotations.
A Text instance can have different types of layers
that hold annotations or denote special regions of the text. For
instance, words
layer defines the word tokens, named_entities
layer
denotes the positions of named entities etc.
There are two types of layers:
multi layer has elements that can span several regions. For example, sentence "Kõrred, millel on toitunud viljasääse vastsed, jäävad õhukeseks." has two clauses:
a. "Kõrred jäävad õhukeseks",
b. ", millel on toitunud viljasääse vastsed, " .
Clause a spans multiple regions in the original text.
Both types of layers require each layer element to define start
and
end
attributes. Simple layer elements define start
and end
as
integers of the range containing the element. Multi layer elements
similarily define start
and end
attributes, but these are lists of
respective start and end positions of the element.
Simple layer:
In [4]:
from estnltk import Text
text = Text('Kõrred, millel on toitunud viljasääse vastsed, jäävad õhukeseks.')
text.tokenize_words()
text['words']
Out[4]:
Each word has a start
and end
attribute that tells, where the word
is located in the text. In case of multi layers, we see slightly
different result:
In [5]:
text.tag_clauses()
text['clauses']
Out[5]:
We see that first clause has two spans in the text. Although the second clause has only one span, it is also defined as a multi layer element. Estnltk uses either simple or multi type for a single layer. However, nothing stops you from mixing these two, if you wish.
In next sections, we discuss typical NLP operations you can do with Estnltk and also explain, how the results are stored in the dictionary underneath the Text instances.
One of the most basic tasks of any NLP pipeline is text and sentence tokenization. The Text class has methods tokenize_paragraphs(), tokenize_sentences() and tokenize_words(), which you can call to do this explicitly. However, there are also properties word_texts, sentence_texts and paragraph_texts that do this automatically when you use them and also give you the texts of tokenized words or sentences:
In [6]:
from estnltk import Text
text = Text('Üle oja mäele, läbi oru jõele. Ämber läks ümber.')
print (text.word_texts)
In order for the tokenization to happen, Text instance applies the default tokenizer in background and updates the text data:
In [7]:
from pprint import pprint
pprint (text)
As you can see, there is now a words
element in the dictionary, which
is a list of dictionaries denoting start
and end
positions of the
respective words. You also see sentences
and paragraphs
elements,
because sentence and paragraph tokenization is a prerequisite before
word tokenization and Estnltk did this automatically on your behalf.
The word_texts property does basically the same as the following snippet:
In [8]:
text = Text('Üle oja mäele, läbi oru jõele. Ämber läks ümber.')
text.tokenize_words() # this method applies text tokenization
print ([text['text'][word['start']:word['end']] for word in text['words']])
Only difference is that by using word_texts
property twice does not perform tokenization twice. Second call would
use the start
and end
attributes already stored in the
Text instance.
The default word tokenizer is a modification of WordPunctTokenizer :
In [9]:
from nltk.tokenize.regexp import WordPunctTokenizer
tok = WordPunctTokenizer()
print (tok.tokenize('Tere maailm!'))
Also, the default sentence tokenizer comes from NLTK:
In [10]:
import nltk.data
tok = nltk.data.load('tokenizers/punkt/estonian.pickle')
tok.tokenize('Esimene lause. Teine lause?')
Out[10]:
In order to plug in custom tokenization functionality, you need to implement interface defined by NLTK StringTokenizer and supply them as keyword arguments when initiating Text objects. Of course, all other NLTK tokenizers follow this interface:
In [11]:
from nltk.tokenize.regexp import WhitespaceTokenizer
from nltk.tokenize.simple import LineTokenizer
kwargs = {
"word_tokenizer": WhitespaceTokenizer(),
"sentence_tokenizer": LineTokenizer()
}
plain = '''Hmm, lausemärgid jäävad sõnade külge. Ja laused
tuvastatakse praegu
reavahetuste järgi'''
text = Text(plain, **kwargs)
print (text.word_texts)
print (text.sentence_texts)
After both word and sentence tokenization, a Text instance looks like this:
In [12]:
text
Out[12]:
This is the full list of tokenization related properties of Text:
Example:
In [13]:
from estnltk import Text
text = Text('Esimene lause. Teine lause')
text.text
Out[13]:
In [14]:
text.words
Out[14]:
In [15]:
text.word_texts
Out[15]:
In [16]:
text.word_starts
Out[16]:
In [17]:
text.word_ends
Out[17]:
In [18]:
text.word_spans
Out[18]:
In [19]:
text.sentences
Out[19]:
In [20]:
text.sentence_texts
Out[20]:
In [21]:
text.sentence_starts
Out[21]:
In [22]:
text.sentence_ends
Out[22]:
In [23]:
text.sentence_spans
Out[23]:
Note that if a dictionary already has words
, sentences
or
paragraphs
elements (or any other element that we introduce later),
accessing these elements in a newly initialized Text
object does not require recomputing them:
In [24]:
text = Text({'paragraphs': [{'end': 27, 'start': 0}],
'sentences': [{'end': 14, 'start': 0}, {'end': 27, 'start': 15}],
'text': 'Esimene lause. Teine lause.',
'words': [{'end': 7, 'start': 0, 'text': 'Esimene'},
{'end': 13, 'start': 8, 'text': 'lause'},
{'end': 14, 'start': 13, 'text': '.'},
{'end': 20, 'start': 15, 'text': 'Teine'},
{'end': 26, 'start': 21, 'text': 'lause'},
{'end': 27, 'start': 26, 'text': '.'}]}
)
print (text.word_texts) # tokenization is already done, just extract words using the positions
You should also remember this, when you have defined custom tokenizers. In such cases you can force retokenization by calling tokenize_words(), tokenize_sentences() or tokenize_words().
note
Things to remember!
words
,sentences
andparagraphs
are simple layers.- use properties to access the tokenized word/sentence texts and avoid tokenize_words(), tokenize_sentences() or tokenize_paragraphs(), unless you have a meaningful reason to use them (for example, preparing documents for indexing in a database).
In linguistics, morphology is the identification, analysis, and description of the structure of a given language's morphemes and other linguistic units, such as root words, lemmas, suffixes, parts of speech etc. Estnltk wraps Vabamorf morphological analyzer, which can do both morphological analysis and synthesis.
Esnltk Text class properties for extracting morphological information:
These properties call tag_analysis() method in
background, which also call tokenize_paragraphs(),
tokenize_sentences() and
tokenize_words() as word tokenization is required
in order add morphological analysis. Morphological analysis adds extra
information to words
layer, which we'll explain in following sections.
See postag_table, nounform_table and verbform_table for more detailed information about various analysis tags.
Before we continue with morphological analysis, we introduce a way to
put together various information in a simple way. Often you want to
extract various information, such as words, lemmas, postags and put them
together such that you could easily access all of them. Estnltk has
ZipBuilder class, which can compile together
properties you need and then format them in various ways. First, you can
initiate the builder on a Text object by calling
get attribute and then chain together the
attributes you wish to have. Last step is telling the format you want
the data to appear.
get <item_1> <item_2> ... <item_n> as <format>
. Output formats include
Pandas
DataFrame:
In [25]:
from estnltk import Text
text = Text('Usjas kaslane ründas künklikul maastikul tünjat Tallinnfilmi režissööri')
text.get.word_texts.postags.postag_descriptions.as_dataframe
Out[25]:
A list of tuples:
In [26]:
list(text.get.word_texts.postags.postag_descriptions.as_zip)
Out[26]:
A list of lists:
In [27]:
text.get.word_texts.postags.postag_descriptions.as_list
Out[27]:
A dictionary:
In [28]:
text.get.word_texts.postags.postag_descriptions.as_dict
Out[28]:
All the properties can be given also as a list, which can be convinient in some situations:
In [29]:
text.get(['word_texts', 'postags', 'postag_descriptions']).as_dataframe
Out[29]:
Note
Estnltk does not stop the programmer doing wrong things
You can chain together any Text property, but only thing you must take care of is that all the properties act on same layer/unit data. So, when you mix sentence and word properties, you get either an error or malformed output.
Morphological analysis is performed with method
tag_analysis() and is invoked by accessing any
property requiring this. In such case, also methods
tokenize_paragraphs(),
tokenize_sentences() and
tokenize_words() are called as word and sentence
tokenization is required in order add morphological analysis.
Morphological analysis adds extra information to words
layer, which
we'll explain below.
After doing morphological analysis, ideally only one unambiguous dictionary containing all the raw data is generated. However, sometimes the disambiguator cannot really eliminate all ambiguity and you get multiple analysis variants:
In [30]:
from estnltk import Text
text = Text('mõeldud')
text.tag_analysis()
Out[30]:
The word mõeldud has quite a lot ambiguity as it can be interpreted either as a verb or adjective. Adjective version itself can be though of as singular or plural and with different suffixes.
This ambiguity also affects how properties work. In this case, there are two lemmas and when accessing lemmas property, estnltk displays both unique cases, sorted alphabetically and separated by a pipe:
In [31]:
print (text.lemmas)
print (text.postags)
Now, we have already seen that morphological data is added to word level
dictionary under element analysis
. Let's also look at a single
analysis dictionary element for word "raudteejaamadelgi":
In [32]:
Text('raudteejaamadelgi').analysis
Out[32]:
In [33]:
{'clitic': 'gi', # In Estonian, -gi and -ki suffixes
'ending': 'del', # word suffix without clitic
'form': 'pl ad', # word form, in this case plural and adessive (alalütlev) case
'lemma': 'raudteejaam', # the dictionary form of the word
'partofspeech': 'S', # POS tag, in this case substantive
'root': 'raud_tee_jaam', # root form (same as lemma, but verbs do not have -ma suffix)
# also has compound word markers and optional phonetic markers
'root_tokens': ['raud', 'tee', 'jaam']} # for compund word roots, a list of simple roots the compound is made of
Out[33]:
In [34]:
from estnltk import Text
text = Text('Usjas kaslane ründas künklikul maastikul tünjat Tallinnfilmi režissööri')
text.get.word_texts.postags.postag_descriptions.as_dataframe
Out[34]:
In [35]:
text.get.word_texts.forms.descriptions.as_dataframe
Out[35]:
Also, see nounform_table, verbform_table and postag_table that contains detailed information with examples about the morphological attributes.
By default, estnltk does not add phonetic information to analyzed word roots, but this functionality can be changed. Here are all the options that can be given to the Text class that will affect the analysis results:
disambiguate: boolean (default: True)
guess: boolean (default: True)
Use guessing in case of unknown words
NB! In order to switch guessing off, disambiguation and proper name analysis have to be set to False as well.
propername: boolean (default: True)
compound: boolean (default: True)
phonetic: boolean (default: False)
In [36]:
from estnltk import Text
print (Text('tosinkond palki sai oma palga', phonetic=True, compound=False).roots)
See phonetic_markers for more information.
Note
Things to remember about morphological analysis!
- Morphological analysis is stored in
analysis
attribute of each word.- Morphological analysis is in
words
layer.- Use ZipBuilder class simplify data retrieval.
- If you write something that needs better performance, access the Text directly as a dictionary, because when using properties, one loop per property is executed.
Giellatekno (gt) morphological analysis tagset is an alternative tagset that can be used in parallel with the default (Filosoft's) tagset. Estnltk has function convert_to_gt(), which can be used to convert existing (Filosoft's) morphological analyses in Text object to Giellatekno's analyses:
In [37]:
from estnltk.converters.gt_conversion import convert_to_gt
from estnltk import Text
text = Text('Rändur võttis istet.')
# Tag analysis in the text (using Filosoft's morphological categories)
text.tag_analysis()
# Convert analyses into gt format
text = convert_to_gt(text)
As a result of the conversion, a new layer named "gt_words"
is attached to the Text object, containing words and their analyses in the gt format:
In [38]:
text['gt_words']
Out[38]:
The layer "gt_words"
has the same structure as the layer "words"
. However, the form
categories used in analyses are different, and the number of analyses in each word can also be different. Categories specific to the gt format are listed in table_verb_forms_gt and table_noun_forms_gt.
If the parameter layer_name is passed to the function convert_to_gt(), gt analyses will be stored into the corresponding layer. For example, this can be used to overwrite the original morphological analysis layer "words"
:
In [39]:
from estnltk.converters.gt_conversion import convert_to_gt
from estnltk import Text
text2 = Text('Rändur haaras vile')
# Tag analysis in the text (using Filosoft's morphological categories)
text2.tag_analysis()
# Convert analyses to gt format, and overwrite the 'words' layer
text2 = convert_to_gt(text2, layer_name='words')
In [40]:
# Analyses with gt categories
text2.get.word_texts.postags.forms.as_dataframe
Out[40]:
(!) Important! If you overwrite the layer "words"
with gt format analyses, tools depending on the morphological analysis (e.g. named entity recognizer, temporal expression tagger, or verb chain detector) are no longer applicable on the Text object, as most of the tools assume that the Filosoft's tagset is used, so they won't work with the gt tagset.
The reverse operation of morphological analysis is synthesis. That is, given the dictionary form of the word and some options, generating all possible inflections that match given criteria.
Estnltk has function synthesize(), which accepts these parameters:
Let's generate plural genitive forms for lemma "palk" (in English a paycheck and a log):
In [41]:
from estnltk import synthesize
synthesize('palk', 'pl g')
Out[41]:
We can hint the synthesizer so that it outputs only inflections that match prefix palka:
In [42]:
synthesize('palk', 'pl g', hint='palka')
Out[42]:
For fun, here is some demo code for synthesizing all forms of any given noun (See nounform_table):
In [43]:
from estnltk import synthesize
import pandas
cases = [
('n', 'nimetav'),
('g', 'omastav'),
('p', 'osastav'),
('ill', 'sisseütlev'),
('in', 'seesütlev'),
('el', 'seestütlev'),
('all', 'alaleütlev'),
('ad', 'alalütlev'),
('abl', 'alaltütlev'),
('tr', 'saav'),
('ter', 'rajav'),
('es', 'olev'),
('ab', 'ilmaütlev'),
('kom', 'kaasaütlev')]
def synthesize_all(word):
case_rows = []
sing_rows = []
plur_rows = []
for case, name in cases:
case_rows.append(name)
sing_rows.append(', '.join(synthesize(word, 'sg ' + case, 'S')))
plur_rows.append(', '.join(synthesize(word, 'pl ' + case, 'S')))
return pandas.DataFrame({'case': case_rows, 'singular': sing_rows, 'plural': plur_rows}, columns=['case', 'singular', 'plural'])
In [44]:
synthesize_all('kuusk')
Out[44]:
Let's try something funny as well:
In [45]:
synthesize_all('luuslang-lendur')
Out[45]:
In [46]:
from estnltk import Text
text = Text('Vikastes lausetes on trügivigasid!')
text.get.word_texts.spelling.spelling_suggestions.as_dataframe
Out[46]:
There is also spellcheck_results that gives both spelling and suggestions together. This is more efficient than calling spelling and spelling_suggestions separately:
In [47]:
text.spellcheck_results
Out[47]:
Lastly, there is function fix_spelling() that replaces incorrect words with first suggestion in the list. It is very naive, but it may be handy:
In [48]:
print(text.fix_spelling())
Often, during preprocessing of text files, we wish to check if the files satisfy certain assumptions. One such possible requirement is check if the files contain characters that can be handled by our application. For example, an application assuming Estonian input might not work with Cyrillic characters. In such cases, it is necessary to detect invalid input.
Estnltk has predefined alphabets for Estonian and Russian, that can be combined with various punctuation and whitespace:
In [1]:
from estnltk import EST_ALPHA, RUS_ALPHA, DIGITS, WHITESPACE, PUNCTUATION, ESTONIAN, RUSSIAN
Estonian alphabet (EST_ALPHA):
abcdefghijklmnoprsšzžtuvwõäöüxyzABCDEFGHIJKLMNOPRSŠZŽTUVWÕÄÖÜXYZ
Russian alphabet (RUS_ALPHA):
абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ
Standard punctuation (PUNCTUATION):
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~–
Digits:
0123456789
Whitespace:
' \t\n\r\x0b\x0c'
Estonian combined with punctuation and whitespace:
'abcdefghijklmnoprsšzžtuvwõäöüxyzABCDEFGHIJKLMNOPRSŠZŽTUVWÕÄÖÜXYZ0123456789 \t\n\r\x0b\x0c!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~–'
Russian combined with punctuation and whitespace:
'абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ0123456789 \t\n\r\x0b\x0c!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~–'
In [50]:
from estnltk import Text, TextCleaner, RUSSIAN
td_ru = TextCleaner(RUSSIAN)
et_plain = 'Segan suhkrut malbelt tassis, kus nii armsalt aurab tee.'
ru_plain = 'Дождь, звонкой пеленой наполнил небо майский дождь.'
et_correct = Text(et_plain)
et_invalid = Text(ru_plain)
ru_correct = Text(ru_plain, text_cleaner=td_ru)
ru_invalid = Text(et_plain, text_cleaner=td_ru)
Now you can use is_valid() method to check if the text contains only characters defined in the alphabet:
In [51]:
et_correct.is_valid()
Out[51]:
In [52]:
et_invalid.is_valid()
Out[52]:
In [53]:
ru_correct.is_valid()
Out[53]:
In [54]:
ru_invalid.is_valid()
Out[54]:
In addition to checking just for correctness, we might want to get the list of invalid characters:
In [55]:
from estnltk import Text
text = Text('Esmaspäeval (27.04) liikus madalrōhkkond Pōhjalahelt Soome kohale.¶')
print (text.invalid_characters)
Surprisingly, in addition to ¶
we also see character ō
as invalid.
Well, the reason is that is not the correct õ
.
note
Different Unicode characters
- ō latin small letter o with macron (U+014D)
- õ latin small letter o with tilde (U+00F5)
It is really hard to distinguish the difference visually, but in case we
are indexing the text, we fail to find it via search later if we assume
it used correct character õ
.
So, let's replace the wrong ō
and remove other invalid characters
using method clean():
In [56]:
text = text.replace('ō', 'õ').clean()
print (text)
print (text.is_valid())
Estnltk Text class mimics the behaviour of some string functions for convenience: capitalize(), count(), endswith(), find(), index(), isalnum(), isalpha(), isdigit(), islower(), isspace(), istitle(), isupper(), lower(), lstrip(), replace(), rfind(), rindex(), rstrip(), startswith(), strip().
However, if the method modifies the string, such as strip(), the method returns a new Text instance, invalidating all computed attributes such as the start and end positions as a result of tokenization. These attributes won't be copied to the resulting string. However, all the original keyword arguments are passed to the new copy. It is recommended to use these methods in case the text does not have any layers.
Here is an example showing few of these methods at work:
In [57]:
from estnltk import Text
text = Text(' TERE MAAILM ').strip().capitalize().replace('maailm', 'estnltk!')
print (text)
In [58]:
from estnltk import Text
from pprint import pprint
text = Text('Esimene lause. Teine lause. Kolmas lause.')
for sentence in text.split_by('sentences'):
pprint(sentence)
An example with multi layer:
In [59]:
from estnltk import Text
text = Text('Kõrred, millel on toitunud viljasääse vastsed, jäävad õhukeseks.')
for clause in text.split_by('clauses'):
print (clause)
note
Things to remember!
- The resulting sentences are also Text instances.
- Simple layer elements that do not belong entirely to a single split, are discarded!
- Multi layer element regions that do not belong entirely to a single split, are discarded!
- Multi layer elements will end up in several splits, if spans of the element are distributed in several splits.
- Start and end positions defining the layer element locations are modified so they align with the split they are moved into.
- Splitting only deals with
start
andend
attributes of layer elements. Other attributes are not modified and are copied as they are.- Multi layer split texts are by default separated with a space character ' '.
In [60]:
from estnltk import Text
text = Text('Pidage meeles, et <red font>teete kodused tööd kõik ära</red font>, muidu tuleb pahandus!')
text.split_by_regex('<red font>.*?</red font>')
Out[60]:
By default, the matched regions are discarded and used as separators.
This can be changed by using gaps=False
argument that reverses the
behaviour:
In [61]:
text.split_by_regex('<red font>.*?</red font>', gaps=False)
Out[61]:
In addition to splitting, we use a term dividing if we actually do not want Text instances as the result. Instead, we may just want to access the words, one sentence at a time, having the reference to the original instance. Estnltk has divide() method, that takes two parameters: the element to divide into bins, the element that defines the bins:
In [62]:
from estnltk import Text
text = Text('Esimene lause. Teine lause.')
for sentence in text.divide('words', 'sentences'):
for word in sentence:
word['new_attribute'] = 'Estnltk greets the word ' + word['text']
In [63]:
text
Out[63]:
The divide() method is useful for
note
Nota bene!
The original references are lost in elements having
start
andend
positions in multi layer format. The reason is that multi layer elements can span regions that end up in different splits/divisions, thus invalidating thestart
andend
attributes. Updating the invalidated attributes requires modifying them, which we cannot do as this would also modify the original element. Thus, instead a copy is made of the element, the attributes are updated, and the element is returned.
Temporal expressions tagger identifies temporal expressions (timexes) in text and normalizes these expressions, providing corresponding calendrical dates and times. The current version of the temporal expressions tagger is tuned for processing news texts (so the quality of the analysis may be suboptimal in other domains). The program outputs an annotation in a format similar to TimeML's TIMEX3 (more detailed description can be found in annotation guidelines, which are currently only in Estonian).
The Text class has property timexes, which returns a list of time expressions found in the text:
In [64]:
from estnltk import Text
from pprint import pprint
text = Text('Järgmisel kolmapäeval, kõige hiljemalt kell 18.00 algab viiepäevane koosolek, mida korraldatakse igal aastal')
The output is a list of four dictionaries, each representing an timex found in text:
In [65]:
pprint(text.timexes)
There are a number of mandatory attributes present in the dictionaries:
type - following the TimeML specification, four types of temporal expressions are distinguished:
temporal_function - boolean value indicating whether the semantics of the expression are relative to the context.
For DATE and TIME expressions:
For DURATION expressions, temporal_function is mostly False, except for vague durations;
The value is a mandatory attribute containing the semantics and has four possible formats:
Date and time yyyy-mm-ddThh:mm
Week-based yyyy-Wnn-wdThh:mm
Time based Thh:mm
Time span Pn1Yn2Mn3Wn4DTn5Hn6M
ni denotes a value and Y (year), M (month), W (week), D (day), H (hours), M (minutes) denotes respective time granularity.
Formats (1) and (2) are used with DATE, TIME and SET types. Format (1) is always preferred if both (1) and (2) can be used. Format (3) is used in cases it is impossible to extract the date. Format (4) is used is used in time span expressions.
In addition, there are dedicated markers for special time notions:
Different times of the day
Weekends/workdays
Seasons
Quarters
In [66]:
from estnltk import Text
Text('Täna on ilus ilm').timexes
Out[66]:
However, when passing creation_date=datetime.datetime(1986, 12, 21)
, we see that word "today" (täna) refers to to December 21, 1986:
In [67]:
import datetime
Text('Täna on ilus ilm', creation_date=datetime.datetime(1986, 12, 21)).timexes
Out[67]:
Here are some examples of temporal expressions and fields that the tagger can extract. The document creation date is fixed to Dec 21, 1986 in the examples below. See annotation guidelines for more detailed explanations.
Example | Temporal expression | Type | Value | Modifier | |
---|---|---|---|---|---|
Järgmisel reedel | Järgmisel reedel | DATE | 1986-12-26 | ||
2004. aastal | 2004. aastal | DATE | 2004 | ||
esmaspäeva hommikul | esmaspäeva hommikul | TIME | 1986-12-15TMO | ||
järgmisel reedel kell 14.00 | järgmisel reedel kell 14. 00 | TIME | 1986-12-26T14:00 | ||
neljapäeviti | neljapäeviti | SET | XXXX-WXX-XX | ||
hommikuti | hommikuti | SET | XXXX-XX-XXTMO | ||
selle kuu alguses | selle kuu alguses | DATE | 1986-12 | START | |
1990ndate lõpus | 1990ndate lõpus | DATE | 199 | END | |
VI sajandist e.m.a | VI sajandist e.m.a | DATE | BC05 | ||
kolm tundi | kolm tundi | DURATION | PT3H | ||
viis kuud | viis kuud | DURATION | P5M | ||
kaks minutit | kaks minutit | DURATION | PT2M | ||
teisipäeviti | teisipäeviti | SET | XXXX-WXX-XX | ||
kolm päeva igas kuus | kolm päeva | DURATION | P3D | ||
kolm päeva igas kuus | igas kuus | SET | P1M | ||
hiljuti | hiljuti | DATE | PAST_REF | ||
tulevikus | tulevikus | DATE | FUTURE_REF | ||
2009. aasta alguses | 2009. aasta alguses | DATE | 2009 | START | |
juuni alguseks 2007. aastal | juuni alguseks | DATE | 1986-06 | START | |
juuni alguseks 2007. aastal | 2007. aastal | DATE | 2007 | ||
2009. aasta esimesel poolel | 2009. aasta esimesel poolel | DATE | 2009 | FIRST_HALF | |
umbes 4 aastat | umbes 4 aastat | DURATION | P4Y | APPROX | |
peaaegu 4 aastat | peaaegu 4 aastat | DURATION | P4Y | LESS_THAN | |
12-15 märts 2009 | 12- | DATE | 2009-03-12 | ||
12-15 märts 2009 | 15 märts 2009 | DATE | 2009-03-15 | ||
12-15 märts 2009 | DURATION | PXXD | |||
eelmise kuu lõpus | eelmise kuu lõpus | DATE | 1986-11 | END | |
2004. aasta suvel | 2004. aasta suvel | DATE | 2004-SU | ||
Detsembris oli keskmine temperatuur kaks korda madalam kui kuu aega varem | Detsembris | DATE | 1986-12 | ||
Detsembris oli keskmine temperatuur kaks korda madalam kui kuu aega varem | kuu aega varem | DATE | 1986-11 | ||
neljapäeval, 17. juunil | neljapäeval , 17. juunil | DATE | 1986-06-17 | ||
täna, 100 aastat tagasi | täna | DATE | 1986-12-21 | ||
täna, 100 aastat tagasi | 100 aastat tagasi | DATE | 1886 | ||
neljapäeva öösel vastu reedet | neljapäeva öösel vastu reedet | TIME | 1986-12-19TNI | ||
viimase aasta jooksul | viimase aasta jooksul | DURATION | P1Y | ||
viimase aasta jooksul | DATE | 1985 | |||
viimase kolme aasta jooksul | viimase kolme aasta jooksul | DURATION | P3Y | ||
viimase kolme aasta jooksul | DATE | 1983 | |||
aastaid tagasi | aastaid tagasi | DATE | PAST_REF | ||
aastate pärast | aastate pärast | DATE | FUTURE_REF |
A simple sentence, also called an independent clause, typically contains a finite verb, and expresses a complete thought. However, natural language sentences can also be long and complex, consisting of two or more clauses joined together. The clause structure can be made even more complex due to embedded clauses, which divide their parent clauses into two halves:
In [68]:
from estnltk import Text
text = Text('Mees, keda seal kohtasime, oli tuttav ja teretas meid.')
The clause annotations define embedded clauses and clause boundaries. Additionally, each word in a sentence is associated with a clause index:
In [69]:
text.get.word_texts.clause_indices.clause_annotations.as_dataframe
Out[69]:
Clause annotation information is stored in words
layer as
clause_index
and clause_annotation
attributes:
In [70]:
text.words
Out[70]:
Clause indices and annotations can be explicitly tagged with method tag_clause_annotations().
Property clause_texts() can be used to see the full clauses themselves:
In [71]:
text.clause_texts
Out[71]:
Method tag_clauses() can be used create a special
clauses
multilayer, that lists character-level indices of start and
end positions of clause regions:
In [72]:
text.tag_clauses()
text['clauses']
Out[72]:
It might be useful to process each clause of the sentence independently:
In [73]:
for clause in text.split_by('clauses'):
print (clause.text)
Because commas are important clause delimiters in Estonian, the quality of the clause segmentation may suffer due to accidentially missing commas in the input text. To address this issue, the clause segmenter can be initialized in a mode in which the program tries to be less sensitive to missing commas while detecting clause boundaries.
Example:
In [74]:
from estnltk import ClauseSegmenter
from estnltk import Text
segmenter = ClauseSegmenter( ignore_missing_commas=True )
text = Text('Keegi teine ka siin ju kirjutas et ütles et saab ise asjadele järgi minna aga vastust seepeale ei tulnudki.', clause_segmenter = segmenter)
for clause in text.split_by('clauses'):
print (clause.text)
Note that this mode is experimental and compared to the basic mode, it may introduce additional incorrect clause boundaries, although it also improves clause boundary detection in texts with (a lot of) missing commas.
Verb chain tagger identifies main verbs (predicates) in clauses. The current version of the program aims to detect following verb chain constructions:
Verb chains are stored as a simple layer named verb_chains
:
In [75]:
from estnltk import Text
text = Text('Ta oleks pidanud sinna minema, aga ei läinud.')
text.verb_chains
Out[75]:
Following is a brief description of the attributes:
analysis_ids
- the indices of analysis ids of the words in the
phrase of this chain.clause_index
- the clause id this chain was tagged in.mood
- mood of the finite verb. Possible values: 'indic'
(indicative), 'imper' (imperative), 'condit' (conditional),
'quotat' (quotative) or '??' (undetermined);morph
- for each word in the chain, lists its morphological
features: part of speech tag and form (in one string, separated by
'_', and multiple variants of the pos/form are separated by '/');other_verbs
- boolean, marks whether there are other verbs in the
context, which can be potentially added to the verb chain; if
True
,then it is uncertain whether the chain is complete or not;pattern
- the general pattern of the chain: for each word in the
chain, lists whether it is 'ega', 'ei', 'ära', 'pole',
'ole', '&' (conjunction: ja/ning/ega/või), 'verb' (verb
different than 'ole') or 'nom/adv' (nominal/adverb);phrase
- the word indices of the sentence that make up the verb
chain phrase.pol
- grammatical polarity of the finite verb. Possible values:
'POS', 'NEG' or '??'. 'NEG' means that the chain begins with
a negation word ei/pole/ega/ära; '??' is reserved for cases
where it is uncertain whether ära forms a negated verb chain or
not;roots
- for each word in the chain, lists its corresponding 'root'
value from the morphological analysis;tense
- tense of the finite verb. Possible values depend on the
mood value. Tenses of indicative: 'present', 'imperfect',
'perfect', 'pluperfect'; tense of imperative: 'present';
tenses of conditional and quotative: 'present' and 'past'.
Additionally, the tense may remain undetermined ('??').voice
- voice of the finite verb. Possible values: 'personal',
'impersonal', '??' (undetermined).Note that the words in the verb chain (in phrase
, pattern
, morph
and roots
) are ordered by the order of the grammatical relations - the
order which may not coincide with the word order in text. The first word
is the finite verb (main verb) of the clause (except in case of the
negation constructions, where the first word is typically a negation
word), and each following word is governed by the previous word in the
chain. An exception: the chain may end with a conjunction of two
infinite verbs (general pattern verb & verb), in this case, both
infinite verbs can be considered as being governed by the preceding word
in the chain.
Attributes start
and end
contain start and end positions for each
token in the phrase, and these token positions are listed in the
ascending order, regardless the order of the grammatical relations.
Estonian WordNet API provides means to query Estonian WordNet. WordNet is a network of synsets, in which synsets are collections of synonymous words and are connected to other synsets via relations. For example, the synset which contains the word "koer" ("dog") has a generalisation via hypernymy relation in the form of synset which contains the word "koerlane" ("canine").
Estonian WordNet contains synsets with different types of part-of-speech: adverbs, adjectives, verbs and nouns.
Part of speech API equivalent
Adverb wn.ADV Adjective wn.ADJ Noun wn.NOUN Verb wn.VERB
Given API is on most parts in conformance with NLTK WordNet's API (http://www.nltk.org/howto/wordnet.html). However, there are some differences due to different structure of the WordNets.
Existing relations:
antonym, be_in_state, belongs_to_class, causes, fuzzynym, has_holo_location, has_holo_madeof, has_holo_member, has_holo_part, has_holo_portion, has_holonym, has_hyperonym, has_hyponym, has_instance, has_mero_location, has_mero_madeof, has_mero_member, has_mero_part, has_mero_portion, has_meronym, has_subevent, has_xpos_hyperonym, has_xpos_hyponym, involved, involved_agent, involved_instrument, involved_location, involved_patient, involved_target_direction, is_caused_by, is_subevent_of, near_antonym, near_synonym, role, role_agent, role_instrument, role_location, role_patient, role_target_direction, state_of, xpos_fuzzynym, xpos_near_antonym, xpos_near_synonym .
Before anything else, let's import the module:
In [76]:
from estnltk.wordnet import wn
The most common use for the API is to query synsets. Synsets can be queried in several ways. The easiest way is to query all the synsets which match some conditions. For that we can either use:
In [77]:
wn.all_synsets()
Out[77]:
which returns all the synsets there are or:
In [78]:
wn.all_synsets(pos=wn.ADV)
Out[78]:
which returns all the synset of which part of speech is "adverb". We can also query synsets by providing a lemma and a part of speech using:
In [79]:
wn.synsets("koer",pos=wn.VERB)
Out[79]:
By neglecting "pos", it matches once again all the synsets with "koer" as lemma:
In [80]:
wn.synsets("koer")
Out[80]:
The API allows to query synset's details. For example, we can retrieve name and pos:
In [81]:
synset = wn.synset("king.n.01")
synset.name
Out[81]:
We can also query definition and examples:
In [82]:
synset.definition()
Out[82]:
In [83]:
synset.examples()
Out[83]:
In [84]:
synset.hypernyms()
Out[84]:
In [85]:
synset.hyponyms()
Out[85]:
In [86]:
synset.meronyms()
Out[86]:
In [87]:
synset.holonyms()
Out[87]:
More specific relations can be queried with a universal method:
In [88]:
synset = wn.synset('jäätis.n.01')
synset.get_related_synsets('fuzzynym')
Out[88]:
In [89]:
synset = wn.synset('jalats.n.01')
target_synset = wn.synset('kinnas.n.01')
In [90]:
synset.path_similarity(target_synset)
Out[90]:
In [91]:
synset.lch_similarity(target_synset)
Out[91]:
In [92]:
synset.wup_similarity(target_synset)
Out[92]:
In addition, we can also find the closest common ancestor via hypernyms:
In [93]:
synset.lowest_common_hypernyms(target_synset)
Out[93]: