In [1]:
from pandas import DataFrame, read_csv
from estnltk import Text
from estnltk.taggers import EventTagger
In [2]:
event_vocabulary = DataFrame([['Harv', 'sagedus'],
['tugev peavalu', 'sümptom']],
columns=['term', 'type'])
or file event vocabulary.csv in csv format:
term,type
Harv,sagedus
tugev peavalu,sümptom
In [3]:
event_vocabulary = read_csv('data/event vocabulary.csv')
or list of dict
s
In [4]:
event_vocabulary = [{'term': 'harv', 'type': 'sagedus'},
{'term': 'tugev peavalu', 'type': 'sümptom'}]
There must be one key (column) called term in event_vocabulary
. That refers to the strings searched from the text. Other keys (type in this example) are optional. No key may have name start, end, wstart_raw, wend_raw, cstart, wstart, or bstart.
Create Text object, EventTagger object and find the list of events.
In [5]:
text = Text('Tugev peavalu esineb valimis harva.')
event_tagger = EventTagger(event_vocabulary, search_method='ahocorasick', case_sensitive=False,
conflict_resolving_strategy='ALL', return_layer=True)
event_tagger.tag(text)
Out[5]:
The attributes start and end show at which character the event starts and ends.
The attributes wstart_raw (word start raw) and wend_raw (word end raw) show at which word the event starts and ends.
The attributes cstart (char start) and wstart (word start) are like start and wstart_raw but are calculated as if all the events consist of one char.
The bstart (block start) attribute is is like wstart_raw but is calculated as if all the events and the gaps betveen the events (if exist) consist of one word. There is a gap between the events A and B if
wend_raw of A < wstart_raw of B.
The cstart, wstart and bstart attributes are calculated only if there is no overlapping events in the text. Use conflict_resolving_strategy='MAX'
or conflict_resolving_strategy='MIN'
to remove overlaps.
Tugev peavalu | esineb valimis | harv a. | |
---|---|---|---|
start | 0 | 29 | |
end | 13 | 33 | |
wstart_raw | 0 | 4 | |
wend_raw | 2 | 5 | |
cstart | 0 | 17 | |
wstart | 0 | 3 | |
bstart | 0 | 2 |
The search_method is either 'ahocorasick' or 'naive'. 'naive' is slower in general but does not depend on pyahocorasic package.
The conflict_resolving_strategy is either 'ALL', 'MIN' or 'MAX' (see the next example).
The events in output are ordered by start
and end
.
The defaults are:
search_method='naive' # for Python < 3
search_method='ahocorasick' # for Python >= 3
case_sensitive=True
conflict_resolving_strategy='MAX'
return_layer=False
layer_name='events'
In [6]:
event_vocabulary = [
{'term': 'kaks', 'value': 2, 'type': 'väike'},
{'term': 'kümme', 'value': 10, 'type': 'keskmine'},
{'term': 'kakskümmend', 'value': 20, 'type': 'suur'},
{'term': 'kakskümmend kaks', 'value': 22, 'type': 'suur'}
]
text = Text('kakskümmend kaks')
conflict_resolving_strategy='ALL'
returns all events.
In [7]:
event_tagger = EventTagger(event_vocabulary, search_method='naive', conflict_resolving_strategy='ALL', return_layer=True)
event_tagger.tag(text)
Out[7]:
conflict_resolving_strategy='MAX'
returns all the events that are not contained by any other event.
In [8]:
event_tagger = EventTagger(event_vocabulary, search_method='naive', conflict_resolving_strategy='MAX', return_layer=True)
event_tagger.tag(text)
Out[8]:
conflict_resolving_strategy='MIN'
returns all the events that don't contain any other event.
In [9]:
event_tagger = EventTagger(event_vocabulary, search_method='naive', conflict_resolving_strategy='MIN', return_layer=True)
event_tagger.tag(text)
Out[9]: