EventTagger

A class that finds a list of events from Text object based on user-provided vocabulary. The events are tagged by several metrics (start, end, wstart_raw, wend_raw, cstart, wstart, bstart) and user-provided classificators.

Usage


In [1]:
from pandas import DataFrame, read_csv
from estnltk import Text
from estnltk.taggers import EventTagger

Example 1

Create pandas DataFrame


In [2]:
event_vocabulary = DataFrame([['Harv',          'sagedus'], 
                              ['tugev peavalu', 'sümptom']], 
                      columns=['term',          'type'])

or file event vocabulary.csv in csv format:

term,type
Harv,sagedus
tugev peavalu,sümptom

In [3]:
event_vocabulary = read_csv('data/event vocabulary.csv')

or list of dicts


In [4]:
event_vocabulary = [{'term': 'harv',          'type': 'sagedus'},
                    {'term': 'tugev peavalu', 'type': 'sümptom'}]

There must be one key (column) called term in event_vocabulary. That refers to the strings searched from the text. Other keys (type in this example) are optional. No key may have name start, end, wstart_raw, wend_raw, cstart, wstart, or bstart.

Create Text object, EventTagger object and find the list of events.


In [5]:
text = Text('Tugev peavalu esineb valimis harva.')
event_tagger = EventTagger(event_vocabulary, search_method='ahocorasick', case_sensitive=False,
                           conflict_resolving_strategy='ALL', return_layer=True)
event_tagger.tag(text)


Out[5]:
[{'bstart': 0,
  'cstart': 0,
  'end': 13,
  'start': 0,
  'term': 'tugev peavalu',
  'type': 'sümptom',
  'wend_raw': 2,
  'wstart': 0,
  'wstart_raw': 0},
 {'bstart': 2,
  'cstart': 17,
  'end': 33,
  'start': 29,
  'term': 'harv',
  'type': 'sagedus',
  'wend_raw': 5,
  'wstart': 3,
  'wstart_raw': 4}]

The attributes start and end show at which character the event starts and ends.
The attributes wstart_raw (word start raw) and wend_raw (word end raw) show at which word the event starts and ends.
The attributes cstart (char start) and wstart (word start) are like start and wstart_raw but are calculated as if all the events consist of one char.
The bstart (block start) attribute is is like wstart_raw but is calculated as if all the events and the gaps betveen the events (if exist) consist of one word. There is a gap between the events A and B if

wend_raw of A < wstart_raw of B.

The cstart, wstart and bstart attributes are calculated only if there is no overlapping events in the text. Use conflict_resolving_strategy='MAX' or conflict_resolving_strategy='MIN' to remove overlaps.

Tugev peavalu esineb valimis harv a.
start 0 29
end 13 33
wstart_raw 0 4
wend_raw 2 5
cstart 0 17
wstart 0 3
bstart 0 2

The search_method is either 'ahocorasick' or 'naive'. 'naive' is slower in general but does not depend on pyahocorasic package.

The conflict_resolving_strategy is either 'ALL', 'MIN' or 'MAX' (see the next example).

The events in output are ordered by start and end.

The defaults are:

search_method='naive' # for Python < 3
search_method='ahocorasick' # for Python >= 3
case_sensitive=True
conflict_resolving_strategy='MAX'
return_layer=False
layer_name='events'

Example 2


In [6]:
event_vocabulary = [
                    {'term': 'kaks', 'value': 2, 'type': 'väike'},
                    {'term': 'kümme', 'value': 10, 'type': 'keskmine'},
                    {'term': 'kakskümmend', 'value': 20, 'type': 'suur'},
                    {'term': 'kakskümmend kaks', 'value': 22, 'type': 'suur'}
                   ]
text = Text('kakskümmend kaks')

conflict_resolving_strategy='ALL' returns all events.


In [7]:
event_tagger = EventTagger(event_vocabulary, search_method='naive', conflict_resolving_strategy='ALL', return_layer=True)
event_tagger.tag(text)


Out[7]:
[{'end': 4,
  'start': 0,
  'term': 'kaks',
  'type': 'väike',
  'value': 2,
  'wend_raw': 1,
  'wstart_raw': 0},
 {'end': 11,
  'start': 0,
  'term': 'kakskümmend',
  'type': 'suur',
  'value': 20,
  'wend_raw': 1,
  'wstart_raw': 0},
 {'end': 16,
  'start': 0,
  'term': 'kakskümmend kaks',
  'type': 'suur',
  'value': 22,
  'wend_raw': 0,
  'wstart_raw': 0},
 {'end': 9,
  'start': 4,
  'term': 'kümme',
  'type': 'keskmine',
  'value': 10,
  'wend_raw': 1,
  'wstart_raw': 0},
 {'end': 16,
  'start': 12,
  'term': 'kaks',
  'type': 'väike',
  'value': 2,
  'wend_raw': 0,
  'wstart_raw': 2}]

conflict_resolving_strategy='MAX' returns all the events that are not contained by any other event.


In [8]:
event_tagger = EventTagger(event_vocabulary, search_method='naive', conflict_resolving_strategy='MAX', return_layer=True)
event_tagger.tag(text)


Out[8]:
[{'bstart': 0,
  'cstart': 0,
  'end': 16,
  'start': 0,
  'term': 'kakskümmend kaks',
  'type': 'suur',
  'value': 22,
  'wend_raw': 0,
  'wstart': 0,
  'wstart_raw': 0}]

conflict_resolving_strategy='MIN' returns all the events that don't contain any other event.


In [9]:
event_tagger = EventTagger(event_vocabulary, search_method='naive', conflict_resolving_strategy='MIN', return_layer=True)
event_tagger.tag(text)


Out[9]:
[{'bstart': 0,
  'cstart': 0,
  'end': 4,
  'start': 0,
  'term': 'kaks',
  'type': 'väike',
  'value': 2,
  'wend_raw': 1,
  'wstart': 0,
  'wstart_raw': 0},
 {'bstart': 1,
  'cstart': 1,
  'end': 9,
  'start': 4,
  'term': 'kümme',
  'type': 'keskmine',
  'value': 10,
  'wend_raw': 1,
  'wstart': 0,
  'wstart_raw': 0},
 {'bstart': 3,
  'cstart': 5,
  'end': 16,
  'start': 12,
  'term': 'kaks',
  'type': 'väike',
  'value': 2,
  'wend_raw': 0,
  'wstart': 2,
  'wstart_raw': 2}]