Estnltk includes an experimental noun phrase chunker, which can be used to detect non-overlapping noun phrases from the text.
The class NounPhraseChunker provides method
analyze_text(), which takes a
Text object as an input, detects potential noun
phrases, and stores in the layer NOUN_CHUNKS
:
In [1]:
from estnltk import Text
from estnltk.np_chunker import NounPhraseChunker
from estnltk.names import TEXT, NOUN_CHUNKS
from pprint import pprint
# initialise the chunker
chunker = NounPhraseChunker()
text = Text('Suur karvane kass nurrus punasel diivanil, väike hiir aga hiilis temast mööda.')
# chunk the input text
text = chunker.analyze_text( text )
# output the results (found phrases)
pprint( text[NOUN_CHUNKS] )
By default, the method
:py~estnltk.np_chunker.NounPhraseChunker.analyze_text returns the
input text. The keyword argument return_type
can be used to change the
type of data returned. If return_type='labels'
, the method returns
results of chunking in a BIO annotation scheme:
In [2]:
from estnltk import Text
from estnltk.np_chunker import NounPhraseChunker
from estnltk.names import TEXT
# initialise the chunker
chunker = NounPhraseChunker()
text = Text('Suur karvane kass nurrus punasel diivanil, väike hiir aga hiilis temast mööda.')
# chunk the input text, get the results in BIO annotation format
np_labels = chunker.analyze_text( text, return_type='labels' )
# output results of the chunking in BIO annotation format
for word, np_label in zip(text.words, np_labels):
print( word[TEXT]+' '+str(np_label) )
In the above example, the resulting list np_labels
contains a label
for each word in the input text, indicating word's position in phrase:
"B"
denotes that the word begins a phrase, "I"
indicates that the
word is inside a phrase, and "O"
indicates that the word does not
belong to any noun phrase.
If the input argument return_type="strings"
is passed to the method,
the method returns only results of the chunking as a list of phrase
strings:
In [3]:
from estnltk import Text
from estnltk.np_chunker import NounPhraseChunker
# initialise the chunker
chunker = NounPhraseChunker()
text = Text('Autojuhi lapitekk pälvis linna koduleheküljel paljude kodanike tähelepanu.')
# chunk the input text
phrase_strings = chunker.analyze_text( text, return_type="strings" )
The above example produces following output:
In [4]:
print( phrase_strings )
If return_type="tokens"
is set, the chunker returns a list of lists of
tokens, where each token is given as a dictonary containing analyses of
the word:
In [5]:
from estnltk import Text
from estnltk.np_chunker import NounPhraseChunker
from estnltk.names import TEXT, ANALYSIS, LEMMA
# initialise the chunker
chunker = NounPhraseChunker()
text = Text('Autojuhi lapitekk pälvis linna koduleheküljel paljude kodanike tähelepanu.')
# chunk the input text
phrases = chunker.analyze_text( text, return_type="tokens" )
# output phrases word by word
for phrase in phrases:
print()
for token in phrase:
# output text and first lemma
print( token[TEXT], token[ANALYSIS][0][LEMMA] )
Note that, regardless the return_type
, the layer NOUN_CHUNKS
will
always be added to the input Text.
By default, the chunker does not allow tagging phrases longer than 3
words, as the quality of tagging longer phrases is likely suboptimal,
and the coverage of these phrases is also likely low [1] . So, phrases
longer than 3 words will be cut into one word phrases. This default
setting can be turned off by specifying cutPhrases=False
as an input
argument for the method analyze_text():
In [6]:
from estnltk import Text
from estnltk.np_chunker import NounPhraseChunker
# initialise the chunker
chunker = NounPhraseChunker()
text = Text('Kõige väiksemate tassidega serviis toodi kusagilt vanast tolmusest kapist välja.')
# chunk the input text while allowing phrases longer than 3 words
phrase_strings = chunker.analyze_text( text, cutPhrases=False, return_type="strings" )
The output is following:
In [7]:
print( phrase_strings )
In [9]:
from estnltk import Text
from estnltk.np_chunker import NounPhraseChunker
from estnltk.syntax.parsers import VISLCG3Parser
# initialise the chunker using VISLCG3Parser instead of MaltParser
chunker = NounPhraseChunker( parser = VISLCG3Parser() )
text = Text('Maril oli väike tall.')
# chunk the input text
text = chunker.analyze_text( text )
# output the results (found phrases)
pprint( text[NOUN_CHUNKS] )