Toeknization is the process that identifies the text boundaries of words and sentences. We can identify the boundaries of sentences first then tokenize each sentence to identify the words that compose the sentence. Of course, we can do word tokenization first and then segment the token sequence into sentneces. Tokenization in polyglot relies on the Unicode Text Segmentation algorithm as implemented by the ICU Project.
You can use C/C++ ICU library by installing the required package libicu-dev
. For example, on ubuntu/debian systems you should use apt-get
utility as the following:
In [ ]:
sudo apt-get install libicu-dev
In [4]:
from polyglot.text import Text
To call our word tokenizer, first we need to construct a Text object.
In [9]:
blob = u"""
两个月前遭受恐怖袭击的法国巴黎的犹太超市在装修之后周日重新开放,法国内政部长以及超市的管理者都表示,这显示了生命力要比野蛮行为更强大。
该超市1月9日遭受枪手袭击,导致4人死亡,据悉这起事件与法国《查理周刊》杂志社恐怖袭击案有关。
"""
text = Text(blob)
The property words will call the word tokenizer.
In [10]:
text.words
Out[10]:
Since ICU boundary break algorithms are language aware, polyglot will detect the language used first before calling the tokenizer
In [26]:
print(text.language)
If we are interested in segmenting the text first into sentences, we can query the sentences
property
In [20]:
text.sentences
Out[20]:
Sentence
class inherits Text
, therefore, we can tokenize each sentence into words using the same property words
In [21]:
first_sentence = text.sentences[0]
first_sentence.words
Out[21]:
The subcommand tokenize does by default sentence segmentation and word tokenization.
In [4]:
! polyglot tokenize --help
Each line represents a sentence where the words are split by spaces.
In [25]:
!polyglot --lang en tokenize --input testdata/cricket.txt