In [1]:
from polyglot.detect import Detector
In [2]:
arabic_text = u"""
أفاد مصدر امني في قيادة عمليات صلاح الدين في العراق بأن " القوات الامنية تتوقف لليوم
الثالث على التوالي عن التقدم الى داخل مدينة تكريت بسبب
انتشار قناصي التنظيم الذي يطلق على نفسه اسم "الدولة الاسلامية" والعبوات الناسفة
والمنازل المفخخة والانتحاريين، فضلا عن ان القوات الامنية تنتظر وصول تعزيزات اضافية ".
"""
In [3]:
detector = Detector(arabic_text)
print(detector.language)
In [4]:
mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state located in East Asia.
"""
If the text contains snippets from different languages, the detector is able to find the most probable langauges used in the text. For each language, we can query the model confidence level:
In [5]:
for language in Detector(mixed_text).languages:
print(language)
To take a closer look, we can inspect the text line by line, notice that the confidence in the detection went down for the first line
In [14]:
for line in mixed_text.strip().splitlines():
print(line + u"\n")
for language in Detector(line).languages:
print(language)
print("\n")
In [7]:
detector = Detector("pizza")
print(detector)
In case, that the detection is not reliable even when we are using the best effort strategy, an exception UnknownLanguage will be thrown.
In [9]:
print(Detector("4"))
Such an exception may not be desirable especially for trivial cases like characters that could belong to so many languages.
In this case, we can silence the exceptions by passing setting quiet to True
In [10]:
print(Detector("4", quiet=True))
In [11]:
!polyglot detect --help
The subcommand detect tries to identify the language code for each line in a text file.
This could be convieniet if each line represents a document or a sentence that could have been generated by a tokenizer
In [12]:
!polyglot detect --input testdata/cricket.txt
In [13]:
from polyglot.utils import pretty_list
print(pretty_list(Detector.supported_languages()))