Language Detection

Polyglot depends on pycld2 library which in turn depends on cld2 library for detecting language(s) used in plain text.



In [1]:

    
from polyglot.detect import Detector

Example



In [2]:

    
arabic_text = u"""
أفاد مصدر امني في قيادة عمليات صلاح الدين في العراق بأن " القوات الامنية تتوقف لليوم
الثالث على التوالي عن التقدم الى داخل مدينة تكريت بسبب
انتشار قناصي التنظيم الذي يطلق على نفسه اسم "الدولة الاسلامية" والعبوات الناسفة
والمنازل المفخخة والانتحاريين، فضلا عن ان القوات الامنية تنتظر وصول تعزيزات اضافية ".
"""



In [3]:

    
detector = Detector(arabic_text)
print(detector.language)









    



name: Arabic      code: ar       confidence:  99.0 read bytes:   907

Mixed Text



In [4]:

    
mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state located in East Asia.
"""

If the text contains snippets from different languages, the detector is able to find the most probable langauges used in the text. For each language, we can query the model confidence level:



In [5]:

    
for language in Detector(mixed_text).languages:
  print(language)









    



name: English     code: en       confidence:  87.0 read bytes:  1154
name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
name: un          code: un       confidence:   0.0 read bytes:     0

To take a closer look, we can inspect the text line by line, notice that the confidence in the detection went down for the first line



In [14]:

    
for line in mixed_text.strip().splitlines():
  print(line + u"\n")
  for language in Detector(line).languages:
    print(language)
  print("\n")









    



China (simplified Chinese: 中国; traditional Chinese: 中國),

name: English     code: en       confidence:  71.0 read bytes:   887
name: Chinese     code: zh_Hant  confidence:  11.0 read bytes:  1755
name: un          code: un       confidence:   0.0 read bytes:     0


officially the People's Republic of China (PRC), is a sovereign state located in East Asia.

name: English     code: en       confidence:  98.0 read bytes:  1291
name: un          code: un       confidence:   0.0 read bytes:     0
name: un          code: un       confidence:   0.0 read bytes:     0

Best Effort Strategy

Sometimes, there is no enough text to make a decision, like detecting a language from one word. This forces the detector to switch to a best effort strategy, a warning will be thrown and the attribute reliable will be set to False.



In [7]:

    
detector = Detector("pizza")
print(detector)









    



WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.






    



Prediction is reliable: False
Language 1: name: English     code: en       confidence:  85.0 read bytes:  1194
Language 2: name: un          code: un       confidence:   0.0 read bytes:     0
Language 3: name: un          code: un       confidence:   0.0 read bytes:     0

In case, that the detection is not reliable even when we are using the best effort strategy, an exception UnknownLanguage will be thrown.



In [9]:

    
print(Detector("4"))









    



---------------------------------------------------------------------------
UnknownLanguage                           Traceback (most recent call last)
<ipython-input-9-de43776398b9> in <module>()
----> 1 print(Detector("4"))

/usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc in __init__(self, text, quiet)
     63     self.quiet = quiet
     64     """If true, exceptions will be silenced."""
---> 65     self.detect(text)
     66 
     67   @staticmethod

/usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc in detect(self, text)
     89 
     90       if not reliable and not self.quiet:
---> 91         raise UnknownLanguage("Try passing a longer snippet of text")
     92       else:
     93         logger.warning("Detector is not able to detect the language reliably.")

UnknownLanguage: Try passing a longer snippet of text

Such an exception may not be desirable especially for trivial cases like characters that could belong to so many languages. In this case, we can silence the exceptions by passing setting quiet to True



In [10]:

    
print(Detector("4", quiet=True))









    



WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.






    



Prediction is reliable: False
Language 1: name: un          code: un       confidence:   0.0 read bytes:     0
Language 2: name: un          code: un       confidence:   0.0 read bytes:     0
Language 3: name: un          code: un       confidence:   0.0 read bytes:     0

Command Line



In [11]:

    
!polyglot detect --help









    



usage: polyglot detect [-h] [--input [INPUT [INPUT ...]]]

optional arguments:
  -h, --help            show this help message and exit
  --input [INPUT [INPUT ...]]

The subcommand detect tries to identify the language code for each line in a text file. This could be convieniet if each line represents a document or a sentence that could have been generated by a tokenizer



In [12]:

    
!polyglot detect --input testdata/cricket.txt









    



English             Australia posted a World Cup record total of 417-6 as they beat Afghanistan by 275 runs.
English             David Warner hit 178 off 133 balls, Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth.
English             Afghanistan were then dismissed for 142, with Mitchell Johnson and Mitchell Starc taking six wickets between them.
English             Australia's score surpassed the 413-5 India made against Bermuda in 2007.
English             It continues the pattern of bat dominating ball in this tournament as the third 400 plus score achieved in the pool stages, following South Africa's 408-5 and 411-4 against West Indies and Ireland respectively.
English             The winning margin beats the 257-run amount by which India beat Bermuda in Port of Spain in 2007, which was equalled five days ago by South Africa in their victory over West Indies in Sydney.

Supported Languages

cld2 can detect up to 165 languages.



In [13]:

    
from polyglot.utils import pretty_list
print(pretty_list(Detector.supported_languages()))









    



  1. Abkhazian                  2. Afar                       3. Afrikaans                
  4. Akan                       5. Albanian                   6. Amharic                  
  7. Arabic                     8. Armenian                   9. Assamese                 
 10. Aymara                    11. Azerbaijani               12. Bashkir                  
 13. Basque                    14. Belarusian                15. Bengali                  
 16. Bihari                    17. Bislama                   18. Bosnian                  
 19. Breton                    20. Bulgarian                 21. Burmese                  
 22. Catalan                   23. Cebuano                   24. Cherokee                 
 25. Nyanja                    26. Corsican                  27. Croatian                 
 28. Croatian                  29. Czech                     30. Chinese                  
 31. Chinese                   32. Chinese                   33. Chinese                  
 34. Chineset                  35. Chineset                  36. Chineset                 
 37. Chineset                  38. Chineset                  39. Chineset                 
 40. Danish                    41. Dhivehi                   42. Dutch                    
 43. Dzongkha                  44. English                   45. Esperanto                
 46. Estonian                  47. Ewe                       48. Faroese                  
 49. Fijian                    50. Finnish                   51. French                   
 52. Frisian                   53. Ga                        54. Galician                 
 55. Ganda                     56. Georgian                  57. German                   
 58. Greek                     59. Greenlandic               60. Guarani                  
 61. Gujarati                  62. Haitian_creole            63. Hausa                    
 64. Hawaiian                  65. Hebrew                    66. Hebrew                   
 67. Hindi                     68. Hmong                     69. Hungarian                
 70. Icelandic                 71. Igbo                      72. Indonesian               
 73. Interlingua               74. Interlingue               75. Inuktitut                
 76. Inupiak                   77. Irish                     78. Italian                  
 79. Ignore                    80. Javanese                  81. Javanese                 
 82. Japanese                  83. Kannada                   84. Kashmiri                 
 85. Kazakh                    86. Khasi                     87. Khmer                    
 88. Kinyarwanda               89. Krio                      90. Kurdish                  
 91. Kyrgyz                    92. Korean                    93. Laothian                 
 94. Latin                     95. Latvian                   96. Limbu                    
 97. Limbu                     98. Limbu                     99. Lingala                  
100. Lithuanian               101. Lozi                     102. Luba_lulua               
103. Luo_kenya_and_tanzania   104. Luxembourgish            105. Macedonian               
106. Malagasy                 107. Malay                    108. Malayalam                
109. Maltese                  110. Manx                     111. Maori                    
112. Marathi                  113. Mauritian_creole         114. Romanian                 
115. Mongolian                116. Montenegrin              117. Montenegrin              
118. Montenegrin              119. Montenegrin              120. Nauru                    
121. Ndebele                  122. Nepali                   123. Newari                   
124. Norwegian                125. Norwegian                126. Norwegian_n              
127. Nyanja                   128. Occitan                  129. Oriya                    
130. Oromo                    131. Ossetian                 132. Pampanga                 
133. Pashto                   134. Pedi                     135. Persian                  
136. Polish                   137. Portuguese               138. Punjabi                  
139. Quechua                  140. Rajasthani               141. Rhaeto_romance           
142. Romanian                 143. Rundi                    144. Russian                  
145. Samoan                   146. Sango                    147. Sanskrit                 
148. Scots                    149. Scots_gaelic             150. Serbian                  
151. Serbian                  152. Seselwa                  153. Seselwa                  
154. Sesotho                  155. Shona                    156. Sindhi                   
157. Sinhalese                158. Siswant                  159. Slovak                   
160. Slovenian                161. Somali                   162. Spanish                  
163. Sundanese                164. Swahili                  165. Swedish                  
166. Syriac                   167. Tagalog                  168. Tajik                    
169. Tamil                    170. Tatar                    171. Telugu                   
172. Thai                     173. Tibetan                  174. Tigrinya                 
175. Tonga                    176. Tsonga                   177. Tswana                   
178. Tumbuka                  179. Turkish                  180. Turkmen                  
181. Twi                      182. Uighur                   183. Ukrainian                
184. Urdu                     185. Uzbek                    186. Venda                    
187. Vietnamese               188. Volapuk                  189. Waray_philippines        
190. Welsh                    191. Wolof                    192. Xhosa                    
193. Yiddish                  194. Yoruba                   195. Zhuang                   
196. Zulu