Morphological Analysis

Polyglot offers trained morfessor models to generate morphemes from words. The goal of the Morpho project is to develop unsupervised data-driven methods that discover the regularities behind word forming in natural languages. In particular, Morpho project is focussing on the discovery of morphemes, which are the primitive units of syntax, the smallest individually meaningful elements in the utterances of a language. Morphemes are important in automatic generation and recognition of a language, especially in languages in which words may have many different inflected forms.

Languages Coverage

Using polyglot vocabulary dictionaries, we trained morfessor models on the most frequent words 50,000 words of each language.


In [1]:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))


  1. Piedmontese language       2. Lombard language           3. Gan Chinese              
  4. Sicilian                   5. Scots                      6. Kirghiz, Kyrgyz          
  7. Pashto, Pushto             8. Kurdish                    9. Portuguese               
 10. Kannada                   11. Korean                    12. Khmer                    
 13. Kazakh                    14. Ilokano                   15. Polish                   
 16. Panjabi, Punjabi          17. Georgian                  18. Chuvash                  
 19. Alemannic                 20. Czech                     21. Welsh                    
 22. Chechen                   23. Catalan; Valencian        24. Northern Sami            
 25. Sanskrit (Saṁskṛta)       26. Slovene                   27. Javanese                 
 28. Slovak                    29. Bosnian-Croatian-Serbian  30. Bavarian                 
 31. Swedish                   32. Swahili                   33. Sundanese                
 34. Serbian                   35. Albanian                  36. Japanese                 
 37. Western Frisian           38. French                    39. Finnish                  
 40. Upper Sorbian             41. Faroese                   42. Persian                  
 43. Sinhala, Sinhalese        44. Italian                   45. Amharic                  
 46. Aragonese                 47. Volapük                   48. Icelandic                
 49. Sakha                     50. Afrikaans                 51. Indonesian               
 52. Interlingua               53. Azerbaijani               54. Ido                      
 55. Arabic                    56. Assamese                  57. Yoruba                   
 58. Yiddish                   59. Waray-Waray               60. Croatian                 
 61. Hungarian                 62. Haitian; Haitian Creole   63. Quechua                  
 64. Armenian                  65. Hebrew (modern)           66. Silesian                 
 67. Hindi                     68. Divehi; Dhivehi; Mald...  69. German                   
 70. Danish                    71. Occitan                   72. Tagalog                  
 73. Turkmen                   74. Thai                      75. Tajik                    
 76. Greek, Modern             77. Telugu                    78. Tamil                    
 79. Oriya                     80. Ossetian, Ossetic         81. Tatar                    
 82. Turkish                   83. Kapampangan               84. Venetian                 
 85. Manx                      86. Gujarati                  87. Galician                 
 88. Irish                     89. Scottish Gaelic; Gaelic   90. Nepali                   
 91. Cebuano                   92. Zazaki                    93. Walloon                  
 94. Dutch                     95. Norwegian                 96. Norwegian Nynorsk        
 97. West Flemish              98. Chinese                   99. Bosnian                  
100. Breton                   101. Belarusian               102. Bulgarian                
103. Bashkir                  104. Egyptian Arabic          105. Tibetan Standard, Tib... 
106. Bengali                  107. Burmese                  108. Romansh                  
109. Marathi (Marāṭhī)        110. Malay                    111. Maltese                  
112. Russian                  113. Macedonian               114. Malayalam                
115. Mongolian                116. Malagasy                 117. Vietnamese               
118. Spanish; Castilian       119. Estonian                 120. Basque                   
121. Bishnupriya Manipuri     122. Asturian                 123. English                  
124. Esperanto                125. Luxembourgish, Letzeb... 126. Latin                    
127. Uighur, Uyghur           128. Ukrainian                129. Limburgish, Limburgan... 
130. Latvian                  131. Urdu                     132. Lithuanian               
133. Fiji Hindi               134. Uzbek                    135. Romanian, Moldavian, ... 

Download Necessary Models


In [7]:
%%bash
polyglot download morph2.en morph2.ar


[polyglot_data] Downloading package morph2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.en is already up-to-date!
[polyglot_data] Downloading package morph2.ar to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.ar is already up-to-date!

Example

Word Segmentation


In [15]:
from polyglot.text import Text, Word

In [20]:
words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
  w = Word(w, language="en")
  print("{:<20}{}".format(w, w.morphemes))


preprocessing       ['pre', 'process', 'ing']
processor           ['process', 'or']
invaluable          ['in', 'valuable']
thankful            ['thank', 'ful']
crossed             ['cross', 'ed']

Sentence Segmentation

If the text is not tokenized properly, morphological analysis could offer a smart of way of splitting the text into its original units. Here, is an example:


In [16]:
blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"

In [17]:
text.morphemes


Out[17]:
WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])

Command Line Interface


In [9]:
!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en morph | tail -n 30


which           which
India           In_dia
beat            beat 
Bermuda         Ber_mud_a
in              in   
Port            Port 
of              of   
Spain           Spa_in
in              in   
2007            2007 
,               ,    
which           which
was             wa_s 
equalled        equal_led
five            five 
days            day_s
ago             ago  
by              by   
South           South
Africa          Africa
in              in   
their           t_heir
victory         victor_y
over            over 
West            West 
Indies          In_dies
in              in   
Sydney          Syd_ney
.               .    

Demo

This demo does not reflect the models supplied by polyglot, however, we think it is indicative of what you should expect from morfessor

Demo

Citation

This is an interface to the implementation being described in the Morfessor2.0: Python Implementation and Extensions for Morfessor Baseline technical report.

@InProceedings{morfessor2,
               title:{Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},
               author:  {Virpioja, Sami ; Smit, Peter ; Grönroos, Stig-Arne ; Kurimo, Mikko},
               year: {2013},
               publisher: {Department of Signal Processing and Acoustics, Aalto University},
               booktitle:{Aalto University publication series}
}