Polyglot offers trained morfessor models to generate morphemes from words. The goal of the Morpho project is to develop unsupervised data-driven methods that discover the regularities behind word forming in natural languages. In particular, Morpho project is focussing on the discovery of morphemes, which are the primitive units of syntax, the smallest individually meaningful elements in the utterances of a language. Morphemes are important in automatic generation and recognition of a language, especially in languages in which words may have many different inflected forms.
Using polyglot vocabulary dictionaries, we trained morfessor models on the most frequent words 50,000 words of each language.
In [1]:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))
In [7]:
%%bash
polyglot download morph2.en morph2.ar
In [15]:
from polyglot.text import Text, Word
In [20]:
words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
w = Word(w, language="en")
print("{:<20}{}".format(w, w.morphemes))
If the text is not tokenized properly, morphological analysis could offer a smart of way of splitting the text into its original units. Here, is an example:
In [16]:
blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"
In [17]:
text.morphemes
Out[17]:
In [9]:
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en morph | tail -n 30
This demo does not reflect the models supplied by polyglot, however, we think it is indicative of what you should expect from morfessor
This is an interface to the implementation being described in the Morfessor2.0: Python Implementation and Extensions for Morfessor Baseline technical report.
@InProceedings{morfessor2,
title:{Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},
author: {Virpioja, Sami ; Smit, Peter ; Grönroos, Stig-Arne ; Kurimo, Mikko},
year: {2013},
publisher: {Department of Signal Processing and Acoustics, Aalto University},
booktitle:{Aalto University publication series}
}