Transliteration is the conversion of a text from one script to another. For instance, a Latin transliteration of the Greek phrase "Ελληνική Δημοκρατία", usually translated as 'Hellenic Republic', is "Ellēnikḗ Dēmokratía".
In [1]:
from polyglot.transliteration import Transliterator
In [2]:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("transliteration2"))
In [3]:
%%bash
polyglot download embeddings2.en transliteration2.ar
We tag each word in the text with one part of speech.
In [7]:
from polyglot.text import Text
In [8]:
blob = """We will meet at eight o'clock on Thursday morning."""
text = Text(blob)
We can query all the tagged words
In [9]:
for x in text.transliterate("ar"):
print(x)
In [20]:
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en transliteration --target ar | tail -n 30
This work is a direct implementation of the research being described in the False-Friend Detection and Entity Matching via Unsupervised Transliteration paper. The author of this library strongly encourage you to cite the following paper if you are using this software.
@article{chen2016false,
title = {False-Friend Detection and Entity Matching via Unsupervised Transliteration},
author = {Chen, Yanqing and Skiena, Steven},
journal = {arXiv preprint arXiv:1611.06722},
year = {2016}
}