Machine Translation in Python 3 with NLTK

(C) 2017 by Damir Cavar

Version: 1.0, November 2017

This is a brief introduction to the Machine Translation components in NLTK.

Loading an Aligned Corpus

Import the comtrans module from nltk.corpus.


In [1]:
from nltk.corpus import comtrans

We can load a word-level alignment corpus for English and French from the NLTK dataset:


In [2]:
words = comtrans.words("alignment-en-fr.txt")

Print out the words in the corpus as a list:


In [4]:
for word in words[:20]:
    print(word)
print("...")


Resumption
of
the
session
I
declare
resumed
the
session
of
the
European
Parliament
adjourned
on
Friday
17
December
1999
,
...

Access a word by index in the list:


In [5]:
print(words[0])


Resumption

We can load the aligned sentences. Here we will load just one sentence, the firs one in the corpus:


In [ ]:
als = comtrans.aligned_sents("alignment-en-fr.txt")[0]
als

print(" ".join(als.words))
print(" ".join(als.mots))

The alignments can be accessed via the alignment property:


In [ ]:
als.alignment

We can display the alignment using the invert function:


In [ ]:
als.invert()

We can also create alignments directly using the NLTK translate module. We import the translation modules from NLTK:


In [ ]:
from nltk.translate import Alignment, AlignedSent

We can create an alignment example:


In [ ]:
als = AlignedSent( ["Reprise", "de", "la", "session" ], \
    ["Resumption", "of", "the", "session" ] , \
    Alignment( [ (0 , 0), (1 , 1), (2 , 2), (3 , 3) ] ) )

Translating with IBM Model 1 in NLTK

We already imported comtrans from NLTK in the code above. We have to import IBMModel1 from nltk.translate:


In [ ]:
from nltk.translate import IBMModel1

We can create an IBMModel1 using 20 iterations to run the learning algorithm using the first 10 sentences from the aligned corpus; see the EM explanation on the slides and the following publications:

  • Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

  • Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.


In [ ]:
com_ibm1 = IBMModel1(comtrans.aligned_sents()[:10], 100)

In [ ]:
print(round(com_ibm1.translation_table["bitte"]["Please"], 3) )

In [ ]:
print(round(com_ibm1.translation_table["Sitzungsperiode"]["session"] , 3) )