(C) 2017 by Damir Cavar
Version: 1.0, November 2017
This is a brief introduction to the Machine Translation components in NLTK.
Import the comtrans module from nltk.corpus.
In [1]:
from nltk.corpus import comtrans
We can load a word-level alignment corpus for English and French from the NLTK dataset:
In [2]:
words = comtrans.words("alignment-en-fr.txt")
Print out the words in the corpus as a list:
In [4]:
for word in words[:20]:
print(word)
print("...")
Access a word by index in the list:
In [5]:
print(words[0])
We can load the aligned sentences. Here we will load just one sentence, the firs one in the corpus:
In [ ]:
als = comtrans.aligned_sents("alignment-en-fr.txt")[0]
als
print(" ".join(als.words))
print(" ".join(als.mots))
The alignments can be accessed via the alignment property:
In [ ]:
als.alignment
We can display the alignment using the invert function:
In [ ]:
als.invert()
We can also create alignments directly using the NLTK translate module. We import the translation modules from NLTK:
In [ ]:
from nltk.translate import Alignment, AlignedSent
We can create an alignment example:
In [ ]:
als = AlignedSent( ["Reprise", "de", "la", "session" ], \
["Resumption", "of", "the", "session" ] , \
Alignment( [ (0 , 0), (1 , 1), (2 , 2), (3 , 3) ] ) )
We already imported comtrans from NLTK in the code above. We have to import IBMModel1 from nltk.translate:
In [ ]:
from nltk.translate import IBMModel1
We can create an IBMModel1 using 20 iterations to run the learning algorithm using the first 10 sentences from the aligned corpus; see the EM explanation on the slides and the following publications:
Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.
Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.
In [ ]:
com_ibm1 = IBMModel1(comtrans.aligned_sents()[:10], 100)
In [ ]:
print(round(com_ibm1.translation_table["bitte"]["Please"], 3) )
In [ ]:
print(round(com_ibm1.translation_table["Sitzungsperiode"]["session"] , 3) )