In this lab, we will be using Python Natural Language Toolkit (nltk) again to get to know the IBM models better. There are proper, open-source MT systems out there (such as Apertium and MOSES); however, getting to know them would require more than 90 minutes.
For today's exercises, you will need the docker image again. Provided you have already downloaded it last time, you can start it by:
docker ps -a: lists all the containers you have created. Pick the one you used last time (with any luck, there is only one)docker start <container id>docker exec -it <container id> bashWhen that's done, update your git repository:
cd /nlp/python_nlp_2017_fall/
git pull
If git pull returns with errors, it is most likely because some of your files have changes in it (most likely the morphology or syntax notebooks, which you worked on the previous labs). You can check this with git status. If the culprit is the file A.ipynb, you can resolve this problem like so:
cp A.ipynb A_mine.ipynb
git checkout A.ipynb
After that, git pull should work.
And start the notebook:
jupyter notebook --port=8888 --ip=0.0.0.0 --no-browser --allow-root
If you started the notebook, but cannot access it in your browser, make sure jupyter is not running on the host system as well. If so, stop it.
In [ ]:
import os
import shutil
import urllib
import nltk
def download_file(url, directory=''):
real_dir = os.path.realpath(directory)
if not os.path.isdir(real_dir):
os.makedirs(real_dir)
file_name = url.rsplit('/', 1)[-1]
real_file = os.path.join(real_dir, file_name)
if not os.path.isfile(real_file):
with urllib.request.urlopen(url) as inf:
with open(real_file, 'wb') as outf:
shutil.copyfileobj(inf, outf)
We download and preprocess a subset of the Hunglish corpus. It consists of English-Hungarian translation pairs extracted from open-source software documentation. The sentences are already aligned, but it lacks word alignment.
In [ ]:
Read the whole corpus (all files). Try not to read it all into memory. Write a function that
Note:
iso-8859-2 (a.k.a. Latin-2) encoding
In [ ]:
def read_files(directory=''):
pass
The text is not tokenized. Use nltk's word_tokenize() function to tokenize the snippets. Also, lowercase them. You can do this in read_files() above if you wish, or in the code you write for 1.4 below.
Note:
punkt) is not installed by default. You have to download() it.sent_tokenize() and then just lowercase the word at the beginning of the sentence, or even better, tag the snippets for NER. However, we have neither the time nor the resources (models) to do that now.
In [ ]:
from nltk.tokenize import sent_tokenize, word_tokenize
The models we are going to try expect a list of nltk.translate.api.AlignedSent objects. Create a bitext variable that is a list of AlignedSent objects created from the preprocessed, tokenized corpus.
Note that AlignedSent also allows you to specify an alignment between the words in the two texts. Unfortunately (but not unexpectedly), the corpus doesn't have this information.
In [ ]:
from nltk.translate.api import AlignedSent
bitext = [] # Your code here
assert len(bitext) == 135439
In [ ]:
from nltk.translate import IBMModel1
ibm1 = IBMModel1(bitext, 5)
While the model doesn't have a translate() function, it does provide a way to compute the translation probability $P(F|E)$ with some additional codework. That additional work is what you have to put in.
Remember that the formula for the translation probability is $P(F|E) = \sum_AP(F,A|E)$. Computing $P(F,A|E)$ is a bit hairy; luckily IBMModel1 has a method to calculate at least part of it: prob_t_a_given_s(), which is in fact only $P(F|A,E)$. This function accepts an AlignmentInfo object that contains the source and target sentences as well as the aligment between them.
Unfortunately, AlignmentInfo's representation of an alignment is completely different from the Alignment object's. Your first is task to do the conversion from the latter to the former. Given the example pair John loves Mary / De szereti János Marcsit,
Aligment is basically a list of source-target, 0-based index pairs, [(0, 2), (1, 1), (2, 3)]AlignmentInfo objects is a tuple (!), where the ith position is the index of the target word that is aligned to the ith source word, or 0, if the ith source word is unaligned. Indices are 1-based, because the 0th word is NULL on both sides (see lecture page 35, slide 82). The tuple you return must also contain the alignment for this NULL word, which is not aligned with the NULL on the other side - in other words, the returned tuple starts with a 0. Example: (0, 3, 2, 4). If multiple target words are aligned with the same source word, you are free to use the index of any of them.
In [ ]:
from nltk.translate.ibm_model import AlignmentInfo
def alignment_to_info(alignment):
"""Converts from an Alignment object to the alignment format required by AlignmentInfo."""
pass
assert alignment_to_info([(0, 2), (1, 1), (2, 3)]) == (0, 3, 2, 4)
Your task is to write a function that, given a source and a target sentence and an alignment, creates an AlignmentInfo an object and calls prob_t_a_given_s() of the model with it. The code here (test_prob_t_a_given_s()) might give you some clue as to how to construct the object.
Since prob_t_a_given_s() only computes $P(F|A,E)$, you have to add the $P(A|E)$ component. See page 38, slide 95 and page 39, side 100 in the lecture. What is $J$ and $K$ in the inverse setup?
Important: "interestingly", prob_t_a_given_s() translates from target to source. However, you still want to translate from source to target, so take care when filling the fields of the AlignmentInfo object.
Also note:
AlignmentInfo) format. Don't bother converting it for now!
In [ ]:
def prob_f_a_e(model, src_sentence, tgt_sentence, alig_in_tuple_format):
pass
Write a function that, given an AlignedSent object, computes $P(F,A|E)$. Since IBMModel1 aligns the sentences of the training set with the most probable alignment, this function will effectively compute $P(F,A_{best}|E)$.
Don't forget to convert the alignment with the function you wrote in Exercise 2.1. before passing it to prob_f_a_e().
In [ ]:
def prob_best_a(model, aligned_sent):
pass
Write a function that, given an AlignedSent object, computes $P(F|E)$. It should enumerate all possible alignments (in the tuple format) and call the function you wrote in Exercise 2.2 with them.
Note: the itertools.product function can be very useful in enumerating the alignments.
In [ ]:
def prob_f_e(model, aligned_sent):
pass
Test cases for Exercises 2.3 – 2.5.
In [ ]:
import numpy
testext = [
AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']),
AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big']),
AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']),
AlignedSent(['das', 'haus'], ['the', 'house']),
AlignedSent(['das', 'buch'], ['the', 'book']),
AlignedSent(['ein', 'buch'], ['a', 'book'])
]
ibm2 = IBMModel1(testext, 5)
# Tests for Exercise 2.3
assert numpy.allclose(prob_f_a_e(ibm2, ['ein', 'buch'], ['a', 'book'], (0, 1, 2)), 0.08283000979778607)
assert numpy.allclose(prob_f_a_e(ibm2, ['ein', 'buch'], ['a', 'book'], (0, 2, 1)), 0.0002015158225914316)
# Tests for Exercise 2.4
assert numpy.allclose(prob_best_a(ibm2, testext[4]), 0.059443309368677)
assert numpy.allclose(prob_best_a(ibm2, testext[2]), 1.3593610057711997e-05)
# Tests for Exercise 2.5
assert numpy.allclose(prob_f_e(ibm2, testext[4]), 0.13718805082588842)
assert numpy.allclose(prob_f_e(ibm2, testext[2]), 0.0001809283308942621)
NLTK also has some functions related to phrase-based translation, but these are all but finished. The components are scattered into two packages:
phrase_extraction() that can extract phrases from parallel text, based on an alignmentStackDecoder object, which can be used to translate sentences based on a phrase table and a language modelIf you are wondering where the rest of the training functionality is, you spotted the problem: unfortunately, the part that assembles the phrase table based on the extracted phrases is missing. Also missing are the classes that represent and compute a language model. So in the code block below, we only run the decoder on an example sentence with a "hand-crafted" model.
Note: This is the same code as in the documentation of the decoder (above).
In [ ]:
from collections import defaultdict
from math import log
from nltk.translate import PhraseTable
from nltk.translate.stack_decoder import StackDecoder
# The (probabilistic) phrase table
phrase_table = PhraseTable()
phrase_table.add(('niemand',), ('nobody',), log(0.8))
phrase_table.add(('niemand',), ('no', 'one'), log(0.2))
phrase_table.add(('erwartet',), ('expects',), log(0.8))
phrase_table.add(('erwartet',), ('expecting',), log(0.2))
phrase_table.add(('niemand', 'erwartet'), ('one', 'does', 'not', 'expect'), log(0.1))
phrase_table.add(('die', 'spanische', 'inquisition'), ('the', 'spanish', 'inquisition'), log(0.8))
phrase_table.add(('!',), ('!',), log(0.8))
# The "language model"
language_prob = defaultdict(lambda: -999.0)
language_prob[('nobody',)] = log(0.5)
language_prob[('expects',)] = log(0.4)
language_prob[('the', 'spanish', 'inquisition')] = log(0.2)
language_prob[('!',)] = log(0.1)
# Note: type() with three parameters creates a new type object
language_model = type('',(object,), {'probability_change': lambda self, context, phrase: language_prob[phrase],
'probability': lambda self, phrase: language_prob[phrase]})()
stack_decoder = StackDecoder(phrase_table, language_model)
stack_decoder.translate(['niemand', 'erwartet', 'die', 'spanische', 'inquisition', '!'])
Run through the parallel corpus (already aligned by an IBM model), and extract all phrases from them. You can limit the length of the phrases you consider at 2 (3, ...) words, but you have to do it manually, because the max_phrase_length argument of phrase_extraction() doesn't work. Once you have all the phrases, create a phrase table similar to the one above. Don't forget that the decoder expects log probabilities.