11. Machine Translation — Lab exercises

Preparations

Introduction

In this lab, we will be using Python Natural Language Toolkit (nltk) again to get to know the IBM models better. There are proper, open-source MT systems out there (such as Apertium and MOSES); however, getting to know them would require more than 90 minutes.

Infrastructure

For today's exercises, you will need the docker image again. Provided you have already downloaded it last time, you can start it by:

  • docker ps -a: lists all the containers you have created. Pick the one you used last time (with any luck, there is only one)
  • docker start <container id>
  • docker exec -it <container id> bash

When that's done, update your git repository:

cd /nlp/python_nlp_2017_fall/
git pull

If git pull returns with errors, it is most likely because some of your files have changes in it (most likely the morphology or syntax notebooks, which you worked on the previous labs). You can check this with git status. If the culprit is the file A.ipynb, you can resolve this problem like so:

cp A.ipynb A_mine.ipynb
git checkout A.ipynb

After that, git pull should work.

And start the notebook:

jupyter notebook --port=8888 --ip=0.0.0.0 --no-browser --allow-root

If you started the notebook, but cannot access it in your browser, make sure jupyter is not running on the host system as well. If so, stop it.

Boilerplate

The following code imports the packages and defines the functions we are going to use.


In [ ]:
import os
import shutil
import urllib

import nltk

def download_file(url, directory=''):
    real_dir = os.path.realpath(directory)
    if not os.path.isdir(real_dir):
        os.makedirs(real_dir)

    file_name = url.rsplit('/', 1)[-1]
    real_file = os.path.join(real_dir, file_name)
    
    if not os.path.isfile(real_file):
        with urllib.request.urlopen(url) as inf:
            with open(real_file, 'wb') as outf:
                shutil.copyfileobj(inf, outf)

Exercises

1. Corpus acquisition

We download and preprocess a subset of the Hunglish corpus. It consists of English-Hungarian translation pairs extracted from open-source software documentation. The sentences are already aligned, but it lacks word alignment.

1.1 Download

Download the corpus. The url is ftp://ftp.mokk.bme.hu/Hunglish2/softwaredocs/bi/opensource_X.bi, where X is a number that ranges from 1 to 9. Use the download_file function defined above.


In [ ]:

1.2 Conversion

Read the whole corpus (all files). Try not to read it all into memory. Write a function that

  • reads all files you have just downloaded
  • is a generator that yields tuples (Hungarian snippet, English snippet)

Note:

  • the files are encoded with the iso-8859-2 (a.k.a. Latin-2) encoding
  • the Hungarian and English snippets are separated by a tab
  • don't forget to strip whitespace from the returned snippets
  • throw away pairs with empty snippets

In [ ]:
def read_files(directory=''):
    pass

1.3 Tokenization

The text is not tokenized. Use nltk's word_tokenize() function to tokenize the snippets. Also, lowercase them. You can do this in read_files() above if you wish, or in the code you write for 1.4 below.

Note:

  • The model for the sentence tokenizer (punkt) is not installed by default. You have to download() it.
  • NLTK doesn't have Hungarian tokenizer models, so there might be errors in the Hungarian result
  • instead of just lowercasing everything, we might have chosen a more sophisticated solution, e.g. by first calling sent_tokenize() and then just lowercase the word at the beginning of the sentence, or even better, tag the snippets for NER. However, we have neither the time nor the resources (models) to do that now.

In [ ]:
from nltk.tokenize import sent_tokenize, word_tokenize

1.4 Create the training corpus

The models we are going to try expect a list of nltk.translate.api.AlignedSent objects. Create a bitext variable that is a list of AlignedSent objects created from the preprocessed, tokenized corpus.

Note that AlignedSent also allows you to specify an alignment between the words in the two texts. Unfortunately (but not unexpectedly), the corpus doesn't have this information.


In [ ]:
from nltk.translate.api import AlignedSent
bitext = []  # Your code here

assert len(bitext) == 135439

2. IBM Models

NLTK implements IBM models 1-5. Unfortunately, the implementations don't provide the end-to-end machine translation systems, only their alignment models.

2.1 IBM Model 1

Train an IBM Model 1 alignment. We do it in a separate code block, so that we don't rerun it by accident – training even a simple model takes some time.


In [ ]:
from nltk.translate import IBMModel1
ibm1 = IBMModel1(bitext, 5)

2.2 Alignment conversion

While the model doesn't have a translate() function, it does provide a way to compute the translation probability $P(F|E)$ with some additional codework. That additional work is what you have to put in.

Remember that the formula for the translation probability is $P(F|E) = \sum_AP(F,A|E)$. Computing $P(F,A|E)$ is a bit hairy; luckily IBMModel1 has a method to calculate at least part of it: prob_t_a_given_s(), which is in fact only $P(F|A,E)$. This function accepts an AlignmentInfo object that contains the source and target sentences as well as the aligment between them.

Unfortunately, AlignmentInfo's representation of an alignment is completely different from the Alignment object's. Your first is task to do the conversion from the latter to the former. Given the example pair John loves Mary / De szereti János Marcsit,

  • Aligment is basically a list of source-target, 0-based index pairs, [(0, 2), (1, 1), (2, 3)]
  • The alignment in the AlignmentInfo objects is a tuple (!), where the ith position is the index of the target word that is aligned to the ith source word, or 0, if the ith source word is unaligned. Indices are 1-based, because the 0th word is NULL on both sides (see lecture page 35, slide 82). The tuple you return must also contain the alignment for this NULL word, which is not aligned with the NULL on the other side - in other words, the returned tuple starts with a 0. Example: (0, 3, 2, 4). If multiple target words are aligned with the same source word, you are free to use the index of any of them.

In [ ]:
from nltk.translate.ibm_model import AlignmentInfo

def alignment_to_info(alignment):
    """Converts from an Alignment object to the alignment format required by AlignmentInfo."""
    pass

assert alignment_to_info([(0, 2), (1, 1), (2, 3)]) == (0, 3, 2, 4)

2.3. Compute $P(F,A|E)$

Your task is to write a function that, given a source and a target sentence and an alignment, creates an AlignmentInfo an object and calls prob_t_a_given_s() of the model with it. The code here (test_prob_t_a_given_s()) might give you some clue as to how to construct the object.

Since prob_t_a_given_s() only computes $P(F|A,E)$, you have to add the $P(A|E)$ component. See page 38, slide 95 and page 39, side 100 in the lecture. What is $J$ and $K$ in the inverse setup?

Important: "interestingly", prob_t_a_given_s() translates from target to source. However, you still want to translate from source to target, so take care when filling the fields of the AlignmentInfo object.

Also note:

  1. the alignment you pass to the function should already be in the right (AlignmentInfo) format. Don't bother converting it for now!
  2. Test cases for Exercises 2.3 – 2.5 are available below Exercise 2.5.

In [ ]:
def prob_f_a_e(model, src_sentence, tgt_sentence, alig_in_tuple_format):
    pass

2.4. Compute $P(F, A_{best}|E)$

Write a function that, given an AlignedSent object, computes $P(F,A|E)$. Since IBMModel1 aligns the sentences of the training set with the most probable alignment, this function will effectively compute $P(F,A_{best}|E)$.

Don't forget to convert the alignment with the function you wrote in Exercise 2.1. before passing it to prob_f_a_e().


In [ ]:
def prob_best_a(model, aligned_sent):
    pass

2.5. Compute $P(F|E)$

Write a function that, given an AlignedSent object, computes $P(F|E)$. It should enumerate all possible alignments (in the tuple format) and call the function you wrote in Exercise 2.2 with them.

Note: the itertools.product function can be very useful in enumerating the alignments.


In [ ]:
def prob_f_e(model, aligned_sent):
    pass

Test cases for Exercises 2.3 – 2.5.


In [ ]:
import numpy

testext = [
    AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']),
    AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big']),
    AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']),
    AlignedSent(['das', 'haus'], ['the', 'house']),
    AlignedSent(['das', 'buch'], ['the', 'book']),
    AlignedSent(['ein', 'buch'], ['a', 'book'])
]
ibm2 = IBMModel1(testext, 5)

# Tests for Exercise 2.3
assert numpy.allclose(prob_f_a_e(ibm2, ['ein', 'buch'], ['a', 'book'], (0, 1, 2)), 0.08283000979778607)
assert numpy.allclose(prob_f_a_e(ibm2, ['ein', 'buch'], ['a', 'book'], (0, 2, 1)), 0.0002015158225914316)

# Tests for Exercise 2.4
assert numpy.allclose(prob_best_a(ibm2, testext[4]), 0.059443309368677)
assert numpy.allclose(prob_best_a(ibm2, testext[2]), 1.3593610057711997e-05)

# Tests for Exercise 2.5
assert numpy.allclose(prob_f_e(ibm2, testext[4]), 0.13718805082588842)
assert numpy.allclose(prob_f_e(ibm2, testext[2]), 0.0001809283308942621)

3. Phrase-based translation

NLTK also has some functions related to phrase-based translation, but these are all but finished. The components are scattered into two packages:

  • phrase_based defines the function phrase_extraction() that can extract phrases from parallel text, based on an alignment
  • stack_decoder defines the StackDecoder object, which can be used to translate sentences based on a phrase table and a language model

3.1. Decoding example

If you are wondering where the rest of the training functionality is, you spotted the problem: unfortunately, the part that assembles the phrase table based on the extracted phrases is missing. Also missing are the classes that represent and compute a language model. So in the code block below, we only run the decoder on an example sentence with a "hand-crafted" model.

Note: This is the same code as in the documentation of the decoder (above).


In [ ]:
from collections import defaultdict
from math import log

from nltk.translate import PhraseTable
from nltk.translate.stack_decoder import StackDecoder

# The (probabilistic) phrase table
phrase_table = PhraseTable()
phrase_table.add(('niemand',), ('nobody',), log(0.8))
phrase_table.add(('niemand',), ('no', 'one'), log(0.2))
phrase_table.add(('erwartet',), ('expects',), log(0.8))
phrase_table.add(('erwartet',), ('expecting',), log(0.2))
phrase_table.add(('niemand', 'erwartet'), ('one', 'does', 'not', 'expect'), log(0.1))
phrase_table.add(('die', 'spanische', 'inquisition'), ('the', 'spanish', 'inquisition'), log(0.8))
phrase_table.add(('!',), ('!',), log(0.8))

# The "language model"
language_prob = defaultdict(lambda: -999.0)
language_prob[('nobody',)] = log(0.5)
language_prob[('expects',)] = log(0.4)
language_prob[('the', 'spanish', 'inquisition')] = log(0.2)
language_prob[('!',)] = log(0.1)
# Note: type() with three parameters creates a new type object
language_model = type('',(object,), {'probability_change': lambda self, context, phrase: language_prob[phrase],
                                     'probability': lambda self, phrase: language_prob[phrase]})()

stack_decoder = StackDecoder(phrase_table, language_model)

stack_decoder.translate(['niemand', 'erwartet', 'die', 'spanische', 'inquisition', '!'])

3.2. Train the phrase table*

Run through the parallel corpus (already aligned by an IBM model), and extract all phrases from them. You can limit the length of the phrases you consider at 2 (3, ...) words, but you have to do it manually, because the max_phrase_length argument of phrase_extraction() doesn't work. Once you have all the phrases, create a phrase table similar to the one above. Don't forget that the decoder expects log probabilities.