More metrics: BLEU (5+ pts)

Pick BLEU or any other relevant metric, e.g. BLEU (e.g. from nltk.bleu_score).

Train model to maximize BLEU directly
How does levenshtein behave when maximizing BLEU and vice versa?
Compare this with how they behave when optimizing likelihood.

(use default parameters for bleu: 4-gram, uniform weights)

Actor-critic (5+++ pts)

While self-critical training provides a large reduction of gradient variance, it has a few drawbacks:

It requires a lot of additional computation during training
It doesn't adjust V(s) between decoder steps. (one value per sequence)

There's a more general way of doing the same thing: learned baselines, also known as advantage actor-critic.

There are two main ways to apply that:

naive way: compute V(s) once per training example.
- This only requires additional 1-unit linear dense layer that grows out of encoder, estimating V(s)
- (implement this to get main points)
every step: compute V(s) on each decoder step
- Again it's just an 1-unit dense layer (no nonlinearity), but this time it's inside decoder recurrence.
- (+3 pts additional for this guy)

In both cases, you should train V(s) to minimize squared error $(V(s) - R(s,a))^2$ with R being actual levenshtein. You can then use $ A(s,a) = (R(s,a) - const(V(s))) $ for policy gradient.

There's also one particularly interesting approach (+5 additional pts):

combining SCST and actor-critic:
- compute baseline $V(s)$ via self-critical sequence training (just like in main assignment)
- learn correction $ C(s,a_{:t}) = R(s,a) - V(s) $ by minimizing $(R(s,a) - V(s) - C(s,a_{:t}))^2 $
- use $ A(s,a_{:t}) = R(s,a) - V(s) - const(C(s,a_{:t})) $

Implement attention (5+++ pts)

Some seq2seq tasks can benefit from the attention mechanism. In addition to taking the last time-step of encoder hidden state, we can allow decoder to peek on any time-step of his choice.

Recommended steps:

1) Modify encoder-decoder

Learn to feed the entire encoder into the decoder. You can do so by sending encoder rnn sequences directly into decoder (make sure there's no only_return_final=True for encoder rnn layer).

class encoder:
    ...
    enc_sequences, (h, c) = self.lstm(x)
    ...

class decoder: 
    ...
    attention_applied = self.attn_layer(enc_sequences)
    h, c = self.lstm_decoder(prev_emb, (attention_applied, c))
    ...

For starters, you can take it's last tick (via SliceLayer) inside the decoder step and feed it as input to make sure it works.

2) Implement attention mechanism

Next thing we'll need is to implement the math of attention.

The simplest way to do so is to write a special layer. We gave you a prototype and some tests below.

3) Use attention inside decoder

That's almost it! Now use AttentionLayer inside the decoder and feed it to back to lstm/gru/rnn (see code demo below).

Train the full network just like you did before attention.

More points will be awwarded for comparing learning results of attention Vs no attention.

Bonus bonus: visualize attention vectors (>= +3 points)

The best way to make sure your attention actually works is to visualize it.

A simple way to do so is to obtain attention vectors from each tick (values right after softmax, not the layer outputs) and drawing those as images.

step-by-step guide:

compute scores between $h_{e, j}^i$ and $h_{d}^i$ $\forall j = 1, ... , \text{len(enc_seq)}$, where i -- number of decoder step
apply softmax to scores and get weight for each vector
obtain attention vector using enc_seq and weights