Here are some cool mini-projects you can try to dive deeper into the topic.
Pick BLEU or any other relevant metric, e.g. BLEU (e.g. from nltk.bleu_score).
(use default parameters for bleu: 4-gram, uniform weights)
While self-critical training provides a large reduction of gradient variance, it has a few drawbacks:
There's a more general way of doing the same thing: learned baselines, also known as advantage actor-critic.
There are two main ways to apply that:
In both cases, you should train V(s) to minimize squared error $(V(s) - R(s,a))^2$ with R being actual levenshtein. You can then use $ A(s,a) = (R(s,a) - const(V(s))) $ for policy gradient.
There's also one particularly interesting approach (+5 additional pts):
Some seq2seq tasks can benefit from the attention mechanism. In addition to taking the last time-step of encoder hidden state, we can allow decoder to peek on any time-step of his choice.
1) Modify encoder-decoder
Learn to feed the entire encoder into the decoder. You can do so by sending encoder rnn sequences directly into decoder (make sure there's no only_return_final=True for encoder rnn layer).
class encoder:
...
enc_sequences, (h, c) = self.lstm(x)
...
class decoder:
...
attention_applied = self.attn_layer(enc_sequences)
h, c = self.lstm_decoder(prev_emb, (attention_applied, c))
...
For starters, you can take it's last tick (via SliceLayer) inside the decoder step and feed it as input to make sure it works.
2) Implement attention mechanism
Next thing we'll need is to implement the math of attention.
The simplest way to do so is to write a special layer. We gave you a prototype and some tests below.
3) Use attention inside decoder
That's almost it! Now use AttentionLayer inside the decoder and feed it to back to lstm/gru/rnn (see code demo below).
Train the full network just like you did before attention.
More points will be awwarded for comparing learning results of attention Vs no attention.
Bonus bonus: visualize attention vectors (>= +3 points)
The best way to make sure your attention actually works is to visualize it.
A simple way to do so is to obtain attention vectors from each tick (values right after softmax, not the layer outputs) and drawing those as images.
In [ ]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
In [ ]:
class AttentionLayer(nn.Module):
def __init__(self, hidden_size, output_size):
super(self.__class__, self).__init__()
self.attn = nn.Linear(hidden_size * 2, output_size)
def forward(self, enc_seq, decoder_state):
scores = # your code here
alphas = F.softmax(
# your code here
)
attn_combined = # your code here
return attn_combined
In [ ]:
# demo code
batch_size = 32
hidden_size = 256
seq_len = 41
dec_h_prev = torch.rand((batch_size, hidden_size))
enc_sequences = torch.rand((batch_size, seq_len, hidden_size))
attention = AttentionLayer(hidden_size, hidden_size)
# sanity check
demo_output = attention(enc_sequences, dec_h_prev)
print('actual shape:', demo_output.shape)
assert demo_output.shape == (32, 256)
assert np.all(np.isfinite(demo_output.detach().cpu().numpy()))
In [ ]: