Lesson 6 RNN (Code Along)


2018/7/22|8/10 –– Wayne H Nixalo


In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.io import *
from fastai.conv_learner import *
from fastai.column_data import *

1. Setup

We're going to download the collected works of Nietzsche to use as our data for this class.


In [165]:
# PATH = Path('data/nietzsche/')
PATH = 'data/nietzsche/'

get_data("https://s3.amazonaws.com/text-datasets/nietzsche.txt", f'{PATH}nietzsche.txt')
text = open(f'{PATH}nietzsche.txt').read()
print('corpus length:', len(text))


nietzsche.txt: 606kB [00:01, 520kB/s]                             
corpus length: 600893


In [4]:
text[:400]


Out[4]:
'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not ground\nfor suspecting that all philosophers, in so far as they have been\ndogmatists, have failed to understand women--that the terrible\nseriousness and clumsy importunity with which they have usually paid\ntheir addresses to Truth, have been unskilled and unseemly methods for\nwinning a woman? Certainly she has never allowed herself '

In [5]:
chars = sorted(list(set(text)))
vocab_size = len(chars) + 1
print('total chars', vocab_size)


total chars 85

Sometimes it's useful to have a zero value in the dataset, eg: for padding.


In [6]:
chars.insert(0, '\0')
''.join(chars[1:-5])


Out[6]:
'\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz'

Map from chars to indices and back again:


In [7]:
char_indices = {c: i for i, c in enumerate(chars)}
indices_char = {i: c for i, c in enumerate(chars)}

idx will be the data we use form now on – it simply converts all characters to their index (based on the mapping above).


In [8]:
idx = [char_indices[c] for c in text]
idx[:10]


Out[8]:
[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [9]:
''.join(indices_char[i] for i in idx[:70])


Out[9]:
'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not gro'

2. Three char model

2.1 Create inputs

Create a list of every 4th character, starting at the 0th, 1st, 2nd, then 3rd characters.


In [10]:
cs = 3
c1_dat = [idx[i]   for i in range(0, len(idx)-cs, cs)] # every 1st char
c2_dat = [idx[i+1] for i in range(0, len(idx)-cs, cs)] # every 2nd
c3_dat = [idx[i+2] for i in range(0, len(idx)-cs, cs)] # every 3rd
c4_dat = [idx[i+3] for i in range(0, len(idx)-cs, cs)] # every 4th

Our inputs:


In [11]:
x1 = np.stack(c1_dat)
x2 = np.stack(c2_dat)
x3 = np.stack(c3_dat)

Our outputs:


In [12]:
y = np.stack(c4_dat)

The first 4 inputs and outputs:


In [13]:
x1[:4], x2[:4], x3[:4]


Out[13]:
(array([40, 30, 29,  1]), array([42, 25,  1, 43]), array([29, 27,  1, 45]))

In [14]:
y[:4]


Out[14]:
array([30, 29,  1, 40])

In [15]:
x1.shape, y.shape


Out[15]:
((200297,), (200297,))

2.2 Create and train model

Pick a size for our hidden state


In [16]:
n_hidden = 256

The number of latent factors to create (ie: size of the embedding matrix):


In [17]:
n_fac = 42 # about half the number of our characters

In [19]:
'0.3' in torch.__version__


Out[19]:
True

In [20]:
class Char3Model(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac) # embedding
        
        # the 'green arrow' from our diagram – the layer operation from input to hidden
        self.l_in = nn.Linear(n_fac, n_hidden)
        
        # the 'orange arrow' from our diagram – the layer operation from hidden to hidden
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        
        # the 'blue arrow' from our diagram – the layer operation from hidden to output
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, c1, c2, c3):
        in1 = F.relu(self.l_in(self.e(c1)))
        in2 = F.relu(self.l_in(self.e(c2)))
        in3 = F.relu(self.l_in(self.e(c3)))
        
        if '0.3' in torch.__version__:
            h = V(torch.zeros(in1.size()).cuda())
            h = F.tanh(self.l_hidden(h+in1))
            h = F.tanh(self.l_hidden(h+in2))
            h = F.tanh(self.l_hidden(h+in3))
        else:
            h = torch.zeros(in1.size()).cuda() # I dont think I have to wrap as Variable since this is pytorch 0.4, no?
            h = torch.tanh(self.l_hidden(h + in1))
            h = torch.tanh(self.l_hidden(h + in2))
            h = torch.tanh(self.l_hidden(h + in3))
        
        return F.log_softmax(self.l_out(h))

In [37]:
mdata = ColumnarModelData.from_arrays('.', [-1], np.stack([x1,x2,x3], axis=1), y, bs=512)
model = Char3Model(vocab_size, n_fac).cuda()

In [38]:
it = iter(mdata.trn_dl)
*xs,yt = next(it)
# tensor = model(*xs)
tensor = model(*V(xs))

In [39]:
optimizer = optim.Adam(model.parameters(), 1e-2)

In [40]:
set_lrs(optimizer, 1e-3)
fit(model, mdata, 1, optimizer, F.nll_loss)


epoch      trn_loss   val_loss                              
    0      2.118637   1.526361  

Out[40]:
[array([1.52636])]

In [41]:
set_lrs(optimizer, 1e-3)
fit(model, mdata, 1, optimizer, F.nll_loss)


epoch      trn_loss   val_loss                              
    0      1.924193   0.515864  

Out[41]:
[array([0.51586])]

2.3 Test model


In [35]:
def get_next(inp):
    """
    Takes a 3-char string. 
    Turns it into a Tensor of an array of the char index of the string.
    Passes that tensor to the model.
    Does an argmax to get the predicted char-number; then coverts to char.
    """
    idxs = T(np.array([char_indices[c] for c in inp]))
#     pred = model(*idxs)
    pred = model(*VV(idxs))
    i = np.argmax(to_np(pred))
    return chars[i]

In [25]:
get_next('y. '), get_next('ppl'), get_next(' th'), get_next('and')


Out[25]:
('T', 'e', 'e', ' ')

3. Our first RNN

Lecture 6

3.1 Create inputs

This is the size of our unrolled RNN:


In [43]:
cs = 8

For each of 0 thru 8, create a list of every 8th character with that starting point. These will be the 8 inputs to our model.


In [44]:
c_in_dat = [[idx[i + j] for i in range(cs)] for j in range(len(idx) - cs)]

In [45]:
c_out_dat = [idx[j + cs] for j in range(len(idx) - cs)]

In [46]:
xs = np.stack(c_in_dat, axis=0); xs.shape


Out[46]:
(600885, 8)

In [47]:
y = np.stack(c_out_dat); y.shape


Out[47]:
(600885,)

So each column below is one series of 8 characters from the text.


In [48]:
xs[:cs, :cs]


Out[48]:
array([[40, 42, 29, 30, 25, 27, 29,  1],
       [42, 29, 30, 25, 27, 29,  1,  1],
       [29, 30, 25, 27, 29,  1,  1,  1],
       [30, 25, 27, 29,  1,  1,  1, 43],
       [25, 27, 29,  1,  1,  1, 43, 45],
       [27, 29,  1,  1,  1, 43, 45, 40],
       [29,  1,  1,  1, 43, 45, 40, 40],
       [ 1,  1,  1, 43, 45, 40, 40, 39]])

they're overlapping. So after '[42, 29, 30, 25, 27, 29, 1, 1]' comes '1', and after '[29, 30, 25, 27, 29, 1, 1, 1]' comes '43', and so on. The nth row is the same as the nth column.

...and this is the next character after each sequence


In [49]:
y[:cs]


Out[49]:
array([ 1,  1, 43, 45, 40, 40, 39, 43])

3.2 Create and train model


In [50]:
val_idx = get_cv_idxs(len(idx) - cs - 1)

In [51]:
mdata = ColumnarModelData.from_arrays('.', val_idx, xs, y, bs=512)

In [52]:
class CharLoopModel(nn.Module):
    """This is an RNN."""
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
#         h  = torch.zeros(bs, n_hidden).cuda()
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
#             inp = torch.tanh(self.l_in(self.e(c)))   # the torch.tanh vs F.tanh warning didnt pop
#             h   = torch.tanh(self.l_hidden(h + inp)) # up on Mac, but did on Linux-gpu. Odd.
            inp = F.relu(self.l_in(self.e(c)))
            h   = F.tanh(self.l_hidden(h+inp))
        
        return F.log_softmax(self.l_out(h), dim=-1)

In [53]:
model = CharLoopModel(vocab_size, n_fac).cuda()
optimizer = optim.Adam(model.parameters(), 1e-2)

In [54]:
fit(model, mdata, 1, optimizer, F.nll_loss)


epoch      trn_loss   val_loss                              
    0      2.00313    1.999883  

Out[54]:
[array([1.99988])]

In [55]:
set_lrs(optimizer, 1e-3)
fit(model, mdata, 1, optimizer, F.nll_loss)


epoch      trn_loss   val_loss                              
    0      1.715826   1.720763  

Out[55]:
[array([1.72076])]

The input and hidden states represent qualitatively different types of information, so adding them together can potentially lose information. Instead we can concatenate them together.


In [56]:
class CharLoopConcatModel(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac + n_hidden, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
#         h  = torch.zeros(bs, n_hidden).cuda()
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            inp = torch.cat((h, self.e(c)), 1)
            inp = F.relu(self.l_in(inp))
#             h   = torch.tanh(self.l_hidden(inp))
            h   = F.tanh(self.l_hidden(inp))
            
        return F.log_softmax(self.l_out(inp), dim=-1)

In [57]:
model = CharLoopConcatModel(vocab_size, n_fac).cuda()
optimizer = optim.Adam(model.parameters(), 1e-3)

In [58]:
it = iter(mdata.trn_dl)
*xs,yt = next(it)
# t = model(*xs)
t = model(*V(xs))

In [59]:
xs[0].size(0)


Out[59]:
512

In [60]:
t


Out[60]:
Variable containing:
-4.3426 -4.4830 -4.5473  ...  -4.5408 -4.5223 -4.5775
-4.3193 -4.4293 -4.2892  ...  -4.6967 -4.4340 -4.4535
-4.3042 -4.3908 -4.4010  ...  -4.6524 -4.4899 -4.4502
          ...             ⋱             ...          
-4.4318 -4.3545 -4.5647  ...  -4.6044 -4.4858 -4.4999
-4.3499 -4.4375 -4.4929  ...  -4.6093 -4.5143 -4.4863
-4.3761 -4.4803 -4.3790  ...  -4.4807 -4.3885 -4.5237
[torch.cuda.FloatTensor of size 512x85 (GPU 0)]

In [61]:
fit(model, mdata, 1, optimizer, F.nll_loss)


epoch      trn_loss   val_loss                              
    0      1.825499   1.793691  

Out[61]:
[array([1.79369])]

In [62]:
set_lrs(optimizer, 1e-4)
fit(model, mdata, 1, optimizer, F.nll_loss)


epoch      trn_loss   val_loss                              
    0      1.707226   1.711758  

Out[62]:
[array([1.71176])]

3.3 Test Model


In [67]:
if '0.3' in torch.__version__:
    def get_next(inp):
        idxs = T(np.array([char_indices[c] for c in inp]))
        p = model(*VV(idxs))
        i = np.argmax(to_np(p))
        return chars[i]
else:
    def get_next(inp):
    #     idxs = [T(np.array([char_indices[c] for c in inp]))]
        idxs = [T(np.array([char_indices[c]])) for c in inp]
        p = model(*idxs)
        i = np.argmax(to_np(p))
    #     pdb.set_trace()
        return chars[i]

In [68]:
get_next('for thos')


Out[68]:
'e'

In [69]:
get_next('part of ')


Out[69]:
't'

In [70]:
get_next('queens a')


Out[70]:
'n'

4. RNN with PyTorch

Lecture 6, 1:48:52


In [92]:
class CharRNN(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
#         h  = torch.zeros(1, bs, n_hidden)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
        
        return F.log_softmax(self.l_out(outp[-1]))
#         return F.log_softmax(self.l_out(outp[-1]), dim=-1) # outp[-1] to get last hidden state

In [93]:
model = CharRNN(vocab_size, n_fac).cuda()
optimizer = optim.Adam(model.parameters(), 1e-3)

In [94]:
it = iter(mdata.trn_dl)
*xs,yt = next(it)

In [95]:
# tensor = model.e(V(torch.stack(xs))) # works w/o V(.). but takes longer when switching btwn w/wo V(.)?
# tensor = model.e(torch.stack(xs)) # these are ints so cannot require gradients
# tensor = model.e(T(torch.stack(xs)))
tensor = model.e(V(torch.stack(xs)))
tensor.size()


Out[95]:
torch.Size([8, 512, 42])

In [96]:
# htensor = V(torch.zeros(1, 512, n_hidden)) # V(.) required here, else: RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED
# NOTE: does not work: htensor = torch.zeros(1, 512, n_hidden, requires_grad=True) # requires_grad=True accomplishes what V(.) did in 0.3.1 for 0.4.
# htensor = T(torch.zeros(1, 512, n_hidden))
htensor = V(torch.zeros(1, 512, n_hidden))

In [97]:
outp, hn = model.rnn(tensor, htensor)
outp.size(), hn.size()


Out[97]:
(torch.Size([8, 512, 256]), torch.Size([1, 512, 256]))

I'm able to get this far in pytorch 0.4, using T instead of V. The problem is the next line keeps giving me a:

RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED

As per here, I'm going to use pytorch 0.3 from here to the end.


In [57]:
# the error when using pytorch 0.4:
tensor = model(*V(xs)); tensor.size()


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-57-77b686092abb> in <module>()
----> 1 tensor = model(*V(xs)); tensor.size()

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    475             result = self._slow_forward(*input, **kwargs)
    476         else:
--> 477             result = self.forward(*input, **kwargs)
    478         for hook in self._forward_hooks.values():
    479             hook_result = hook(self, input, result)

<ipython-input-45-0e09a685603e> in forward(self, *cs)
     10         h  = torch.zeros(1, bs, n_hidden)
     11         inp = self.e(torch.stack(cs))
---> 12         outp,h = self.rnn(inp, h)
     13 
     14 #         return F.log_softmax(self.l_out(outp[-1]))

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    475             result = self._slow_forward(*input, **kwargs)
    476         else:
--> 477             result = self.forward(*input, **kwargs)
    478         for hook in self._forward_hooks.values():
    479             hook_result = hook(self, input, result)

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/rnn.py in forward(self, input, hx)
    190             flat_weight=flat_weight
    191         )
--> 192         output, hidden = func(input, self.all_weights, hx, batch_sizes)
    193         if is_packed:
    194             output = PackedSequence(output, batch_sizes)

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/_functions/rnn.py in forward(input, *fargs, **fkwargs)
    322             func = decorator(func)
    323 
--> 324         return func(input, *fargs, **fkwargs)
    325 
    326     return forward

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/_functions/rnn.py in forward(input, weight, hx, batch_sizes)
    286             batch_first, dropout, train, bool(bidirectional),
    287             list(batch_sizes.data) if variable_length else (),
--> 288             dropout_ts)
    289 
    290         if cx is not None:

RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED

In [98]:
tensor = model(*V(xs)); tensor.size()


Out[98]:
torch.Size([512, 85])

In [99]:
fit(model, mdata, 4, optimizer, F.nll_loss)


epoch      trn_loss   val_loss                              
    0      1.858932   1.841601  
    1      1.686824   1.674297                              
    2      1.586414   1.589504                              
    3      1.527497   1.548447                              

Out[99]:
[array([1.54845])]

In [100]:
set_lrs(opt, 1e-4)
fit(model, mdata, 2, optimizer, F.nll_loss)


epoch      trn_loss   val_loss                              
    0      1.490229   1.523964  
    1      1.486616   1.506196                              

Out[100]:
[array([1.5062])]

4.1 Test model


In [101]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = model(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [102]:
get_next('for thos')


Out[102]:
'e'

In [103]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:] + c
    return res

In [104]:
get_next_n('for thos', 40)


Out[104]:
'for those of the consequence and probably and pr'

5. Multi-output model

5.1 Setup

Lecture 1:58:07

Let's take non-overlapping sets of characters this time.


In [107]:
c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(0, len(idx) - cs - 1, cs)]

Then create the exact same thing, offset by 1, as our labels.


In [108]:
c_out_dat = [[idx[i+j] for i in range(cs)] for j in range(1, len(idx) - cs, cs)]

In [109]:
xs = np.stack(c_in_dat)
xs.shape


Out[109]:
(75111, 8)

In [110]:
ys = np.stack(c_out_dat)
ys.shape


Out[110]:
(75111, 8)

In [111]:
xs[:cs, :cs]


Out[111]:
array([[40, 42, 29, 30, 25, 27, 29,  1],
       [ 1,  1, 43, 45, 40, 40, 39, 43],
       [33, 38, 31,  2, 73, 61, 54, 73],
       [ 2, 44, 71, 74, 73, 61,  2, 62],
       [72,  2, 54,  2, 76, 68, 66, 54],
       [67,  9,  9, 76, 61, 54, 73,  2],
       [73, 61, 58, 67, 24,  2, 33, 72],
       [ 2, 73, 61, 58, 71, 58,  2, 67]])

In [112]:
ys[:cs, :cs]


Out[112]:
array([[42, 29, 30, 25, 27, 29,  1,  1],
       [ 1, 43, 45, 40, 40, 39, 43, 33],
       [38, 31,  2, 73, 61, 54, 73,  2],
       [44, 71, 74, 73, 61,  2, 62, 72],
       [ 2, 54,  2, 76, 68, 66, 54, 67],
       [ 9,  9, 76, 61, 54, 73,  2, 73],
       [61, 58, 67, 24,  2, 33, 72,  2],
       [73, 61, 58, 71, 58,  2, 67, 68]])

5.2 Create and train model


In [147]:
val_idx = get_cv_idxs(len(xs) - cs - 1)

In [148]:
mdata = ColumnarModelData.from_arrays('.', val_idx, xs, ys, bs=512)

In [149]:
class CharSeqRNN(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
        return F.log_softmax(self.l_out(outp), dim=-1)

In [150]:
model = CharSeqRNN(vocab_size, n_fac).cuda()
optimizer = optim.Adam(model.parameters(), 1e-3)

In [151]:
it = iter(mdata.trn_dl)
*xst, yt = next(it)

In [152]:
def nll_loss_seq(inp, targ):
    sl,bs,nh = inp.size() # 8 x 512 x nhidden 
    targ = targ.transpose(0,1).contiguous().view(-1)
    return F.nll_loss(inp.view(-1, nh), targ)

In [153]:
fit(model, mdata, 4, optimizer, nll_loss_seq)


epoch      trn_loss   val_loss                              
    0      2.625081   2.430088  
    1      2.300546   2.209068                              
    2      2.14247    2.094057                              
    3      2.05383    2.020662                              

Out[153]:
[array([2.02066])]

In [154]:
set_lrs(opt, 1e-4)

In [155]:
fit(model, mdata, 1, optimizer, nll_loss_seq)


epoch      trn_loss   val_loss                              
    0      1.98154    1.969768  

Out[155]:
[array([1.96977])]

5.3 Identity init


In [156]:
model     = CharSeqRNN(vocab_size, n_fac).cuda()
optimizer = optim.Adam(model.parameters(), 1e-2)

In [157]:
m.rnn.weight_hh_l0.data.copy_(torch.eye(n_hidden))


Out[157]:
    1     0     0  ...      0     0     0
    0     1     0  ...      0     0     0
    0     0     1  ...      0     0     0
       ...          ⋱          ...       
    0     0     0  ...      1     0     0
    0     0     0  ...      0     1     0
    0     0     0  ...      0     0     1
[torch.cuda.FloatTensor of size 256x256 (GPU 0)]

In [158]:
fit(model, mdata, 4, optimizer, nll_loss_seq)


epoch      trn_loss   val_loss                              
    0      2.176792   2.035941  
    1      1.959119   1.921122                              
    2      1.881671   1.879322                              
    3      1.84743    1.848952                              

Out[158]:
[array([1.84895])]

In [159]:
set_lrs(optimizer, 1e-3)

In [160]:
fit(model, mdata, 4, opt, nll_loss_seq)


epoch      trn_loss   val_loss                              
    0      1.808441   1.848952  
    1      1.806329   1.848952                              
    2      1.807249   1.848952                              
    3      1.808862   1.848952                              

Out[160]:
[array([1.84895])]

In [161]:
set_lrs(optimizer, 1e-4)

In [162]:
fit(model, mdata, 4, optimizer, nll_loss_seq)


epoch      trn_loss   val_loss                              
    0      1.766545   1.800295  
    1      1.747102   1.790097                              
    2      1.739047   1.784535                              
    3      1.735754   1.780744                              

Out[162]:
[array([1.78074])]

6. Stateful model

Lecture 7


In [168]:
from torchtext import vocab, data

from fastai.nlp import *
from fastai.lm_rnn import *

PATH = 'data/nietzsche/'
TRN_PATH = 'trn/'
VAL_PATH = 'val/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

## line counting: https://stackoverflow.com/a/3137099
# $ wc -l nietzsche/nietzsche.txt
## splitting: https://stackoverflow.com/a/2016918
# $ split -l 7947 nietzsche/nietzsche.txt
# $ mv xaa nietzsche/trn.txt
# $ mv xab nietzsche/val.txt

%ls {PATH}


nietzsche.txt  trn/  val/

In [169]:
%ls {PATH}trn


trn.txt

In [170]:
TEXT = data.Field(lower=True, tokenize=list) # torchtext
bs = 64; bptt = 8; n_fac = 42; n_hidden = 256

FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
mdata = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)

len(mdata.trn_dl), mdata.nt, len(mdata.trn_ds), len(mdata.trn_ds[0].text)


Out[170]:
(942, 55, 1, 482908)

6.2 RNN


In [175]:
class CharSeqStatefulRNN(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.vocab_size = vocab_size
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h) # bptt here; throw away hidden state's history
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

In [172]:
m = CharSeqStatefulRNN(mdata.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [173]:
fit(m, mdata, 4, opt, F.nll_loss)


epoch      trn_loss   val_loss                               
    0      1.884467   1.854556  
    1      1.709602   1.698231                               
    2      1.634151   1.640654                               
    3      1.579375   1.601974                               

Out[173]:
[array([1.60197])]

In [174]:
set_lrs(opt, 1e-4)

fit(m, mdata, 4, opt, F.nll_loss)


epoch      trn_loss   val_loss                               
    0      1.496956   1.557951  
    1      1.500723   1.551696                               
    2      1.500952   1.547426                               
    3      1.498358   1.544698                               

Out[174]:
[array([1.5447])]

6.3 RNN loop


In [179]:
# # From pytorch source:
# def RNNCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
#     return F.tanh(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh))

In [195]:
class CharSeqStatefulRNN2(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNNCell(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp = []
        o = self.h
        for c in cs:
            o = self.rnn(self.e(c), o)
            outp.append(o)
        outp = self.l_out(torch.stack(outp))
        self.h = repackage_var(o)
        return F.log_softmax(outp, dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

In [196]:
m = CharSeqStatefulRNN2(mdata.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [197]:
fit(m, mdata, 4, opt, F.nll_loss)


epoch      trn_loss   val_loss                               
    0      1.900145   1.862242  
    1      1.718113   1.714746                               
    2      1.635495   1.642114                               
    3      1.582026   1.598145                               

Out[197]:
[array([1.59814])]

In [199]:
class CharSeqStatefulGRU(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.GRU(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

In [201]:
# # From pytorch source code – for reference

# def GRUCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
#     gi = F.linear(input, w_ih, b_ih)
#     gh = F.linear(hidden, w_hh, b_hh)
#     i_r, i_i, i_n = gi.chunk(3, 1)
#     h_r, h_i, h_n = gh.chunk(3, 1)
    
#     resetgate = F.sigmoid(i_r + h_r)
#     inputgate = F.sigmoid(i_i + h_i)
#     newgate = F.tanh(i_h + resetgate * h_n)
#     return newgate + inputgate * (hidden - newgate)

In [202]:
m = CharSeqStatefulGRU(mdata.nt, n_fac, 512).cuda()

opt = optim.Adam(m.parameters(), 1e-3)

In [203]:
fit(m, mdata, 6, opt, F.nll_loss)


epoch      trn_loss   val_loss                               
    0      1.767768   1.741309  
    1      1.592988   1.591748                               
    2      1.50757    1.533809                               
    3      1.448381   1.495046                               
    4      1.417075   1.476248                               
    5      1.38126    1.464628                               

Out[203]:
[array([1.46463])]

In [204]:
set_lrs(opt, 1e-4)

In [205]:
fit(m, mdata, 3, opt, F.nll_loss)


epoch      trn_loss   val_loss                               
    0      1.295676   1.430624  
    1      1.296015   1.42632                                
    2      1.293869   1.424653                               

Out[205]:
[array([1.42465])]

6.5 Putting it all together LSTM


In [206]:
from fastai import sgdr

n_hidden = 512

In [207]:
class CharSeqStatefulLSTM(nn.Module):
    def __init__(self, vocab_size, n_fac, bs, nl):
        super().__init__()
        self.vocab_size,self.nl = vocab_size,nl
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.LSTM(n_fac, n_hidden, nl, dropout=0.5)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h[0].size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        self.h = (V(torch.zeros(self.nl, bs, n_hidden)),
                  V(torch.zeros(self.nl, bs, n_hidden)))

In [208]:
m = CharSeqStatefulLSTM(mdata.nt, n_fac, 512, 2).cuda()
lo = LayerOptimizer(optim.Adam, m, 1e-2, 1e-5)

In [210]:
os.makedirs(f'{PATH}models', exist_ok=True)

In [211]:
fit(m, mdata, 2, lo.opt, F.nll_loss)


epoch      trn_loss   val_loss                              
    0      1.909249   1.801902  
    1      1.769695   1.680969                              

Out[211]:
[array([1.68097])]

In [213]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(mdata.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(m, mdata, 2**4-1, lo.opt, F.nll_loss, callbacks=cb)


epoch      trn_loss   val_loss                              
    0      1.58525    1.527634  
    1      1.619767   1.540278                              
    2      1.500887   1.457802                              
    3      1.636409   1.550131                              
    4      1.561904   1.500154                              
    5      1.475814   1.435466                              
    6      1.418267   1.401623                              
    7      1.601189   1.53171                               
    8      1.581333   1.507951                              
    9      1.545162   1.485335                              
    10     1.508973   1.452861                              
    11     1.453223   1.420257                              
    12     1.416339   1.39364                               
    13     1.378315   1.37022                               
    14     1.348801   1.357095                              

Out[213]:
[array([1.35709])]

In [215]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(mdata.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(m, mdata, 2**6-1, lo.opt, F.nll_loss, callbacks=cb)


epoch      trn_loss   val_loss                              
    0      1.345566   1.355601  
    1      1.341196   1.353813                              
    2      1.335737   1.352657                              
    3      1.33972    1.350923                              
    4      1.326929   1.350483                              
    5      1.320362   1.346776                              
    6      1.323201   1.346167                              
    7      1.328892   1.345882                              
    8      1.322246   1.342471                              
    9      1.30893    1.339016                              
    10     1.299575   1.336992                              
    11     1.300942   1.33501                               
    12     1.29522    1.33449                               
    13     1.287398   1.333311                              
    14     1.29294    1.333083                              
    15     1.29369    1.334529                              
    16     1.28323    1.333356                              
    17     1.275178   1.331828                              
    18     1.264822   1.331893                              
    19     1.268571   1.328727                              
    20     1.263555   1.328655                              
    21     1.25193    1.328018                              
    22     1.241885   1.328129                              
    23     1.244996   1.327239                              
    24     1.234549   1.327502                              
    25     1.235547   1.327251                              
    26     1.224019   1.327427                              
    27     1.225533   1.32756                               
    28     1.222705   1.327103                              
    29     1.215969   1.327171                              
    30     1.220966   1.328729                              
    31     1.228109   1.32682                               
    32     1.23929    1.329681                              
    33     1.23149    1.328574                              
    34     1.212139   1.32914                               
    35     1.216606   1.330779                              
    36     1.206941   1.33133                               
    37     1.19608    1.331276                              
    38     1.195772   1.331107                              
    39     1.18828    1.333295                              
    40     1.180462   1.33281                               
    41     1.173319   1.335211                              
    42     1.16803    1.335397                              
    43     1.162477   1.336817                              
    44     1.159715   1.337362                              
    45     1.149905   1.338025                              
    46     1.137541   1.34039                               
    47     1.141907   1.341837                              
    48     1.128822   1.342572                              
    49     1.127333   1.344634                              
    50     1.125221   1.344653                              
    51     1.115946   1.346475                              
    52     1.11484    1.347534                              
    53     1.114747   1.347997                              
    54     1.103783   1.349139                              
    55     1.099163   1.349055                              
    56     1.100835   1.349907                              
    57     1.10285    1.350082                              
    58     1.094202   1.351048                              
    59     1.093313   1.351003                              
    60     1.094888   1.350991                              
    61     1.099354   1.350457                              
    62     1.08658    1.35037                               

Out[215]:
[array([1.35037])]

6.6 Test


In [216]:
def get_next(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]

In [217]:
get_next('for thos')


Out[217]:
'e'

In [218]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:] + c
    return res

In [219]:
print(get_next_n('for thos', 400))


for those interpressingand antagonism.--i have not that anyone self-truthful in german and his lyproves astakes and cry.stappony; no mortal quality regard          1.      an innorness: men credit in (the espectregalisms, and as amongthe ageness of byings of the old faiths. if a place as this useful, and problems. it is there is, more preparations, aroundly, whenovers (when which one seems our: subplums a

In [220]:
print(get_next_n('the reason', 400))


the reason above rassi=fania ands," therefore, begansly all to and for the history of ideas (that, in a many things--it wrong" like the pathable or in us!50 prato--it had not to perspectivism--and shouldthe enre thatthey require correctively to a thought--excitations of the hypothesis, for a valuations, but direct.20] were men than philosophy "these god civilise moments and as, for example, or christian men

I made a mistake somewhere, the loss should be around 1.25, not 1.35. Anyway, basically works.