Building an RNN in PyTorch

In this notebook, I'll construct a character-level RNN with PyTorch. If you are unfamiliar with character-level RNNs, check out this great article by Andrej Karpathy. The network will train character by character on some text, then generate new text character by character. As an example, I will train on Anna Karenina, one of my favorite novels. I call this project Anna KaRNNa.

In [1]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
from torch.autograd import Variable

In [2]:
with open('anna.txt', 'r') as f:
    text =

Now we have the text, encode it as integers.

In [3]:
chars = tuple(set(text))
int2char = dict(enumerate(chars))
char2int = {ch: ii for ii, ch in int2char.items()}
encoded = np.array([char2int[ch] for ch in text])

Processing the data

We're one-hot encoding the data, so I'll make a function to do that.

I'll also create mini-batches for training. We'll take the encoded characters and split them into multiple sequences, given by n_seqs (also refered to as "batch size" in other places). Each of those sequences will be n_steps long.

In [5]:
def one_hot_encode(arr, n_labels):
    # Initialize the the encoded array
    one_hot = np.zeros((np.multiply(*arr.shape), n_labels), dtype=np.float32)
    # Fill the appropriate elements with ones
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    # Finally reshape it to get back to the original array
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    return one_hot

In [6]:
def get_batches(arr, n_seqs, n_steps):
    '''Create a generator that returns mini-batches of size
       n_seqs x n_steps from arr.
    batch_size = n_seqs * n_steps
    n_batches = len(arr)//batch_size
    # Keep only enough characters to make full batches
    arr = arr[:n_batches * batch_size]
    # Reshape into n_seqs rows
    arr = arr.reshape((n_seqs, -1))
    for n in range(0, arr.shape[1], n_steps):
        # The features
        x = arr[:, n:n+n_steps]
        # The targets, shifted by one
        y = np.zeros_like(x)
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+n_steps]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        yield x, y

Defining the network with PyTorch

Here I'll use PyTorch to define the architecture of the network. We start by defining the layers and operations we want. Then, define a method for the forward pass. I'm also going to write a method for predicting characters.

In [7]:
class CharRNN(nn.Module):
    def __init__(self, tokens, n_steps=100, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden = lr
        self.chars = tokens
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}
        self.dropout = nn.Dropout(drop_prob)
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
        self.fc = nn.Linear(n_hidden, len(self.chars))
    def forward(self, x, hc):
        ''' Forward pass through the network '''
        x, (h, c) = self.lstm(x, hc)
        x = self.dropout(x)
        # Stack up LSTM outputs
        x = x.view(x.size()[0]*x.size()[1], self.n_hidden)
        x = self.fc(x)
        return x, (h, c)
    def predict(self, char, h=None, cuda=False, top_k=None):
        ''' Given a character, predict the next character.
            Returns the predicted character and the hidden state.
        if cuda:
        if h is None:
            h = self.init_hidden(1)
        x = np.array([[self.char2int[char]]])
        x = one_hot_encode(x, len(self.chars))
        inputs = Variable(torch.from_numpy(x), volatile=True)
        if cuda:
            inputs = inputs.cuda()
        h = tuple([Variable(, volatile=True) for each in h])
        out, h = self.forward(inputs, h)

        p = F.softmax(out).data
        if cuda:
            p = p.cpu()
        if top_k is None:
            top_ch = np.arange(len(self.chars))
            p, top_ch = p.topk(top_k)
            top_ch = top_ch.numpy().squeeze()
        p = p.numpy().squeeze()
        char = np.random.choice(top_ch, p=p/p.sum())
        return self.int2char[char], h
    def init_weights(self):
        ''' Initialize weights for fully connected layer '''
        initrange = 0.1
        # Set bias tensor to all zeros
        # FC weights as random uniform, 1)
    def init_hidden(self, n_seqs):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x n_seqs x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        return (Variable(, n_seqs, self.n_hidden).zero_()),
                Variable(, n_seqs, self.n_hidden).zero_()))

In [18]:
def train(net, data, epochs=10, n_seqs=10, n_steps=50, lr=0.001, clip=5, val_frac=0.1, cuda=False, print_every=10):
    ''' Traing a network 
        net: CharRNN network
        data: text data to train the network
        epochs: Number of epochs to train
        n_seqs: Number of mini-sequences per mini-batch, aka batch size
        n_steps: Number of character steps per mini-batch
        lr: learning rate
        clip: gradient clipping
        val_frac: Fraction of data to hold out for validation
        cuda: Train with CUDA on a GPU
        print_every: Number of steps for printing training and validation loss
    opt = torch.optim.Adam(net.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    # create training and validation data
    val_idx = int(len(data)*(1-val_frac))
    data, val_data = data[:val_idx], data[val_idx:]
    if cuda:
    counter = 0
    n_chars = len(net.chars)
    for e in range(epochs):
        h = net.init_hidden(n_seqs)
        for x, y in get_batches(data, n_seqs, n_steps):
            counter += 1
            # One-hot encode our data and make them Torch tensors
            x = one_hot_encode(x, n_chars)
            x, y = torch.from_numpy(x), torch.from_numpy(y)
            inputs, targets = Variable(x), Variable(y)
            if cuda:
                inputs, targets = inputs.cuda(), targets.cuda()

            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            h = tuple([Variable( for each in h])

            output, h = net.forward(inputs, h)
            loss = criterion(output, targets.view(n_seqs*n_steps))

            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm(net.parameters(), clip)

            if counter % print_every == 0:
                # Get validation loss
                val_h = net.init_hidden(n_seqs)
                val_losses = []
                for x, y in get_batches(val_data, n_seqs, n_steps):
                    # One-hot encode our data and make them Torch tensors
                    x = one_hot_encode(x, n_chars)
                    x, y = torch.from_numpy(x), torch.from_numpy(y)
                    # Creating new variables for the hidden state, otherwise
                    # we'd backprop through the entire training history
                    val_h = tuple([Variable(, volatile=True) for each in val_h])
                    inputs, targets = Variable(x, volatile=True), Variable(y, volatile=True)
                    if cuda:
                        inputs, targets = inputs.cuda(), targets.cuda()

                    output, val_h = net.forward(inputs, val_h)
                    val_loss = criterion(output, targets.view(n_seqs*n_steps))
                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format([0]),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))

Time to train

Now we can actually train the network. First we'll create the network itself, with some given hyperparameters. Then, define the mini-batches sizes (number of sequences and number of steps), and start the training. With the train function, we can set the number of epochs, the learning rate, and other parameters. Also, we can run the training on a GPU by setting cuda=True.

In [19]:
if 'net' in locals():
    del net

In [20]:
net = CharRNN(chars, n_hidden=512, n_layers=2)

In [ ]:
n_seqs, n_steps = 128, 100
train(net, encoded, epochs=25, n_seqs=n_seqs, n_steps=n_steps, lr=0.001, cuda=True, print_every=10)

Getting the best model

To set your hyperparameters to get the best performance, you'll want to watch the training and validation losses. If your training loss is much lower than the validation loss, you're overfitting. Increase regularization (more dropout) or use a smaller network. If the training and validation losses are close, you're underfitting so you can increase the size of the network.

After training, we'll save the model so we can load it again later if we need too. Here I'm saving the parameters needed to create the same architecture, the hidden layer hyperparameters and the text characters.

In [17]:
checkpoint = {'n_hidden': net.n_hidden,
              'n_layers': net.n_layers,
              'state_dict': net.state_dict(),
              'tokens': net.chars}
with open('', 'wb') as f:, f)


Now that the model is trained, we'll want to sample from it. To sample, we pass in a character and have the network predict the next character. Then we take that character, pass it back in, and get another predicted character. Just keep doing this and you'll generate a bunch of text!

Top K sampling

Our predictions come from a categorcial probability distribution over all the possible characters. We can make the sampled text more reasonable but less variable by only considering some $K$ most probable characters. This will prevent the network from giving us completely absurd characters while allowing it to introduce some noise and randomness into the sampled text.

Typically you'll want to prime the network so you can build up a hidden state. Otherwise the network will start out generating characters at random. In general the first bunch of characters will be a little rough since it hasn't built up a long history of characters to predict from.

In [15]:
def sample(net, size, prime='The', top_k=None, cuda=False):
    if cuda:

    # First off, run through the prime characters
    chars = [ch for ch in prime]
    h = net.init_hidden(1)
    for ch in prime:
        char, h = net.predict(ch, h, cuda=cuda, top_k=top_k)

    # Now pass in the previous character and get a new one
    for ii in range(size):
        char, h = net.predict(chars[-1], h, cuda=cuda, top_k=top_k)

    return ''.join(chars)

In [16]:
print(sample(net, 2000, prime='Anna', top_k=5, cuda=False))

Anna, and she seemed to
have a tears of himself, but as though he did not tell them his face. He
was not settled at him, and with whom he had already sat down, and that
it was impossible to say."

"That's not somewhere the point to her." They despring them at that
mistake and him. The marshes was the chincer, was in a glance of
assisted at the corricted service. But when she had so given on the

"I have been always to the than except into a bare, but there,
as it is to be supposed." Again he they had not all to be
about the mistares and thousands as so towards it, and what the
mather was to the carriage with silence, and walked a love with which
he had true, that he had an acquaintance. The creeming
sense of the colorel sorts, and at the comminterness of his own
her hands, and had belonged to the peasants, he had a country peasant
his breathed handed, he sat sight of the same time as his friends, to see
where the household has taken in out of the same time and his sister
in a long wife.

"I'm not going, and you must be definite that, and I've taken up in the matter,
that the perplehey was to go and should be doing. This was an
official serious towards the same ways, and shaking it the child.

"I should have sent on a prince," the point said, at some time, he would
be thinking of that there was so a good first and happy before her former
and staying at his breaking and, and the divorce, there was nothing and
that struck any other self-conventional and annight for him
and the misingry, and alone was a good hair, and talking away with
her. And telling him to an except only a lighted attention of sorried
soffed and her, and had so callly the province was sitting in the
children at the trumples of the matter, struck him to see that he were
to go out of the path that shoted a look of half-pate, was still all to
holital. Sergey Ivanovitch was the same tell him with the contrary
to the crowd of his coat and wondering that it had
all suddened and so well of him.

The so

Loading a checkpoint

In [47]:
with open('', 'rb') as f:
    checkpoint = torch.load(f)
loaded = CharRNN(checkpoint['tokens'], n_hidden=checkpoint['n_hidden'], n_layers=checkpoint['n_layers'])

In [51]:
print(sample(loaded, 2000, cuda=True, top_k=5, prime="And Levin said"))

And Levin said to her.

The simplest shame was not happy, and that the marshal, with her
strange side would as a chance, and with his head horses, and the point of
timp she fancied that something was stuck by the secrets were as he
wanted, the common. They were continually
to grief the clever, that with service of the second, this was his

"You say it's supposing it, I'm all simpler, though I can't see you thank God
in that clear woman went to the same to the sense of hands of
a significance after things about her the servant where a shinest must
doing the children that suddenly came to be a shall on the soul it will
change yesterday!" he asked, stranging at the post, but she said to him, and
he stupid at him, and would have been deserved in all this time as her stop,
throbbing the propinting hers in his eyes, as his white stock on the
wind of the sound of the drawing room, he walked to the cases, he fell that
she would be decided, and the prince was
complicated them all at once, he came in the world. He stood as
answer that she did not know to him; he was supposing
an else in haste to him, his wife showed her as after
social sturies, the most only
of the same and her lips seemed to her. And the crimson went up
for the conviction of the same trive to tell him of all them.

"Why is you take his horse in him.... I've been many poor of socious
professs of him. If I could not speak by the patures above it?" he said
at the party of the person on the colore,
tried to dinner the captar, a precious stand, shateful the
forest had to be so anything as he was surprised, and the sound was
towards into his fect in the compart. The carriage carried often all
today they should be suddred and saying all the chief condection of
it. A store of a peculiar fell work as the sick man would have been in
condition to see in the side of home the low constarations that he
should have asked it, but he was not to be so in society of the classes
with such a stolen of surplance with an answers of think a