"Reinforcement Learning", CG Nicholls$\def\E{\mathbb{E}}

Each timestep $t$:

receive observation $x_t$, from the environment
choose action $a_t$
receive reward $r_{t+1}$
see observation $x_{t+1}$

state is Markov, ie: $$ P(x_{t+1} \mid x_t) = P(x_{t+1} \mid x_1, x_2, \dots, x_t) $$

Therefore we can model as a Markov decision process (MDP).

Means the state captures all you need to know about the system

in contrast, if the state did not include the cart's velocity, then it wouldn't be Markov

Dont want to choose deterministically, since we want to explore different ways of playing

therefore, choose our actions using a conditional probability distribution $P(a \mid x)$

Must satisfy $P(a_L \mid x) + P(a_R \mid x) = 1$ for all observations, ie agent must choose exactly one of $a_L$ or $a_R$.

Definitions:

$\pi(a \mid x; \theta) = P(a \mid x; \theta)$ is the policy
$\tau$ is the trajectory: $(x_0, a_0, r_1, x_1, a_1, r_2, \dots, x_{T-1}, a_{T-1}, r_T, x_T)$
$R_\tau$ is the total reward of the trajectory: $r_1 + r_2 + \dots + r_T$
the expected reward for a given policy $\pi$ is $\mathbb{E}_\tau[R_\tau \mid \pi]$

Cross-entropy method

Draw $\theta$ from a Gaussian distribution, with mean $\mathbf{\mu} = \{ \mu_1, \mu_2, \dots, \mu_K\}$, and axis-aligned variance, $\mathbf{\sigma} = \{\sigma_1, \sigma_2, \dots, \sigma_K\}$.

draw $N$ samples of $\theta$
sample reward for each sample
update Gaussian distribution:
- new $\mu$ is the sample mean of the retained $\theta$ samples
- new $\sigma$ is the sample variance of the retained $\theta$ samples

The policy gradient method

Continued to use parametrised policy:

$$ \pi(a \mid x; \theta) $$

use gradient ascent to optimize $\theta$ to maximise the expected total reward
at each step, compute $\nabla_\theta \E_\tau[R_\tau]$, and then update the parameter vector $\theta$ by the learning rate, $\alpha > 0$:

$$ \theta \leftarrow \theta + \alpha \nabla_\theta \E_\tau[R_\tau] $$

Computing the gradient

Expected total reward for following the policy $\pi(a \mid x; \theta)$ is obtained by marginalizing over all possible trajectories $\tau$:

$$ \E_\tau[R_\tau \mid \pi; \theta] = \int_\tau P(\tau \mid \pi; \theta)R_\tau\, d\tau $$

Take the derivative with respect to $\theta$:

$$ \nabla_\theta \E_\tau[R_\tau \mid \pi; \theta] = \nabla_\theta \int_\tau P(\tau \mid \pi; \theta)R_\tau \, d\tau \\ = \int_\tau \nabla_\theta P(\tau \mid \pi; \theta)R_\tau \, d\tau $$

$$ = \int_\tau \frac{P(\tau \mid \pi; \theta)}{P(\tau \mid \pi; \theta)} \nabla_\theta P(\tau \mid \pi; \theta)R_\tau \, d\tau $$

$$ = \E_\tau \left[ \frac{1}{P(\tau \mid \pi; \theta)} \nabla_\theta P(\tau \mid \pi; \theta)R_\tau \right] $$

$$ = \E_\tau \left[ \nabla_\theta \log(P(\tau \mid \pi; \theta)) R_\tau \right] $$

How to compute $\nabla_\theta \E_\tau[R_\tau]$?

We have:

$$ \E_\tau[R_\tau] = \int_\tau R_\tau P(\tau \mid \pi;\theta)\,d\tau $$

$$ \nabla_\theta \E_\tau[R_\tau] = \nabla_\theta \int_\tau R_\tau P(\tau \mid \pi; \theta)\, d\tau $$

$$ = \int_\tau R_\tau \nabla_\theta P(\tau \mid \pi; \theta)\, d\tau $$

$$ = \E_\tau \left[ \frac{R_\tau \nabla_\theta P(\tau \mid \pi; \theta)} {P(\tau \mid \pi; \theta)} \right] $$

$$ = \E_\tau \left[ R_\tau \nabla_\theta \log(P(\tau \mid \pi; \theta)) \right] $$

Let's compute the gradient for a single concrete directory $\tau = (x_0, a_0, r_1, x_1, \dots, x_{T-1}, a_{T-1}, r_T, x_T)$:

$$ P(\tau \mid x; \theta) = p(x_0 \mid \pi; \theta) \, \prod_{t=0}^{T-1} p(a_t \mid x_t; \theta)\, p(r_t \mid a_t, x_t)\, p(x_{t+1} \mid a_t, x_t) $$

The only terms that depend on $\theta$ are the action terms, so we have:

$$ \nabla_\theta P(\tau \mid x; \theta) = p(x_0 \mid \pi; \theta) \, \prod_{t=0}^{T-1} \left( p(r_t \mid a_t, x_t)\, p(x_{t+1} \mid a_t, x_t) \right) \prod_{t=0}^{T-1} \nabla_\theta p(a_t \mid x_t; \theta) $$

$$ \nabla_\theta \log P(\tau \mid x; \theta) = \nabla_\theta \log \left( p(x_0 \mid \pi; \theta) \, \prod_{t=0}^{T-1} \left( p(r_t \mid a_t, x_t)\, p(x_{t+1} \mid a_t, x_t) \right) \prod_{t=0}^{T-1} p(a_t \mid x_t; \theta) \right) $$

$$ = \nabla_\theta \left( \log p(x_0 \mid \pi; \theta) + \sum_{t=0}^{T-1} \log p(r_t \mid a_t, x_t) + \sum_{t=0}^{T-1} \log p(x_{t+1} \mid a_t, x_t) + \sum_{t=0}^{T-1} \log p(a_t \mid x_t; \theta) \right) $$

$$ = \sum_{t=0}^{T-1} \nabla_\theta \log p(a_t \mid x_t; \theta) $$

Similarly the reward for the trajectory is:

$$ R_\tau = \sum_{t=1}^T r_t $$

And so the gradient for this trajectory is:

$$ \hat{g}(\tau) := R_\tau \nabla_\theta \log P(\tau \mid \pi; \theta ) $$

$$ = \left( \sum_{t'=1}^T r_{t'} \right) \sum_{t=0}^{T-1} \nabla_\theta \log P(a_t \mid x_t; \theta ) $$

That is:

$$ \E_\tau[\hat{g}(\tau) \mid \pi; \theta] = \nabla_\theta \E_\tau[ R_\tau \mid \pi; \theta ] $$

We are using a neural net to represent the policy, ie to represent $P(a \mid x; \theta)$.

The gradient $\nabla_\theta \log P(a \mid x; \theta)$ can therefore be obtained by back-propping the... well we want to do:

$$ \theta \leftarrow \theta + \alpha \nabla_\theta \E_\tau[R_\tau] $$

So, we want to draw a sample of:

$$ \nabla_\theta \E_\tau[R_\tau] $$

This sample is $\hat{g}(\tau)$, ie:

$$ \hat{g}(\tau) = R_\tau \nabla_\theta \log P(\tau \mid \pi; \theta) \\ = \left( \sum_{t'=1}^T r_{t'} \right) \sum_{t=0}^{T-1} \nabla_\theta \log P(a_t \mid x_t; \theta) $$



In [73]:

    
import torch
from torch import autograd, nn, optim
import torch.nn.functional as F


class Policy(nn.Module):
    def __init__(self, input_size, output_size):
        super().__init__()
        self.h1 = nn.Linear(input_size, output_size)

    def forward(self, x):
        x = self.h1(x)
        p = F.softmax(x)
        a = torch.multinomial(p)
        return a

torch.manual_seed(123)
policy = Policy(2, 2)

# let's imagine that if the input is [1, 0], action should be 0
# if the input is [0, 1], action should be 1
# just something simple for now...
# we can do this supervised...
training = [
    {'input': [1, 0], 'target': 0},
    {'input': [0, 1], 'target': 1},
]
opt = optim.Adam(params=policy.parameters(), lr=0.1)
correct_since_last_print = 0
num_epochs = 100
print_every = num_epochs // 10
for epoch in range(num_epochs):
    for ex in training:
        policy.zero_grad()
        input = torch.FloatTensor(ex['input']).view(1, -1)
        target = ex['target']
        a = policy(autograd.Variable(input))
        reward = 0
        if a.data[0][0] == target:
            reward = 1
            correct_since_last_print += 1
        if epoch % print_every == 0:
            print('input', ex['input'], 'target', target, 'a', a.data[0][0], a.data[0][0] == target)
        a.reinforce(torch.FloatTensor([[reward]]))
        autograd.backward([a], [None])
        opt.step()
    if epoch % print_every == 0:
        print('acc', correct_since_last_print / print_every / 2)
        correct_since_last_print = 0









    



input [1, 0] target 0 a 1 False
input [0, 1] target 1 a 1 True
acc 0.05
input [1, 0] target 0 a 0 True
input [0, 1] target 1 a 1 True
acc 0.55
input [1, 0] target 0 a 0 True
input [0, 1] target 1 a 1 True
acc 0.65
input [1, 0] target 0 a 0 True
input [0, 1] target 1 a 1 True
acc 0.95
input [1, 0] target 0 a 0 True
input [0, 1] target 1 a 1 True
acc 1.0
input [1, 0] target 0 a 0 True
input [0, 1] target 1 a 1 True
acc 0.95
input [1, 0] target 0 a 0 True
input [0, 1] target 1 a 1 True
acc 0.95
input [1, 0] target 0 a 0 True
input [0, 1] target 1 a 1 True
acc 1.0
input [1, 0] target 0 a 0 True
input [0, 1] target 1 a 1 True
acc 1.0
input [1, 0] target 0 a 0 True
input [0, 1] target 1 a 1 True
acc 0.9



In [25]:

    
import torch
from torch import autograd

a = autograd.Variable(torch.FloatTensor([1.5]), requires_grad=True)
print('a', a)

b = a + 1
b.backward()
print('a.grad', a.grad)

a = autograd.Variable(torch.FloatTensor([1.5]), requires_grad=True)
b = torch.log(a)
b.backward()
print('a.grad', a.grad)

a = autograd.Variable(torch.FloatTensor([1.5, 2.5]), requires_grad=True)
b = torch.log(a[0])
b.backward()
print('a.grad', a.grad)

a = autograd.Variable(torch.FloatTensor([1.5, 2.5]), requires_grad=True)
b = torch.log(a[1])
b.backward()
print('a.grad', a.grad)









    



a Variable containing:
 1.5000
[torch.FloatTensor of size 1]

a.grad Variable containing:
 1
[torch.FloatTensor of size 1]

a.grad Variable containing:
 0.6667
[torch.FloatTensor of size 1]

a.grad Variable containing:
 0.6667
 0.0000
[torch.FloatTensor of size 2]

a.grad Variable containing:
 0.0000
 0.4000
[torch.FloatTensor of size 2]



In [ ]:

    
import torch
from torch import autograd, nn, optim
import torch.nn.functional as F


class Policy(nn.Module):
    def __init__(self, input_size, output_size):
        super().__init__()
        self.h1 = nn.Linear(input_size, output_size)

    def forward(self, x):
        x = self.h1(x)
        p = F.softmax(x)
        a = torch.multinomial(p)
        return a

torch.manual_seed(123)
policy = Policy(2, 2)

training = [
    {'input': [1, 0], 'target': 0},
    {'input': [0, 1], 'target': 1},
]
opt = optim.Adam(params=policy.parameters(), lr=0.1)
correct_since_last_print = 0
num_epochs = 100
print_every = num_epochs // 10
for epoch in range(num_epochs):
    for ex in training:
        policy.zero_grad()
        input = torch.FloatTensor(ex['input']).view(1, -1)
        target = ex['target']
        a = policy(autograd.Variable(input))
        reward = 0
        if a.data[0][0] == target:
            reward = 1
            correct_since_last_print += 1
        if epoch % print_every == 0:
            print('input', ex['input'], 'target', target, 'a', a.data[0][0], a.data[0][0] == target)
        a.reinforce(torch.FloatTensor([[reward]]))
        autograd.backward([a], [None])
        opt.step()
    if epoch % print_every == 0:
        print('acc', correct_since_last_print / print_every / 2)
        correct_since_last_print = 0