Policy Gradient

Created by Jonas Busk ([jbusk@dtu.dk](mailto:jbusk@dtu.dk)).

In this part, we will create an agent that can learn to solve tasks from OpenAI Gym by applying the Policy Gradient method. We will implement the agent with a probabilistic policy, that given a state of the environment, $s$, outputs a probability distribution over available actions, $a$:

$$ p_\theta(a|s) $$

Since this is a deep learning course, we will model the policy as a neural network with parameters $\theta$ and train it with gradient descent (now the name 'Policy Gradient' should start to make sense). When the set of available actions is discrete, we can use a network with softmax output.

The core idea of training the policy network is simple: we want to maximize the expected total reward by increasing the probability of good actions and decreasing the probability of bad actions.

The expectation over the (discounted) total reward, $R$, is:

$$ \mathbb{E}[R|\theta] = \int p_\theta({\bf a}|{\bf s}) R({\bf a}) d{\bf a} \ , $$

where ${\bf a} = a_1,\ldots,a_T$, ${\bf s}=s_1,\ldots,s_T$.

Then we can use the gradient to maximize the total reward:

$$ \begin{align} \nabla_\theta \mathbb{E}[R|\theta] &= \nabla_\theta \int p_\theta({\bf a}|{\bf s}) R({\bf a}) \, d{\bf a} \\ &= \int \nabla_\theta p_\theta({\bf a}|{\bf s}) R({\bf a}) \, d{\bf a} \\ &= \int p_\theta({\bf a}|{\bf s}) \nabla_\theta \log p_\theta({\bf a}|{\bf s}) R({\bf a}) \, d{\bf a} \\ &= \mathbb{E}[R({\bf a}) \nabla_\theta \log p_\theta({\bf a}|{\bf s})] \end{align} $$

using the identity

$$ \nabla_\theta p_\theta({\bf a}|{\bf s}) = p_\theta({\bf a}|{\bf s}) \nabla_\theta \log p_\theta({\bf a}|{\bf s}) $$

to express the gradient as an average over $p_\theta({\bf a},{\bf s})$.

We cannot evaluate the average over roll-outs analytically, but we have an environment simulator that when supplied with our current policy $p_\theta(a|s)$ can return the sequence of action, states and rewards. This allows us to replace the integral by a Monte Carlo average over $V$ roll-outs:

$$ \nabla_\theta \mathbb{E}[R|\theta] \approx \frac{1}{V} \sum_{v=1}^V \nabla_\theta \log p_\theta({\bf a}^{(v)}|{\bf s}^{(v)}) R({\bf a}^{(v)}) \ . $$

In practice, to reduce the variance of the gradient, instead of $R$, we use the adjusted discounted future reward, also known as the advantage, $A$:

$$ A_t = R_t - b_t \ , $$

where the baseline, $b_t$, is the (discounted) total future reward at timestep $t$ averaged over the $V$ roll-outs:

$$ b_t = \frac{1}{V} \sum_{v=1}^V R_t^{(v)} \ . $$

This way we are always encouraging and discouraging roughly half of the performed actions, which gives us the final gradient estimator:

$$ \nabla_\theta \mathbb{E}[R|\theta] \approx \frac{1}{V} \sum_{v=1}^V \nabla_\theta \log p_\theta({\bf a}^{(v)}|{\bf s}^{(v)}) A({\bf a}^{(v)}) $$

And that's it! Please refer to this blog post by Karpathy for more discussion on the Policy Gradient method.

--

Note: For simple reinforcement learning problems (like the one we will address in this exercise) there are simpler methods that work just fine. However, the Policy Gradient method has been shown to also work well for complex problems with high dimensional inputs and many parameters, where simple methods become inadequate.

Policy Gradient code


In [1]:
# imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.python.ops.nn import relu, softmax
import gym
#from utils import Viewer

In this lab we will work with the CartPole-v0 environment. Later you can change the code below to explore other environments and solve different tasks.

Note: The policy implemented in this notebook is designed to work on environments with a discrete action space. Extending the code to also handle environments with a continuous action space is left as an optional exercise.


In [2]:
# create gym environment
#env = gym.make('CartPole-v0')

Let us see how the environment looks when we just take random actions.


In [3]:
# demo the environment
#env.reset() # reset the environment
#view = Viewer(env, custom_render=True) # we use this custom viewer to render the environment inline in the notebook
#for _ in range(200):
#    view.render()
#    env.render() # uncomment this to use gym's own render function
#    env.step(env.action_space.sample()) # take a random action
#view.render(close=True, display_gif=True) # display the environment inline in the notebook
#env.render(close=True) # uncomment this to use gym'm own render function

Taking random actions does not do a very good job at balancing the pole. Let us now apply the Policy Gradient method described above to solve this task!

To start with, our policy will be a rather simple neural network with one hidden layer. We can retrieve the shape of the state space (input) and action space (output) from the environment.


In [4]:
# setup policy network
n_inputs = 360
n_hidden = 150
n_hidden2 = 100
n_hidden3 = 100
n_outputs = 6*6


tf.reset_default_graph()

states_pl = tf.placeholder(tf.float32, [None, n_inputs], name='states_pl')
actions_pl = tf.placeholder(tf.int32, [None, 2], name='actions_pl')
advantages_pl = tf.placeholder(tf.float32, [None], name='advantages_pl')
learning_rate_pl = tf.placeholder(tf.float32, name='learning_rate_pl')

l_hidden = tf.layers.dense(inputs=states_pl, units=n_hidden, activation=relu, name='l_hidden')
l_hidden2 = tf.layers.dense(inputs=l_hidden, units=n_hidden2, activation=relu, name='l_hidden2')
#l_hidden3 = tf.layers.dense(inputs=l_hidden2, units=n_hidden3, activation=relu, name='l_hidden3')
l_out = tf.layers.dense(inputs=l_hidden2, units=n_outputs, activation=softmax, name='l_out')

# print network
print('states_pl:', states_pl.get_shape())
print('actions_pl:', actions_pl.get_shape())
print('advantages_pl:', advantages_pl.get_shape())
print('l_hidden:', l_hidden.get_shape())
print('l_hidden2:', l_hidden2.get_shape())
#print('l_hidden3:', l_hidden3.get_shape())
print('l_out:', l_out.get_shape())


states_pl: (?, 360)
actions_pl: (?, 2)
advantages_pl: (?,)
l_hidden: (?, 150)
l_hidden2: (?, 100)
l_out: (?, 36)

In [5]:
# define loss and optimizer
loss_f = -tf.reduce_mean(tf.multiply(tf.log(tf.gather_nd(l_out, actions_pl)), advantages_pl))

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_pl)
train_f = optimizer.minimize(loss_f)

saver = tf.train.Saver() # we use this later to save the model

In [6]:
# test forward pass
from minesweeper_tk import Minesweeper
env = Minesweeper(display=False, ROWS = 6, COLS = 6, MINES = 7, rewards = {"win" : 100, "loss" : -10, "progress" : 1, "noprogress" : -3})
state = env.stateConverter(env.get_state()).flatten()
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    action_probabilities = sess.run(fetches=l_out, feed_dict={states_pl: [state]})
print(action_probabilities)


[[ 0.02717245  0.03074444  0.02910808  0.02701524  0.02763182  0.02621428
   0.03039241  0.02795587  0.01859163  0.02113014  0.02862343  0.02655763
   0.04251134  0.02349014  0.02943581  0.02649014  0.02733519  0.03032993
   0.03046167  0.0249031   0.04035928  0.02424078  0.01707526  0.02313183
   0.03824328  0.02227735  0.02332473  0.02926064  0.02600256  0.02380204
   0.0260513   0.03191482  0.02403162  0.03425055  0.02755087  0.03238829]]

Note: As we create our solution, we will make very few assumptions about the cart-pole environment. We aim to develop a general model for solving reinforcement learning problems, and therefore care little about the specific meaning of the inputs and outputs.


In [7]:
# helper functions

def get_rollout(sess, env, rollout_limit=None, stochastic=False, seed=None):
    """Generate rollout by iteratively evaluating the current policy on the environment."""
    rollout_limit = rollout_limit or env.spec.timestep_limit
    
    env.reset()
    s = env.stateConverter(env.get_state()).flatten()
    states, actions, rewards = [], [], []
    for i in range(rollout_limit):
        a = get_action(sess, s, stochastic)
        s1, r, done, _ = env.step(a)
        states.append(s)
        actions.append(a)
        rewards.append(r)
        s = s1
        if done: break
    return states, actions, rewards, i+1

def get_action(sess, state, stochastic=False):
    """Choose an action, given a state, with the current policy network."""
    # get action probabilities
    a_prob = sess.run(fetches=l_out, feed_dict={states_pl: np.atleast_2d(state)})
    if stochastic:
        # sample action from distribution
        return (np.cumsum(np.asarray(a_prob)) > np.random.rand()).argmax()
    else:
        # select action with highest probability
        return a_prob.argmax()

def get_advantages(rewards, rollout_limit, discount_factor, eps=1e-12):
    """Compute advantages"""
    returns = get_returns(rewards, rollout_limit, discount_factor)
    # standardize columns of returns to get advantages
    advantages = (returns - np.mean(returns, axis=0)) / (np.std(returns, axis=0) + eps)
    # restore original rollout lengths
    advantages = [adv[:len(rewards[i])] for i, adv in enumerate(advantages)]
    return advantages

def get_returns(rewards, rollout_limit, discount_factor):
    """Compute the cumulative discounted rewards, a.k.a. returns."""
    returns = np.zeros((len(rewards), rollout_limit))
    for i, r in enumerate(rewards):
        returns[i, len(r) - 1] = r[-1]
        for j in reversed(range(len(r)-1)):
            returns[i,j] = r[j] + discount_factor * returns[i,j+1]
    return returns

In [ ]:
# training settings

epochs = 10000 # number of training batches
batch_size = 200 # number of timesteps in a batch
rollout_limit = 25 # max rollout length
discount_factor = 1.0 # reward discount factor (gamma), 1.0 = no discount
learning_rate = 0.005 # you know this by now
early_stop_loss = 0 # stop training if loss < early_stop_loss, 0 or False to disable

# train policy network

try:
    statistics = []
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        print('start training')
        for epoch in range(epochs):
            # generate rollouts until batch_size total timesteps are collected
            states, actions, rewards = [], [], []
            timesteps = 0
            while timesteps < batch_size:
                _rollout_limit = min(rollout_limit, batch_size - timesteps) # limit rollout to match batch_size
                s, a, r, t = get_rollout(sess, env, _rollout_limit, stochastic=True, seed=epoch)            
                states.append(s)
                actions.append(a)
                rewards.append(r)
                timesteps += t
            # compute advantages
            advantages = get_advantages(rewards, rollout_limit, discount_factor)
            # policy gradient update
            loss, _ = sess.run(fetches=[loss_f, train_f], feed_dict={
                states_pl: np.concatenate(states),
                actions_pl: np.column_stack((np.arange(timesteps), np.concatenate(actions))),
                advantages_pl: np.concatenate(advantages),
                learning_rate_pl: learning_rate
            })            
            # validation
            val_rewards = [get_rollout(sess, env, rollout_limit, stochastic=True, seed=(epochs+i))[2] for i in range(10)]
            # store and print training statistics
            mtr = np.mean([np.sum(r) for r in rewards])
            mvr = np.mean([np.sum(r) for r in val_rewards])
            statistics.append((mtr, mvr, loss))
            if epoch % 20 == 0:
                print('%4d. training reward: %6.2f, validation reward: %6.2f, loss: %7.4f' % (epoch+1, mtr, mvr, loss))
                saver.save(sess, 'tmp/model.ckpt')   

            if early_stop_loss and loss < early_stop_loss: break
        print('done')
        # save session
except KeyboardInterrupt:
    pass


start training
   1. training reward:  -4.24, validation reward:  -3.10, loss: -1.6095
  21. training reward:  -2.00, validation reward:  -2.00, loss: -0.0000
  41. training reward:  -2.00, validation reward:  -2.00, loss: -0.0000
  61. training reward:  -2.00, validation reward:  -2.00, loss: -0.0000
  81. training reward:  -2.00, validation reward:  -2.00, loss: -0.0000
 101. training reward:  -2.00, validation reward:  -2.00, loss: -0.0000
 121. training reward:  -2.00, validation reward:  -2.00, loss: -0.0000
 141. training reward:  -2.00, validation reward:  -2.00, loss: -0.0000

In [9]:
# plot training statistics
statistics = np.array(statistics).T
mean_training_rewards = statistics[0]
mean_validation_rewards = statistics[1]
losses = statistics[2]

plt.figure(figsize=(10,6))
plt.subplot(211)
plt.plot(losses, label='loss')
plt.xlabel('epoch'); plt.ylabel('loss')
plt.xlim((0, len(losses)))
plt.legend(loc=1); plt.grid()
plt.subplot(212)
plt.plot(mean_training_rewards, label='mean training reward')
plt.plot(mean_validation_rewards, label='mean validation reward')
plt.xlabel('epoch'); plt.ylabel('mean reward')
plt.xlim((0, len(mean_validation_rewards)))
plt.legend(loc=4); plt.grid()
plt.tight_layout(); plt.show()



In [70]:
# review solution
env = Minesweeper(display=True, ROWS = 6, COLS = 6)
import time
with tf.Session() as sess:
    saver.restore(sess, "tmp/model.ckpt")
    env.reset()
    s = env.stateConverter(env.get_state()).flatten()
    #view = Viewer(env, custom_render=True)
    for _ in range(500):
        a = get_action(sess, s, stochastic=False)
        s, r, done, _ = env.step(a)
    #view.render(close=True, display_gif=True)


INFO:tensorflow:Restoring parameters from tmp/model.ckpt
[2017-11-27 14:58:34,337] Restoring parameters from tmp/model.ckpt

Exercises

Now it is your turn! Play around the code above and try to make it learn better and faster.

Experiment with the:

  • number of timesteps in a batch.
  • max length of rollouts.
  • discount factor.
  • learning rate.
  • number of hidden units and layers.

Exercise 1

Describe any changes you made to the code and why you think they improve the agent. Are you able to get solutions consistently?

Answer:

  • Changed the batch size to 600, it makes sense for it to look some out in the future but not indefinetely.
  • Added discount factor of 0.9, 0.8 seemed too low. We want to discount actions a bit, so the close they are to the current state, the more reward the action will get for a given state.
  • Set up the learning rate until it stopped ossiliating.
  • Increased number of units in layer 1 to 30 and added a second layer with 10 units.

Exercise 2

Consider the following sequence of rewards produced by an agent interacting with an environment for 10 timesteps: [0, 1, 1, 1, 0, 1, 1, 0, 0, 0].

  • What is the total reward?
  • What is the total future reward in each timestep?
  • What is the discounted future reward in each timestep if $\gamma = 0.9$?

[Hint: See previous notebook.]

Answer:

  • Total reward is the sum of all rewards, so 5
  • Total future reward is defined as $ R_t = r_t + r_{t+1} + r_{t+2} + \dots + r_T \ . $. Thus we get the vector of future rewards: [5, 5, 4, 3, 3, 2, 1, 0, 0, 0]
  • The discounted future rewards is defined as: $R_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots + \gamma^{T-1} r_T$. Thus we get the following discounted future rewards vector: [3.560931, 4.95659, 4.2851, 3.539, 1.71, 2.9, 2.0, 0.0, 0.0, 0.0] - See bellow for calculation

In [4]:
l = [0, 1, 1, 1, 0, 1, 1, 0, 0, 0]
disc = 0.9
result = []
for i in range(len(l)):
    reward = l[i]
    l2 = l[i:]
    for i2 in range(len(l2)):
        reward += disc**i2 * l2[i2]
    result.append(reward)
result


Out[4]:
[3.560931, 4.95659, 4.2851, 3.539, 1.71, 2.9, 2.0, 0.0, 0.0, 0.0]

Exercise 3

In the plot of the training and validation mean reward above, you will sometimes see the validation reward starts out lower than the training reward but later they cross. How can you explain this behavior? [Hint: Do we use the policy network in the same way during training and validation?]

Answer: No, because in training we do continue doing rollouts until we hit time_step greater than or equal to our batch_size while under validation we just do one roll_out of the specified length. Since the reward is a sum, there will be a difference. We can mainly look at the plot and see where the training and validation are going but if we really wanted to plot them together, we would need to normalize them to compare.

Exercise 4

How does the policy gradient method we have used address the exploration-exploitation dilemma (see the previous notebook for definition)?

Answer: In the current implementation we have chosen to take a stochastic action based on our policy. (get_action-function). Thus in the start, we don't know what to do so all actions have equal propability and we will just make random choises. After some time, we begin to see which random choices work so we update our policy. This in turn makes the random actions that behave good more likely and the actions that had bad results less likely. Thus as we train, we begin to go away from exploring the space because the chance of odd actions becomes less as the networks learns which a good. So you could say that the learning rate determines how fast we go into exploitation mode, as a faster learning rate will change the policy faster and thus the chance of explorative choices becomes smaller as we set on a well-defined strategy in the action space.

Optional exercises:

  • Explore! Train a policy for a different environment.
  • Let's get real! Modify the code to work on an environment with a continuous action space.

Exercise from Michael Nielsens Book - Chapter 6

The idea of convolutional layers is to behave in an invariant way across images. It may seem surprising, then, that our network can learn more when all we've done is translate the input data. Can you explain why this is actually quite reasonable?

Answer : Convolution NN has translational invariance, since it is running a kernel with a much smaller size than the input image it doesn't care where in the image a feature is, it will still be detected just at another position in the convulutional layer. However, by expanding the data by rotation, stretching distortion etc we can first of all get more training examples to train on, second on all we can try to fill in some of the "gaps" between samples. The idea is, that the image lies on a smaller manifold of its 700-somthing dimensional space. Even with a lot of training data this manifold is propably sparsely populated. Image a 2d dataset with points with a lot of space in between them, earlier we have seen our neural networks with a lot of parameters can overfit some crazy decision boundaries in the gap between points. By augmention our data, we can try to construct new data from our current datapoints and hope, that they fill out some of these gaps between the points so our NN will have a harder time overfitting and will put down more reasonable dicision boundaries.


In [ ]: