Created by Jonas Busk ([jbusk@dtu.dk](mailto:jbusk@dtu.dk)).
In this part, we will create an agent that can learn to solve tasks from OpenAI Gym by applying the Policy Gradient method. We will implement the agent with a probabilistic policy, that given a state of the environment, $s$, outputs a probability distribution over available actions, $a$:
$$ p_\theta(a|s) $$Since this is a deep learning course, we will model the policy as a neural network with parameters $\theta$ and train it with gradient descent (now the name 'Policy Gradient' should start to make sense). When the set of available actions is discrete, we can use a network with softmax output.
The core idea of training the policy network is simple: we want to maximize the expected total reward by increasing the probability of good actions and decreasing the probability of bad actions.
The expectation over the (discounted) total reward, $R$, is:
$$ \mathbb{E}[R|\theta] = \int p_\theta({\bf a}|{\bf s}) R({\bf a}) d{\bf a} \ , $$where ${\bf a} = a_1,\ldots,a_T$, ${\bf s}=s_1,\ldots,s_T$.
Then we can use the gradient to maximize the total reward:
$$ \begin{align} \nabla_\theta \mathbb{E}[R|\theta] &= \nabla_\theta \int p_\theta({\bf a}|{\bf s}) R({\bf a}) \, d{\bf a} \\ &= \int \nabla_\theta p_\theta({\bf a}|{\bf s}) R({\bf a}) \, d{\bf a} \\ &= \int p_\theta({\bf a}|{\bf s}) \nabla_\theta \log p_\theta({\bf a}|{\bf s}) R({\bf a}) \, d{\bf a} \\ &= \mathbb{E}[R({\bf a}) \nabla_\theta \log p_\theta({\bf a}|{\bf s})] \end{align} $$using the identity
$$ \nabla_\theta p_\theta({\bf a}|{\bf s}) = p_\theta({\bf a}|{\bf s}) \nabla_\theta \log p_\theta({\bf a}|{\bf s}) $$to express the gradient as an average over $p_\theta({\bf a},{\bf s})$.
We cannot evaluate the average over roll-outs analytically, but we have an environment simulator that when supplied with our current policy $p_\theta(a|s)$ can return the sequence of action, states and rewards. This allows us to replace the integral by a Monte Carlo average over $V$ roll-outs:
$$ \nabla_\theta \mathbb{E}[R|\theta] \approx \frac{1}{V} \sum_{v=1}^V \nabla_\theta \log p_\theta({\bf a}^{(v)}|{\bf s}^{(v)}) R({\bf a}^{(v)}) \ . $$In practice, to reduce the variance of the gradient, instead of $R$, we use the adjusted discounted future reward, also known as the advantage, $A$:
$$ A_t = R_t - b_t \ , $$where the baseline, $b_t$, is the (discounted) total future reward at timestep $t$ averaged over the $V$ roll-outs:
$$ b_t = \frac{1}{V} \sum_{v=1}^V R_t^{(v)} \ . $$This way we are always encouraging and discouraging roughly half of the performed actions, which gives us the final gradient estimator:
$$ \nabla_\theta \mathbb{E}[R|\theta] \approx \frac{1}{V} \sum_{v=1}^V \nabla_\theta \log p_\theta({\bf a}^{(v)}|{\bf s}^{(v)}) A({\bf a}^{(v)}) $$And that's it! Please refer to this blog post by Karpathy for more discussion on the Policy Gradient method.
--
Note: For simple reinforcement learning problems (like the one we will address in this exercise) there are simpler methods that work just fine. However, the Policy Gradient method has been shown to also work well for complex problems with high dimensional inputs and many parameters, where simple methods become inadequate.
In [1]:
# imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.python.ops.nn import relu, softmax
import gym
#from utils import Viewer
In this lab we will work with the CartPole-v0 environment. Later you can change the code below to explore other environments and solve different tasks.
Note: The policy implemented in this notebook is designed to work on environments with a discrete action space. Extending the code to also handle environments with a continuous action space is left as an optional exercise.
In [2]:
# create gym environment
#env = gym.make('CartPole-v0')
Let us see how the environment looks when we just take random actions.
In [3]:
# demo the environment
#env.reset() # reset the environment
#view = Viewer(env, custom_render=True) # we use this custom viewer to render the environment inline in the notebook
#for _ in range(200):
# view.render()
# env.render() # uncomment this to use gym's own render function
# env.step(env.action_space.sample()) # take a random action
#view.render(close=True, display_gif=True) # display the environment inline in the notebook
#env.render(close=True) # uncomment this to use gym'm own render function
Taking random actions does not do a very good job at balancing the pole. Let us now apply the Policy Gradient method described above to solve this task!
To start with, our policy will be a rather simple neural network with one hidden layer. We can retrieve the shape of the state space (input) and action space (output) from the environment.
In [4]:
# setup policy network
n_inputs = 360
n_hidden = 150
n_hidden2 = 100
n_hidden3 = 100
n_outputs = 6*6
tf.reset_default_graph()
states_pl = tf.placeholder(tf.float32, [None, n_inputs], name='states_pl')
actions_pl = tf.placeholder(tf.int32, [None, 2], name='actions_pl')
advantages_pl = tf.placeholder(tf.float32, [None], name='advantages_pl')
learning_rate_pl = tf.placeholder(tf.float32, name='learning_rate_pl')
l_hidden = tf.layers.dense(inputs=states_pl, units=n_hidden, activation=relu, name='l_hidden')
l_hidden2 = tf.layers.dense(inputs=l_hidden, units=n_hidden2, activation=relu, name='l_hidden2')
#l_hidden3 = tf.layers.dense(inputs=l_hidden2, units=n_hidden3, activation=relu, name='l_hidden3')
l_out = tf.layers.dense(inputs=l_hidden2, units=n_outputs, activation=softmax, name='l_out')
# print network
print('states_pl:', states_pl.get_shape())
print('actions_pl:', actions_pl.get_shape())
print('advantages_pl:', advantages_pl.get_shape())
print('l_hidden:', l_hidden.get_shape())
print('l_hidden2:', l_hidden2.get_shape())
#print('l_hidden3:', l_hidden3.get_shape())
print('l_out:', l_out.get_shape())
In [5]:
# define loss and optimizer
loss_f = -tf.reduce_mean(tf.multiply(tf.log(tf.gather_nd(l_out, actions_pl)), advantages_pl))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_pl)
train_f = optimizer.minimize(loss_f)
saver = tf.train.Saver() # we use this later to save the model
In [6]:
# test forward pass
from minesweeper_tk import Minesweeper
env = Minesweeper(display=False, ROWS = 6, COLS = 6, MINES = 7, rewards = {"win" : 100, "loss" : -10, "progress" : 1, "noprogress" : -3})
state = env.stateConverter(env.get_state()).flatten()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
action_probabilities = sess.run(fetches=l_out, feed_dict={states_pl: [state]})
print(action_probabilities)
Note: As we create our solution, we will make very few assumptions about the cart-pole environment. We aim to develop a general model for solving reinforcement learning problems, and therefore care little about the specific meaning of the inputs and outputs.
In [7]:
# helper functions
def get_rollout(sess, env, rollout_limit=None, stochastic=False, seed=None):
"""Generate rollout by iteratively evaluating the current policy on the environment."""
rollout_limit = rollout_limit or env.spec.timestep_limit
env.reset()
s = env.stateConverter(env.get_state()).flatten()
states, actions, rewards = [], [], []
for i in range(rollout_limit):
a = get_action(sess, s, stochastic)
s1, r, done, _ = env.step(a)
states.append(s)
actions.append(a)
rewards.append(r)
s = s1
if done: break
return states, actions, rewards, i+1
def get_action(sess, state, stochastic=False):
"""Choose an action, given a state, with the current policy network."""
# get action probabilities
a_prob = sess.run(fetches=l_out, feed_dict={states_pl: np.atleast_2d(state)})
if stochastic:
# sample action from distribution
return (np.cumsum(np.asarray(a_prob)) > np.random.rand()).argmax()
else:
# select action with highest probability
return a_prob.argmax()
def get_advantages(rewards, rollout_limit, discount_factor, eps=1e-12):
"""Compute advantages"""
returns = get_returns(rewards, rollout_limit, discount_factor)
# standardize columns of returns to get advantages
advantages = (returns - np.mean(returns, axis=0)) / (np.std(returns, axis=0) + eps)
# restore original rollout lengths
advantages = [adv[:len(rewards[i])] for i, adv in enumerate(advantages)]
return advantages
def get_returns(rewards, rollout_limit, discount_factor):
"""Compute the cumulative discounted rewards, a.k.a. returns."""
returns = np.zeros((len(rewards), rollout_limit))
for i, r in enumerate(rewards):
returns[i, len(r) - 1] = r[-1]
for j in reversed(range(len(r)-1)):
returns[i,j] = r[j] + discount_factor * returns[i,j+1]
return returns
In [ ]:
# training settings
epochs = 10000 # number of training batches
batch_size = 200 # number of timesteps in a batch
rollout_limit = 25 # max rollout length
discount_factor = 1.0 # reward discount factor (gamma), 1.0 = no discount
learning_rate = 0.005 # you know this by now
early_stop_loss = 0 # stop training if loss < early_stop_loss, 0 or False to disable
# train policy network
try:
statistics = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print('start training')
for epoch in range(epochs):
# generate rollouts until batch_size total timesteps are collected
states, actions, rewards = [], [], []
timesteps = 0
while timesteps < batch_size:
_rollout_limit = min(rollout_limit, batch_size - timesteps) # limit rollout to match batch_size
s, a, r, t = get_rollout(sess, env, _rollout_limit, stochastic=True, seed=epoch)
states.append(s)
actions.append(a)
rewards.append(r)
timesteps += t
# compute advantages
advantages = get_advantages(rewards, rollout_limit, discount_factor)
# policy gradient update
loss, _ = sess.run(fetches=[loss_f, train_f], feed_dict={
states_pl: np.concatenate(states),
actions_pl: np.column_stack((np.arange(timesteps), np.concatenate(actions))),
advantages_pl: np.concatenate(advantages),
learning_rate_pl: learning_rate
})
# validation
val_rewards = [get_rollout(sess, env, rollout_limit, stochastic=True, seed=(epochs+i))[2] for i in range(10)]
# store and print training statistics
mtr = np.mean([np.sum(r) for r in rewards])
mvr = np.mean([np.sum(r) for r in val_rewards])
statistics.append((mtr, mvr, loss))
if epoch % 20 == 0:
print('%4d. training reward: %6.2f, validation reward: %6.2f, loss: %7.4f' % (epoch+1, mtr, mvr, loss))
saver.save(sess, 'tmp/model.ckpt')
if early_stop_loss and loss < early_stop_loss: break
print('done')
# save session
except KeyboardInterrupt:
pass
In [9]:
# plot training statistics
statistics = np.array(statistics).T
mean_training_rewards = statistics[0]
mean_validation_rewards = statistics[1]
losses = statistics[2]
plt.figure(figsize=(10,6))
plt.subplot(211)
plt.plot(losses, label='loss')
plt.xlabel('epoch'); plt.ylabel('loss')
plt.xlim((0, len(losses)))
plt.legend(loc=1); plt.grid()
plt.subplot(212)
plt.plot(mean_training_rewards, label='mean training reward')
plt.plot(mean_validation_rewards, label='mean validation reward')
plt.xlabel('epoch'); plt.ylabel('mean reward')
plt.xlim((0, len(mean_validation_rewards)))
plt.legend(loc=4); plt.grid()
plt.tight_layout(); plt.show()
In [70]:
# review solution
env = Minesweeper(display=True, ROWS = 6, COLS = 6)
import time
with tf.Session() as sess:
saver.restore(sess, "tmp/model.ckpt")
env.reset()
s = env.stateConverter(env.get_state()).flatten()
#view = Viewer(env, custom_render=True)
for _ in range(500):
a = get_action(sess, s, stochastic=False)
s, r, done, _ = env.step(a)
#view.render(close=True, display_gif=True)
Now it is your turn! Play around the code above and try to make it learn better and faster.
Experiment with the:
Describe any changes you made to the code and why you think they improve the agent. Are you able to get solutions consistently?
Answer:
Consider the following sequence of rewards produced by an agent interacting with an environment for 10 timesteps: [0, 1, 1, 1, 0, 1, 1, 0, 0, 0].
[Hint: See previous notebook.]
Answer:
In [4]:
l = [0, 1, 1, 1, 0, 1, 1, 0, 0, 0]
disc = 0.9
result = []
for i in range(len(l)):
reward = l[i]
l2 = l[i:]
for i2 in range(len(l2)):
reward += disc**i2 * l2[i2]
result.append(reward)
result
Out[4]:
In the plot of the training and validation mean reward above, you will sometimes see the validation reward starts out lower than the training reward but later they cross. How can you explain this behavior? [Hint: Do we use the policy network in the same way during training and validation?]
Answer: No, because in training we do continue doing rollouts until we hit time_step greater than or equal to our batch_size while under validation we just do one roll_out of the specified length. Since the reward is a sum, there will be a difference. We can mainly look at the plot and see where the training and validation are going but if we really wanted to plot them together, we would need to normalize them to compare.
How does the policy gradient method we have used address the exploration-exploitation dilemma (see the previous notebook for definition)?
Answer: In the current implementation we have chosen to take a stochastic action based on our policy. (get_action-function). Thus in the start, we don't know what to do so all actions have equal propability and we will just make random choises. After some time, we begin to see which random choices work so we update our policy. This in turn makes the random actions that behave good more likely and the actions that had bad results less likely. Thus as we train, we begin to go away from exploring the space because the chance of odd actions becomes less as the networks learns which a good. So you could say that the learning rate determines how fast we go into exploitation mode, as a faster learning rate will change the policy faster and thus the chance of explorative choices becomes smaller as we set on a well-defined strategy in the action space.
The idea of convolutional layers is to behave in an invariant way across images. It may seem surprising, then, that our network can learn more when all we've done is translate the input data. Can you explain why this is actually quite reasonable?
Answer : Convolution NN has translational invariance, since it is running a kernel with a much smaller size than the input image it doesn't care where in the image a feature is, it will still be detected just at another position in the convulutional layer. However, by expanding the data by rotation, stretching distortion etc we can first of all get more training examples to train on, second on all we can try to fill in some of the "gaps" between samples. The idea is, that the image lies on a smaller manifold of its 700-somthing dimensional space. Even with a lot of training data this manifold is propably sparsely populated. Image a 2d dataset with points with a lot of space in between them, earlier we have seen our neural networks with a lot of parameters can overfit some crazy decision boundaries in the gap between points. By augmention our data, we can try to construct new data from our current datapoints and hope, that they fill out some of these gaps between the points so our NN will have a harder time overfitting and will put down more reasonable dicision boundaries.
In [ ]: