REINFORCE in TensorFlow

Just like we did before for q-learning, this time we'll design a neural network to learn CartPole-v0 via policy gradient (REINFORCE).


In [1]:
# This code creates a virtual display to draw game images on. 
# If you are running locally, just ignore it

import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
    !bash ../xvfb start
    os.environ['DISPLAY'] = ':1'


Starting virtual X frame buffer: Xvfb.

In [2]:
import gym
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make("CartPole-v0")

# gym compatibility: unwrap TimeLimit
if hasattr(env,'env'):
    env=env.env

env.reset()
n_actions = env.action_space.n
state_dim = env.observation_space.shape
plt.imshow(env.render("rgb_array"))


Out[2]:
<matplotlib.image.AxesImage at 0x7f0a91210128>

Building the policy network

For REINFORCE algorithm, we'll need a model that predicts action probabilities given states.

For numerical stability, please do not include the softmax layer into your network architecture.

We'll use softmax or log-softmax where appropriate.


In [3]:
import tensorflow as tf
tf.reset_default_graph()

# create input variables. We only need <s,a,R> for REINFORCE
states = tf.placeholder('float32', (None,)+state_dim, name="states")
actions = tf.placeholder('int32', name="action_ids")
cumulative_rewards = tf.placeholder('float32', name="cumulative_returns")

In [4]:
import keras
import keras.layers as L

#sess = tf.InteractiveSession()
#keras.backend.set_session(sess)
#<define network graph using raw tf or any deep learning library>
#network = keras.models.Sequential()
#network.add(L.InputLayer(state_dim))
#network.add(L.Dense(200, activation='relu'))
#network.add(L.Dense(200, activation='relu'))
#network.add(L.Dense(n_actions, activation='linear'))

network = keras.models.Sequential()
network.add(L.Dense(256, activation="relu", input_shape=state_dim, name="layer_1"))
network.add(L.Dense(n_actions, activation="linear", name="layer_2"))
print(network.summary())

#question: counting from the beginning of the model, the logits are in layer #9: model.layers[9].output
    
#logits = network.layers[2].output #<linear outputs (symbolic) of your network>

logits = network(states)
policy = tf.nn.softmax(logits)
log_policy = tf.nn.log_softmax(logits)


WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
layer_1 (Dense)              (None, 256)               1280      
_________________________________________________________________
layer_2 (Dense)              (None, 2)                 514       
=================================================================
Total params: 1,794
Trainable params: 1,794
Non-trainable params: 0
_________________________________________________________________
None
Using TensorFlow backend.

In [5]:
# utility function to pick action in one given state
def get_action_proba(s): 
    return policy.eval({states: [s]})[0]

Loss function and updates

We now need to define objective and update over policy gradient.

Our objective function is

$$ J \approx { 1 \over N } \sum _{s_i,a_i} \pi_\theta (a_i | s_i) \cdot G(s_i,a_i) $$

Following the REINFORCE algorithm, we can define our objective as follows:

$$ \hat J \approx { 1 \over N } \sum _{s_i,a_i} log \pi_\theta (a_i | s_i) \cdot G(s_i,a_i) $$

When you compute gradient of that function over network weights $ \theta $, it will become exactly the policy gradient.


In [6]:
# select log-probabilities for chosen actions, log pi(a_i|s_i)
indices = tf.stack([tf.range(tf.shape(log_policy)[0]), actions], axis=-1)
log_policy_for_actions = tf.gather_nd(log_policy, indices)

In [7]:
# REINFORCE objective function
# hint: you need to use log_policy_for_actions to get log probabilities for actions taken

J =  tf.reduce_mean((log_policy_for_actions * cumulative_rewards), axis=-1)# <policy objective as in the last formula. Please use mean, not sum.>

In [8]:
# regularize with entropy
entropy =  tf.reduce_mean(policy*log_policy) # <compute entropy. Don't forget the sign!>

In [9]:
# all network weights
all_weights =  tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES) #<a list of all trainable weights in your network>

# weight updates. maximizing J is same as minimizing -J. Adding negative entropy.
loss = -J - 0.1*entropy

update = tf.train.AdamOptimizer().minimize(loss, var_list=all_weights)


WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.

Computing cumulative rewards


In [26]:
def get_cumulative_rewards(rewards,    # rewards at each step
                           gamma=0.99  # discount for reward
                           ):
    """
    take a list of immediate rewards r(s,a) for the whole session 
    compute cumulative rewards R(s,a) (a.k.a. G(s,a) in Sutton '16)
    R_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ...

    The simple way to compute cumulative rewards is to iterate from last to first time tick
    and compute R_t = r_t + gamma*R_{t+1} recurrently

    You must return an array/list of cumulative rewards with as many elements as in the initial rewards.
    """

    #<your code here>
    cumulative_rewards = np.zeros((len(rewards)))
    cumulative_rewards[-1] = rewards[-1]
    for t in range(len(rewards)-2, -1, -1):
        cumulative_rewards[t] = rewards[t] + gamma * cumulative_rewards[t + 1]
    return cumulative_rewards #< array of cumulative rewards>

In [27]:
assert len(get_cumulative_rewards(range(100))) == 100
assert np.allclose(get_cumulative_rewards([0, 0, 1, 0, 0, 1, 0], gamma=0.9),
                   [1.40049, 1.5561, 1.729, 0.81, 0.9, 1.0, 0.0])
assert np.allclose(get_cumulative_rewards([0, 0, 1, -2, 3, -4, 0], gamma=0.5),
                   [0.0625, 0.125, 0.25, -1.5, 1.0, -4.0, 0.0])
assert np.allclose(get_cumulative_rewards([0, 0, 1, 2, 3, 4, 0], gamma=0),
                   [0, 0, 1, 2, 3, 4, 0])
print("looks good!")


looks good!

In [28]:
def train_step(_states, _actions, _rewards):
    """given full session, trains agent with policy gradient"""
    _cumulative_rewards = get_cumulative_rewards(_rewards)
    update.run({states: _states, actions: _actions,
                cumulative_rewards: _cumulative_rewards})

Playing the game


In [41]:
def generate_session(t_max=1000):
    """play env with REINFORCE agent and train at the session end"""

    # arrays to record session
    states, actions, rewards = [], [], []

    s = env.reset()

    for t in range(t_max):

        # action probabilities array aka pi(a|s)
        action_probas = get_action_proba(s)

        a = np.random.choice(a=len(action_probas), p=action_probas) #<pick random action using action_probas>

        new_s, r, done, info = env.step(a)

        # record session history to train later
        states.append(s)
        actions.append(a)
        rewards.append(r)

        s = new_s
        if done:
            break

    train_step(states, actions, rewards)

    # technical: return session rewards to print them later
    return sum(rewards)

In [42]:
s = tf.InteractiveSession()
s.run(tf.global_variables_initializer())

for i in range(100):

    rewards = [generate_session() for _ in range(100)]  # generate new sessions

    print("mean reward:%.3f" % (np.mean(rewards)))

    if np.mean(rewards) > 300:
        print("You Win!") # but you can train even further
        break


mean reward:27.180
mean reward:49.820
mean reward:103.360
mean reward:169.880
mean reward:153.010
mean reward:125.110
mean reward:299.820
mean reward:664.230
You Win!

Results & video


In [43]:
# record sessions
import gym.wrappers
env = gym.wrappers.Monitor(gym.make("CartPole-v0"),
                           directory="videos", force=True)
sessions = [generate_session() for _ in range(100)]
env.close()

In [44]:
# show video
from IPython.display import HTML
import os

video_names = list(
    filter(lambda s: s.endswith(".mp4"), os.listdir("./videos/")))

HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format("./videos/"+video_names[-1]))  # this may or may not be _last_ video. Try other indices


Out[44]:

In [46]:
from submit import submit_cartpole
submit_cartpole(generate_session, "tonatiuh_rangel@hotmail.com", "Cecc5rcVxaVUYtsQ")


Submitted to Coursera platform. See results on assignment page!

In [ ]:
# That's all, thank you for your attention!
# Not having enough? There's an actor-critic waiting for you in the honor section.
# But make sure you've seen the videos first.