Deep Kung-Fu with advantage actor-critic

In this notebook you'll build a deep reinforcement learning agent for atari KungFuMaster and train it with advantage actor-critic.



In [1]:

    
from __future__ import print_function, division
from IPython.core import display
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

#If you are running on a server, launch xvfb to record game videos
#Please make sure you have xvfb installed
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
    !bash ../xvfb start
    os.environ['DISPLAY'] = ':1'









    



Starting virtual X frame buffer: Xvfb.

For starters, let's take a look at the game itself:

Image resized to 42x42 and grayscale to run faster
Rewards divided by 100 'cuz they are all divisible by 100
Agent sees last 4 frames of game to account for object velocity



In [2]:

    
import gym
from atari_util import PreprocessAtari

# We scale rewards to avoid exploding gradients during optimization.
reward_scale = 0.01

def make_env():
    env = gym.make("KungFuMasterDeterministic-v0")
    env = PreprocessAtari(
        env, height=42, width=42,
        crop=lambda img: img[60:-30, 5:],
        dim_order='tensorflow',
        color=False, n_frames=4,
        reward_scale=reward_scale)
    return env

env = make_env()

obs_shape = env.observation_space.shape
n_actions = env.action_space.n

print("Observation shape:", obs_shape)
print("Num actions:", n_actions)
print("Action names:", env.env.env.get_action_meanings())









    



Observation shape: (42, 42, 4)
Num actions: 14
Action names: ['NOOP', 'UP', 'RIGHT', 'LEFT', 'DOWN', 'DOWNRIGHT', 'DOWNLEFT', 'RIGHTFIRE', 'LEFTFIRE', 'DOWNFIRE', 'UPRIGHTFIRE', 'UPLEFTFIRE', 'DOWNRIGHTFIRE', 'DOWNLEFTFIRE']



In [3]:

    
s = env.reset()
for _ in range(100):
    s, _, _, _ = env.step(env.action_space.sample())

plt.title('Game image')
plt.imshow(env.render('rgb_array'))
plt.show()

plt.title('Agent observation (4-frame buffer)')
plt.imshow(s.transpose([0,2,1]).reshape([42,-1]))
plt.show()

Build an agent

We now have to build an agent for actor-critic training - a convolutional neural network that converts states into action probabilities $\pi$ and state values $V$.

Your assignment here is to build and apply a neural network - with any framework you want.

For starters, we want you to implement this architecture:

After your agent gets mean reward above 50, we encourage you to experiment with model architecture to score even better.



In [4]:

    
import tensorflow as tf
tf.reset_default_graph()
sess = tf.InteractiveSession()



In [5]:

    
from keras.layers import Conv2D, Dense, Flatten, Input
from keras.models import Model
class Agent:
    def __init__(self, name, state_shape, n_actions, reuse=False):
        """A simple actor-critic agent"""
        
        with tf.variable_scope(name, reuse=reuse):
            
            # Prepare neural network architecture
            ### Your code here: prepare any necessary layers, variables, etc.
            inputs = Input(shape=[42, 42, 4], name='input')
            x = Conv2D(filters=32, kernel_size=(3, 3), strides=(2, 2), activation='relu', name='conv1')(inputs)
            x = Flatten()(x)
            x = Dense(32, activation='relu', name='dense')(x)
            logits_head = Dense(n_actions, activation='linear', name='logits_head')(x)
            v_head = Dense(1, activation='linear', name='v_head')(x)
            
            self.network = Model(inputs=inputs, outputs=[logits_head, v_head])
            print(self.network.summary())

            #self.network = Sequential()
            #self.network.add(Conv2D(filters=32, kernel_size=(3, 3), strides=(2, 2), activation='elu'))
            #self.network.add(Conv2D(filters=32, kernel_size=(3, 3), strides=(2, 2), activation='elu'))
            #self.network.add(Conv2D(filters=32, kernel_size=(3, 3), strides=(2, 2), activation='elu'))
            #self.network.add(Flatten())
            #self.network.add(Dense(128, activation='elu'))
            #self.network.add(Dense(n_actions, activation='linear'))
            
            
            # prepare a graph for agent step
            self.state_t = tf.placeholder('float32', [None,] + list(state_shape))
            self.agent_outputs = self.symbolic_step(self.state_t)
        
    def symbolic_step(self, state_t):
        """Takes agent's previous step and observation, returns next state and whatever it needs to learn (tf tensors)"""
        
        # Apply neural network
        ### Your code here: apply agent's neural network to get policy logits and state values.
        
        output = self.network(state_t)
        logits =  output[0] #<logits go here>
        state_value = tf.reshape(output[1], shape=[-1,]) #<state values go here>
        
        assert tf.is_numeric_tensor(state_value) and state_value.shape.ndims == 1, \
            "please return 1D tf tensor of state values [you got %s]" % repr(state_value)
        assert tf.is_numeric_tensor(logits) and logits.shape.ndims == 2, \
            "please return 2d tf tensor of logits [you got %s]" % repr(logits)
        # hint: if you triggered state_values assert with your shape being [None, 1], 
        # just select [:, 0]-th element of state values as new state values
        
        return (logits, state_value)
    
    def step(self, state_t):
        """Same as symbolic step except it operates on numpy arrays"""
        sess = tf.get_default_session()
        return sess.run(self.agent_outputs, {self.state_t: state_t})
    
    def sample_actions(self, agent_outputs):
        """pick actions given numeric agent outputs (np arrays)"""
        logits, state_values = agent_outputs
        policy = np.exp(logits) / np.sum(np.exp(logits), axis=-1, keepdims=True)
        return np.array([np.random.choice(len(p), p=p) for p in policy])









    



Using TensorFlow backend.



In [6]:

    
agent = Agent("agent", obs_shape, n_actions)
sess.run(tf.global_variables_initializer())









    



WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input (InputLayer)              (None, 42, 42, 4)    0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 20, 20, 32)   1184        input[0][0]                      
__________________________________________________________________________________________________
flatten_1 (Flatten)             (None, 12800)        0           conv1[0][0]                      
__________________________________________________________________________________________________
dense (Dense)                   (None, 32)           409632      flatten_1[0][0]                  
__________________________________________________________________________________________________
logits_head (Dense)             (None, 14)           462         dense[0][0]                      
__________________________________________________________________________________________________
v_head (Dense)                  (None, 1)            33          dense[0][0]                      
==================================================================================================
Total params: 411,311
Trainable params: 411,311
Non-trainable params: 0
__________________________________________________________________________________________________
None



In [7]:

    
state = [env.reset()]
logits, value = agent.step(state)
print("action logits:\n", logits)
print("state values:\n", value)









    



action logits:
 [[ 3.1491040e-04 -1.8924670e-04 -9.5752053e-05 -4.1935517e-04
   5.7682555e-05 -1.4358910e-04  2.1329689e-04 -2.9975208e-04
  -6.9215646e-05 -1.3610983e-04 -2.6377433e-04 -4.4316439e-05
  -1.9259397e-04 -2.2033168e-04]]
state values:
 [0.00028839]

Let's play!

Let's build a function that measures agent's average reward.



In [8]:

    
def evaluate(agent, env, n_games=1):
    """Plays an a game from start till done, returns per-game rewards """

    game_rewards = []
    for _ in range(n_games):
        state = env.reset()

        total_reward = 0
        while True:
            action = agent.sample_actions(agent.step([state]))[0]
            state, reward, done, info = env.step(action)
            total_reward += reward
            if done: break

        # We rescale the reward back to ensure compatibility
        # with other evaluations.
        game_rewards.append(total_reward / reward_scale)
    return game_rewards



In [9]:

    
env_monitor = gym.wrappers.Monitor(env, directory="kungfu_videos", force=True)
rw = evaluate(agent, env_monitor, n_games=3,)
env_monitor.close()
print (rw)









    



[400.0, 1000.0, 400.0]



In [10]:

    
#show video
import os

from IPython.display import HTML

video_names = [s for s in os.listdir("./kungfu_videos/") if s.endswith(".mp4")]

HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format("./kungfu_videos/" + video_names[-1]))  #this may or may not be _last_ video. Try other indices









    Out[10]:

Training on parallel games

To make actor-critic training more stable, we shall play several games in parallel. This means ya'll have to initialize several parallel gym envs, send agent's actions there and .reset() each env if it becomes terminated. To minimize learner brain damage, we've taken care of them for ya - just make sure you read it before you use it.



In [11]:

    
class EnvBatch:
    def __init__(self, n_envs = 10):
        """ Creates n_envs environments and babysits them for ya' """
        self.envs = [make_env() for _ in range(n_envs)]
        
    def reset(self):
        """ Reset all games and return [n_envs, *obs_shape] observations """
        return np.array([env.reset() for env in self.envs])
    
    def step(self, actions):
        """
        Send a vector[batch_size] of actions into respective environments
        :returns: observations[n_envs, *obs_shape], rewards[n_envs], done[n_envs,], info[n_envs]
        """
        results = [env.step(a) for env, a in zip(self.envs, actions)]
        new_obs, rewards, done, infos = map(np.array, zip(*results))
        
        # reset environments automatically
        for i in range(len(self.envs)):
            if done[i]:
                new_obs[i] = self.envs[i].reset()
        
        return new_obs, rewards, done, infos

Let's try it out:



In [12]:

    
env_batch = EnvBatch(10)

batch_states = env_batch.reset()

batch_actions = agent.sample_actions(agent.step(batch_states))

batch_next_states, batch_rewards, batch_done, _ = env_batch.step(batch_actions)

print("State shape:", batch_states.shape)
print("Actions:", batch_actions[:3])
print("Rewards:", batch_rewards[:3])
print("Done:", batch_done[:3])









    



State shape: (10, 42, 42, 4)
Actions: [ 4 11  4]
Rewards: [0. 0. 0.]
Done: [False False False]

Actor-critic

Here we define a loss functions and learning algorithms as usual.



In [13]:

    
# These placeholders mean exactly the same as in "Let's try it out" section above
states_ph = tf.placeholder('float32', [None,] + list(obs_shape))    
next_states_ph = tf.placeholder('float32', [None,] + list(obs_shape))
actions_ph = tf.placeholder('int32', (None,))
rewards_ph = tf.placeholder('float32', (None,))
is_done_ph = tf.placeholder('float32', (None,))



In [14]:

    
# logits[n_envs, n_actions] and state_values[n_envs, n_actions]
logits, state_values = agent.symbolic_step(states_ph)
next_logits, next_state_values = agent.symbolic_step(next_states_ph)
next_state_values = next_state_values * (1 - is_done_ph)

# probabilities and log-probabilities for all actions
probs = tf.nn.softmax(logits)            # [n_envs, n_actions]
logprobs = tf.nn.log_softmax(logits)     # [n_envs, n_actions]

# log-probabilities only for agent's chosen actions
logp_actions = tf.reduce_sum(logprobs * tf.one_hot(actions_ph, n_actions), axis=-1) # [n_envs,]



In [15]:

    
# compute advantage using rewards_ph, state_values and next_state_values
gamma = 0.99
advantage = rewards_ph + gamma * next_state_values - state_values #<YOUR CODE>
assert advantage.shape.ndims == 1, "please compute advantage for each sample, vector of shape [n_envs,]"

# compute policy entropy given logits_seq. Mind the "-" sign!
entropy = - tf.reduce_mean(probs * logprobs, axis=-1) #- logp_actions * tf.exp(logp_actions) #<YOUR CODE>
assert entropy.shape.ndims == 1, "please compute pointwise entropy vector of shape [n_envs,] "



actor_loss =  - tf.reduce_mean(logp_actions * tf.stop_gradient(advantage)) - 0.001 * tf.reduce_mean(entropy)

# compute target state values using temporal difference formula. Use rewards_ph and next_step_values
target_state_values = rewards_ph + gamma * next_state_values #<YOUR CODE>

#critic_loss = tf.reduce_mean((state_values - tf.stop_gradient(target_state_values))**2 )
# Clip gradients
gradients = tf.clip_by_norm(tf.stop_gradient(target_state_values), clip_norm=40)
critic_loss = tf.reduce_mean((state_values - gradients)**2)


train_step = tf.train.AdamOptimizer(1e-4).minimize(actor_loss + critic_loss)
sess.run(tf.global_variables_initializer())









    



WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.



In [16]:

    
# Sanity checks to catch some errors. Specific to KungFuMaster in assignment's default setup.
l_act, l_crit, adv, ent = sess.run([actor_loss, critic_loss, advantage, entropy], feed_dict = {
        states_ph: batch_states,
        actions_ph: batch_actions,
        next_states_ph: batch_states,
        rewards_ph: batch_rewards,
        is_done_ph: batch_done,
    })

assert abs(l_act) < 100 and abs(l_crit) < 100, "losses seem abnormally large"
assert 0 <= ent.mean() <= np.log(n_actions), "impossible entropy value, double-check the formula pls"
if ent.mean() < np.log(n_actions) / 2: print("Entropy is too low for untrained agent")
print("You just might be fine!")









    



Entropy is too low for untrained agent
You just might be fine!

Train

Just the usual - play a bit, compute loss, follow the graidents, repeat a few million times.



In [17]:

    
from IPython.display import clear_output
from tqdm import trange
from pandas import DataFrame
ewma = lambda x, span=100: DataFrame({'x':np.asarray(x)}).x.ewm(span=span).mean().values

env_batch = EnvBatch(10)
batch_states = env_batch.reset()

rewards_history = []
entropy_history = []



In [19]:

    
for i in trange(100000):
    batch_actions = agent.sample_actions(agent.step(batch_states))
    batch_next_states, batch_rewards, batch_done, _ = env_batch.step(batch_actions)

    feed_dict = {
        states_ph: batch_states,
        actions_ph: batch_actions,
        next_states_ph: batch_next_states,
        rewards_ph: batch_rewards,
        is_done_ph: batch_done,
    }
    batch_states = batch_next_states

    _, ent_t = sess.run([train_step, entropy], feed_dict)
    entropy_history.append(np.mean(ent_t))

    if i % 500 == 0:
        if i % 2500 == 0:
            rewards_history.append(np.mean(evaluate(agent, env, n_games=3)))
            if rewards_history[-1] >= 50:
                print("Your agent has earned the yellow belt")#% color)
            if rewards_history[-1] >= 1100:
                print('Good job exiting')
                break
                
        clear_output(True)
        plt.figure(figsize=[8, 4])
        plt.subplot(1, 2, 1)
        plt.plot(rewards_history, label='rewards')
        plt.plot(ewma(np.array(rewards_history), span=10), marker='.', label='rewards ewma@10')
        plt.title("Session rewards")
        plt.grid()
        plt.legend()

        plt.subplot(1, 2, 2)
        plt.plot(entropy_history, label='entropy')
        plt.plot(ewma(np.array(entropy_history), span=1000), label='entropy ewma@1000')
        plt.title("Policy entropy")
        plt.grid()
        plt.legend()
        plt.show()









    



 19%|█▉        | 19283/100000 [28:00<2:14:33, 10.00it/s]





    



Your agent has earned the yellow belt
Good job exiting

Relax and grab some refreshments while your agent is locked in an infinite loop of violence and death.

How to interpret plots:

The session reward is the easy thing: it should in general go up over time, but it's okay if it fluctuates ~~like crazy~~. It's also OK if it reward doesn't increase substantially before some 10k initial steps. However, if reward reaches zero and doesn't seem to get up over 2-3 evaluations, there's something wrong happening.

Since we use a policy-based method, we also keep track of policy entropy - the same one you used as a regularizer. The only important thing about it is that your entropy shouldn't drop too low (< 0.1) before your agent gets the yellow belt. Or at least it can drop there, but it shouldn't stay there for long.

If it does, the culprit is likely:

Some bug in entropy computation. Remember that it is $ - \sum p(a_i) \cdot log p(a_i) $
Your agent architecture converges too fast. Increase entropy coefficient in actor loss.
Gradient explosion - just clip gradients and maybe use a smaller network
Us. Or TF developers. Or aliens. Or lizardfolk. Contact us on forums before it's too late!

If you're debugging, just run logits, values = agent.step(batch_states) and manually look into logits and values. This will reveal the problem 9 times out of 10: you'll likely see some NaNs or insanely large numbers or zeros. Try to catch the moment when this happens for the first time and investigate from there.



In [20]:

    
logits, values = agent.step(batch_states)
logits, values









    Out[20]:





(array([[-2.256826  , -2.8081849 ,  0.9750036 , -2.972189  ,  0.54374224,
         -1.9537139 , -1.7996006 , -2.094563  , -1.4282502 ,  0.5790966 ,
         -2.804247  , -0.13034855,  0.16831958,  2.8247058 ],
        [-2.256782  , -2.8081262 ,  0.97497964, -2.972127  ,  0.5437282 ,
         -1.9536737 , -1.7995636 , -2.0945215 , -1.4282196 ,  0.57908374,
         -2.804187  , -0.13034603,  0.16831425,  2.8246484 ],
        [-2.2568393 , -2.8082027 ,  0.975011  , -2.972208  ,  0.54374653,
         -1.9537263 , -1.799612  , -2.0945756 , -1.4282596 ,  0.5791006 ,
         -2.8042653 , -0.13034932,  0.16832122,  2.8247235 ],
        [-2.2570248 , -2.8084507 ,  0.97511244, -2.9724715 ,  0.5438062 ,
         -1.9538965 , -1.7997688 , -2.0947511 , -1.4283894 ,  0.5791553 ,
         -2.8045182 , -0.13036002,  0.16834384,  2.8249671 ],
        [-2.2563355 , -2.8075297 ,  0.9747354 , -2.9714932 ,  0.54358464,
         -1.9532641 , -1.7991863 , -2.0940995 , -1.4279071 ,  0.578952  ,
         -2.8035781 , -0.1303203 ,  0.16825983,  2.8240623 ],
        [-2.2562923 , -2.807472  ,  0.97471166, -2.9714317 ,  0.54357076,
         -1.9532245 , -1.7991499 , -2.0940588 , -1.4278768 ,  0.5789393 ,
         -2.8035192 , -0.1303178 ,  0.16825455,  2.8240058 ],
        [-2.256941  , -2.8083386 ,  0.97506666, -2.9723525 ,  0.54377925,
         -1.9538196 , -1.799698  , -2.0946717 , -1.4283308 ,  0.5791306 ,
         -2.804404  , -0.1303552 ,  0.16833362,  2.8248572 ],
        [-2.2568605 , -2.808231  ,  0.97502255, -2.9722383 ,  0.5437534 ,
         -1.9537457 , -1.7996299 , -2.0945957 , -1.4282744 ,  0.57910687,
         -2.804294  , -0.13035055,  0.16832381,  2.8247514 ],
        [-2.256803  , -2.8081543 ,  0.9749912 , -2.9721568 ,  0.5437349 ,
         -1.953693  , -1.7995814 , -2.0945415 , -1.4282343 ,  0.57908994,
         -2.804216  , -0.13034724,  0.16831681,  2.824676  ],
        [-2.2566712 , -2.8079782 ,  0.97491896, -2.9719696 ,  0.5436925 ,
         -1.953572  , -1.79947   , -2.0944166 , -1.428142  ,  0.579051  ,
         -2.804036  , -0.13033964,  0.16830075,  2.824503  ]],
       dtype=float32),
 array([1.0049338, 1.0049121, 1.0049405, 1.0050323, 1.0046911, 1.0046697,
        1.0049908, 1.004951 , 1.0049225, 1.0048572], dtype=float32))

"Final" evaluation



In [21]:

    
env_monitor = gym.wrappers.Monitor(env, directory="kungfu_videos", force=True)
final_rewards = evaluate(agent, env_monitor, n_games=20)
env_monitor.close()
print("Final mean reward:", np.mean(final_rewards))

video_names = list(filter(lambda s: s.endswith(".mp4"), os.listdir("./kungfu_videos/")))









    



Final mean reward: 2770.0



In [22]:

    
HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format("./kungfu_videos/"+video_names[-1]))









    Out[22]:



In [23]:

    
HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format("./kungfu_videos/" + video_names[-2]))  # try other indices









    Out[23]:



In [ ]:

    
# if you don't see videos, just navigate to ./kungfu_videos and download .mp4 files from there.



In [24]:

    
from submit import submit_kungfu
env = make_env()
submit_kungfu(agent, env, evaluate, "tonatiuh_rangel@hotmail.com", "QJAamGsTmv7HQwIi")









    



Submitted to Coursera platform. See results on assignment page!

Now what?

Well, 5k reward is just the beginning. Can you get past 200? With recurrent neural network memory, chances are you can even beat 400!

Try n-step advantage and "lambda"-advantage (aka GAE) - see this article
- This change should improve early convergence a lot
Try recurrent neural network
- RNN memory will slow things down initially, but in will reach better final reward at this game
Implement asynchronuous version
- Remember A3C? The first "A" stands for asynchronuous. It means there are several parallel actor-learners out there.
- You can write custom code for synchronization, but we recommend using redis
  - You can store full parameter set in redis, along with any other metadate
  - Here's a quick way to (de)serialize parameters for redis
```
import joblib
from six import BytesIO
```
```
def dumps(data):
  "converts whatever to string"
  s = BytesIO()
  joblib.dump(data,s)
  return s.getvalue()
```
```
def loads(string):
  "converts string to whatever was dumps'ed in it"
  return joblib.load(BytesIO(string))
```



In [ ]:



In [ ]: