In this notebook you'll build a deep reinforcement learning agent for atari KungFuMaster and train it with advantage actor-critic.
In [1]:
from __future__ import print_function, division
from IPython.core import display
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
#If you are running on a server, launch xvfb to record game videos
#Please make sure you have xvfb installed
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
!bash ../xvfb start
os.environ['DISPLAY'] = ':1'
For starters, let's take a look at the game itself:
In [2]:
import gym
from atari_util import PreprocessAtari
# We scale rewards to avoid exploding gradients during optimization.
reward_scale = 0.01
def make_env():
env = gym.make("KungFuMasterDeterministic-v0")
env = PreprocessAtari(
env, height=42, width=42,
crop=lambda img: img[60:-30, 5:],
dim_order='tensorflow',
color=False, n_frames=4,
reward_scale=reward_scale)
return env
env = make_env()
obs_shape = env.observation_space.shape
n_actions = env.action_space.n
print("Observation shape:", obs_shape)
print("Num actions:", n_actions)
print("Action names:", env.env.env.get_action_meanings())
In [3]:
s = env.reset()
for _ in range(100):
s, _, _, _ = env.step(env.action_space.sample())
plt.title('Game image')
plt.imshow(env.render('rgb_array'))
plt.show()
plt.title('Agent observation (4-frame buffer)')
plt.imshow(s.transpose([0,2,1]).reshape([42,-1]))
plt.show()
We now have to build an agent for actor-critic training - a convolutional neural network that converts states into action probabilities $\pi$ and state values $V$.
Your assignment here is to build and apply a neural network - with any framework you want.
For starters, we want you to implement this architecture:
After your agent gets mean reward above 50, we encourage you to experiment with model architecture to score even better.
In [4]:
import tensorflow as tf
tf.reset_default_graph()
sess = tf.InteractiveSession()
In [5]:
from keras.layers import Conv2D, Dense, Flatten, Input
from keras.models import Model
class Agent:
def __init__(self, name, state_shape, n_actions, reuse=False):
"""A simple actor-critic agent"""
with tf.variable_scope(name, reuse=reuse):
# Prepare neural network architecture
### Your code here: prepare any necessary layers, variables, etc.
inputs = Input(shape=[42, 42, 4], name='input')
x = Conv2D(filters=32, kernel_size=(3, 3), strides=(2, 2), activation='relu', name='conv1')(inputs)
x = Flatten()(x)
x = Dense(32, activation='relu', name='dense')(x)
logits_head = Dense(n_actions, activation='linear', name='logits_head')(x)
v_head = Dense(1, activation='linear', name='v_head')(x)
self.network = Model(inputs=inputs, outputs=[logits_head, v_head])
print(self.network.summary())
#self.network = Sequential()
#self.network.add(Conv2D(filters=32, kernel_size=(3, 3), strides=(2, 2), activation='elu'))
#self.network.add(Conv2D(filters=32, kernel_size=(3, 3), strides=(2, 2), activation='elu'))
#self.network.add(Conv2D(filters=32, kernel_size=(3, 3), strides=(2, 2), activation='elu'))
#self.network.add(Flatten())
#self.network.add(Dense(128, activation='elu'))
#self.network.add(Dense(n_actions, activation='linear'))
# prepare a graph for agent step
self.state_t = tf.placeholder('float32', [None,] + list(state_shape))
self.agent_outputs = self.symbolic_step(self.state_t)
def symbolic_step(self, state_t):
"""Takes agent's previous step and observation, returns next state and whatever it needs to learn (tf tensors)"""
# Apply neural network
### Your code here: apply agent's neural network to get policy logits and state values.
output = self.network(state_t)
logits = output[0] #<logits go here>
state_value = tf.reshape(output[1], shape=[-1,]) #<state values go here>
assert tf.is_numeric_tensor(state_value) and state_value.shape.ndims == 1, \
"please return 1D tf tensor of state values [you got %s]" % repr(state_value)
assert tf.is_numeric_tensor(logits) and logits.shape.ndims == 2, \
"please return 2d tf tensor of logits [you got %s]" % repr(logits)
# hint: if you triggered state_values assert with your shape being [None, 1],
# just select [:, 0]-th element of state values as new state values
return (logits, state_value)
def step(self, state_t):
"""Same as symbolic step except it operates on numpy arrays"""
sess = tf.get_default_session()
return sess.run(self.agent_outputs, {self.state_t: state_t})
def sample_actions(self, agent_outputs):
"""pick actions given numeric agent outputs (np arrays)"""
logits, state_values = agent_outputs
policy = np.exp(logits) / np.sum(np.exp(logits), axis=-1, keepdims=True)
return np.array([np.random.choice(len(p), p=p) for p in policy])
In [6]:
agent = Agent("agent", obs_shape, n_actions)
sess.run(tf.global_variables_initializer())
In [7]:
state = [env.reset()]
logits, value = agent.step(state)
print("action logits:\n", logits)
print("state values:\n", value)
In [8]:
def evaluate(agent, env, n_games=1):
"""Plays an a game from start till done, returns per-game rewards """
game_rewards = []
for _ in range(n_games):
state = env.reset()
total_reward = 0
while True:
action = agent.sample_actions(agent.step([state]))[0]
state, reward, done, info = env.step(action)
total_reward += reward
if done: break
# We rescale the reward back to ensure compatibility
# with other evaluations.
game_rewards.append(total_reward / reward_scale)
return game_rewards
In [9]:
env_monitor = gym.wrappers.Monitor(env, directory="kungfu_videos", force=True)
rw = evaluate(agent, env_monitor, n_games=3,)
env_monitor.close()
print (rw)
In [10]:
#show video
import os
from IPython.display import HTML
video_names = [s for s in os.listdir("./kungfu_videos/") if s.endswith(".mp4")]
HTML("""
<video width="640" height="480" controls>
<source src="{}" type="video/mp4">
</video>
""".format("./kungfu_videos/" + video_names[-1])) #this may or may not be _last_ video. Try other indices
Out[10]:
To make actor-critic training more stable, we shall play several games in parallel. This means ya'll have to initialize several parallel gym envs, send agent's actions there and .reset() each env if it becomes terminated. To minimize learner brain damage, we've taken care of them for ya - just make sure you read it before you use it.
In [11]:
class EnvBatch:
def __init__(self, n_envs = 10):
""" Creates n_envs environments and babysits them for ya' """
self.envs = [make_env() for _ in range(n_envs)]
def reset(self):
""" Reset all games and return [n_envs, *obs_shape] observations """
return np.array([env.reset() for env in self.envs])
def step(self, actions):
"""
Send a vector[batch_size] of actions into respective environments
:returns: observations[n_envs, *obs_shape], rewards[n_envs], done[n_envs,], info[n_envs]
"""
results = [env.step(a) for env, a in zip(self.envs, actions)]
new_obs, rewards, done, infos = map(np.array, zip(*results))
# reset environments automatically
for i in range(len(self.envs)):
if done[i]:
new_obs[i] = self.envs[i].reset()
return new_obs, rewards, done, infos
Let's try it out:
In [12]:
env_batch = EnvBatch(10)
batch_states = env_batch.reset()
batch_actions = agent.sample_actions(agent.step(batch_states))
batch_next_states, batch_rewards, batch_done, _ = env_batch.step(batch_actions)
print("State shape:", batch_states.shape)
print("Actions:", batch_actions[:3])
print("Rewards:", batch_rewards[:3])
print("Done:", batch_done[:3])
In [13]:
# These placeholders mean exactly the same as in "Let's try it out" section above
states_ph = tf.placeholder('float32', [None,] + list(obs_shape))
next_states_ph = tf.placeholder('float32', [None,] + list(obs_shape))
actions_ph = tf.placeholder('int32', (None,))
rewards_ph = tf.placeholder('float32', (None,))
is_done_ph = tf.placeholder('float32', (None,))
In [14]:
# logits[n_envs, n_actions] and state_values[n_envs, n_actions]
logits, state_values = agent.symbolic_step(states_ph)
next_logits, next_state_values = agent.symbolic_step(next_states_ph)
next_state_values = next_state_values * (1 - is_done_ph)
# probabilities and log-probabilities for all actions
probs = tf.nn.softmax(logits) # [n_envs, n_actions]
logprobs = tf.nn.log_softmax(logits) # [n_envs, n_actions]
# log-probabilities only for agent's chosen actions
logp_actions = tf.reduce_sum(logprobs * tf.one_hot(actions_ph, n_actions), axis=-1) # [n_envs,]
In [15]:
# compute advantage using rewards_ph, state_values and next_state_values
gamma = 0.99
advantage = rewards_ph + gamma * next_state_values - state_values #<YOUR CODE>
assert advantage.shape.ndims == 1, "please compute advantage for each sample, vector of shape [n_envs,]"
# compute policy entropy given logits_seq. Mind the "-" sign!
entropy = - tf.reduce_mean(probs * logprobs, axis=-1) #- logp_actions * tf.exp(logp_actions) #<YOUR CODE>
assert entropy.shape.ndims == 1, "please compute pointwise entropy vector of shape [n_envs,] "
actor_loss = - tf.reduce_mean(logp_actions * tf.stop_gradient(advantage)) - 0.001 * tf.reduce_mean(entropy)
# compute target state values using temporal difference formula. Use rewards_ph and next_step_values
target_state_values = rewards_ph + gamma * next_state_values #<YOUR CODE>
#critic_loss = tf.reduce_mean((state_values - tf.stop_gradient(target_state_values))**2 )
# Clip gradients
gradients = tf.clip_by_norm(tf.stop_gradient(target_state_values), clip_norm=40)
critic_loss = tf.reduce_mean((state_values - gradients)**2)
train_step = tf.train.AdamOptimizer(1e-4).minimize(actor_loss + critic_loss)
sess.run(tf.global_variables_initializer())
In [16]:
# Sanity checks to catch some errors. Specific to KungFuMaster in assignment's default setup.
l_act, l_crit, adv, ent = sess.run([actor_loss, critic_loss, advantage, entropy], feed_dict = {
states_ph: batch_states,
actions_ph: batch_actions,
next_states_ph: batch_states,
rewards_ph: batch_rewards,
is_done_ph: batch_done,
})
assert abs(l_act) < 100 and abs(l_crit) < 100, "losses seem abnormally large"
assert 0 <= ent.mean() <= np.log(n_actions), "impossible entropy value, double-check the formula pls"
if ent.mean() < np.log(n_actions) / 2: print("Entropy is too low for untrained agent")
print("You just might be fine!")
In [17]:
from IPython.display import clear_output
from tqdm import trange
from pandas import DataFrame
ewma = lambda x, span=100: DataFrame({'x':np.asarray(x)}).x.ewm(span=span).mean().values
env_batch = EnvBatch(10)
batch_states = env_batch.reset()
rewards_history = []
entropy_history = []
In [19]:
for i in trange(100000):
batch_actions = agent.sample_actions(agent.step(batch_states))
batch_next_states, batch_rewards, batch_done, _ = env_batch.step(batch_actions)
feed_dict = {
states_ph: batch_states,
actions_ph: batch_actions,
next_states_ph: batch_next_states,
rewards_ph: batch_rewards,
is_done_ph: batch_done,
}
batch_states = batch_next_states
_, ent_t = sess.run([train_step, entropy], feed_dict)
entropy_history.append(np.mean(ent_t))
if i % 500 == 0:
if i % 2500 == 0:
rewards_history.append(np.mean(evaluate(agent, env, n_games=3)))
if rewards_history[-1] >= 50:
print("Your agent has earned the yellow belt")#% color)
if rewards_history[-1] >= 1100:
print('Good job exiting')
break
clear_output(True)
plt.figure(figsize=[8, 4])
plt.subplot(1, 2, 1)
plt.plot(rewards_history, label='rewards')
plt.plot(ewma(np.array(rewards_history), span=10), marker='.', label='rewards ewma@10')
plt.title("Session rewards")
plt.grid()
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(entropy_history, label='entropy')
plt.plot(ewma(np.array(entropy_history), span=1000), label='entropy ewma@1000')
plt.title("Policy entropy")
plt.grid()
plt.legend()
plt.show()
Relax and grab some refreshments while your agent is locked in an infinite loop of violence and death.
How to interpret plots:
The session reward is the easy thing: it should in general go up over time, but it's okay if it fluctuates like crazy. It's also OK if it reward doesn't increase substantially before some 10k initial steps. However, if reward reaches zero and doesn't seem to get up over 2-3 evaluations, there's something wrong happening.
Since we use a policy-based method, we also keep track of policy entropy - the same one you used as a regularizer. The only important thing about it is that your entropy shouldn't drop too low (< 0.1
) before your agent gets the yellow belt. Or at least it can drop there, but it shouldn't stay there for long.
If it does, the culprit is likely:
If you're debugging, just run logits, values = agent.step(batch_states)
and manually look into logits and values. This will reveal the problem 9 times out of 10: you'll likely see some NaNs or insanely large numbers or zeros. Try to catch the moment when this happens for the first time and investigate from there.
In [20]:
logits, values = agent.step(batch_states)
logits, values
Out[20]:
In [21]:
env_monitor = gym.wrappers.Monitor(env, directory="kungfu_videos", force=True)
final_rewards = evaluate(agent, env_monitor, n_games=20)
env_monitor.close()
print("Final mean reward:", np.mean(final_rewards))
video_names = list(filter(lambda s: s.endswith(".mp4"), os.listdir("./kungfu_videos/")))
In [22]:
HTML("""
<video width="640" height="480" controls>
<source src="{}" type="video/mp4">
</video>
""".format("./kungfu_videos/"+video_names[-1]))
Out[22]:
In [23]:
HTML("""
<video width="640" height="480" controls>
<source src="{}" type="video/mp4">
</video>
""".format("./kungfu_videos/" + video_names[-2])) # try other indices
Out[23]:
In [ ]:
# if you don't see videos, just navigate to ./kungfu_videos and download .mp4 files from there.
In [24]:
from submit import submit_kungfu
env = make_env()
submit_kungfu(agent, env, evaluate, "tonatiuh_rangel@hotmail.com", "QJAamGsTmv7HQwIi")
Well, 5k reward is just the beginning. Can you get past 200? With recurrent neural network memory, chances are you can even beat 400!
import joblib
from six import BytesIO
def dumps(data):
"converts whatever to string"
s = BytesIO()
joblib.dump(data,s)
return s.getvalue()
def loads(string):
"converts string to whatever was dumps'ed in it"
return joblib.load(BytesIO(string))
In [ ]:
In [ ]: