Deep Q-Network implementation

This notebook shamelessly demands you to implement a DQN - an approximate q-learning algorithm with experience replay and target networks - and see if it works any better this way.


In [1]:
#XVFB will be launched if you run on a server
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
    !bash ../xvfb start
    os.environ['DISPLAY'] = ':1'


Starting virtual X frame buffer: Xvfb.

Frameworks - we'll accept this homework in any deep learning framework. This particular notebook was designed for tensorflow, but you will find it easy to adapt it to almost any python-based deep learning framework.


In [2]:
import gym
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Let's play some old videogames

This time we're gonna apply approximate q-learning to an atari game called Breakout. It's not the hardest thing out there, but it's definitely way more complex than anything we tried before.

Processing game image

Raw atari images are large, 210x160x3 by default. However, we don't need that level of detail in order to learn them.

We can thus save a lot of time by preprocessing game image, including

  • Resizing to a smaller shape, 64 x 64
  • Converting to grayscale
  • Cropping irrelevant image parts (top & bottom)

In [3]:
from gym.core import ObservationWrapper
from gym.spaces import Box

from scipy.misc import imresize
from skimage import color


class PreprocessAtari(ObservationWrapper):
    def __init__(self, env):
        """A gym wrapper that crops, scales image into the desired shapes and optionally grayscales it."""
        ObservationWrapper.__init__(self,env)
        
        self.img_size = (64, 64)
        self.observation_space = Box(0.0, 1.0, (self.img_size[0], self.img_size[1], 1))

    def observation(self, img):
        """what happens to each observation"""
        
        # Here's what you need to do:
        #  * crop image, remove irrelevant parts
        #  * resize image to self.img_size 
        #     (use imresize imported above or any library you want,
        #      e.g. opencv, skimage, PIL, keras)
        #  * cast image to grayscale
        #  * convert image pixels to (0,1) range, float32 type

        #top = self.img_size[0]
        #bottom = self.img_size[0] - 18
        #left = self.img_size[1]
        #right = self.img_size[1] - left
        #crop = img[top:bottom, left:right, :]
        #print(top, bottom, left, right)

        img2 = imresize(img, self.img_size)
        img2 = color.rgb2gray(img2)
        s = (img2.shape[0], img2.shape[1], 1 )
        img2 = img2.reshape(s)
        img2 = img2.astype('float32') / img2.max()
        return img2

In [4]:
import gym
#spawn game instance for tests
env = gym.make("BreakoutDeterministic-v0") #create raw env
env = PreprocessAtari(env)

observation_shape = env.observation_space.shape
n_actions = env.action_space.n


obs = env.reset()

#test observation
assert obs.ndim == 3, "observation must be [batch, time, channels] even if there's just one channel"
assert obs.shape == observation_shape
assert obs.dtype == 'float32'
assert len(np.unique(obs))>2, "your image must not be binary"
assert 0 <= np.min(obs) and np.max(obs) <=1, "convert image pixels to (0,1) range"

print("Formal tests seem fine. Here's an example of what you'll get.")

plt.title("what your network gonna see")
plt.imshow(obs[:, :, 0], interpolation='none',cmap='gray');


Formal tests seem fine. Here's an example of what you'll get.

Frame buffer

Our agent can only process one observation at a time, so we gotta make sure it contains enough information to fing optimal actions. For instance, agent has to react to moving objects so he must be able to measure object's velocity.

To do so, we introduce a buffer that stores 4 last images. This time everything is pre-implemented for you.


In [5]:
from framebuffer import FrameBuffer
def make_env():
    env = gym.make("BreakoutDeterministic-v4")
    env = PreprocessAtari(env)
    env = FrameBuffer(env, n_frames=4, dim_order='tensorflow')
    return env

env = make_env()
env.reset()
n_actions = env.action_space.n
state_dim = env.observation_space.shape

In [6]:
for _ in range(50):
    obs, _, _, _ = env.step(env.action_space.sample())


plt.title("Game image")
plt.imshow(env.render("rgb_array"))
plt.show()
plt.title("Agent observation (4 frames left to right)")
plt.imshow(obs.transpose([0,2,1]).reshape([state_dim[0],-1]));


Building a network

We now need to build a neural network that can map images to state q-values. This network will be called on every agent's step so it better not be resnet-152 unless you have an array of GPUs. Instead, you can use strided convolutions with a small number of features to save time and memory.

You can build any architecture you want, but for reference, here's something that will more or less work:


In [7]:
import tensorflow as tf
tf.reset_default_graph()
sess = tf.InteractiveSession()

In [8]:
from keras.layers import Conv2D, Dense, Flatten
from keras.models import Sequential
class DQNAgent:
    def __init__(self, name, state_shape, n_actions, epsilon=0, reuse=False):
        """A simple DQN agent"""
        with tf.variable_scope(name, reuse=reuse):
            
            #< Define your network body here. Please make sure you don't use any layers created elsewhere >
            self.network = Sequential()
            self.network.add(Conv2D(filters=16, kernel_size=(3, 3), strides=(2, 2), activation='relu'))
            self.network.add(Conv2D(filters=32, kernel_size=(3, 3), strides=(2, 2), activation='relu'))
            self.network.add(Conv2D(filters=64, kernel_size=(3, 3), strides=(2, 2), activation='relu'))
            self.network.add(Flatten())
            self.network.add(Dense(256, activation='relu'))
            self.network.add(Dense(n_actions, activation='linear'))
            
            # prepare a graph for agent step
            self.state_t = tf.placeholder('float32', [None,] + list(state_shape))
            self.qvalues_t = self.get_symbolic_qvalues(self.state_t)
            
        self.weights = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=name)
        self.epsilon = epsilon

    def get_symbolic_qvalues(self, state_t):
        """takes agent's observation, returns qvalues. Both are tf Tensors"""
        #< apply your network layers here >
        qvalues = self.network(state_t) #< symbolic tensor for q-values >
        
        assert tf.is_numeric_tensor(qvalues) and qvalues.shape.ndims == 2, \
            "please return 2d tf tensor of qvalues [you got %s]" % repr(qvalues)
        assert int(qvalues.shape[1]) == n_actions
        
        return qvalues
    
    def get_qvalues(self, state_t):
        """Same as symbolic step except it operates on numpy arrays"""
        sess = tf.get_default_session()
        return sess.run(self.qvalues_t, {self.state_t: state_t})
    
    def sample_actions(self, qvalues):
        """pick actions given qvalues. Uses epsilon-greedy exploration strategy. """
        epsilon = self.epsilon
        batch_size, n_actions = qvalues.shape
        random_actions = np.random.choice(n_actions, size=batch_size)
        best_actions = qvalues.argmax(axis=-1)
        should_explore = np.random.choice([0, 1], batch_size, p = [1-epsilon, epsilon])
        return np.where(should_explore, random_actions, best_actions)


Using TensorFlow backend.

In [9]:
agent = DQNAgent("dqn_agent", state_dim, n_actions, epsilon=0.5)
sess.run(tf.global_variables_initializer())


WARNING: Logging before flag parsing goes to stderr.
W0213 23:10:53.443150 139892449416960 deprecation_wrapper.py:119] From /opt/conda/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0213 23:10:53.446766 139892449416960 deprecation_wrapper.py:119] From /opt/conda/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0213 23:10:53.448719 139892449416960 deprecation_wrapper.py:119] From /opt/conda/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

Now let's try out our agent to see if it raises any errors.


In [10]:
def evaluate(env, agent, n_games=1, greedy=False, t_max=10000):
    """ Plays n_games full games. If greedy, picks actions as argmax(qvalues). Returns mean reward. """
    rewards = []
    for _ in range(n_games):
        s = env.reset()
        reward = 0
        for _ in range(t_max):
            qvalues = agent.get_qvalues([s])
            action = qvalues.argmax(axis=-1)[0] if greedy else agent.sample_actions(qvalues)[0]
            s, r, done, _ = env.step(action)
            reward += r
            if done: 
                break
                
        rewards.append(reward)
    return np.mean(rewards)

In [11]:
evaluate(env, agent, n_games=1)


Out[11]:
5.0

Experience replay

For this assignment, we provide you with experience replay buffer. If you implemented experience replay buffer in last week's assignment, you can copy-paste it here to get 2 bonus points.

The interface is fairly simple:

  • exp_replay.add(obs, act, rw, next_obs, done) - saves (s,a,r,s',done) tuple into the buffer
  • exp_replay.sample(batch_size) - returns observations, actions, rewards, next_observations and is_done for batch_size random samples.
  • len(exp_replay) - returns number of elements stored in replay buffer.

In [12]:
from replay_buffer import ReplayBuffer
exp_replay = ReplayBuffer(10)

for _ in range(30):
    exp_replay.add(env.reset(), env.action_space.sample(), 1.0, env.reset(), done=False)

obs_batch, act_batch, reward_batch, next_obs_batch, is_done_batch = exp_replay.sample(5)

assert len(exp_replay) == 10, "experience replay size should be 10 because that's what maximum capacity is"

In [13]:
def play_and_record(agent, env, exp_replay, n_steps=1):
    """
    Play the game for exactly n steps, record every (s,a,r,s', done) to replay buffer. 
    Whenever game ends, add record with done=True and reset the game.
    :returns: return sum of rewards over time
    
    Note: please do not env.reset() unless env is done.
    It is guaranteed that env has done=False when passed to this function.
    """
    # State at the beginning of rollout
    s = env.framebuffer
    greedy = True
    
    # Play the game for n_steps as per instructions above
    #<YOUR CODE>
    last_info = None
    total_reward = 0
    for _ in range(n_steps):
        qvalues = agent.get_qvalues([s])
        a = agent.sample_actions(qvalues)[0]
    
        next_s, r, done, info = env.step(a)
        r = -10 if ( last_info is not None and last_info['ale.lives'] > info['ale.lives'] ) else r
        last_info = info
        
        # Experience Replay
        exp_replay.add(s, a, r, next_s, done) 

        total_reward += r
        s = next_s
        if done: 
            s = env.reset()
    return total_reward

In [14]:
# testing your code. This may take a minute...
exp_replay = ReplayBuffer(20000)

play_and_record(agent, env, exp_replay, n_steps=10000)

# if you're using your own experience replay buffer, some of those tests may need correction. 
# just make sure you know what your code does
assert len(exp_replay) == 10000, "play_and_record should have added exactly 10000 steps, "\
                                 "but instead added %i"%len(exp_replay)
is_dones = list(zip(*exp_replay._storage))[-1]

assert 0 < np.mean(is_dones) < 0.1, "Please make sure you restart the game whenever it is 'done' and record the is_done correctly into the buffer."\
                                    "Got %f is_done rate over %i steps. [If you think it's your tough luck, just re-run the test]"%(np.mean(is_dones), len(exp_replay))
    
for _ in range(100):
    obs_batch, act_batch, reward_batch, next_obs_batch, is_done_batch = exp_replay.sample(10)
    assert obs_batch.shape == next_obs_batch.shape == (10,) + state_dim
    assert act_batch.shape == (10,), "actions batch should have shape (10,) but is instead %s"%str(act_batch.shape)
    assert reward_batch.shape == (10,), "rewards batch should have shape (10,) but is instead %s"%str(reward_batch.shape)
    assert is_done_batch.shape == (10,), "is_done batch should have shape (10,) but is instead %s"%str(is_done_batch.shape)
    assert [int(i) in (0,1) for i in is_dones], "is_done should be strictly True or False"
    assert [0 <= a <= n_actions for a in act_batch], "actions should be within [0, n_actions]"
    
print("Well done!")


Well done!

Target networks

We also employ the so called "target network" - a copy of neural network weights to be used for reference Q-values:

The network itself is an exact copy of agent network, but it's parameters are not trained. Instead, they are moved here from agent's actual network every so often.

$$ Q_{reference}(s,a) = r + \gamma \cdot \max _{a'} Q_{target}(s',a') $$


In [15]:
target_network = DQNAgent("target_network", state_dim, n_actions)

In [16]:
def load_weigths_into_target_network(agent, target_network):
    """ assign target_network.weights variables to their respective agent.weights values. """
    assigns = []
    for w_agent, w_target in zip(agent.weights, target_network.weights):
        assigns.append(tf.assign(w_target, w_agent, validate_shape=True))
    tf.get_default_session().run(assigns)

In [17]:
load_weigths_into_target_network(agent, target_network) 

# check that it works
sess.run([tf.assert_equal(w, w_target) for w, w_target in zip(agent.weights, target_network.weights)]);
print("It works!")


It works!

Learning with... Q-learning

Here we write a function similar to agent.update from tabular q-learning.


In [18]:
# placeholders that will be fed with exp_replay.sample(batch_size)
obs_ph = tf.placeholder(tf.float32, shape=(None,) + state_dim)
actions_ph = tf.placeholder(tf.int32, shape=[None])
rewards_ph = tf.placeholder(tf.float32, shape=[None])
next_obs_ph = tf.placeholder(tf.float32, shape=(None,) + state_dim)
is_done_ph = tf.placeholder(tf.float32, shape=[None])

is_not_done = 1 - is_done_ph
gamma = 0.99

Take q-values for actions agent just took


In [19]:
current_qvalues = agent.get_symbolic_qvalues(obs_ph)
current_action_qvalues = tf.reduce_sum(tf.one_hot(actions_ph, n_actions) * current_qvalues, axis=1)

Compute Q-learning TD error:

$$ L = { 1 \over N} \sum_i [ Q_{\theta}(s,a) - Q_{reference}(s,a) ] ^2 $$

With Q-reference defined as

$$ Q_{reference}(s,a) = r(s,a) + \gamma \cdot max_{a'} Q_{target}(s', a') $$

Where

  • $Q_{target}(s',a')$ denotes q-value of next state and next action predicted by target_network
  • $s, a, r, s'$ are current state, action, reward and next state respectively
  • $\gamma$ is a discount factor defined two cells above.

In [20]:
# compute q-values for NEXT states with target network
next_qvalues_target = target_network.get_symbolic_qvalues(next_obs_ph) #<your code> 

# compute state values by taking max over next_qvalues_target for all actions
next_state_values_target = tf.reduce_max(next_qvalues_target, axis=1) #<YOUR CODE>

# compute Q_reference(s,a) as per formula above.
reference_qvalues = rewards_ph + gamma * next_state_values_target #<YOUR CODE>

# Define loss function for sgd.
td_loss = (current_action_qvalues - reference_qvalues) ** 2
td_loss = tf.reduce_mean(td_loss)

train_step = tf.train.AdamOptimizer(1e-3).minimize(td_loss, var_list=agent.weights)


W0213 23:11:19.266005 139892449416960 deprecation.py:323] From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:1205: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

In [21]:
sess.run(tf.global_variables_initializer())

In [22]:
for chk_grad in tf.gradients(reference_qvalues, agent.weights):
    error_msg = "Reference q-values should have no gradient w.r.t. agent weights. Make sure you used target_network qvalues! "
    error_msg += "If you know what you're doing, ignore this assert."
    assert chk_grad is None or np.allclose(sess.run(chk_grad), sess.run(chk_grad * 0)), error_msg

#assert tf.gradients(reference_qvalues, is_not_done)[0] is not None, "make sure you used is_not_done"
assert tf.gradients(reference_qvalues, rewards_ph)[0] is not None, "make sure you used rewards"
assert tf.gradients(reference_qvalues, next_obs_ph)[0] is not None, "make sure you used next states"
assert tf.gradients(reference_qvalues, obs_ph)[0] is None, "reference qvalues shouldn't depend on current observation!" # ignore if you're certain it's ok
print("Splendid!")


Splendid!

Main loop

It's time to put everything together and see if it learns anything.


In [23]:
from tqdm import trange
from IPython.display import clear_output
import matplotlib.pyplot as plt
from pandas import DataFrame
moving_average = lambda x, span, **kw: DataFrame({'x':np.asarray(x)}).x.ewm(span=span, **kw).mean().values
%matplotlib inline

mean_rw_history = []
td_loss_history = []

In [24]:
#exp_replay = ReplayBuffer(10**5)
exp_replay = ReplayBuffer(10**4)
play_and_record(agent, env, exp_replay, n_steps=10000)

def sample_batch(exp_replay, batch_size):
    obs_batch, act_batch, reward_batch, next_obs_batch, is_done_batch = exp_replay.sample(batch_size)
    return {
        obs_ph:obs_batch, actions_ph:act_batch, rewards_ph:reward_batch, 
        next_obs_ph:next_obs_batch, is_done_ph:is_done_batch
    }

In [25]:
for i in trange(10**5):
    
    # play
    play_and_record(agent, env, exp_replay, 10)
    
    # train
    _, loss_t = sess.run([train_step, td_loss], sample_batch(exp_replay, batch_size=64))
    td_loss_history.append(loss_t)
    
    # adjust agent parameters
    if i % 500 == 0:
        load_weigths_into_target_network(agent, target_network)
        agent.epsilon = max(agent.epsilon * 0.99, 0.01)
        mean_rw_history.append(evaluate(make_env(), agent, n_games=3))
    
    if i % 100 == 0:
        clear_output(True)
        print("buffer size = %i, epsilon = %.5f" % (len(exp_replay), agent.epsilon))
        
        plt.subplot(1,2,1)
        plt.title("mean reward per game")
        plt.plot(mean_rw_history)
        plt.grid()

        assert not np.isnan(loss_t)
        plt.figure(figsize=[12, 4])
        plt.subplot(1,2,2)
        plt.title("TD loss history (moving average)")
        plt.plot(moving_average(np.array(td_loss_history), span=100, min_periods=100))
        plt.grid()
        plt.show()
    
    if np.mean(mean_rw_history[-10:]) > 10.:
        break


buffer size = 10000, epsilon = 0.33786

In [26]:
assert np.mean(mean_rw_history[-10:]) > 10.
print("That's good enough for tutorial.")


That's good enough for tutorial.

How to interpret plots:

This aint no supervised learning so don't expect anything to improve monotonously.

  • TD loss is the MSE between agent's current Q-values and target Q-values. It may slowly increase or decrease, it's ok. The "not ok" behavior includes going NaN or stayng at exactly zero before agent has perfect performance.
  • mean reward is the expected sum of r(s,a) agent gets over the full game session. It will oscillate, but on average it should get higher over time (after a few thousand iterations...).
    • In basic q-learning implementation it takes 5-10k steps to "warm up" agent before it starts to get better.
  • buffer size - this one is simple. It should go up and cap at max size.
  • epsilon - agent's willingness to explore. If you see that agent's already at 0.01 epsilon before it's average reward is above 0 - it means you need to increase epsilon. Set it back to some 0.2 - 0.5 and decrease the pace at which it goes down.
  • Also please ignore first 100-200 steps of each plot - they're just oscillations because of the way moving average works.

At first your agent will lose quickly. Then it will learn to suck less and at least hit the ball a few times before it loses. Finally it will learn to actually score points.

Training will take time. A lot of it actually. An optimistic estimate is to say it's gonna start winning (average reward > 10) after 10k steps.

But hey, look on the bright side of things:

Video


In [27]:
agent.epsilon=0 # Don't forget to reset epsilon back to previous value if you want to go on training

In [28]:
#record sessions
import gym.wrappers
env_monitor = gym.wrappers.Monitor(make_env(),directory="videos",force=True)
sessions = [evaluate(env_monitor, agent, n_games=1) for _ in range(100)]
env_monitor.close()


---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-28-d8066809f2fb> in <module>()
      2 import gym.wrappers
      3 env_monitor = gym.wrappers.Monitor(make_env(),directory="videos",force=True)
----> 4 sessions = [evaluate(env_monitor, agent, n_games=1) for _ in range(100)]
      5 env_monitor.close()

<ipython-input-28-d8066809f2fb> in <listcomp>(.0)
      2 import gym.wrappers
      3 env_monitor = gym.wrappers.Monitor(make_env(),directory="videos",force=True)
----> 4 sessions = [evaluate(env_monitor, agent, n_games=1) for _ in range(100)]
      5 env_monitor.close()

<ipython-input-10-6eafa26a784e> in evaluate(env, agent, n_games, greedy, t_max)
      8             qvalues = agent.get_qvalues([s])
      9             action = qvalues.argmax(axis=-1)[0] if greedy else agent.sample_actions(qvalues)[0]
---> 10             s, r, done, _ = env.step(action)
     11             reward += r
     12             if done:

/opt/conda/lib/python3.6/site-packages/gym/wrappers/monitor.py in step(self, action)
     29     def step(self, action):
     30         self._before_step(action)
---> 31         observation, reward, done, info = self.env.step(action)
     32         done = self._after_step(observation, reward, done, info)
     33 

~/work/notebooks/week4_approx/framebuffer.py in step(self, action)
     29     def step(self, action):
     30         """plays breakout for 1 step, returns frame buffer"""
---> 31         new_img, reward, done, info = self.env.step(action)
     32         self.update_buffer(new_img)
     33         return self.framebuffer, reward, done, info

/opt/conda/lib/python3.6/site-packages/gym/core.py in step(self, action)
    259 
    260     def step(self, action):
--> 261         observation, reward, done, info = self.env.step(action)
    262         return self.observation(observation), reward, done, info
    263 

/opt/conda/lib/python3.6/site-packages/gym/wrappers/time_limit.py in step(self, action)
     14     def step(self, action):
     15         assert self._elapsed_steps is not None, "Cannot call env.step() before calling reset()"
---> 16         observation, reward, done, info = self.env.step(action)
     17         self._elapsed_steps += 1
     18         if self._elapsed_steps >= self._max_episode_steps:

/opt/conda/lib/python3.6/site-packages/gym/envs/atari/atari_env.py in step(self, a)
    118             num_steps = self.np_random.randint(self.frameskip[0], self.frameskip[1])
    119         for _ in range(num_steps):
--> 120             reward += self.ale.act(action)
    121         ob = self._get_obs()
    122 

/opt/conda/lib/python3.6/site-packages/atari_py/ale_python_interface.py in act(self, action)
    150 
    151     def act(self, action):
--> 152         return ale_lib.act(self.obj, int(action))
    153 
    154     def game_over(self):

KeyboardInterrupt: 

In [ ]:
#show video
from IPython.display import HTML
import os

video_names = list(filter(lambda s:s.endswith(".mp4"),os.listdir("./videos/")))

HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format("./videos/"+video_names[-1])) #this may or may not be _last_ video. Try other indices

More

If you want to play with DQN a bit more, here's a list of things you can try with it:

Easy:

  • Implementing double q-learning shouldn't be a problem if you've already have target networks in place.

    • You will probably need tf.argmax to select best actions
    • Here's an original article
  • Dueling architecture is also quite straightforward if you have standard DQN.

    • You will need to change network architecture, namely the q-values layer
    • It must now contain two heads: V(s) and A(s,a), both dense layers
    • You should then add them up via elemwise sum layer.
    • Here's an article

Hard: Prioritized experience replay

In this section, you're invited to implement prioritized experience replay

  • You will probably need to provide a custom data structure
  • Once pool.update is called, collect the pool.experience_replay.observations, actions, rewards and is_alive and store them in your data structure
  • You can now sample such transitions in proportion to the error (see article) for training.

It's probably more convenient to explicitly declare inputs for "sample observations", "sample actions" and so on to plug them into q-learning.

Prioritized (and even normal) experience replay should greatly reduce amount of game sessions you need to play in order to achieve good performance.

While it's effect on runtime is limited for atari, more complicated envs (further in the course) will certainly benefit for it.

There is even more out there - see this overview article.


In [29]:
from submit import submit_breakout
env = make_env()
submit_breakout(agent, env, evaluate, "tonatiuh_rangel@hotmail.com", "CGeZfHhZ10uqx4Ud")


Submitted to Coursera platform. See results on assignment page!

In [ ]: