In [ ]:

    
import sys
if 'google.colab' in sys.modules:
    import os

    os.system('apt-get install -y xvfb')
    os.system('wget https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/xvfb -O ../xvfb')
    os.system('apt-get install -y python-opengl ffmpeg')
    os.system('pip install pyglet==1.2.4')

    os.system('python -m pip install -U pygame --user')

    print('setup complete')

# XVFB will be launched if you run on a server
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
    !bash ../xvfb start
    os.environ['DISPLAY'] = ':1'

Implementing Advantage-Actor Critic (A2C)

In this notebook you will implement Advantage Actor Critic algorithm that trains on a batch of Atari 2600 environments running in parallel.

Firstly, we will use environment wrappers implemented in file atari_wrappers.py. These wrappers preprocess observations (resize, grayscal, take max between frames, skip frames and stack them together) and rewards. Some of the wrappers help to reset the environment and pass done flag equal to True when agent dies. File env_batch.py includes implementation of ParallelEnvBatch class that allows to run multiple environments in parallel. To create an environment we can use nature_dqn_env function. Note that if you are using PyTorch and not using tensorboardX you will need to implement a wrapper that will log raw total rewards that the unwrapped environment returns and redefine the implemention of nature_dqn_env function here.



In [ ]:

    
import numpy as np
from atari_wrappers import nature_dqn_env, NumpySummaries


env = nature_dqn_env("SpaceInvadersNoFrameskip-v4", nenvs=8, summaries='Numpy')
obs = env.reset()
assert obs.shape == (8, 84, 84, 4)
assert obs.dtype == np.uint8

Next, we will need to implement a model that predicts logits and values. It is suggested that you use the same model as in Nature DQN paper with a modification that instead of having a single output layer, it will have two output layers taking as input the output of the last hidden layer. Note that this model is different from the model you used in homework where you implemented DQN. You can use your favorite deep learning framework here. We suggest that you use orthogonal initialization with parameter $\sqrt{2}$ for kernels and initialize biases with zeros.



In [ ]:

    
# import tensorflow as torch
# import torch as tf

<Define your model here>

You will also need to define and use a policy that wraps the model. While the model computes logits for all actions, the policy will sample actions and also compute their log probabilities. policy.act should return a dictionary of all the arrays that are needed to interact with an environment and train the model. Note that actions must be an np.ndarray while the other tensors need to have the type determined by your deep learning framework.



In [ ]:

    
class Policy:
    def __init__(self, model):
        self.model = model
    
    def act(self, inputs):
        <Implement policy by calling model, sampling actions and computing their log probs>
        # Should return a dict containing keys ['actions', 'logits', 'log_probs', 'values'].

Next will pass the environment and policy to a runner that collects partial trajectories from the environment. The class that does is is already implemented for you.



In [ ]:

    
from runners import EnvRunner

This runner interacts with the environment for a given number of steps and returns a dictionary containing keys

'observations'
'rewards'
'resets'
'actions'
all other keys that you defined in Policy

under each of these keys there is a python list of interactions with the environment of specified length $T$ — the size of partial trajectory.

To train the part of the model that predicts state values you will need to compute the value targets. Any callable could be passed to EnvRunner to be applied to each partial trajectory after it is collected. Thus, we can implement and use ComputeValueTargets callable. The formula for the value targets is simple:

$$ \hat v(s_t) = \left( \sum_{t'=0}^{T - 1 - t} \gamma^{t'}r_{t+t'} \right) + \gamma^T \hat{v}(s_{t+T}), $$

In implementation, however, do not forget to use trajectory['resets'] flags to check if you need to add the value targets at the next step when computing value targets for the current step. You can access trajectory['state']['latest_observation'] to get last observations in partial trajectory — $s_{t+T}$.



In [ ]:

    
class ComputeValueTargets:
    def __init__(self, policy, gamma=0.99):
        self.policy = policy
    
    def __call__(self, trajectory):
        # This method should modify trajectory inplace by adding
        # an item with key 'value_targets' to it.
        <Compute value targets for a given partial trajectory>

After computing value targets we will transform lists of interactions into tensors with the first dimension batch_size which is equal to T * nenvs, i.e. you essentially need to flatten the first two dimensions.



In [ ]:

    
class MergeTimeBatch:
    """ Merges first two axes typically representing time and env batch. """
    def __call__(self, trajectory):
        # Modify trajectory inplace.
        <TODO: implement>



In [ ]:

    
model = <Create your model here>
policy = Policy(model)
runner = EnvRunner(
    env, policy, nsteps=5,
    transforms=[
        ComputeValueTargets(),
        MergeTimeBatch(),
    ])

Now is the time to implement the advantage actor critic algorithm itself. You can look into your lecture, Mnih et al. 2016 paper, and lecture by Sergey Levine.



In [ ]:

    
class A2C:
    def __init__(self,
                 policy,
                 optimizer,
                 value_loss_coef=0.25,
                 entropy_coef=0.01,
                 max_grad_norm=0.5):
        self.policy = policy
        self.optimizer = optimizer
        self.value_loss_coef = value_loss_coef
        self.entropy_coef = entropy_coef
        self.max_grad_norm = max_grad_norm
    
    def policy_loss(self, trajectory):
        # You will need to compute advantages here.
        <TODO: implement>
    
    def value_loss(self, trajectory):
        <TODO: implement>
    
    def loss(self, trajectory):
        <TODO: implement>
      
    def step(self, trajectory):
        <TODO: implement>

Now you can train your model. With reasonable hyperparameters training on a single GTX1080 for 10 million steps across all batched environments (which translates to about 5 hours of wall clock time) it should be possible to achieve average raw reward over last 100 episodes (the average is taken over 100 last episodes in each environment in the batch) of about 600. You should plot this quantity with respect to runner.step_var — the number of interactions with all environments. It is highly encouraged to also provide plots of the following quantities (these are useful for debugging as well):

Coefficient of Determination between value targets and value predictions
Entropy of the policy $\pi$
Value loss
Policy loss
Value targets
Value predictions
Gradient norm
Advantages
A2C loss

For optimization we suggest you use RMSProp with learning rate starting from 7e-4 and linearly decayed to 0, smoothing constant (alpha in PyTorch and decay in TensorFlow) equal to 0.99 and epsilon equal to 1e-5.



In [ ]:

    
a2c = <Create instance of the algorithm> 

<Write your training loop>