In [ ]:
import sys
if 'google.colab' in sys.modules:
import os
os.system('apt-get install -y xvfb')
os.system('wget https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/xvfb -O ../xvfb')
os.system('apt-get install -y python-opengl ffmpeg')
os.system('pip install pyglet==1.2.4')
os.system('python -m pip install -U pygame --user')
print('setup complete')
# XVFB will be launched if you run on a server
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
!bash ../xvfb start
os.environ['DISPLAY'] = ':1'
In this notebook you will implement Advantage Actor Critic algorithm that trains on a batch of Atari 2600 environments running in parallel.
Firstly, we will use environment wrappers implemented in file atari_wrappers.py
. These wrappers preprocess observations (resize, grayscal, take max between frames, skip frames and stack them together) and rewards. Some of the wrappers help to reset the environment and pass done
flag equal to True
when agent dies.
File env_batch.py
includes implementation of ParallelEnvBatch
class that allows to run multiple environments in parallel. To create an environment we can use nature_dqn_env
function. Note that if you are using
PyTorch and not using tensorboardX
you will need to implement a wrapper that will log raw total rewards that the unwrapped environment returns and redefine the implemention of nature_dqn_env
function here.
In [ ]:
import numpy as np
from atari_wrappers import nature_dqn_env, NumpySummaries
env = nature_dqn_env("SpaceInvadersNoFrameskip-v4", nenvs=8, summaries='Numpy')
obs = env.reset()
assert obs.shape == (8, 84, 84, 4)
assert obs.dtype == np.uint8
Next, we will need to implement a model that predicts logits and values. It is suggested that you use the same model as in Nature DQN paper with a modification that instead of having a single output layer, it will have two output layers taking as input the output of the last hidden layer. Note that this model is different from the model you used in homework where you implemented DQN. You can use your favorite deep learning framework here. We suggest that you use orthogonal initialization with parameter $\sqrt{2}$ for kernels and initialize biases with zeros.
In [ ]:
# import tensorflow as torch
# import torch as tf
<Define your model here>
You will also need to define and use a policy that wraps the model. While the model computes logits for all actions, the policy will sample actions and also compute their log probabilities. policy.act
should return a dictionary of all the arrays that are needed to interact with an environment and train the model.
Note that actions must be an np.ndarray
while the other
tensors need to have the type determined by your deep learning framework.
In [ ]:
class Policy:
def __init__(self, model):
self.model = model
def act(self, inputs):
<Implement policy by calling model, sampling actions and computing their log probs>
# Should return a dict containing keys ['actions', 'logits', 'log_probs', 'values'].
Next will pass the environment and policy to a runner that collects partial trajectories from the environment. The class that does is is already implemented for you.
In [ ]:
from runners import EnvRunner
This runner interacts with the environment for a given number of steps and returns a dictionary containing keys
Policy
under each of these keys there is a python list
of interactions with the environment of specified length $T$ — the size of partial trajectory.
To train the part of the model that predicts state values you will need to compute the value targets.
Any callable could be passed to EnvRunner
to be applied to each partial trajectory after it is collected.
Thus, we can implement and use ComputeValueTargets
callable.
The formula for the value targets is simple:
In implementation, however, do not forget to use
trajectory['resets']
flags to check if you need to add the value targets at the next step when
computing value targets for the current step. You can access trajectory['state']['latest_observation']
to get last observations in partial trajectory — $s_{t+T}$.
In [ ]:
class ComputeValueTargets:
def __init__(self, policy, gamma=0.99):
self.policy = policy
def __call__(self, trajectory):
# This method should modify trajectory inplace by adding
# an item with key 'value_targets' to it.
<Compute value targets for a given partial trajectory>
After computing value targets we will transform lists of interactions into tensors
with the first dimension batch_size
which is equal to T * nenvs
, i.e. you essentially need
to flatten the first two dimensions.
In [ ]:
class MergeTimeBatch:
""" Merges first two axes typically representing time and env batch. """
def __call__(self, trajectory):
# Modify trajectory inplace.
<TODO: implement>
In [ ]:
model = <Create your model here>
policy = Policy(model)
runner = EnvRunner(
env, policy, nsteps=5,
transforms=[
ComputeValueTargets(),
MergeTimeBatch(),
])
Now is the time to implement the advantage actor critic algorithm itself. You can look into your lecture, Mnih et al. 2016 paper, and lecture by Sergey Levine.
In [ ]:
class A2C:
def __init__(self,
policy,
optimizer,
value_loss_coef=0.25,
entropy_coef=0.01,
max_grad_norm=0.5):
self.policy = policy
self.optimizer = optimizer
self.value_loss_coef = value_loss_coef
self.entropy_coef = entropy_coef
self.max_grad_norm = max_grad_norm
def policy_loss(self, trajectory):
# You will need to compute advantages here.
<TODO: implement>
def value_loss(self, trajectory):
<TODO: implement>
def loss(self, trajectory):
<TODO: implement>
def step(self, trajectory):
<TODO: implement>
Now you can train your model. With reasonable hyperparameters training on a single GTX1080 for 10 million steps across all batched environments (which translates to about 5 hours of wall clock time)
it should be possible to achieve average raw reward over last 100 episodes (the average is taken over 100 last
episodes in each environment in the batch) of about 600. You should plot this quantity with respect to
runner.step_var
— the number of interactions with all environments. It is highly
encouraged to also provide plots of the following quantities (these are useful for debugging as well):
For optimization we suggest you use RMSProp with learning rate starting from 7e-4 and linearly decayed to 0, smoothing constant (alpha in PyTorch and decay in TensorFlow) equal to 0.99 and epsilon equal to 1e-5.
In [ ]:
a2c = <Create instance of the algorithm>
<Write your training loop>