This notebook builds upon qlearning.ipynb
, or to be exact, generating qlearning.py.
There's a powerful technique that you can use to improve sample efficiency for off-policy algorithms: [spoiler] Experience replay :)
The catch is that you can train Q-learning and EV-SARSA on <s,a,r,s'>
tuples even if they aren't sampled under current agent's policy. So here's what we're gonna do:
<s,a,r,s'>
.<s,a,r,s'>
.<s,a,r,s'>
transition in a buffer. To enable such training, first we must implement a memory structure that would act like such a buffer.
In [1]:
%load_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import clear_output
import pandas as pd
#XVFB will be launched if you run on a server
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
!bash ../xvfb start
os.environ['DISPLAY'] = ':1'
In [2]:
import random
class ReplayBuffer(object):
def __init__(self, size):
"""
Create Replay buffer.
Parameters
----------
size: int
Max number of transitions to store in the buffer. When the buffer
overflows the old memories are dropped.
Note: for this assignment you can pick any data structure you want.
If you want to keep it simple, you can store a list of tuples of (s, a, r, s') in self._storage
However you may find out there are faster and/or more memory-efficient ways to do so.
"""
self._storage = []
self._maxsize = size
# OPTIONAL: YOUR CODE
columns = ['state', 'action', 'reward', 'next_state', 'is_done']
self._storage = pd.DataFrame(columns=columns)
def __len__(self):
return len(self._storage)
def add(self, obs_t, action, reward, obs_tp1, done):
'''
Make sure, _storage will not exceed _maxsize.
Make sure, FIFO rule is being followed: the oldest examples has to be removed earlier
'''
#data = (obs_t, action, reward, obs_tp1, done)
data = {
'state' : obs_t,
'action' : action,
'reward' : reward,
'next_state' : obs_tp1,
'is_done' : done
}
# add data to storage
if len(self._storage) == self._maxsize:
self._storage.drop(0, axis=0, inplace=True)
self._storage.reset_index(drop=True, inplace=True)
self._storage = self._storage.append(data, ignore_index=True)
def sample(self, batch_size):
"""Sample a batch of experiences.
Parameters
----------
batch_size: int
How many transitions to sample.
Returns
-------
obs_batch: np.array
batch of observations
act_batch: np.array
batch of actions executed given obs_batch
rew_batch: np.array
rewards received as results of executing act_batch
next_obs_batch: np.array
next set of observations seen after executing act_batch
done_mask: np.array
done_mask[i] = 1 if executing act_batch[i] resulted in
the end of an episode and 0 otherwise.
"""
idxes = np.random.randint(
low=0,
high=len(self._storage),
size=batch_size
) #<randomly generate batch_size integers to be used as indexes of samples>
r = self._storage.loc[idxes]
# collect <s,a,r,s',done> for each index
states = r.state.values
actions = r.action.values
rewards = r.reward.values
next_states = r.next_state.values
is_done = r.is_done.values
return (states, actions, rewards, next_states, is_done)
Some tests to make sure your buffer works right
In [3]:
replay = ReplayBuffer(2)
obj1 = tuple(range(5))
obj2 = tuple(range(5, 10))
replay.add(*obj1)
assert replay.sample(1)==obj1, "If there's just one object in buffer, it must be retrieved by buf.sample(1)"
replay.add(*obj2)
assert len(replay._storage)==2, "Please make sure __len__ methods works as intended."
replay.add(*obj2)
assert len(replay._storage)==2, "When buffer is at max capacity, replace objects instead of adding new ones."
assert tuple(np.unique(a) for a in replay.sample(100))==obj2
replay.add(*obj1)
assert max(len(np.unique(a)) for a in replay.sample(100))==2
replay.add(*obj1)
assert tuple(np.unique(a) for a in replay.sample(100))==obj1
print ("Success!")
Now let's use this buffer to improve training:
In [4]:
import gym
from qlearning import QLearningAgent
env = gym.make("Taxi-v2")
n_actions = env.action_space.n
In [18]:
def play_and_train_with_replay(env, agent, replay=None,
t_max=10**4, replay_batch_size=32):
"""
This function should
- run a full game, actions given by agent.getAction(s)
- train agent using agent.update(...) whenever possible
- return total reward
:param replay: ReplayBuffer where agent can store and sample (s,a,r,s',done) tuples.
If None, do not use experience replay
"""
total_reward = 0.0
s = env.reset()
for t in range(t_max):
# get agent to pick action given state s
a = agent.get_action(s) #<YOUR CODE>
next_s, r, done, _ = env.step(a)
# update agent on current transition. Use agent.update
#<YOUR CODE>
agent.update(s, a, r, next_s)
if replay is not None:
# store current <s,a,r,s'> transition in buffer
#<YOUR CODE>
replay.add(s, a, r, next_s, done)
# sample replay_batch_size random transitions from replay,
# then update agent on each of them in a loop
#<YOUR CODE>
(states, actions, rewards, next_states, is_done) = replay.sample(replay_batch_size)
for s_, a_, r_, next_s_ in zip(states, actions, rewards, next_states):
agent.update(s_, a_, r_, next_s_)
s = next_s
total_reward +=r
if done:break
return total_reward
In [19]:
# Create two agents: first will use experience replay, second will not.
agent_baseline = QLearningAgent(alpha=0.5, epsilon=0.25, discount=0.99,
get_legal_actions = lambda s: range(n_actions))
agent_replay = QLearningAgent(alpha=0.5, epsilon=0.25, discount=0.99,
get_legal_actions = lambda s: range(n_actions))
replay = ReplayBuffer(1000)
In [20]:
from IPython.display import clear_output
from pandas import DataFrame
moving_average = lambda x, span=100: DataFrame({'x':np.asarray(x)}).x.ewm(span=span).mean().values
rewards_replay, rewards_baseline = [], []
for i in range(1000):
rewards_replay.append(play_and_train_with_replay(env, agent_replay, replay))
rewards_baseline.append(play_and_train_with_replay(env, agent_baseline, replay=None))
agent_replay.epsilon *= 0.99
agent_baseline.epsilon *= 0.99
if i %100 ==0:
clear_output(True)
print('Baseline : eps =', agent_replay.epsilon, 'mean reward =', np.mean(rewards_baseline[-10:]))
print('ExpReplay: eps =', agent_baseline.epsilon, 'mean reward =', np.mean(rewards_replay[-10:]))
plt.plot(moving_average(rewards_replay), label='exp. replay')
plt.plot(moving_average(rewards_baseline), label='baseline')
plt.grid()
plt.legend()
plt.show()
In [21]:
from submit import submit_experience_replay
submit_experience_replay(rewards_replay, rewards_baseline, 'tonatiuh_rangel@hotmail.com', 'GWnGSUsbgj3Fcn0B')
Experience replay, if implemented correctly, will improve algorithm's initial convergence a lot, but it shouldn't affect the final performance.
We will use the code you just wrote extensively in the next week of our course. If you're feeling that you need more examples to understand how experience replay works, try using it for binarized state spaces (CartPole or other classic control envs).
Next week we're gonna explore how q-learning and similar algorithms can be applied for large state spaces, with deep learning models to approximate the Q function.
However, the code you've written for this week is already capable of solving many RL problems, and as an added benifit - it is very easy to detach. You can use Q-learning, SARSA and Experience Replay for any RL problems you want to solve - just thow 'em into a file and import the stuff you need.