Batch Reinforcement Learning with Coach

In many real-world problems, a learning agent cannot interact with the real environment or with a simlulated one. This might be due to the risk of taking sub-optimal actions in the real world, or due to the complexity of creating a simluator that immitates correctly the real environment dynamics. In such cases, the learning agent is only exposed to data that was collected using some deployed policy, and we would like to use that data to learn a better policy for solving the problem. One such example might be developing a better drug dose or admission scheduling policy. We have data based on the policy that was used with patients so far, but cannot experiment (and explore) on patients to collect new data.

But wait... If we don't have a simulator, how would we evaluate our newly learned policy and know if it is any good? Which algorithms should we be using in order to better address the problem of learning only from a batch of data?

Alternatively, what do we do if we don't have a simulator, but instead we can actually deploy our policy on that real-world environment, and would just like to separate the new data collection part from the learning part (i.e. if we have a system that can quite easily run inference, but is very hard to integrate a reinforcement learning framework with, such as Coach, for learning a new policy).

We will try to address these questions and more in this tutorial, demonstrating how to use Batch Reinforcement Learning.

First, let's use a simple environment to collect the data to be used for learning a policy using Batch RL. In reality, we probably would already have a dataset of transitions of the form <current_observation, action, reward, next_state> to be used for learning a new policy. Ideally, we would also have, for each transtion, $p(a|o)$ the probabilty of an action, given that transition's current_observation.

Preliminaries

First, get the required imports and other general settings we need for this notebook.

Solving Acrobot with Batch RL


In [ ]:
from copy import deepcopy
import tensorflow as tf
import os

from rl_coach.agents.dqn_agent import DQNAgentParameters
from rl_coach.agents.ddqn_bcq_agent import DDQNBCQAgentParameters, KNNParameters
from rl_coach.base_parameters import VisualizationParameters
from rl_coach.core_types import TrainingSteps, EnvironmentEpisodes, EnvironmentSteps, CsvDataset
from rl_coach.environments.gym_environment import GymVectorEnvironment
from rl_coach.graph_managers.batch_rl_graph_manager import BatchRLGraphManager
from rl_coach.graph_managers.graph_manager import ScheduleParameters
from rl_coach.memories.memory import MemoryGranularity
from rl_coach.schedules import LinearSchedule
from rl_coach.memories.episodic import EpisodicExperienceReplayParameters
from rl_coach.architectures.head_parameters import QHeadParameters
from rl_coach.agents.ddqn_agent import DDQNAgentParameters
from rl_coach.base_parameters import TaskParameters
from rl_coach.spaces import SpacesDefinition, DiscreteActionSpace, VectorObservationSpace, StateSpace, RewardSpace

# Get all the outputs of this tutorial out of the 'Resources' folder
os.chdir('Resources')

# the dataset size to collect 
DATASET_SIZE = 50000

task_parameters = TaskParameters(experiment_path='.')

####################
# Graph Scheduling #
####################

schedule_params = ScheduleParameters()

# 100 epochs (we run train over all the dataset, every epoch) of training
schedule_params.improve_steps = TrainingSteps(100)

# we evaluate the model every epoch
schedule_params.steps_between_evaluation_periods = TrainingSteps(1)

# only for when we have an enviroment
schedule_params.evaluation_steps = EnvironmentEpisodes(10)
schedule_params.heatup_steps = EnvironmentSteps(DATASET_SIZE)

################
#  Environment #
################
env_params = GymVectorEnvironment(level='Acrobot-v1')

Let's use OpenAI Gym's Acrobot-v1 in order to collect a dataset of experience, and then use that dataset in order to learn a policy solving the environment using Batch RL.

The Preset

First we will collect a dataset using a random action selecting policy. Then we will use that dataset to train an agent in a Batch RL fashion.
Let's start simple - training an agent with Double DQN.


In [ ]:
tf.reset_default_graph() # just to clean things up; only needed for the tutorial

#########
# Agent #
#########
agent_params = DDQNAgentParameters()
agent_params.network_wrappers['main'].batch_size = 128
agent_params.algorithm.num_steps_between_copying_online_weights_to_target = TrainingSteps(100)
agent_params.algorithm.discount = 0.99

# to jump start the agent's q values, and speed things up, we'll initialize the last Dense layer's bias
# with a number in the order of the discounted reward of a random policy
agent_params.network_wrappers['main'].heads_parameters = \
[QHeadParameters(output_bias_initializer=tf.constant_initializer(-100))]

# NN configuration
agent_params.network_wrappers['main'].learning_rate = 0.0001
agent_params.network_wrappers['main'].replace_mse_with_huber_loss = False

# ER - we'll need an episodic replay buffer for off-policy evaluation
agent_params.memory = EpisodicExperienceReplayParameters()

# E-Greedy schedule - there is no exploration in Batch RL. Disabling E-Greedy. 
agent_params.exploration.epsilon_schedule = LinearSchedule(initial_value=0, final_value=0, decay_steps=1)
agent_params.exploration.evaluation_epsilon = 0


graph_manager = BatchRLGraphManager(agent_params=agent_params,
                                    env_params=env_params,
                                    schedule_params=schedule_params,
                                    vis_params=VisualizationParameters(dump_signals_to_csv_every_x_episodes=1),
                                    reward_model_num_epochs=30)
graph_manager.create_graph(task_parameters)
graph_manager.improve()

First we see Coach running a long heatup of 50,000 steps (as we have defined a DATASET_SIZE of 50,000 in the preliminaries section), in order to collect a dataset of random actions. Then we can see Coach training a supervised reward model that is needed for the Doubly Robust OPE (off-policy evaluation). Last, Coach starts using the collected dataset of experience to train a Double DQN agent. Since, for this environment, we actually do have a simulator, Coach will be using it to evaluate the learned policy. As you can probably see, since this is a very simple environment, a dataset of just random actions is enough to get a Double DQN agent training, and reaching rewards of less than -100 (actually solving the environment). As you can also probably notice, the learning is not very stable, and if you take a look at the Q values predicted by the agent (e.g. in Coach Dashboard; this tutorial experiment results are under the Resources folder), you will see them increasing unboundedly. This is caused due to the Batch RL based learning, where not interacting with the environment any further, while randomly exposing only small parts of the MDP in the dataset, makes learning even harder than standard Off-Policy RL. This phenomena is very nicely explained in Off-Policy Deep Reinforcement Learning without Exploration. We have implemented a discrete-actions variant of Batch Constrained Q-Learning, which helps mitigating this issue.

Next, let's switch to a dataset containing data combined from several 'deployed' policies, as is often the case in real-world scenarios, where we already have a policy (hopefully not a random one) in-place and we want to improve it. For instance, a recommender system already using a policy for generating recommendations, and we want to use Batch RL to learn a better policy.

We will demonstrate that by training an agent, and using its replay buffer content as the dataset from which we will learn a new policy, without any further interaction with the environment. This should allow for both a better trained agent and for more meaningful Off-Policy Evaluation (as the more extensive your input data is, i.e. exposing more of the MDP, the better the evaluation of a new policy based on it).


In [ ]:
tf.reset_default_graph() # just to clean things up; only needed for the tutorial

# Experience Generating Agent parameters
experience_generating_agent_params = DDQNAgentParameters()

# schedule parameters
experience_generating_schedule_params = ScheduleParameters()
experience_generating_schedule_params.heatup_steps = EnvironmentSteps(1000)
experience_generating_schedule_params.improve_steps = TrainingSteps(
    DATASET_SIZE - experience_generating_schedule_params.heatup_steps.num_steps)
experience_generating_schedule_params.steps_between_evaluation_periods = EnvironmentEpisodes(10)
experience_generating_schedule_params.evaluation_steps = EnvironmentEpisodes(1)

# DQN params
experience_generating_agent_params.algorithm.num_steps_between_copying_online_weights_to_target = EnvironmentSteps(100)
experience_generating_agent_params.algorithm.discount = 0.99
experience_generating_agent_params.algorithm.num_consecutive_playing_steps = EnvironmentSteps(1)

# NN configuration
experience_generating_agent_params.network_wrappers['main'].learning_rate = 0.0001
experience_generating_agent_params.network_wrappers['main'].batch_size = 128
experience_generating_agent_params.network_wrappers['main'].replace_mse_with_huber_loss = False
experience_generating_agent_params.network_wrappers['main'].heads_parameters = \
[QHeadParameters(output_bias_initializer=tf.constant_initializer(-100))]

# ER size
experience_generating_agent_params.memory = EpisodicExperienceReplayParameters()
experience_generating_agent_params.memory.max_size = \
    (MemoryGranularity.Transitions,
     experience_generating_schedule_params.heatup_steps.num_steps +
     experience_generating_schedule_params.improve_steps.num_steps)

# E-Greedy schedule
experience_generating_agent_params.exploration.epsilon_schedule = LinearSchedule(1.0, 0.01, DATASET_SIZE)
experience_generating_agent_params.exploration.evaluation_epsilon = 0

# 50 epochs of training (the entire dataset is used each epoch)
schedule_params.improve_steps = TrainingSteps(50)

graph_manager = BatchRLGraphManager(agent_params=agent_params,
                                    experience_generating_agent_params=experience_generating_agent_params,
                                    experience_generating_schedule_params=experience_generating_schedule_params,
                                    env_params=env_params,
                                    schedule_params=schedule_params,
                                    vis_params=VisualizationParameters(dump_signals_to_csv_every_x_episodes=1),
                                    reward_model_num_epochs=30,
                                    train_to_eval_ratio=0.5)
graph_manager.create_graph(task_parameters)
graph_manager.improve()

Off-Policy Evaluation

As we mentioned earlier, one of the hardest problems in Batch RL is that we do not have a simulator or cannot easily deploy a trained policy on the real-world environment, in order to test its goodness. This is where OPE comes in handy.

Coach supports several off-policy evaluators, some are useful for bandits problems (only evaluating a single step return), and others are for full-blown Reinforcement Learning problems. The main goal of the OPEs is to help us select the best model, either for collecting more data to do another round of Batch RL on, or for actual deployment in the real-world environment.

Opening the experiment that we have just ran (under the tutorials/Resources folder, with Coach Dashboard), you will be able to plot the actual simulator's Evaluation Reward. Usually, we won't have this signal available as we won't have a simulator, but since we're using a dummy environment for demonstration purposes, we can take a look and examine how the OPEs correlate with it.

Here are two example plots from Dashboard showing how well the Weighted Importance Sampling (RL estimator) and the Doubly Robust (bandits estimator) each correlate with the Evaluation Reward.

Using a Dataset to Feed a Batch RL Algorithm

Ok, so we now understand how things are expected to work. But, hey... if we don't have a simulator (which we did have in this tutorial so far, and have used it to generate a training/evaluation dataset) how will we feed Coach with the dataset to train/evaluate on?

The CSV

Coach defines a csv data format that can be used to fill its replay buffer. We have created an example csv from the same Acrobot-v1 environment, and have placed it under the Tutorials' Resources folder.

Here are the first couple of lines from it so you can get a grip of what to expect -

action all_action_probabilities episode_id episode_name reward transition_number state_feature_0 state_feature_1 state_feature_2 state_feature_3 state_feature_4 state_feature_5
0 [0.4159157,0.23191088,0.35217342] 0 acrobot -1 0 0.996893843 0.078757007 0.997566524 0.069721088 -0.078539907 -0.072449002
1 [0.46244532,0.22402011,0.31353462] 0 acrobot -1 1 0.997643051 0.068617369 0.999777604 0.021088905 -0.022653483 -0.40743716
0 [0.4961428,0.21575058,0.2881066] 0 acrobot -1 2 0.997613067 0.069051922 0.996147629 -0.087692077 0.023128103 -0.662019594
2 [0.49341106,0.22363988,0.28294897] 0 acrobot -1 3 0.997141344 0.075558854 0.972780655 -0.231727853 0.035575821 -0.771402023

The Preset


In [ ]:
tf.reset_default_graph() # just to clean things up; only needed for the tutorial

#########
# Agent #
#########
# note that we have moved to BCQ, which will help the training to converge better and faster
agent_params = DDQNBCQAgentParameters() 
agent_params.network_wrappers['main'].batch_size = 128
agent_params.algorithm.num_steps_between_copying_online_weights_to_target = TrainingSteps(100)
agent_params.algorithm.discount = 0.99

# to jump start the agent's q values, and speed things up, we'll initialize the last Dense layer
# with something in the order of the discounted reward of a random policy
agent_params.network_wrappers['main'].heads_parameters = \
[QHeadParameters(output_bias_initializer=tf.constant_initializer(-100))]

# NN configuration
agent_params.network_wrappers['main'].learning_rate = 0.0001
agent_params.network_wrappers['main'].replace_mse_with_huber_loss = False

# ER - we'll be needing an episodic replay buffer for off-policy evaluation
agent_params.memory = EpisodicExperienceReplayParameters()

# E-Greedy schedule - there is no exploration in Batch RL. Disabling E-Greedy. 
agent_params.exploration.epsilon_schedule = LinearSchedule(initial_value=0, final_value=0, decay_steps=1)
agent_params.exploration.evaluation_epsilon = 0

# can use either a kNN or a NN based model for predicting which actions not to max over in the bellman equation
agent_params.algorithm.action_drop_method_parameters = KNNParameters()


DATATSET_PATH = 'acrobot_dataset.csv'
agent_params.memory = EpisodicExperienceReplayParameters()
agent_params.memory.load_memory_from_file_path = CsvDataset(DATATSET_PATH, is_episodic = True)

spaces = SpacesDefinition(state=StateSpace({'observation': VectorObservationSpace(shape=6)}),
                          goal=None,
                          action=DiscreteActionSpace(3),
                          reward=RewardSpace(1))

graph_manager = BatchRLGraphManager(agent_params=agent_params,
                                    env_params=None,
                                    spaces_definition=spaces,
                                    schedule_params=schedule_params,
                                    vis_params=VisualizationParameters(dump_signals_to_csv_every_x_episodes=1),
                                    reward_model_num_epochs=30,
                                    train_to_eval_ratio=0.4)
graph_manager.create_graph(task_parameters)
graph_manager.improve()

Model Selection with OPE

Running the above preset will train an agent based on the experience in the csv dataset. Note that now we are finally demonstarting the real scenario with Batch Reinforcement Learning, where we train and evaluate solely based on the recorded dataset. Coach uses the same dataset (after internally splitting it, obviously) for both training and evaluation.

Now that we have ran this preset, we have 100 agents (one is saved after every training epoch), and we would have to decide which one we choose for deployment (either for running another round of experience collection and training, or for final deployment, meaning going into production).

Opening the experiment csv in Dashboard and displaying the OPE signals, we can now choose a checkpoint file for deployment on the end-node. Here is an example run, where we show the Weighted Importance Sampling and Sequential Doubly Robust OPEs.

Based on this plot we would probably have chosen a checkpoint from around Epoch 85. From here, if we are not satisfied with the deployed agent's performance, we can iteratively continue with data collection, policy training (maybe based on a combination of all the data collected so far), and deployment.


In [ ]: