Speeding up training by applying auxillary imitation loss on expert actions:
L_gps = lamda_a3c * L_a3c + guided_lambda * L_gps
This implementation is loosely reffered as 'guided policy search' after algorithm described in paper by S. Levine and P. Abbeel Learning Neural Network Policies with Guided PolicySearch under Unknown Dynamics in a sense that we utilize same idea of fitting 'local' (here - single episode) oracle for environment with unknown dynamics and use actions demonstrated by it to optimize trajectory distribution for training agent; also connected to s RLfD ideas.
Using expert actions is proven way to speed up training by exploring more relevant state-action space regiones;
It can also lead to suboptimal policies when demonstrated actions are suboptimal (learnt model is irrelevant) and agent is strictly entingled to expert trajectories (cannot explore and act on it's own);
Levine et al., Learning Neural Network Policies with Guided PolicySearch under Unknown Dynamics
Brys et al., Reinforcement Learning from Demonstration through Shaping
Wiewiora et al., Principled Methods for Advising Reinforcement Learning Agents
Abbeel, Ng, Exploration and Apprenticeship Learning in Reinforcement Learning
For each train episode expert has access to entire data range but does not provides complete state-action trajectories. Instead it fits simple local model of external part of the environment which provides advises in form of categorical probability distribution in action space. Morover, imitation loss is defined in actions subspace, namely over 'buy' and 'sell' actions. Such relaxed conditions seems to work better than strictly following [possibly suboptimal] expert trajectory.
Note that expert advices conditioned on external state observations only, i.e. price dynamics; expert doesn't account current inner agent state (opened position, acoount value etc.). It can be said it acts similar to real-life financial advisor'...now it probably time to buy'; it does not bear any responsibility for advices given so it is on agent's side to deside follow or not. The degree of agent 'independency' from oracle is regulated by guided_lambda hyperparamter. When set to zero - oracle advises completely ignored. When set >> 1 it can be clearly seen (with sine wave data for example) than agent performace is degraded due to strong dependency from oracle; reasonable values for this setup seems to lie within 0.1 .. 1.0 range;
*sample, irrelevant to image above
In [ ]:
import warnings
warnings.filterwarnings("ignore") # suppress h5py deprecation warning
import os
import backtrader as bt
import numpy as np
from gym import spaces
from btgym import BTgymEnv, BTgymDataset, BTgymRandomDataDomain
from btgym.algorithms import Launcher
from btgym.research.gps.aac import GuidedAAC
from btgym.research.gps.policy import GuidedPolicy_0_0
from btgym.research.gps.strategy import GuidedStrategy_0_0, ExpertObserver
In [ ]:
# Set backtesting engine and parameters:
engine = bt.Cerebro()
engine.addstrategy(
GuidedStrategy_0_0,
drawdown_call=10, # max % to loose, in percent of initial cash
target_call=10, # max % to win, same
skip_frame=10,
gamma=0.99,
state_ext_scale=np.linspace(4e3, 1e3, num=6),
reward_scale=7,
expert_config= # see btgym.research.gps.oracle.Oracle class for details
{
'time_threshold': 5,
'pips_threshold': 10,
'pips_scale': 1e-4,
'kernel_size': 10,
'kernel_stddev': 1,
},
)
# Expert actions observer:
engine.addobserver(ExpertObserver)
# Set leveraged account:
engine.broker.setcash(2000)
engine.broker.setcommission(commission=0.0001, leverage=10.0) # commisssion to imitate spread
engine.addsizer(bt.sizers.SizerFix, stake=5000)
# Data: uncomment to get up to six month of 1 minute bars:
data_m1_6_month = [
'./data/DAT_ASCII_EURUSD_M1_201701.csv',
'./data/DAT_ASCII_EURUSD_M1_201702.csv',
'./data/DAT_ASCII_EURUSD_M1_201703.csv',
#'./data/DAT_ASCII_EURUSD_M1_201704.csv',
#'./data/DAT_ASCII_EURUSD_M1_201705.csv',
#'./data/DAT_ASCII_EURUSD_M1_201706.csv',
]
# Uncomment single choice of source file:
dataset = BTgymRandomDataDomain(
#filename=data_m1_6_month,
#filename='./data/DAT_ASCII_EURUSD_M1_2016.csv', # full year
filename='./data/test_sine_1min_period256_delta0002.csv', # simple sine
trial_params=dict(
start_weekdays={0, 1, 2, 3, 4, 5, 6},
sample_duration={'days': 3, 'hours': 0, 'minutes': 0},
start_00=False,
time_gap={'days': 1, 'hours': 10},
test_period={'days': 0, 'hours': 0, 'minutes': 0},
),
episode_params=dict(
start_weekdays={0, 1, 2, 3, 4, 5, 6},
sample_duration={'days': 1, 'hours': 23, 'minutes': 50},
start_00=False,
time_gap={'days': 1, 'hours': 0},
),
)
env_config = dict(
class_ref=BTgymEnv,
kwargs=dict(
dataset=dataset,
engine=engine,
render_modes=['episode', 'human', 'external', 'internal'],
render_state_as_image=True,
render_ylabel='OHL_diff. / Internals',
render_size_episode=(12,8),
render_size_human=(9, 4),
render_size_state=(11, 3),
render_dpi=75,
port=5000,
data_port=4999,
connect_timeout=90,
verbose=0,
)
)
cluster_config = dict(
host='127.0.0.1',
port=12230,
num_workers=4, # Set according CPU's available or so
num_ps=1,
num_envs=1,
log_dir=os.path.expanduser('~/tmp/gps'),
)
policy_config = dict(
class_ref=GuidedPolicy_0_0,
kwargs={
'lstm_layers': (256, 256),
'lstm_2_init_period': 50,
'conv_2d_layer_config': (
(32, (3, 1), (2, 1)),
(32, (3, 1), (2, 1)),
(64, (3, 1), (2, 1)),
(64, (3, 1), (2, 1))
),
'encode_internal_state': False,
}
)
trainer_config = dict(
class_ref=GuidedAAC,
kwargs=dict(
opt_learn_rate=1e-4, # scalar or random log-uniform
opt_end_learn_rate=1e-5,
opt_decay_steps=20*10**6,
model_gamma=0.99,
model_gae_lambda=1.0,
model_beta=0.01, # Entropy reg, scalar or random log-uniform
aac_lambda=1.0, # main a3c loss weight
guided_lambda=1.0, # Imitation loss weight
guided_decay_steps=10*10**6, # annealing guided_lambda to zero in 10M steps
rollout_length=20,
time_flat=True,
use_value_replay=False,
episode_train_test_cycle=[1,0],
model_summary_freq=100,
episode_summary_freq=5,
env_render_freq=5,
)
)
In [ ]:
launcher = Launcher(
cluster_config=cluster_config,
env_config=env_config,
trainer_config=trainer_config,
policy_config=policy_config,
test_mode=False,
max_env_steps=100*10**6,
root_random_seed=0,
purge_previous=1, # ask to override previously saved model and logs
verbose=0,
)
# Train it:
launcher.run()
In [ ]: