Guided policy search

Speeding up training by applying auxillary imitation loss on expert actions:

L_gps = lamda_a3c * L_a3c + guided_lambda * L_gps

Intuition:

This implementation is loosely reffered as 'guided policy search' after algorithm described in paper by S. Levine and P. Abbeel Learning Neural Network Policies with Guided PolicySearch under Unknown Dynamics in a sense that we utilize same idea of fitting 'local' (here - single episode) oracle for environment with unknown dynamics and use actions demonstrated by it to optimize trajectory distribution for training agent; also connected to s RLfD ideas.
Using expert actions is proven way to speed up training by exploring more relevant state-action space regiones;
It can also lead to suboptimal policies when demonstrated actions are suboptimal (learnt model is irrelevant) and agent is strictly entingled to expert trajectories (cannot explore and act on it's own);
So it is essential to find ballanced degree of agent dependensy on advisor;

Papers:

Levine et al., Learning Neural Network Policies with Guided PolicySearch under Unknown Dynamics
Brys et al., Reinforcement Learning from Demonstration through Shaping
Wiewiora et al., Principled Methods for Advising Reinforcement Learning Agents
Abbeel, Ng, Exploration and Apprenticeship Learning in Reinforcement Learning

Implementation details:

For each train episode expert has access to entire data range but does not provides complete state-action trajectories. Instead it fits simple local model of external part of the environment which provides advises in form of categorical probability distribution in action space. Morover, imitation loss is defined in actions subspace, namely over 'buy' and 'sell' actions. Such relaxed conditions seems to work better than strictly following [possibly suboptimal] expert trajectory.
Note that expert advices conditioned on external state observations only, i.e. price dynamics; expert doesn't account current inner agent state (opened position, acoount value etc.). It can be said it acts similar to real-life financial advisor'...now it probably time to buy'; it does not bear any responsibility for advices given so it is on agent's side to deside follow or not. The degree of agent 'independency' from oracle is regulated by guided_lambda hyperparamter. When set to zero - oracle advises completely ignored. When set >> 1 it can be clearly seen (with sine wave data for example) than agent performace is degraded due to strong dependency from oracle; reasonable values for this setup seems to lie within 0.1 .. 1.0 range;

Expert estimated actions example:

currently oracle implements extremely simple strategy: it finds local peaks along entire given episode data, filters it by time and value and estimate signals by reversing position after every up- or down- peak. Than smoothes by convolving signals with gaussian kernel, normalises and outputs discrete prob. distribution over action space.

Expert actions distribution:

It seems that using MSE error on action subsets leaves agent better degree of freedom than cross-entropy loss over entire action space; though it should be highly dependend on quality of advised actions.

Example of distribution over actions learnt by agent *:

*sample, irrelevant to image above

Addition for those who like to look: extended summaries.

Tensorboard summaries (see images tab) are updated and now include:

- action probabilities for an episode (see above);

- value function for an episode:

- visualisations of hidden state of LSTM blocks:

256 cells along vertical axis, environment timesteps - along horisontal one; it can be clearly seen how rnn state gets reset when deal is closed.

TODO:

1. Relax agent dependency on advised actions by annelaing guided_lambda in the course of training;
1. Implemnet potential-based approach to advising as in Wiewiora et al.



In [ ]:

    
import warnings
warnings.filterwarnings("ignore") # suppress h5py deprecation warning

import os
import backtrader as bt
import numpy as np
from gym import spaces

from btgym import BTgymEnv, BTgymDataset, BTgymRandomDataDomain
from btgym.algorithms import Launcher

from btgym.research.gps.aac import GuidedAAC
from btgym.research.gps.policy import GuidedPolicy_0_0
from btgym.research.gps.strategy import GuidedStrategy_0_0, ExpertObserver



In [ ]:

    
# Set backtesting engine and parameters:

engine = bt.Cerebro()

engine.addstrategy(
    GuidedStrategy_0_0,
    drawdown_call=10, # max % to loose, in percent of initial cash
    target_call=10,  # max % to win, same
    skip_frame=10,
    gamma=0.99,
    state_ext_scale=np.linspace(4e3, 1e3, num=6),
    reward_scale=7,
    expert_config=  # see btgym.research.gps.oracle.Oracle class for details
        {
            'time_threshold': 5,
            'pips_threshold': 10, 
            'pips_scale': 1e-4,
            'kernel_size': 10,
            'kernel_stddev': 1,
        },
)

# Expert actions observer:
engine.addobserver(ExpertObserver)

# Set leveraged account:
engine.broker.setcash(2000)
engine.broker.setcommission(commission=0.0001, leverage=10.0) # commisssion to imitate spread
engine.addsizer(bt.sizers.SizerFix, stake=5000)  

# Data: uncomment to get up to six month of 1 minute bars:
data_m1_6_month = [
    './data/DAT_ASCII_EURUSD_M1_201701.csv',
    './data/DAT_ASCII_EURUSD_M1_201702.csv',
    './data/DAT_ASCII_EURUSD_M1_201703.csv',
    #'./data/DAT_ASCII_EURUSD_M1_201704.csv',
    #'./data/DAT_ASCII_EURUSD_M1_201705.csv',
    #'./data/DAT_ASCII_EURUSD_M1_201706.csv',
]

# Uncomment single choice of source file:
dataset = BTgymRandomDataDomain(  
    #filename=data_m1_6_month,
    #filename='./data/DAT_ASCII_EURUSD_M1_2016.csv', # full year
    filename='./data/test_sine_1min_period256_delta0002.csv',  # simple sine 

    trial_params=dict(
        start_weekdays={0, 1, 2, 3, 4, 5, 6},
        sample_duration={'days': 3, 'hours': 0, 'minutes': 0},
        start_00=False,
        time_gap={'days': 1, 'hours': 10},
        test_period={'days': 0, 'hours': 0, 'minutes': 0},
    ),
    episode_params=dict(
        start_weekdays={0, 1, 2, 3, 4, 5, 6},
        sample_duration={'days': 1, 'hours': 23, 'minutes': 50},
        start_00=False,
        time_gap={'days': 1, 'hours': 0},
    ),
)

env_config = dict(
    class_ref=BTgymEnv, 
    kwargs=dict(
        dataset=dataset,
        engine=engine,
        render_modes=['episode', 'human', 'external', 'internal'],
        render_state_as_image=True,
        render_ylabel='OHL_diff. / Internals',
        render_size_episode=(12,8),
        render_size_human=(9, 4),
        render_size_state=(11, 3),
        render_dpi=75,
        port=5000,
        data_port=4999,
        connect_timeout=90,
        verbose=0,
    )
)

cluster_config = dict(
    host='127.0.0.1',
    port=12230,
    num_workers=4,  # Set according CPU's available or so
    num_ps=1,
    num_envs=1,
    log_dir=os.path.expanduser('~/tmp/gps'),
)

policy_config = dict(
    class_ref=GuidedPolicy_0_0,
    kwargs={
        'lstm_layers': (256, 256),
        'lstm_2_init_period': 50,
        'conv_2d_layer_config': (
             (32, (3, 1), (2, 1)),
             (32, (3, 1), (2, 1)),
             (64, (3, 1), (2, 1)),
             (64, (3, 1), (2, 1))
         ),
        'encode_internal_state': False,
    }
)

trainer_config = dict(
    class_ref=GuidedAAC,
    kwargs=dict(
        opt_learn_rate=1e-4, # scalar or random log-uniform 
        opt_end_learn_rate=1e-5,
        opt_decay_steps=20*10**6,
        model_gamma=0.99,
        model_gae_lambda=1.0,
        model_beta=0.01, # Entropy reg, scalar or random log-uniform
        aac_lambda=1.0, # main a3c loss weight
        guided_lambda=1.0,  # Imitation loss weight
        guided_decay_steps=10*10**6,  # annealing guided_lambda to zero in 10M steps
        rollout_length=20,
        time_flat=True,
        use_value_replay=False,
        episode_train_test_cycle=[1,0],
        model_summary_freq=100,
        episode_summary_freq=5,
        env_render_freq=5,
    )
)



In [ ]:

    
launcher = Launcher(
    cluster_config=cluster_config,
    env_config=env_config,
    trainer_config=trainer_config,
    policy_config=policy_config,
    test_mode=False,
    max_env_steps=100*10**6,
    root_random_seed=0,
    purge_previous=1,  # ask to override previously saved model and logs
    verbose=0,
)

# Train it:
launcher.run()



In [ ]: