Discrete actions multi-data, multi-asset setup intro.


In [ ]:
from logbook import INFO, WARNING, DEBUG

import warnings
warnings.filterwarnings("ignore") # suppress h5py deprecation warning

import numpy as np
import os
import backtrader as bt

from btgym.research.casual_conv.strategy import CasualConvStrategyMulti
from btgym.research.casual_conv.networks import conv_1d_casual_attention_encoder

from btgym.algorithms.launcher.base import Launcher
from btgym.algorithms.aac import A3C
from btgym.algorithms.policy import StackedLstmPolicy

from btgym import MultiDiscreteEnv
from btgym.datafeed.casual import BTgymCasualDataDomain
from btgym.datafeed.multi import BTgymMultiData

from collections import OrderedDict

Problem formulation

Consider setup with one riskless asset acting as broker account cash and K risky assets. For every risky asset there exists track of historic price records referred as data-line. Apart from assets data lines there [optionally] exists number of exogenous data lines holding some information and statistics, e.g. economic indexes, encoded news, macroeconomic indicators, weather forecasts etc. which are considered relevant to decision-making. It is supposed for this setup that:

1. there is no interest rates for any asset;
2. broker actions are fixed-size market orders (`buy`, `sell`, `close`); short selling is permitted;
3. transaction costs are modelled via broker commission;
4. 'market liquidity' and 'capital impact' assumptions are met;
6. time indexes match for all data lines provided;

Model

The problem is modelled as discrete-time finite-horizon partially observable Markov decision process for equity/currency trading:

- *for every asset* traded agent action space is discrete `(0: `hold` [do nothing], 1:`buy`, 2: `sell`, 3:`close` [position])`;
- environment is episodic: maximum  episode duration and episode termination conditions
  are set;
- for every timestep of the episode agent is given environment state observation as tensor of last
  `m` time-embedded preprocessed values for every data-line included and emits actions according some stochastic policy.
- agent's goal is to maximize expected cumulative capital by learning optimal policy;

Environment setup explanation:

    1. This environment expects Dataset to be instance of `btgym.datafeed.multi.BTgymMultiData`, which sets
    number,  specifications and sampling synchronisation for historic data for all assets and datalines
    one want to consider jointly.

    2. Internally every episodic asset data is converted to single bt.feed and added to environment strategy
    as separate named data-line (see backtrader docs for extensive explanation of data-lines concept).
    Strategy is expected to properly handle all received data-lines.

    3. btgym.spaces.ActionDictSpace and order execution. Strategy expects to receive separate action
    for every asset in form of dictionary: `{data_line_name_1: action, ..., data_line_name_K: action}`
    for K assets added, and issues orders for all assets within a single strategy step.
    It is supposed that actions are discrete [for this environment] and same for every asset.
    Base actions are set by strategy.params.portfolio_actions, default is: ('hold', 'buy', 'sell', 'close')
    which equals to `gym.spaces.Discrete` with depth `N=4 (~number of actions: 0, 1, 2, 3)`. 
    That is, for `K` assets environment action space will be a shallow dictionary 
    `(DictSpace)` of discrete spaces:
    `{data_line_name_1: gym.spaces.Discrete(N), ..., data_line_name_K: gym.spaces.Discrete(N)}`

        Example::

            if datalines added via BTgymMultiData are: ['eurchf', 'eurgbp', 'eurgpy', 'eurusd'],
            and base asset actions are ['hold', 'buy', 'sell', 'close'], than:

            env.action.space will be:
                DictSpace(
                    {
                        'eurchf': gym.spaces.Discrete(4),
                        'eurgbp': gym.spaces.Discrete(4),
                        'eurgpy': gym.spaces.Discrete(4),
                        'eurusd': gym.spaces.Discrete(4),
                    }
                )
            single environment action instance (as seen inside strategy):
                {
                    'eurchf': 'hold',
                    'eurgbp': 'buy',
                    'eurgpy': 'hold',
                    'eurusd': 'close',
                }
            corresponding action integer encoding as passed to environment via .step():
                {
                    'eurchf': 0,
                    'eurgbp': 1,
                    'eurgpy': 0,
                    'eurusd': 3,
                }
            vector of integers (categorical):
                (0, 1, 0, 3)

    4. Environment actions cardinality and encoding. Note that total set of environment actions for `K` assets
    and `N` base actions is a `cartesian product of K sets of N elements each`. 
    It can be encoded as `vector of integers,
    single scalar, binary or one_hot`. As cardinality skyrockets with `K`, 
    `multi-discrete` action setup is only suited
    for small number of assets.

        Example::

            Setup with 4 assets and 4 base actions [hold, buy, sell, close] spawns total of 256 possible
            environment actions expressed by single integer in [0, 255] or binary encoding:
                vector str :                            vector int:     int:   binary:
                ('hold', 'hold', 'hold', 'hold')     -> (0, 0, 0, 0) -> 0   -> 00000000
                ('hold', 'hold', 'hold', 'buy')      -> (0, 0, 0, 1) -> 1   -> 00000001
                ...         ...         ...
                ('close', 'close', 'close', 'sell')  -> (3, 3, 3, 2) -> 254 -> 11111110
                ('close', 'close', 'close', 'close') -> (3, 3, 3, 3) -> 255 -> 11111111

    Internally there is some weirdness with encodings as we jump forth and back between
    dictionary of names or categorical encodings and binary encoding or one-hot encoding.
    As a rule: strategy operates with dictionary of string names of actions, environment 
    sees action as dictionary of integer numbers while policy estimator operates with 
    either binary or one-hot encoding.

    5. Observation space: is nested DictSpace, where 'external' part part of space 
    should hold specifications for every data line added.

        Example::

            if datalines added via BTgymMultiData are:
                'eurchf', 'eurgbp', 'eurgpy', 'eurusd';

            environment observation space should be DictSpace:
             {
                'raw': spaces.Box(low=-1000, high=1000, shape=(128, 4), dtype=np.float32),
                'external': DictSpace(
                    {
                        'eurusd': spaces.Box(low=-1000, high=1000, shape=(128, 1, num_features), dtype=np.float32),
                        'eurgbp': spaces.Box(low=-1000, high=1000, shape=(128, 1, num_features), dtype=np.float32),
                        'eurchf': spaces.Box(low=-1000, high=1000, shape=(128, 1, num_features), dtype=np.float32),
                        'eurgpy': spaces.Box(low=-1000, high=1000, shape=(128, 1, num_features), dtype=np.float32),
                    }
                ),
                'internal': spaces.Box(...),
                'datetime': spaces.Box(...),
                'metadata': DictSpace(...)
            }

In [ ]:
engine = bt.Cerebro()

num_features = 16

engine.addstrategy(
    CasualConvStrategyMulti,
    cash_name='EUR',  # just a naming for cash asset
    start_cash=2000,
    commission=0.0001, 
    leverage=10.0,
    asset_names={'USD', 'CHF'},  # that means we use JPY and GBP as information data lines only
    drawdown_call=10,
    target_call=10,
    skip_frame=10,
    gamma=0.99,
    state_ext_scale = {  # strategy specific CWT preprocessing params:
        'USD': np.linspace(1, 2, num=num_features),
        'GBP': np.linspace(1, 2, num=num_features),
        'CHF': np.linspace(1, 2, num=num_features),
        'JPY': np.linspace(5e-3, 1e-2, num=num_features),
    },
    cwt_signal_scale=4e3,
    cwt_lower_bound=4.0, 
    cwt_upper_bound=90.0,
    reward_scale=7, 
    order_size={
        'CHF': 1000,
        #'GBP': 1000,
        #'JPY': 1000,
        'USD': 1000,
    },
)

data_config = [
    ('USD', {'filename': './data/DAT_ASCII_EURUSD_M1_2017.csv'}, ),
    ('GBP', {'filename': './data/DAT_ASCII_EURGBP_M1_2017.csv'}, ),
    ('JPY', {'filename': './data/DAT_ASCII_EURJPY_M1_2017.csv'}, ),
    ('CHF', {'filename': './data/DAT_ASCII_EURCHF_M1_2017.csv'}, ),
]

data_config = OrderedDict(data_config)

dataset = BTgymMultiData(
    data_class_ref=BTgymCasualDataDomain,
    data_config=data_config,
    trial_params=dict(
        start_weekdays={0, 1, 2, 3, 4, 5, 6},
        sample_duration={'days': 30, 'hours': 0, 'minutes': 0},
        start_00=False,
        time_gap={'days': 15, 'hours': 0},
        test_period={'days': 7, 'hours': 0, 'minutes': 0},
        expanding=True,
    ),
    episode_params=dict(
        start_weekdays={0, 1, 2, 3, 4, 5, 6},
        sample_duration={'days': 2, 'hours': 23, 'minutes': 55},
        start_00=False,
        time_gap={'days': 2, 'hours': 15},
    ),
    frozen_time_split={'year': 2017, 'month': 3, 'day': 1},
)
#########################

env_config = dict(
    class_ref=MultiDiscreteEnv, 
    kwargs=dict(
        dataset=dataset,
        engine=engine,
        render_modes=['episode'],
        render_state_as_image=True,
        render_size_episode=(12,16),
        render_size_human=(9, 4),
        render_size_state=(11, 3),
        render_dpi=75,
        port=5000,
        data_port=4999,
        connect_timeout=90,
        verbose=0,
    )
)

cluster_config = dict(
    host='127.0.0.1',
    port=12230,
    num_workers=4,  # Set according CPU's available or so
    num_ps=1,
    num_envs=1,
    log_dir=os.path.expanduser('~/tmp/multi_discrete'),
)
policy_config = dict(
    class_ref=StackedLstmPolicy,
    kwargs={
        'lstm_layers': (256, 256),
        'dropout_keep_prob': 1.0,  # 0.25 - 0.75 opt
        'encode_internal_state': False,
        'conv_1d_num_filters': 64,
        'state_encoder_class_ref': conv_1d_casual_attention_encoder,
    }
)

trainer_config = dict(
    class_ref=A3C,
    kwargs=dict(
        opt_learn_rate=1e-4,
        opt_end_learn_rate=1e-5,
        opt_decay_steps=50*10**6,
        model_gamma=0.99,
        model_gae_lambda=1.0,
        model_beta=0.001, # entropy reg: 0.001 for non_shared encoder_params; 0.05 if 'share_encoder_params'=True
        rollout_length=20,
        time_flat=True, 
        model_summary_freq=10,
        episode_summary_freq=1,
        env_render_freq=5,
    )
)

In [ ]:
launcher = Launcher(
    cluster_config=cluster_config,
    env_config=env_config,
    trainer_config=trainer_config,
    policy_config=policy_config,
    test_mode=False,
    max_env_steps=100*10**6,
    root_random_seed=0,
    purge_previous=1,  # ask to override previously saved model and logs
    verbose=0
)

# Train it:
launcher.run()

In [ ]: