Trading Strategy for Finance using LSTMs

This lab was developed by Mike Imas, Onur Yilmaz Ph.D., and Andy Steinbach Ph.D.

1. Environment Verification

Before we begin, let's verify WebSockets are working on your system. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Shift-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.


In [1]:
print("The answer should be three: " + str(1+2))


The answer should be three: 3

Let's execute the cell below to display information about the GPUs running on the server.


In [2]:
!nvidia-smi


Tue Dec  5 16:19:55 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:00:1E.0     Off |                    0 |
| N/A   36C    P8    28W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2. Lab Overview

This lab focuses on the prediction of time series financial data using a special recurrent neural network (RNN), called Long Short Term Memory (LSTM), for trading strategies in finance.

The goal of this lab is to give you a deep learning (DL) approach that can be potentially beneficial to the complex trading strategies in finance. This lab is not a complete trading strategy that generates a profit and loss curve (PNL). Rather, it shows how LSTM based deep neural networks can be applied to predict time series financial data. The code provided in this lab can be repurposed to predict the any time series financial data to be used to make certain decisions, opening a long position or closing a short position.

DL has been disrupting many applications including computer vision, natural language processing, and there has been a flurry of research and development activities in different verticals of the industry such as healthcare and finance to exploit this new technology for the area specific use cases. DL based investment strategies are also in the center of research and development activities in the algorithmic trading.

In this lab, it is assumed that you are familiar with RNNs, TensorFlow, and Python. For more information on RNNs, LSTMs, and TensorFlow, please check the relevant labs in DLI.

3. Implementation of LSTM based Financial Data Predictor

In this section, we will go over the implementation of Long Short Term Memory (LSTM) based financial data predictor using a dataset from Kaggle provided by Two Sigma Investment, which is a New York City based hedge fund company. The reason we picked this dataset is to use expertly generated features for training and inference. The goal of this course is to give the experience of how LSTMs can be applied to predict time series financial data.

What is Long Short Term Memory (LSTM) Networks?

LSTM is a variant of recurrent neural network (RNN) and was published by Hochreiter & Schmidhuber in 1997. RNNs are an extension of regular artificial neural networks that add connections feeding the hidden state of the neural network back into itself, these are called recurrent connections. The reason for adding these recurrent connections is to provide the network with visibility not just of the current data sample it has been provided, but also it's previous hidden state. In some sense, this gives the network a sequential memory of what it has seen before. This makes RNNs applicable in situations where a sequence of data (like in most of the financial data) is required to make a classification decision or regression estimate.

                                Figure 1: Recurrent Neural Networks (RNNs)

LSTMs do not have vanishing gradient problems like in most of the RNNs. LSTM is normally augmented by recurrent gates called forget gates. As mentioned, a defining feature of the LSTM is that it prevents backpropagated errors from vanishing (or exploding) and instead allow errors to flow backwards through unlimited numbers of "virtual layers" unfolded in time. That is, the LSTM can learn "very deep" tasks that require memories of events that happened thousands or even millions of discrete time steps ago. Problem-specific LSTM-like topologies can be evolved and can work even when signals contain long delays or have a mix of low and high frequency components

a. Financial Terminologies

Before we start running the code, we include some of the financial terminologies that we use in this lab. The definitions are taken directly from Investopedia.com.

Trading Strategy: A set of objective rules defining the conditions that must be met for a trade entry and exit to occur. Trading strategies include specifications for trade entries, including trade filters and triggers, as well as rules for trade exits, money management, timeframes, order types, and other relevant information. A trading strategy, if based on quantifiably specifications, can be analyzed based on historical data to project future performance.

Instrument: An instrument is a tradeable asset or negotiable item such as a security, commodity, derivative or index, or any item that underlies a derivative. An instrument is a means by which something of value is transferred, held or accomplished.

Security: It is a fungible, negotiable financial instrument that holds some type of monetary value. It represents an ownership position in a publicly-traded corporation (via stock), a creditor relationship with a governmental body or a corporation (represented by owning that entity's bond), or rights to ownership as represented by an option.

Stock: A stock is a type of security that signifies ownership in a corporation and represents a claim on part of the corporation's assets and earnings. It is delivered in the units of shares.

Share: Shares are units of ownership interest in a corporation or financial asset.

Long Position (Long): A long (or long position) is the buying of a security such as a stock, commodity or currency with the expectation that the asset will rise in value. Trader normally has no plan to sell the security in the near future. A key component of long position investment is the ownership of the stock or bond.

Short Position (Short): A short, or short position, is a directional trading or investment strategy where the investor sells shares of borrowed stock in the open market. The expectation of the investor is that the price of the stock will decrease over time, at which point the he will purchase the shares in the open market and return the shares to the broker which he borrowed them from.

Return: A return is the gain or loss of a security in a particular period. The return consists of the income and the capital gains relative on an investment, and it is usually quoted as a percentage. The general rule is that the more risk you take, the greater the potential for higher returns and losses.

Fundamental Analysis: It is a method of evaluating a security in an attempt to measure its intrinsic value, by examining related economic, financial and other qualitative and quantitative factors. Fundamental analysts study anything that can affect the security's value, including macroeconomic factors such as the overall economy and industry conditions, and microeconomic factors such as financial conditions and company management. For instance, for stocks and equity instruments, this method uses revenues, earnings, future growth, return on equity, profit margins and other data to determine a company's underlying value and potential for future growth.

Technical Analysis: It is the evaluation of securities by means of studying statistics generated by market activity, such as past prices and volume. Technical analysts do not attempt to measure a security's intrinsic value but instead use stock charts to identify patterns and trends that may suggest what a stock will do in the future.

b. Two Sigma (2$\sigma$) Investment Dataset in Kaggle

In December 2016, Two Sigma Investments, a New York City based hedge fund company, announced a Kaggle challenge called the Two Sigma Financial Modeling Challenge with the prize pool of $100,000. Two Sigma's goal is to explore what untapped value Kaggle's diverse data science community can discover in the financial markets.

This dataset that was published on Kaggle contains fundamental and technical features pertaining to a time-varying value for a financial instrument. These features are generated by fundamental and technical analysis. Variable to predict is "y" which is the return of an instrument. Features and "y" variable are anonymized by using special transformations like principal component analysis (PCA) in order to protect the original data. Each instrument has an id and time is represented by the 'timestamp' feature. We picked this dataset because it includes expertly generated feature set for training and inference. Structure of the data is depicted in Figure 2. Some of the features are as follows;

Fundamental Features: Macroeconomic factors (overall economy, industry conditions, financial conditions), revenues, earnings, future growth, profit margins, etc.

Technical Features: Price movements, analytical and statistical tools like mean, standard deviation, moving averages, etc.

"y" scalar variable: Return of the instrument

                                  Figure 2: Two Sigma Investment Dataset

Data is saved and accessed as a HDF5 file in the Kernels environment. HDF5 stands for hierarchical data format version number 5. The HDF format is designed specifically to store and organize large amounts of scientific data. Common file extensions include .hdf, .hdf5, or simply .h5.

c. Step by Step Implementation

A typical DL workflow starts with data preparation because the data is not clean and ready to use most of the time. Deep neural network building and training follow the data preparation. Lastly, the trained network is validated with a dataset.

First, we import several widely used modules such as NumPy for numerical calculations, pandas for data management, matplotlib for visualizations, and TensorFlow for building and training deep neural networks.


In [3]:
#imports
import h5py
import pandas as pd 
import numpy as np
import pprint as pp 
import tensorflow as tf 
from tensorflow.contrib import rnn
import math
import matplotlib.pyplot as plt
import warnings
import prepareData as prepData

Data Preparation

The original data needs to be cleaned before training the network. Since cleaning the data takes significant amount of time (around 20 minutes), we have stored the cleaned data into another .h5 file. If you would like to use the original data and run the cleaning code, please set the "usePreparedData" variable to "False".


In [4]:
# The data is prepared and stored in a seperate .h5 file.
# Set usePreparedData = False to use the original data and run the data preparation code
usePreparedData = True
# insampleCutoffTimestamp variable is used to split the data in time into two pieces to create training and test set.
insampleCutoffTimestamp = 1650

# If usePreparatedData is True, then the prepared data is stored. Otherwise, the original data is stored
if usePreparedData == True:
    #with pd.HDFStore("/home/mimas/2sigma/DLI_FSI/2sigma/train_prepared.h5", 'r') as train:
    with pd.HDFStore("2sigma/trainDataPrepared.h5", 'r') as train:
        df = train.get("train") 
else:
    with pd.HDFStore("2sigma/train.h5", 'r') as train:
        df = train.get("train")

There are multiple instruments in the dataset and each instrument has an id. Time is represented by the 'timestamp' feature. Let's look at the data.


In [5]:
# This will print the dataset
df


Out[5]:
id timestamp derived_0 derived_1 derived_2 derived_3 derived_4 fundamental_0 fundamental_1 fundamental_2 ... technical_43 technical_44 y y_lagged technical_diff krnldiff delta5diff krnl40 delta540 fmod29
0 10 0 0.370326 -0.006316 0.222831 -0.213030 0.729277 -0.335633 0.113292 1.621238 ... -2.000000e+00 0.000951 -0.011753 0.000046 0.000000 0.000000 0.000000 -0.041838 0.000011 0.666596
1 11 0 0.014765 -0.038064 -0.017425 0.320652 -0.034134 0.004413 0.114285 -0.210185 ... -2.000000e+00 0.000951 -0.001240 0.000046 0.000000 0.000000 0.000000 -0.041838 0.000011 0.666596
2 12 0 -0.010622 -0.050577 1.571245 -0.157525 -0.068550 -0.155937 1.060683 -0.764516 ... -2.000000e+00 0.000951 -0.020940 0.000046 0.006942 0.000000 0.000000 -0.041838 0.000011 0.666596
3 25 0 -0.003429 -0.012705 -0.005859 -0.037375 0.024913 0.178495 0.044287 -0.007262 ... -2.000000e+00 0.000951 -0.015959 0.000046 0.006766 0.000000 0.000000 -0.041838 0.000011 0.666596
4 26 0 0.176693 -0.025284 -0.057680 0.015100 0.180894 0.139445 -0.125687 -0.018707 ... 0.000000e+00 0.000951 -0.007338 0.000046 0.006236 0.000000 0.000000 -0.041838 0.000011 0.666596
5 27 0 0.346856 0.166239 -1.482727 -0.992249 -0.125916 0.345812 0.044287 -0.584239 ... -2.000000e+00 0.000951 0.031425 0.000046 0.010000 0.000000 0.000000 -0.041838 0.000011 0.666596
6 31 0 0.072036 0.014931 -0.005859 0.014063 0.024913 -0.193205 0.044287 0.019531 ... -2.000000e+00 0.000951 -0.032895 0.000046 0.006601 0.000000 0.000000 -0.041838 0.000011 0.666596
7 38 0 0.300062 0.071251 -0.074451 -0.065292 -0.011286 0.026365 0.210249 0.167494 ... -2.000000e+00 0.000951 0.015803 0.000046 0.007909 0.000000 0.000000 -0.041838 0.000011 0.666596
8 39 0 -0.003511 -0.034270 0.082372 -0.023937 -0.025750 0.007815 0.263451 -0.241212 ... -2.000000e+00 0.000951 -0.027593 0.000046 0.000000 0.000000 0.000000 -0.041838 0.000011 0.666596
9 40 0 -0.083330 0.081935 -1.482727 -0.206856 -0.839563 -0.234100 -0.291853 -2.522340 ... -2.000000e+00 0.000951 0.006662 0.000046 0.000000 0.000000 0.000000 -0.041838 0.000011 0.666596
10 41 0 0.435826 1.222721 0.363570 -0.005651 0.442866 0.125375 0.044287 0.292311 ... -2.000000e+00 0.000951 -0.001899 0.000046 0.000000 0.000000 0.000000 -0.041838 0.000011 0.666596
11 43 0 -0.003429 -0.012705 -0.005859 -0.037375 0.024913 -0.285388 0.044287 -0.193590 ... -2.000000e+00 0.000951 0.050219 0.000046 0.007407 0.000000 0.000000 -0.041838 0.000011 0.666596
12 44 0 0.034991 -0.019258 0.055769 -0.084496 0.259828 0.198800 0.265104 -0.160462 ... -2.000000e+00 0.000951 -0.018991 0.000046 0.000000 0.000000 0.000000 -0.041838 0.000011 0.666596
13 49 0 -0.003429 0.212615 -0.979520 -0.037375 0.024913 0.395701 0.044287 -0.780873 ... 0.000000e+00 0.000951 -0.005203 0.000046 -0.024503 0.000000 0.000000 -0.041838 0.000011 0.666596
14 54 0 0.071704 -0.044019 -0.005859 0.038046 0.024913 0.331778 -0.021114 0.019531 ... -2.000000e+00 0.000951 -0.006369 0.000046 0.000000 0.000000 0.000000 -0.041838 0.000011 0.666596
15 59 0 0.116360 0.164506 0.156510 -0.129252 1.310059 -0.264846 0.044287 0.370725 ... -2.000000e+00 0.000951 0.017768 0.000046 -0.012122 0.000000 0.000000 -0.041838 0.000011 0.666596
16 60 0 0.026824 -0.024105 -0.028991 0.433277 -0.006366 0.295138 0.044287 -0.285946 ... 0.000000e+00 0.000951 -0.001089 0.000046 0.007072 0.000000 0.000000 -0.041838 0.000011 0.666596
17 62 0 0.367122 0.675543 -0.008483 -0.367778 -0.015529 0.129848 0.044287 -0.251624 ... -2.000000e+00 0.000951 -0.008794 0.000046 -0.001446 0.000000 0.000000 -0.041838 0.000011 0.666596
18 63 0 0.453271 -0.036301 0.094657 -0.521106 0.049715 -0.352013 0.044287 -0.182694 ... 0.000000e+00 0.000951 0.040724 0.000046 -0.000027 0.000000 0.000000 -0.041838 0.000011 0.666596
19 68 0 -0.122734 0.038939 0.148015 2.493512 0.009746 -0.063111 0.044287 -0.162878 ... -2.000000e+00 0.000951 -0.003921 0.000046 -0.000538 0.000000 0.000000 -0.041838 0.000011 0.666596
20 69 0 -0.062361 -0.063724 0.021274 0.032845 0.174025 0.122335 -0.133188 -0.108286 ... 0.000000e+00 0.000951 -0.011317 0.000046 0.000000 0.000000 0.000000 -0.041838 0.000011 0.666596
21 70 0 0.175489 -0.024445 -0.034079 -0.020533 0.282478 -0.069417 0.247840 0.303953 ... -2.000000e+00 0.000951 0.044167 0.000046 0.008450 0.000000 0.000000 -0.041838 0.000011 0.666596
22 76 0 -0.190705 0.006221 -0.056835 0.063464 -0.149507 0.267959 -0.392849 0.080084 ... -2.000000e+00 0.000951 0.001395 0.000046 0.000000 0.000000 0.000000 -0.041838 0.000011 0.666596
23 79 0 0.008503 0.048732 0.123793 0.266366 -0.009772 -0.248937 0.044287 -0.157415 ... 0.000000e+00 0.000951 -0.012101 0.000046 -0.001014 0.000000 0.000000 -0.041838 0.000011 0.666596
24 80 0 0.043156 0.540945 0.278394 0.563343 -0.144004 -0.024376 0.044287 -0.337090 ... 0.000000e+00 0.000951 -0.070837 0.000046 -0.003050 0.000000 0.000000 -0.041838 0.000011 0.666596
25 82 0 0.086832 -0.022962 -0.005859 0.006783 0.024913 0.071911 0.112818 0.019531 ... -2.000000e+00 0.000951 -0.001766 0.000046 0.008571 0.000000 0.000000 -0.041838 0.000011 0.666596
26 83 0 -0.082307 -0.039761 0.102000 -0.798021 0.149622 0.028448 0.062647 0.044142 ... 0.000000e+00 0.000951 -0.012542 0.000046 -0.004632 0.000000 0.000000 -0.041838 0.000011 0.666596
27 85 0 0.005351 -0.031086 0.078334 0.417328 -0.020146 0.390538 0.256594 -0.167223 ... 0.000000e+00 0.000951 0.029697 0.000046 0.012955 0.000000 0.000000 -0.041838 0.000011 0.666596
28 87 0 0.045536 0.054391 0.330190 -0.816961 0.012602 -0.036462 -0.115197 -0.237428 ... 0.000000e+00 0.000951 0.030815 0.000046 -0.026302 0.000000 0.000000 -0.041838 0.000011 0.666596
29 90 0 -0.003925 -0.037522 -0.005859 -0.028244 0.024913 -0.049041 0.084827 0.019531 ... -2.000000e+00 0.000951 -0.016336 0.000046 0.005712 0.000000 0.000000 -0.041838 0.000011 0.666596
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1710726 2100 1812 0.079269 -0.021573 -0.033426 -0.333376 0.061463 0.217718 0.044287 0.127397 ... -3.330669e-16 -0.050890 0.008358 -0.009384 0.000000 0.000000 0.000000 -0.453188 -0.009168 -0.034480
1710727 2101 1812 -0.204339 0.134338 0.143729 0.702932 -0.290513 0.031102 -0.243759 -0.205067 ... -1.989691e+00 -0.001040 -0.003035 0.011515 0.004754 0.006968 -0.002417 0.046869 -0.001460 -0.034480
1710728 2102 1812 0.095661 -0.051734 -0.005859 0.035816 0.024913 -0.004488 -0.249359 0.019531 ... -1.266820e-11 0.005510 -0.001172 0.013182 0.000000 0.005418 0.000000 0.349820 -0.000827 -0.034480
1710729 2104 1812 0.134436 -0.030198 -0.057608 0.016190 0.146911 -0.340176 -0.104457 0.002421 ... -1.266820e-11 0.020394 0.003194 -0.002342 0.000000 0.000000 0.000000 -0.185159 0.000049 -0.034480
1710730 2107 1812 -0.180348 0.061339 -0.005859 0.049680 0.024913 -0.418095 -0.620948 0.019531 ... -1.660443e-06 0.052314 -0.019208 0.010236 0.000000 0.004216 0.002716 0.578964 -0.020938 -0.034480
1710731 2108 1812 -0.247698 0.026739 -0.005859 -0.280510 0.024913 0.198069 -0.620398 0.019531 ... -1.266820e-11 -0.019431 0.000383 -0.003761 0.000000 0.000000 0.000575 0.150522 0.005653 -0.034480
1710732 2109 1812 -0.425230 0.266186 0.075126 0.039726 -0.060703 -0.059133 0.044287 0.286111 ... -1.660443e-06 -0.065701 -0.012707 0.006391 0.003234 0.040423 0.003234 0.262781 0.003433 -0.034480
1710733 2114 1812 -2.225214 1.222721 1.571245 -0.347759 0.257406 0.172330 -0.098562 -0.000398 ... -1.266820e-11 -0.016923 -0.002151 -0.012710 -0.000356 -0.004449 -0.000356 0.243193 0.006373 -0.034480
1710734 2117 1812 -0.136062 -0.006129 -1.482727 -0.316265 -0.480613 -0.270387 0.044287 -0.297367 ... -1.266820e-11 -0.057015 0.034297 0.002157 -0.002306 0.015304 -0.001026 0.218020 -0.014916 -0.034480
1710735 2118 1812 -2.225214 1.222721 -0.005859 -0.037375 0.024913 0.204721 -0.136476 0.019531 ... -1.989691e+00 0.022479 -0.007932 -0.005968 -0.001756 0.004116 -0.001219 0.401249 -0.006091 -0.034480
1710736 2120 1812 -0.003429 -0.012705 -0.398379 -0.037375 0.024913 -0.222814 0.044287 -0.554627 ... -2.000000e+00 -0.032474 -0.033865 0.021858 -0.003194 0.003268 -0.003194 -0.272112 0.000000 -0.034480
1710737 2121 1812 0.126793 0.040152 -0.005859 -0.054294 0.024913 -0.423501 -0.416516 0.019531 ... -1.266820e-11 0.043590 -0.001192 -0.002113 0.000000 -0.023934 0.000000 0.336718 -0.007716 -0.034480
1710738 2126 1812 -0.181550 -0.039677 -0.082102 -0.140690 -0.231005 -0.145413 -0.150222 0.158479 ... -1.782362e+00 -0.010252 -0.003483 0.001174 -0.002802 -0.002365 -0.000005 -0.086226 -0.001102 -0.034480
1710739 2129 1812 -0.064731 -0.044927 -0.005859 0.615191 0.024913 0.040299 -0.058827 0.019531 ... -3.051758e-05 0.034427 0.005619 0.023354 0.002314 0.028929 0.003790 -0.228275 -0.000905 -0.034480
1710740 2130 1812 -2.225214 1.222721 1.174958 -0.439014 -1.318172 -0.219093 0.087497 -1.461111 ... -6.892265e-13 0.020224 -0.027963 0.010122 -0.003100 0.008168 -0.003100 0.414461 0.008308 -0.034480
1710741 2131 1812 -0.149930 0.070203 0.363126 -0.010869 0.164914 -0.262682 -0.315310 0.019663 ... -4.279235e-09 0.002708 -0.016629 0.006379 0.005736 0.018755 0.005736 0.305399 0.005956 -0.034480
1710742 2137 1812 -0.059745 -0.052186 -0.062670 -1.842369 0.073483 0.287247 0.044287 0.001499 ... -2.000000e+00 -0.024923 0.008800 -0.017717 0.000000 -0.043130 -0.002332 -0.035613 0.000000 -0.034480
1710743 2138 1812 0.219608 -0.034336 -0.031817 0.270888 0.045701 0.134292 0.364995 0.064963 ... -1.660443e-06 0.010777 0.002015 -0.004487 0.000000 0.000000 0.000000 -0.366031 -0.001975 -0.034480
1710744 2139 1812 0.062132 0.070332 0.351273 0.478199 0.092099 0.494140 0.044287 -0.330590 ... -6.892265e-13 0.027970 0.025162 -0.007581 -0.002689 -0.015003 0.000150 0.075615 -0.006907 -0.034480
1710745 2140 1812 0.213264 0.029603 -0.005859 -0.273137 0.024913 -0.464013 -0.404886 0.019531 ... -4.279235e-09 0.042746 -0.024362 0.013498 0.000000 0.000000 0.002586 1.009813 -0.006423 -0.034480
1710746 2142 1812 -0.223395 -0.042492 -0.060381 0.016019 0.218667 -0.160979 -0.138038 0.348972 ... -1.660443e-06 0.022390 0.008430 -0.004077 -0.001042 -0.000775 -0.001042 -0.193525 -0.001304 -0.034480
1710747 2145 1812 -0.154051 -0.029331 -0.010545 0.019339 -0.260369 -0.227888 0.006721 0.008255 ... -1.660443e-06 -0.032458 -0.000711 0.000967 0.000000 0.000000 0.002064 0.052745 -0.005793 -0.034480
1710748 2146 1812 -0.238458 0.316407 0.632261 0.531651 -0.154740 0.069316 0.044287 -0.543269 ... -3.330669e-16 -0.004821 -0.017794 -0.024961 0.000000 -0.029522 0.000741 0.585441 0.000000 -0.034480
1710749 2148 1812 0.089476 -0.038628 0.776538 3.759122 -0.177850 0.388419 0.193054 0.175415 ... -1.999969e+00 -0.022806 -0.001058 -0.002914 0.003262 -0.002499 0.000534 -0.490991 -0.012937 -0.034480
1710750 2149 1812 0.254593 0.064668 -0.034742 -1.336611 -0.037401 0.244451 0.049125 -0.130919 ... -6.892265e-13 -0.027780 -0.001462 0.003725 -0.000977 -0.003881 0.001808 -0.185266 -0.008880 -0.034480
1710751 2150 1812 -0.123364 -0.055977 -0.005859 0.010906 0.024913 -0.255730 -0.108285 0.019531 ... -2.328306e-10 0.001004 0.004604 0.001954 0.002757 -0.002948 0.002757 0.272229 0.005969 -0.034480
1710752 2151 1812 -2.225214 0.080905 -0.005859 3.369380 0.024913 -0.293557 0.044287 0.019531 ... -1.660443e-06 0.044597 -0.009241 0.007506 0.000000 0.007508 0.000899 0.121156 -0.010494 -0.034480
1710753 2154 1812 -0.077930 -0.038748 -0.031859 0.646608 -0.145526 -0.119539 -0.151587 -0.130524 ... -1.266820e-11 0.030816 -0.006852 0.002378 0.000000 0.000000 -0.005879 0.043207 -0.001385 -0.034480
1710754 2156 1812 -0.269845 -0.005322 -0.005859 -0.117539 0.024913 0.214088 -0.307293 0.019531 ... -1.999260e+00 -0.011706 -0.000785 -0.029726 0.003938 -0.032706 0.003938 0.392242 0.003202 -0.034480
1710755 2158 1812 -0.003429 -0.012705 -0.005859 -0.037375 0.024913 -0.065267 -1.059481 0.019531 ... 0.000000e+00 0.000951 0.003497 -0.011483 0.000000 -0.026854 -0.003424 -0.522720 0.000000 -0.034480

1710756 rows × 114 columns

If the original data is stored, the data preparation code will be executed in the following cell. First, extreme values in each feature set are removed. Then, some hand-crafted features are added to feature set to boost the prediction accuracy. There are many methods including PCA and auto-encoders to do the feature engineering rather than creating hand-crafted features. As an exercise, we highly recommend you to add auto-encoders to the code and check the accuracy after the lab. Lastly, NaNs are replaced with the median of the feature.


In [7]:
if usePreparedData == False:
    # Original data is not clean and some the samples are a bit extreme.
    # These values are removed from the feature set.
    df = prepData.removeExtremeValues(df, insampleCutoffTimestamp)
    # A little bit feature engineering. Hand-crafted features are created here to boost the accuracy.
    df = prepData.createNewFeatures(df) 
    # Check whether ve still have any NaNs 
    df = prepData.fillNaNs(df) 
    df.to_hdf("2sigma/trainDataPrepared.h5", 'train')

Model Construction

Now, we set up the TensorFlow compute graph. The deep neural network that is used in this code is comprised of a LSTM cell that runs over 10 time steps, a fully connected layers (FCL), and also drop-out layers to prevent overfitting. Calculating the number of time steps for a recurrent neural network is not a trivial task. It is actually another hyperparameter that needs to be searched. The network is depicted in the following figure.

                            Figure 3: Structure of the LSTM based deep neural network

Below is the code to build the deep neural network depicted in Figure 3;


In [8]:
def weight_variable(shape): 
    initial = tf.truncated_normal(shape, stddev=0.3)
    return tf.Variable(initial) 
    
def bias_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.3)
    return tf.Variable(initial) 

n_time_steps = 10
def getDNN (x, LSTMCellSize, keep_prob):
    with tf.name_scope('model'):
        with tf.name_scope('RNN'):
            # We will add two dropout layers and LSTM cells with the number of units as LSTMCellSize.
            cell = rnn.DropoutWrapper(rnn.BasicLSTMCell(LSTMCellSize, forget_bias=2, activation=tf.nn.tanh), output_keep_prob=keep_prob)
            # We use the cell to create RNN.
            # Note that outputs is not a tensor, it is a list with one element which is numpy array. 
            outputs, states = tf.nn.dynamic_rnn(cell, x, dtype=tf.float32) 
            outputs_shape = outputs.get_shape().as_list()
                
        # hidden layer with sigmoid activation
        with tf.name_scope('W_fc1'):
            W_fc1 = weight_variable([LSTMCellSize, 1])
        with tf.name_scope('b_fc1'):
            b_fc1 = bias_variable([1])
        with tf.name_scope('pred'):
            pred = tf.matmul(outputs[:,-1,:], W_fc1) + b_fc1

        return pred

In [9]:
# The column names that will be included in the featureset are added into colList.
# colList will be used throughout the lab.
colList=[]                  
for thisColumn in df.columns: 
    if thisColumn not in ('id', 'timestamp', 'y', 'CntNs', 'y_lagged'): 
        colList.append(thisColumn)
colList.append('y_lagged')

#if you do not reset the default graph you will need to restart the kernel
#every time this notebook is run
tf.reset_default_graph()

# Network Parameters 
# Number of units in the LSTM cell.
n_LSTMCell = len(colList)

# Placeholder for the input and the keep probability for the dropout layers
with tf.name_scope('input'):
    x= tf.placeholder(tf.float32, shape=[None, n_time_steps, len(colList)])
with tf.name_scope('keep_prob'):
    keep_prob = tf.placeholder(tf.float32)

# At the input, we create 2-layer LSTM cell (with dropout layers)
print('Building tensorflow graph')

# Graph construction for the LSTM based deep neural network. 
# Structure of the network is depicted in the above figure.
# Please see the dnn.py to see the code of the network.
pred = getDNN (x, n_LSTMCell, keep_prob)


Building tensorflow graph

Training and Testing

We split the data into two pieces in time to have a training and testing set. In order to have enough sample for each id, the cut-off timestamp for the training set was defined in "insampleCutoffTimestamp" variable as 1650. Figure 4 shows how an instrument is split in time to create training and testing set. While training the model, the training set for each instrument will be fed separately to learn the time patterns in the data.

                                    Figure 4: Training and Testing Dataset

In the Kaggle challenge, the metric to evaluate the prediction accuracy was given as Pearson correlation. In statistics, pearson correlation is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It is widely used in the sciences. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s.

Depending on the frequency of the financial data, Pearson correlation (R) can be very small. In finance, given the high ratio of signal-to-noise, even a small R can deliver meaningful value. Please note that the algorithm that won the challenge had only 0.038 R.

The following cell includes the code for creating training and testing set, and calculating Pearson correlation.


In [10]:
# Placeholder for the output (label)
with tf.name_scope('label'):
    y = tf.placeholder(tf.float32, shape=[None, 1]) 
# Placeholder to be able to split the data into training and test set while training the network.
inSampleCutoff = tf.placeholder(tf.int32, shape = ())

# this is important - we only want to train on the in-sample set of rows using TensorFlow
y_inSample = y[0:inSampleCutoff]
pred_inSample = pred[0:inSampleCutoff]

# also extract out of sample predictions and actual values,
# we'll use them for evaluation while training the model.
y_outOfSample = y[inSampleCutoff:]
pred_outOfSample = pred[inSampleCutoff:]

with tf.name_scope('stats'):
    # Pearson correlation to evaluate the model
    covariance = tf.reduce_sum(tf.matmul(tf.transpose(tf.subtract(pred_inSample, tf.reduce_mean(pred_inSample))),tf.subtract(y_inSample, tf.reduce_mean(y_inSample))))
    var_pred = tf.reduce_sum(tf.square(tf.subtract(pred_inSample, tf.reduce_mean(pred_inSample))))
    var_y = tf.reduce_sum(tf.square(tf.subtract(y_inSample, tf.reduce_mean(y_inSample))))
    pearson_corr = covariance / tf.sqrt(var_pred * var_y) 

tf.summary.scalar("pearson_corr", pearson_corr)


Out[10]:
<tf.Tensor 'pearson_corr:0' shape=() dtype=string>

In most of the traditional machine learning and deep learning methods, it is assumed that the feature set and predicted value have zero mean and unit variance gaussian distribution. Empirical studies show that the financial data such as asset returns is often not compatible with this assumption. That is why we normalize the "y" variable by subtracting its mean and dividing the result by the standard deviation in the following cell. As an exercise, you can also normalize the features and see if you improve the accuracy.


In [11]:
# Training dataset is also created here. We included the code to split the data in the above cell. 
# The difference is that the above code will be used in the training by the TensorFlow.
# This code will not be used by TensorFlow and creates the testing dataset whenever it is executed.
dfInSample = df[df.timestamp <  insampleCutoffTimestamp]
# create a reference dataframe (that only depends on in-sample data)
# that gives us standard deviation and mean information on per-id basis
# we'll use it later for variance stabilization
meanStdById = dfInSample.groupby(['id']).agg( {'y':['mean', 'std']})

We are ready to launch the graph for training the model and see intermediate diagnostics results and the final result. We defined the important hyperparameters including the epoch, training batch size and learning rate at the top of the cell. Initially, the epoch is set to 1 because it takes 15-20 minutes to complete the training with 10 epochs even though we are using GPUs. In order to speed up the training in the lab environment, we provided pre-trained networks with 10 epochs and 20 epochs. An adaptive learning rate starting from 0.002 with exponential decay is used for the training from scratch. Learning rate should be set to 0.00058 and 0.00061 for using pre-trained models with 10 and 15 epochs respectively.


In [12]:
# Training parameters
display_step = 100 
epoch = 1
pre_trained_model = 'SavedModels/model_epoch_10.ckpt'
mini_batch_limit = 1300

# set up adaptive learning rate:
globalStep = tf.placeholder(tf.float32)
# Ratio of globalStep / totalDecaySteps is designed to indicate how far we've progressed in training.
# the ratio is 0 at the beginning of training and is 1 at the end.
# adaptiveLearningRate will thus change from the starting learningRate to learningRate * decay_rate
# in order to simplify the code, we are fixing the total number of decay steps at 1 and pass globalStep
# as a fraction that starts with 0 and tends to 1.
# Learning rate should be set to 0.002 if you are training from scratch.
# Learning rate should be set to 0.00058 if you are using the pre-trained network with 10 epochs.
# Learning rate should be set to 0.00061 if you are using the pre-trained network with 15 epochs.
adaptiveLearningRate = tf.train.exponential_decay(
  0.00058,       # Start with this learning rate
  globalStep,  # globalStep / totalDecaySteps shows how far we've progressed in training
  1,           # totalDecaySteps
  0.3)         # decay_rate, the factor by which the starting learning rate will be 
               # multiplied when the training is finished
    
# Define loss and optimizer
# Note the loss only involves in-sample rows
# Regularization is added in the loss function to avoid over-fitting
rnn_variables = lstm_variables = [v for v in tf.trainable_variables()
                    if v.name.startswith('rnn')]

with tf.name_scope('loss'):
    loss = tf.nn.l2_loss(tf.subtract(y_inSample,pred_inSample)) + tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(scale=0.0001), tf.trainable_variables())

tf.summary.scalar("loss", loss)
optimizer = tf.train.AdamOptimizer(learning_rate=adaptiveLearningRate).minimize (loss) 

# Getting unique ids to train the network per id basis.
ids = df.id.unique()
ids.sort()

summary_op = tf.summary.merge_all()

# initialize the variables 
init = tf.global_variables_initializer()

totalActual = []
totalPredicted = []
import random
# Launch the graph 
# Implement Cross Validation, but in a vay that preserves temporal structure for id's 
with tf.Session() as sess:  
    # Global variables are initialized
    sess.run(init) 
    
    # Restore latest checkpoint
    model_saver = tf.train.Saver()
    model_saver.restore(sess, pre_trained_model)
    
    writer = tf.summary.FileWriter("logs", graph=tf.get_default_graph())
    step = 50  
    writer_step = 1;
    for i in range(epoch):
        print('Epoch: ', i, '******************************')        
        actual = []
        predicted = []
        
        random.shuffle(ids)

        for thisId in ids:
            # Getting the data of the current id
            this_df = df[df.id == thisId].copy()
            this_df = this_df.sort_values(['id', 'timestamp'])
                        
            # we need to pass training set to the graph definition
            # optimization will only consider in training set
            inSampleSize, _ = this_df[this_df.timestamp < insampleCutoffTimestamp].shape
            totalRows, _ = this_df.shape
            
            batch_y = this_df.loc[:,'y'].values            
            batch_x = this_df[colList].values
                    
            if totalRows < n_time_steps:
                continue

            # Data is formated as a 3D tensor with the shape of (batch_size, n_time, n_feature) for LSTM
            # n_time_steps parameter determines how many steps that LSTM will unroll in time
            complete_x = np.zeros([totalRows-n_time_steps+1, n_time_steps, len(colList)])
            for n in range(n_time_steps):
                complete_x[:,n,:]=batch_x[n:totalRows-n_time_steps+n+1,:]
            
            batch_y = batch_y[n_time_steps-1:]
            inSampleSize -= n_time_steps - 1

            # variance stabilizing transform
            # some id's will not have in-sample rows, we cannot perform transform on those
            # furthermore, since there is not in-sample rows to train on, we must skip
            if inSampleSize < 10:
                continue
                
            # perform variance stabilization
            thisMean = meanStdById.loc[thisId][0]
            thisStd = meanStdById.loc[thisId][1]
            batch_y = (batch_y - thisMean) / thisStd
            
            batch_y = batch_y.reshape(-1,1)
            minibatchSize, _ = batch_y.shape

            # we want to make sure that RNN reaches steady state
            if minibatchSize < mini_batch_limit: 
                continue 
            
            # Run optimization 
            # note: keep_prob is set to 0.5 for training only!
            _, currentRate = sess.run([optimizer, adaptiveLearningRate], feed_dict={x: complete_x, y: batch_y, keep_prob:0.5, inSampleCutoff:inSampleSize, globalStep:i/epoch})

            # Obtain out of sample target variable and our prediction
            y_oos, pred_oos = sess.run([y_outOfSample, pred_outOfSample], feed_dict={x: complete_x, y: batch_y, keep_prob:1.0, inSampleCutoff:inSampleSize}) 
            
            # flatten the returned lists
            y_oos = [y for x in y_oos for y in x]
            pred_oos = [y for x in pred_oos for y in x]
            
            #reverse transform before recording the results
            if inSampleSize:            
                y_oos = [ (t*thisStd + thisMean) for t in y_oos]
                pred_oos = [ (t*thisStd + thisMean) for t in pred_oos]
            
            # record the results
            actual.extend(y_oos)
            predicted.extend(pred_oos)
                       
            totalActual.extend(y_oos)
            totalPredicted.extend(pred_oos)
            
            # Once every display_step show some diagnostics - the loss function, in-sample correlation, etc.
            if step % display_step == 0: 
                # Calculate batch accuracy 
                # Calculate batch loss 
                correl, lossResult, summary = sess.run([pearson_corr, loss, summary_op], feed_dict={x: complete_x, y: batch_y, keep_prob:1.0, inSampleCutoff:inSampleSize})
                
                writer.add_summary(summary, writer_step)
                writer_step += 1
                # corrcoef sometimes fails to compute correlation for a perfectly valid reason (e.g. stdev(pred_oos) is 0)
                # it sets the result to nan, but also gives an annoying warning
                # the following suppresses the warning
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    correl_oos = np.corrcoef(y_oos, pred_oos)[0,1]
                    
                print('LR: %s - Iter %s, minibatch loss = %s, minibatch corr = %s, oos %s (%s/%s)' % (currentRate, step, lossResult, correl, correl_oos, inSampleSize, totalRows))
                
            step += 1 
       
        print('Optimization Finished!') 
        print('Correl: ', np.corrcoef(actual, predicted)[0,1])


INFO:tensorflow:Restoring parameters from SavedModels/model_epoch_10.ckpt
Epoch:  0 ******************************
LR: 0.00058 - Iter 100, minibatch loss = 637.9, minibatch corr = 0.0112935, oos -0.0740918212272 (1270/1442)
LR: 0.00058 - Iter 200, minibatch loss = 813.608, minibatch corr = 0.0776712, oos 0.102692945818 (1641/1813)
LR: 0.00058 - Iter 300, minibatch loss = 783.563, minibatch corr = 0.0297509, oos 0.00751064257798 (1573/1745)
LR: 0.00058 - Iter 400, minibatch loss = 818.312, minibatch corr = 0.0306565, oos -0.00329758258569 (1641/1813)
LR: 0.00058 - Iter 500, minibatch loss = 816.253, minibatch corr = 0.0956083, oos 0.165470373543 (1641/1813)
LR: 0.00058 - Iter 600, minibatch loss = 814.606, minibatch corr = 0.0403289, oos -0.0952056588811 (1641/1813)
LR: 0.00058 - Iter 700, minibatch loss = 815.304, minibatch corr = 0.0473748, oos -0.0250306466663 (1641/1813)
LR: 0.00058 - Iter 800, minibatch loss = 730.31, minibatch corr = 0.0254695, oos nan (1485/1494)
Optimization Finished!
Correl:  0.0436318385103

In [15]:
! pwd


/notebooks

It takes 3-5 minutes to complete the training with 1 epochs. We also provided TensorBoard to review the model architecture, loss and correlation variables. TensorBoard is a suite of web applications for inspecting and understanding your TensorFlow runs and graphs.

Click here to start TensorBoard.

You should get a correlation value around R = 0.04. Note that the correlation tends to increase with each epoch (but not always).

Let's use the pre-trained model with 15 epochs by setting the pre_trained_model variable as pre_trained_model = 'SavedModels/model_epoch_15.ckpt' in above cell, lower the starting Learning Rate to 0.00061 and re-run everything using Kernel->Restart & Run All.

What is the correlation that you get this time? Was it improved?

Since training takes significant amount of time, we recommend you train the model with 20 epochs and check the correlation after this lab in your environment. You should get a correlation value around R = 0.05.

Optional Exercise

Please read ahead and only come back to these optional exercises if time permits.

Train from scratch [20-30 mins]

First, change the # of epochs to 20 in the above cell. Second, put the starting learning rate back to 0.002. Third, comment out the two line where the pre-trained model is loaded (under "Restore latest checkpoint"). Then re-run everything using Kernel->Restart & Run All.

How can a portfolio manager assess the predicted signal?

We could scatter-plot actual returns over the predicted returns, however correlation is not visually apparent on scatter plots when the correlation is below 20-30%. The correlation we achieve in this signal is much weaker which is typical of modern financial markets. Correlations which we often observe in other applications of predictive models are all but impossible in the financial markets which are highly efficient (simply put, unpredictable). If we imagine that someone has a signal with correlation of 30% using leverage the person would soon get extremely rich - and the observed signal (inefficiency) would disappear from the market.

In order to visually assess the signal, we split out of sample data points into buckets based on the value of predicted returns. We then compute per-bucket mean actual returns. Then we plot mean actual returns (Y axis) against predicted returns (X axis). We thus plot one point per bucket. By taking mean value, we average out the variance within each bucket and uncover the predictive value of the signal.


In [16]:
actual = totalActual
predicted = totalPredicted

actualMeanReturn = []
predictedMeanReturn = []
stdActualReturns = []
# Buckets are created
buckets = np.arange(-0.02,0.02,0.002)

actual = np.array(actual)
predicted = np.array(predicted)

# Predicted values and the actual values are placed into buckets
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    for i in range(len(buckets)-1):
        index = np.logical_and(predicted>buckets[i], predicted<buckets[i+1])
        thisBucket = actual[index].mean()
        actualMeanReturn.append(thisBucket)
        predictedMeanReturn.append(predicted[index].mean())
        stdActualReturns.append(actual[index].std())

# Actual versus predicted values are plotted
plt.figure()
plt.plot(predictedMeanReturn,actualMeanReturn, marker='*')
plt.xlabel('predicted')
plt.ylabel('actual')
plt.grid(True)
plt.show()

plt.figure()
plt.errorbar(predictedMeanReturn, actualMeanReturn, yerr = stdActualReturns, marker='*')
plt.xlabel('predicted')
plt.ylabel('actual')
plt.grid(True)
plt.show()


How much variance is there?

Plot 2 answers this question by adding error bars to the previous plot. Length of the error bar is equal to the standard deviation of actual returns within each respective bucket. Plots such as these would be typically used by a portfolio manager to assess behavior of prospective signals and to assess signal levels at which an action should be taken. The simplest trading system utilizing this signal would buy security when predicted return is above some threshold (say, above 0.5%) and sell (or short-sell) the security when the signal is below negative threshold (e.g. below -0.5%).

4. Next Steps

We recommend you to try the following steps after the lab.

  1. Try using other machine learning techniques such as random forest, ridge regression, xgboost and compare the correlation with LSTM based predictor.

  2. Try using autoencoder to extract fewer features than the original dataset provides and use the features as input to the deep learning model. Analyze the performance.

5. Summary

In this lab, step by step implementation of a LSTM based deep neural network to predict time series financial data is presented. The performance of the model is evaluated with the pearson correlation and competitive performance is achieved. The code provided in this lab can be used in complex trading strategies.

6. Post-Lab

Finally, don't forget to save your work from this lab before time runs out and the instance shuts down!!

  1. You can download the data from this link.

  2. To use the data, please set the "usePreparedData" variable to False before running the code on your environment.

  3. Also, remove the code "model_saver.restore(sess, pre_trained_model)" to train the model for you data.

  4. You can execute the following cell block to zip the files you've been working on, and download it from the link below.


In [34]:
! tar -cvf output3.zip --exclude="2sigma" --exclude="__*" --exclude="*.zip" *


SavedModels/
SavedModels/model_epoch_15.ckpt.index
SavedModels/model_epoch_15.ckpt.data-00000-of-00001
SavedModels/model_epoch_10.ckpt.data-00000-of-00001
SavedModels/model_epoch_10.ckpt.meta
SavedModels/checkpoint
SavedModels/model_epoch_10.ckpt.index
SavedModels/model_epoch_15.ckpt.meta
data.jpg
data_split.jpg
dnn.jpg
logs/
logs/events.out.tfevents.1512491595.7fbe5bb9497e
main.ipynb
prepareData.py
rnn.jpg

In [21]:
!tar -cvf output.zip main.ipynb prepareData.py dnn.jpg data.jpg data_split.jpg rnn.jpg


main.ipynb
prepareData.py
dnn.jpg
data.jpg
data_split.jpg
rnn.jpg

In [22]:
! ls


2sigma	     __pycache__  data_split.jpg  logs	      output.zip      rnn.jpg
SavedModels  data.jpg	  dnn.jpg	  main.ipynb  prepareData.py

In [ ]: