## Machine Learning Engineer Nanodegree

Calvin Ku September 27, 2016

## Definition

### Project Overview

Problem with trading is that you never know when is the best time to buy or sell a stock, as you never know if the stock price will go up or go down in the future. This simple trading bot is an attempt to solve this problem.

Given the historical data of a stock, our chimp will tell you whether you should buy or sell or hold a particular stock today (in our case, the JPM).

#### Data used in this project

The only data used in this project is the JPM historical data collected from Yahoo Finance. The data ranges from December 30, 1983 to September 27, 2016. We don't use S&P 500 ETF as ETFs are generally arbitrageable which can render the techniques we will use in this project (namely, VPA) useless.

### Problem Statement

This project is about building a trading robot. In this proejct we will call it the Chimp. The Chimp is built to give the common user suggestions on whether to buy or sell or hold a particular stock on a particular trading day. The goal of this project is to build a trading robot that can beat a random monkey bot. Inpired by the famous saying of Princeton University professor Burton Malkiel in 1973 that "A blindfolded monkey throwing darts at a newspaper’s financial pages could select a portfolio that would do just as well as one carefully selected by experts” and the Forbes article Any Monkey Can Beat the Market, instead of competing on a portfolio basis, we set our battlefield on JPM.

We will use JPM as an example in this project but the same method can be applied to any stock. In the end we will evaluate our method by giving the monkey bot (which chooses the three actions equally on a random basis) and our Chimp 1000 dollars and see how they perform from September 26, 2011 to September 27, 2016 on JPM.

### Metrics

In this project we use the cash in hand plus the portfolio value (number of shares in hand times the market price), the total assets as the metric. We also and define the reward function to be the ratio of the difference of the assets divided by the previous assets between the current state and the previous, i.e.: $$R(s_i) = \frac{Cash(s_{i + 1}) + PV(s_{i + 1}) - Cash(s_i) - PV(s_i)}{Cash(s_i) + PV(s_i)}$$

This simple metric is in line with what we want the trading bot to achieve in that our ultimate goal is to make as much profit as possible given what we have put into the market, and it doesn't matter whether it's in cash or in shares.

## Analysis

### Data Exploration

#### First look

Let's first take a glance at some statistics of our data and then see if there's any missing values



In :

from __future__ import division

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from IPython.display import display
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import time
import random
from collections import defaultdict
from sklearn.ensemble import RandomForestRegressor
from copy import copy, deepcopy
from numba import jit

pd.set_option('display.max_columns', 50)

dfSPY = pd.read_csv('allSPY.csv', index_col='Date', parse_dates=True, na_values = ['nan'])
dfJPM = pd.read_csv('JPM.csv', index_col='Date', parse_dates=True, na_values = ['nan'])

# del dfSPY.index.name
del dfJPM.index.name
# display(dfSPY)

start_date = '1983-12-30'
end_date = '2016-09-27'

dates = pd.date_range(start_date, end_date)

dfMain = pd.DataFrame(index=dates)
# dfMain = dfMain.join(dfSPY)
dfMain = dfMain.join(dfJPM)
dfMain.dropna(inplace=True)

print("Start date: {}".format(dfMain.index))
print("End date: {}\n".format(dfMain.index[-1]))

print(dfMain.describe())




Start date: 1983-12-30 00:00:00
End date: 2016-09-27 00:00:00

Open         High          Low        Close        Volume  \
count  8256.000000  8256.000000  8256.000000  8256.000000  8.256000e+03
mean     46.472522    47.047441    45.882564    46.475389  1.290023e+07
std      20.463781    20.687878    20.250667    20.470789  1.835588e+07
min      10.250010    10.875000     9.624990    10.125000  3.780000e+04
25%      35.124990    35.527499    34.625010    35.060001  1.882275e+06
50%      41.000010    41.500000    40.500000    40.965000  7.037000e+06
75%      52.612501    53.092500    52.000000    52.542501  1.529572e+07
max     147.000000   149.124985   144.000000   147.000000  2.172942e+08

count  8256.000000
mean     22.914243
std      17.191570
min       1.514014
25%       5.380651
50%      23.981890
75%      33.764691
max      68.076434




In :

print("\nInspect missing values:")
display(dfMain.isnull().sum())




Inspect missing values:

Open         0
High         0
Low          0
Close        0
Volume       0
dtype: int64



Since we won't be using data prior to 1993 for training, we can use SPY (S&P 500 ETF) to get trading days and see if we have any data missing for JPM.



In :

spy_dates = pd.date_range('1993-01-29', end_date)
dfSPY = dfSPY.ix[spy_dates, :]
dfSPY.dropna(inplace=True)
print("Number of days where JPM are traded: {}".format(len(dfMain.ix[dfSPY.index, :])))




Number of days where JPM are traded: 5960



It seems to be good. Let's look at the first few lines:



In :




Open
High
Low
Close
Volume

1983-12-30
44.000008
44.500006
43.500014
44.000008
211500.0
2.602623

1984-01-03
43.937506
44.249986
43.624979
44.000008
385500.0
2.602623

1984-01-04
44.843758
45.874979
44.249986
45.874979
292500.0
2.713529

1984-01-05
46.812508
47.375008
46.250008
47.375008
344100.0
2.802256

1984-01-06
46.875014
47.375008
46.375021
46.875014
194400.0
2.772681



We can see that we have six columns: Open, High, Low, Close, Volume, Adj Close. The Adj Close is the closing price of that day adjusted for "future" dividends payout and splits. For our usage, we will need to adjust the rest of columns as well.

### Exploratory Visualization

Now let's have a look on the performance of JPM itself:



In :




Out:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8c13357410>



Starting from the beginning, the stock price generally has a upward trend, with a bad time from 2001 to 2003 and the crush at the end of 2008.

Now we can take a look at the correlations between the variables:



In :

g = sns.PairGrid(dfMain)
g.map_upper(plt.scatter, alpha=0.3)
g.map_lower(plt.scatter, alpha=0.3)
g.map_diag(sns.kdeplot, lw=3, legend=False)

g.fig.suptitle('Feature Pair Grid')




Out:



We can see it clearly on the Adj Close rows and columns there are several lines. This is due to the fact that the Adj Close varaible are adjusted for times where there are splits and dividends payout. From here we know we'll need to adjust other variables to match it.

### Algorithms and Techniques

#### Algorithms overview

The trading bot (from now on we refer to it as the Chimp, coined with our wishful expectation that she will be smarter than the Monkey Trader) consists of two parts. In the first part we implement Q-learning and run it through the historical data for some number of iterations to construct the Q-table. The Chimp can then go with the optimal policy by following the action of which the state-action pair has the maximum Q value. However, since the state space is vast and the coverage of the Q-table is very small, in the second part we use supervised learning to train on the Q-table and make predictions for the unseen states.

#### Reinfocement learning

##### Q-learning

The core idea of reinforcement learning is simple.

1. The Chimp senses its environment (data of some sort)
2. The Chimp takes an action
3. The Chimp gets a reward for that action she took
4. The Chimp "remembers" the association between the state-action pair and the reward, so next time when she is in the same situation, she'd carry out the action that she thinks best, under the condition that,
5. The Chimp has a really good memory so she doesn't just remember the immediate reward for each state-action pair, she also remembers all the rewards that are prior to and lead to the current state-action pair so that she can maximize the total reward she get

One way to do this is to use a method called Q-learning. At the core of Q-learning is the Bellman equation.

In each iteration we use the Bellman equation to update the cell of our Q-table: $$Q(s, a) \longleftarrow (1 - \alpha) \ Q(s, a) + \alpha \ (R(s) + \gamma \ max_{a'} Q(s', a'))$$

where (s, a) is the state-action pair, $\alpha$ the learning rate, $R(s)$ the reward function, and $\gamma$ the discount factor. And then the Chimp will follow the policy: $$\pi(s) = argmax_{a}Q(s, a)$$

Although we don't have the Q value of any state-action pair to begin with, the reward function contains the "real" information and throughout the iterations that information will slowly propagate to each state-action pair. At some point the Q values will converge to a practical level (hopefully), and we end up with a Q table in psudo-equilibrium.

##### Exploration-exploitation dilemma

However, there's a catch. How does the Chimp know what to do before she has learned anything?

One important concept of reinforcement learning is the exploration-exploitation dilemma. Essentially it means when we take an action, we have to choose between whether to explore new possibilities, or just to follow what we've known to be best, to the best of our knowledge. And of course, if we don't know much, then following that limited knowledge of ours wouldn't make much sense. On the other hand, if we've already known pretty much everything, then there's just not much to explore, and wandering around mindlessly wouldn't make sense either.

To implement this concept, we want our Chimp to set out not having a bias (not that she's got much anyways), so we introduce the variable $\epsilon$, which represents the possibility of the Chimp taking random actions. Initially we set $\epsilon = 1$ , and gradually decreases its value as the Chimp getting to know more and more about its environment. As time goes, the Chimp will become wiser and wiser and spends most of her time following what's best for her, and less time on being an explorer.

#### Supervised learning with random forest

For the supervised learning part, we will use the random forest. The random forest doesn't expect linear features or assuming features interacting with each other linearly. It has the advantage of a decision tree which generally can fit into any shape of data, while being ensemble and eliminates the problem of a decision tree being easily overfitting. The random and ensemble nature of the algorithm makes it very unlikely to overfit on the training data. Since we are combining supervised learning with reinforcement learning, the problem gets more complicated and we will have more parameters to tune, and it's good to have an algorithm that kind of works "out of the box" most of the time. On the other hand, random forest handles high dimensionality very well, which makes feature engineering a bit easier where we don't need to worry too much about whether it's the newly engineered features not representative or it's the algorithm not able to handle the enlarged dimensionality. In addition to this, random forest itself is very easy to tune. We can easily grid search through number of choices of features for each splitting and the number of trees. The ensemble nature of the algorithm also makes it scalable when we need to: 1. train on more data, 2. build more trees, 3. take more stocks into account . Overall, random forest generally gives good results and it has been recognized that ensemble algorithms like random forest perform over other traditional algorithms in the Kaggle community over the years and have become quite popular.

##### How it works

The random forest is an ensemble method, which "ensembles" a bunch of decision trees. Each decision tree is generated by creating "nodes" with features in our dataset.

##### Decision tree

In the training stage, a data point comes down and through the nodes of the decision tree. Each node classifies the data point and sends it to the next node. Say, for example we are classifying people to determine whether their annual income is above or below average, and one feature of our data is gender. And we will probably have values like male/female/other. Now say this data point is a female, then it will get sent down the path accordingly to the next node. The node at the bottom of the decision tree is sometimes referred to as a leaf. Our data point will end up in one of the leaves, and the label of the data point is going to mark that leaf. For example, if our data point is above income average, then we mark that leaf as "above count +1". At the end of the training, all leaves will be marked with labels above or below.

In the predicting stage, we run data down the decision tree and see which leaves they end up with. And then we assign the data the labels of the leaves accordingly.

##### The ensembleness

We now know how each decision tree is constructed and have a forest of decision trees. The last step is to get these decision trees to vote. If 10 trees say this person is above and 5 say below, we predict the person as above.

##### Randomness of random forest

As said earlier, the decision trees are constructed with the features of our dataset. However, not all of the features are used to construct each decision tree. This is where the random part of the algorithm comes in. Most implementation employs a method called bagging, which generates $m$ sub-datasets from the feature space of the orginal dataset by sampling with replacement, where the size of the sub-datasets is $n'$, relative to the size of the feature space of the original dataset, $n$. The features of each bag are then used to construct a decision tree model.

##### Other parts of random forest

We won't cover everything about the random forest here. However it's worth noting some of the more important specifics of random forest that are not covered here:

• Binning of the continuous variables—which are done slightly differently from implementation to implementation
• Splitting methods—when constructing the decision trees we need to decide which feature to be placed on top of the tree and which to be put at the bottom.
• Voting methods—we can decide to give each decision tree with the same voting power, or not.
• Modification of the algorithm for regression problems (recursive partitioning)

### Benchmark

We shall set three different benchmarks here. One theoretical, and two practical.

#### Theoretical benchmark

Since our goal is to make as much money as possible. The best role model we can have, would be a God Chimp. A God Chimp is a Chimp that can foresee the future, and trades accordingly. In our case, this is not very hard to do. We can instantiate a Chimp object and get her to iterate through the entire dataset, until the whole Q-table converges. And with that Q-table in hand, we can get the psudo-perfect action series, which can make us a good deal of money. We then compute the accuracy of the action series of our mortal Chimp for that of the God Chimp. Theoretically speaking, the higher the accuracy, the closer the behavior of the mortal Chimp to the God Chimp, the more money the mortal Chimp would be making.

#### Practical benchmarks

That said, the ups and downs of the stock market are not really uniformly distributed. This means our mortal Chimp could have a very decent accuracy for the God Chimp, but say, screwed up most of the important part. And therefore not really doing that well. Conversely, it may appear that our mortal Chimp is doing really terrible mimicing the God Chimp, but still makes a lot of money. So we will need some practical benchmarks that are more down to earth.

We shall test our Chimp against 100,000 random Monkeys and the Patient Trader we have defined earlier. Since these two naive traders don't get influenced by the media or manipulated by the market makers, they are proven to perform better than the average investor. We are happy as long as our Chimp can perform better than the Monkey, which means our Chimp is at least better than chance (and therefore better than any average person), and also it'd be great if she can beat the PT. However beating the PT in general means beating the market, which isn't something really easy to do. So we wouldn't expect that much here.

## Methodology

### Data Preprocessing



In :




Open
High
Low
Close
Volume

1983-12-30
44.000008
44.500006
43.500014
44.000008
211500.0
2.602623

1984-01-03
43.937506
44.249986
43.624979
44.000008
385500.0
2.602623

1984-01-04
44.843758
45.874979
44.249986
45.874979
292500.0
2.713529

1984-01-05
46.812508
47.375008
46.250008
47.375008
344100.0
2.802256

1984-01-06
46.875014
47.375008
46.375021
46.875014
194400.0
2.772681



As said earlier, we need to adjust the prices of Open, High, Low, Close, Volume. This can be done by getting the adjustment fact by dividing Adj Close by Close. We then multiply the prices by this factor, and divide the volume by this factor.



In :

# Adjust Open, High, Low, Volume

dfMain['Open'] = dfMain['Open'] * dfMain['Adj Factor']
dfMain['High'] = dfMain['High'] * dfMain['Adj Factor']
dfMain['Low'] = dfMain['Low'] * dfMain['Adj Factor']
dfMain['Volume'] = dfMain['Volume'] / dfMain['Adj Factor']

display(dfMain.tail())




Open
High
Low
Close
Volume

1983-12-30
2.602623
2.632198
2.573048
44.000008
3.575624e+06
2.602623

1984-01-03
2.598926
2.617409
2.580440
44.000008
6.517272e+06
2.602623

1984-01-04
2.652532
2.713529
2.617410
45.874979
4.945011e+06
2.713529

1984-01-05
2.768984
2.802256
2.735712
47.375008
5.817363e+06
2.802256

1984-01-06
2.772681
2.802256
2.743106
46.875014
3.286531e+06
2.772681

Open
High
Low
Close
Volume

2016-09-21
66.839996
67.129997
66.309998
66.839996
14116800.0
66.839996

2016-09-22
66.989998
67.419998
66.839996
67.389999
12781700.0
67.389999

2016-09-23
67.389999
67.900002
67.180000
67.250000
13967400.0
67.250000

2016-09-26
66.599998
66.800003
65.540001
65.779999
16408100.0
65.779999

2016-09-27
65.410004
66.410004
65.110001
66.360001
13580600.0
66.360001



#### Features engineering using volume price analysis

Volume price analysis has been around for over 100 years, and there are many legendary traders who made themselves famous (and wealthy) using it. In addition to this, the basic principle behind it kind of makes sense on its own, that:

1. Price can only be moved by volume; large spread pushed by large volume and small spread by low volume
2. If it's not the case, then there's an anomaly, and you need to be cautious

But then people, especially practioners, tend to think of it as an art rather than science, in that even though you have some clues what's going on on the market, you still don't know what the best timing is. And it takes practice and practice until you "get it".

For we data scientists, everything is science, including art. If a human can stare at the candlesticks telling you when to buy or sell, so can a computer. Thus the following features are extracted from the raw dataset:

For volume:

• -1d Volume
• -2d Volume
• -3d Volume
• -4d Volume
• -5d Volume
• 10d Average Volume
• 21d Average Volume
• 63d Average Volume

For price:

For wick:

• -1d upperwick/lowerwick
• -2d upperwick/lowerwick
• -3d upperwick/lowerwick
• -4d upperwick/lowerwick
• -5d upperwick/lowerwick
• 10d upperwick/lowerwick
• 21d upperwick/lowerwick
• 63d upperwick/lowerwick

where -nd represents n day in the past.

#### More details on feature engineering

The reason why we choose 5, 10, 21, 63 days is because these are the common time ranges used in technical analysis, where 5 days correspond to one trading week, 10 to two, and 21 days correspond to one trading month, 63 to three. We don't want to explode our feature space so to start with we use the most recent 5-day data with longer term average data.

Spread and wicks are terms to describe the status of the candlestick chart (see below).

The spread describes the thick body part of the candlestick which shows the difference of the opening price and the closing price. The time frame (in terms of opening/closing) can range from minutes to months depending on what we want to look at (in our case, the time frame is one day). The wicks are the thin lines that extend at the top and the bottom, which describe whether there are stocks traded at prices beyond opening/closing prices during the day (or the specific time frame of interest). As shown in the picture, we can have white or black bodies on the candlestick chart to indicate the direction of the pricing movement, with white meaning $\text{closing price} > \text{opening price}$ and vice versa. On the other hand, a candle can have a upperwick and/or a lowerwick or none at all.

Note that to implement Q-learning we need to make the variables discrete. We use 100 day maximum and 100 day average to divide the above features and get relative levels of those features.

We set the trading price of each trading day to be the Adjusted Close: $$Trade Price = Adj\ Close$$ This information is not available to the Chimp. The properties of the Chimp get updated with this information when she places an order. The portfolio value also gets updated using this price.



In :

# Price Engineering
# Get opens
period_list = [1, 2, 3, 4, 5, 10, 21, 63, 100]
for x in period_list:
dfMain['-' + str(x) + 'd_Open'] = dfMain['Open'].shift(x)

period_list = xrange(1, 5 + 1)
for x in period_list:

# Get highs
period_list1 = xrange(1, 5 + 1)
for x in period_list1:
dfMain['-' + str(x) + 'd_High'] = dfMain['High'].shift(x)

period_list2 = [10, 21, 63, 100]
for x in period_list2:
dfMain[str(x) + 'd_High'] = dfMain['High'].shift().rolling(window=x).max()

# Get lows
period_list1 = xrange(1, 5 + 1)
for x in period_list1:
dfMain['-' + str(x) + 'd_Low'] = dfMain['Low'].shift(x)

period_list2 = [10, 21, 63, 100]
for x in period_list2:
dfMain[str(x) + 'd_Low'] = dfMain['High'].shift().rolling(window=x).min()




In :

# Get Volume Bases
dfMain['100d_Avg_Vol'] = dfMain['Volume'].shift().rolling(window=100).mean() * 1.5
dfMain['100d_Max_Vol'] = dfMain['Volume'].shift().rolling(window=100).max()

display(dfMain.tail())




Open
High
Low
Close
Volume
-1d_Open
-2d_Open
-3d_Open
-4d_Open
-5d_Open
-10d_Open
-21d_Open
-63d_Open
-100d_Open
-1d_High
-2d_High
-3d_High
-4d_High
-5d_High
10d_High
21d_High
63d_High
100d_High
-1d_Low
-2d_Low
-3d_Low
-4d_Low
-5d_Low
10d_Low
21d_Low
63d_Low
100d_Low
100d_Avg_Vol
100d_Max_Vol

2016-09-21
66.839996
67.129997
66.309998
66.839996
14116800.0
66.839996
66.750000
66.150002
66.089996
66.290001
66.269997
67.160004
65.750000
62.463746
62.602664
66.459999
66.190002
65.820000
66.639999
66.400002
66.849998
66.639999
66.260002
66.930000
67.250000
67.680000
67.769997
67.769997
67.769997
66.239998
65.849998
65.440002
66.089996
66.209999
66.260002
65.879997
58.296187
58.296187
2.162303e+07
4.445207e+07
0.555336
9.473810

2016-09-22
66.989998
67.419998
66.839996
67.389999
12781700.0
67.389999
66.839996
66.750000
66.150002
66.089996
66.290001
67.220001
66.070000
63.207953
63.198027
66.839996
66.459999
66.190002
65.820000
66.639999
67.129997
66.849998
66.639999
66.260002
66.930000
67.680000
67.769997
67.769997
67.769997
66.309998
66.239998
65.849998
65.440002
66.089996
66.260002
66.139999
58.296187
58.296187
2.158722e+07
4.445207e+07
0.553699
9.473810

2016-09-23
67.389999
67.900002
67.180000
67.250000
13967400.0
67.250000
66.989998
66.839996
66.750000
66.150002
66.089996
67.029999
65.989998
60.012824
62.414134
67.389999
66.839996
66.459999
66.190002
65.820000
67.419998
67.129997
66.849998
66.639999
66.260002
67.430000
67.769997
67.769997
67.769997
66.839996
66.309998
66.239998
65.849998
65.440002
66.260002
66.139999
58.296187
58.296187
2.162404e+07
4.445207e+07
0.558211
9.473810

2016-09-26
66.599998
66.800003
65.540001
65.779999
16408100.0
65.779999
67.389999
66.989998
66.839996
66.750000
66.150002
66.139999
65.910004
58.256495
61.273014
67.250000
67.389999
66.839996
66.459999
66.190002
67.900002
67.419998
67.129997
66.849998
66.639999
67.900002
67.900002
67.900002
67.900002
67.180000
66.839996
66.309998
66.239998
65.849998
66.260002
66.139999
58.296187
58.296187
2.154451e+07
4.445207e+07
0.555250
9.603815

2016-09-27
65.410004
66.410004
65.110001
66.360001
13580600.0
66.360001
66.599998
67.389999
66.989998
66.839996
66.750000
66.110001
66.330002
58.732788
61.124170
65.779999
67.250000
67.389999
66.839996
66.459999
66.800003
67.900002
67.419998
67.129997
66.849998
67.900002
67.900002
67.900002
67.900002
65.540001
67.180000
66.839996
66.309998
66.239998
66.260002
66.260002
59.090007
58.296187
2.153319e+07
4.445207e+07
0.564871
9.603815

Open         29.013163
Name: 2011-12-30 00:00:00, dtype: float64




In :

@jit
def relative_transform(num):
if 0 <= num < 0.25:
return 1
elif 0.25 <= num < 0.5:
return 2
elif 0.5 <= num < 0.75:
return 3
elif 0.75 <= num < 1:
return 4
elif 1 <= num:
return 5
elif -0.25 <= num < 0:
return -1
elif -0.5 <= num < -0.25:
return -2
elif -0.75 <= num < -0.5:
return -3
elif -1 <= num < -0.75:
return -4
elif num < -1:
return -5
else:
num

# Volume Engineering
# Get volumes
period_list = xrange(1, 5 + 1)
for x in period_list:
dfMain['-' + str(x) + 'd_Vol'] = dfMain['Volume'].shift(x)

# Get avg. volumes
period_list = [10, 21, 63]
for x in period_list:
dfMain[str(x) + 'd_Avg_Vol'] = dfMain['Volume'].shift().rolling(window=x).mean()

# Get relative volumes 1
period_list = range(1, 5 + 1)
for x in period_list:
dfMain['-' + str(x) + 'd_Vol1'] = dfMain['-' + str(x) + 'd_Vol'] / dfMain['100d_Avg_Vol']
dfMain['-' + str(x) + 'd_Vol1'] = dfMain['-' + str(x) + 'd_Vol1'].apply(relative_transform)

# Get relative avg. volumes 1
period_list = [10, 21, 63]
for x in period_list:
dfMain[str(x) + 'd_Avg_Vol1'] = dfMain[str(x) + 'd_Avg_Vol'] / dfMain['100d_Avg_Vol']
dfMain[str(x) + 'd_Avg_Vol1'] = dfMain[str(x) + 'd_Avg_Vol1'].apply(relative_transform)

# Get relative volumes 2
period_list = xrange(1, 5 + 1)
for x in period_list:
dfMain['-' + str(x) + 'd_Vol2'] = dfMain['-' + str(x) + 'd_Vol'] / dfMain['100d_Max_Vol']
dfMain['-' + str(x) + 'd_Vol2'] = dfMain['-' + str(x) + 'd_Vol2'].apply(relative_transform)




In :

period_list1 = xrange(1, 5 + 1)
period_list2 = [10, 21, 63]

for x in period_list1:
dfMain['-' + str(x) + 'd_Spread'] = dfMain['-' + str(x) + 'd_adjClose'] - dfMain['-' + str(x) + 'd_Open']

for x in period_list2:

period_list1 = xrange(1, 5 + 1)
period_list2 = [10, 21, 63]

for x in period_list1:

for x in period_list2:




2016-09-21
-3.0
1.0
-2.0
3.0
1.0
1.0

2016-09-22
1.0
-3.0
1.0
-2.0
3.0
1.0

2016-09-23
3.0
1.0
-3.0
1.0
-2.0
1.0

2016-09-26
-2.0
3.0
1.0
-3.0
1.0
1.0

2016-09-27
-5.0
-1.0
3.0
1.0
-3.0
-1.0




In :

# Get wicks
@jit
if high > open and high > adj_close:
return True
else:
return False
if low < open and low < adj_close:
return True
else:
return False

start_time = time.time()

period_list1 = xrange(1, 5 + 1)
period_list2 = [10, 21, 63, 100]
for x in period_list1:
dfMain.ix[:, '-' + str(x) + 'd_upperwick_bool'] = dfMain.apply(lambda row: upperwick(row['-' + str(x) + 'd_Open'], row['-' + str(x) + 'd_adjClose'], row['-' + str(x) + 'd_High']), axis=1)
dfMain.ix[:, '-' + str(x) + 'd_lowerwick_bool'] = dfMain.apply(lambda row: lowerwick(row['-' + str(x) + 'd_Open'], row['-' + str(x) + 'd_adjClose'], row['-' + str(x) + 'd_Low']), axis=1)

for x in period_list2:
dfMain.ix[:, str(x) + 'd_upperwick_bool'] = dfMain.apply(lambda row: upperwick(row['-' + str(x) + 'd_Open'], row['-1d_adjClose'], row[str(x) + 'd_High']), axis=1)
dfMain.ix[:, str(x) + 'd_lowerwick_bool'] = dfMain.apply(lambda row: lowerwick(row['-' + str(x) + 'd_Open'], row['-1d_adjClose'], row[str(x) + 'd_Low']), axis=1)

print("Getting wicks took {} seconds.".format(time.time() - start_time))




Getting wicks took 8.72208499908 seconds.




In :

@jit
@jit

start_time = time.time()

# Transform upper wicks
period_list1 = xrange(1, 5 + 1)
period_list2 = [10, 21, 63]

for x in period_list1:
has_upperwicks = dfMain['-' + str(x) + 'd_upperwick_bool']
has_lowerwicks = dfMain['-' + str(x) + 'd_lowerwick_bool']

dfMain.loc[has_upperwicks, '-' + str(x) + 'd_upperwick'] = dfMain.loc[has_upperwicks, :].apply(lambda row: get_upperwick_length(row['-' + str(x) + 'd_Open'], row['-' + str(x) + 'd_adjClose'], row['-' + str(x) + 'd_High']), axis=1)
dfMain.loc[has_lowerwicks, '-' + str(x) + 'd_lowerwick'] = dfMain.loc[has_lowerwicks, :].apply(lambda row: get_lowerwick_length(row['-' + str(x) + 'd_Open'], row['-' + str(x) + 'd_adjClose'], row['-' + str(x) + 'd_Low']), axis=1)

# Get relative upperwick length
dfMain.loc[dfMain['-' + str(x) + 'd_upperwick_bool'], '-' + str(x) + 'd_upperwick'] = dfMain.loc[dfMain['-' + str(x) + 'd_upperwick_bool'], '-' + str(x) + 'd_upperwick'] / dfMain.loc[dfMain['-' + str(x) + 'd_upperwick_bool'], '100d_Avg_Spread']
# Get relative lowerwick length
dfMain.loc[dfMain['-' + str(x) + 'd_lowerwick_bool'], '-' + str(x) + 'd_lowerwick'] = dfMain.loc[dfMain['-' + str(x) + 'd_lowerwick_bool'], '-' + str(x) + 'd_lowerwick'] / dfMain.loc[dfMain['-' + str(x) + 'd_lowerwick_bool'], '100d_Avg_Spread']

# Transform upperwick ratio to int
dfMain.loc[dfMain['-' + str(x) + 'd_upperwick_bool'], '-' + str(x) + 'd_upperwick'] = dfMain.loc[dfMain['-' + str(x) + 'd_upperwick_bool'], '-' + str(x) + 'd_upperwick'].apply(relative_transform)
# Transform lowerwick ratio to int
dfMain.loc[dfMain['-' + str(x) + 'd_lowerwick_bool'], '-' + str(x) + 'd_lowerwick'] = dfMain.loc[dfMain['-' + str(x) + 'd_lowerwick_bool'], '-' + str(x) + 'd_lowerwick'].apply(relative_transform)

# Assign 0 to no-upperwick days
dfMain.loc[np.logical_not(dfMain['-' + str(x) + 'd_upperwick_bool']), '-' + str(x) + 'd_upperwick'] = 0
# Assign 0 to no-lowerwick days
dfMain.loc[np.logical_not(dfMain['-' + str(x) + 'd_lowerwick_bool']), '-' + str(x) + 'd_lowerwick'] = 0

for x in period_list2:
has_upperwicks = dfMain[str(x) + 'd_upperwick_bool']
has_lowerwicks = dfMain[str(x) + 'd_lowerwick_bool']

dfMain.loc[has_upperwicks, str(x) + 'd_upperwick'] = dfMain.loc[has_upperwicks, :].apply(lambda row: get_upperwick_length(row['-' + str(x) + 'd_Open'], row['-1d_adjClose'], row[str(x) + 'd_High']), axis=1)
dfMain.loc[has_lowerwicks, str(x) + 'd_lowerwick'] = dfMain.loc[has_lowerwicks, :].apply(lambda row: get_lowerwick_length(row['-' + str(x) + 'd_Open'], row['-1d_adjClose'], row[str(x) + 'd_Low']), axis=1)

# Get relative upperwick length
dfMain.loc[dfMain[str(x) + 'd_upperwick_bool'], str(x) + 'd_upperwick'] = dfMain.loc[dfMain[str(x) + 'd_upperwick_bool'], str(x) + 'd_upperwick'] / dfMain.loc[dfMain[str(x) + 'd_upperwick_bool'], '100d_Avg_Spread']
# Get relative lowerwick length
dfMain.loc[dfMain[str(x) + 'd_lowerwick_bool'], str(x) + 'd_lowerwick'] = dfMain.loc[dfMain[str(x) + 'd_lowerwick_bool'], str(x) + 'd_lowerwick'] / dfMain.loc[dfMain[str(x) + 'd_lowerwick_bool'], '100d_Avg_Spread']

# Transform upperwick ratio to int
dfMain.loc[dfMain[str(x) + 'd_upperwick_bool'], str(x) + 'd_upperwick'] = dfMain.loc[dfMain[str(x) + 'd_upperwick_bool'], str(x) + 'd_upperwick'].apply(relative_transform)
# Transform lowerwick ratio to int
dfMain.loc[dfMain[str(x) + 'd_lowerwick_bool'], str(x) + 'd_lowerwick'] = dfMain.loc[dfMain[str(x) + 'd_lowerwick_bool'], str(x) + 'd_lowerwick'].apply(relative_transform)

# Assign 0 to no-upperwick days
dfMain.loc[np.logical_not(dfMain[str(x) + 'd_upperwick_bool']), str(x) + 'd_upperwick'] = 0
# Assign 0 to no-lowerwick days
dfMain.loc[np.logical_not(dfMain[str(x) + 'd_lowerwick_bool']), str(x) + 'd_lowerwick'] = 0

print("Transforming wicks took {} seconds.".format(time.time() - start_time))




Transforming wicks took 7.03627705574 seconds.




In :

display(dfMain[['-1d_lowerwick', '-2d_lowerwick', '-3d_lowerwick', '-4d_lowerwick', '-5d_lowerwick', '10d_lowerwick', '21d_lowerwick', '63d_lowerwick']].isnull().sum())




-1d_lowerwick    65
-2d_lowerwick    65
-3d_lowerwick    65
-4d_lowerwick    64
-5d_lowerwick    63
10d_lowerwick    30
21d_lowerwick    41
63d_lowerwick    27
dtype: int64




In :

display(dfMain.tail())




Open
High
Low
Close
Volume
-1d_Open
-2d_Open
-3d_Open
-4d_Open
-5d_Open
-10d_Open
-21d_Open
-63d_Open
-100d_Open
-1d_High
-2d_High
-3d_High
-4d_High
-5d_High
...
-5d_lowerwick_bool
10d_upperwick_bool
10d_lowerwick_bool
21d_upperwick_bool
21d_lowerwick_bool
63d_upperwick_bool
63d_lowerwick_bool
100d_upperwick_bool
100d_lowerwick_bool
-1d_upperwick
-1d_lowerwick
-2d_upperwick
-2d_lowerwick
-3d_upperwick
-3d_lowerwick
-4d_upperwick
-4d_lowerwick
-5d_upperwick
-5d_lowerwick
10d_upperwick
10d_lowerwick
21d_upperwick
21d_lowerwick
63d_upperwick
63d_lowerwick

1983-12-30
2.602623
2.632198
2.573048
44.000008
3.575624e+06
2.602623
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
False
False
False
False
False
False
False
False
False
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

1984-01-03
2.598926
2.617409
2.580440
44.000008
6.517272e+06
2.602623
2.602623
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2.602623
NaN
NaN
NaN
NaN
2.632198
NaN
NaN
NaN
NaN
...
False
False
False
False
False
False
False
False
False
NaN
NaN
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

1984-01-04
2.652532
2.713529
2.617410
45.874979
4.945011e+06
2.713529
2.598926
2.602623
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2.602623
2.602623
NaN
NaN
NaN
2.617409
2.632198
NaN
NaN
NaN
...
False
False
False
False
False
False
False
False
False
NaN
NaN
NaN
NaN
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

1984-01-05
2.768984
2.802256
2.735712
47.375008
5.817363e+06
2.802256
2.652532
2.598926
2.602623
NaN
NaN
NaN
NaN
NaN
NaN
2.713529
2.602623
2.602623
NaN
NaN
2.713529
2.617409
2.632198
NaN
NaN
...
False
False
False
False
False
False
False
False
False
0.0
NaN
NaN
NaN
NaN
NaN
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

1984-01-06
2.772681
2.802256
2.743106
46.875014
3.286531e+06
2.772681
2.768984
2.652532
2.598926
2.602623
NaN
NaN
NaN
NaN
NaN
2.802256
2.713529
2.602623
2.602623
NaN
2.802256
2.713529
2.617409
2.632198
NaN
...
False
False
False
False
False
False
False
False
False
0.0
NaN
0.0
NaN
NaN
NaN
NaN
NaN
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

5 rows × 105 columns

Open
High
Low
Close
Volume
-1d_Open
-2d_Open
-3d_Open
-4d_Open
-5d_Open
-10d_Open
-21d_Open
-63d_Open
-100d_Open
-1d_High
-2d_High
-3d_High
-4d_High
-5d_High
...
-5d_lowerwick_bool
10d_upperwick_bool
10d_lowerwick_bool
21d_upperwick_bool
21d_lowerwick_bool
63d_upperwick_bool
63d_lowerwick_bool
100d_upperwick_bool
100d_lowerwick_bool
-1d_upperwick
-1d_lowerwick
-2d_upperwick
-2d_lowerwick
-3d_upperwick
-3d_lowerwick
-4d_upperwick
-4d_lowerwick
-5d_upperwick
-5d_lowerwick
10d_upperwick
10d_lowerwick
21d_upperwick
21d_lowerwick
63d_upperwick
63d_lowerwick

2016-09-21
66.839996
67.129997
66.309998
66.839996
14116800.0
66.839996
66.750000
66.150002
66.089996
66.290001
66.269997
67.160004
65.750000
62.463746
62.602664
66.459999
66.190002
65.820000
66.639999
66.400002
66.849998
66.639999
66.260002
66.930000
67.250000
...
True
True
True
True
False
True
True
True
True
1.0
2.0
4.0
3.0
2.0
3.0
3.0
2.0
5.0
1.0
4.0
2.0
5.0
0.0
5.0
5.0

2016-09-22
66.989998
67.419998
66.839996
67.389999
12781700.0
67.389999
66.839996
66.750000
66.150002
66.089996
66.290001
67.220001
66.070000
63.207953
63.198027
66.839996
66.459999
66.190002
65.820000
66.639999
67.129997
66.849998
66.639999
66.260002
66.930000
...
True
True
True
True
False
True
True
True
True
3.0
4.0
1.0
2.0
4.0
3.0
2.0
3.0
3.0
2.0
4.0
5.0
5.0
0.0
5.0
5.0

2016-09-23
67.389999
67.900002
67.180000
67.250000
13967400.0
67.250000
66.989998
66.839996
66.750000
66.150002
66.089996
67.029999
65.989998
60.012824
62.414134
67.389999
66.839996
66.459999
66.190002
65.820000
67.419998
67.129997
66.849998
66.639999
66.260002
...
True
True
True
True
False
True
True
True
True
1.0
2.0
3.0
4.0
1.0
2.0
4.0
3.0
2.0
3.0
1.0
5.0
3.0
0.0
3.0
5.0

2016-09-26
66.599998
66.800003
65.540001
65.779999
16408100.0
65.779999
67.389999
66.989998
66.839996
66.750000
66.150002
66.139999
65.910004
58.256495
61.273014
67.250000
67.389999
66.839996
66.459999
66.190002
67.900002
67.419998
67.129997
66.849998
66.639999
...
True
True
False
True
False
True
False
True
True
4.0
1.0
1.0
2.0
3.0
4.0
1.0
2.0
4.0
3.0
5.0
0.0
5.0
0.0
5.0
0.0

2016-09-27
65.410004
66.410004
65.110001
66.360001
13580600.0
66.360001
66.599998
67.389999
66.989998
66.839996
66.750000
66.110001
66.330002
58.732788
61.124170
65.779999
67.250000
67.389999
66.839996
66.459999
66.800003
67.900002
67.419998
67.129997
66.849998
...
True
True
False
True
False
True
False
True
True
2.0
2.0
4.0
1.0
1.0
2.0
3.0
4.0
1.0
2.0
5.0
0.0
5.0
0.0
5.0
0.0

5 rows × 105 columns




In :




1983-12-30     2.602623  2.602623   2.602623
1984-01-03     2.602623  2.598926   2.602623
1984-01-04     2.713529  2.652532   2.713529
1984-01-05     2.802256  2.768984   2.802256
1984-01-06     2.772681  2.772681   2.772681




In :

display(dfMain.columns)




Index([u'Open', u'High', u'Low', u'Close', u'Volume', u'Adj Close',
u'-1d_Open', u'-2d_Open', u'-3d_Open', u'-4d_Open',
...
u'-4d_lowerwick', u'-5d_upperwick', u'-5d_lowerwick', u'10d_upperwick',
u'10d_lowerwick', u'21d_upperwick', u'21d_lowerwick', u'63d_upperwick',
dtype='object', length=106)




In :

dfMain.drop(['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', \
'-1d_Vol', '-2d_Vol', '-3d_Vol', '-4d_Vol', '-5d_Vol', '10d_Avg_Vol', '21d_Avg_Vol', '63d_Avg_Vol', \
'-1d_Open', '-2d_Open', '-3d_Open', '-4d_Open', '-5d_Open', '-10d_Open', '-21d_Open', '-63d_Open', '-100d_Open',  \
'-1d_High', '-2d_High', '-3d_High', '-4d_High', '-5d_High', '10d_High', '21d_High', '63d_High', '100d_High',  \
'-1d_Low', '-2d_Low', '-3d_Low', '-4d_Low', '-5d_Low', '10d_Low', '21d_Low', '63d_Low', '100d_Low',  \
'-1d_upperwick_bool', '-2d_upperwick_bool', '-3d_upperwick_bool', '-4d_upperwick_bool', '-5d_upperwick_bool', '10d_upperwick_bool', '21d_upperwick_bool', '63d_upperwick_bool', '100d_upperwick_bool', \
'-1d_lowerwick_bool', '-2d_lowerwick_bool', '-3d_lowerwick_bool', '-4d_lowerwick_bool', '-5d_lowerwick_bool', '10d_lowerwick_bool', '21d_lowerwick_bool', '63d_lowerwick_bool', '100d_lowerwick_bool'], \
axis=1, inplace=True)




In :

display(dfMain.columns)
dfMain.dropna(inplace=True)
len(dfMain.columns)




Index([u'-1d_Vol1', u'-2d_Vol1', u'-3d_Vol1', u'-4d_Vol1', u'-5d_Vol1',
u'10d_Avg_Vol1', u'21d_Avg_Vol1', u'63d_Avg_Vol1', u'-1d_Vol2',
u'-1d_lowerwick', u'-2d_upperwick', u'-2d_lowerwick', u'-3d_upperwick',
u'-3d_lowerwick', u'-4d_upperwick', u'-4d_lowerwick', u'-5d_upperwick',
u'-5d_lowerwick', u'10d_upperwick', u'10d_lowerwick', u'21d_upperwick',
dtype='object')

Out:

38




In :

data_full = copy(dfMain)




In :

display(data_full.tail())




-1d_Vol1
-2d_Vol1
-3d_Vol1
-4d_Vol1
-5d_Vol1
10d_Avg_Vol1
21d_Avg_Vol1
63d_Avg_Vol1
-1d_Vol2
-2d_Vol2
-3d_Vol2
-4d_Vol2
-5d_Vol2
-1d_upperwick
-1d_lowerwick
-2d_upperwick
-2d_lowerwick
-3d_upperwick
-3d_lowerwick
-4d_upperwick
-4d_lowerwick
-5d_upperwick
-5d_lowerwick
10d_upperwick
10d_lowerwick
21d_upperwick
21d_lowerwick
63d_upperwick
63d_lowerwick

1984-05-23
1.0
5.0
1.0
2.0
1.0
2.0
2.0
3.0
1.0
2.0
1.0
1.0
1.0
-5.0
-5.0
4.0
-4.0
-2.0
-4.0
-2.0
-3.0
3.0
5.0
0.0
0.0
2.0
0.0
2.0
2.0
0.0
3.0
0.0
0.0
5.0
0.0
5.0
0.0
2.474736

1984-05-24
2.0
1.0
5.0
1.0
2.0
2.0
2.0
3.0
1.0
1.0
2.0
1.0
1.0
-5.0
-5.0
-5.0
4.0
-4.0
-4.0
-3.0
-4.0
4.0
0.0
3.0
5.0
0.0
0.0
2.0
0.0
2.0
2.0
4.0
0.0
5.0
0.0
5.0
0.0
2.497335

1984-05-25
5.0
2.0
1.0
5.0
1.0
5.0
3.0
3.0
5.0
1.0
1.0
1.0
1.0
-3.0
-5.0
-5.0
-5.0
4.0
-3.0
-3.0
-4.0
5.0
2.0
4.0
0.0
3.0
5.0
0.0
0.0
2.0
0.0
4.0
0.0
5.0
0.0
5.0
0.0
2.486037

1984-05-29
2.0
5.0
2.0
1.0
5.0
5.0
3.0
3.0
1.0
5.0
1.0
1.0
1.0
-2.0
-3.0
-5.0
-5.0
-5.0
-2.0
-3.0
-3.0
4.0
4.0
5.0
2.0
4.0
0.0
3.0
5.0
0.0
0.0
5.0
0.0
5.0
0.0
5.0
0.0
2.395634

1984-05-30
2.0
2.0
5.0
2.0
1.0
5.0
3.0
3.0
1.0
1.0
5.0
1.0
1.0
-5.0
-2.0
-3.0
-5.0
-5.0
-2.0
-4.0
-4.0
2.0
5.0
4.0
4.0
5.0
2.0
4.0
0.0
3.0
5.0
5.0
0.0
5.0
0.0
5.0
0.0
2.271333

-1d_Vol1
-2d_Vol1
-3d_Vol1
-4d_Vol1
-5d_Vol1
10d_Avg_Vol1
21d_Avg_Vol1
63d_Avg_Vol1
-1d_Vol2
-2d_Vol2
-3d_Vol2
-4d_Vol2
-5d_Vol2
-1d_upperwick
-1d_lowerwick
-2d_upperwick
-2d_lowerwick
-3d_upperwick
-3d_lowerwick
-4d_upperwick
-4d_lowerwick
-5d_upperwick
-5d_lowerwick
10d_upperwick
10d_lowerwick
21d_upperwick
21d_lowerwick
63d_upperwick
63d_lowerwick

2016-09-21
2.0
3.0
5.0
3.0
3.0
3.0
3.0
3.0
1.0
2.0
3.0
2.0
2.0
-3.0
1.0
-2.0
3.0
1.0
-1.0
1.0
2.0
1.0
2.0
4.0
3.0
2.0
3.0
3.0
2.0
5.0
1.0
4.0
2.0
5.0
0.0
5.0
5.0
66.839996

2016-09-22
3.0
2.0
3.0
5.0
3.0
3.0
3.0
3.0
2.0
1.0
2.0
3.0
2.0
1.0
-3.0
1.0
-2.0
3.0
-1.0
1.0
2.0
3.0
4.0
1.0
2.0
4.0
3.0
2.0
3.0
3.0
2.0
4.0
5.0
5.0
0.0
5.0
5.0
67.389999

2016-09-23
3.0
3.0
2.0
3.0
5.0
3.0
3.0
3.0
2.0
2.0
1.0
2.0
3.0
3.0
1.0
-3.0
1.0
-2.0
1.0
1.0
4.0
1.0
2.0
3.0
4.0
1.0
2.0
4.0
3.0
2.0
3.0
1.0
5.0
3.0
0.0
3.0
5.0
67.250000

2016-09-26
3.0
3.0
3.0
2.0
3.0
3.0
3.0
3.0
2.0
2.0
2.0
1.0
2.0
-2.0
3.0
1.0
-3.0
1.0
1.0
1.0
4.0
4.0
1.0
1.0
2.0
3.0
4.0
1.0
2.0
4.0
3.0
5.0
0.0
5.0
0.0
5.0
0.0
65.779999

2016-09-27
4.0
3.0
3.0
3.0
2.0
3.0
3.0
3.0
2.0
2.0
2.0
2.0
1.0
-5.0
-1.0
3.0
1.0
-3.0
-1.0
-1.0
3.0
2.0
2.0
4.0
1.0
1.0
2.0
3.0
4.0
1.0
2.0
5.0
0.0
5.0
0.0
5.0
0.0
66.360001



## Implementation

The problem with time series data in contrast to cross-sectional ones is that we cannot rely on conventional methods such as cross-validation or the usual 4:3:3 train-cv-test testing framework, as all of these methods are based on the assumption that all our data are drawn from the same population and a careful selection of a sample (samples) with proper modelling can say a lot about the entire population (and of course, about the other carefully drawn samples). However, we are not so lucky when it comes to dealing with time series data, mostly because:

1. Most if not all of the time our model really isn't the underlying model, that is, the data doesn't really come from the model we use. So it works only for really limited range of time and as the time series gets longer the difference between our model on the training set and the underlying model starts to show.

2. Essentially the system we are looking at is a time-dependent one so the underlying model itself most likely changes from time to time (unless we're talking about some grand unified model that can model the whole universe), in which case, assuming one model structure will work on the entire period of data is just wishful thinking.

That said, a lot of time we wish that in the process of our research, we can find some "invariants" (or least psudo-invariants) in our data that doesn't change as time goes. Still, we don't know if they are out there or not.

### Training-testing framework

As said above, we will thus employ a training-testing framework that rolls as time goes. In our case, we keep 35 trading months of data for training (how this is determined will be shown later), and use the model to predict for 7 days, and since we are probably more interested in the performance of the JPM of the recent years we will proceed as following:

1. Use data from September 25, 2006 to September 25, 2011 as our validation dataset, and
2. Use data from September 26, 2011 to September 27, 2016 as our test dataset
3. Use 35 months of data prior to prediction period as our training set.

### Parameters for Q learning

We start off setting our parameters as follows:

• Discount factor: $\gamma = 0.75$
• Learning rate: $\alpha = \frac{1}{t}$, where $t$ is the number of time a state-action pair gets updated
• Eploitation-exploration ratio: $\epsilon = \epsilon - \frac{1}{\text{iter_number}}$
• Number of iteration: iter_number = 5000

### Other assumptions

1. Trading price—as mentioned earlier, we assume the trading price to be the same as the Adjusted Close.

2. No transaction cost—this can simplify the problem so that we can focus on modelling.

3. Two actions—we limit ourselves to only two actions, buy and sell. Again since there's no transaction cost, buying when there's no cash is equivalent to hold (and similar for the sell case). By limiting the size of the action space it's easier for our Q value to converge.

### Benchmarks for the two phases

As mentioned above, we will use a roll-forward training framework to build our Chimp. We will first give it a few tries and fine-tune our parameters on the validation dataset. We shall call this the validation phase. And then we'll move on the to test on the test dataset, which will be referred to as the test phase.

We will set up our two benchmarks for the two phases for comparison. To recap, which include:

1. Performances of the Patient Trader
2. Performances of the Monkey
3. Action series of the God Chimp


In :

validation_start_date = datetime(2006, 9, 25)
validation_end_date = datetime(2011, 9, 27)
test_start_date = datetime(2011, 9, 26)
test_end_date = datetime(2016, 9, 27)

print("Validation phase")
validation_phase_data = data_full.ix[validation_start_date:validation_end_date, :]
print("Number of dates in validation dataset: {}\n".format(len(validation_phase_data)))

print("Test phase")
test_phase_data = data_full.ix[test_start_date:test_end_date, :]
print("Number of dates in test dataset: {}".format(len(test_phase_data)))




Validation phase
Number of dates in validation dataset: 1262

Test phase
Number of dates in test dataset: 1260



#### Performances of the Patient Trader

##### Validation phase (2006-9-25 ~ 2011-9-25)
###### Start

$Cash_{init} = 1000.00$

$Share_{init} = 0$

$PV_{init} = 0$

$Trading \ Price_{init} = 36.61$

$Share_{start} = floor(\frac{1000.00}{36.61}) = 27$

$PV_{start} = 36.61 \cdot 27 = 988.47$

$Cash_{start} = Cash_{init} - PV_{start} = 1000.00 - 988.47 = 11.53$

$Total \ Assets_{start} = Cash_{start} + PV_{start} = 1000.00$

##### End

$Cash_{end} = 11.53$ $Share_{end} = 27$ $Trading \ Price_{end} = 27.42$

$PV_{end} = 27.42 \cdot 27 = 740.34$

$Total \ Assets_{end} = Cash_{end} + PV_{end} = 11.53 + 740.34 = 751.87$

We can calculate the annual ROI by solving the following equation for $r_{val}$: $$(1 + r)^{1260/252} = 0.7519$$

$$\Longrightarrow \boxed{r_{val} = -0.05543464 \approx -5.54\%}$$
##### Test phase (2006-9-25 ~ 2011-9-25)

Similarly, we will have $$\boxed{r_{test} = 0.1912884 \approx 19.13\%}$$

#### Performances of the Monkey

We use a MonkeyBot class which will place one and only one order randomly everyday. We iterate it through the time frame we choose 100,000 times and we get the following distributions:



In :

import random
import time
from copy import deepcopy

class MonkeyBot(object):
def __init__(self, dfEnv, cash=1000, share=0, pv=0, now_yes_share=0, random_state=0):
random.seed(random_state)
self.cash = cash
self.share = share
self.pv = pv
self.pv_history_list = []
self.action_list = []
self.env = deepcopy(dfEnv)

num_affordable = int(self.cash // stock_price)
self.cash = self.cash - stock_price * num_affordable
self.share = self.share + num_affordable
self.pv = stock_price * self.share

def sell(self, stock_price):
self.cash = self.cash + stock_price * self.share
self.pv = 0
self.share = 0
self.action_list.append('Sell')

def hold(self, stock_price):
self.pv = stock_price * self.share
self.action_list.append('Hold')

def reset(self):
self.cash = 1000
self.share = 0
self.pv = 0

def yes_share(self):
# Represent chimp asset in state_action
if self.share > 0:
return 1
else:
return 0

def make_decision(self, x):
random_choice = random.choice([1, 2])

if random_choice == 0:
self.hold(x)
elif random_choice == 1:
else:
self.sell(x)

return self.pv # for frame-wise operation

def simulate(self, iters):
for i in range(iters):
self.pv_history_list.append(self.env.ix[-1, 'Monkey PV'] + self.cash)
self.reset()




In :

monkey_val = MonkeyBot(validation_phase_data, random_state=0)

start_time = time.time()
iters = 100000
monkey_val.simulate(iters)
print("{0} iterations took {1} seconds".format(iters, time.time() - start_time))




100000 iterations took 297.804918051 seconds




In :

plt.hist(monkey_val.pv_history_list, bins=50)




Out:

(array([  6.34800000e+03,   2.08350000e+04,   2.16270000e+04,
1.65640000e+04,   1.12580000e+04,   7.61500000e+03,
5.05600000e+03,   3.36200000e+03,   2.14800000e+03,
1.47100000e+03,   1.01100000e+03,   7.31000000e+02,
5.09000000e+02,   3.70000000e+02,   2.75000000e+02,
1.83000000e+02,   1.52000000e+02,   1.26000000e+02,
1.00000000e+02,   5.10000000e+01,   5.20000000e+01,
3.90000000e+01,   2.30000000e+01,   2.40000000e+01,
1.90000000e+01,   1.10000000e+01,   8.00000000e+00,
4.00000000e+00,   4.00000000e+00,   4.00000000e+00,
3.00000000e+00,   2.00000000e+00,   3.00000000e+00,
2.00000000e+00,   1.00000000e+00,   3.00000000e+00,
1.00000000e+00,   2.00000000e+00,   1.00000000e+00,
0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
0.00000000e+00,   1.00000000e+00]),
array([    85.447874  ,    342.45150846,    599.45514292,    856.45877738,
1113.46241184,   1370.4660463 ,   1627.46968076,   1884.47331522,
2141.47694968,   2398.48058414,   2655.4842186 ,   2912.48785306,
3169.49148752,   3426.49512198,   3683.49875644,   3940.5023909 ,
4197.50602536,   4454.50965982,   4711.51329428,   4968.51692874,
5225.5205632 ,   5482.52419766,   5739.52783212,   5996.53146658,
6253.53510104,   6510.5387355 ,   6767.54236996,   7024.54600442,
7281.54963888,   7538.55327334,   7795.5569078 ,   8052.56054226,
8309.56417672,   8566.56781118,   8823.57144564,   9080.5750801 ,
9337.57871456,   9594.58234902,   9851.58598348,  10108.58961794,
10365.5932524 ,  10622.59688686,  10879.60052132,  11136.60415578,
11393.60779024,  11650.6114247 ,  11907.61505916,  12164.61869362,
12421.62232808,  12678.62596254,  12935.629597  ]),
<a list of 50 Patch objects>)




In :

monkey_val_stats = pd.Series(monkey_val.pv_history_list)
print(monkey_val_stats.describe())




count    100000.000000
mean       1060.238834
std         728.393854
min          85.447874
25%         575.907935
50%         872.079011
75%        1327.014578
max       12935.629597
dtype: float64




In :

monkey_test = MonkeyBot(test_phase_data, random_state=0)

start_time = time.time()
iters = 100000
monkey_test.simulate(iters)
print("{0} iterations took {1} seconds".format(iters, time.time() - start_time))




100000 iterations took 276.959551096 seconds




In :

plt.hist(monkey_test.pv_history_list, bins=50)




Out:

(array([  4.00000000e+00,   2.10000000e+01,   1.36000000e+02,
4.72000000e+02,   1.08800000e+03,   2.06900000e+03,
3.36000000e+03,   4.82700000e+03,   6.27800000e+03,
7.27500000e+03,   7.81400000e+03,   8.06700000e+03,
7.98500000e+03,   7.47900000e+03,   7.03600000e+03,
6.17200000e+03,   5.41600000e+03,   4.60700000e+03,
4.00500000e+03,   3.22600000e+03,   2.75700000e+03,
2.14500000e+03,   1.71600000e+03,   1.30400000e+03,
1.03900000e+03,   8.09000000e+02,   6.83000000e+02,
5.25000000e+02,   3.81000000e+02,   2.92000000e+02,
2.53000000e+02,   1.93000000e+02,   1.49000000e+02,
1.02000000e+02,   7.70000000e+01,   4.60000000e+01,
4.70000000e+01,   3.40000000e+01,   2.40000000e+01,
2.50000000e+01,   1.20000000e+01,   1.10000000e+01,
1.50000000e+01,   7.00000000e+00,   3.00000000e+00,
4.00000000e+00,   2.00000000e+00,   4.00000000e+00,
3.00000000e+00,   1.00000000e+00]),
array([  422.648034  ,   508.28394454,   593.91985508,   679.55576562,
765.19167616,   850.8275867 ,   936.46349724,  1022.09940778,
1107.73531832,  1193.37122886,  1279.0071394 ,  1364.64304994,
1450.27896048,  1535.91487102,  1621.55078156,  1707.1866921 ,
1792.82260264,  1878.45851318,  1964.09442372,  2049.73033426,
2135.3662448 ,  2221.00215534,  2306.63806588,  2392.27397642,
2477.90988696,  2563.5457975 ,  2649.18170804,  2734.81761858,
2820.45352912,  2906.08943966,  2991.7253502 ,  3077.36126074,
3162.99717128,  3248.63308182,  3334.26899236,  3419.9049029 ,
3505.54081344,  3591.17672398,  3676.81263452,  3762.44854506,
3848.0844556 ,  3933.72036614,  4019.35627668,  4104.99218722,
4190.62809776,  4276.2640083 ,  4361.89991884,  4447.53582938,
4533.17173992,  4618.80765046,  4704.443561  ]),
<a list of 50 Patch objects>)




In :

monkey_test_stats = pd.Series(monkey_test.pv_history_list)
print(monkey_test_stats.describe())




count    100000.000000
mean       1606.960713
std         464.488921
min         422.648034
25%        1272.686842
50%        1542.634717
75%        1869.730139
max        4704.443561
dtype: float64


##### Validation phase (2006-9-25 ~ 2011-9-25)

Using the mean we can calculate $r_{val}$: $$(1 + r_{val})^{1260/252} = 0.8721$$

$$\Longrightarrow \boxed{r_{val} = -0.02699907 \approx -2.70\%}$$
##### Test phase (2006-9-25 ~ 2011-9-25)

Similarly, $$\Longrightarrow \boxed{r_{test} = 0.09056276 \approx 9.06\%}$$

#### Action series of the God Chimp



In :

from sklearn.ensemble import RandomForestRegressor

class ChimpBot(MonkeyBot):

epsilon = 1
gamma = 0.75
random_reward = 

random_counter = 0
policy_counter = 0

track_key1 = {'Sell': 0, 'Buy': 0, 'Hold': 0}
track_key2 = {'Sell': 0, 'Buy': 0, 'Hold': 0}

track_random_decision = {'Sell': 0, 'Buy': 0, 'Hold': 0}

reset_counter = 0

def __init__(self, dfEnv, iter_random_rounds, test_mode=False, cash=1000, share=0, pv=0, random_state=0):
super(ChimpBot, self).__init__(dfEnv, iter_random_rounds, cash, share, pv)
random.seed(random_state)
np.random.seed(random_state)
# sets self.cash = 1000
# sets self.share = 0
# sets self.pv = 0
# sets self.pv_history_list = []
# sets self.env = dfEnv
# implements sell(self, stock_price)
# implements hold(self)
self.test_mode = test_mode
self.num_features = len(dfEnv.columns) - 1
self.random_rounds = iter_random_rounds # Number of rounds where the bot chooses to go monkey

self.iter_env = self.env.iterrows()
self.now_env_index, self.now_row = self.iter_env.next()

# self.now_yes_share = 0
self.now_action = ''
# self.now_q = 0

self.prev_cash = self.cash
self.prev_share = self.share
self.prev_pv = self.pv

self.q_df_columns = list(self.env.columns)
self.q_df_columns.pop()
self.q_df_columns.extend(['Action', 'Q Value'])
self.q_df = pd.DataFrame(columns=self.q_df_columns)
self.q_dict = defaultdict(lambda: (0, 0)) # element of q_dict is (state, act): (q_value, t)
# self.q_dict_analysis preserves the datetime data and is not used by the ChimpBot
self.q_dict_analysis = defaultdict(lambda: (0, 0))

self.negative_reward = 0
self.n_reward_hisotry = []
self.net_reward = 0

self.reset_counter = 0

def make_q_df(self):
result_dict = defaultdict(list)

for index, row in self.q_dict.iteritems():
for i in range(len(self.q_dict.keys())):
column_name = 'col' + str(i + 1)
result_dict[column_name].append(index[i])
result_dict['Q'].append(self.q_dict[index])

self.q_df = pd.DataFrame(result_dict)
q_df_column_list = ['col' + str(x) for x in range(1, self.num_features + 1 + 1)]
q_df_column_list.append('Q')
self.q_df = self.q_df[q_df_column_list]

def transfer_action(x):
return 1
elif x == 'Sell':
return 2
elif x == 'Hold':
return 0
else:
raise ValueError("Wrong action!")

def str_float_int(x):
return int(float(x))

arr_int = np.vectorize(str_float_int)

self.q_df['col' + str(self.num_features + 1)] = self.q_df['col' + str(self.num_features + 1)].apply(transfer_action)
self.q_df.ix[:, :-1] = self.q_df.ix[:, :-1].apply(arr_int)

def split_q_df(self):
self.q_df_X = self.q_df.ix[:, :-1]
self.q_df_y = self.q_df.ix[:, -1]

def train_on_q_df(self):
reg = RandomForestRegressor(n_estimators=128, max_features='sqrt', n_jobs=-1, random_state=0)
self.q_reg = reg
self.q_reg = self.q_reg.fit(self.q_df_X, self.q_df_y)

def update_q_model(self):
print("Updating Q model...")
start_time = time.time()
self.make_q_df()
self.split_q_df()
self.train_on_q_df()

def from_state_action_predict_q(self, state_action):
state_action = [state_action]

pred_q = self.q_reg.predict(state_action)

return pred_q

def max_q(self, now_row):
def transfer_action(x):
return 1
elif x == 'Sell':
return 2
elif x == 'Hold':
return 0
else:
raise ValueError("Wrong action!")

def str_float_int(x):
return int(float(x))

now_row2 = list(now_row)
# now_row2.append(self.now_yes_share)
max_q = ''
q_compare_dict = {}

if len(now_row2) > self.num_features:
raise ValueError("Got ya bastard! @ MaxQ")

# Populate the q_dict
for act in set(self.valid_actions):
now_row2.append(act)
now_row_key = tuple(now_row2)

_ = self.q_dict[now_row_key]

try:
self.q_reg
except AttributeError:
pass
# print('No q_reg yet...going with default.')
else:
if _ == 0:

single_X = np.array(now_row_key)
# print(single_X)
arr_int = np.vectorize(str_float_int)
single_X[-1] = transfer_action(single_X[-1])
single_X = arr_int(single_X)
single_X = single_X.reshape(1, -1)
pred_q = self.q_reg.predict(single_X)
dreamed_q = (1 - (1 / (self.q_dict[now_row_key] + 1))) * self.q_dict[now_row_key] + (1 / (self.q_dict[now_row_key] + 1)) * pred_q
self.q_dict[now_row_key] = (dreamed_q, self.q_dict[now_row_key] + 1)

q_compare_dict[now_row_key] = self.q_dict[now_row_key]
now_row2.pop()

try:
max(q_compare_dict.iteritems(), key=lambda x:x)
except ValueError:
print("Wrong Q Value in Q Compare Dict!")
else:
key, qAndT = max(q_compare_dict.iteritems(), key=lambda x:x)
# print("Action: {0}, with Q-value: {1}".format(key[-1], qAndT))
return key[-1], qAndT, qAndT

def q_update(self):
# print("Data Index: {}".format(self.now_env_index))
now_states = list(self.now_row)
# now_states = list(now_states)
now_states.pop() # disregard the Trade Price

prev_states = list(self.prev_states)

if len(prev_states) > self.num_features:
raise ValueError("Got ya bastard! @ Q_Update...something wrong with the self.prev_states!!!")

# prev_states.append(self.prev_yes_share)
prev_states.append(self.prev_action)
prev_states_key = tuple(prev_states)

if len(prev_states_key) > self.num_features + 2:
raise ValueError("Got ya bastard! @ Q_Update")

q_temp = self.q_dict[prev_states_key]

q_temp0 = (1 - (1 / (q_temp + 1))) * q_temp + (1 / (q_temp + 1)) * (self.reward + self.gamma * self.max_q(now_states))

self.q_dict[prev_states_key] = (q_temp0, q_temp + 1)
# For analysis purpose
self.q_dict_analysis[prev_states_key] = (q_temp0, self.prev_env_index)
# print("Now Action: {}".format())
# print(prev_states_key)
return (self.q_dict[prev_states_key])

def policy(self, now_row):
return self.max_q(now_row)

def reset(self):
# Portfolio change over iterations
self.pv_history_list.append(self.pv + self.cash)

self.iter_env = self.env.iterrows()
self.now_env_index, self.now_row = self.iter_env.next()

self.cash = 1000
self.share = 0
self.pv = 0

self.prev_cash = self.cash
self.prev_share = self.share
self.prev_pv = self.pv

if self.test_mode is True:
self.epsilon = 0

else:
if self.epsilon - 1/self.random_rounds > 1/self.random_rounds: # Epislon threshold: 0.01
self.random_counter += 1
self.epsilon = self.epsilon - 1/self.random_rounds
else:
self.epsilon = 0.000001 # Epislon threshold: 0.1
self.policy_counter += 1

self.net_reward = 0

self.reset_counter += 1

if self.reset_counter % self.random_rounds == 0:
self.update_q_model()

if self.reset_counter != self.random_rounds:
self.action_list = []

def make_decision(self, now_row):
return self.policy(now_row)

def update(self):
# Update state
now_states = list(self.now_row)

if len(now_states) > self.num_features + 1:
print(len(now_states))
print(self.num_features)
raise ValueError("Got ya bastard! @ Q_Update...something wrong with the self.now_row!!!")

now_states.pop() # disregard the Trade Price

if len(now_states) > self.num_features:
print(now_states)
raise ValueError("Got ya bastard! @ Q_Update...something wrong with now_states after pop!!!")

# Exploitation-exploration decisioning
self.decision = np.random.choice(2, p = [self.epsilon, 1 - self.epsilon]) # decide to go random or with the policy
# self.decision = 0 # Force random mode

# print("Random decision: {0}, Epislon: {1}".format(self.decision, self.epsilon))
if self.decision == 0: # if zero, go random
action = random.choice(self.valid_actions)
else: # else go with the policy
action = self.make_decision(now_states)

if len(now_states) > self.num_features:
print(now_states)
raise ValueError("Got ya bastard! @ Q_Update...something wrong with now_states after make_decision!!!")

# Execute action and get reward
# print(self.now_row)
elif action == 'Sell':
# print(self.now_row)
self.sell(self.now_row[-1])
elif action == 'Hold':
# print(self.now_row)
self.hold(self.now_row[-1])
else:
raise ValueError("Wrong action man!")

try:
self.prev_states
except AttributeError:
print("Running the first time...no prevs exist.")
else:
self.reward = ((self.cash - self.prev_cash) + (self.pv - self.prev_pv)) / (self.prev_cash + self.prev_pv)
self.q_update()

self.prev_states = now_states

if len(now_states) > self.num_features:
raise ValueError("Got ya bastard! @ Q_Update...something wrong with the now_states!!!")

self.now_action = action
self.prev_action = action
# self.prev_yes_share = self.now_yes_share
self.prev_env_index = deepcopy(self.now_env_index)
self.prev_cash = self.cash
self.prev_share = self.share
self.prev_pv = self.pv

try:
self.now_env_index, self.now_row = self.iter_env.next()
except StopIteration:
pass
# print("End of data.")
else:
pass

try:
_ = self.reward
except AttributeError:
print("No reward yet...0 assigned.")
self.reward = 0

def simulate(self):
start_time = time.time()

for i in range(self.random_rounds):
for l in range(len(self.env)):
self.update()
self.reset()
print("{0} rounds of simulation took {1} seconds".format(self.random_rounds, time.time() - start_time))
return self.pv_history_list




In :

iter_random_rounds=5000
god_chimp = ChimpBot(dfEnv=data_full, iter_random_rounds=iter_random_rounds, random_state=0)
pv_history_list = god_chimp.simulate()

print(pv_history_list[-1])

pd.Series(pv_history_list).plot()




Running the first time...no prevs exist.
No reward yet...0 assigned.
5000 rounds of simulation took 9646.88984394 seconds
2.05413123456e+31

Out:

<matplotlib.axes._subplots.AxesSubplot at 0x7fb7a8d09390>




In :

print(pd.Series(god_chimp.action_list).describe())




count     8156
unique       2
freq      4123
dtype: object




In :

# Convert Q-Table to Dataframe from the God Chimp (full dataset)
iter_random_rounds=5000
result_dict = defaultdict(list)
for index, row in god_chimp.q_dict_analysis.iteritems():
for i in range(len(god_chimp.q_dict_analysis.keys())):
column_name = 'col' + str(i + 1)
result_dict[column_name].append(index[i])
result_dict['Q'].append(god_chimp.q_dict_analysis[index])
result_dict['Date'].append(god_chimp.q_dict_analysis[index])

god_chimp_q_df = pd.DataFrame(result_dict)

# Yes share column removed
column_list = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9', 'col10', 'col11', 'col12', 'col13', 'col14', 'col15', 'col16', 'col17', 'col18', 'col19', 'col20', 'col21', 'col22', 'col23', 'col24', 'col25', 'col26', 'col27', 'col28', 'col29', 'col30', 'col31', 'col32', 'col33', 'col34', 'col35', 'col36', 'col37', 'col38', 'Date', 'Q']
god_chimp_q_df = god_chimp_q_df[column_list]
god_chimp_q_df.sort_values('Date', inplace=True)
god_chimp_q_df.reset_index(inplace=True)
del god_chimp_q_df['index']

god_chimp_q_df.reset_index(inplace=True)
del god_chimp_q_df['index']

god_chimp_q_df.set_index(god_chimp_q_df['Date'], inplace=True)
del god_chimp_q_df.index.name
del god_chimp_q_df['Date']

print(len(god_chimp_q_df))




16312

col1
col2
col3
col4
col5
col6
col7
col8
col9
col10
col11
col12
col13
col14
col15
col16
col17
col18
col19
col20
col21
col22
col23
col24
col25
col26
col27
col28
col29
col30
col31
col32
col33
col34
col35
col36
col37
col38
Q

1984-05-23
1.0
5.0
1.0
2.0
1.0
2.0
2.0
3.0
1.0
2.0
1.0
1.0
1.0
-5.0
-5.0
4.0
-4.0
-2.0
-4.0
-2.0
-3.0
3.0
5.0
0.0
0.0
2.0
0.0
2.0
2.0
0.0
3.0
0.0
0.0
5.0
0.0
5.0
0.0
Sell
0.008839

1984-05-23
1.0
5.0
1.0
2.0
1.0
2.0
2.0
3.0
1.0
2.0
1.0
1.0
1.0
-5.0
-5.0
4.0
-4.0
-2.0
-4.0
-2.0
-3.0
3.0
5.0
0.0
0.0
2.0
0.0
2.0
2.0
0.0
3.0
0.0
0.0
5.0
0.0
5.0
0.0
0.019101

1984-05-24
2.0
1.0
5.0
1.0
2.0
2.0
2.0
3.0
1.0
1.0
2.0
1.0
1.0
-5.0
-5.0
-5.0
4.0
-4.0
-4.0
-3.0
-4.0
4.0
0.0
3.0
5.0
0.0
0.0
2.0
0.0
2.0
2.0
4.0
0.0
5.0
0.0
5.0
0.0
Sell
0.015100

1984-05-24
2.0
1.0
5.0
1.0
2.0
2.0
2.0
3.0
1.0
1.0
2.0
1.0
1.0
-5.0
-5.0
-5.0
4.0
-4.0
-4.0
-3.0
-4.0
4.0
0.0
3.0
5.0
0.0
0.0
2.0
0.0
2.0
2.0
4.0
0.0
5.0
0.0
5.0
0.0
0.009537

1984-05-25
5.0
2.0
1.0
5.0
1.0
5.0
3.0
3.0
5.0
1.0
1.0
1.0
1.0
-3.0
-5.0
-5.0
-5.0
4.0
-3.0
-3.0
-4.0
5.0
2.0
4.0
0.0
3.0
5.0
0.0
0.0
2.0
0.0
4.0
0.0
5.0
0.0
5.0
0.0
-0.015552




In :

def action_to_int(string):
return 1
elif string == 'Sell':
return 2
else:
return string

god_chimp_q_df.ix[:, -2] = god_chimp_q_df.ix[:, -2].apply(action_to_int)




In :




Out:

col1
col2
col3
col4
col5
col6
col7
col8
col9
col10
col11
col12
col13
col14
col15
col16
col17
col18
col19
col20
col21
col22
col23
col24
col25
col26
col27
col28
col29
col30
col31
col32
col33
col34
col35
col36
col37
col38
Q

1984-05-23
1.0
5.0
1.0
2.0
1.0
2.0
2.0
3.0
1.0
2.0
1.0
1.0
1.0
-5.0
-5.0
4.0
-4.0
-2.0
-4.0
-2.0
-3.0
3.0
5.0
0.0
0.0
2.0
0.0
2.0
2.0
0.0
3.0
0.0
0.0
5.0
0.0
5.0
0.0
2
0.008839

1984-05-23
1.0
5.0
1.0
2.0
1.0
2.0
2.0
3.0
1.0
2.0
1.0
1.0
1.0
-5.0
-5.0
4.0
-4.0
-2.0
-4.0
-2.0
-3.0
3.0
5.0
0.0
0.0
2.0
0.0
2.0
2.0
0.0
3.0
0.0
0.0
5.0
0.0
5.0
0.0
1
0.019101

1984-05-24
2.0
1.0
5.0
1.0
2.0
2.0
2.0
3.0
1.0
1.0
2.0
1.0
1.0
-5.0
-5.0
-5.0
4.0
-4.0
-4.0
-3.0
-4.0
4.0
0.0
3.0
5.0
0.0
0.0
2.0
0.0
2.0
2.0
4.0
0.0
5.0
0.0
5.0
0.0
2
0.015100

1984-05-24
2.0
1.0
5.0
1.0
2.0
2.0
2.0
3.0
1.0
1.0
2.0
1.0
1.0
-5.0
-5.0
-5.0
4.0
-4.0
-4.0
-3.0
-4.0
4.0
0.0
3.0
5.0
0.0
0.0
2.0
0.0
2.0
2.0
4.0
0.0
5.0
0.0
5.0
0.0
1
0.009537

1984-05-25
5.0
2.0
1.0
5.0
1.0
5.0
3.0
3.0
5.0
1.0
1.0
1.0
1.0
-3.0
-5.0
-5.0
-5.0
4.0
-3.0
-3.0
-4.0
5.0
2.0
4.0
0.0
3.0
5.0
0.0
0.0
2.0
0.0
4.0
0.0
5.0
0.0
5.0
0.0
1
-0.015552



### Finding the right size for training set

As said earlier, one problem with time series data is to find the training window size wihtin which the data can be seen as being drawn from the same population as the data we want to predict. Then of course we can generalize what we have learned/modelled from the training to the cross-validation/test dataset.

To do this we can make use of the God Chimp’s Q-table we just got and get:



In :

from sklearn.metrics import accuracy_score

def find_best_training_size(data_full, full_q_df, training_sizes, testing_size, target_data, random_state=0):
start_time = time.time()
accs = []
d_counter = 0

# Loop through all batches in validation dataset
(u, ) = data_full.index.get_indexer_for([target_data.index])
for d in range(u, u + testing_size * (len(target_data) // testing_size), testing_size):
acc_num_train_months = []
d_counter += 1

# Dates in the batch
date_range = data_full.iloc[d:d + testing_size].index

# Loop through all sizes of training sets
for num_train_month in range(1, training_sizes + 1):
# Prepare Training/Testing Datasets
X_train = full_q_df.iloc[d - (int(21 * num_train_month)):d, :-1]
y_train = full_q_df.iloc[d - (int(21 * num_train_month)):d, -1]
X_test = full_q_df.ix[date_range, :-1]
y_test = full_q_df.ix[date_range, -1]

# Fit data and make predictions
reg = RandomForestRegressor(n_estimators=128, max_features='sqrt', oob_score=True, n_jobs=-1, random_state=random_state)
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
y_fit = reg.predict(X_train)

pred_q = y_pred
actions = X_test.ix[:, -1]
data = {'Action': actions, 'Q': pred_q}
df_pred = pd.DataFrame(data=data, index=y_test.index)

pred_actions = []

for date in date_range:
max_q = [0, -1]
for i, r in df_pred.ix[date].iterrows():
if r['Q'] > max_q:
max_q = [r['Action'], r['Q']]
pred_actions.append(max_q)

best_actions = []

for date in date_range:
max_q = [0, -1]
for i, r in full_q_df.ix[date].iterrows():
if r['Q'] > max_q:
max_q = [r[-2], r['Q']]
best_actions.append(max_q)

acc_num_train_months.append(accuracy_score(best_actions, pred_actions))
accs.append(np.array(acc_num_train_months))
print("Batch {0} completed....{1:.2f}%".format(d_counter, d_counter / len(range(u, u + testing_size * (len(target_data) // testing_size), testing_size))))
geo_means = np.power(reduce(lambda x,y: x*y, accs), (1/len(accs)))
arithmetic_means = reduce(lambda x,y: x+y, accs) / len(accs)
print("Geometric Means Max: {}".format((np.argmax(geo_means) + 1, np.max(geo_means))))
print("Arithemtic Means Max: {}".format((np.argmax(arithmetic_means) + 1, np.max(arithmetic_means))))

print("Grid search best num_train_year took {} seconds:".format(time.time() - start_time))

return (geo_means, arithmetic_means)




In :

means = find_best_training_size(data_full=data_full, full_q_df=god_chimp_q_df, training_sizes=120, testing_size=7, target_data=validation_phase_data, random_state=0)
geo_means = means
arithmetic_means = means




Batch 1 completed....0.01%
Geometric Means Max: (64, 0.7142857142857143)
Arithemtic Means Max: (64, 0.7142857142857143)
Batch 2 completed....0.01%
Geometric Means Max: (3, 0.5714285714285714)
Arithemtic Means Max: (2, 0.5714285714285714)
Batch 3 completed....0.02%
Geometric Means Max: (96, 0.6256455914125556)
Arithemtic Means Max: (96, 0.66666666666666663)
Batch 4 completed....0.02%
Geometric Means Max: (64, 0.62865124056670951)
Arithemtic Means Max: (26, 0.6428571428571429)
Batch 5 completed....0.03%
Geometric Means Max: (64, 0.61676569768093636)
Arithemtic Means Max: (26, 0.62857142857142867)
Batch 6 completed....0.03%
Geometric Means Max: (99, 0.62103954569315234)
Arithemtic Means Max: (86, 0.6428571428571429)
Batch 7 completed....0.04%
Geometric Means Max: (75, 0.59791895257491678)
Arithemtic Means Max: (75, 0.61224489795918369)
Batch 8 completed....0.04%
Geometric Means Max: (86, 0.58377634280084301)
Arithemtic Means Max: (86, 0.6071428571428571)
Batch 9 completed....0.05%
Geometric Means Max: (86, 0.56406972935286603)
Arithemtic Means Max: (86, 0.58730158730158732)
Batch 10 completed....0.06%
Geometric Means Max: (86, 0.57754617386855167)
Arithemtic Means Max: (86, 0.59999999999999998)
Batch 11 completed....0.06%
Geometric Means Max: (86, 0.58881149457503434)
Arithemtic Means Max: (86, 0.61038961038961037)
Batch 12 completed....0.07%
Geometric Means Max: (86, 0.5734297104905578)
Arithemtic Means Max: (86, 0.59523809523809523)
Batch 13 completed....0.07%
Geometric Means Max: (19, 0.57272798013191006)
Arithemtic Means Max: (19, 0.5934065934065933)
Batch 14 completed....0.08%
Geometric Means Max: (19, 0.56098822364309964)
Arithemtic Means Max: (19, 0.58163265306122436)
Batch 15 completed....0.08%
Geometric Means Max: (19, 0.55100859829002813)
Arithemtic Means Max: (19, 0.57142857142857129)
Batch 16 completed....0.09%
Geometric Means Max: (22, 0.53490979743126044)
Arithemtic Means Max: (6, 0.5535714285714286)
Batch 17 completed....0.09%
Geometric Means Max: (22, 0.54408691643996165)
Arithemtic Means Max: (6, 0.56302521008403361)
Batch 18 completed....0.10%
Geometric Means Max: (22, 0.54557098369822032)
Arithemtic Means Max: (6, 0.56349206349206349)
Batch 19 completed....0.11%
Geometric Means Max: (22, 0.55869872780346663)
Arithemtic Means Max: (6, 0.5714285714285714)
Batch 20 completed....0.11%
Geometric Means Max: (6, 0.56378168119990191)
Arithemtic Means Max: (6, 0.58571428571428563)
Batch 21 completed....0.12%
Geometric Means Max: (22, 0.56587991561770867)
Arithemtic Means Max: (6, 0.58503401360544216)
Batch 22 completed....0.12%
Geometric Means Max: (6, 0.57497242477007127)
Arithemtic Means Max: (6, 0.59740259740259738)
Batch 23 completed....0.13%
Geometric Means Max: (6, 0.57481788816184121)
Arithemtic Means Max: (6, 0.59627329192546585)
Batch 24 completed....0.13%
Geometric Means Max: (6, 0.56782888454738434)
Arithemtic Means Max: (6, 0.5892857142857143)
Batch 25 completed....0.14%
Geometric Means Max: (22, 0.55386919150437219)
Arithemtic Means Max: (6, 0.57714285714285718)
Batch 26 completed....0.14%
Geometric Means Max: (22, 0.56325011044611106)
Arithemtic Means Max: (6, 0.58791208791208793)
Batch 27 completed....0.15%
Geometric Means Max: (22, 0.56355091864686091)
Arithemtic Means Max: (6, 0.58201058201058209)
Batch 28 completed....0.16%
Geometric Means Max: (106, 0.56859858323892842)
Arithemtic Means Max: (106, 0.59183673469387732)
Batch 29 completed....0.16%
Geometric Means Max: (12, 0.56442352472214352)
Arithemtic Means Max: (12, 0.58620689655172398)
Batch 30 completed....0.17%
Geometric Means Max: (6, 0.55123680938043562)
Arithemtic Means Max: (6, 0.57619047619047614)
Batch 31 completed....0.17%
Geometric Means Max: (6, 0.55586372241481441)
Arithemtic Means Max: (6, 0.58064516129032262)
Batch 32 completed....0.18%
Geometric Means Max: (6, 0.55136449517862185)
Arithemtic Means Max: (30, 0.58035714285714279)
Batch 33 completed....0.18%
Geometric Means Max: (6, 0.55570699329744611)
Arithemtic Means Max: (6, 0.58008658008658009)
Batch 34 completed....0.19%
Geometric Means Max: (12, 0.54182726703042361)
Arithemtic Means Max: (12, 0.5714285714285714)
Batch 35 completed....0.19%
Geometric Means Max: (12, 0.54265135033554901)
Arithemtic Means Max: (6, 0.5714285714285714)
Batch 36 completed....0.20%
Geometric Means Max: (12, 0.54343080206465677)
Arithemtic Means Max: (6, 0.57142857142857151)
Batch 37 completed....0.21%
Geometric Means Max: (12, 0.54416915168583402)
Arithemtic Means Max: (12, 0.57142857142857151)
Batch 38 completed....0.21%
Geometric Means Max: (12, 0.53502085239721486)
Arithemtic Means Max: (12, 0.56390977443609025)
Batch 39 completed....0.22%
Geometric Means Max: (12, 0.53592475455602195)
Arithemtic Means Max: (12, 0.56410256410256421)
Batch 40 completed....0.22%
Geometric Means Max: (12, 0.52756323317798448)
Arithemtic Means Max: (12, 0.55714285714285716)
Batch 41 completed....0.23%
Geometric Means Max: (12, 0.53147668182880325)
Arithemtic Means Max: (12, 0.56097560975609762)
Batch 42 completed....0.23%
Geometric Means Max: (12, 0.52368038104958692)
Arithemtic Means Max: (12, 0.55442176870748305)
Batch 43 completed....0.24%
Geometric Means Max: (12, 0.52124517284914607)
Arithemtic Means Max: (12, 0.55149501661129574)
Batch 44 completed....0.24%
Geometric Means Max: (12, 0.5189312204298645)
Arithemtic Means Max: (12, 0.54870129870129869)
Batch 45 completed....0.25%
Geometric Means Max: (12, 0.5200437097985775)
Arithemtic Means Max: (12, 0.54920634920634925)
Batch 46 completed....0.26%
Geometric Means Max: (12, 0.52111006167266338)
Arithemtic Means Max: (12, 0.54968944099378891)
Batch 47 completed....0.26%
Geometric Means Max: (12, 0.52213308578289752)
Arithemtic Means Max: (12, 0.55015197568389063)
Batch 48 completed....0.27%
Geometric Means Max: (12, 0.51998951761759216)
Arithemtic Means Max: (12, 0.54761904761904767)
Batch 49 completed....0.27%
Geometric Means Max: (12, 0.517941711251518)
Arithemtic Means Max: (12, 0.54518950437317781)
Batch 50 completed....0.28%
Geometric Means Max: (12, 0.5159834047208105)
Arithemtic Means Max: (12, 0.54285714285714282)
Batch 51 completed....0.28%
Geometric Means Max: (12, 0.51701706198755848)
Arithemtic Means Max: (12, 0.54341736694677878)
Batch 52 completed....0.29%
Geometric Means Max: (12, 0.50438540546294719)
Arithemtic Means Max: (12, 0.5357142857142857)
Batch 53 completed....0.29%
Geometric Means Max: (12, 0.50283767301012483)
Arithemtic Means Max: (12, 0.53369272237196763)
Batch 54 completed....0.30%
Geometric Means Max: (12, 0.49760139412636795)
Arithemtic Means Max: (12, 0.52910052910052907)
Batch 55 completed....0.31%
Geometric Means Max: (12, 0.5008826158927735)
Arithemtic Means Max: (12, 0.53246753246753242)
Batch 56 completed....0.31%
Geometric Means Max: (12, 0.50710491140567004)
Arithemtic Means Max: (12, 0.54081632653061218)
Batch 57 completed....0.32%
Geometric Means Max: (12, 0.50816846850978448)
Arithemtic Means Max: (12, 0.54135338345864659)
Batch 58 completed....0.32%
Geometric Means Max: (12, 0.50314836165483057)
Arithemtic Means Max: (12, 0.53694581280788178)
Batch 59 completed....0.33%
Geometric Means Max: (12, 0.50178210279759639)
Arithemtic Means Max: (12, 0.53510895883777232)
Batch 60 completed....0.33%
Geometric Means Max: (12, 0.50287025409688912)
Arithemtic Means Max: (12, 0.5357142857142857)
Batch 61 completed....0.34%
Geometric Means Max: (12, 0.50155400811254369)
Arithemtic Means Max: (12, 0.5339578454332552)
Batch 62 completed....0.34%
Geometric Means Max: (12, 0.50261022812720868)
Arithemtic Means Max: (12, 0.53456221198156673)
Batch 63 completed....0.35%
Geometric Means Max: (12, 0.49812424617084583)
Arithemtic Means Max: (12, 0.53061224489795911)
Batch 64 completed....0.36%
Geometric Means Max: (12, 0.49919394708650316)
Arithemtic Means Max: (12, 0.53124999999999989)
Batch 65 completed....0.36%
Geometric Means Max: (12, 0.49802384852127762)
Arithemtic Means Max: (12, 0.52967032967032956)
Batch 66 completed....0.37%
Geometric Means Max: (12, 0.49689182682505073)
Arithemtic Means Max: (12, 0.52813852813852813)
Batch 67 completed....0.37%
Geometric Means Max: (12, 0.50095192826022594)
Arithemtic Means Max: (12, 0.53304904051172697)
Batch 68 completed....0.38%
Geometric Means Max: (12, 0.5019225729192911)
Arithemtic Means Max: (12, 0.53361344537815114)
Batch 69 completed....0.38%
Geometric Means Max: (12, 0.50077464167626573)
Arithemtic Means Max: (12, 0.53209109730848847)
Batch 70 completed....0.39%
Geometric Means Max: (12, 0.49966202247535368)
Arithemtic Means Max: (12, 0.53061224489795911)
Batch 71 completed....0.39%
Geometric Means Max: (12, 0.49574393137254263)
Arithemtic Means Max: (12, 0.52716297786720312)
Batch 72 completed....0.40%
Geometric Means Max: (12, 0.49672316501551406)
Arithemtic Means Max: (12, 0.52777777777777768)
Batch 73 completed....0.41%
Geometric Means Max: (12, 0.49572001303560909)
Arithemtic Means Max: (12, 0.52641878669275921)
Batch 74 completed....0.41%
Geometric Means Max: (12, 0.49667303335633722)
Arithemtic Means Max: (12, 0.52702702702702697)
Batch 75 completed....0.42%
Geometric Means Max: (12, 0.48848923694432961)
Arithemtic Means Max: (12, 0.52190476190476187)
Batch 76 completed....0.42%
Geometric Means Max: (12, 0.48764885918770623)
Arithemtic Means Max: (6, 0.52255639097744344)
Batch 77 completed....0.43%
Geometric Means Max: (86, 0.48642238209255967)
Arithemtic Means Max: (6, 0.51762523191094612)
Batch 78 completed....0.43%
Geometric Means Max: (86, 0.48563339744746153)
Arithemtic Means Max: (6, 0.51282051282051277)
Batch 79 completed....0.44%
Geometric Means Max: (86, 0.48801098935877035)
Arithemtic Means Max: (88, 0.51356238698010848)
Batch 80 completed....0.44%
Geometric Means Max: (86, 0.48721934322490068)
Arithemtic Means Max: (6, 0.51428571428571412)
Batch 81 completed....0.45%
Geometric Means Max: (86, 0.48952595687140349)
Arithemtic Means Max: (6, 0.51322751322751314)
Batch 82 completed....0.46%
Geometric Means Max: (86, 0.49045037306853478)
Arithemtic Means Max: (6, 0.51393728222996504)
Batch 83 completed....0.46%
Geometric Means Max: (86, 0.48321558764206474)
Arithemtic Means Max: (6, 0.50946643717728046)
Batch 84 completed....0.47%
Geometric Means Max: (86, 0.48546904465304547)
Arithemtic Means Max: (106, 0.51190476190476175)
Batch 85 completed....0.47%
Geometric Means Max: (86, 0.48475759409518659)
Arithemtic Means Max: (106, 0.51596638655462168)
Batch 86 completed....0.48%
Geometric Means Max: (86, 0.48694750908522227)
Arithemtic Means Max: (106, 0.5166112956810629)
Batch 87 completed....0.48%
Geometric Means Max: (86, 0.49099187770543717)
Arithemtic Means Max: (106, 0.52052545155993402)
Batch 88 completed....0.49%
Geometric Means Max: (86, 0.49023382432053542)
Arithemtic Means Max: (106, 0.52110389610389585)
Batch 89 completed....0.49%
Geometric Means Max: (86, 0.4933210832392862)
Arithemtic Means Max: (106, 0.52487961476725487)
Batch 90 completed....0.50%
Geometric Means Max: (86, 0.49255044499633688)
Arithemtic Means Max: (106, 0.52539682539682508)
Batch 91 completed....0.51%
Geometric Means Max: (86, 0.49179790832182846)
Arithemtic Means Max: (106, 0.52590266875981129)
Batch 92 completed....0.51%
Geometric Means Max: (86, 0.49106284368548725)
Arithemtic Means Max: (88, 0.52018633540372661)
Batch 93 completed....0.52%
Geometric Means Max: (86, 0.48821147974058565)
Arithemtic Means Max: (106, 0.52073732718893972)
Batch 94 completed....0.52%
Geometric Means Max: (86, 0.48902961133470862)
Arithemtic Means Max: (88, 0.51975683890577495)
Batch 95 completed....0.53%
Geometric Means Max: (86, 0.48835076612112494)
Arithemtic Means Max: (88, 0.52030075187969915)
Batch 96 completed....0.53%
Geometric Means Max: (86, 0.49028892438383481)
Arithemtic Means Max: (88, 0.52083333333333315)
Batch 97 completed....0.54%
Geometric Means Max: (86, 0.49219457628121988)
Arithemtic Means Max: (88, 0.52430044182621482)
Batch 98 completed....0.54%
Geometric Means Max: (86, 0.48947055365161007)
Arithemtic Means Max: (88, 0.52332361516034975)
Batch 99 completed....0.55%
Geometric Means Max: (86, 0.48881408085816824)
Arithemtic Means Max: (88, 0.52669552669552655)
Batch 100 completed....0.56%
Geometric Means Max: (86, 0.48817159174514585)
Arithemtic Means Max: (88, 0.52428571428571413)
Batch 101 completed....0.56%
Geometric Means Max: (86, 0.48893331028238379)
Arithemtic Means Max: (88, 0.52475247524752455)
Batch 102 completed....0.57%
Geometric Means Max: (86, 0.49075368640824735)
Arithemtic Means Max: (88, 0.52521008403361324)
Batch 103 completed....0.57%
Geometric Means Max: (86, 0.49010857945360164)
Arithemtic Means Max: (88, 0.52704576976421613)
Batch 104 completed....0.58%
Geometric Means Max: (86, 0.4927498996090573)
Arithemtic Means Max: (88, 0.53021978021978)
Batch 105 completed....0.58%
Geometric Means Max: (86, 0.49449535350971829)
Arithemtic Means Max: (88, 0.52925170068027194)
Batch 106 completed....0.59%
Geometric Means Max: (86, 0.49194297123876057)
Arithemtic Means Max: (88, 0.5283018867924526)
Batch 107 completed....0.59%
Geometric Means Max: (86, 0.49130934616971061)
Arithemtic Means Max: (88, 0.52870493991989298)
Batch 108 completed....0.60%
Geometric Means Max: (86, 0.4930146359997975)
Arithemtic Means Max: (88, 0.52777777777777757)
Batch 109 completed....0.61%
Geometric Means Max: (86, 0.49238144478161971)
Arithemtic Means Max: (106, 0.52686762778505869)
Batch 110 completed....0.61%
Geometric Means Max: (86, 0.49176055731636104)
Arithemtic Means Max: (88, 0.52857142857142836)
Batch 111 completed....0.62%
Geometric Means Max: (86, 0.49422824529195786)
Arithemtic Means Max: (88, 0.52895752895752879)
Batch 112 completed....0.62%
Geometric Means Max: (86, 0.49181594083080032)
Arithemtic Means Max: (88, 0.52678571428571408)
Batch 113 completed....0.63%
Geometric Means Max: (37, 0.48917245321812225)
Arithemtic Means Max: (88, 0.52465233881163065)
Batch 114 completed....0.63%
Geometric Means Max: (37, 0.49079958332338564)
Arithemtic Means Max: (106, 0.52756892230576413)
Batch 115 completed....0.64%
Geometric Means Max: (37, 0.48849592877399806)
Arithemtic Means Max: (106, 0.5254658385093165)
Batch 116 completed....0.64%
Geometric Means Max: (37, 0.48624252727125239)
Arithemtic Means Max: (88, 0.52339901477832496)
Batch 117 completed....0.65%
Geometric Means Max: (37, 0.48691388849713813)
Arithemtic Means Max: (88, 0.52136752136752118)
Batch 118 completed....0.66%
Geometric Means Max: (37, 0.48471909281715569)
Arithemtic Means Max: (88, 0.5217917675544792)
Batch 119 completed....0.66%
Geometric Means Max: (37, 0.48421788328234305)
Arithemtic Means Max: (88, 0.52220888355342121)
Batch 120 completed....0.67%
Geometric Means Max: (37, 0.48488658348373082)
Arithemtic Means Max: (106, 0.52261904761904743)
Batch 121 completed....0.67%
Geometric Means Max: (37, 0.48439210081294981)
Arithemtic Means Max: (88, 0.5218417945690671)
Batch 122 completed....0.68%
Geometric Means Max: (37, 0.48504863771264056)
Arithemtic Means Max: (88, 0.5234192037470724)
Batch 123 completed....0.68%
Geometric Means Max: (37, 0.48569536748203107)
Arithemtic Means Max: (88, 0.52264808362369319)
Batch 124 completed....0.69%
Geometric Means Max: (37, 0.48520551595335487)
Arithemtic Means Max: (88, 0.52304147465437767)
Batch 125 completed....0.69%
Geometric Means Max: (37, 0.48670891280120848)
Arithemtic Means Max: (88, 0.52342857142857124)
Batch 126 completed....0.70%
Geometric Means Max: (37, 0.48732917886635596)
Arithemtic Means Max: (88, 0.52380952380952361)
Batch 127 completed....0.71%
Geometric Means Max: (37, 0.48794044885471227)
Arithemtic Means Max: (88, 0.52530933633295807)
Batch 128 completed....0.71%
Geometric Means Max: (37, 0.48854291660016974)
Arithemtic Means Max: (88, 0.52566964285714257)
Batch 129 completed....0.72%
Geometric Means Max: (37, 0.48804716506082885)
Arithemtic Means Max: (106, 0.52270210409745277)
Batch 130 completed....0.72%
Geometric Means Max: (37, 0.48755953198764029)
Arithemtic Means Max: (88, 0.52417582417582387)
Batch 131 completed....0.73%
Geometric Means Max: (37, 0.48557456370845781)
Arithemtic Means Max: (88, 0.52235550708833123)
Batch 132 completed....0.73%
Geometric Means Max: (37, 0.48617383289530886)
Arithemtic Means Max: (88, 0.5238095238095235)
Batch 133 completed....0.74%
Geometric Means Max: (37, 0.48676481349700956)
Arithemtic Means Max: (88, 0.52416756176154633)
Batch 134 completed....0.74%
Geometric Means Max: (37, 0.48630252064701129)
Arithemtic Means Max: (88, 0.52345415778251569)
Batch 135 completed....0.75%
Geometric Means Max: (37, 0.48688394034136734)
Arithemtic Means Max: (88, 0.52169312169312143)
Batch 136 completed....0.76%
Geometric Means Max: (37, 0.48497939981146843)
Arithemtic Means Max: (88, 0.51995798319327702)
Batch 137 completed....0.76%
Geometric Means Max: (37, 0.48635194193640852)
Arithemtic Means Max: (88, 0.52033368091762222)
Batch 138 completed....0.77%
Geometric Means Max: (37, 0.48835316436721282)
Arithemtic Means Max: (88, 0.52173913043478226)
Batch 139 completed....0.77%
Geometric Means Max: (37, 0.48789460388738076)
Arithemtic Means Max: (88, 0.52209660842754335)
Batch 140 completed....0.78%
Geometric Means Max: (37, 0.4884456784731222)
Arithemtic Means Max: (88, 0.52142857142857113)
Batch 141 completed....0.78%
Geometric Means Max: (37, 0.48659160252802625)
Arithemtic Means Max: (88, 0.52279635258358625)
Batch 142 completed....0.79%
Geometric Means Max: (37, 0.48790874830552244)
Arithemtic Means Max: (88, 0.52414486921529135)
Batch 143 completed....0.79%
Geometric Means Max: (37, 0.48844817211423636)
Arithemtic Means Max: (88, 0.5224775224775221)
Batch 144 completed....0.80%
Geometric Means Max: (37, 0.48898068793360178)
Arithemtic Means Max: (88, 0.52281746031746001)
Batch 145 completed....0.81%
Geometric Means Max: (37, 0.48950642730722521)
Arithemtic Means Max: (88, 0.52315270935960556)
Batch 146 completed....0.81%
Geometric Means Max: (37, 0.48906091051673961)
Arithemtic Means Max: (88, 0.52348336594911904)
Batch 147 completed....0.82%
Geometric Means Max: (37, 0.48862185242354167)
Arithemtic Means Max: (88, 0.52186588921282762)
Batch 148 completed....0.82%
Geometric Means Max: (37, 0.48913897755455787)
Arithemtic Means Max: (22, 0.52027027027026995)
Batch 149 completed....0.83%
Geometric Means Max: (37, 0.48737714008749528)
Arithemtic Means Max: (22, 0.52157238734419908)
Batch 150 completed....0.83%
Geometric Means Max: (22, 0.48483654853649749)
Arithemtic Means Max: (22, 0.51999999999999968)
Batch 151 completed....0.84%
Geometric Means Max: (37, 0.48465754764330321)
Arithemtic Means Max: (22, 0.519394512771996)
Batch 152 completed....0.84%
Geometric Means Max: (37, 0.48426556365074003)
Arithemtic Means Max: (22, 0.51785714285714268)
Batch 153 completed....0.85%
Geometric Means Max: (37, 0.48387901447479009)
Arithemtic Means Max: (22, 0.51820728291316498)
Batch 154 completed....0.86%
Geometric Means Max: (22, 0.48452064189800437)
Arithemtic Means Max: (23, 0.51948051948051943)
Batch 155 completed....0.86%
Geometric Means Max: (22, 0.48413723270649006)
Arithemtic Means Max: (23, 0.51889400921658979)
Batch 156 completed....0.87%
Geometric Means Max: (35, 0.48527864113033026)
Arithemtic Means Max: (35, 0.52014652014652019)
Batch 157 completed....0.87%
Geometric Means Max: (22, 0.48585072924906419)
Arithemtic Means Max: (23, 0.52138307552320273)
Batch 158 completed....0.88%
Geometric Means Max: (22, 0.48703722588766069)
Arithemtic Means Max: (35, 0.52169981916817365)
Batch 159 completed....0.88%
Geometric Means Max: (35, 0.48657967815063752)
Arithemtic Means Max: (95, 0.52201257861635186)
Batch 160 completed....0.89%
Geometric Means Max: (35, 0.48619378122575563)
Arithemtic Means Max: (35, 0.52053571428571432)
Batch 161 completed....0.89%
Geometric Means Max: (35, 0.48668182701277329)
Arithemtic Means Max: (35, 0.520851818988465)
Batch 162 completed....0.90%
Geometric Means Max: (35, 0.4878358252086118)
Arithemtic Means Max: (35, 0.52204585537918868)
Batch 163 completed....0.91%
Geometric Means Max: (35, 0.48830940701804265)
Arithemtic Means Max: (95, 0.5232252410166518)
Batch 164 completed....0.91%
Geometric Means Max: (35, 0.48466343349493562)
Arithemtic Means Max: (95, 0.52351916376306584)
Batch 165 completed....0.92%
Geometric Means Max: (35, 0.48514741253961702)
Arithemtic Means Max: (95, 0.5238095238095235)
Batch 166 completed....0.92%
Geometric Means Max: (37, 0.48408667245780157)
Arithemtic Means Max: (95, 0.52495697074010295)
Batch 167 completed....0.93%
Geometric Means Max: (37, 0.4837337178020239)
Arithemtic Means Max: (95, 0.5252352437981177)
Batch 168 completed....0.93%
Geometric Means Max: (37, 0.48338521775530746)
Arithemtic Means Max: (95, 0.5238095238095235)
Batch 169 completed....0.94%
Geometric Means Max: (37, 0.48386405075491018)
Arithemtic Means Max: (95, 0.52409129332206217)
Batch 170 completed....0.94%
Geometric Means Max: (37, 0.48497387994136798)
Arithemtic Means Max: (95, 0.52521008403361302)
Batch 171 completed....0.95%
Geometric Means Max: (37, 0.4865917612708815)
Arithemtic Means Max: (95, 0.52464494569757691)
Batch 172 completed....0.96%
Geometric Means Max: (37, 0.48819613346490337)
Arithemtic Means Max: (95, 0.52408637873754116)
Batch 173 completed....0.96%
Geometric Means Max: (37, 0.48927125054952147)
Arithemtic Means Max: (95, 0.52353426919900869)
Batch 174 completed....0.97%
Geometric Means Max: (37, 0.49033633693005918)
Arithemtic Means Max: (95, 0.52545155993431825)
Batch 175 completed....0.97%
Geometric Means Max: (37, 0.48882535386673426)
Arithemtic Means Max: (95, 0.52326530612244859)
Batch 176 completed....0.98%
Geometric Means Max: (37, 0.48846012714730919)
Arithemtic Means Max: (95, 0.52353896103896069)
Batch 177 completed....0.98%
Geometric Means Max: (37, 0.48950999479058466)
Arithemtic Means Max: (95, 0.5238095238095235)
Batch 178 completed....0.99%
Geometric Means Max: (37, 0.48803156708560974)
Arithemtic Means Max: (95, 0.52487961476725487)
Batch 179 completed....0.99%
Geometric Means Max: (37, 0.4890711793203078)
Arithemtic Means Max: (95, 0.52593774940143612)
Batch 180 completed....1.00%
Geometric Means Max: (37, 0.48761289414808806)
Arithemtic Means Max: (95, 0.52539682539682497)
Grid search best num_train_year took 22549.8151841 seconds:




In :

print(geo_means)
print(sorted(range(len(geo_means)), key=lambda k: geo_means[k], reverse=True))

print(arithmetic_means)
print(sorted(range(len(arithmetic_means)), key=lambda k: arithmetic_means[k], reverse=True))

plt.figure()
plt.plot(geo_means)
plt.figure()
plt.plot(arithmetic_means)




[ 0.46941488  0.          0.46812252  0.          0.          0.          0.
0.          0.46291511  0.47041653  0.46304303  0.          0.45957992
0.          0.4666322   0.47626914  0.          0.48214285  0.          0.
0.          0.48211332  0.          0.          0.48167617  0.          0.
0.          0.          0.47306257  0.4742388   0.          0.46818833
0.          0.48574901  0.47116956  0.48761289  0.          0.45552042
0.47488147  0.          0.          0.          0.47865849  0.46048357
0.          0.46888289  0.          0.          0.45852544  0.
0.45188303  0.45495606  0.45458731  0.          0.          0.45398097
0.          0.          0.          0.47279696  0.44708424  0.47766806
0.          0.45651616  0.          0.47028094  0.          0.          0.
0.44763883  0.          0.47323469  0.          0.          0.45097461
0.46573407  0.44087702  0.46764007  0.4366769   0.          0.          0.
0.          0.          0.46308693  0.45887766  0.          0.          0.
0.45345877  0.          0.47053841  0.46407285  0.          0.46038203
0.          0.          0.          0.          0.          0.45663992
0.45515121  0.          0.46483863  0.          0.          0.45141788
0.          0.          0.44604973  0.          0.          0.
0.47038701  0.          0.          0.4573022   0.46905104  0.45768709]
[36, 34, 17, 21, 24, 43, 62, 15, 39, 30, 72, 29, 60, 35, 92, 9, 114, 66, 0, 118, 46, 32, 2, 78, 14, 76, 104, 93, 85, 10, 8, 44, 95, 12, 86, 49, 119, 117, 101, 64, 38, 102, 52, 53, 56, 90, 51, 107, 75, 70, 61, 110, 77, 79, 1, 3, 4, 5, 6, 7, 11, 13, 16, 18, 19, 20, 22, 23, 25, 26, 27, 28, 31, 33, 37, 40, 41, 42, 45, 47, 48, 50, 54, 55, 57, 58, 59, 63, 65, 67, 68, 69, 71, 73, 74, 80, 81, 82, 83, 84, 87, 88, 89, 91, 94, 96, 97, 98, 99, 100, 103, 105, 106, 108, 109, 111, 112, 113, 115, 116]
[ 0.50079365  0.51746032  0.50634921  0.50714286  0.48253968  0.50634921
0.50555556  0.48968254  0.4952381   0.5015873   0.50238095  0.51507937
0.49444444  0.49603175  0.50634921  0.51111111  0.51269841  0.51269841
0.51984127  0.50714286  0.50079365  0.51825397  0.51984127  0.4984127
0.51111111  0.49761905  0.49761905  0.50238095  0.51349206  0.51031746
0.51111111  0.50634921  0.50793651  0.49761905  0.52142857  0.51031746
0.51904762  0.49206349  0.49285714  0.51031746  0.5         0.49047619
0.49285714  0.51269841  0.49603175  0.49920635  0.5031746   0.50714286
0.49761905  0.49444444  0.49761905  0.48730159  0.49285714  0.49444444
0.49206349  0.4952381   0.49126984  0.48888889  0.49444444  0.50396825
0.50952381  0.48174603  0.51190476  0.49761905  0.49603175  0.51349206
0.50396825  0.48888889  0.48809524  0.48095238  0.48650794  0.5015873
0.51031746  0.47777778  0.50873016  0.48730159  0.5015873   0.48492063
0.50555556  0.47063492  0.49603175  0.49603175  0.49761905  0.47460317
0.50634921  0.49920635  0.49444444  0.51269841  0.49126984  0.49603175
0.49365079  0.49920635  0.5015873   0.4984127   0.52539683  0.49603175
0.49444444  0.45952381  0.48888889  0.47301587  0.49920635  0.49365079
0.49603175  0.5         0.50079365  0.51349206  0.48571429  0.48968254
0.47063492  0.49761905  0.48174603  0.47857143  0.46825397  0.49126984
0.50079365  0.49047619  0.46666667  0.4968254   0.50396825  0.49365079]
[94, 34, 18, 22, 36, 21, 1, 11, 105, 65, 28, 17, 43, 87, 16, 62, 24, 30, 15, 29, 72, 39, 35, 60, 74, 32, 47, 19, 3, 2, 5, 84, 31, 14, 6, 78, 118, 59, 66, 46, 27, 10, 92, 71, 76, 9, 114, 0, 104, 20, 103, 40, 85, 45, 91, 100, 93, 23, 48, 82, 63, 26, 25, 33, 109, 50, 117, 64, 80, 13, 89, 95, 44, 81, 102, 55, 8, 53, 49, 58, 86, 12, 96, 101, 90, 119, 38, 52, 42, 54, 37, 113, 88, 56, 115, 41, 7, 107, 67, 98, 57, 68, 51, 75, 70, 106, 77, 4, 110, 61, 69, 111, 73, 83, 99, 79, 108, 112, 116, 97]

Out:

[<matplotlib.lines.Line2D at 0x7fb7b4e1c790>]