Looking at simple ML approaches to daily equity data from quandl.
(https://www.quandl.com/data/WIKI/documentation/bulk-download)
Initial focus is on a single name model. That is, the feature space only has one equity in it at a time.
First, let's get some data & see what it looks like.
In [6]:
# imports
import pandas as pd
import numpy as np
from scipy import stats
import sklearn
from sklearn import preprocessing as pp
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import interactive
import sys
import tensorflow as tf
import time
import os
import os.path
import pickle
import logging as log
log.basicConfig(level=log.DEBUG)
# load simple trading simulator
import sim
In [7]:
# obviously change this to where you downloaded file
dlfile = '~/WIKI-eod/WIKI_20161106.csv'
daily = pd.read_csv(dlfile,header=0)
log.info('Read #%d records',len(daily.index));
daily.columns = ['Sym', 'Date', 'Open','High','Low','Close',
'Volume','ExDiv','SplitRatio','AdjOpen',
'AdjHigh','AdjLow','AdjClose','AdjVolume']
In [8]:
daily.head()
Out[8]:
In [9]:
daily.shape
Out[9]:
In [10]:
# We don't care about the dividends or splits - just the adjusted values
# discard non-adjusted values, index & rename
u = daily.ix[:,[0,1,9,10,11,12,13]]
u.set_index('Date', inplace=True)
u.columns = ['Sym', 'Open','High','Low','Close','Volume']
# get rid of oldest data
u = u[u.index > '2000-01-01']
daily = None # GC, please
u.describe()
Out[10]:
In [11]:
#how many names?
len(u.Sym.unique())
Out[11]:
Data is free and dirty. Let's clean it up and get rid of low liquidity symbols.
In [32]:
# calculate daily $ volume and break
addvs = u.groupby('Sym').apply( lambda x : (x.Volume * x.Close).median())
addvs.sort_values()
np.percentile(addvs, [50, 75,90] )
addvs.hist()
Out[32]:
In [13]:
# let's get rid of items less than $10m/day
goners = addvs[addvs < 1e7]
u = u[ np.logical_not(u.Sym.isin(goners.index))]
#how many names?
len(u.Sym.unique())
Out[13]:
In [14]:
len(goners)
Out[14]:
So, far we've cut away at our data pretty severely: by date and liquidity, reducing the number of names from over 3182 to 1185. We may want to swing back and use some of the discarded data for training purposes later...
In [15]:
bysz = u.groupby('Sym').size()
bysz.head()
bysz.sort_values(inplace=True)
#bysz.quantiles([.25,.5,.75])
np.percentile(bysz,[25,50,75])
#bysz.tail()
Out[15]:
In [16]:
u.describe()
Out[16]:
In [17]:
# let's replace zeros for nans so we can ffill them
u.replace(0,np.nan, inplace=True)
u.replace([np.inf, -np.inf], np.nan,inplace=True)
u.describe()
Out[17]:
In [18]:
# need to examine our nans...
badz = u[u.isnull().any(axis=1)]
badz.describe()
badcount = badz.groupby('Sym').size()
np.sort(badcount)
#np.percentile(badcount,[25,50,75])
Out[18]:
In [19]:
# get rid of anyone missing more than 100 observations
goners = badcount[badcount>100].index
u = u[ np.logical_not(u.Sym.isin(goners))]
len(u.Sym.unique())
Out[19]:
The following gives us a uniform set of names that continue trading up until today and have been trading since 2000-01-01. This is a mixed blessing as it allows us to create a 'square' dataset with known tradable instruments but also gets rid of a lot of good data, eliminates newer names and introduces survivorship bias...
it's also slow...
all of that said, we're going to do it so we know we have a complete and tradable dataset. We can revisit this as we get deeper into strategy development as needed.
In [20]:
Z = u.groupby('Sym').filter( lambda x: x.index.min() == u.index.min() and x.index.max() == u.index.max())
Z.shape
Out[20]:
Ok, we have a mostly usable dataset. Let's calculate some values, take a quick look at the data and see if we can run it in the simulator.
In [21]:
# let's verify that we can use z in our sim environment
# fill in missing data
Z = sim.squarem(Z)
# create a universe suitable for simulation
Ubig = sim.prep_univ( Z.index, Z.Sym, Z.Open, Z.High, Z.Low, Z.Close, Z.Volume, Z.Sym)
Ubig.shape
Out[21]:
In [22]:
Ubig.describe()
Out[22]:
Let's look at some of the names in the universe.
In [23]:
S = Ubig[Ubig.Sym=='CCK'].Close
type(S)
S.plot(legend=True)
#Ubig[Ubig.Sym=='AAPL'].Close.plot()
#Ubig[Ubig.Sym=='TDW'].Close.plot()
#plt.legend()
Out[23]:
Let's run a simple hold-the-universe-in-equal-weight strategy just to see that it looks sensible-ish.
In [24]:
# run simulation, capturing balances and discarding trading activity details
# the default strategy simply buys the universe in equal weight and re-weights daily
_,B = sim.sim(Ubig)
# plot NAV
B.NAV.plot()
Out[24]:
Apparently indexing works. Now that we've munged the data sufficiently, let's store the Universe for future use.
In [25]:
pickle.dump(Ubig, open( 'U.pkl', "wb"))
log.info('Wrote U.pkl')
len(Ubig.Sym.unique())
Out[25]:
OK, let's look at a few baseline strategies before we see if we can improve any of them with ML.
Trading 600+ names a day is a pain. Can we replicate the performance with fewer names?
Assuming we have no edge, what would it look like to trade an equal-weighted basket of 10 RANDOM names every day? Let's simulate 10 iterations of the strategy and see what it will look like individually and in aggregate
In [26]:
# this is the strategy
#def random_strat( U, cfg, kvargs ) :
# random portfolio strategy: picks 'num_names' randomly
# nnames = kvargs.pop('num_names',10)
# names = random.sample(U.Sym, nnames )
# U.Weight = np.where( U.Sym.isin( names ), 1/float(nnames), 0 )
# return U
# let's run it 10 times
N = sim.rtest(Ubig, runs=10)
In [27]:
# looking at them both together yields a pretty neat result
nm = N.mean(1)
both = pd.DataFrame( {'eq_wt':B.NAV, 'avg_rndm': nm})
both.plot()
Out[27]:
Makes sense, but cute anyway: any given instance of the random strategy will vary meaningfully from the index, but in aggregate they will tend toward the performance of the index.
Holding the entire market on an equal-weighted basis is a fine indexing strategy. Holding a random 10 names on an equal-weighted basis tends to replicate its performance.
Can we find a selection criteria that improves on this for us?
Let's try some simple things. First, a strategy that holds the prior day's 10 biggest winners and then a strategy that holds the prior day's biggest losers.
In [28]:
best_strat = sim.best_strat
_,BBB = sim.sim(Ubig,sim_FUN=best_strat)
BBB.NAV.plot()
Out[28]:
In [29]:
nm = N.mean(1)
#bb=BuyBest.mean(1)
three = pd.DataFrame( {'eq_wt':B.NAV, 'avg_rndm': nm, 'best_buy':BBB.NAV})
three.plot()
Out[29]:
Buying winners wins big when it works but will take you on a pretty wrenching ride and would have blown you up if you traded it through 2000.
Now let's look at buying the prior day's worst names.
In [30]:
worst_strat = sim.worst_strat
_,BWB = sim.sim(Ubig,sim_FUN=worst_strat)
BWB.NAV.plot()
Out[30]:
In [31]:
four = pd.DataFrame( {'eq_wt':B.NAV, 'avg_rndm': N.mean(1), 'best_buy':BBB.NAV, 'worst_buy':BWB.NAV})
four.plot()
four.plot(loglog=True)
Out[31]:
Looks like mean reversion is a thing at these frequencies... Fair enough.
So, let's summarize our progress.
Next, we'll take a look at training ML models with this dataset and see if we can improve on any of our simple strategies...