Daily Equity Notes

Looking at simple ML approaches to daily equity data from quandl.

(https://www.quandl.com/data/WIKI/documentation/bulk-download)

Initial focus is on a single name model. That is, the feature space only has one equity in it at a time.

First, let's get some data & see what it looks like.



In [6]:

    
# imports
import pandas as pd
import numpy as np
from scipy import stats

import sklearn
from sklearn import preprocessing as pp

import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import interactive
import sys
import tensorflow as tf
import time
import os
import os.path
import pickle
import logging as log
log.basicConfig(level=log.DEBUG)

# load simple trading simulator
import sim



In [7]:

    
# obviously change this to where you downloaded file
dlfile = '~/WIKI-eod/WIKI_20161106.csv'
daily = pd.read_csv(dlfile,header=0)
log.info('Read #%d records',len(daily.index));
daily.columns = ['Sym', 'Date', 'Open','High','Low','Close',
                    'Volume','ExDiv','SplitRatio','AdjOpen',
                        'AdjHigh','AdjLow','AdjClose','AdjVolume']









    



INFO:root:Read #14437950 records



In [8]:

    
daily.head()



In [9]:

    
daily.shape









    Out[9]:





(14437950, 14)



In [10]:

    
# We don't care about the dividends or splits - just the adjusted values

# discard non-adjusted values, index & rename
u = daily.ix[:,[0,1,9,10,11,12,13]]
u.set_index('Date', inplace=True)
u.columns = ['Sym', 'Open','High','Low','Close','Volume']

# get rid of oldest data
u = u[u.index > '2000-01-01']

daily = None # GC, please

u.describe()









    Out[10]:






  
    
      
      Open
      High
      Low
      Close
      Volume
    
  
  
    
      count
      1.040978e+07
      1.041045e+07
      1.041045e+07
      1.041072e+07
      1.041076e+07
    
    
      mean
      8.853263e+01
      8.992543e+01
      8.779834e+01
      8.888673e+01
      1.526352e+06
    
    
      std
      2.617280e+03
      2.650242e+03
      2.612497e+03
      2.632222e+03
      6.690106e+06
    
    
      min
      0.000000e+00
      0.000000e+00
      0.000000e+00
      0.000000e+00
      0.000000e+00
    
    
      25%
      9.281144e+00
      9.461216e+00
      9.097764e+00
      9.280000e+00
      6.767100e+04
    
    
      50%
      1.777000e+01
      1.805985e+01
      1.746421e+01
      1.777000e+01
      2.842000e+05
    
    
      75%
      3.247877e+01
      3.295080e+01
      3.198269e+01
      3.247981e+01
      1.044300e+06
    
    
      max
      2.281800e+05
      2.293740e+05
      2.275300e+05
      2.293000e+05
      2.304019e+09



In [11]:

    
#how many names?
len(u.Sym.unique())









    Out[11]:





3183

Data is free and dirty. Let's clean it up and get rid of low liquidity symbols.



In [32]:

    
# calculate daily $ volume and break 
addvs = u.groupby('Sym').apply( lambda x : (x.Volume * x.Close).median())
addvs.sort_values()
np.percentile(addvs, [50, 75,90] )
addvs.hist()









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6447a5d390>



In [13]:

    
# let's get rid of items less than $10m/day
goners = addvs[addvs < 1e7]
u = u[ np.logical_not(u.Sym.isin(goners.index))]
#how many names?
len(u.Sym.unique())









    Out[13]:





1188



In [14]:

    
len(goners)









    Out[14]:





1995

So, far we've cut away at our data pretty severely: by date and liquidity, reducing the number of names from over 3182 to 1185. We may want to swing back and use some of the discarded data for training purposes later...



In [15]:

    
bysz = u.groupby('Sym').size()
bysz.head()
bysz.sort_values(inplace=True)
#bysz.quantiles([.25,.5,.75])
np.percentile(bysz,[25,50,75])

#bysz.tail()









    Out[15]:





array([ 2668.5,  4224. ,  4228. ])



In [16]:

    
u.describe()









    Out[16]:






  
    
      
      Open
      High
      Low
      Close
      Volume
    
  
  
    
      count
      4.036149e+06
      4.036166e+06
      4.036166e+06
      4.036179e+06
      4.036204e+06
    
    
      mean
      1.614391e+02
      1.637134e+02
      1.609761e+02
      1.624093e+02
      3.496761e+06
    
    
      std
      4.087227e+03
      4.136628e+03
      4.084709e+03
      4.112169e+03
      1.039325e+07
    
    
      min
      1.455814e-02
      2.170000e-02
      1.455814e-02
      2.050000e-02
      0.000000e+00
    
    
      25%
      1.543115e+01
      1.568740e+01
      1.517000e+01
      1.543491e+01
      5.631000e+05
    
    
      50%
      2.667623e+01
      2.706462e+01
      2.627319e+01
      2.667979e+01
      1.308124e+06
    
    
      75%
      4.491113e+01
      4.548288e+01
      4.432000e+01
      4.492100e+01
      3.179400e+06
    
    
      max
      2.281800e+05
      2.293740e+05
      2.275300e+05
      2.293000e+05
      2.304019e+09



In [17]:

    
# let's replace zeros for nans so we can ffill them
u.replace(0,np.nan, inplace=True)
u.replace([np.inf, -np.inf], np.nan,inplace=True)
u.describe()









    Out[17]:






  
    
      
      Open
      High
      Low
      Close
      Volume
    
  
  
    
      count
      4.036149e+06
      4.036166e+06
      4.036166e+06
      4.036179e+06
      4.035418e+06
    
    
      mean
      1.614391e+02
      1.637134e+02
      1.609761e+02
      1.624093e+02
      3.497442e+06
    
    
      std
      4.087227e+03
      4.136628e+03
      4.084709e+03
      4.112169e+03
      1.039415e+07
    
    
      min
      1.455814e-02
      2.170000e-02
      1.455814e-02
      2.050000e-02
      1.600000e+01
    
    
      25%
      1.543115e+01
      1.568740e+01
      1.517000e+01
      1.543491e+01
      5.634000e+05
    
    
      50%
      2.667623e+01
      2.706462e+01
      2.627319e+01
      2.667979e+01
      1.308600e+06
    
    
      75%
      4.491113e+01
      4.548288e+01
      4.432000e+01
      4.492100e+01
      3.180000e+06
    
    
      max
      2.281800e+05
      2.293740e+05
      2.275300e+05
      2.293000e+05
      2.304019e+09



In [18]:

    
# need to examine our nans...
badz = u[u.isnull().any(axis=1)]
badz.describe()
badcount = badz.groupby('Sym').size()
np.sort(badcount)
#np.percentile(badcount,[25,50,75])









    Out[18]:





array([  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   1,   1,   1,   1,   1,   2,   2,   2,   2,   2,
         2,   2,   2,   2,   2,   3,   3,   3,   3,   4,   4,   6,   6,
         6,   6,   7,   7,   9,   9,  10,  10,  17,  18,  18,  18,  19,
        19,  20,  20,  22,  26,  26,  28,  31,  31,  35,  37,  67, 182])



In [19]:

    
# get rid of anyone missing more than 100 observations
goners = badcount[badcount>100].index
u = u[ np.logical_not(u.Sym.isin(goners))]
len(u.Sym.unique())









    Out[19]:





1187

The following gives us a uniform set of names that continue trading up until today and have been trading since 2000-01-01. This is a mixed blessing as it allows us to create a 'square' dataset with known tradable instruments but also gets rid of a lot of good data, eliminates newer names and introduces survivorship bias...

it's also slow...

all of that said, we're going to do it so we know we have a complete and tradable dataset. We can revisit this as we get deeper into strategy development as needed.



In [20]:

    
Z = u.groupby('Sym').filter( lambda x: x.index.min() == u.index.min() and x.index.max() == u.index.max())
Z.shape









    Out[20]:





(2714810, 6)

Ok, we have a mostly usable dataset. Let's calculate some values, take a quick look at the data and see if we can run it in the simulator.



In [21]:

    
# let's verify that we can use z  in our sim environment
# fill in missing data
Z = sim.squarem(Z)

# create a universe suitable for simulation
Ubig = sim.prep_univ( Z.index, Z.Sym, Z.Open, Z.High, Z.Low, Z.Close, Z.Volume, Z.Sym)
Ubig.shape









    



INFO:root:Got rid of 0/642 symbols
/usr/lib64/python2.7/site-packages/pandas/core/groupby.py:685: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(min_periods=1,window=20,center=False).mean()
  return func(g, *args, **kwargs)
/usr/lib64/python2.7/site-packages/pandas/core/groupby.py:685: FutureWarning: pd.rolling_std is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(min_periods=1,window=20,center=False).std()
  return func(g, *args, **kwargs)






    Out[21]:





(2725932, 19)



In [22]:

    
Ubig.describe()









    Out[22]:






  
    
      
      Multiplier
      Open
      High
      Low
      Close
      Volume
      ADV
      DeltaV
      Return
      Fwd_Close
      Fwd_Return
      SD
      Fwd_Open
      Fwd_COReturn
    
  
  
    
      count
      2725932.0
      2.725932e+06
      2.725932e+06
      2.725932e+06
      2.725932e+06
      2.725932e+06
      2.725931e+06
      2.725931e+06
      2.725931e+06
      2.725932e+06
      2.725932e+06
      2.725929e+06
      2.725932e+06
      2.725932e+06
    
    
      mean
      1.0
      2.226893e+02
      2.243057e+02
      2.208029e+02
      2.226456e+02
      4.025552e+06
      4.023279e+06
      2.661003e-04
      2.940238e-04
      2.227142e+02
      2.940263e-04
      2.206396e-02
      2.227574e+02
      3.153228e-04
    
    
      std
      0.0
      5.015723e+03
      5.042574e+03
      4.979409e+03
      5.012831e+03
      1.212138e+07
      1.086728e+07
      4.786192e-01
      2.754559e-02
      5.014403e+03
      2.754558e-02
      1.668339e-02
      5.017295e+03
      1.402322e-02
    
    
      min
      1.0
      1.455814e-02
      7.864865e-02
      1.455814e-02
      5.898649e-02
      4.300000e+01
      1.000000e+02
      -9.486203e+00
      -3.570812e+00
      5.898649e-02
      -3.570812e+00
      0.000000e+00
      1.455814e-02
      -9.997442e-01
    
    
      25%
      1.0
      1.536626e+01
      1.560987e+01
      1.511146e+01
      1.536896e+01
      6.116000e+05
      6.956722e+05
      -2.777810e-01
      -1.029169e-02
      1.537259e+01
      -1.029169e-02
      1.215633e-02
      1.537000e+01
      -3.631412e-03
    
    
      50%
      1.0
      2.655789e+01
      2.693000e+01
      2.617679e+01
      2.656049e+01
      1.462800e+06
      1.580758e+06
      -1.145079e-02
      2.200946e-04
      2.656879e+01
      2.200946e-04
      1.756784e-02
      2.656501e+01
      0.000000e+00
    
    
      75%
      1.0
      4.509414e+01
      4.563840e+01
      4.454000e+01
      4.510266e+01
      3.598300e+06
      3.713368e+06
      2.617968e-01
      1.102714e-02
      4.511558e+01
      1.102714e-02
      2.636890e-02
      4.510598e+01
      4.241282e-03
    
    
      max
      1.0
      2.281800e+05
      2.293740e+05
      2.275300e+05
      2.293000e+05
      2.304019e+09
      6.079510e+08
      1.032482e+01
      1.449269e+00
      2.293000e+05
      1.449269e+00
      8.122759e-01
      2.281800e+05
      3.078370e+00

Let's look at some of the names in the universe.



In [23]:

    
S = Ubig[Ubig.Sym=='CCK'].Close
type(S)
S.plot(legend=True)

#Ubig[Ubig.Sym=='AAPL'].Close.plot()
#Ubig[Ubig.Sym=='TDW'].Close.plot()
#plt.legend()









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f646896bed0>

Let's run a simple hold-the-universe-in-equal-weight strategy just to see that it looks sensible-ish.



In [24]:

    
# run simulation, capturing balances and discarding trading activity details
# the default strategy simply buys the universe in equal weight and re-weights daily
_,B = sim.sim(Ubig)
# plot NAV
B.NAV.plot()









    



INFO:root:ran over 4246 days and 2725932 rows in 159 secs






    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f646896f790>

Apparently indexing works. Now that we've munged the data sufficiently, let's store the Universe for future use.



In [25]:

    
pickle.dump(Ubig, open( 'U.pkl', "wb"))
log.info('Wrote U.pkl')
len(Ubig.Sym.unique())









    



INFO:root:Wrote U.pkl






    Out[25]:





642

OK, let's look at a few baseline strategies before we see if we can improve any of them with ML.

Trading 600+ names a day is a pain. Can we replicate the performance with fewer names?

Assuming we have no edge, what would it look like to trade an equal-weighted basket of 10 RANDOM names every day? Let's simulate 10 iterations of the strategy and see what it will look like individually and in aggregate



In [26]:

    
# this is the strategy
#def random_strat( U, cfg, kvargs ) :
    # random portfolio strategy: picks 'num_names' randomly
#    nnames = kvargs.pop('num_names',10)
#    names = random.sample(U.Sym, nnames )
#    U.Weight = np.where( U.Sym.isin( names ), 1/float(nnames), 0 )                 
#    return U

# let's run it 10 times 
N = sim.rtest(Ubig, runs=10)









    



INFO:root:ran over 4246 days and 2725932 rows in 154 secs
INFO:root:ran over 4246 days and 2725932 rows in 162 secs
INFO:root:ran over 4246 days and 2725932 rows in 159 secs
INFO:root:ran over 4246 days and 2725932 rows in 159 secs
INFO:root:ran over 4246 days and 2725932 rows in 159 secs
INFO:root:ran over 4246 days and 2725932 rows in 148 secs
INFO:root:ran over 4246 days and 2725932 rows in 138 secs
INFO:root:ran over 4246 days and 2725932 rows in 152 secs
INFO:root:ran over 4246 days and 2725932 rows in 148 secs
INFO:root:ran over 4246 days and 2725932 rows in 164 secs



In [27]:

    
# looking at them both together yields a pretty neat result
nm = N.mean(1)
both = pd.DataFrame( {'eq_wt':B.NAV, 'avg_rndm': nm})
both.plot()









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f643c065ed0>

Makes sense, but cute anyway: any given instance of the random strategy will vary meaningfully from the index, but in aggregate they will tend toward the performance of the index.

Holding the entire market on an equal-weighted basis is a fine indexing strategy. Holding a random 10 names on an equal-weighted basis tends to replicate its performance.

Can we find a selection criteria that improves on this for us?

Let's try some simple things. First, a strategy that holds the prior day's 10 biggest winners and then a strategy that holds the prior day's biggest losers.



In [28]:

    
best_strat = sim.best_strat
_,BBB = sim.sim(Ubig,sim_FUN=best_strat)
BBB.NAV.plot()









    



INFO:root:ran over 4246 days and 2725932 rows in 110 secs






    Out[28]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6448953210>



In [29]:

    
nm = N.mean(1)
#bb=BuyBest.mean(1)
three = pd.DataFrame( {'eq_wt':B.NAV, 'avg_rndm': nm, 'best_buy':BBB.NAV})
three.plot()









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f644894fe50>

Buying winners wins big when it works but will take you on a pretty wrenching ride and would have blown you up if you traded it through 2000.

Now let's look at buying the prior day's worst names.



In [30]:

    
worst_strat = sim.worst_strat
_,BWB = sim.sim(Ubig,sim_FUN=worst_strat)
BWB.NAV.plot()









    



INFO:root:ran over 4246 days and 2725932 rows in 167 secs






    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f64480f5a10>



In [31]:

    
four = pd.DataFrame( {'eq_wt':B.NAV, 'avg_rndm': N.mean(1), 'best_buy':BBB.NAV, 'worst_buy':BWB.NAV})
four.plot()
four.plot(loglog=True)









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6448185390>

Looks like mean reversion is a thing at these frequencies... Fair enough.

So, let's summarize our progress.

we pulled the free equity dataset off Quandl manually...
then munged it into a form suitable for simple simulations
we ran some simple, portfolio-oriented daily equity strategies on the simulator

Next, we'll take a look at training ML models with this dataset and see if we can improve on any of our simple strategies...

	Sym	Date	Open	High	Low	Close	Volume	SplitRatio	AdjOpen	AdjHigh	AdjLow	AdjClose	AdjVolume
0	A	1999-11-18	45.50	50.00	40.00	44.00	44739900.0	1.0	43.365904	47.654840	38.123872	41.936259	44739900.0
1	A	1999-11-19	42.94	43.00	39.81	40.38	10897100.0	1.0	40.925976	40.983162	37.942784	38.486049	10897100.0
2	A	1999-11-22	41.31	44.00	40.06	44.00	4705200.0	1.0	39.372429	41.936259	38.181058	41.936259	4705200.0
3	A	1999-11-23	42.50	43.63	40.25	40.25	4274400.0	1.0	40.506614	41.583613	38.362146	38.362146	4274400.0
4	A	1999-11-24	40.13	41.94	40.00	41.06	3464400.0	1.0	38.247774	39.972880	38.123872	39.134155	3464400.0

	Open	High	Low	Close	Volume
count	1.040978e+07	1.041045e+07	1.041045e+07	1.041072e+07	1.041076e+07
mean	8.853263e+01	8.992543e+01	8.779834e+01	8.888673e+01	1.526352e+06
std	2.617280e+03	2.650242e+03	2.612497e+03	2.632222e+03	6.690106e+06
min	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
25%	9.281144e+00	9.461216e+00	9.097764e+00	9.280000e+00	6.767100e+04
50%	1.777000e+01	1.805985e+01	1.746421e+01	1.777000e+01	2.842000e+05
75%	3.247877e+01	3.295080e+01	3.198269e+01	3.247981e+01	1.044300e+06
max	2.281800e+05	2.293740e+05	2.275300e+05	2.293000e+05	2.304019e+09

	Open	High	Low	Close	Volume
count	4.036149e+06	4.036166e+06	4.036166e+06	4.036179e+06	4.036204e+06
mean	1.614391e+02	1.637134e+02	1.609761e+02	1.624093e+02	3.496761e+06
std	4.087227e+03	4.136628e+03	4.084709e+03	4.112169e+03	1.039325e+07
min	1.455814e-02	2.170000e-02	1.455814e-02	2.050000e-02	0.000000e+00
25%	1.543115e+01	1.568740e+01	1.517000e+01	1.543491e+01	5.631000e+05
50%	2.667623e+01	2.706462e+01	2.627319e+01	2.667979e+01	1.308124e+06
75%	4.491113e+01	4.548288e+01	4.432000e+01	4.492100e+01	3.179400e+06
max	2.281800e+05	2.293740e+05	2.275300e+05	2.293000e+05	2.304019e+09

	Multiplier	Open	High	Low	Close	Volume	ADV	DeltaV	Return	Fwd_Close	Fwd_Return	SD	Fwd_Open	Fwd_COReturn
count	2725932.0	2.725932e+06	2.725932e+06	2.725932e+06	2.725932e+06	2.725932e+06	2.725931e+06	2.725931e+06	2.725931e+06	2.725932e+06	2.725932e+06	2.725929e+06	2.725932e+06	2.725932e+06
mean	1.0	2.226893e+02	2.243057e+02	2.208029e+02	2.226456e+02	4.025552e+06	4.023279e+06	2.661003e-04	2.940238e-04	2.227142e+02	2.940263e-04	2.206396e-02	2.227574e+02	3.153228e-04
std	0.0	5.015723e+03	5.042574e+03	4.979409e+03	5.012831e+03	1.212138e+07	1.086728e+07	4.786192e-01	2.754559e-02	5.014403e+03	2.754558e-02	1.668339e-02	5.017295e+03	1.402322e-02
min	1.0	1.455814e-02	7.864865e-02	1.455814e-02	5.898649e-02	4.300000e+01	1.000000e+02	-9.486203e+00	-3.570812e+00	5.898649e-02	-3.570812e+00	0.000000e+00	1.455814e-02	-9.997442e-01
25%	1.0	1.536626e+01	1.560987e+01	1.511146e+01	1.536896e+01	6.116000e+05	6.956722e+05	-2.777810e-01	-1.029169e-02	1.537259e+01	-1.029169e-02	1.215633e-02	1.537000e+01	-3.631412e-03
50%	1.0	2.655789e+01	2.693000e+01	2.617679e+01	2.656049e+01	1.462800e+06	1.580758e+06	-1.145079e-02	2.200946e-04	2.656879e+01	2.200946e-04	1.756784e-02	2.656501e+01	0.000000e+00
75%	1.0	4.509414e+01	4.563840e+01	4.454000e+01	4.510266e+01	3.598300e+06	3.713368e+06	2.617968e-01	1.102714e-02	4.511558e+01	1.102714e-02	2.636890e-02	4.510598e+01	4.241282e-03
max	1.0	2.281800e+05	2.293740e+05	2.275300e+05	2.293000e+05	2.304019e+09	6.079510e+08	1.032482e+01	1.449269e+00	2.293000e+05	1.449269e+00	8.122759e-01	2.281800e+05	3.078370e+00