Daily Equity Notes

Looking at simple ML approaches to daily equity data from quandl.

(https://www.quandl.com/data/WIKI/documentation/bulk-download)

Initial focus is on a single name model. That is, the feature space only has one equity in it at a time.

First, let's get some data & see what it looks like.


In [6]:
# imports
import pandas as pd
import numpy as np
from scipy import stats

import sklearn
from sklearn import preprocessing as pp

import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import interactive
import sys
import tensorflow as tf
import time
import os
import os.path
import pickle
import logging as log
log.basicConfig(level=log.DEBUG)

# load simple trading simulator
import sim

In [7]:
# obviously change this to where you downloaded file
dlfile = '~/WIKI-eod/WIKI_20161106.csv'
daily = pd.read_csv(dlfile,header=0)
log.info('Read #%d records',len(daily.index));
daily.columns = ['Sym', 'Date', 'Open','High','Low','Close',
                    'Volume','ExDiv','SplitRatio','AdjOpen',
                        'AdjHigh','AdjLow','AdjClose','AdjVolume']


INFO:root:Read #14437950 records

In [8]:
daily.head()


Out[8]:
Sym Date Open High Low Close Volume ExDiv SplitRatio AdjOpen AdjHigh AdjLow AdjClose AdjVolume
0 A 1999-11-18 45.50 50.00 40.00 44.00 44739900.0 0.0 1.0 43.365904 47.654840 38.123872 41.936259 44739900.0
1 A 1999-11-19 42.94 43.00 39.81 40.38 10897100.0 0.0 1.0 40.925976 40.983162 37.942784 38.486049 10897100.0
2 A 1999-11-22 41.31 44.00 40.06 44.00 4705200.0 0.0 1.0 39.372429 41.936259 38.181058 41.936259 4705200.0
3 A 1999-11-23 42.50 43.63 40.25 40.25 4274400.0 0.0 1.0 40.506614 41.583613 38.362146 38.362146 4274400.0
4 A 1999-11-24 40.13 41.94 40.00 41.06 3464400.0 0.0 1.0 38.247774 39.972880 38.123872 39.134155 3464400.0

In [9]:
daily.shape


Out[9]:
(14437950, 14)

In [10]:
# We don't care about the dividends or splits - just the adjusted values

# discard non-adjusted values, index & rename
u = daily.ix[:,[0,1,9,10,11,12,13]]
u.set_index('Date', inplace=True)
u.columns = ['Sym', 'Open','High','Low','Close','Volume']

# get rid of oldest data
u = u[u.index > '2000-01-01']

daily = None # GC, please

u.describe()


Out[10]:
Open High Low Close Volume
count 1.040978e+07 1.041045e+07 1.041045e+07 1.041072e+07 1.041076e+07
mean 8.853263e+01 8.992543e+01 8.779834e+01 8.888673e+01 1.526352e+06
std 2.617280e+03 2.650242e+03 2.612497e+03 2.632222e+03 6.690106e+06
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 9.281144e+00 9.461216e+00 9.097764e+00 9.280000e+00 6.767100e+04
50% 1.777000e+01 1.805985e+01 1.746421e+01 1.777000e+01 2.842000e+05
75% 3.247877e+01 3.295080e+01 3.198269e+01 3.247981e+01 1.044300e+06
max 2.281800e+05 2.293740e+05 2.275300e+05 2.293000e+05 2.304019e+09

In [11]:
#how many names?
len(u.Sym.unique())


Out[11]:
3183

Data is free and dirty. Let's clean it up and get rid of low liquidity symbols.


In [32]:
# calculate daily $ volume and break 
addvs = u.groupby('Sym').apply( lambda x : (x.Volume * x.Close).median())
addvs.sort_values()
np.percentile(addvs, [50, 75,90] )
addvs.hist()


Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6447a5d390>

In [13]:
# let's get rid of items less than $10m/day
goners = addvs[addvs < 1e7]
u = u[ np.logical_not(u.Sym.isin(goners.index))]
#how many names?
len(u.Sym.unique())


Out[13]:
1188

In [14]:
len(goners)


Out[14]:
1995

So, far we've cut away at our data pretty severely: by date and liquidity, reducing the number of names from over 3182 to 1185. We may want to swing back and use some of the discarded data for training purposes later...


In [15]:
bysz = u.groupby('Sym').size()
bysz.head()
bysz.sort_values(inplace=True)
#bysz.quantiles([.25,.5,.75])
np.percentile(bysz,[25,50,75])

#bysz.tail()


Out[15]:
array([ 2668.5,  4224. ,  4228. ])

In [16]:
u.describe()


Out[16]:
Open High Low Close Volume
count 4.036149e+06 4.036166e+06 4.036166e+06 4.036179e+06 4.036204e+06
mean 1.614391e+02 1.637134e+02 1.609761e+02 1.624093e+02 3.496761e+06
std 4.087227e+03 4.136628e+03 4.084709e+03 4.112169e+03 1.039325e+07
min 1.455814e-02 2.170000e-02 1.455814e-02 2.050000e-02 0.000000e+00
25% 1.543115e+01 1.568740e+01 1.517000e+01 1.543491e+01 5.631000e+05
50% 2.667623e+01 2.706462e+01 2.627319e+01 2.667979e+01 1.308124e+06
75% 4.491113e+01 4.548288e+01 4.432000e+01 4.492100e+01 3.179400e+06
max 2.281800e+05 2.293740e+05 2.275300e+05 2.293000e+05 2.304019e+09

In [17]:
# let's replace zeros for nans so we can ffill them
u.replace(0,np.nan, inplace=True)
u.replace([np.inf, -np.inf], np.nan,inplace=True)
u.describe()


Out[17]:
Open High Low Close Volume
count 4.036149e+06 4.036166e+06 4.036166e+06 4.036179e+06 4.035418e+06
mean 1.614391e+02 1.637134e+02 1.609761e+02 1.624093e+02 3.497442e+06
std 4.087227e+03 4.136628e+03 4.084709e+03 4.112169e+03 1.039415e+07
min 1.455814e-02 2.170000e-02 1.455814e-02 2.050000e-02 1.600000e+01
25% 1.543115e+01 1.568740e+01 1.517000e+01 1.543491e+01 5.634000e+05
50% 2.667623e+01 2.706462e+01 2.627319e+01 2.667979e+01 1.308600e+06
75% 4.491113e+01 4.548288e+01 4.432000e+01 4.492100e+01 3.180000e+06
max 2.281800e+05 2.293740e+05 2.275300e+05 2.293000e+05 2.304019e+09

In [18]:
# need to examine our nans...
badz = u[u.isnull().any(axis=1)]
badz.describe()
badcount = badz.groupby('Sym').size()
np.sort(badcount)
#np.percentile(badcount,[25,50,75])


Out[18]:
array([  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   1,   1,   1,   1,   1,   2,   2,   2,   2,   2,
         2,   2,   2,   2,   2,   3,   3,   3,   3,   4,   4,   6,   6,
         6,   6,   7,   7,   9,   9,  10,  10,  17,  18,  18,  18,  19,
        19,  20,  20,  22,  26,  26,  28,  31,  31,  35,  37,  67, 182])

In [19]:
# get rid of anyone missing more than 100 observations
goners = badcount[badcount>100].index
u = u[ np.logical_not(u.Sym.isin(goners))]
len(u.Sym.unique())


Out[19]:
1187

The following gives us a uniform set of names that continue trading up until today and have been trading since 2000-01-01. This is a mixed blessing as it allows us to create a 'square' dataset with known tradable instruments but also gets rid of a lot of good data, eliminates newer names and introduces survivorship bias...

it's also slow...

all of that said, we're going to do it so we know we have a complete and tradable dataset. We can revisit this as we get deeper into strategy development as needed.


In [20]:
Z = u.groupby('Sym').filter( lambda x: x.index.min() == u.index.min() and x.index.max() == u.index.max())
Z.shape


Out[20]:
(2714810, 6)

Ok, we have a mostly usable dataset. Let's calculate some values, take a quick look at the data and see if we can run it in the simulator.


In [21]:
# let's verify that we can use z  in our sim environment
# fill in missing data
Z = sim.squarem(Z)

# create a universe suitable for simulation
Ubig = sim.prep_univ( Z.index, Z.Sym, Z.Open, Z.High, Z.Low, Z.Close, Z.Volume, Z.Sym)
Ubig.shape


INFO:root:Got rid of 0/642 symbols
/usr/lib64/python2.7/site-packages/pandas/core/groupby.py:685: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(min_periods=1,window=20,center=False).mean()
  return func(g, *args, **kwargs)
/usr/lib64/python2.7/site-packages/pandas/core/groupby.py:685: FutureWarning: pd.rolling_std is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(min_periods=1,window=20,center=False).std()
  return func(g, *args, **kwargs)
Out[21]:
(2725932, 19)

In [22]:
Ubig.describe()


Out[22]:
Multiplier Open High Low Close Volume ADV DeltaV Return Fwd_Close Fwd_Return SD Fwd_Open Fwd_COReturn
count 2725932.0 2.725932e+06 2.725932e+06 2.725932e+06 2.725932e+06 2.725932e+06 2.725931e+06 2.725931e+06 2.725931e+06 2.725932e+06 2.725932e+06 2.725929e+06 2.725932e+06 2.725932e+06
mean 1.0 2.226893e+02 2.243057e+02 2.208029e+02 2.226456e+02 4.025552e+06 4.023279e+06 2.661003e-04 2.940238e-04 2.227142e+02 2.940263e-04 2.206396e-02 2.227574e+02 3.153228e-04
std 0.0 5.015723e+03 5.042574e+03 4.979409e+03 5.012831e+03 1.212138e+07 1.086728e+07 4.786192e-01 2.754559e-02 5.014403e+03 2.754558e-02 1.668339e-02 5.017295e+03 1.402322e-02
min 1.0 1.455814e-02 7.864865e-02 1.455814e-02 5.898649e-02 4.300000e+01 1.000000e+02 -9.486203e+00 -3.570812e+00 5.898649e-02 -3.570812e+00 0.000000e+00 1.455814e-02 -9.997442e-01
25% 1.0 1.536626e+01 1.560987e+01 1.511146e+01 1.536896e+01 6.116000e+05 6.956722e+05 -2.777810e-01 -1.029169e-02 1.537259e+01 -1.029169e-02 1.215633e-02 1.537000e+01 -3.631412e-03
50% 1.0 2.655789e+01 2.693000e+01 2.617679e+01 2.656049e+01 1.462800e+06 1.580758e+06 -1.145079e-02 2.200946e-04 2.656879e+01 2.200946e-04 1.756784e-02 2.656501e+01 0.000000e+00
75% 1.0 4.509414e+01 4.563840e+01 4.454000e+01 4.510266e+01 3.598300e+06 3.713368e+06 2.617968e-01 1.102714e-02 4.511558e+01 1.102714e-02 2.636890e-02 4.510598e+01 4.241282e-03
max 1.0 2.281800e+05 2.293740e+05 2.275300e+05 2.293000e+05 2.304019e+09 6.079510e+08 1.032482e+01 1.449269e+00 2.293000e+05 1.449269e+00 8.122759e-01 2.281800e+05 3.078370e+00

Let's look at some of the names in the universe.


In [23]:
S = Ubig[Ubig.Sym=='CCK'].Close
type(S)
S.plot(legend=True)

#Ubig[Ubig.Sym=='AAPL'].Close.plot()
#Ubig[Ubig.Sym=='TDW'].Close.plot()
#plt.legend()


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f646896bed0>

Let's run a simple hold-the-universe-in-equal-weight strategy just to see that it looks sensible-ish.


In [24]:
# run simulation, capturing balances and discarding trading activity details
# the default strategy simply buys the universe in equal weight and re-weights daily
_,B = sim.sim(Ubig)
# plot NAV
B.NAV.plot()


INFO:root:ran over 4246 days and 2725932 rows in 159 secs
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f646896f790>

Apparently indexing works. Now that we've munged the data sufficiently, let's store the Universe for future use.


In [25]:
pickle.dump(Ubig, open( 'U.pkl', "wb"))
log.info('Wrote U.pkl')
len(Ubig.Sym.unique())


INFO:root:Wrote U.pkl
Out[25]:
642

OK, let's look at a few baseline strategies before we see if we can improve any of them with ML.

Trading 600+ names a day is a pain. Can we replicate the performance with fewer names?

Assuming we have no edge, what would it look like to trade an equal-weighted basket of 10 RANDOM names every day? Let's simulate 10 iterations of the strategy and see what it will look like individually and in aggregate


In [26]:
# this is the strategy
#def random_strat( U, cfg, kvargs ) :
    # random portfolio strategy: picks 'num_names' randomly
#    nnames = kvargs.pop('num_names',10)
#    names = random.sample(U.Sym, nnames )
#    U.Weight = np.where( U.Sym.isin( names ), 1/float(nnames), 0 )                 
#    return U

# let's run it 10 times 
N = sim.rtest(Ubig, runs=10)


INFO:root:ran over 4246 days and 2725932 rows in 154 secs
INFO:root:ran over 4246 days and 2725932 rows in 162 secs
INFO:root:ran over 4246 days and 2725932 rows in 159 secs
INFO:root:ran over 4246 days and 2725932 rows in 159 secs
INFO:root:ran over 4246 days and 2725932 rows in 159 secs
INFO:root:ran over 4246 days and 2725932 rows in 148 secs
INFO:root:ran over 4246 days and 2725932 rows in 138 secs
INFO:root:ran over 4246 days and 2725932 rows in 152 secs
INFO:root:ran over 4246 days and 2725932 rows in 148 secs
INFO:root:ran over 4246 days and 2725932 rows in 164 secs

In [27]:
# looking at them both together yields a pretty neat result
nm = N.mean(1)
both = pd.DataFrame( {'eq_wt':B.NAV, 'avg_rndm': nm})
both.plot()


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f643c065ed0>

Makes sense, but cute anyway: any given instance of the random strategy will vary meaningfully from the index, but in aggregate they will tend toward the performance of the index.

Holding the entire market on an equal-weighted basis is a fine indexing strategy. Holding a random 10 names on an equal-weighted basis tends to replicate its performance.

Can we find a selection criteria that improves on this for us?

Let's try some simple things. First, a strategy that holds the prior day's 10 biggest winners and then a strategy that holds the prior day's biggest losers.


In [28]:
best_strat = sim.best_strat
_,BBB = sim.sim(Ubig,sim_FUN=best_strat)
BBB.NAV.plot()


INFO:root:ran over 4246 days and 2725932 rows in 110 secs
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6448953210>

In [29]:
nm = N.mean(1)
#bb=BuyBest.mean(1)
three = pd.DataFrame( {'eq_wt':B.NAV, 'avg_rndm': nm, 'best_buy':BBB.NAV})
three.plot()


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f644894fe50>

Buying winners wins big when it works but will take you on a pretty wrenching ride and would have blown you up if you traded it through 2000.

Now let's look at buying the prior day's worst names.


In [30]:
worst_strat = sim.worst_strat
_,BWB = sim.sim(Ubig,sim_FUN=worst_strat)
BWB.NAV.plot()


INFO:root:ran over 4246 days and 2725932 rows in 167 secs
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f64480f5a10>

In [31]:
four = pd.DataFrame( {'eq_wt':B.NAV, 'avg_rndm': N.mean(1), 'best_buy':BBB.NAV, 'worst_buy':BWB.NAV})
four.plot()
four.plot(loglog=True)


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6448185390>

Looks like mean reversion is a thing at these frequencies... Fair enough.

So, let's summarize our progress.

  • we pulled the free equity dataset off Quandl manually...
  • then munged it into a form suitable for simple simulations
  • we ran some simple, portfolio-oriented daily equity strategies on the simulator

Next, we'll take a look at training ML models with this dataset and see if we can improve on any of our simple strategies...