Last time, we loaded data, saw that it was basically usable for simulation and then stored it.
Now, we look a bit closer at the data and look at how to best use it given the off-the-shelf tensor flow tools we'll try to apply.
In [1]:
# imports
import collections
import pandas as pd
import numpy as np
from scipy import stats
import sklearn
from sklearn import preprocessing as pp
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import interactive
import sys
import tensorflow as tf
import time
import os
import os.path
import pickle
import logging as log
log.basicConfig(level=log.DEBUG)
In [2]:
f = 'U.pkl'
P = pickle.load(open(f))
log.info('loaded <%s>',f)
P.describe()
Out[2]:
In [3]:
import sim
# can we still sim?
_,B = sim.sim(P)
# plot NAV
B.NAV.plot()
Out[3]:
In [4]:
P.head()
Out[4]:
Looks like it did yesterday.
Ok, now we want to use some portion of this data to train simple ml models.
Let's define and run a function to normalize the data.
In [5]:
def prep_ml( u, show_plots=False ) :
# given universe, prep for ML: scale, center & generate moments
t0 = time.time()
log.info('scaling & centering...')
u.reset_index( inplace=True)
u.sort_values(['Sym','Date'],inplace=True)
u.Date = pd.to_datetime(u.Date)
u.set_index('Date',inplace=True)
# scale & center prices & volume
raw_scaled = u.groupby('Sym').transform( lambda x : (x - x.mean())/x.std())
u = pd.concat([ u.Sym, raw_scaled], axis=1)
# graphical sanity check
if (show_plots):
log.info('Plotting scaled & centered prices')
fig, ax = plt.subplots()
u.groupby('Sym')['Close'].plot(ax=ax)
log.info('completed scaling & centering in %d...',(time.time()-t0))
return u
In [6]:
Z = prep_ml(P,show_plots=True)
In [7]:
Z.head()
Out[7]:
Let's clean-out uninteresting columns
In [8]:
print Z.shape
Z.drop(['Multiplier','Expiry','Strike', 'Fwd_Open', 'Fwd_COReturn'],axis=1, inplace=True)
print Z.shape
Z.head()
Out[8]:
In [9]:
# let's get rid of NaNs from rolling windows
K = Z.dropna()
K.shape
K.head()
Out[9]:
The data is currently tainted with a few forward-looking values, all tagged Fwd_* they will need to be excised from the training set and perhaps used to create the 'labels' for classification purposes. The data breaks down:
Including the open, high, & low prices seem a bit heavy to me - a lot of parameters for limited information. Perhaps we can represent them better. One type of information we might hope to glean from them is the localized volatility. This could be usefully transformed using Garman-Klass or something similar:
$$ \sigma = \sqrt{ \frac{Z}{n} \sum \left[ \textstyle\frac{1}{2}\displaystyle \left( \log \frac{H_i}{L_i} \right)^2 - (2\log 2-1) \left( \log \frac{C_i}{O_i} \right)^2 \right] }. $$Z = Number of closing prices in a year, n = number of historical prices used for the volatility estimate.
For now, let's use the data as-is to establish a baseline and then see what else we can do.
Let's start simple with a linear model as baseline and then see if an off-the-shelf DNN says anything (more) interesting.
What classifications will we create?
Let's see if the data advises otherwise, but it seems that we could break the universe of returns into three segments:
So, let's figure out what values should partition the classes and then convert our fwd-looking return to the labels and get rid of the fwd-looking values entirely.
In [10]:
# first a quick look at the distribution of returns.
K.hist('Return',bins=100)
# now, where do we partition our classes?
q = K.Return.quantile([.333333,.666666]).values
print q
# let's add-in a 1-day Garman-Klass vol
#K['GK'] = np.sqrt(np.abs( 252 * ( (1/2) * (np.log( K.High/ K.Low) )**2 - (2 * np.log(2) - 1 ) * (np.log(K.Close/K.Open))**2 )))
K['SD'] = np.abs(K.SD)
#K['VOLARATIO'] = np.divide( K.GK , K.SD )
# let's classify date by doy and day of week
K['DOY'] = K.index.dayofyear
K['DOW'] = K.index.dayofweek
# let's encode the symbols
K['FSYM'], _ = pd.factorize(K.Sym)
# let's represent vol as ratio with ADV
K['VARATIO'] = np.abs(np.divide( K.Volume , K.ADV))
# let's create column of labels based on these values
K['Label'] = np.where( K.Fwd_Return <= q[0], 0,
np.where( K.Fwd_Return <= q[1], 1, 2))
# let's make sure labels look reasonable
print K.groupby('Label').size()
# Now that we have labels, let's get rid of fwd-looking values
K.drop(['Fwd_Return', 'Fwd_Close'],axis=1, inplace=True)
K.head()
Out[10]:
In [11]:
# do we hava NaNs in our data?
K[K.isnull().any(axis=1)]
Out[11]:
Let's partition into training and validation sets (80/20) and try a few different ways of packaging the data...
For training, we'll use three different data sets:
We'll try to use the tfcontrib code to ease our ascent of the tensorflow learning curve.
In [12]:
# we'll set our testing/validation divide
TVDIVDATE = '2013-01-01'
# let's define which cols go where
RAW_COLS = ['FSYM', 'Open','High','Low','DOY','DOW','Close','Volume'] #
CALCD_COLS = ['ADV', 'DeltaV', 'Return', 'SD', 'VARATIO' ]# 'GK', 'VOLARATIO',
RAWNCALCD_COLS = RAW_COLS + CALCD_COLS
SMRAW_COLS = ['DOW','Close','Volume'] #
SRAWNCALCD_COLS = SMRAW_COLS + CALCD_COLS
Dataset = collections.namedtuple('Dataset', ['data', 'target'])
Ktrain = K[K.index<=TVDIVDATE].reset_index()
Kvlad = K[K.index>TVDIVDATE].reset_index()
# raw training/validations data sets
raw_train = Dataset(data=Ktrain[RAW_COLS],target=Ktrain.Label )
raw_vlad = Dataset(data=Kvlad[RAW_COLS],target=Kvlad.Label )
# calcd training/validations data sets
calcd_train = Dataset(data=Ktrain[CALCD_COLS],target=Ktrain.Label )
calcd_vlad = Dataset(data=Kvlad[CALCD_COLS],target=Kvlad.Label )
# raw+calcd training/validations data sets
rc_train = Dataset(data=Ktrain[RAWNCALCD_COLS],target=Ktrain.Label )
rc_vlad = Dataset(data=Kvlad[RAWNCALCD_COLS],target=Kvlad.Label )
# small raw training/validations data sets
smraw_train = Dataset(data=Ktrain[SMRAW_COLS],target=Ktrain.Label )
smraw_vlad = Dataset(data=Kvlad[SMRAW_COLS],target=Kvlad.Label )
# small raw+calcd training/validations data sets
src_train = Dataset(data=Ktrain[SRAWNCALCD_COLS],target=Ktrain.Label )
src_vlad = Dataset(data=Kvlad[SRAWNCALCD_COLS],target=Kvlad.Label )
print raw_train.data.tail()
print calcd_train.data.tail()
print rc_train.data.tail()
print src_train.data.tail()
In [13]:
# let's store these datasets
forsims = { 'src_train': src_train,
'src_vlad': src_vlad,
'Kvlad': Kvlad }
fname = 'forsims.pkl'
pickle.dump(forsims, open( fname, "wb"))
log.info('Wrote %s', fname)
forsims = None
In [14]:
def _fitntestLinearClassifier( train, vlad, layers=None, model_dir='/tmp/model', steps=10 ):
# use off-the-shelf Linear classifier, returning accuracy and responses
fsize = len(train.data.columns)
nclasses = len(train.target.unique())
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=fsize)]
# Build model
classifier = tf.contrib.learn.LinearClassifier(feature_columns=feature_columns,
n_classes=nclasses,
model_dir=model_dir)
# Fit model.
classifier.fit(x=train.data, y=train.target, steps=steps)
# Evaluate accuracy.
result = classifier.evaluate(x=vlad.data, y=vlad.target)
print('Accuracy: {0:f}'.format(result["accuracy"]))
return result,classifier
def _fitntestDNN( train, vlad, layers=None, model_dir='/tmp/model', steps=10 ):
# build off-the-shelf network, train and validate
fsize = len(train.data.columns)
nclasses = len(train.target.unique())
if layers is None:
layers = [ fsize, fsize, fsize]
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=fsize)]
# Build 3 layer DNN with fsize, 2*fsize, fsize layers
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=layers,
n_classes=nclasses,
model_dir=model_dir)
# Fit model.
classifier.fit(x=train.data, y=train.target, steps=steps)
# Evaluate accuracy.
result = classifier.evaluate(x=vlad.data, y=vlad.target)
print('Accuracy: {0:f}'.format(result["accuracy"]))
return result,classifier
def _fitntestRandomForest( train, vlad, max_nodes=1024, steps=100, model_dir='/tmp/rf') :
# build fit & test random forest for input
fsize = len(train.data.columns)
nclasses = len(train.target.unique())
hparams = tf.contrib.tensor_forest.python.tensor_forest.ForestHParams(
num_trees=nclasses, max_nodes=max_nodes, num_classes=nclasses, num_features=fsize)
classifier = tf.contrib.learn.TensorForestEstimator(hparams)
tdata = train.data.as_matrix().astype(np.float32)
ttgt = train.target.as_matrix().astype(np.float32)
vdata = vlad.data.as_matrix().astype(np.float32)
vtgt = vlad.target.as_matrix().astype(np.float32)
monitors = [tf.contrib.learn.TensorForestLossMonitor(10, 10)]
classifier.fit(x=tdata, y=ttgt, steps=steps, monitors=monitors)
result = classifier.evaluate(x=vdata, y=vtgt)#, steps=np.round(steps/10)
print('Accuracy: {0:f}'.format(result["accuracy"]))
return result,classifier
In [15]:
steps = 10
# use the linear classifier
raw_lc = _fitntestLinearClassifier( train=raw_train, vlad=raw_vlad, model_dir='/tmp/raw_lc', steps=steps)
calcd_lc = _fitntestLinearClassifier( train=calcd_train, vlad=calcd_vlad, model_dir='/tmp/calcd_lc',steps=steps)
rc_lc = _fitntestLinearClassifier( train=rc_train, vlad=rc_vlad, model_dir='/tmp/rc_lc', steps=steps)
smraw_lc = _fitntestLinearClassifier( train=smraw_train, vlad=smraw_vlad, model_dir='/tmp/smraw_lc', steps=steps)
src_lc = _fitntestLinearClassifier( train=src_train, vlad=src_vlad, model_dir='/tmp/src_lc', steps=steps)
In [16]:
# use the dnn
raw_dnn = _fitntestDNN( train=raw_train, vlad=raw_vlad,model_dir='/tmp/raw_dnn', steps=steps)
calcd_dnn = _fitntestDNN( train=calcd_train, vlad=calcd_vlad,model_dir='/tmp/calcd_dnn', steps=steps)
rc_dnn = _fitntestDNN( train=rc_train, vlad=rc_vlad,model_dir='/tmp/rc_dnn', steps=steps)
smraw_dnn = _fitntestDNN( train=smraw_train, vlad=smraw_vlad,model_dir='/tmp/smraw_dnn', steps=steps)
src_dnn = _fitntestDNN( train=src_train, vlad=src_vlad,model_dir='/tmp/src_dnn', steps=steps)
In [17]:
# random forests
raw_rf = _fitntestRandomForest(train=raw_train, vlad=raw_vlad, model_dir='/tmp/raw_rf', steps=steps)
calcd_rf = _fitntestRandomForest(train=calcd_train, vlad=calcd_vlad, model_dir='/tmp/calcd_rf', steps=steps)
rc_rf = _fitntestRandomForest(train=rc_train, vlad=rc_vlad, model_dir='/tmp/rc_rf', steps=steps)
smraw_rf = _fitntestRandomForest(train=smraw_train, vlad=smraw_vlad, model_dir='/tmp/smraw_rf', steps=steps)
src_rf = _fitntestRandomForest(train=src_train, vlad=src_vlad, model_dir='/tmp/src_rf', steps=steps)
In [18]:
# let's aggregate our results so far
results = pd.DataFrame( [ raw_lc[0],raw_dnn[0], calcd_lc[0],
calcd_dnn[0], rc_lc[0], rc_dnn[0], smraw_lc[0],smraw_dnn[0],src_lc[0],src_dnn[0],
smraw_rf[0],calcd_rf[0],src_rf[0],raw_rf[0],rc_rf[0]])
results['model'] = ['Linear', 'DNN','Linear', 'DNN','Linear', 'DNN','Linear',
'DNN','Linear', 'DNN','RandomForest','RandomForest','RandomForest','RandomForest','RandomForest']
results['features'] =['raw', 'raw','calcd','calcd','raw+calcd','raw+calcd','smraw',
'smraw','smraw+calcd','smraw+calcd','smraw','calcd','smraw+calcd','raw','raw+calcd']
results.sort_values('accuracy',inplace=True)
results
Out[18]:
In [19]:
results.groupby('model').agg('median')
Out[19]:
In [20]:
results.groupby('features').agg('median').sort_values('accuracy')
Out[20]:
In [21]:
resdf = results[['model','features','accuracy']].sort_values(['model','features'])
resdf.set_index(resdf.features, inplace=True)
resdf.drop('features',axis=1,inplace=True)
print resdf
fig, ax = plt.subplots(figsize=(8,6))
for label, df in resdf.groupby('model'):
df.accuracy.plot(ax=ax, label=label)
plt.axhline(y=.333333,color='black')
plt.legend(loc='center right')
plt.title('accuracy by model')
Out[21]:
We've used three simple off the shelf models from tensorflow:
We've gotten limited results as seen in the table ('accuracy by model') just above. Basically, we expect a null-predictor to get a score of about 33.3% (the black line in the plot). We were able to beat that in 10 of our 15 cases.
The models matter. The DNN severely underperformed with a median accuracy of 29.6% - nearly 4% worse than the null predictor. Worse, it showed limited stability with four scores below par and then a wild 41% on the featureset that did worst for the other models. It's very possible that the hyperparameters are poorly chosen, the model is undertrained or I'm misusing the model in some other fashion, but as implemented it doesn't perform.
The linear model does well but isn't terribly stable at least compared with the random forests which were both consistently positive and consistent across featuresets though clearly showing a preference for more data.
We've also created five different feature sets:
It's likely that a limited amount of raw data and then well-chosen calculated data is best, but it's hard to make sweeping assessments based on this limited study.
Up until now, we've only looked at one day and one symbol in a featureset. I'd like to look at stacking or enfolding the data so that we include several days of lookback values as inputs to the models. Another variant of this would be to provide a consistently-sized universe of more than one symbol at a time, possibly also with a lookback element.
There remain numerous paths for progression. Some of these:
I like all these ideas and want to do them all. But as practitioners, we press on to complete the circuit knowing that we can return to these studies with some concrete results in hand.
In the next workbook, we look at our simple strategies in a bit greater detail with pyfolio.
Following that, we'll complete the circuit and include predictive information from the random forest model to our strategies to see if/how they might be improved even with this limited edge.
In [22]:
vdata = src_vlad.data.as_matrix().astype(np.float32)
vtgt = src_vlad.target.as_matrix().astype(np.float32)
p=src_rf[1].predict( x=vdata)
In [23]:
R = pd.DataFrame( {'predicted':p,'actual':vtgt})
R['dist'] = np.abs(R.actual-R.predicted)
# avg distance is meaningful. a null predictor should get about .88,
# so anything below provides some edge
print R.dist.mean()
#R
twos=R.dist[R.dist==2]
len(twos.index)/float(len(R.index))
#len(twos)
#len(R.index)
Out[23]:
In [24]:
# in the spirit of minimizing distance, let's see if we calculate
# distance from regression as that could be meaningful improvement
def _fitntestRandomForest_Regr( train, vlad, max_nodes=1024, steps=10, model_dir='/tmp/rfr') :
# build fit & test random forest for input
fsize = len(train.data.columns)
nclasses = len(train.target.unique())
hparams = tf.contrib.tensor_forest.python.tensor_forest.ForestHParams(
num_trees=nclasses, max_nodes=max_nodes, num_classes=nclasses, num_features=fsize)
classifier = tf.contrib.learn.TensorForestEstimator(hparams)
tdata = train.data.as_matrix().astype(np.float32)
ttgt = train.target.as_matrix().astype(np.float32)
vdata = vlad.data.as_matrix().astype(np.float32)
vtgt = vlad.target.as_matrix().astype(np.float32)
regressor = tf.contrib.learn.TensorForestEstimator(hparams)
monitors = [tf.contrib.learn.TensorForestLossMonitor(10, 10)]
regressor.fit(x=tdata, y=ttgt, steps=steps, monitors=monitors)
result = classifier.evaluate(x=vdata, y=vtgt)#, steps=np.round(steps/10)
print('Accuracy: {0:f}'.format(result["accuracy"]))
return result,classifier
src_rfr = _fitntestRandomForest(train=src_train, vlad=src_vlad, model_dir='/tmp/src_rfr', steps=100)
In [25]:
pr = src_rfr[1].predict( x=vdata)
RR = pd.DataFrame( {'predicted':pr,'actual':vtgt})
RR['dist'] = np.abs(RR.actual-RR.predicted)
#
# does regression beat the classifier for distance?
print RR.dist.mean()
twos=RR.dist[RR.dist==2]
print len(twos.index)/float(len(RR.index))