Special care should be taken with missing data on this problem.

Missing data shall never be filled in the target variable, or the results evaluation would be corrupted. That is a risk on this problem, if things are done without care, because the target variable and the features are the same, only time-shifted.

First forward and then backwards fill is the best way to try to keep causality as much as possible.

Some filtering of symbols that have a lot of missing data could help, or the predictor may find itself full of constant data.

Filling missing data and dropping "bad samples" can be done in two or three levels: In the total data level, in the training time level, or in the base samples level. The differences are probably small for the filling part, but may be significant when dropping samples.


In [120]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import scipy.optimize as spo
import sys
from time import time
from sklearn.metrics import r2_score, median_absolute_error

%matplotlib inline

%pylab inline
pylab.rcParams['figure.figsize'] = (20.0, 10.0)

%load_ext autoreload
%autoreload 2

sys.path.append('../../')


Populating the interactive namespace from numpy and matplotlib
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

In [121]:
from utils import preprocessing as pp

In [122]:
data_df = pd.read_pickle('../../data/data_train_val_df.pkl')

In [123]:
print(data_df.shape)
data_df.head()


(5520, 2415)
Out[123]:
feature Close ... Volume
SPY MMM ABT ABBV ACN ATVI AYI ADBE AMD AAP ... WYNN XEL XRX XLNX XYL YHOO YUM ZBH ZION ZTS
date
1993-01-29 43.94 24.50 6.88 NaN NaN NaN NaN 2.59 18.75 NaN ... NaN 87800.0 7633602.0 1745196.0 NaN NaN NaN NaN 33600.0 NaN
1993-02-01 44.25 24.69 6.88 NaN NaN NaN NaN 2.72 19.12 NaN ... NaN 72400.0 3001200.0 3574800.0 NaN NaN NaN NaN 32000.0 NaN
1993-02-02 44.34 24.72 6.53 NaN NaN NaN NaN 2.84 20.25 NaN ... NaN 242200.0 1388598.0 2652396.0 NaN NaN NaN NaN 251600.0 NaN
1993-02-03 44.81 25.19 6.91 NaN NaN NaN NaN 2.70 20.50 NaN ... NaN 272200.0 1228200.0 5040396.0 NaN NaN NaN NaN 254800.0 NaN
1993-02-04 45.00 26.06 6.84 NaN NaN NaN NaN 2.73 20.12 NaN ... NaN 162800.0 1675602.0 7033200.0 NaN NaN NaN NaN 317200.0 NaN

5 rows × 2415 columns


In [124]:
data_df.columns.nlevels


Out[124]:
2

Let's first filter at the symbol level


In [125]:
data_df['Close'].shape


Out[125]:
(5520, 483)

In [126]:
good_ratios = 1.0 - (data_df['Close'].isnull().sum()/ data_df['Close'].shape[0])
good_ratios.sort_values(ascending=False).plot()


Out[126]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1456888cc0>

In [127]:
filtered_data_df = pp.drop_irrelevant_symbols(data_df['Close'], good_data_ratio=0.99)

In [128]:
good_ratios = 1.0 - (filtered_data_df.isnull().sum()/ filtered_data_df.shape[0])
good_ratios.sort_values(ascending=False).plot()


Out[128]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f145684cda0>

In [129]:
filtered_data_df.shape


Out[129]:
(5520, 285)

In [130]:
filtered_data_df.head()


Out[130]:
SPY MMM ABT ADBE AMD AES AET AFL APD ALK ... HCN WDC WY WHR WFM WMB XEL XRX XLNX ZION
date
1993-01-29 43.94 24.50 6.88 2.59 18.75 4.41 6.42 4.49 21.94 4.19 ... 22.50 4.50 41.50 46.12 1.81 6.88 22.00 14.28 2.50 10.94
1993-02-01 44.25 24.69 6.88 2.72 19.12 4.53 6.64 4.52 22.38 4.19 ... 23.00 4.50 42.00 46.50 1.78 7.00 22.19 14.09 2.62 11.06
1993-02-02 44.34 24.72 6.53 2.84 20.25 4.53 6.62 4.57 22.31 4.16 ... 22.75 4.62 42.12 46.50 1.78 6.98 22.06 14.09 2.64 11.12
1993-02-03 44.81 25.19 6.91 2.70 20.50 4.49 6.50 4.65 22.69 4.22 ... 23.25 4.69 42.00 47.12 1.81 6.84 22.38 14.03 2.68 11.25
1993-02-04 45.00 26.06 6.84 2.73 20.12 4.49 6.73 4.84 23.19 4.38 ... 23.00 4.56 42.62 48.38 1.84 6.90 22.81 14.15 2.67 11.69

5 rows × 285 columns


In [131]:
filtered_data_df.isnull().sum().sort_values(ascending=False)


Out[131]:
CB       38
MCHP     35
INTU     30
HUM      25
MS       16
GT        9
CHK       5
VRTX      2
ZION      1
PDCO      1
PAYX      1
PBCT      1
CMCSA     1
CSCO      1
CTAS      1
CINF      1
QCOM      1
COST      1
CERN      1
REGN      1
CELG      1
ROST      1
CA        1
SWKS      1
BIIB      1
BBBY      1
PCAR      1
XRAY      1
CSX       1
NTRS      1
         ..
MMC       0
MAS       0
PPG       0
PNC       0
PNW       0
PCG       0
PFE       0
PKI       0
PEP       0
PH        0
ORCL      0
OKE       0
OMC       0
OXY       0
NUE       0
NOC       0
NSC       0
JWN       0
NI        0
NKE       0
NEM       0
MUR       0
MSI       0
MOS       0
TAP       0
MHK       0
MRK       0
MCD       0
MKC       0
SPY       0
Length: 285, dtype: int64

Let's try to filter the whole dataset using only the 'Close' values


In [132]:
good_data_ratio = 0.99
FEATURE_OF_INTEREST = 'Close'

filtered_data_df = data_df[FEATURE_OF_INTEREST].dropna(thresh=math.ceil(good_data_ratio*data_df[FEATURE_OF_INTEREST].shape[0]), axis=1)

In [133]:
filtered_data_df.head()


Out[133]:
SPY MMM ABT ADBE AMD AES AET AFL APD ALK ... HCN WDC WY WHR WFM WMB XEL XRX XLNX ZION
date
1993-01-29 43.94 24.50 6.88 2.59 18.75 4.41 6.42 4.49 21.94 4.19 ... 22.50 4.50 41.50 46.12 1.81 6.88 22.00 14.28 2.50 10.94
1993-02-01 44.25 24.69 6.88 2.72 19.12 4.53 6.64 4.52 22.38 4.19 ... 23.00 4.50 42.00 46.50 1.78 7.00 22.19 14.09 2.62 11.06
1993-02-02 44.34 24.72 6.53 2.84 20.25 4.53 6.62 4.57 22.31 4.16 ... 22.75 4.62 42.12 46.50 1.78 6.98 22.06 14.09 2.64 11.12
1993-02-03 44.81 25.19 6.91 2.70 20.50 4.49 6.50 4.65 22.69 4.22 ... 23.25 4.69 42.00 47.12 1.81 6.84 22.38 14.03 2.68 11.25
1993-02-04 45.00 26.06 6.84 2.73 20.12 4.49 6.73 4.84 23.19 4.38 ... 23.00 4.56 42.62 48.38 1.84 6.90 22.81 14.15 2.67 11.69

5 rows × 285 columns


In [134]:
filtered_data_df.columns


Out[134]:
Index(['SPY', 'MMM', 'ABT', 'ADBE', 'AMD', 'AES', 'AET', 'AFL', 'APD', 'ALK',
       ...
       'HCN', 'WDC', 'WY', 'WHR', 'WFM', 'WMB', 'XEL', 'XRX', 'XLNX', 'ZION'],
      dtype='object', length=285)

In [135]:
fdata_df = data_df.loc[:,(slice(None),filtered_data_df.columns.tolist())]

In [136]:
new_cols = fdata_df.columns.get_level_values(1)

In [137]:
np.setdiff1d(new_cols, filtered_data_df.columns)


Out[137]:
array([], dtype=object)

In [138]:
np.setdiff1d(filtered_data_df.columns, new_cols)


Out[138]:
array([], dtype=object)

In [139]:
np.intersect1d(filtered_data_df.columns, new_cols).shape


Out[139]:
(285,)

In [140]:
filtered_data_df.columns.shape


Out[140]:
(285,)

Looks good to me... Let's test it on the full dataset


In [142]:
filtered_data_df = pp.drop_irrelevant_symbols(data_df, good_data_ratio=0.99)

In [143]:
good_ratios = 1.0 - (filtered_data_df['Close'].isnull().sum()/ filtered_data_df['Close'].shape[0])
good_ratios.sort_values(ascending=False).plot()


Out[143]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1456945f98>

Now, let's filter at the sample level


In [145]:
import predictor.feature_extraction as fe

train_time = -1 # In real time days
base_days = 7 # In market days
step_days = 30 # market days
ahead_days = 1 # market days
today = data_df.index[-1] # Real date

tic = time()
x, y = fe.generate_train_intervals(data_df, 
                                   train_time, 
                                   base_days, 
                                   step_days, 
                                   ahead_days, 
                                   today, 
                                   fe.feature_close_one_to_one)
toc = time()
print('Elapsed time: %i seconds.' % (toc-tic))


Elapsed time: 40 seconds.

In [146]:
x.shape


Out[146]:
(73478, 7)

In [147]:
y.shape


Out[147]:
(73478, 1)

In [152]:
x_y_df = pd.concat([x, y], axis=1)

In [153]:
x_y_df.shape


Out[153]:
(73478, 8)

In [154]:
x_y_df.head()


Out[154]:
0 1 2 3 4 5 6 target
2014-07-14 53.60 53.93 54.21 53.48 54.39 54.04 54.05 53.97
2011-10-11 18.28 18.73 18.64 18.90 18.19 18.60 18.22 18.22
2008-04-29 30.64 30.20 30.65 30.84 30.92 30.89 30.45 30.80
1994-12-22 9.22 9.25 9.31 9.34 9.34 9.28 9.03 9.12
1998-07-17 65812.50 64593.75 63187.50 64875.00 61781.25 58875.00 59437.50 58312.50

In [157]:
x_y_df.isnull().sum(axis=1)


Out[157]:
2014-07-14    0
2011-10-11    0
2008-04-29    0
1994-12-22    0
1998-07-17    0
1997-08-04    0
1996-07-10    0
2007-11-05    0
2014-05-30    0
2010-09-16    0
1994-01-11    0
2012-09-25    0
1995-10-23    0
1995-05-03    0
2008-12-01    0
2007-04-03    0
2010-06-22    0
2009-04-13    0
1995-06-15    0
2010-06-22    0
2011-07-18    0
1994-02-23    0
2006-11-20    0
2011-03-09    0
2003-10-16    0
2012-06-29    0
1998-11-23    0
2007-08-10    0
2010-09-16    0
2007-04-03    0
             ..
2011-06-03    0
2004-05-21    0
1999-09-24    0
2003-11-28    0
2007-08-10    0
1996-07-10    0
1995-06-15    0
1999-12-20    0
2011-04-20    0
2002-06-26    0
2014-10-07    0
2008-06-11    0
2002-04-02    0
2012-11-08    0
2013-02-06    0
2004-05-21    0
2014-03-05    0
2008-03-17    0
2006-01-23    0
2002-04-02    0
2009-11-13    0
2006-01-23    0
2002-12-16    0
2005-06-17    0
2009-07-08    0
2004-05-21    0
1998-11-23    0
2000-02-02    0
2012-01-06    0
2006-11-20    0
Length: 73478, dtype: int64

In [ ]: