Special care should be taken with missing data on this problem.

Missing data shall never be filled in the target variable, or the results evaluation would be corrupted. That is a risk on this problem, if things are done without care, because the target variable and the features are the same, only time-shifted.

First forward and then backwards fill is the best way to try to keep causality as much as possible.

Some filtering of symbols that have a lot of missing data could help, or the predictor may find itself full of constant data.

Filling missing data and dropping "bad samples" can be done in two or three levels: In the total data level, in the training time level, or in the base samples level. The differences are probably small for the filling part, but may be significant when dropping samples.



In [120]:

    
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import scipy.optimize as spo
import sys
from time import time
from sklearn.metrics import r2_score, median_absolute_error

%matplotlib inline

%pylab inline
pylab.rcParams['figure.figsize'] = (20.0, 10.0)

%load_ext autoreload
%autoreload 2

sys.path.append('../../')









    



Populating the interactive namespace from numpy and matplotlib
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload



In [121]:

    
from utils import preprocessing as pp



In [122]:

    
data_df = pd.read_pickle('../../data/data_train_val_df.pkl')



In [123]:

    
print(data_df.shape)
data_df.head()









    



(5520, 2415)






    Out[123]:







  
    
      feature
      Close
      ...
      Volume
    
    
      
      SPY
      MMM
      ABT
      ABBV
      ACN
      ATVI
      AYI
      ADBE
      AMD
      AAP
      ...
      WYNN
      XEL
      XRX
      XLNX
      XYL
      YHOO
      YUM
      ZBH
      ZION
      ZTS
    
    
      date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1993-01-29
      43.94
      24.50
      6.88
      NaN
      NaN
      NaN
      NaN
      2.59
      18.75
      NaN
      ...
      NaN
      87800.0
      7633602.0
      1745196.0
      NaN
      NaN
      NaN
      NaN
      33600.0
      NaN
    
    
      1993-02-01
      44.25
      24.69
      6.88
      NaN
      NaN
      NaN
      NaN
      2.72
      19.12
      NaN
      ...
      NaN
      72400.0
      3001200.0
      3574800.0
      NaN
      NaN
      NaN
      NaN
      32000.0
      NaN
    
    
      1993-02-02
      44.34
      24.72
      6.53
      NaN
      NaN
      NaN
      NaN
      2.84
      20.25
      NaN
      ...
      NaN
      242200.0
      1388598.0
      2652396.0
      NaN
      NaN
      NaN
      NaN
      251600.0
      NaN
    
    
      1993-02-03
      44.81
      25.19
      6.91
      NaN
      NaN
      NaN
      NaN
      2.70
      20.50
      NaN
      ...
      NaN
      272200.0
      1228200.0
      5040396.0
      NaN
      NaN
      NaN
      NaN
      254800.0
      NaN
    
    
      1993-02-04
      45.00
      26.06
      6.84
      NaN
      NaN
      NaN
      NaN
      2.73
      20.12
      NaN
      ...
      NaN
      162800.0
      1675602.0
      7033200.0
      NaN
      NaN
      NaN
      NaN
      317200.0
      NaN
    
  

5 rows × 2415 columns



In [124]:

    
data_df.columns.nlevels









    Out[124]:





2

Let's first filter at the symbol level



In [125]:

    
data_df['Close'].shape









    Out[125]:





(5520, 483)



In [126]:

    
good_ratios = 1.0 - (data_df['Close'].isnull().sum()/ data_df['Close'].shape[0])
good_ratios.sort_values(ascending=False).plot()









    Out[126]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f1456888cc0>



In [127]:

    
filtered_data_df = pp.drop_irrelevant_symbols(data_df['Close'], good_data_ratio=0.99)



In [128]:

    
good_ratios = 1.0 - (filtered_data_df.isnull().sum()/ filtered_data_df.shape[0])
good_ratios.sort_values(ascending=False).plot()









    Out[128]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f145684cda0>



In [129]:

    
filtered_data_df.shape









    Out[129]:





(5520, 285)



In [130]:

    
filtered_data_df.head()









    Out[130]:







  
    
      
      SPY
      MMM
      ABT
      ADBE
      AMD
      AES
      AET
      AFL
      APD
      ALK
      ...
      HCN
      WDC
      WY
      WHR
      WFM
      WMB
      XEL
      XRX
      XLNX
      ZION
    
    
      date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1993-01-29
      43.94
      24.50
      6.88
      2.59
      18.75
      4.41
      6.42
      4.49
      21.94
      4.19
      ...
      22.50
      4.50
      41.50
      46.12
      1.81
      6.88
      22.00
      14.28
      2.50
      10.94
    
    
      1993-02-01
      44.25
      24.69
      6.88
      2.72
      19.12
      4.53
      6.64
      4.52
      22.38
      4.19
      ...
      23.00
      4.50
      42.00
      46.50
      1.78
      7.00
      22.19
      14.09
      2.62
      11.06
    
    
      1993-02-02
      44.34
      24.72
      6.53
      2.84
      20.25
      4.53
      6.62
      4.57
      22.31
      4.16
      ...
      22.75
      4.62
      42.12
      46.50
      1.78
      6.98
      22.06
      14.09
      2.64
      11.12
    
    
      1993-02-03
      44.81
      25.19
      6.91
      2.70
      20.50
      4.49
      6.50
      4.65
      22.69
      4.22
      ...
      23.25
      4.69
      42.00
      47.12
      1.81
      6.84
      22.38
      14.03
      2.68
      11.25
    
    
      1993-02-04
      45.00
      26.06
      6.84
      2.73
      20.12
      4.49
      6.73
      4.84
      23.19
      4.38
      ...
      23.00
      4.56
      42.62
      48.38
      1.84
      6.90
      22.81
      14.15
      2.67
      11.69
    
  

5 rows × 285 columns



In [131]:

    
filtered_data_df.isnull().sum().sort_values(ascending=False)









    Out[131]:





CB       38
MCHP     35
INTU     30
HUM      25
MS       16
GT        9
CHK       5
VRTX      2
ZION      1
PDCO      1
PAYX      1
PBCT      1
CMCSA     1
CSCO      1
CTAS      1
CINF      1
QCOM      1
COST      1
CERN      1
REGN      1
CELG      1
ROST      1
CA        1
SWKS      1
BIIB      1
BBBY      1
PCAR      1
XRAY      1
CSX       1
NTRS      1
         ..
MMC       0
MAS       0
PPG       0
PNC       0
PNW       0
PCG       0
PFE       0
PKI       0
PEP       0
PH        0
ORCL      0
OKE       0
OMC       0
OXY       0
NUE       0
NOC       0
NSC       0
JWN       0
NI        0
NKE       0
NEM       0
MUR       0
MSI       0
MOS       0
TAP       0
MHK       0
MRK       0
MCD       0
MKC       0
SPY       0
Length: 285, dtype: int64

Let's try to filter the whole dataset using only the 'Close' values



In [132]:

    
good_data_ratio = 0.99
FEATURE_OF_INTEREST = 'Close'

filtered_data_df = data_df[FEATURE_OF_INTEREST].dropna(thresh=math.ceil(good_data_ratio*data_df[FEATURE_OF_INTEREST].shape[0]), axis=1)



In [133]:

    
filtered_data_df.head()









    Out[133]:







  
    
      
      SPY
      MMM
      ABT
      ADBE
      AMD
      AES
      AET
      AFL
      APD
      ALK
      ...
      HCN
      WDC
      WY
      WHR
      WFM
      WMB
      XEL
      XRX
      XLNX
      ZION
    
    
      date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1993-01-29
      43.94
      24.50
      6.88
      2.59
      18.75
      4.41
      6.42
      4.49
      21.94
      4.19
      ...
      22.50
      4.50
      41.50
      46.12
      1.81
      6.88
      22.00
      14.28
      2.50
      10.94
    
    
      1993-02-01
      44.25
      24.69
      6.88
      2.72
      19.12
      4.53
      6.64
      4.52
      22.38
      4.19
      ...
      23.00
      4.50
      42.00
      46.50
      1.78
      7.00
      22.19
      14.09
      2.62
      11.06
    
    
      1993-02-02
      44.34
      24.72
      6.53
      2.84
      20.25
      4.53
      6.62
      4.57
      22.31
      4.16
      ...
      22.75
      4.62
      42.12
      46.50
      1.78
      6.98
      22.06
      14.09
      2.64
      11.12
    
    
      1993-02-03
      44.81
      25.19
      6.91
      2.70
      20.50
      4.49
      6.50
      4.65
      22.69
      4.22
      ...
      23.25
      4.69
      42.00
      47.12
      1.81
      6.84
      22.38
      14.03
      2.68
      11.25
    
    
      1993-02-04
      45.00
      26.06
      6.84
      2.73
      20.12
      4.49
      6.73
      4.84
      23.19
      4.38
      ...
      23.00
      4.56
      42.62
      48.38
      1.84
      6.90
      22.81
      14.15
      2.67
      11.69
    
  

5 rows × 285 columns



In [134]:

    
filtered_data_df.columns









    Out[134]:





Index(['SPY', 'MMM', 'ABT', 'ADBE', 'AMD', 'AES', 'AET', 'AFL', 'APD', 'ALK',
       ...
       'HCN', 'WDC', 'WY', 'WHR', 'WFM', 'WMB', 'XEL', 'XRX', 'XLNX', 'ZION'],
      dtype='object', length=285)



In [135]:

    
fdata_df = data_df.loc[:,(slice(None),filtered_data_df.columns.tolist())]



In [136]:

    
new_cols = fdata_df.columns.get_level_values(1)



In [137]:

    
np.setdiff1d(new_cols, filtered_data_df.columns)









    Out[137]:





array([], dtype=object)



In [138]:

    
np.setdiff1d(filtered_data_df.columns, new_cols)









    Out[138]:





array([], dtype=object)



In [139]:

    
np.intersect1d(filtered_data_df.columns, new_cols).shape









    Out[139]:





(285,)



In [140]:

    
filtered_data_df.columns.shape









    Out[140]:





(285,)

Looks good to me... Let's test it on the full dataset



In [142]:

    
filtered_data_df = pp.drop_irrelevant_symbols(data_df, good_data_ratio=0.99)



In [143]:

    
good_ratios = 1.0 - (filtered_data_df['Close'].isnull().sum()/ filtered_data_df['Close'].shape[0])
good_ratios.sort_values(ascending=False).plot()









    Out[143]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f1456945f98>

Now, let's filter at the sample level



In [145]:

    
import predictor.feature_extraction as fe

train_time = -1 # In real time days
base_days = 7 # In market days
step_days = 30 # market days
ahead_days = 1 # market days
today = data_df.index[-1] # Real date

tic = time()
x, y = fe.generate_train_intervals(data_df, 
                                   train_time, 
                                   base_days, 
                                   step_days, 
                                   ahead_days, 
                                   today, 
                                   fe.feature_close_one_to_one)
toc = time()
print('Elapsed time: %i seconds.' % (toc-tic))









    



Elapsed time: 40 seconds.



In [146]:

    
x.shape









    Out[146]:





(73478, 7)



In [147]:

    
y.shape









    Out[147]:





(73478, 1)



In [152]:

    
x_y_df = pd.concat([x, y], axis=1)



In [153]:

    
x_y_df.shape









    Out[153]:





(73478, 8)



In [154]:

    
x_y_df.head()



In [157]:

    
x_y_df.isnull().sum(axis=1)









    Out[157]:





2014-07-14    0
2011-10-11    0
2008-04-29    0
1994-12-22    0
1998-07-17    0
1997-08-04    0
1996-07-10    0
2007-11-05    0
2014-05-30    0
2010-09-16    0
1994-01-11    0
2012-09-25    0
1995-10-23    0
1995-05-03    0
2008-12-01    0
2007-04-03    0
2010-06-22    0
2009-04-13    0
1995-06-15    0
2010-06-22    0
2011-07-18    0
1994-02-23    0
2006-11-20    0
2011-03-09    0
2003-10-16    0
2012-06-29    0
1998-11-23    0
2007-08-10    0
2010-09-16    0
2007-04-03    0
             ..
2011-06-03    0
2004-05-21    0
1999-09-24    0
2003-11-28    0
2007-08-10    0
1996-07-10    0
1995-06-15    0
1999-12-20    0
2011-04-20    0
2002-06-26    0
2014-10-07    0
2008-06-11    0
2002-04-02    0
2012-11-08    0
2013-02-06    0
2004-05-21    0
2014-03-05    0
2008-03-17    0
2006-01-23    0
2002-04-02    0
2009-11-13    0
2006-01-23    0
2002-12-16    0
2005-06-17    0
2009-07-08    0
2004-05-21    0
1998-11-23    0
2000-02-02    0
2012-01-06    0
2006-11-20    0
Length: 73478, dtype: int64



In [ ]:

	0	1	2	3	4	5	6	target
2014-07-14	53.60	53.93	54.21	53.48	54.39	54.04	54.05	53.97
2011-10-11	18.28	18.73	18.64	18.90	18.19	18.60	18.22	18.22
2008-04-29	30.64	30.20	30.65	30.84	30.92	30.89	30.45	30.80
1994-12-22	9.22	9.25	9.31	9.34	9.34	9.28	9.03	9.12
1998-07-17	65812.50	64593.75	63187.50	64875.00	61781.25	58875.00	59437.50	58312.50

feature	Close										...	Volume
	SPY	MMM	ABT	ABBV	ACN	ATVI	AYI	ADBE	AMD	AAP	...	WYNN	XEL	XRX	XLNX	XYL	YHOO	YUM	ZBH	ZION	ZTS
date
1993-01-29	43.94	24.50	6.88	NaN	NaN	NaN	NaN	2.59	18.75	NaN	...	NaN	87800.0	7633602.0	1745196.0	NaN	NaN	NaN	NaN	33600.0	NaN
1993-02-01	44.25	24.69	6.88	NaN	NaN	NaN	NaN	2.72	19.12	NaN	...	NaN	72400.0	3001200.0	3574800.0	NaN	NaN	NaN	NaN	32000.0	NaN
1993-02-02	44.34	24.72	6.53	NaN	NaN	NaN	NaN	2.84	20.25	NaN	...	NaN	242200.0	1388598.0	2652396.0	NaN	NaN	NaN	NaN	251600.0	NaN
1993-02-03	44.81	25.19	6.91	NaN	NaN	NaN	NaN	2.70	20.50	NaN	...	NaN	272200.0	1228200.0	5040396.0	NaN	NaN	NaN	NaN	254800.0	NaN
1993-02-04	45.00	26.06	6.84	NaN	NaN	NaN	NaN	2.73	20.12	NaN	...	NaN	162800.0	1675602.0	7033200.0	NaN	NaN	NaN	NaN	317200.0	NaN