In this notebook the datsets for the predictor will be generated.


In [1]:
# Basic imports
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import scipy.optimize as spo
import sys
from time import time
from sklearn.metrics import r2_score, median_absolute_error

%matplotlib inline

%pylab inline
pylab.rcParams['figure.figsize'] = (20.0, 10.0)

%load_ext autoreload
%autoreload 2

sys.path.append('../../')
import predictor.feature_extraction as fe
import utils.preprocessing as pp


Populating the interactive namespace from numpy and matplotlib

Let's first define the list of parameters to use in each dataset.


In [2]:
# Input values
GOOD_DATA_RATIO = 0.99  # The ratio of non-missing values for a symbol to be considered good
SAMPLES_GOOD_DATA_RATIO = 0.9  # The ratio of non-missing values for an interval to be considered good

train_val_time = -1  # In real time days (-1 is for the full interval)
''' Step days will be fixed. That means that the datasets with longer base periods will have samples 
that are more correlated. '''
step_days = 7  # market days

base_days = [7, 14, 28, 56, 112]  # In market days
ahead_days = [1, 7, 14, 28, 56]  # market days

In [3]:
datasets_params_list_df = pd.DataFrame([(x,y) for x in base_days for y in ahead_days],
                                      columns=['base_days', 'ahead_days'])
datasets_params_list_df['train_val_time'] = train_val_time
datasets_params_list_df['step_days'] = step_days
datasets_params_list_df['GOOD_DATA_RATIO'] = GOOD_DATA_RATIO
datasets_params_list_df['SAMPLES_GOOD_DATA_RATIO'] = SAMPLES_GOOD_DATA_RATIO
datasets_params_list_df


Out[3]:
base_days ahead_days train_val_time step_days GOOD_DATA_RATIO SAMPLES_GOOD_DATA_RATIO
0 7 1 -1 7 0.99 0.9
1 7 7 -1 7 0.99 0.9
2 7 14 -1 7 0.99 0.9
3 7 28 -1 7 0.99 0.9
4 7 56 -1 7 0.99 0.9
5 14 1 -1 7 0.99 0.9
6 14 7 -1 7 0.99 0.9
7 14 14 -1 7 0.99 0.9
8 14 28 -1 7 0.99 0.9
9 14 56 -1 7 0.99 0.9
10 28 1 -1 7 0.99 0.9
11 28 7 -1 7 0.99 0.9
12 28 14 -1 7 0.99 0.9
13 28 28 -1 7 0.99 0.9
14 28 56 -1 7 0.99 0.9
15 56 1 -1 7 0.99 0.9
16 56 7 -1 7 0.99 0.9
17 56 14 -1 7 0.99 0.9
18 56 28 -1 7 0.99 0.9
19 56 56 -1 7 0.99 0.9
20 112 1 -1 7 0.99 0.9
21 112 7 -1 7 0.99 0.9
22 112 14 -1 7 0.99 0.9
23 112 28 -1 7 0.99 0.9
24 112 56 -1 7 0.99 0.9

Now, let's define the function to generate each dataset.

Note: The way to treat the missing data was carefully thought. Never the missing data is filled across samples. Some symbols are discarded before the intervals generation. Some intervals are later discarded. Only after that, and only within training sample intervals the missing data is filled. That is done first forward and then backwards, to preserve, as much as possible, the causality.


In [4]:
def generate_one_set(params):
    # print(('-'*70 + '\n {}, {} \n' + '-'*70).format(params['base_days'].values, params['ahead_days'].values))
    tic = time()
    
    train_val_time = int(params['train_val_time'])
    base_days = int(params['base_days'])
    step_days = int(params['step_days'])
    ahead_days = int(params['ahead_days'])
    
    print('Generating: base{}_ahead{}'.format(base_days, ahead_days))
    pid = 'base{}_ahead{}'.format(base_days, ahead_days)
    
    # Getting the data
    data_df = pd.read_pickle('../../data/data_train_val_df.pkl')
    today = data_df.index[-1]  # Real date
    print(pid + ') data_df loaded')

    # Drop symbols with many missing points
    data_df = pp.drop_irrelevant_symbols(data_df, params['GOOD_DATA_RATIO'])
    print(pid + ') Irrelevant symbols dropped.')
    
    # Generate the intervals for the predictor
    x, y = fe.generate_train_intervals(data_df, 
                                       train_val_time, 
                                       base_days, 
                                       step_days,
                                       ahead_days, 
                                       today, 
                                       fe.feature_close_one_to_one)    
    print(pid + ') Intervals generated')
    
    # Drop "bad" samples and fill missing data
    x_y_df = pd.concat([x, y], axis=1)
    x_y_df = pp.drop_irrelevant_samples(x_y_df, params['SAMPLES_GOOD_DATA_RATIO'])
    x = x_y_df.iloc[:, :-1]
    y = x_y_df.iloc[:, -1]
    x = pp.fill_missing(x)
    print(pid + ') Irrelevant samples dropped and missing data filled.')
    
    # Pickle that
    x.to_pickle('../../data/x_{}.pkl'.format(pid))
    y.to_pickle('../../data/y_{}.pkl'.format(pid))
    
    toc = time()
    print('%s) %i intervals generated in: %i seconds.' % (pid, x.shape[0], (toc-tic)))
    
    return pid, x, y

In [5]:
for ind in range(datasets_params_list_df.shape[0]):
    pid, x, y = generate_one_set(datasets_params_list_df.iloc[ind,:])


Generating: base7_ahead1
base7_ahead1) data_df loaded
base7_ahead1) Irrelevant symbols dropped.
base7_ahead1) Intervals generated
base7_ahead1) Irrelevant samples dropped and missing data filled.
base7_ahead1) 224493 intervals generated in: 137 seconds.
Generating: base7_ahead7
base7_ahead7) data_df loaded
base7_ahead7) Irrelevant symbols dropped.
base7_ahead7) Intervals generated
base7_ahead7) Irrelevant samples dropped and missing data filled.
base7_ahead7) 224207 intervals generated in: 135 seconds.
Generating: base7_ahead14
base7_ahead14) data_df loaded
base7_ahead14) Irrelevant symbols dropped.
base7_ahead14) Intervals generated
base7_ahead14) Irrelevant samples dropped and missing data filled.
base7_ahead14) 223922 intervals generated in: 141 seconds.
Generating: base7_ahead28
base7_ahead28) data_df loaded
base7_ahead28) Irrelevant symbols dropped.
base7_ahead28) Intervals generated
base7_ahead28) Irrelevant samples dropped and missing data filled.
base7_ahead28) 223352 intervals generated in: 138 seconds.
Generating: base7_ahead56
base7_ahead56) data_df loaded
base7_ahead56) Irrelevant symbols dropped.
base7_ahead56) Intervals generated
base7_ahead56) Irrelevant samples dropped and missing data filled.
base7_ahead56) 222212 intervals generated in: 162 seconds.
Generating: base14_ahead1
base14_ahead1) data_df loaded
base14_ahead1) Irrelevant symbols dropped.
base14_ahead1) Intervals generated
base14_ahead1) Irrelevant samples dropped and missing data filled.
base14_ahead1) 224268 intervals generated in: 177 seconds.
Generating: base14_ahead7
base14_ahead7) data_df loaded
base14_ahead7) Irrelevant symbols dropped.
base14_ahead7) Intervals generated
base14_ahead7) Irrelevant samples dropped and missing data filled.
base14_ahead7) 223982 intervals generated in: 170 seconds.
Generating: base14_ahead14
base14_ahead14) data_df loaded
base14_ahead14) Irrelevant symbols dropped.
base14_ahead14) Intervals generated
base14_ahead14) Irrelevant samples dropped and missing data filled.
base14_ahead14) 223697 intervals generated in: 132 seconds.
Generating: base14_ahead28
base14_ahead28) data_df loaded
base14_ahead28) Irrelevant symbols dropped.
base14_ahead28) Intervals generated
base14_ahead28) Irrelevant samples dropped and missing data filled.
base14_ahead28) 223127 intervals generated in: 132 seconds.
Generating: base14_ahead56
base14_ahead56) data_df loaded
base14_ahead56) Irrelevant symbols dropped.
base14_ahead56) Intervals generated
base14_ahead56) Irrelevant samples dropped and missing data filled.
base14_ahead56) 221987 intervals generated in: 136 seconds.
Generating: base28_ahead1
base28_ahead1) data_df loaded
base28_ahead1) Irrelevant symbols dropped.
base28_ahead1) Intervals generated
base28_ahead1) Irrelevant samples dropped and missing data filled.
base28_ahead1) 223696 intervals generated in: 172 seconds.
Generating: base28_ahead7
base28_ahead7) data_df loaded
base28_ahead7) Irrelevant symbols dropped.
base28_ahead7) Intervals generated
base28_ahead7) Irrelevant samples dropped and missing data filled.
base28_ahead7) 223410 intervals generated in: 167 seconds.
Generating: base28_ahead14
base28_ahead14) data_df loaded
base28_ahead14) Irrelevant symbols dropped.
base28_ahead14) Intervals generated
base28_ahead14) Irrelevant samples dropped and missing data filled.
base28_ahead14) 223125 intervals generated in: 156 seconds.
Generating: base28_ahead28
base28_ahead28) data_df loaded
base28_ahead28) Irrelevant symbols dropped.
base28_ahead28) Intervals generated
base28_ahead28) Irrelevant samples dropped and missing data filled.
base28_ahead28) 222555 intervals generated in: 131 seconds.
Generating: base28_ahead56
base28_ahead56) data_df loaded
base28_ahead56) Irrelevant symbols dropped.
base28_ahead56) Intervals generated
base28_ahead56) Irrelevant samples dropped and missing data filled.
base28_ahead56) 221415 intervals generated in: 131 seconds.
Generating: base56_ahead1
base56_ahead1) data_df loaded
base56_ahead1) Irrelevant symbols dropped.
base56_ahead1) Intervals generated
base56_ahead1) Irrelevant samples dropped and missing data filled.
base56_ahead1) 222552 intervals generated in: 142 seconds.
Generating: base56_ahead7
base56_ahead7) data_df loaded
base56_ahead7) Irrelevant symbols dropped.
base56_ahead7) Intervals generated
base56_ahead7) Irrelevant samples dropped and missing data filled.
base56_ahead7) 222267 intervals generated in: 141 seconds.
Generating: base56_ahead14
base56_ahead14) data_df loaded
base56_ahead14) Irrelevant symbols dropped.
base56_ahead14) Intervals generated
base56_ahead14) Irrelevant samples dropped and missing data filled.
base56_ahead14) 221982 intervals generated in: 140 seconds.
Generating: base56_ahead28
base56_ahead28) data_df loaded
base56_ahead28) Irrelevant symbols dropped.
base56_ahead28) Intervals generated
base56_ahead28) Irrelevant samples dropped and missing data filled.
base56_ahead28) 221412 intervals generated in: 139 seconds.
Generating: base56_ahead56
base56_ahead56) data_df loaded
base56_ahead56) Irrelevant symbols dropped.
base56_ahead56) Intervals generated
base56_ahead56) Irrelevant samples dropped and missing data filled.
base56_ahead56) 220272 intervals generated in: 135 seconds.
Generating: base112_ahead1
base112_ahead1) data_df loaded
base112_ahead1) Irrelevant symbols dropped.
base112_ahead1) Intervals generated
base112_ahead1) Irrelevant samples dropped and missing data filled.
base112_ahead1) 220279 intervals generated in: 150 seconds.
Generating: base112_ahead7
base112_ahead7) data_df loaded
base112_ahead7) Irrelevant symbols dropped.
base112_ahead7) Intervals generated
base112_ahead7) Irrelevant samples dropped and missing data filled.
base112_ahead7) 219994 intervals generated in: 158 seconds.
Generating: base112_ahead14
base112_ahead14) data_df loaded
base112_ahead14) Irrelevant symbols dropped.
base112_ahead14) Intervals generated
base112_ahead14) Irrelevant samples dropped and missing data filled.
base112_ahead14) 219709 intervals generated in: 155 seconds.
Generating: base112_ahead28
base112_ahead28) data_df loaded
base112_ahead28) Irrelevant symbols dropped.
base112_ahead28) Intervals generated
base112_ahead28) Irrelevant samples dropped and missing data filled.
base112_ahead28) 219139 intervals generated in: 156 seconds.
Generating: base112_ahead56
base112_ahead56) data_df loaded
base112_ahead56) Irrelevant symbols dropped.
base112_ahead56) Intervals generated
base112_ahead56) Irrelevant samples dropped and missing data filled.
base112_ahead56) 217999 intervals generated in: 147 seconds.

In [6]:
datasets_params_list_df['x_filename'] = datasets_params_list_df.apply(lambda x: 
                                                                      'x_base{}_ahead{}.pkl'.format(int(x['base_days']), 
                                                                                                    int(x['ahead_days'])), axis=1)
datasets_params_list_df['y_filename'] = datasets_params_list_df.apply(lambda x: 
                                                                      'y_base{}_ahead{}.pkl'.format(int(x['base_days']), 
                                                                                                    int(x['ahead_days'])), axis=1)
datasets_params_list_df


Out[6]:
base_days ahead_days train_val_time step_days GOOD_DATA_RATIO SAMPLES_GOOD_DATA_RATIO x_filename y_filename
0 7 1 -1 7 0.99 0.9 x_base7_ahead1.pkl y_base7_ahead1.pkl
1 7 7 -1 7 0.99 0.9 x_base7_ahead7.pkl y_base7_ahead7.pkl
2 7 14 -1 7 0.99 0.9 x_base7_ahead14.pkl y_base7_ahead14.pkl
3 7 28 -1 7 0.99 0.9 x_base7_ahead28.pkl y_base7_ahead28.pkl
4 7 56 -1 7 0.99 0.9 x_base7_ahead56.pkl y_base7_ahead56.pkl
5 14 1 -1 7 0.99 0.9 x_base14_ahead1.pkl y_base14_ahead1.pkl
6 14 7 -1 7 0.99 0.9 x_base14_ahead7.pkl y_base14_ahead7.pkl
7 14 14 -1 7 0.99 0.9 x_base14_ahead14.pkl y_base14_ahead14.pkl
8 14 28 -1 7 0.99 0.9 x_base14_ahead28.pkl y_base14_ahead28.pkl
9 14 56 -1 7 0.99 0.9 x_base14_ahead56.pkl y_base14_ahead56.pkl
10 28 1 -1 7 0.99 0.9 x_base28_ahead1.pkl y_base28_ahead1.pkl
11 28 7 -1 7 0.99 0.9 x_base28_ahead7.pkl y_base28_ahead7.pkl
12 28 14 -1 7 0.99 0.9 x_base28_ahead14.pkl y_base28_ahead14.pkl
13 28 28 -1 7 0.99 0.9 x_base28_ahead28.pkl y_base28_ahead28.pkl
14 28 56 -1 7 0.99 0.9 x_base28_ahead56.pkl y_base28_ahead56.pkl
15 56 1 -1 7 0.99 0.9 x_base56_ahead1.pkl y_base56_ahead1.pkl
16 56 7 -1 7 0.99 0.9 x_base56_ahead7.pkl y_base56_ahead7.pkl
17 56 14 -1 7 0.99 0.9 x_base56_ahead14.pkl y_base56_ahead14.pkl
18 56 28 -1 7 0.99 0.9 x_base56_ahead28.pkl y_base56_ahead28.pkl
19 56 56 -1 7 0.99 0.9 x_base56_ahead56.pkl y_base56_ahead56.pkl
20 112 1 -1 7 0.99 0.9 x_base112_ahead1.pkl y_base112_ahead1.pkl
21 112 7 -1 7 0.99 0.9 x_base112_ahead7.pkl y_base112_ahead7.pkl
22 112 14 -1 7 0.99 0.9 x_base112_ahead14.pkl y_base112_ahead14.pkl
23 112 28 -1 7 0.99 0.9 x_base112_ahead28.pkl y_base112_ahead28.pkl
24 112 56 -1 7 0.99 0.9 x_base112_ahead56.pkl y_base112_ahead56.pkl

In [7]:
datasets_params_list_df.to_pickle('../../data/datasets_params_list_df.pkl')

In [ ]: