In [1]:

    
# Basic imports
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import scipy.optimize as spo
import sys

%matplotlib inline

%pylab inline
pylab.rcParams['figure.figsize'] = (20.0, 10.0)

%load_ext autoreload
%autoreload 2

sys.path.append('../../')









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
data_df = pd.read_pickle('../../data/data_train_val_df.pkl')
print(data_df.shape)
data_df.head()









    



(5520, 2415)






    Out[2]:







  
    
      feature
      Close
      ...
      Volume
    
    
      
      SPY
      MMM
      ABT
      ABBV
      ACN
      ATVI
      AYI
      ADBE
      AMD
      AAP
      ...
      WYNN
      XEL
      XRX
      XLNX
      XYL
      YHOO
      YUM
      ZBH
      ZION
      ZTS
    
    
      date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1993-01-29
      43.94
      24.50
      6.88
      NaN
      NaN
      NaN
      NaN
      2.59
      18.75
      NaN
      ...
      NaN
      87800.0
      7633602.0
      1745196.0
      NaN
      NaN
      NaN
      NaN
      33600.0
      NaN
    
    
      1993-02-01
      44.25
      24.69
      6.88
      NaN
      NaN
      NaN
      NaN
      2.72
      19.12
      NaN
      ...
      NaN
      72400.0
      3001200.0
      3574800.0
      NaN
      NaN
      NaN
      NaN
      32000.0
      NaN
    
    
      1993-02-02
      44.34
      24.72
      6.53
      NaN
      NaN
      NaN
      NaN
      2.84
      20.25
      NaN
      ...
      NaN
      242200.0
      1388598.0
      2652396.0
      NaN
      NaN
      NaN
      NaN
      251600.0
      NaN
    
    
      1993-02-03
      44.81
      25.19
      6.91
      NaN
      NaN
      NaN
      NaN
      2.70
      20.50
      NaN
      ...
      NaN
      272200.0
      1228200.0
      5040396.0
      NaN
      NaN
      NaN
      NaN
      254800.0
      NaN
    
    
      1993-02-04
      45.00
      26.06
      6.84
      NaN
      NaN
      NaN
      NaN
      2.73
      20.12
      NaN
      ...
      NaN
      162800.0
      1675602.0
      7033200.0
      NaN
      NaN
      NaN
      NaN
      317200.0
      NaN
    
  

5 rows × 2415 columns

The first objective of this notebook is to implement the next function (to extract sample intervals from the total period).



In [3]:

    
def generate_train_intervals(data_df, train_time, base_time, step, days_ahead, today):
    pass

Let's define the parameters as constants, just to do some scratch work.



In [4]:

    
# I will try to keep the convention to name with the "days" suffix, 
# to all the variables that represent "market days". The ones that 
# represent real time will be named more arbitrarily.

train_time = 365 # In real time days
base_days = 7 # In market days
step_days = 7 # market days
ahead_days = 1 # market days
today = data_df.index[-1] # Real date



In [5]:

    
today









    Out[5]:





Timestamp('2014-12-31 00:00:00')

The amount of samples to be generated would be (train_time - base_time) * num_companies / step. There are days_ahead market days left, only for target values, so the total "used" period is train_time + days_ahead.

The option of training with all, one, or some companies can be done by the user when it inputs the data (just filter data_df to get the companies you want). Anyway, one interesting choice would be to allow the training with multiple companies, targeting only one. That would multiply the features by the number of available companies, but would reduce the samples a lot. By now, I want to keep the complexity low, so I won't implement that idea, yet. A many to many approach could also be implemented (the target would be the vector with all the companies data). I will start with the simple "one to one".



In [6]:

    
data_df.index[data_df.index <= today][-(ahead_days + 1)]









    Out[6]:





Timestamp('2014-12-30 00:00:00')



In [7]:

    
def add_market_days(base, delta, data_df):
    """
    base is in real time.
    delta is in market days.
    """
    market_days = data_df.index
    if base not in market_days:
        raise Exception('The base date is not in the market days list.')
    base_index = market_days.tolist().index(base)
    if base_index + delta >= len(market_days):
        return market_days[-1]
    if base_index + delta < 0:
        return market_days[0]
    return market_days[base_index + delta]



In [8]:

    
# Remember the last target days are not used for training, but that is a "market days" period.
end_of_training_date = add_market_days(today, -ahead_days, data_df)
start_date = end_of_training_date - dt.timedelta(train_time) 
print('Start date: %s.  End of training date: %s.' % (start_date, end_of_training_date))









    



Start date: 2013-12-30 00:00:00.  End of training date: 2014-12-30 00:00:00.



In [9]:

    
TARGET_FEATURE = 'Close'

One important thing to note: the base time is in "market days", that means that it doesn't represent a period of "real" time (the real time may vary with each base interval).



In [10]:

    
def print_period(data_df):
    print('Period: %s  to  %s.' % (data_df.index[0], data_df.index[-1]))



In [11]:

    
data_train_df = data_df[start_date:end_of_training_date]

print_period(data_train_df)
data_train_df.shape









    



Period: 2013-12-30 00:00:00  to  2014-12-30 00:00:00.






    Out[11]:





(253, 2415)



In [12]:

    
start_target_date = add_market_days(start_date, base_days + ahead_days - 1, data_df)
data_target_df = data_df.loc[start_target_date: today,TARGET_FEATURE]

print_period(data_target_df)
data_target_df.shape









    



Period: 2014-01-09 00:00:00  to  2014-12-31 00:00:00.






    Out[12]:





(247, 483)

Is that initial date correct?



In [13]:

    
data_train_df.index[:10]









    Out[13]:





DatetimeIndex(['2013-12-30', '2013-12-31', '2014-01-02', '2014-01-03',
               '2014-01-06', '2014-01-07', '2014-01-08', '2014-01-09',
               '2014-01-10', '2014-01-13'],
              dtype='datetime64[ns]', name='date', freq=None)

Ok, it looks so.

Let's split now!

I should allow for different feature extraction functions to be used, after the time divisions.



In [14]:

    
date_base_ini = start_date
date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
date_target = add_market_days(date_base_end, ahead_days, data_df)

sample_blob = (data_train_df[date_base_ini: date_base_end], pd.DataFrame(data_target_df.loc[date_target]))
sample_blob[0]









    Out[14]:







  
    
      feature
      Close
      ...
      Volume
    
    
      
      SPY
      MMM
      ABT
      ABBV
      ACN
      ATVI
      AYI
      ADBE
      AMD
      AAP
      ...
      WYNN
      XEL
      XRX
      XLNX
      XYL
      YHOO
      YUM
      ZBH
      ZION
      ZTS
    
    
      date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2013-12-30
      183.82
      139.42
      38.41
      53.01
      82.06
      17.80
      109.71
      59.55
      3.85
      109.63
      ...
      1950055.0
      1867094.0
      3846744.0
      1049165.0
      561674.0
      8678033.0
      1953272.0
      623173.0
      895272.0
      1803579.0
    
    
      2013-12-31
      184.69
      140.25
      38.33
      52.81
      82.22
      17.83
      109.32
      59.88
      3.87
      110.68
      ...
      731522.0
      1752814.0
      5359541.0
      1215365.0
      557969.0
      8292761.0
      2132916.0
      649985.0
      1077417.0
      2270418.0
    
    
      2014-01-02
      182.92
      138.13
      38.23
      51.98
      81.13
      18.07
      107.65
      59.29
      3.95
      109.74
      ...
      3056065.0
      3192314.0
      10481405.0
      3437019.0
      765141.0
      21514650.0
      1956285.0
      868763.0
      1356738.0
      2576112.0
    
    
      2014-01-03
      182.88
      138.45
      38.64
      52.30
      81.40
      18.29
      108.15
      59.16
      4.00
      112.88
      ...
      1169540.0
      2939378.0
      7282652.0
      1982702.0
      454495.0
      15761243.0
      1457058.0
      1288207.0
      1122452.0
      2524947.0
    
    
      2014-01-06
      182.36
      137.63
      39.15
      50.39
      80.54
      18.08
      106.28
      58.12
      4.13
      111.80
      ...
      1289126.0
      3382267.0
      14906758.0
      1970805.0
      849360.0
      12472724.0
      2940835.0
      1414955.0
      1988180.0
      2763350.0
    
    
      2014-01-07
      183.48
      137.65
      38.85
      50.49
      81.52
      18.32
      109.44
      58.97
      4.18
      113.18
      ...
      1688085.0
      3481465.0
      15383264.0
      1581167.0
      611127.0
      14141112.0
      3625927.0
      1852572.0
      1343169.0
      2338176.0
    
    
      2014-01-08
      183.52
      136.63
      39.20
      50.36
      82.15
      18.34
      110.03
      58.90
      4.18
      112.30
      ...
      1406668.0
      3563670.0
      7833434.0
      2318930.0
      1234973.0
      18657195.0
      4448753.0
      1880549.0
      2034692.0
      3965882.0
    
  

7 rows × 2415 columns



In [15]:

    
target = sample_blob[1].T
target









    Out[15]:







  
    
      
      SPY
      MMM
      ABT
      ABBV
      ACN
      ATVI
      AYI
      ADBE
      AMD
      AAP
      ...
      WYNN
      XEL
      XRX
      XLNX
      XYL
      YHOO
      YUM
      ZBH
      ZION
      ZTS
    
  
  
    
      2014-01-09
      183.64
      136.45
      39.27
      51.22
      82.95
      18.3
      123.5
      59.09
      4.09
      113.55
      ...
      204.77
      27.82
      12.05
      45.78
      34.63
      40.92
      75.05
      96.47
      30.22
      31.96
    
  

1 rows × 483 columns

Let's define a function that takes a "sample blob" and produces one sample per symbol, only for the "Close" feature (looks like the easiest to do first). The dates in the base period should be substituted by an index, and the symbols shuffled later (along with their labels).



In [16]:

    
feat_close = sample_blob[0][TARGET_FEATURE]
feat_close.index = np.arange(feat_close.shape[0])
feat_close









    Out[16]:







  
    
      
      SPY
      MMM
      ABT
      ABBV
      ACN
      ATVI
      AYI
      ADBE
      AMD
      AAP
      ...
      WYNN
      XEL
      XRX
      XLNX
      XYL
      YHOO
      YUM
      ZBH
      ZION
      ZTS
    
  
  
    
      0
      183.82
      139.42
      38.41
      53.01
      82.06
      17.80
      109.71
      59.55
      3.85
      109.63
      ...
      193.29
      27.92
      12.15
      45.64
      34.58
      40.20
      74.92
      93.20
      29.85
      32.67
    
    
      1
      184.69
      140.25
      38.33
      52.81
      82.22
      17.83
      109.32
      59.88
      3.87
      110.68
      ...
      194.21
      27.94
      12.17
      45.92
      34.60
      40.44
      75.61
      93.19
      29.96
      32.69
    
    
      2
      182.92
      138.13
      38.23
      51.98
      81.13
      18.07
      107.65
      59.29
      3.95
      109.74
      ...
      197.94
      27.51
      11.91
      45.97
      34.16
      39.59
      75.09
      92.24
      29.65
      32.36
    
    
      3
      182.88
      138.45
      38.64
      52.30
      81.40
      18.29
      108.15
      59.16
      4.00
      112.88
      ...
      196.00
      27.50
      11.99
      45.62
      34.47
      40.12
      75.56
      92.64
      29.86
      32.05
    
    
      4
      182.36
      137.63
      39.15
      50.39
      80.54
      18.08
      106.28
      58.12
      4.13
      111.80
      ...
      195.86
      27.35
      12.09
      45.42
      34.41
      39.93
      75.50
      93.24
      29.65
      31.98
    
    
      5
      183.48
      137.65
      38.85
      50.49
      81.52
      18.32
      109.44
      58.97
      4.18
      113.18
      ...
      201.51
      27.82
      12.19
      45.52
      34.51
      40.92
      76.56
      95.10
      29.74
      32.10
    
    
      6
      183.52
      136.63
      39.20
      50.36
      82.15
      18.34
      110.03
      58.90
      4.18
      112.30
      ...
      205.29
      27.71
      12.08
      45.91
      34.49
      41.02
      76.53
      97.43
      30.00
      31.74
    
  

7 rows × 483 columns



In [17]:

    
target.index = ['target']
target









    Out[17]:







  
    
      
      SPY
      MMM
      ABT
      ABBV
      ACN
      ATVI
      AYI
      ADBE
      AMD
      AAP
      ...
      WYNN
      XEL
      XRX
      XLNX
      XYL
      YHOO
      YUM
      ZBH
      ZION
      ZTS
    
  
  
    
      target
      183.64
      136.45
      39.27
      51.22
      82.95
      18.3
      123.5
      59.09
      4.09
      113.55
      ...
      204.77
      27.82
      12.05
      45.78
      34.63
      40.92
      75.05
      96.47
      30.22
      31.96
    
  

1 rows × 483 columns



In [18]:

    
x_y_samples = feat_close.append(target)
x_y_samples









    Out[18]:







  
    
      
      SPY
      MMM
      ABT
      ABBV
      ACN
      ATVI
      AYI
      ADBE
      AMD
      AAP
      ...
      WYNN
      XEL
      XRX
      XLNX
      XYL
      YHOO
      YUM
      ZBH
      ZION
      ZTS
    
  
  
    
      0
      183.82
      139.42
      38.41
      53.01
      82.06
      17.80
      109.71
      59.55
      3.85
      109.63
      ...
      193.29
      27.92
      12.15
      45.64
      34.58
      40.20
      74.92
      93.20
      29.85
      32.67
    
    
      1
      184.69
      140.25
      38.33
      52.81
      82.22
      17.83
      109.32
      59.88
      3.87
      110.68
      ...
      194.21
      27.94
      12.17
      45.92
      34.60
      40.44
      75.61
      93.19
      29.96
      32.69
    
    
      2
      182.92
      138.13
      38.23
      51.98
      81.13
      18.07
      107.65
      59.29
      3.95
      109.74
      ...
      197.94
      27.51
      11.91
      45.97
      34.16
      39.59
      75.09
      92.24
      29.65
      32.36
    
    
      3
      182.88
      138.45
      38.64
      52.30
      81.40
      18.29
      108.15
      59.16
      4.00
      112.88
      ...
      196.00
      27.50
      11.99
      45.62
      34.47
      40.12
      75.56
      92.64
      29.86
      32.05
    
    
      4
      182.36
      137.63
      39.15
      50.39
      80.54
      18.08
      106.28
      58.12
      4.13
      111.80
      ...
      195.86
      27.35
      12.09
      45.42
      34.41
      39.93
      75.50
      93.24
      29.65
      31.98
    
    
      5
      183.48
      137.65
      38.85
      50.49
      81.52
      18.32
      109.44
      58.97
      4.18
      113.18
      ...
      201.51
      27.82
      12.19
      45.52
      34.51
      40.92
      76.56
      95.10
      29.74
      32.10
    
    
      6
      183.52
      136.63
      39.20
      50.36
      82.15
      18.34
      110.03
      58.90
      4.18
      112.30
      ...
      205.29
      27.71
      12.08
      45.91
      34.49
      41.02
      76.53
      97.43
      30.00
      31.74
    
    
      target
      183.64
      136.45
      39.27
      51.22
      82.95
      18.30
      123.50
      59.09
      4.09
      113.55
      ...
      204.77
      27.82
      12.05
      45.78
      34.63
      40.92
      75.05
      96.47
      30.22
      31.96
    
  

8 rows × 483 columns



In [19]:

    
x_y_samples_shuffled = x_y_samples.T.sample(frac=1).reset_index(drop=True)
x_y_samples_shuffled.head()

It is important to take care of the NaN values. Possibly at this sample_blob level is a good point to do so; just discard too bad samples.



In [20]:

    
x_y_samples_shuffled.isnull().sum()









    Out[20]:





0         10
1         10
2         10
3         10
4         10
5         10
6         10
target    10
dtype: int64



In [21]:

    
x_y_samples_filtered = x_y_samples_shuffled.dropna(axis=0, how='any')
print(x_y_samples_filtered.shape)
x_y_samples_filtered.isnull().sum()









    



(473, 8)






    Out[21]:





0         0
1         0
2         0
3         0
4         0
5         0
6         0
target    0
dtype: int64



In [22]:

    
# At some point I will have to standarize those values... (not now, but just as a reminder...)

std_samples = x_y_samples_shuffled.apply(lambda x: x / np.mean(x), axis=1)
std_samples.head()



In [23]:

    
features = std_samples.iloc[:,:-1]
features.head()



In [24]:

    
target = pd.DataFrame(std_samples.iloc[:,-1])
target.head()

Let's create the samples divider function



In [25]:

    
TARGET_FEATURE = 'Close'


def feature_close_one_to_one(sample_blob):
    target = sample_blob[1].T
    feat_close = sample_blob[0][TARGET_FEATURE]
    feat_close.index = np.arange(feat_close.shape[0])
    target.index = ['target']
    x_y_samples = feat_close.append(target)
    x_y_samples_shuffled = x_y_samples.T.sample(frac=1).reset_index(drop=True)
    x_y_samples_filtered = x_y_samples_shuffled.dropna(axis=0, how='any')
    
    return x_y_samples_filtered



In [26]:

    
print(feature_close_one_to_one(sample_blob).shape)
feature_close_one_to_one(sample_blob).head()



In [27]:

    
date_base_ini = start_date
date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
date_target = add_market_days(date_base_end, ahead_days, data_df)
feat_tgt_df = pd.DataFrame()

while date_base_end < end_of_training_date:
    sample_blob = (data_train_df[date_base_ini: date_base_end],
                   pd.DataFrame(data_target_df.loc[date_target]))
    feat_tgt_blob = feature_close_one_to_one(sample_blob) # TODO: Change for a generic function
    feat_tgt_df = feat_tgt_df.append(feat_tgt_blob, ignore_index=True)
    
    date_base_ini = add_market_days(date_base_ini, step_days, data_df)
    date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
    date_target = add_market_days(date_base_end, ahead_days, data_df)
    # print('Start: %s,  End:%s' % (date_base_ini, date_base_end))

feat_tgt_df = feat_tgt_df.sample(frac=1).reset_index(drop=True)

X_df = feat_tgt_df.iloc[:,:-1]
y_df = pd.DataFrame(feat_tgt_df.iloc[:,-1])



In [28]:

    
print(X_df.shape)
X_df.head()



In [29]:

    
print(y_df.shape)
y_df.head()

So, I have everything to define the final function of this notebook



In [30]:

    
def generate_train_intervals(data_df, train_time, base_days, step_days, ahead_days, today, blob_fun):
    end_of_training_date = add_market_days(today, -ahead_days, data_df)
    start_date = end_of_training_date - dt.timedelta(train_time)
    start_target_date = add_market_days(start_date, base_days + ahead_days - 1, data_df)
    
    data_train_df = data_df[start_date:end_of_training_date]
    data_target_df = data_df.loc[start_target_date: today,TARGET_FEATURE]
    
    date_base_ini = start_date
    date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
    date_target = add_market_days(date_base_end, ahead_days, data_df)
    feat_tgt_df = pd.DataFrame()

    while date_base_end < end_of_training_date:
        sample_blob = (data_train_df[date_base_ini: date_base_end],
                       pd.DataFrame(data_target_df.loc[date_target]))
        feat_tgt_blob = blob_fun(sample_blob)
        feat_tgt_df = feat_tgt_df.append(feat_tgt_blob, ignore_index=True)
        
        date_base_ini = add_market_days(date_base_ini, step_days, data_df)
        date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
        date_target = add_market_days(date_base_end, ahead_days, data_df)
        # print('Start: %s,  End:%s' % (date_base_ini, date_base_end))
    
    feat_tgt_df = feat_tgt_df.sample(frac=1).reset_index(drop=True)
    
    X_df = feat_tgt_df.iloc[:,:-1]
    y_df = pd.DataFrame(feat_tgt_df.iloc[:,-1])
    
    return X_df, y_df



In [31]:

    
train_time = 365 # In real time days
base_days = 7 # In market days
step_days = 7 # market days
ahead_days = 1 # market days
today = data_df.index[-1] # Real date

X, y = generate_train_intervals(data_df, train_time, base_days, step_days, ahead_days, today, feature_close_one_to_one)



In [32]:

    
print(X.shape)
X.head()



In [33]:

    
print(y.shape)
y.head()



In [34]:

    
%pwd









    Out[34]:





'/home/miguel/udacity/Machine Learning Nanodegree/projects/capstone/capstone/notebooks/dev'



In [35]:

    
sys.path.append('../../')
import predictor.feature_extraction as fe

Let's try the function as it was saved in the package:



In [36]:

    
X, y = fe.generate_train_intervals(data_df, 
                                   train_time, 
                                   base_days, 
                                   step_days, 
                                   ahead_days, 
                                   today, 
                                   feature_close_one_to_one)



In [37]:

    
print(X.shape)
X.head()



In [38]:

    
print(y.shape)
y.head()

Looks good

Sometimes, it may be useful to keep the dates information...



In [39]:

    
x_y_samples









    Out[39]:







  
    
      
      SPY
      MMM
      ABT
      ABBV
      ACN
      ATVI
      AYI
      ADBE
      AMD
      AAP
      ...
      WYNN
      XEL
      XRX
      XLNX
      XYL
      YHOO
      YUM
      ZBH
      ZION
      ZTS
    
  
  
    
      0
      183.82
      139.42
      38.41
      53.01
      82.06
      17.80
      109.71
      59.55
      3.85
      109.63
      ...
      193.29
      27.92
      12.15
      45.64
      34.58
      40.20
      74.92
      93.20
      29.85
      32.67
    
    
      1
      184.69
      140.25
      38.33
      52.81
      82.22
      17.83
      109.32
      59.88
      3.87
      110.68
      ...
      194.21
      27.94
      12.17
      45.92
      34.60
      40.44
      75.61
      93.19
      29.96
      32.69
    
    
      2
      182.92
      138.13
      38.23
      51.98
      81.13
      18.07
      107.65
      59.29
      3.95
      109.74
      ...
      197.94
      27.51
      11.91
      45.97
      34.16
      39.59
      75.09
      92.24
      29.65
      32.36
    
    
      3
      182.88
      138.45
      38.64
      52.30
      81.40
      18.29
      108.15
      59.16
      4.00
      112.88
      ...
      196.00
      27.50
      11.99
      45.62
      34.47
      40.12
      75.56
      92.64
      29.86
      32.05
    
    
      4
      182.36
      137.63
      39.15
      50.39
      80.54
      18.08
      106.28
      58.12
      4.13
      111.80
      ...
      195.86
      27.35
      12.09
      45.42
      34.41
      39.93
      75.50
      93.24
      29.65
      31.98
    
    
      5
      183.48
      137.65
      38.85
      50.49
      81.52
      18.32
      109.44
      58.97
      4.18
      113.18
      ...
      201.51
      27.82
      12.19
      45.52
      34.51
      40.92
      76.56
      95.10
      29.74
      32.10
    
    
      6
      183.52
      136.63
      39.20
      50.36
      82.15
      18.34
      110.03
      58.90
      4.18
      112.30
      ...
      205.29
      27.71
      12.08
      45.91
      34.49
      41.02
      76.53
      97.43
      30.00
      31.74
    
    
      target
      183.64
      136.45
      39.27
      51.22
      82.95
      18.30
      123.50
      59.09
      4.09
      113.55
      ...
      204.77
      27.82
      12.05
      45.78
      34.63
      40.92
      75.05
      96.47
      30.22
      31.96
    
  

8 rows × 483 columns



In [40]:

    
target = sample_blob[1].T
feat_close = sample_blob[0][TARGET_FEATURE]
x_y_samples = feat_close.append(target)
x_y_samples









    Out[40]:







  
    
      
      SPY
      MMM
      ABT
      ABBV
      ACN
      ATVI
      AYI
      ADBE
      AMD
      AAP
      ...
      WYNN
      XEL
      XRX
      XLNX
      XYL
      YHOO
      YUM
      ZBH
      ZION
      ZTS
    
  
  
    
      2014-12-18
      206.78
      165.30
      45.77
      67.92
      89.74
      20.04
      137.17
      74.89
      2.55
      160.74
      ...
      145.10
      35.35
      13.89
      43.71
      38.10
      50.91
      71.74
      114.94
      28.20
      43.15
    
    
      2014-12-19
      206.52
      165.48
      46.05
      67.71
      90.51
      20.25
      137.38
      74.45
      2.57
      159.85
      ...
      150.11
      35.40
      13.96
      43.00
      38.59
      50.88
      71.17
      114.91
      28.39
      43.51
    
    
      2014-12-22
      207.47
      167.27
      46.37
      66.97
      91.18
      20.30
      140.05
      74.50
      2.66
      159.44
      ...
      147.48
      35.65
      13.97
      43.87
      38.74
      51.15
      72.46
      115.05
      28.18
      43.41
    
    
      2014-12-23
      207.75
      166.87
      45.72
      64.35
      91.42
      20.23
      139.48
      74.38
      2.67
      161.19
      ...
      151.69
      35.43
      14.07
      43.91
      38.72
      50.02
      72.68
      113.47
      28.53
      42.97
    
    
      2014-12-24
      207.77
      166.96
      45.70
      66.21
      91.32
      20.35
      139.94
      74.74
      2.65
      160.23
      ...
      150.90
      36.23
      14.08
      44.08
      38.76
      50.65
      72.79
      114.11
      28.52
      43.84
    
    
      2014-12-26
      208.44
      166.26
      45.85
      66.98
      91.26
      20.42
      139.66
      74.67
      2.65
      160.00
      ...
      150.37
      36.58
      14.14
      43.92
      38.95
      50.86
      73.14
      114.17
      28.56
      44.20
    
    
      2014-12-29
      208.72
      166.71
      45.60
      67.14
      90.52
      20.36
      139.88
      74.13
      2.66
      161.22
      ...
      153.00
      37.25
      14.14
      43.79
      38.86
      50.53
      73.56
      113.50
      28.72
      44.01
    
    
      2014-12-30
      207.60
      165.84
      45.69
      66.30
      90.19
      20.24
      141.20
      73.08
      2.63
      160.03
      ...
      151.20
      36.38
      14.01
      43.55
      38.43
      51.22
      73.28
      114.57
      28.67
      43.35
    
  

8 rows × 483 columns



In [41]:

    
x_y_samples.index = pd.MultiIndex.from_product([[x_y_samples.index[0]], np.arange(x_y_samples.shape[0])])
x_y_samples









    Out[41]:







  
    
      
      
      SPY
      MMM
      ABT
      ABBV
      ACN
      ATVI
      AYI
      ADBE
      AMD
      AAP
      ...
      WYNN
      XEL
      XRX
      XLNX
      XYL
      YHOO
      YUM
      ZBH
      ZION
      ZTS
    
  
  
    
      2014-12-18
      0
      206.78
      165.30
      45.77
      67.92
      89.74
      20.04
      137.17
      74.89
      2.55
      160.74
      ...
      145.10
      35.35
      13.89
      43.71
      38.10
      50.91
      71.74
      114.94
      28.20
      43.15
    
    
      1
      206.52
      165.48
      46.05
      67.71
      90.51
      20.25
      137.38
      74.45
      2.57
      159.85
      ...
      150.11
      35.40
      13.96
      43.00
      38.59
      50.88
      71.17
      114.91
      28.39
      43.51
    
    
      2
      207.47
      167.27
      46.37
      66.97
      91.18
      20.30
      140.05
      74.50
      2.66
      159.44
      ...
      147.48
      35.65
      13.97
      43.87
      38.74
      51.15
      72.46
      115.05
      28.18
      43.41
    
    
      3
      207.75
      166.87
      45.72
      64.35
      91.42
      20.23
      139.48
      74.38
      2.67
      161.19
      ...
      151.69
      35.43
      14.07
      43.91
      38.72
      50.02
      72.68
      113.47
      28.53
      42.97
    
    
      4
      207.77
      166.96
      45.70
      66.21
      91.32
      20.35
      139.94
      74.74
      2.65
      160.23
      ...
      150.90
      36.23
      14.08
      44.08
      38.76
      50.65
      72.79
      114.11
      28.52
      43.84
    
    
      5
      208.44
      166.26
      45.85
      66.98
      91.26
      20.42
      139.66
      74.67
      2.65
      160.00
      ...
      150.37
      36.58
      14.14
      43.92
      38.95
      50.86
      73.14
      114.17
      28.56
      44.20
    
    
      6
      208.72
      166.71
      45.60
      67.14
      90.52
      20.36
      139.88
      74.13
      2.66
      161.22
      ...
      153.00
      37.25
      14.14
      43.79
      38.86
      50.53
      73.56
      113.50
      28.72
      44.01
    
    
      7
      207.60
      165.84
      45.69
      66.30
      90.19
      20.24
      141.20
      73.08
      2.63
      160.03
      ...
      151.20
      36.38
      14.01
      43.55
      38.43
      51.22
      73.28
      114.57
      28.67
      43.35
    
  

8 rows × 483 columns

That would be the way to go: the timestamp of the first day of the base period works as a global timestamp for the base period.



In [42]:

    
x_y_samples.unstack().stack(0).sample(frac=1).reset_index(level=1, drop=True).head()

Let's try the whole function, with shuffle (it's better to do it early, so that I won't forget later and get some artificial results), but keeping the index.



In [43]:

    
TARGET_FEATURE = 'Close'


def feature_close_one_to_one(sample_blob):
    target = sample_blob[1].T
    feat_close = sample_blob[0][TARGET_FEATURE]
    x_y_samples = feat_close.append(target)
    x_y_samples.index = pd.MultiIndex.from_product([[x_y_samples.index[0]], 
                                                    np.arange(x_y_samples.shape[0])])
    x_y_samples_shuffled = x_y_samples.unstack().stack(0).sample(frac=1).reset_index(level=1, drop=True)
    x_y_samples_filtered = x_y_samples_shuffled.dropna(axis=0, how='any')
    
    return x_y_samples_filtered



In [44]:

    
print(feature_close_one_to_one(sample_blob).shape)
feature_close_one_to_one(sample_blob).head()



In [45]:

    
def generate_train_intervals(data_df, train_time, base_days, step_days, ahead_days, today, blob_fun):
    end_of_training_date = add_market_days(today, -ahead_days, data_df)
    start_date = end_of_training_date - dt.timedelta(train_time)
    start_target_date = add_market_days(start_date, base_days + ahead_days - 1, data_df)
    
    data_train_df = data_df[start_date:end_of_training_date]
    data_target_df = data_df.loc[start_target_date: today, TARGET_FEATURE]
    
    date_base_ini = start_date
    date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
    date_target = add_market_days(date_base_end, ahead_days, data_df)
    feat_tgt_df = pd.DataFrame()

    while date_base_end < end_of_training_date:
        sample_blob = (data_train_df[date_base_ini: date_base_end],
                       pd.DataFrame(data_target_df.loc[date_target]))
        feat_tgt_blob = blob_fun(sample_blob)
        feat_tgt_df = feat_tgt_df.append(feat_tgt_blob)
        
        date_base_ini = add_market_days(date_base_ini, step_days, data_df)
        date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
        date_target = add_market_days(date_base_end, ahead_days, data_df)
        # print('Start: %s,  End:%s' % (date_base_ini, date_base_end))
    
    feat_tgt_df = feat_tgt_df.sample(frac=1)
    
    X_df = feat_tgt_df.iloc[:,:-1]
    y_df = pd.DataFrame(feat_tgt_df.iloc[:,-1]).rename(columns={7:'target'})
    
    return X_df, y_df



In [46]:

    
from time import time

tic = time()
X, y = generate_train_intervals(data_df, 
                                train_time, 
                                base_days, 
                                step_days, 
                                ahead_days, 
                                today, 
                                feature_close_one_to_one)
toc = time()
print('Elapsed time: %i seconds.' % (toc-tic))









    



Elapsed time: 2 seconds.



In [47]:

    
print(X.shape)
X.head(10)



In [48]:

    
print(y.shape)
y.head(10)

Let's test the "final" (you never know...) function in its module



In [49]:

    
sys.path.append('../../')
import predictor.feature_extraction as fe

X, y = fe.generate_train_intervals(data_df, 
                                   train_time, 
                                   base_days, 
                                   step_days, 
                                   ahead_days, 
                                   today, 
                                   feature_close_one_to_one)



In [50]:

    
print(X.shape)
X.head(10)



In [51]:

    
print(y.shape)
y.head(10)

Nice!



In [ ]:

I will try to modify the add_market_days function to make it return a shift in real days instead of an index shift (that takes into account the possible duplicates, that are very common in some of the approaches I will follow)



In [55]:

    
data_df









    Out[55]:







  
    
      feature
      Close
      ...
      Volume
    
    
      
      SPY
      MMM
      ABT
      ABBV
      ACN
      ATVI
      AYI
      ADBE
      AMD
      AAP
      ...
      WYNN
      XEL
      XRX
      XLNX
      XYL
      YHOO
      YUM
      ZBH
      ZION
      ZTS
    
    
      date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1993-01-29
      43.94
      24.50
      6.88
      NaN
      NaN
      NaN
      NaN
      2.59
      18.75
      NaN
      ...
      NaN
      87800.0
      7633602.0
      1745196.0
      NaN
      NaN
      NaN
      NaN
      33600.0
      NaN
    
    
      1993-02-01
      44.25
      24.69
      6.88
      NaN
      NaN
      NaN
      NaN
      2.72
      19.12
      NaN
      ...
      NaN
      72400.0
      3001200.0
      3574800.0
      NaN
      NaN
      NaN
      NaN
      32000.0
      NaN
    
    
      1993-02-02
      44.34
      24.72
      6.53
      NaN
      NaN
      NaN
      NaN
      2.84
      20.25
      NaN
      ...
      NaN
      242200.0
      1388598.0
      2652396.0
      NaN
      NaN
      NaN
      NaN
      251600.0
      NaN
    
    
      1993-02-03
      44.81
      25.19
      6.91
      NaN
      NaN
      NaN
      NaN
      2.70
      20.50
      NaN
      ...
      NaN
      272200.0
      1228200.0
      5040396.0
      NaN
      NaN
      NaN
      NaN
      254800.0
      NaN
    
    
      1993-02-04
      45.00
      26.06
      6.84
      NaN
      NaN
      NaN
      NaN
      2.73
      20.12
      NaN
      ...
      NaN
      162800.0
      1675602.0
      7033200.0
      NaN
      NaN
      NaN
      NaN
      317200.0
      NaN
    
    
      1993-02-05
      44.97
      27.19
      6.88
      NaN
      NaN
      NaN
      NaN
      2.60
      19.62
      NaN
      ...
      NaN
      73600.0
      3104598.0
      7197600.0
      NaN
      NaN
      NaN
      NaN
      292400.0
      NaN
    
    
      1993-02-08
      44.97
      27.25
      6.66
      NaN
      NaN
      NaN
      NaN
      2.62
      19.75
      NaN
      ...
      NaN
      43400.0
      3567600.0
      5848800.0
      NaN
      NaN
      NaN
      NaN
      202000.0
      NaN
    
    
      1993-02-09
      44.66
      27.25
      6.84
      NaN
      NaN
      NaN
      NaN
      2.59
      19.75
      NaN
      ...
      NaN
      58400.0
      1281600.0
      2401200.0
      NaN
      NaN
      NaN
      NaN
      111600.0
      NaN
    
    
      1993-02-10
      44.72
      27.31
      6.88
      NaN
      NaN
      NaN
      NaN
      2.56
      19.75
      NaN
      ...
      NaN
      174200.0
      2275602.0
      3268404.0
      NaN
      NaN
      NaN
      NaN
      38800.0
      NaN
    
    
      1993-02-11
      44.94
      26.94
      6.94
      NaN
      NaN
      NaN
      NaN
      2.62
      21.00
      NaN
      ...
      NaN
      87400.0
      343002.0
      1974804.0
      NaN
      NaN
      NaN
      NaN
      18400.0
      NaN
    
    
      1993-02-12
      44.59
      26.81
      7.00
      NaN
      NaN
      NaN
      NaN
      2.54
      21.38
      NaN
      ...
      NaN
      114400.0
      682200.0
      1629996.0
      NaN
      NaN
      NaN
      NaN
      69600.0
      NaN
    
    
      1993-02-16
      43.47
      26.31
      6.66
      NaN
      NaN
      NaN
      NaN
      2.52
      19.75
      NaN
      ...
      NaN
      87200.0
      1186800.0
      6063600.0
      NaN
      NaN
      NaN
      NaN
      61200.0
      NaN
    
    
      1993-02-17
      43.44
      25.94
      6.44
      NaN
      NaN
      NaN
      NaN
      2.58
      20.38
      NaN
      ...
      NaN
      134000.0
      2628198.0
      7012800.0
      NaN
      NaN
      NaN
      NaN
      164000.0
      NaN
    
    
      1993-02-18
      43.41
      25.75
      6.25
      NaN
      NaN
      NaN
      NaN
      2.64
      21.12
      NaN
      ...
      NaN
      89000.0
      1435998.0
      6742404.0
      NaN
      NaN
      NaN
      NaN
      118800.0
      NaN
    
    
      1993-02-19
      43.56
      26.22
      6.06
      NaN
      NaN
      NaN
      NaN
      2.73
      22.12
      NaN
      ...
      NaN
      127200.0
      816198.0
      2259600.0
      NaN
      NaN
      NaN
      NaN
      136400.0
      NaN
    
    
      1993-02-22
      43.72
      26.16
      5.81
      NaN
      NaN
      NaN
      NaN
      2.63
      21.75
      NaN
      ...
      NaN
      172400.0
      2625198.0
      9621600.0
      NaN
      NaN
      NaN
      NaN
      192000.0
      NaN
    
    
      1993-02-23
      43.69
      26.06
      6.06
      NaN
      NaN
      NaN
      NaN
      2.55
      21.38
      NaN
      ...
      NaN
      227800.0
      1179600.0
      2307600.0
      NaN
      NaN
      NaN
      NaN
      82400.0
      NaN
    
    
      1993-02-24
      44.25
      26.31
      6.50
      NaN
      NaN
      NaN
      NaN
      2.66
      22.00
      NaN
      ...
      NaN
      392400.0
      1937598.0
      2843196.0
      NaN
      NaN
      NaN
      NaN
      212800.0
      NaN
    
    
      1993-02-25
      44.34
      26.09
      6.59
      NaN
      NaN
      NaN
      NaN
      2.78
      22.12
      NaN
      ...
      NaN
      115400.0
      1834200.0
      3760800.0
      NaN
      NaN
      NaN
      NaN
      18400.0
      NaN
    
    
      1993-02-26
      44.41
      26.38
      6.53
      NaN
      NaN
      NaN
      NaN
      2.79
      21.75
      NaN
      ...
      NaN
      82400.0
      2159202.0
      2637996.0
      NaN
      NaN
      NaN
      NaN
      19200.0
      NaN
    
    
      1993-03-01
      44.28
      26.06
      6.47
      NaN
      NaN
      NaN
      NaN
      2.69
      21.25
      NaN
      ...
      NaN
      49000.0
      1657602.0
      1125600.0
      NaN
      NaN
      NaN
      NaN
      21600.0
      NaN
    
    
      1993-03-02
      44.94
      26.47
      6.50
      NaN
      NaN
      NaN
      NaN
      2.75
      22.25
      NaN
      ...
      NaN
      77800.0
      1094202.0
      7811604.0
      NaN
      NaN
      NaN
      NaN
      28800.0
      NaN
    
    
      1993-03-03
      45.12
      26.25
      6.41
      NaN
      NaN
      NaN
      NaN
      2.80
      22.88
      NaN
      ...
      NaN
      65600.0
      1642002.0
      6934404.0
      NaN
      NaN
      NaN
      NaN
      69600.0
      NaN
    
    
      1993-03-04
      44.88
      26.47
      6.38
      NaN
      NaN
      NaN
      NaN
      2.75
      22.62
      NaN
      ...
      NaN
      69800.0
      765198.0
      6698004.0
      NaN
      NaN
      NaN
      NaN
      112800.0
      NaN
    
    
      1993-03-05
      44.75
      26.75
      6.38
      NaN
      NaN
      NaN
      NaN
      2.80
      23.25
      NaN
      ...
      NaN
      146000.0
      897198.0
      15781200.0
      NaN
      NaN
      NaN
      NaN
      21600.0
      NaN
    
    
      1993-03-08
      45.75
      27.34
      6.59
      NaN
      NaN
      NaN
      NaN
      2.90
      24.00
      NaN
      ...
      NaN
      263800.0
      1549602.0
      11932800.0
      NaN
      NaN
      NaN
      NaN
      52000.0
      NaN
    
    
      1993-03-09
      45.59
      27.34
      6.31
      NaN
      NaN
      NaN
      NaN
      2.83
      23.62
      NaN
      ...
      NaN
      268400.0
      1260600.0
      6941604.0
      NaN
      NaN
      NaN
      NaN
      30400.0
      NaN
    
    
      1993-03-10
      45.69
      27.41
      6.59
      NaN
      NaN
      NaN
      NaN
      2.86
      23.62
      NaN
      ...
      NaN
      174000.0
      598800.0
      8206800.0
      NaN
      NaN
      NaN
      NaN
      40800.0
      NaN
    
    
      1993-03-11
      45.56
      27.25
      6.56
      NaN
      NaN
      NaN
      NaN
      2.80
      23.75
      NaN
      ...
      NaN
      113000.0
      904602.0
      6606804.0
      NaN
      NaN
      NaN
      NaN
      62800.0
      NaN
    
    
      1993-03-12
      45.09
      27.12
      6.66
      NaN
      NaN
      NaN
      NaN
      2.86
      23.75
      NaN
      ...
      NaN
      87200.0
      819798.0
      4695600.0
      NaN
      NaN
      NaN
      NaN
      82800.0
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2014-11-18
      205.55
      160.15
      43.70
      66.05
      84.30
      20.01
      138.62
      70.83
      2.65
      146.24
      ...
      1445341.0
      2847555.0
      5510194.0
      2696970.0
      908406.0
      26847315.0
      2586790.0
      688679.0
      1069486.0
      12181375.0
    
    
      2014-11-19
      205.22
      158.83
      43.43
      65.50
      83.64
      19.57
      138.82
      70.82
      2.63
      144.74
      ...
      876839.0
      1972218.0
      5084935.0
      2435547.0
      857298.0
      29260010.0
      1612491.0
      518740.0
      1747864.0
      7031136.0
    
    
      2014-11-20
      205.58
      159.36
      43.78
      65.19
      83.44
      21.11
      141.53
      70.44
      2.69
      145.19
      ...
      2655717.0
      2138563.0
      8755083.0
      1736083.0
      610697.0
      28916007.0
      4499097.0
      985701.0
      1849322.0
      5091131.0
    
    
      2014-11-21
      206.68
      160.16
      43.81
      67.36
      84.08
      21.17
      141.57
      71.32
      2.77
      145.49
      ...
      2017251.0
      2650022.0
      6361338.0
      2699464.0
      1030635.0
      22226984.0
      3654458.0
      821242.0
      2998052.0
      5192498.0
    
    
      2014-11-24
      207.26
      159.88
      44.32
      67.78
      84.93
      21.30
      142.51
      71.22
      2.83
      145.58
      ...
      1155464.0
      2225076.0
      8568584.0
      1699310.0
      643336.0
      14643538.0
      2753912.0
      1355368.0
      1319333.0
      4179357.0
    
    
      2014-11-25
      207.11
      158.06
      44.16
      68.06
      85.32
      21.73
      141.46
      72.00
      2.83
      145.45
      ...
      1670515.0
      2332571.0
      6371282.0
      1612922.0
      772716.0
      14219593.0
      3223564.0
      923555.0
      2446487.0
      3589015.0
    
    
      2014-11-26
      207.64
      158.31
      44.25
      68.70
      85.15
      21.82
      141.07
      72.98
      2.84
      145.71
      ...
      833423.0
      1256014.0
      3804699.0
      2356519.0
      550887.0
      13434021.0
      2487289.0
      595022.0
      887210.0
      2131385.0
    
    
      2014-11-28
      207.20
      160.09
      44.51
      69.20
      86.33
      21.65
      138.20
      73.68
      2.79
      147.08
      ...
      578749.0
      1210079.0
      3094160.0
      1423057.0
      676375.0
      8913738.0
      1505440.0
      426898.0
      1766234.0
      1332156.0
    
    
      2014-12-01
      205.76
      158.16
      44.32
      69.11
      86.00
      21.05
      137.03
      73.75
      2.67
      147.07
      ...
      1498720.0
      5969004.0
      4303105.0
      2638062.0
      850048.0
      23146896.0
      3123478.0
      613003.0
      3944978.0
      3153115.0
    
    
      2014-12-02
      207.08
      160.60
      45.22
      69.27
      85.98
      21.14
      138.18
      73.47
      2.68
      148.48
      ...
      1167121.0
      3751995.0
      5048956.0
      1774069.0
      964261.0
      16300591.0
      3634241.0
      1504922.0
      1770926.0
      2500597.0
    
    
      2014-12-03
      207.89
      162.26
      45.68
      68.50
      86.52
      21.48
      139.75
      73.18
      2.74
      152.04
      ...
      2206478.0
      3330964.0
      10690975.0
      5606099.0
      697054.0
      14235962.0
      3014968.0
      978397.0
      1701211.0
      2008310.0
    
    
      2014-12-04
      207.66
      162.26
      45.49
      69.57
      86.34
      21.77
      140.74
      73.03
      2.69
      153.25
      ...
      3673851.0
      2614210.0
      6927550.0
      4055601.0
      443265.0
      12157548.0
      2228168.0
      732956.0
      1296742.0
      4294633.0
    
    
      2014-12-05
      208.00
      162.27
      45.30
      69.71
      86.19
      21.79
      140.23
      72.40
      2.66
      154.18
      ...
      1531474.0
      2330569.0
      4423298.0
      2144040.0
      392476.0
      15418123.0
      1836426.0
      1018655.0
      4964092.0
      3313964.0
    
    
      2014-12-08
      206.61
      160.93
      45.71
      69.42
      84.97
      20.92
      137.68
      72.05
      2.58
      153.68
      ...
      2400856.0
      3159876.0
      6015100.0
      3244526.0
      468915.0
      18190065.0
      3400882.0
      644459.0
      3779869.0
      5889345.0
    
    
      2014-12-09
      206.47
      160.83
      45.44
      69.14
      85.18
      20.97
      137.67
      71.56
      2.59
      158.25
      ...
      1610673.0
      3249038.0
      8166454.0
      1958389.0
      616176.0
      19655554.0
      4467324.0
      986386.0
      3213855.0
      5516141.0
    
    
      2014-12-10
      203.16
      158.24
      44.37
      67.58
      83.73
      20.49
      133.07
      69.85
      2.48
      157.24
      ...
      4661598.0
      2761523.0
      8399042.0
      3377519.0
      782671.0
      16184119.0
      10302596.0
      924065.0
      2706376.0
      4398000.0
    
    
      2014-12-11
      204.19
      159.15
      44.46
      67.03
      83.88
      20.34
      131.75
      69.74
      2.49
      159.16
      ...
      2676661.0
      2116953.0
      6142819.0
      3158420.0
      586045.0
      21100184.0
      7960980.0
      961708.0
      2642106.0
      4942195.0
    
    
      2014-12-12
      200.89
      157.12
      43.69
      65.27
      81.90
      19.90
      129.92
      76.02
      2.52
      160.25
      ...
      1491876.0
      2194511.0
      7343378.0
      2524857.0
      722650.0
      20370512.0
      5649531.0
      879074.0
      2936013.0
      4050331.0
    
    
      2014-12-15
      199.51
      156.85
      43.46
      65.62
      81.88
      20.02
      129.17
      73.04
      2.47
      161.01
      ...
      1437791.0
      3172326.0
      7622374.0
      4404594.0
      726824.0
      18132508.0
      5074164.0
      753207.0
      2977774.0
      5947406.0
    
    
      2014-12-16
      197.91
      159.05
      43.40
      65.25
      82.87
      19.38
      128.36
      72.32
      2.50
      157.01
      ...
      2127890.0
      2964296.0
      11093935.0
      5930450.0
      712928.0
      21399337.0
      7051356.0
      1043861.0
      3680071.0
      4606599.0
    
    
      2014-12-17
      201.79
      160.60
      44.26
      66.57
      85.30
      19.67
      132.98
      74.00
      2.55
      159.55
      ...
      4298558.0
      3357456.0
      6824963.0
      12579715.0
      1241837.0
      17112312.0
      5411318.0
      926323.0
      3567571.0
      7628313.0
    
    
      2014-12-18
      206.78
      165.30
      45.77
      67.92
      89.74
      20.04
      137.17
      74.89
      2.55
      160.74
      ...
      2870142.0
      3062842.0
      6376658.0
      6461698.0
      698949.0
      15338945.0
      4513792.0
      957787.0
      2342748.0
      6578062.0
    
    
      2014-12-19
      206.52
      165.48
      46.05
      67.71
      90.51
      20.25
      137.38
      74.45
      2.57
      159.85
      ...
      3293472.0
      4525090.0
      12745939.0
      7023789.0
      1534229.0
      24110189.0
      5784930.0
      1271670.0
      9852629.0
      5601641.0
    
    
      2014-12-22
      207.47
      167.27
      46.37
      66.97
      91.18
      20.30
      140.05
      74.50
      2.66
      159.44
      ...
      1768390.0
      2095004.0
      5349323.0
      2508023.0
      475873.0
      24021105.0
      2472579.0
      629534.0
      1984295.0
      1921541.0
    
    
      2014-12-23
      207.75
      166.87
      45.72
      64.35
      91.42
      20.23
      139.48
      74.38
      2.67
      161.19
      ...
      1668952.0
      2220244.0
      4977161.0
      1830288.0
      645345.0
      15514036.0
      1605742.0
      1287659.0
      2205030.0
      2612498.0
    
    
      2014-12-24
      207.77
      166.96
      45.70
      66.21
      91.32
      20.35
      139.94
      74.74
      2.65
      160.23
      ...
      494148.0
      3164508.0
      1647975.0
      703703.0
      303489.0
      5962870.0
      799095.0
      306626.0
      1176810.0
      1300640.0
    
    
      2014-12-26
      208.44
      166.26
      45.85
      66.98
      91.26
      20.42
      139.66
      74.67
      2.65
      160.00
      ...
      998923.0
      2546323.0
      2721308.0
      1244180.0
      249438.0
      5170048.0
      1041493.0
      400193.0
      1014964.0
      1518039.0
    
    
      2014-12-29
      208.72
      166.71
      45.60
      67.14
      90.52
      20.36
      139.88
      74.13
      2.66
      161.22
      ...
      1510355.0
      3903388.0
      4365460.0
      1873440.0
      352274.0
      6624489.0
      1892613.0
      450521.0
      1601636.0
      2007239.0
    
    
      2014-12-30
      207.60
      165.84
      45.69
      66.30
      90.19
      20.24
      141.20
      73.08
      2.63
      160.03
      ...
      1262561.0
      2459775.0
      3519035.0
      2805439.0
      437672.0
      10703455.0
      1464524.0
      625307.0
      957687.0
      2251147.0
    
    
      2014-12-31
      205.54
      164.32
      45.02
      65.44
      89.31
      20.15
      140.07
      72.70
      2.67
      159.28
      ...
      1635052.0
      2454622.0
      3650122.0
      2827604.0
      466877.0
      9305013.0
      1539425.0
      694720.0
      1470851.0
      2148909.0
    
  

5520 rows × 2415 columns



In [57]:

    
base = data_df.index[0]
delta = 252



In [63]:

    
market_days = np.unique(data_df.sort_index().index)
len(market_days)









    Out[63]:





5520



In [ ]:



In [ ]:

    
def add_market_days(base, delta, data_df):
    """
    base is in real time.
    delta is in market days.
    """
    market_days = data_df.index
    if base not in market_days:
        raise Exception('The base date is not in the market days list.')
    base_index = market_days.tolist().index(base)
    if base_index + delta >= len(market_days):
        return market_days[-1]
    if base_index + delta < 0:
        return market_days[0]
    return market_days[base_index + delta]

	0	1	2	3	4	5	6	target
0	41.09	41.31	40.66	40.46	40.27	40.39	39.94	39.73
1	94.55	95.10	93.85	93.52	93.80	95.55	94.79	94.84
2	47.01	47.27	47.00	46.83	46.58	46.35	46.94	46.13
3	83.36	83.77	82.56	83.30	83.17	83.87	83.92	83.29
4	36.14	36.16	35.53	35.40	35.76	36.22	35.94	36.04

	0	1	2	3	4	5	6	target
0	1.015038	1.020472	1.004416	0.999475	0.994782	0.997746	0.986630	0.981442
1	1.000529	1.006349	0.993122	0.989630	0.992593	1.011111	1.003069	1.003598
2	1.005266	1.010826	1.005052	1.001417	0.996071	0.991152	1.003769	0.986448
3	0.999460	1.004376	0.989869	0.998741	0.997182	1.005575	1.006175	0.998621
4	1.006720	1.007277	0.989728	0.986107	0.996135	1.008949	1.001149	1.003935

	0	1	2	3	4	5	6
0	1.015038	1.020472	1.004416	0.999475	0.994782	0.997746	0.986630
1	1.000529	1.006349	0.993122	0.989630	0.992593	1.011111	1.003069
2	1.005266	1.010826	1.005052	1.001417	0.996071	0.991152	1.003769
3	0.999460	1.004376	0.989869	0.998741	0.997182	1.005575	1.006175
4	1.006720	1.007277	0.989728	0.986107	0.996135	1.008949	1.001149

	target
0	0.981442
1	1.003598
2	0.986448
3	0.998621
4	1.003935

	0	1	2	3	4	5	6	target
0	48.21	48.36	47.72	47.85	47.60	47.94	48.15	48.55
1	77.31	77.30	76.60	77.29	76.62	77.30	77.15	77.48
2	177.42	178.36	175.48	176.12	174.50	175.68	176.66	178.84
3	41.79	41.93	41.26	41.06	40.36	39.99	39.07	39.22
4	22.32	22.35	21.94	22.19	21.66	21.92	22.08	22.07

feature	Close										...	Volume
	SPY	MMM	ABT	ABBV	ACN	ATVI	AYI	ADBE	AMD	AAP	...	WYNN	XEL	XRX	XLNX	XYL	YHOO	YUM	ZBH	ZION	ZTS
date
1993-01-29	43.94	24.50	6.88	NaN	NaN	NaN	NaN	2.59	18.75	NaN	...	NaN	87800.0	7633602.0	1745196.0	NaN	NaN	NaN	NaN	33600.0	NaN
1993-02-01	44.25	24.69	6.88	NaN	NaN	NaN	NaN	2.72	19.12	NaN	...	NaN	72400.0	3001200.0	3574800.0	NaN	NaN	NaN	NaN	32000.0	NaN
1993-02-02	44.34	24.72	6.53	NaN	NaN	NaN	NaN	2.84	20.25	NaN	...	NaN	242200.0	1388598.0	2652396.0	NaN	NaN	NaN	NaN	251600.0	NaN
1993-02-03	44.81	25.19	6.91	NaN	NaN	NaN	NaN	2.70	20.50	NaN	...	NaN	272200.0	1228200.0	5040396.0	NaN	NaN	NaN	NaN	254800.0	NaN
1993-02-04	45.00	26.06	6.84	NaN	NaN	NaN	NaN	2.73	20.12	NaN	...	NaN	162800.0	1675602.0	7033200.0	NaN	NaN	NaN	NaN	317200.0	NaN

	SPY	MMM	ABT	ABBV	ACN	ATVI	AYI	ADBE	AMD	AAP	...	WYNN	XEL	XRX	XLNX	XYL	YHOO	YUM	ZBH	ZION	ZTS
0	183.82	139.42	38.41	53.01	82.06	17.80	109.71	59.55	3.85	109.63	...	193.29	27.92	12.15	45.64	34.58	40.20	74.92	93.20	29.85	32.67
1	184.69	140.25	38.33	52.81	82.22	17.83	109.32	59.88	3.87	110.68	...	194.21	27.94	12.17	45.92	34.60	40.44	75.61	93.19	29.96	32.69
2	182.92	138.13	38.23	51.98	81.13	18.07	107.65	59.29	3.95	109.74	...	197.94	27.51	11.91	45.97	34.16	39.59	75.09	92.24	29.65	32.36
3	182.88	138.45	38.64	52.30	81.40	18.29	108.15	59.16	4.00	112.88	...	196.00	27.50	11.99	45.62	34.47	40.12	75.56	92.64	29.86	32.05
4	182.36	137.63	39.15	50.39	80.54	18.08	106.28	58.12	4.13	111.80	...	195.86	27.35	12.09	45.42	34.41	39.93	75.50	93.24	29.65	31.98
5	183.48	137.65	38.85	50.49	81.52	18.32	109.44	58.97	4.18	113.18	...	201.51	27.82	12.19	45.52	34.51	40.92	76.56	95.10	29.74	32.10
6	183.52	136.63	39.20	50.36	82.15	18.34	110.03	58.90	4.18	112.30	...	205.29	27.71	12.08	45.91	34.49	41.02	76.53	97.43	30.00	31.74

	0	1	2	3	4	5	6
0	67.41	66.66	65.48	66.76	66.53	67.30	67.30
1	76.10	76.76	76.87	76.23	76.14	76.01	76.43
2	20.60	20.85	21.00	21.23	21.10	21.04	20.92
3	31.11	31.01	30.98	30.66	30.53	30.84	30.98
4	15.45	15.64	15.49	15.47	15.41	15.33	15.37

	0	1	2	3	4	5	6
0	45.34	44.59	43.41	44.52	44.53	44.74	45.18
1	173.89	173.29	175.06	174.57	175.77	175.90	176.58
2	16.95	17.03	17.05	17.18	16.85	17.03	17.01
3	15.25	14.96	14.55	14.87	14.73	14.85	14.97
4	36.50	36.76	36.89	37.47	37.73	38.34	38.29

	0	1	2	3	4	5	6
440	61.14	60.92	61.14	60.48	60.96	60.88	61.82
296	38.98	39.03	38.81	39.19	39.57	39.11	38.90
163	83.20	83.20	83.14	83.38	82.94	82.91	83.81
80	77.83	76.09	71.75	71.35	67.36	66.21	65.38
288	40.94	40.79	40.29	40.32	41.21	41.48	41.27

	SPY	MMM	ABT	ABBV	ACN	ATVI	AYI	ADBE	AMD	AAP	...	WYNN	XEL	XRX	XLNX	XYL	YHOO	YUM	ZBH	ZION	ZTS
2014-12-18	206.78	165.30	45.77	67.92	89.74	20.04	137.17	74.89	2.55	160.74	...	145.10	35.35	13.89	43.71	38.10	50.91	71.74	114.94	28.20	43.15
2014-12-19	206.52	165.48	46.05	67.71	90.51	20.25	137.38	74.45	2.57	159.85	...	150.11	35.40	13.96	43.00	38.59	50.88	71.17	114.91	28.39	43.51
2014-12-22	207.47	167.27	46.37	66.97	91.18	20.30	140.05	74.50	2.66	159.44	...	147.48	35.65	13.97	43.87	38.74	51.15	72.46	115.05	28.18	43.41
2014-12-23	207.75	166.87	45.72	64.35	91.42	20.23	139.48	74.38	2.67	161.19	...	151.69	35.43	14.07	43.91	38.72	50.02	72.68	113.47	28.53	42.97
2014-12-24	207.77	166.96	45.70	66.21	91.32	20.35	139.94	74.74	2.65	160.23	...	150.90	36.23	14.08	44.08	38.76	50.65	72.79	114.11	28.52	43.84
2014-12-26	208.44	166.26	45.85	66.98	91.26	20.42	139.66	74.67	2.65	160.00	...	150.37	36.58	14.14	43.92	38.95	50.86	73.14	114.17	28.56	44.20
2014-12-29	208.72	166.71	45.60	67.14	90.52	20.36	139.88	74.13	2.66	161.22	...	153.00	37.25	14.14	43.79	38.86	50.53	73.56	113.50	28.72	44.01
2014-12-30	207.60	165.84	45.69	66.30	90.19	20.24	141.20	73.08	2.63	160.03	...	151.20	36.38	14.01	43.55	38.43	51.22	73.28	114.57	28.67	43.35

	0	1	2	3	4	5	6	7
2014-12-18	95.41	94.80	96.90	99.14	100.09	99.94	99.22	98.84
2014-12-18	139.82	139.83	141.40	141.31	141.54	141.39	140.67	139.63
2014-12-18	49.00	49.02	49.36	49.17	49.91	50.47	50.88	49.77
2014-12-18	54.10	54.01	53.96	54.42	54.54	54.43	54.73	54.84
2014-12-18	95.36	95.44	96.62	97.36	97.09	97.05	96.73	95.96

	0	1	2	3	4	5	6	7
2014-12-18	26.02	26.14	26.48	26.62	26.54	26.50	26.59	26.52
2014-12-18	511.10	516.35	524.87	530.59	528.77	534.03	530.33	530.42
2014-12-18	252.23	257.81	258.22	257.80	256.98	257.09	258.86	257.49
2014-12-18	89.54	90.10	90.24	90.85	90.26	90.24	90.14	89.46
2014-12-18	58.98	59.58	58.96	57.21	57.46	57.78	57.73	57.65

	0	1	2	3	4	5	6
2014-02-20	80.76	81.48	81.74	81.86	82.45	82.23	82.50
2014-02-20	35.85	36.04	35.97	36.10	36.01	35.95	36.02
2014-05-12	29.04	29.11	29.25	29.23	29.27	28.88	28.95
2014-04-10	28.26	28.38	28.55	28.95	29.17	28.86	28.73
2014-08-11	99.77	99.81	99.99	99.97	100.46	100.60	101.80
2014-01-09	84.64	85.19	84.02	84.41	85.08	83.98	83.92
2014-11-28	63.23	63.60	64.84	65.33	65.86	65.90	66.22
2014-07-22	121.11	122.01	122.26	122.84	122.65	123.31	130.01
2014-09-10	48.75	49.01	48.40	48.56	48.96	49.21	49.69
2014-11-18	83.79	84.99	84.58	84.65	85.40	84.95	84.98

	target
2014-02-20	81.84
2014-02-20	36.00
2014-05-12	28.87
2014-04-10	28.78
2014-08-11	101.74
2014-01-09	84.06
2014-11-28	65.99
2014-07-22	127.39
2014-09-10	50.35
2014-11-18	87.54

	0	1	2	3	4	5	6
2014-05-09	21.10	21.77	21.46	20.73	19.76	20.06	20.07
2014-03-31	60.21	60.26	61.69	61.45	59.98	59.03	59.10
2014-10-17	79.77	79.64	81.16	81.02	81.56	82.52	82.86
2013-07-19	67.39	67.00	66.85	65.94	66.56	64.73	63.44
2014-11-17	41.16	41.26	40.92	40.92	41.44	41.43	41.15
2013-07-30	52.09	52.04	53.48	53.93	54.25	53.76	52.89
2013-11-06	83.00	81.04	83.38	83.30	84.52	84.84	85.22
2014-05-09	49.27	50.12	50.05	50.20	48.61	48.98	49.34
2014-12-08	18.27	18.34	17.86	17.78	17.02	17.08	16.95
2014-05-09	43.59	43.02	43.20	43.78	43.63	43.64	43.52

	target
2014-05-09	19.77
2014-03-31	60.23
2014-10-17	84.09
2013-07-19	63.14
2014-11-17	41.87
2013-07-30	52.79
2013-11-06	85.11
2014-05-09	49.74
2014-12-08	17.47
2014-05-09	43.63