In [1]:
# Basic imports
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import scipy.optimize as spo
import sys

%matplotlib inline

%pylab inline
pylab.rcParams['figure.figsize'] = (20.0, 10.0)

%load_ext autoreload
%autoreload 2

sys.path.append('../../')


Populating the interactive namespace from numpy and matplotlib

In [2]:
data_df = pd.read_pickle('../../data/data_train_val_df.pkl')
print(data_df.shape)
data_df.head()


(5520, 2415)
Out[2]:
feature Close ... Volume
SPY MMM ABT ABBV ACN ATVI AYI ADBE AMD AAP ... WYNN XEL XRX XLNX XYL YHOO YUM ZBH ZION ZTS
date
1993-01-29 43.94 24.50 6.88 NaN NaN NaN NaN 2.59 18.75 NaN ... NaN 87800.0 7633602.0 1745196.0 NaN NaN NaN NaN 33600.0 NaN
1993-02-01 44.25 24.69 6.88 NaN NaN NaN NaN 2.72 19.12 NaN ... NaN 72400.0 3001200.0 3574800.0 NaN NaN NaN NaN 32000.0 NaN
1993-02-02 44.34 24.72 6.53 NaN NaN NaN NaN 2.84 20.25 NaN ... NaN 242200.0 1388598.0 2652396.0 NaN NaN NaN NaN 251600.0 NaN
1993-02-03 44.81 25.19 6.91 NaN NaN NaN NaN 2.70 20.50 NaN ... NaN 272200.0 1228200.0 5040396.0 NaN NaN NaN NaN 254800.0 NaN
1993-02-04 45.00 26.06 6.84 NaN NaN NaN NaN 2.73 20.12 NaN ... NaN 162800.0 1675602.0 7033200.0 NaN NaN NaN NaN 317200.0 NaN

5 rows × 2415 columns

The first objective of this notebook is to implement the next function (to extract sample intervals from the total period).


In [3]:
def generate_train_intervals(data_df, train_time, base_time, step, days_ahead, today):
    pass

Let's define the parameters as constants, just to do some scratch work.


In [4]:
# I will try to keep the convention to name with the "days" suffix, 
# to all the variables that represent "market days". The ones that 
# represent real time will be named more arbitrarily.

train_time = 365 # In real time days
base_days = 7 # In market days
step_days = 7 # market days
ahead_days = 1 # market days
today = data_df.index[-1] # Real date

In [5]:
today


Out[5]:
Timestamp('2014-12-31 00:00:00')

The amount of samples to be generated would be (train_time - base_time) * num_companies / step. There are days_ahead market days left, only for target values, so the total "used" period is train_time + days_ahead.

The option of training with all, one, or some companies can be done by the user when it inputs the data (just filter data_df to get the companies you want). Anyway, one interesting choice would be to allow the training with multiple companies, targeting only one. That would multiply the features by the number of available companies, but would reduce the samples a lot. By now, I want to keep the complexity low, so I won't implement that idea, yet. A many to many approach could also be implemented (the target would be the vector with all the companies data). I will start with the simple "one to one".


In [6]:
data_df.index[data_df.index <= today][-(ahead_days + 1)]


Out[6]:
Timestamp('2014-12-30 00:00:00')

In [7]:
def add_market_days(base, delta, data_df):
    """
    base is in real time.
    delta is in market days.
    """
    market_days = data_df.index
    if base not in market_days:
        raise Exception('The base date is not in the market days list.')
    base_index = market_days.tolist().index(base)
    if base_index + delta >= len(market_days):
        return market_days[-1]
    if base_index + delta < 0:
        return market_days[0]
    return market_days[base_index + delta]

In [8]:
# Remember the last target days are not used for training, but that is a "market days" period.
end_of_training_date = add_market_days(today, -ahead_days, data_df)
start_date = end_of_training_date - dt.timedelta(train_time) 
print('Start date: %s.  End of training date: %s.' % (start_date, end_of_training_date))


Start date: 2013-12-30 00:00:00.  End of training date: 2014-12-30 00:00:00.

In [9]:
TARGET_FEATURE = 'Close'

One important thing to note: the base time is in "market days", that means that it doesn't represent a period of "real" time (the real time may vary with each base interval).


In [10]:
def print_period(data_df):
    print('Period: %s  to  %s.' % (data_df.index[0], data_df.index[-1]))

In [11]:
data_train_df = data_df[start_date:end_of_training_date]

print_period(data_train_df)
data_train_df.shape


Period: 2013-12-30 00:00:00  to  2014-12-30 00:00:00.
Out[11]:
(253, 2415)

In [12]:
start_target_date = add_market_days(start_date, base_days + ahead_days - 1, data_df)
data_target_df = data_df.loc[start_target_date: today,TARGET_FEATURE]

print_period(data_target_df)
data_target_df.shape


Period: 2014-01-09 00:00:00  to  2014-12-31 00:00:00.
Out[12]:
(247, 483)

Is that initial date correct?


In [13]:
data_train_df.index[:10]


Out[13]:
DatetimeIndex(['2013-12-30', '2013-12-31', '2014-01-02', '2014-01-03',
               '2014-01-06', '2014-01-07', '2014-01-08', '2014-01-09',
               '2014-01-10', '2014-01-13'],
              dtype='datetime64[ns]', name='date', freq=None)

Ok, it looks so.

Let's split now!

I should allow for different feature extraction functions to be used, after the time divisions.


In [14]:
date_base_ini = start_date
date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
date_target = add_market_days(date_base_end, ahead_days, data_df)

sample_blob = (data_train_df[date_base_ini: date_base_end], pd.DataFrame(data_target_df.loc[date_target]))
sample_blob[0]


Out[14]:
feature Close ... Volume
SPY MMM ABT ABBV ACN ATVI AYI ADBE AMD AAP ... WYNN XEL XRX XLNX XYL YHOO YUM ZBH ZION ZTS
date
2013-12-30 183.82 139.42 38.41 53.01 82.06 17.80 109.71 59.55 3.85 109.63 ... 1950055.0 1867094.0 3846744.0 1049165.0 561674.0 8678033.0 1953272.0 623173.0 895272.0 1803579.0
2013-12-31 184.69 140.25 38.33 52.81 82.22 17.83 109.32 59.88 3.87 110.68 ... 731522.0 1752814.0 5359541.0 1215365.0 557969.0 8292761.0 2132916.0 649985.0 1077417.0 2270418.0
2014-01-02 182.92 138.13 38.23 51.98 81.13 18.07 107.65 59.29 3.95 109.74 ... 3056065.0 3192314.0 10481405.0 3437019.0 765141.0 21514650.0 1956285.0 868763.0 1356738.0 2576112.0
2014-01-03 182.88 138.45 38.64 52.30 81.40 18.29 108.15 59.16 4.00 112.88 ... 1169540.0 2939378.0 7282652.0 1982702.0 454495.0 15761243.0 1457058.0 1288207.0 1122452.0 2524947.0
2014-01-06 182.36 137.63 39.15 50.39 80.54 18.08 106.28 58.12 4.13 111.80 ... 1289126.0 3382267.0 14906758.0 1970805.0 849360.0 12472724.0 2940835.0 1414955.0 1988180.0 2763350.0
2014-01-07 183.48 137.65 38.85 50.49 81.52 18.32 109.44 58.97 4.18 113.18 ... 1688085.0 3481465.0 15383264.0 1581167.0 611127.0 14141112.0 3625927.0 1852572.0 1343169.0 2338176.0
2014-01-08 183.52 136.63 39.20 50.36 82.15 18.34 110.03 58.90 4.18 112.30 ... 1406668.0 3563670.0 7833434.0 2318930.0 1234973.0 18657195.0 4448753.0 1880549.0 2034692.0 3965882.0

7 rows × 2415 columns


In [15]:
target = sample_blob[1].T
target


Out[15]:
SPY MMM ABT ABBV ACN ATVI AYI ADBE AMD AAP ... WYNN XEL XRX XLNX XYL YHOO YUM ZBH ZION ZTS
2014-01-09 183.64 136.45 39.27 51.22 82.95 18.3 123.5 59.09 4.09 113.55 ... 204.77 27.82 12.05 45.78 34.63 40.92 75.05 96.47 30.22 31.96

1 rows × 483 columns

Let's define a function that takes a "sample blob" and produces one sample per symbol, only for the "Close" feature (looks like the easiest to do first). The dates in the base period should be substituted by an index, and the symbols shuffled later (along with their labels).


In [16]:
feat_close = sample_blob[0][TARGET_FEATURE]
feat_close.index = np.arange(feat_close.shape[0])
feat_close


Out[16]:
SPY MMM ABT ABBV ACN ATVI AYI ADBE AMD AAP ... WYNN XEL XRX XLNX XYL YHOO YUM ZBH ZION ZTS
0 183.82 139.42 38.41 53.01 82.06 17.80 109.71 59.55 3.85 109.63 ... 193.29 27.92 12.15 45.64 34.58 40.20 74.92 93.20 29.85 32.67
1 184.69 140.25 38.33 52.81 82.22 17.83 109.32 59.88 3.87 110.68 ... 194.21 27.94 12.17 45.92 34.60 40.44 75.61 93.19 29.96 32.69
2 182.92 138.13 38.23 51.98 81.13 18.07 107.65 59.29 3.95 109.74 ... 197.94 27.51 11.91 45.97 34.16 39.59 75.09 92.24 29.65 32.36
3 182.88 138.45 38.64 52.30 81.40 18.29 108.15 59.16 4.00 112.88 ... 196.00 27.50 11.99 45.62 34.47 40.12 75.56 92.64 29.86 32.05
4 182.36 137.63 39.15 50.39 80.54 18.08 106.28 58.12 4.13 111.80 ... 195.86 27.35 12.09 45.42 34.41 39.93 75.50 93.24 29.65 31.98
5 183.48 137.65 38.85 50.49 81.52 18.32 109.44 58.97 4.18 113.18 ... 201.51 27.82 12.19 45.52 34.51 40.92 76.56 95.10 29.74 32.10
6 183.52 136.63 39.20 50.36 82.15 18.34 110.03 58.90 4.18 112.30 ... 205.29 27.71 12.08 45.91 34.49 41.02 76.53 97.43 30.00 31.74

7 rows × 483 columns


In [17]:
target.index = ['target']
target


Out[17]:
SPY MMM ABT ABBV ACN ATVI AYI ADBE AMD AAP ... WYNN XEL XRX XLNX XYL YHOO YUM ZBH ZION ZTS
target 183.64 136.45 39.27 51.22 82.95 18.3 123.5 59.09 4.09 113.55 ... 204.77 27.82 12.05 45.78 34.63 40.92 75.05 96.47 30.22 31.96

1 rows × 483 columns


In [18]:
x_y_samples = feat_close.append(target)
x_y_samples


Out[18]:
SPY MMM ABT ABBV ACN ATVI AYI ADBE AMD AAP ... WYNN XEL XRX XLNX XYL YHOO YUM ZBH ZION ZTS
0 183.82 139.42 38.41 53.01 82.06 17.80 109.71 59.55 3.85 109.63 ... 193.29 27.92 12.15 45.64 34.58 40.20 74.92 93.20 29.85 32.67
1 184.69 140.25 38.33 52.81 82.22 17.83 109.32 59.88 3.87 110.68 ... 194.21 27.94 12.17 45.92 34.60 40.44 75.61 93.19 29.96 32.69
2 182.92 138.13 38.23 51.98 81.13 18.07 107.65 59.29 3.95 109.74 ... 197.94 27.51 11.91 45.97 34.16 39.59 75.09 92.24 29.65 32.36
3 182.88 138.45 38.64 52.30 81.40 18.29 108.15 59.16 4.00 112.88 ... 196.00 27.50 11.99 45.62 34.47 40.12 75.56 92.64 29.86 32.05
4 182.36 137.63 39.15 50.39 80.54 18.08 106.28 58.12 4.13 111.80 ... 195.86 27.35 12.09 45.42 34.41 39.93 75.50 93.24 29.65 31.98
5 183.48 137.65 38.85 50.49 81.52 18.32 109.44 58.97 4.18 113.18 ... 201.51 27.82 12.19 45.52 34.51 40.92 76.56 95.10 29.74 32.10
6 183.52 136.63 39.20 50.36 82.15 18.34 110.03 58.90 4.18 112.30 ... 205.29 27.71 12.08 45.91 34.49 41.02 76.53 97.43 30.00 31.74
target 183.64 136.45 39.27 51.22 82.95 18.30 123.50 59.09 4.09 113.55 ... 204.77 27.82 12.05 45.78 34.63 40.92 75.05 96.47 30.22 31.96

8 rows × 483 columns


In [19]:
x_y_samples_shuffled = x_y_samples.T.sample(frac=1).reset_index(drop=True)
x_y_samples_shuffled.head()


Out[19]:
0 1 2 3 4 5 6 target
0 41.09 41.31 40.66 40.46 40.27 40.39 39.94 39.73
1 94.55 95.10 93.85 93.52 93.80 95.55 94.79 94.84
2 47.01 47.27 47.00 46.83 46.58 46.35 46.94 46.13
3 83.36 83.77 82.56 83.30 83.17 83.87 83.92 83.29
4 36.14 36.16 35.53 35.40 35.76 36.22 35.94 36.04

It is important to take care of the NaN values. Possibly at this sample_blob level is a good point to do so; just discard too bad samples.


In [20]:
x_y_samples_shuffled.isnull().sum()


Out[20]:
0         10
1         10
2         10
3         10
4         10
5         10
6         10
target    10
dtype: int64

In [21]:
x_y_samples_filtered = x_y_samples_shuffled.dropna(axis=0, how='any')
print(x_y_samples_filtered.shape)
x_y_samples_filtered.isnull().sum()


(473, 8)
Out[21]:
0         0
1         0
2         0
3         0
4         0
5         0
6         0
target    0
dtype: int64

In [22]:
# At some point I will have to standarize those values... (not now, but just as a reminder...)

std_samples = x_y_samples_shuffled.apply(lambda x: x / np.mean(x), axis=1)
std_samples.head()


Out[22]:
0 1 2 3 4 5 6 target
0 1.015038 1.020472 1.004416 0.999475 0.994782 0.997746 0.986630 0.981442
1 1.000529 1.006349 0.993122 0.989630 0.992593 1.011111 1.003069 1.003598
2 1.005266 1.010826 1.005052 1.001417 0.996071 0.991152 1.003769 0.986448
3 0.999460 1.004376 0.989869 0.998741 0.997182 1.005575 1.006175 0.998621
4 1.006720 1.007277 0.989728 0.986107 0.996135 1.008949 1.001149 1.003935

In [23]:
features = std_samples.iloc[:,:-1]
features.head()


Out[23]:
0 1 2 3 4 5 6
0 1.015038 1.020472 1.004416 0.999475 0.994782 0.997746 0.986630
1 1.000529 1.006349 0.993122 0.989630 0.992593 1.011111 1.003069
2 1.005266 1.010826 1.005052 1.001417 0.996071 0.991152 1.003769
3 0.999460 1.004376 0.989869 0.998741 0.997182 1.005575 1.006175
4 1.006720 1.007277 0.989728 0.986107 0.996135 1.008949 1.001149

In [24]:
target = pd.DataFrame(std_samples.iloc[:,-1])
target.head()


Out[24]:
target
0 0.981442
1 1.003598
2 0.986448
3 0.998621
4 1.003935

Let's create the samples divider function


In [25]:
TARGET_FEATURE = 'Close'


def feature_close_one_to_one(sample_blob):
    target = sample_blob[1].T
    feat_close = sample_blob[0][TARGET_FEATURE]
    feat_close.index = np.arange(feat_close.shape[0])
    target.index = ['target']
    x_y_samples = feat_close.append(target)
    x_y_samples_shuffled = x_y_samples.T.sample(frac=1).reset_index(drop=True)
    x_y_samples_filtered = x_y_samples_shuffled.dropna(axis=0, how='any')
    
    return x_y_samples_filtered

In [26]:
print(feature_close_one_to_one(sample_blob).shape)
feature_close_one_to_one(sample_blob).head()


(473, 8)
Out[26]:
0 1 2 3 4 5 6 target
0 48.21 48.36 47.72 47.85 47.60 47.94 48.15 48.55
1 77.31 77.30 76.60 77.29 76.62 77.30 77.15 77.48
2 177.42 178.36 175.48 176.12 174.50 175.68 176.66 178.84
3 41.79 41.93 41.26 41.06 40.36 39.99 39.07 39.22
4 22.32 22.35 21.94 22.19 21.66 21.92 22.08 22.07

In [27]:
date_base_ini = start_date
date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
date_target = add_market_days(date_base_end, ahead_days, data_df)
feat_tgt_df = pd.DataFrame()

while date_base_end < end_of_training_date:
    sample_blob = (data_train_df[date_base_ini: date_base_end],
                   pd.DataFrame(data_target_df.loc[date_target]))
    feat_tgt_blob = feature_close_one_to_one(sample_blob) # TODO: Change for a generic function
    feat_tgt_df = feat_tgt_df.append(feat_tgt_blob, ignore_index=True)
    
    date_base_ini = add_market_days(date_base_ini, step_days, data_df)
    date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
    date_target = add_market_days(date_base_end, ahead_days, data_df)
    # print('Start: %s,  End:%s' % (date_base_ini, date_base_end))

feat_tgt_df = feat_tgt_df.sample(frac=1).reset_index(drop=True)

X_df = feat_tgt_df.iloc[:,:-1]
y_df = pd.DataFrame(feat_tgt_df.iloc[:,-1])

In [28]:
print(X_df.shape)
X_df.head()


(17102, 7)
Out[28]:
0 1 2 3 4 5 6
0 67.41 66.66 65.48 66.76 66.53 67.30 67.30
1 76.10 76.76 76.87 76.23 76.14 76.01 76.43
2 20.60 20.85 21.00 21.23 21.10 21.04 20.92
3 31.11 31.01 30.98 30.66 30.53 30.84 30.98
4 15.45 15.64 15.49 15.47 15.41 15.33 15.37

In [29]:
print(y_df.shape)
y_df.head()


(17102, 1)
Out[29]:
target
0 69.33
1 76.77
2 20.78
3 31.59
4 15.60

So, I have everything to define the final function of this notebook


In [30]:
def generate_train_intervals(data_df, train_time, base_days, step_days, ahead_days, today, blob_fun):
    end_of_training_date = add_market_days(today, -ahead_days, data_df)
    start_date = end_of_training_date - dt.timedelta(train_time)
    start_target_date = add_market_days(start_date, base_days + ahead_days - 1, data_df)
    
    data_train_df = data_df[start_date:end_of_training_date]
    data_target_df = data_df.loc[start_target_date: today,TARGET_FEATURE]
    
    date_base_ini = start_date
    date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
    date_target = add_market_days(date_base_end, ahead_days, data_df)
    feat_tgt_df = pd.DataFrame()

    while date_base_end < end_of_training_date:
        sample_blob = (data_train_df[date_base_ini: date_base_end],
                       pd.DataFrame(data_target_df.loc[date_target]))
        feat_tgt_blob = blob_fun(sample_blob)
        feat_tgt_df = feat_tgt_df.append(feat_tgt_blob, ignore_index=True)
        
        date_base_ini = add_market_days(date_base_ini, step_days, data_df)
        date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
        date_target = add_market_days(date_base_end, ahead_days, data_df)
        # print('Start: %s,  End:%s' % (date_base_ini, date_base_end))
    
    feat_tgt_df = feat_tgt_df.sample(frac=1).reset_index(drop=True)
    
    X_df = feat_tgt_df.iloc[:,:-1]
    y_df = pd.DataFrame(feat_tgt_df.iloc[:,-1])
    
    return X_df, y_df

In [31]:
train_time = 365 # In real time days
base_days = 7 # In market days
step_days = 7 # market days
ahead_days = 1 # market days
today = data_df.index[-1] # Real date

X, y = generate_train_intervals(data_df, train_time, base_days, step_days, ahead_days, today, feature_close_one_to_one)

In [32]:
print(X.shape)
X.head()


(17102, 7)
Out[32]:
0 1 2 3 4 5 6
0 45.34 44.59 43.41 44.52 44.53 44.74 45.18
1 173.89 173.29 175.06 174.57 175.77 175.90 176.58
2 16.95 17.03 17.05 17.18 16.85 17.03 17.01
3 15.25 14.96 14.55 14.87 14.73 14.85 14.97
4 36.50 36.76 36.89 37.47 37.73 38.34 38.29

In [33]:
print(y.shape)
y.head()


(17102, 1)
Out[33]:
target
0 43.26
1 177.82
2 17.05
3 14.84
4 39.17

In [34]:
%pwd


Out[34]:
'/home/miguel/udacity/Machine Learning Nanodegree/projects/capstone/capstone/notebooks/dev'

In [35]:
sys.path.append('../../')
import predictor.feature_extraction as fe

Let's try the function as it was saved in the package:


In [36]:
X, y = fe.generate_train_intervals(data_df, 
                                   train_time, 
                                   base_days, 
                                   step_days, 
                                   ahead_days, 
                                   today, 
                                   feature_close_one_to_one)

In [37]:
print(X.shape)
X.head()


(24606, 7)
Out[37]:
0 1 2 3 4 5 6
440 61.14 60.92 61.14 60.48 60.96 60.88 61.82
296 38.98 39.03 38.81 39.19 39.57 39.11 38.90
163 83.20 83.20 83.14 83.38 82.94 82.91 83.81
80 77.83 76.09 71.75 71.35 67.36 66.21 65.38
288 40.94 40.79 40.29 40.32 41.21 41.48 41.27

In [38]:
print(y.shape)
y.head()


(24606, 1)
Out[38]:
target
440 62.49
296 38.94
163 84.42
80 62.60
288 41.11

Looks good

Sometimes, it may be useful to keep the dates information...


In [39]:
x_y_samples


Out[39]:
SPY MMM ABT ABBV ACN ATVI AYI ADBE AMD AAP ... WYNN XEL XRX XLNX XYL YHOO YUM ZBH ZION ZTS
0 183.82 139.42 38.41 53.01 82.06 17.80 109.71 59.55 3.85 109.63 ... 193.29 27.92 12.15 45.64 34.58 40.20 74.92 93.20 29.85 32.67
1 184.69 140.25 38.33 52.81 82.22 17.83 109.32 59.88 3.87 110.68 ... 194.21 27.94 12.17 45.92 34.60 40.44 75.61 93.19 29.96 32.69
2 182.92 138.13 38.23 51.98 81.13 18.07 107.65 59.29 3.95 109.74 ... 197.94 27.51 11.91 45.97 34.16 39.59 75.09 92.24 29.65 32.36
3 182.88 138.45 38.64 52.30 81.40 18.29 108.15 59.16 4.00 112.88 ... 196.00 27.50 11.99 45.62 34.47 40.12 75.56 92.64 29.86 32.05
4 182.36 137.63 39.15 50.39 80.54 18.08 106.28 58.12 4.13 111.80 ... 195.86 27.35 12.09 45.42 34.41 39.93 75.50 93.24 29.65 31.98
5 183.48 137.65 38.85 50.49 81.52 18.32 109.44 58.97 4.18 113.18 ... 201.51 27.82 12.19 45.52 34.51 40.92 76.56 95.10 29.74 32.10
6 183.52 136.63 39.20 50.36 82.15 18.34 110.03 58.90 4.18 112.30 ... 205.29 27.71 12.08 45.91 34.49 41.02 76.53 97.43 30.00 31.74
target 183.64 136.45 39.27 51.22 82.95 18.30 123.50 59.09 4.09 113.55 ... 204.77 27.82 12.05 45.78 34.63 40.92 75.05 96.47 30.22 31.96

8 rows × 483 columns


In [40]:
target = sample_blob[1].T
feat_close = sample_blob[0][TARGET_FEATURE]
x_y_samples = feat_close.append(target)
x_y_samples


Out[40]:
SPY MMM ABT ABBV ACN ATVI AYI ADBE AMD AAP ... WYNN XEL XRX XLNX XYL YHOO YUM ZBH ZION ZTS
2014-12-18 206.78 165.30 45.77 67.92 89.74 20.04 137.17 74.89 2.55 160.74 ... 145.10 35.35 13.89 43.71 38.10 50.91 71.74 114.94 28.20 43.15
2014-12-19 206.52 165.48 46.05 67.71 90.51 20.25 137.38 74.45 2.57 159.85 ... 150.11 35.40 13.96 43.00 38.59 50.88 71.17 114.91 28.39 43.51
2014-12-22 207.47 167.27 46.37 66.97 91.18 20.30 140.05 74.50 2.66 159.44 ... 147.48 35.65 13.97 43.87 38.74 51.15 72.46 115.05 28.18 43.41
2014-12-23 207.75 166.87 45.72 64.35 91.42 20.23 139.48 74.38 2.67 161.19 ... 151.69 35.43 14.07 43.91 38.72 50.02 72.68 113.47 28.53 42.97
2014-12-24 207.77 166.96 45.70 66.21 91.32 20.35 139.94 74.74 2.65 160.23 ... 150.90 36.23 14.08 44.08 38.76 50.65 72.79 114.11 28.52 43.84
2014-12-26 208.44 166.26 45.85 66.98 91.26 20.42 139.66 74.67 2.65 160.00 ... 150.37 36.58 14.14 43.92 38.95 50.86 73.14 114.17 28.56 44.20
2014-12-29 208.72 166.71 45.60 67.14 90.52 20.36 139.88 74.13 2.66 161.22 ... 153.00 37.25 14.14 43.79 38.86 50.53 73.56 113.50 28.72 44.01
2014-12-30 207.60 165.84 45.69 66.30 90.19 20.24 141.20 73.08 2.63 160.03 ... 151.20 36.38 14.01 43.55 38.43 51.22 73.28 114.57 28.67 43.35

8 rows × 483 columns


In [41]:
x_y_samples.index = pd.MultiIndex.from_product([[x_y_samples.index[0]], np.arange(x_y_samples.shape[0])])
x_y_samples


Out[41]:
SPY MMM ABT ABBV ACN ATVI AYI ADBE AMD AAP ... WYNN XEL XRX XLNX XYL YHOO YUM ZBH ZION ZTS
2014-12-18 0 206.78 165.30 45.77 67.92 89.74 20.04 137.17 74.89 2.55 160.74 ... 145.10 35.35 13.89 43.71 38.10 50.91 71.74 114.94 28.20 43.15
1 206.52 165.48 46.05 67.71 90.51 20.25 137.38 74.45 2.57 159.85 ... 150.11 35.40 13.96 43.00 38.59 50.88 71.17 114.91 28.39 43.51
2 207.47 167.27 46.37 66.97 91.18 20.30 140.05 74.50 2.66 159.44 ... 147.48 35.65 13.97 43.87 38.74 51.15 72.46 115.05 28.18 43.41
3 207.75 166.87 45.72 64.35 91.42 20.23 139.48 74.38 2.67 161.19 ... 151.69 35.43 14.07 43.91 38.72 50.02 72.68 113.47 28.53 42.97
4 207.77 166.96 45.70 66.21 91.32 20.35 139.94 74.74 2.65 160.23 ... 150.90 36.23 14.08 44.08 38.76 50.65 72.79 114.11 28.52 43.84
5 208.44 166.26 45.85 66.98 91.26 20.42 139.66 74.67 2.65 160.00 ... 150.37 36.58 14.14 43.92 38.95 50.86 73.14 114.17 28.56 44.20
6 208.72 166.71 45.60 67.14 90.52 20.36 139.88 74.13 2.66 161.22 ... 153.00 37.25 14.14 43.79 38.86 50.53 73.56 113.50 28.72 44.01
7 207.60 165.84 45.69 66.30 90.19 20.24 141.20 73.08 2.63 160.03 ... 151.20 36.38 14.01 43.55 38.43 51.22 73.28 114.57 28.67 43.35

8 rows × 483 columns

That would be the way to go: the timestamp of the first day of the base period works as a global timestamp for the base period.


In [42]:
x_y_samples.unstack().stack(0).sample(frac=1).reset_index(level=1, drop=True).head()


Out[42]:
0 1 2 3 4 5 6 7
2014-12-18 95.41 94.80 96.90 99.14 100.09 99.94 99.22 98.84
2014-12-18 139.82 139.83 141.40 141.31 141.54 141.39 140.67 139.63
2014-12-18 49.00 49.02 49.36 49.17 49.91 50.47 50.88 49.77
2014-12-18 54.10 54.01 53.96 54.42 54.54 54.43 54.73 54.84
2014-12-18 95.36 95.44 96.62 97.36 97.09 97.05 96.73 95.96

Let's try the whole function, with shuffle (it's better to do it early, so that I won't forget later and get some artificial results), but keeping the index.


In [43]:
TARGET_FEATURE = 'Close'


def feature_close_one_to_one(sample_blob):
    target = sample_blob[1].T
    feat_close = sample_blob[0][TARGET_FEATURE]
    x_y_samples = feat_close.append(target)
    x_y_samples.index = pd.MultiIndex.from_product([[x_y_samples.index[0]], 
                                                    np.arange(x_y_samples.shape[0])])
    x_y_samples_shuffled = x_y_samples.unstack().stack(0).sample(frac=1).reset_index(level=1, drop=True)
    x_y_samples_filtered = x_y_samples_shuffled.dropna(axis=0, how='any')
    
    return x_y_samples_filtered

In [44]:
print(feature_close_one_to_one(sample_blob).shape)
feature_close_one_to_one(sample_blob).head()


(479, 8)
Out[44]:
0 1 2 3 4 5 6 7
2014-12-18 26.02 26.14 26.48 26.62 26.54 26.50 26.59 26.52
2014-12-18 511.10 516.35 524.87 530.59 528.77 534.03 530.33 530.42
2014-12-18 252.23 257.81 258.22 257.80 256.98 257.09 258.86 257.49
2014-12-18 89.54 90.10 90.24 90.85 90.26 90.24 90.14 89.46
2014-12-18 58.98 59.58 58.96 57.21 57.46 57.78 57.73 57.65

In [45]:
def generate_train_intervals(data_df, train_time, base_days, step_days, ahead_days, today, blob_fun):
    end_of_training_date = add_market_days(today, -ahead_days, data_df)
    start_date = end_of_training_date - dt.timedelta(train_time)
    start_target_date = add_market_days(start_date, base_days + ahead_days - 1, data_df)
    
    data_train_df = data_df[start_date:end_of_training_date]
    data_target_df = data_df.loc[start_target_date: today, TARGET_FEATURE]
    
    date_base_ini = start_date
    date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
    date_target = add_market_days(date_base_end, ahead_days, data_df)
    feat_tgt_df = pd.DataFrame()

    while date_base_end < end_of_training_date:
        sample_blob = (data_train_df[date_base_ini: date_base_end],
                       pd.DataFrame(data_target_df.loc[date_target]))
        feat_tgt_blob = blob_fun(sample_blob)
        feat_tgt_df = feat_tgt_df.append(feat_tgt_blob)
        
        date_base_ini = add_market_days(date_base_ini, step_days, data_df)
        date_base_end = add_market_days(date_base_ini, base_days - 1, data_df)
        date_target = add_market_days(date_base_end, ahead_days, data_df)
        # print('Start: %s,  End:%s' % (date_base_ini, date_base_end))
    
    feat_tgt_df = feat_tgt_df.sample(frac=1)
    
    X_df = feat_tgt_df.iloc[:,:-1]
    y_df = pd.DataFrame(feat_tgt_df.iloc[:,-1]).rename(columns={7:'target'})
    
    return X_df, y_df

In [46]:
from time import time

tic = time()
X, y = generate_train_intervals(data_df, 
                                train_time, 
                                base_days, 
                                step_days, 
                                ahead_days, 
                                today, 
                                feature_close_one_to_one)
toc = time()
print('Elapsed time: %i seconds.' % (toc-tic))


Elapsed time: 2 seconds.

In [47]:
print(X.shape)
X.head(10)


(17102, 7)
Out[47]:
0 1 2 3 4 5 6
2014-02-20 80.76 81.48 81.74 81.86 82.45 82.23 82.50
2014-02-20 35.85 36.04 35.97 36.10 36.01 35.95 36.02
2014-05-12 29.04 29.11 29.25 29.23 29.27 28.88 28.95
2014-04-10 28.26 28.38 28.55 28.95 29.17 28.86 28.73
2014-08-11 99.77 99.81 99.99 99.97 100.46 100.60 101.80
2014-01-09 84.64 85.19 84.02 84.41 85.08 83.98 83.92
2014-11-28 63.23 63.60 64.84 65.33 65.86 65.90 66.22
2014-07-22 121.11 122.01 122.26 122.84 122.65 123.31 130.01
2014-09-10 48.75 49.01 48.40 48.56 48.96 49.21 49.69
2014-11-18 83.79 84.99 84.58 84.65 85.40 84.95 84.98

In [48]:
print(y.shape)
y.head(10)


(17102, 1)
Out[48]:
target
2014-02-20 81.84
2014-02-20 36.00
2014-05-12 28.87
2014-04-10 28.78
2014-08-11 101.74
2014-01-09 84.06
2014-11-28 65.99
2014-07-22 127.39
2014-09-10 50.35
2014-11-18 87.54

Let's test the "final" (you never know...) function in its module


In [49]:
sys.path.append('../../')
import predictor.feature_extraction as fe

X, y = fe.generate_train_intervals(data_df, 
                                   train_time, 
                                   base_days, 
                                   step_days, 
                                   ahead_days, 
                                   today, 
                                   feature_close_one_to_one)

In [50]:
print(X.shape)
X.head(10)


(24606, 7)
Out[50]:
0 1 2 3 4 5 6
2014-05-09 21.10 21.77 21.46 20.73 19.76 20.06 20.07
2014-03-31 60.21 60.26 61.69 61.45 59.98 59.03 59.10
2014-10-17 79.77 79.64 81.16 81.02 81.56 82.52 82.86
2013-07-19 67.39 67.00 66.85 65.94 66.56 64.73 63.44
2014-11-17 41.16 41.26 40.92 40.92 41.44 41.43 41.15
2013-07-30 52.09 52.04 53.48 53.93 54.25 53.76 52.89
2013-11-06 83.00 81.04 83.38 83.30 84.52 84.84 85.22
2014-05-09 49.27 50.12 50.05 50.20 48.61 48.98 49.34
2014-12-08 18.27 18.34 17.86 17.78 17.02 17.08 16.95
2014-05-09 43.59 43.02 43.20 43.78 43.63 43.64 43.52

In [51]:
print(y.shape)
y.head(10)


(24606, 1)
Out[51]:
target
2014-05-09 19.77
2014-03-31 60.23
2014-10-17 84.09
2013-07-19 63.14
2014-11-17 41.87
2013-07-30 52.79
2013-11-06 85.11
2014-05-09 49.74
2014-12-08 17.47
2014-05-09 43.63

Nice!


In [ ]:

I will try to modify the add_market_days function to make it return a shift in real days instead of an index shift (that takes into account the possible duplicates, that are very common in some of the approaches I will follow)


In [55]:
data_df


Out[55]:
feature Close ... Volume
SPY MMM ABT ABBV ACN ATVI AYI ADBE AMD AAP ... WYNN XEL XRX XLNX XYL YHOO YUM ZBH ZION ZTS
date
1993-01-29 43.94 24.50 6.88 NaN NaN NaN NaN 2.59 18.75 NaN ... NaN 87800.0 7633602.0 1745196.0 NaN NaN NaN NaN 33600.0 NaN
1993-02-01 44.25 24.69 6.88 NaN NaN NaN NaN 2.72 19.12 NaN ... NaN 72400.0 3001200.0 3574800.0 NaN NaN NaN NaN 32000.0 NaN
1993-02-02 44.34 24.72 6.53 NaN NaN NaN NaN 2.84 20.25 NaN ... NaN 242200.0 1388598.0 2652396.0 NaN NaN NaN NaN 251600.0 NaN
1993-02-03 44.81 25.19 6.91 NaN NaN NaN NaN 2.70 20.50 NaN ... NaN 272200.0 1228200.0 5040396.0 NaN NaN NaN NaN 254800.0 NaN
1993-02-04 45.00 26.06 6.84 NaN NaN NaN NaN 2.73 20.12 NaN ... NaN 162800.0 1675602.0 7033200.0 NaN NaN NaN NaN 317200.0 NaN
1993-02-05 44.97 27.19 6.88 NaN NaN NaN NaN 2.60 19.62 NaN ... NaN 73600.0 3104598.0 7197600.0 NaN NaN NaN NaN 292400.0 NaN
1993-02-08 44.97 27.25 6.66 NaN NaN NaN NaN 2.62 19.75 NaN ... NaN 43400.0 3567600.0 5848800.0 NaN NaN NaN NaN 202000.0 NaN
1993-02-09 44.66 27.25 6.84 NaN NaN NaN NaN 2.59 19.75 NaN ... NaN 58400.0 1281600.0 2401200.0 NaN NaN NaN NaN 111600.0 NaN
1993-02-10 44.72 27.31 6.88 NaN NaN NaN NaN 2.56 19.75 NaN ... NaN 174200.0 2275602.0 3268404.0 NaN NaN NaN NaN 38800.0 NaN
1993-02-11 44.94 26.94 6.94 NaN NaN NaN NaN 2.62 21.00 NaN ... NaN 87400.0 343002.0 1974804.0 NaN NaN NaN NaN 18400.0 NaN
1993-02-12 44.59 26.81 7.00 NaN NaN NaN NaN 2.54 21.38 NaN ... NaN 114400.0 682200.0 1629996.0 NaN NaN NaN NaN 69600.0 NaN
1993-02-16 43.47 26.31 6.66 NaN NaN NaN NaN 2.52 19.75 NaN ... NaN 87200.0 1186800.0 6063600.0 NaN NaN NaN NaN 61200.0 NaN
1993-02-17 43.44 25.94 6.44 NaN NaN NaN NaN 2.58 20.38 NaN ... NaN 134000.0 2628198.0 7012800.0 NaN NaN NaN NaN 164000.0 NaN
1993-02-18 43.41 25.75 6.25 NaN NaN NaN NaN 2.64 21.12 NaN ... NaN 89000.0 1435998.0 6742404.0 NaN NaN NaN NaN 118800.0 NaN
1993-02-19 43.56 26.22 6.06 NaN NaN NaN NaN 2.73 22.12 NaN ... NaN 127200.0 816198.0 2259600.0 NaN NaN NaN NaN 136400.0 NaN
1993-02-22 43.72 26.16 5.81 NaN NaN NaN NaN 2.63 21.75 NaN ... NaN 172400.0 2625198.0 9621600.0 NaN NaN NaN NaN 192000.0 NaN
1993-02-23 43.69 26.06 6.06 NaN NaN NaN NaN 2.55 21.38 NaN ... NaN 227800.0 1179600.0 2307600.0 NaN NaN NaN NaN 82400.0 NaN
1993-02-24 44.25 26.31 6.50 NaN NaN NaN NaN 2.66 22.00 NaN ... NaN 392400.0 1937598.0 2843196.0 NaN NaN NaN NaN 212800.0 NaN
1993-02-25 44.34 26.09 6.59 NaN NaN NaN NaN 2.78 22.12 NaN ... NaN 115400.0 1834200.0 3760800.0 NaN NaN NaN NaN 18400.0 NaN
1993-02-26 44.41 26.38 6.53 NaN NaN NaN NaN 2.79 21.75 NaN ... NaN 82400.0 2159202.0 2637996.0 NaN NaN NaN NaN 19200.0 NaN
1993-03-01 44.28 26.06 6.47 NaN NaN NaN NaN 2.69 21.25 NaN ... NaN 49000.0 1657602.0 1125600.0 NaN NaN NaN NaN 21600.0 NaN
1993-03-02 44.94 26.47 6.50 NaN NaN NaN NaN 2.75 22.25 NaN ... NaN 77800.0 1094202.0 7811604.0 NaN NaN NaN NaN 28800.0 NaN
1993-03-03 45.12 26.25 6.41 NaN NaN NaN NaN 2.80 22.88 NaN ... NaN 65600.0 1642002.0 6934404.0 NaN NaN NaN NaN 69600.0 NaN
1993-03-04 44.88 26.47 6.38 NaN NaN NaN NaN 2.75 22.62 NaN ... NaN 69800.0 765198.0 6698004.0 NaN NaN NaN NaN 112800.0 NaN
1993-03-05 44.75 26.75 6.38 NaN NaN NaN NaN 2.80 23.25 NaN ... NaN 146000.0 897198.0 15781200.0 NaN NaN NaN NaN 21600.0 NaN
1993-03-08 45.75 27.34 6.59 NaN NaN NaN NaN 2.90 24.00 NaN ... NaN 263800.0 1549602.0 11932800.0 NaN NaN NaN NaN 52000.0 NaN
1993-03-09 45.59 27.34 6.31 NaN NaN NaN NaN 2.83 23.62 NaN ... NaN 268400.0 1260600.0 6941604.0 NaN NaN NaN NaN 30400.0 NaN
1993-03-10 45.69 27.41 6.59 NaN NaN NaN NaN 2.86 23.62 NaN ... NaN 174000.0 598800.0 8206800.0 NaN NaN NaN NaN 40800.0 NaN
1993-03-11 45.56 27.25 6.56 NaN NaN NaN NaN 2.80 23.75 NaN ... NaN 113000.0 904602.0 6606804.0 NaN NaN NaN NaN 62800.0 NaN
1993-03-12 45.09 27.12 6.66 NaN NaN NaN NaN 2.86 23.75 NaN ... NaN 87200.0 819798.0 4695600.0 NaN NaN NaN NaN 82800.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2014-11-18 205.55 160.15 43.70 66.05 84.30 20.01 138.62 70.83 2.65 146.24 ... 1445341.0 2847555.0 5510194.0 2696970.0 908406.0 26847315.0 2586790.0 688679.0 1069486.0 12181375.0
2014-11-19 205.22 158.83 43.43 65.50 83.64 19.57 138.82 70.82 2.63 144.74 ... 876839.0 1972218.0 5084935.0 2435547.0 857298.0 29260010.0 1612491.0 518740.0 1747864.0 7031136.0
2014-11-20 205.58 159.36 43.78 65.19 83.44 21.11 141.53 70.44 2.69 145.19 ... 2655717.0 2138563.0 8755083.0 1736083.0 610697.0 28916007.0 4499097.0 985701.0 1849322.0 5091131.0
2014-11-21 206.68 160.16 43.81 67.36 84.08 21.17 141.57 71.32 2.77 145.49 ... 2017251.0 2650022.0 6361338.0 2699464.0 1030635.0 22226984.0 3654458.0 821242.0 2998052.0 5192498.0
2014-11-24 207.26 159.88 44.32 67.78 84.93 21.30 142.51 71.22 2.83 145.58 ... 1155464.0 2225076.0 8568584.0 1699310.0 643336.0 14643538.0 2753912.0 1355368.0 1319333.0 4179357.0
2014-11-25 207.11 158.06 44.16 68.06 85.32 21.73 141.46 72.00 2.83 145.45 ... 1670515.0 2332571.0 6371282.0 1612922.0 772716.0 14219593.0 3223564.0 923555.0 2446487.0 3589015.0
2014-11-26 207.64 158.31 44.25 68.70 85.15 21.82 141.07 72.98 2.84 145.71 ... 833423.0 1256014.0 3804699.0 2356519.0 550887.0 13434021.0 2487289.0 595022.0 887210.0 2131385.0
2014-11-28 207.20 160.09 44.51 69.20 86.33 21.65 138.20 73.68 2.79 147.08 ... 578749.0 1210079.0 3094160.0 1423057.0 676375.0 8913738.0 1505440.0 426898.0 1766234.0 1332156.0
2014-12-01 205.76 158.16 44.32 69.11 86.00 21.05 137.03 73.75 2.67 147.07 ... 1498720.0 5969004.0 4303105.0 2638062.0 850048.0 23146896.0 3123478.0 613003.0 3944978.0 3153115.0
2014-12-02 207.08 160.60 45.22 69.27 85.98 21.14 138.18 73.47 2.68 148.48 ... 1167121.0 3751995.0 5048956.0 1774069.0 964261.0 16300591.0 3634241.0 1504922.0 1770926.0 2500597.0
2014-12-03 207.89 162.26 45.68 68.50 86.52 21.48 139.75 73.18 2.74 152.04 ... 2206478.0 3330964.0 10690975.0 5606099.0 697054.0 14235962.0 3014968.0 978397.0 1701211.0 2008310.0
2014-12-04 207.66 162.26 45.49 69.57 86.34 21.77 140.74 73.03 2.69 153.25 ... 3673851.0 2614210.0 6927550.0 4055601.0 443265.0 12157548.0 2228168.0 732956.0 1296742.0 4294633.0
2014-12-05 208.00 162.27 45.30 69.71 86.19 21.79 140.23 72.40 2.66 154.18 ... 1531474.0 2330569.0 4423298.0 2144040.0 392476.0 15418123.0 1836426.0 1018655.0 4964092.0 3313964.0
2014-12-08 206.61 160.93 45.71 69.42 84.97 20.92 137.68 72.05 2.58 153.68 ... 2400856.0 3159876.0 6015100.0 3244526.0 468915.0 18190065.0 3400882.0 644459.0 3779869.0 5889345.0
2014-12-09 206.47 160.83 45.44 69.14 85.18 20.97 137.67 71.56 2.59 158.25 ... 1610673.0 3249038.0 8166454.0 1958389.0 616176.0 19655554.0 4467324.0 986386.0 3213855.0 5516141.0
2014-12-10 203.16 158.24 44.37 67.58 83.73 20.49 133.07 69.85 2.48 157.24 ... 4661598.0 2761523.0 8399042.0 3377519.0 782671.0 16184119.0 10302596.0 924065.0 2706376.0 4398000.0
2014-12-11 204.19 159.15 44.46 67.03 83.88 20.34 131.75 69.74 2.49 159.16 ... 2676661.0 2116953.0 6142819.0 3158420.0 586045.0 21100184.0 7960980.0 961708.0 2642106.0 4942195.0
2014-12-12 200.89 157.12 43.69 65.27 81.90 19.90 129.92 76.02 2.52 160.25 ... 1491876.0 2194511.0 7343378.0 2524857.0 722650.0 20370512.0 5649531.0 879074.0 2936013.0 4050331.0
2014-12-15 199.51 156.85 43.46 65.62 81.88 20.02 129.17 73.04 2.47 161.01 ... 1437791.0 3172326.0 7622374.0 4404594.0 726824.0 18132508.0 5074164.0 753207.0 2977774.0 5947406.0
2014-12-16 197.91 159.05 43.40 65.25 82.87 19.38 128.36 72.32 2.50 157.01 ... 2127890.0 2964296.0 11093935.0 5930450.0 712928.0 21399337.0 7051356.0 1043861.0 3680071.0 4606599.0
2014-12-17 201.79 160.60 44.26 66.57 85.30 19.67 132.98 74.00 2.55 159.55 ... 4298558.0 3357456.0 6824963.0 12579715.0 1241837.0 17112312.0 5411318.0 926323.0 3567571.0 7628313.0
2014-12-18 206.78 165.30 45.77 67.92 89.74 20.04 137.17 74.89 2.55 160.74 ... 2870142.0 3062842.0 6376658.0 6461698.0 698949.0 15338945.0 4513792.0 957787.0 2342748.0 6578062.0
2014-12-19 206.52 165.48 46.05 67.71 90.51 20.25 137.38 74.45 2.57 159.85 ... 3293472.0 4525090.0 12745939.0 7023789.0 1534229.0 24110189.0 5784930.0 1271670.0 9852629.0 5601641.0
2014-12-22 207.47 167.27 46.37 66.97 91.18 20.30 140.05 74.50 2.66 159.44 ... 1768390.0 2095004.0 5349323.0 2508023.0 475873.0 24021105.0 2472579.0 629534.0 1984295.0 1921541.0
2014-12-23 207.75 166.87 45.72 64.35 91.42 20.23 139.48 74.38 2.67 161.19 ... 1668952.0 2220244.0 4977161.0 1830288.0 645345.0 15514036.0 1605742.0 1287659.0 2205030.0 2612498.0
2014-12-24 207.77 166.96 45.70 66.21 91.32 20.35 139.94 74.74 2.65 160.23 ... 494148.0 3164508.0 1647975.0 703703.0 303489.0 5962870.0 799095.0 306626.0 1176810.0 1300640.0
2014-12-26 208.44 166.26 45.85 66.98 91.26 20.42 139.66 74.67 2.65 160.00 ... 998923.0 2546323.0 2721308.0 1244180.0 249438.0 5170048.0 1041493.0 400193.0 1014964.0 1518039.0
2014-12-29 208.72 166.71 45.60 67.14 90.52 20.36 139.88 74.13 2.66 161.22 ... 1510355.0 3903388.0 4365460.0 1873440.0 352274.0 6624489.0 1892613.0 450521.0 1601636.0 2007239.0
2014-12-30 207.60 165.84 45.69 66.30 90.19 20.24 141.20 73.08 2.63 160.03 ... 1262561.0 2459775.0 3519035.0 2805439.0 437672.0 10703455.0 1464524.0 625307.0 957687.0 2251147.0
2014-12-31 205.54 164.32 45.02 65.44 89.31 20.15 140.07 72.70 2.67 159.28 ... 1635052.0 2454622.0 3650122.0 2827604.0 466877.0 9305013.0 1539425.0 694720.0 1470851.0 2148909.0

5520 rows × 2415 columns


In [57]:
base = data_df.index[0]
delta = 252

In [63]:
market_days = np.unique(data_df.sort_index().index)
len(market_days)


Out[63]:
5520

In [ ]:


In [ ]:
def add_market_days(base, delta, data_df):
    """
    base is in real time.
    delta is in market days.
    """
    market_days = data_df.index
    if base not in market_days:
        raise Exception('The base date is not in the market days list.')
    base_index = market_days.tolist().index(base)
    if base_index + delta >= len(market_days):
        return market_days[-1]
    if base_index + delta < 0:
        return market_days[0]
    return market_days[base_index + delta]