This tutorial:
OnTheFlySelector for a problem of forecasting large amount of time series;In order to avoid copying and pasting sentences from docstrings, let us extract and format all necessary information. This is done with a class that hides too specialized code under easy-to-read names.
In [1]:
from utils_for_demo_of_on_the_fly_selector import Helper
In [2]:
Helper().why_adaptive_selection()
To continue, read what OnTheFlySelector class is.
In [3]:
Helper().what_is_on_the_fly_selector()
Finally, let us print the list of parameters that can be passed when a new instance of OnTheFlySelector class is being created.
In [4]:
Helper().which_parameters_does_on_the_fly_selector_have()
In [5]:
import os
import datetime
import pandas as pd
from forecastonishing.selection.on_the_fly_selector import OnTheFlySelector
from forecastonishing.selection.paralleling import (
fit_selector_in_parallel,
predict_with_selector_in_parallel
)
from forecastonishing.miscellaneous.metrics import (
overall_r_squared,
averaged_r_squared,
averaged_censored_mape
)
The dataset that is used here is a set of synthetic time series that are drawn from a generative model trained on lots of real-world time series, so the problem under consideration is quite realistic.
First of all, download the dataset if it has not been downloaded before.
In [6]:
path_to_dataset = 'time_series_dataset.csv'
if os.path.isfile(path_to_dataset):
df = pd.read_csv(path_to_dataset, parse_dates=[2])
else:
df = pd.read_csv(
"https://docs.google.com/spreadsheets/" +
"d/1TF0bAf9wOpIXIvIsazMCLEoHQ1y6dTkYYdYRRleC5lM/export?format=csv",
parse_dates=[2]
)
df.to_csv(path_to_dataset, index=False)
df.head()
Out[6]:
How many time series are there?
In [7]:
n_time_series = len(df.groupby(['unit', 'item']))
n_time_series
Out[7]:
In [8]:
len(df.index) / n_time_series
Out[8]:
Each time series includes two months of observations.
Now let us define some metrics, but before that a quick remark is here — so much attention is paid to this section, because it explains which properties one can expect from on-the-fly selection and which properties, conversely, one can not expect from it.
An interesting combination is to use both $R^2$ coefficient computed in a batch for all time series and $R^2$ coefficient computed for each time series separately and then averaged over all of them. The former metric reports how well levels of different time series are grasped, whereas the latter one reports how well individual dynamics and deviations from a corresponding mean are predicted.
In addition, MAPE (mean absolute percentage error) computed for each time series separately, censored from above at 100% level, and averaged over all time series, can be displayed too, because it shows how far predictions are from actual values in relative terms.
To see why these three metrics differ, look at the below example.
In [9]:
example_df = pd.DataFrame(
[[1, 2, 3],
[1, 4, 5],
[2, 10, 9],
[2, 9, 10]],
columns=['key', 'actual_value', 'prediction']
)
In [10]:
overall_r_squared(example_df)
Out[10]:
Above metric is high, because two series from example_df have different levels and predictions are near the corresponding levels which means that variation across levels is reflected in predictions.
In [11]:
averaged_r_squared(example_df, ['key'])
Out[11]:
Alas, this metric is negative, because variation around individual means is not reflected at all.
In [12]:
averaged_censored_mape(example_df, ['key'])
Out[12]:
Finally, value of the third metric is neither decent nor poor. Predictions are not too far from actual values — relative difference is about 24%.
In [13]:
horizon = 3
n_evaluational_rounds = 10
# This is not a parameter of `OnTheFlySelector` or any of its methods.
# It is introduced, because, by default, `OnTheFlySelector` does not
# use older lags, so it allows filtering redundant observations.
max_lag_to_use = 10
In [14]:
train_test_frontier = df['date'].max() - datetime.timedelta(days=horizon-1)
n_training_days = horizon + n_evaluational_rounds + max_lag_to_use + 1
train_df = df[
(df['date'] < train_test_frontier) &
(df['date'] >= train_test_frontier - datetime.timedelta(days=n_training_days))
]
test_df = df[df['date'] >= train_test_frontier]
In [15]:
selector = OnTheFlySelector(
horizon=horizon,
n_evaluational_rounds=n_evaluational_rounds,
verbose=1
)
In [16]:
%%time
selector = fit_selector_in_parallel(
selector,
train_df,
name_of_target='value',
series_keys=['unit', 'item'],
n_processes=4
)
In [17]:
%%time
predictions_df = predict_with_selector_in_parallel(
selector,
train_df,
n_processes=4
)
In [18]:
# Sorting is necessary after parallel execution.
evaluation_df = predictions_df.reset_index().sort_values(by=['unit', 'item', 'index'])
evaluation_df['actual_value'] = test_df.sort_values(by=['unit', 'item', 'date'])['value'].values
In [19]:
overall_r_squared(evaluation_df)
Out[19]:
Since 1 is the maximal possible value of $R^2$, the above score is very good.
In [20]:
averaged_r_squared(evaluation_df, ['unit', 'item'])
Out[20]:
Negative value indicates that true mean is better than the forecast.
In [21]:
averaged_censored_mape(evaluation_df, ['unit', 'item'])
Out[21]:
Censored MAPE about 38% is not a bad score.
As it can be seen, simple forecasters can not predict future dynamic of the time series under consideration, especially multiple steps ahead. However, cross-sectional variance is grasped almost perfectly. Also note that the most frequent value in the dataset is 0 which makes MAPE too pessimistic, because even very close to 0 positive forecast is evaluated as maximal error if actual value is 0.
If you need more examples of how to use OnTheFlySelector class, please look at tests/on_the_fly_selector_tests.py file.