Introduction

Purposes

This tutorial:

  • provides a theoretical description of adaptive selection;
  • demonstrates how to use OnTheFlySelector for a problem of forecasting large amount of time series;
  • suggests a set of metrics such that properties of on-the-fly selection are revealed by them.

Background

In order to avoid copying and pasting sentences from docstrings, let us extract and format all necessary information. This is done with a class that hides too specialized code under easy-to-read names.


In [1]:
from utils_for_demo_of_on_the_fly_selector import Helper

In [2]:
Helper().why_adaptive_selection()


    Time series forecasting has a property that all observations are
    ordered. Depending on position, behavior of series can vary and so
    one method can yield better results at some moments while
    another method can outperform it at some other moments. This is
    the reason why adaptive selection is useful for a significant
    number of series.

To continue, read what OnTheFlySelector class is.


In [3]:
Helper().what_is_on_the_fly_selector()


    This class provides functionality for adaptive short-term
    forecasting based on selection from a pool of models.

    Fitting goes like this: an instance of the class applies all its
    candidate models to each of time series from the learning sample,
    evaluates models performances on specified number of last folds,
    and, finally, selects winning model for each of the time series.
    To predict future values of a time series, corresponding to it
    winning model is used.

    The class is designed for a case of plenty of time series and
    plenty of simple forecasters - if so, it is too expensive to store
    all forecasts in any place other than operating memory and it is
    better to compute them on-the-fly and then store only selected
    values.

    As for terminology, simple forecaster means a forecaster that
    has no fitting. By default, the class uses moving average,
    moving median, and exponential moving average, but you can pass
    your own simple forecaster to initialization of a new instance.

    Selection is preferred over stacking, because base forecasters are
    quite similar to each other and so they have many common mistakes.

    Advantages of adaptive on-the-fly selection are as follows:
    * It always produces sane results, abnormal over-forecasts or
      under-forecasts are impossible;
    * Each time series is tailored individually, this is not a model
      that predicts for several time series without taking into
      consideration their identities;
    * Uses no external features and so can be used for modelling of
      residuals of a more complicated model;
    * Can be easily paralleled or distributed, can deal with thousands
      of time series in one call.

    Limitations of adaptive on-the-fly selection are as follows:
    * Not suitable for time series that are strongly influenced by
      external factors;
    * Not suitable for non-stationary time series (e.g., time series
      with trend or seasonality) until they are not made stationary;
    * Long-term forecasts made with it converge to constants.

    

Finally, let us print the list of parameters that can be passed when a new instance of OnTheFlySelector class is being created.


In [4]:
Helper().which_parameters_does_on_the_fly_selector_have()


:param candidates: Optional[Dict[Any, List[Dict[str, Any]]]] = None
        forecasters to select from, mapping from instances of
        regressors to kwargs of their initializations, default value
        results in moving averages of ten distinct windows,
        moving medians of eight distinct windows and exponential
        moving averages of ten distinct half-lives
:param evaluation_fn: Optional[Callable] = None
        function that is used for selection of the best forecasters,
        the bigger its value, the better is a forecaster, default is
        negative mean squared error
:param horizon: int = 1
        number of steps ahead to be forecasted at each iteration,
        default is 1
:param n_evaluational_rounds: int = 1
        number of iterations at each of which forecasters make
        predictions; structure of rounds is determined as follows:
        the next round is obtained from the preceding round by going
        one step forward and the last round ends at ends of time
        series; default value is 1, i.e., forecasters are evaluated
        at the last `horizon` observations from each series.
:param verbose: int = 0
        if it is greater than 0, a progress bar with tried candidates
        is shown, default is 0

Application

Import Statements


In [5]:
import os
import datetime

import pandas as pd

from forecastonishing.selection.on_the_fly_selector import OnTheFlySelector
from forecastonishing.selection.paralleling import (
    fit_selector_in_parallel,
    predict_with_selector_in_parallel
)
from forecastonishing.miscellaneous.metrics import (
    overall_r_squared,
    averaged_r_squared,
    averaged_censored_mape
)

Data Extraction and Brief Exploration

The dataset that is used here is a set of synthetic time series that are drawn from a generative model trained on lots of real-world time series, so the problem under consideration is quite realistic.

First of all, download the dataset if it has not been downloaded before.


In [6]:
path_to_dataset = 'time_series_dataset.csv'
if os.path.isfile(path_to_dataset):
    df = pd.read_csv(path_to_dataset, parse_dates=[2])
else:
    df = pd.read_csv(
        "https://docs.google.com/spreadsheets/" +
        "d/1TF0bAf9wOpIXIvIsazMCLEoHQ1y6dTkYYdYRRleC5lM/export?format=csv",
        parse_dates=[2]
    )
    df.to_csv(path_to_dataset, index=False)
df.head()


Out[6]:
unit item date value
0 1 1 2017-11-01 14.0
1 1 1 2017-11-02 11.0
2 1 1 2017-11-03 15.0
3 1 1 2017-11-04 8.0
4 1 1 2017-11-05 10.0

How many time series are there?


In [7]:
n_time_series = len(df.groupby(['unit', 'item']))
n_time_series


Out[7]:
7949

In [8]:
len(df.index) / n_time_series


Out[8]:
61.0

Each time series includes two months of observations.

Metrics

Now let us define some metrics, but before that a quick remark is here — so much attention is paid to this section, because it explains which properties one can expect from on-the-fly selection and which properties, conversely, one can not expect from it.

An interesting combination is to use both $R^2$ coefficient computed in a batch for all time series and $R^2$ coefficient computed for each time series separately and then averaged over all of them. The former metric reports how well levels of different time series are grasped, whereas the latter one reports how well individual dynamics and deviations from a corresponding mean are predicted.

In addition, MAPE (mean absolute percentage error) computed for each time series separately, censored from above at 100% level, and averaged over all time series, can be displayed too, because it shows how far predictions are from actual values in relative terms.

To see why these three metrics differ, look at the below example.


In [9]:
example_df = pd.DataFrame(
    [[1, 2, 3],
     [1, 4, 5],
     [2, 10, 9],
     [2, 9, 10]],
    columns=['key', 'actual_value', 'prediction']
)

In [10]:
overall_r_squared(example_df)


Out[10]:
0.91061452513966479

Above metric is high, because two series from example_df have different levels and predictions are near the corresponding levels which means that variation across levels is reflected in predictions.


In [11]:
averaged_r_squared(example_df, ['key'])


Out[11]:
-1.5

Alas, this metric is negative, because variation around individual means is not reflected at all.


In [12]:
averaged_censored_mape(example_df, ['key'])


Out[12]:
24.027777777777779

Finally, value of the third metric is neither decent nor poor. Predictions are not too far from actual values — relative difference is about 24%.

The Launch Itself


In [13]:
horizon = 3
n_evaluational_rounds = 10

# This is not a parameter of `OnTheFlySelector` or any of its methods.
# It is introduced, because, by default, `OnTheFlySelector` does not
# use older lags, so it allows filtering redundant observations.
max_lag_to_use = 10

In [14]:
train_test_frontier = df['date'].max() - datetime.timedelta(days=horizon-1)
n_training_days = horizon + n_evaluational_rounds + max_lag_to_use + 1
train_df = df[
    (df['date'] < train_test_frontier) &
    (df['date'] >= train_test_frontier - datetime.timedelta(days=n_training_days))
]
test_df = df[df['date'] >= train_test_frontier]

In [15]:
selector = OnTheFlySelector(
    horizon=horizon,
    n_evaluational_rounds=n_evaluational_rounds,
    verbose=1
)

In [16]:
%%time
selector = fit_selector_in_parallel(
    selector,
    train_df,
    name_of_target='value',
    series_keys=['unit', 'item'],
    n_processes=4
)


100%|██████████| 28/28 [22:18<00:00, 51.35s/it]
100%|██████████| 28/28 [22:19<00:00, 51.05s/it]
100%|██████████| 28/28 [22:20<00:00, 51.60s/it]
100%|██████████| 28/28 [22:20<00:00, 50.79s/it]
CPU times: user 752 ms, sys: 164 ms, total: 916 ms
Wall time: 22min 21s

In [17]:
%%time
predictions_df = predict_with_selector_in_parallel(
    selector,
    train_df,
    n_processes=4
)


CPU times: user 432 ms, sys: 64 ms, total: 496 ms
Wall time: 1min 50s

In [18]:
# Sorting is necessary after parallel execution.
evaluation_df = predictions_df.reset_index().sort_values(by=['unit', 'item', 'index'])
evaluation_df['actual_value'] = test_df.sort_values(by=['unit', 'item', 'date'])['value'].values

In [19]:
overall_r_squared(evaluation_df)


Out[19]:
0.90524827491432336

Since 1 is the maximal possible value of $R^2$, the above score is very good.


In [20]:
averaged_r_squared(evaluation_df, ['unit', 'item'])


Out[20]:
-7.0785745955648114

Negative value indicates that true mean is better than the forecast.


In [21]:
averaged_censored_mape(evaluation_df, ['unit', 'item'])


Out[21]:
37.5152389675739

Censored MAPE about 38% is not a bad score.

Conclusion

As it can be seen, simple forecasters can not predict future dynamic of the time series under consideration, especially multiple steps ahead. However, cross-sectional variance is grasped almost perfectly. Also note that the most frequent value in the dataset is 0 which makes MAPE too pessimistic, because even very close to 0 positive forecast is evaluated as maximal error if actual value is 0.

If you need more examples of how to use OnTheFlySelector class, please look at tests/on_the_fly_selector_tests.py file.