In [1]:
# remove the notebook root logger.\n",
import logging
logger = logging.getLogger()
logger.handlers = []

An introduction to time series forecasting with PyAF

This is an introduction to time series forecasting with PyAF. We describe the basic vocabulary used and give simple examples.

Problem Setting

The main problem to be solved with a forecasting tool is to predict the future values of a quantity of interest called 'Signal' over a period of time called 'Horizon' (predict the shop sales over next 7 days).

Usually all available (mainly past) data on the signal can be used to perform the forecasting task, even external related signals can be used (called exogenous signals, weather data can be used to predict ice-cream sales).

The time itself can also be used. Some time information can be helpful : sundays, holidays, quarter of the year etc. Future values of time-based data are usually known.

Time series models are mathematical concepts that can used to compute the future values. They summarize a pattern that can be observed in the data in a regular way. Example time series models :

  • Linear trend : next future valeus are obtained by extrapolating a straight line fitted on known signal data.
  • Cycles : future values follow a periodic pattern (one value for each day of the week, for example).
  • Moving Average : Some combination of previous signal values (average sales of previous 7 days).
  • Naive model : the predicted value for tomorrow is today value.
  • ...

Automatic forecasting is about testing a lot of possible time series models and selecting the best model based on its forecast quality (difference between predicted and actual values).

Forecasting with PyAF

PyAF uses a machine learning approach to perform the forecasting task. It starts by building a time series model based on past values (training process) and then uses this model to generate the future values (forecast).

A typical PyAF use case can sketched the following way :

[read CSV file or table in database ] => [Pandas Dataframe , DF] => [train a model M using DF ] => [forecast DF using M] => [Pandas Dataframe containing Forecasts , FC] => [save FC to a csv file or a database]

The first two and last two operations are Pandas generic data access tasks and are not part of PyAF. PyAF is only concerned with training the mdoel and using it to forecast the signal.

A pandas dataframe object is a very sophisticated representation of the dataset. It allows a lot of possible manipulations on the data. Almost everything that can be done in a spreadsheet or a database (SQL) is possible, in python, using this object. It can be seen an abstarct unifying view of columnar data such as disk files and database tables.

For a serious introduction to pandas objects see http://pandas.pydata.org/pandas-docs/stable/dsintro.html.

PyAF Use Cases

PyAF on the Ozone dataset

Here , we take a standard use case, of ozone data in LA (https://datamarket.com/data/set/22u8/ozon-concentration-downtown-l-a-1955-1972). This is a popular dataset that gives the monthly average of ozone concentration in LA from January 1955 to Dec 1972.

The Ozone dataset is available as a comma separated values (CSV) file from the link below.

First , we load the dataset int a pandas dataframe object.


In [2]:
import pandas as pd


csvfile_link = "https://raw.githubusercontent.com/antoinecarme/TimeSeriesData/master/ozone-la.csv"
ozone_dataframe = pd.read_csv(csvfile_link);

This dataset has two columns : 'Month' and 'Ozone'.


In [3]:
ozone_dataframe.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216 entries, 0 to 215
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Month   216 non-null    object 
 1   Ozone   216 non-null    float64
dtypes: float64(1), object(1)
memory usage: 3.5+ KB

We see here that the column 'Month' is not really a datetime column (strange date format). Pandas prvides the necessary tools to 'normalize' this column on the fly (ISO date format) :


In [4]:
import datetime
ozone_dataframe['Month'] = ozone_dataframe['Month'].apply(lambda x : datetime.datetime.strptime(x, "%Y-%m"))
ozone_dataframe.head()


Out[4]:
Month Ozone
0 1955-01-01 2.7
1 1955-02-01 2.0
2 1955-03-01 3.6
3 1955-04-01 5.0
4 1955-05-01 6.5

Pandas provides also some plotting capabilities


In [5]:
%matplotlib inline  
ozone_dataframe.plot.line('Month', ['Ozone'], grid = True, figsize=(12, 8))


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbff28dba60>

Now, we can build/train a time series model with PyAF to forecast the next 12 values:


In [6]:
import pyaf.ForecastEngine as autof
lEngine = autof.cForecastEngine()
lEngine.train(ozone_dataframe , 'Month' , 'Ozone', 12);


INFO:pyaf.std:START_TRAINING 'Ozone'
INFO:pyaf.std:END_TRAINING_TIME_IN_SECONDS 'Ozone' 2.810774326324463

In [7]:
lEngine.getModelInfo()


INFO:pyaf.std:TIME_DETAIL TimeVariable='Month' TimeMin=1955-01-01T00:00:00.000000 TimeMax=1968-07-01T00:00:00.000000 TimeDelta=<DateOffset: months=1> Horizon=12
INFO:pyaf.std:SIGNAL_DETAIL_ORIG SignalVariable='Ozone' Length=216  Min=1.2 Max=8.7  Mean=3.772685185185185 StdDev=1.4881769275050256
INFO:pyaf.std:SIGNAL_DETAIL_TRANSFORMED TransformedSignalVariable='_Ozone' Min=1.2 Max=8.7  Mean=3.772685185185185 StdDev=1.4881769275050256
INFO:pyaf.std:BEST_TRANSOFORMATION_TYPE '_'
INFO:pyaf.std:BEST_DECOMPOSITION  '_Ozone_LinearTrend_residue_zeroCycle_residue_AR(54)' [LinearTrend + NoCycle + AR]
INFO:pyaf.std:TREND_DETAIL '_Ozone_LinearTrend' [LinearTrend]
INFO:pyaf.std:CYCLE_DETAIL '_Ozone_LinearTrend_residue_zeroCycle' [NoCycle]
INFO:pyaf.std:AUTOREG_DETAIL '_Ozone_LinearTrend_residue_zeroCycle_residue_AR(54)' [AR]
INFO:pyaf.std:MODEL_MAPE MAPE_Fit=0.163 MAPE_Forecast=0.1911 MAPE_Test=0.1412
INFO:pyaf.std:MODEL_SMAPE SMAPE_Fit=0.1549 SMAPE_Forecast=0.2071 SMAPE_Test=0.1591
INFO:pyaf.std:MODEL_MASE MASE_Fit=0.6862 MASE_Forecast=0.7568 MASE_Test=0.6507
INFO:pyaf.std:MODEL_L1 L1_Fit=0.6015042844230086 L1_Forecast=0.5259990223234978 L1_Test=0.3312822637507642
INFO:pyaf.std:MODEL_L2 L2_Fit=0.796062835055018 L2_Forecast=0.6332578057427527 L2_Test=0.4052783985064731
INFO:pyaf.std:MODEL_COMPLEXITY 56
INFO:pyaf.std:SIGNAL_TRANSFORMATION_DETAIL_START
INFO:pyaf.std:SIGNAL_TRANSFORMATION_MODEL_VALUES NoTransf None
INFO:pyaf.std:SIGNAL_TRANSFORMATION_DETAIL_END
INFO:pyaf.std:TREND_DETAIL_START
INFO:pyaf.std:LINEAR_RIDGE_TREND LinearTrend (4.9984819530936715, array([-1.88052118]))
INFO:pyaf.std:TREND_DETAIL_END
INFO:pyaf.std:CYCLE_MODEL_DETAIL_START
INFO:pyaf.std:ZERO_CYCLE_MODEL_VALUES _Ozone_LinearTrend_residue_zeroCycle 0.0 {}
INFO:pyaf.std:CYCLE_MODEL_DETAIL_END
INFO:pyaf.std:AR_MODEL_DETAIL_START
INFO:pyaf.std:AR_MODEL_COEFF 1 _Ozone_LinearTrend_residue_zeroCycle_residue_Lag1 0.3962309652132945
INFO:pyaf.std:AR_MODEL_COEFF 2 _Ozone_LinearTrend_residue_zeroCycle_residue_Lag10 0.18918578372229783
INFO:pyaf.std:AR_MODEL_COEFF 3 _Ozone_LinearTrend_residue_zeroCycle_residue_Lag12 0.1713743596956554
INFO:pyaf.std:AR_MODEL_COEFF 4 _Ozone_LinearTrend_residue_zeroCycle_residue_Lag30 0.15888412236547456
INFO:pyaf.std:AR_MODEL_COEFF 5 _Ozone_LinearTrend_residue_zeroCycle_residue_Lag7 -0.1467488289271946
INFO:pyaf.std:AR_MODEL_COEFF 6 _Ozone_LinearTrend_residue_zeroCycle_residue_Lag36 0.14341118628019953
INFO:pyaf.std:AR_MODEL_COEFF 7 _Ozone_LinearTrend_residue_zeroCycle_residue_Lag39 -0.13359041572857594
INFO:pyaf.std:AR_MODEL_COEFF 8 _Ozone_LinearTrend_residue_zeroCycle_residue_Lag48 0.12839876794812233
INFO:pyaf.std:AR_MODEL_COEFF 9 _Ozone_LinearTrend_residue_zeroCycle_residue_Lag52 -0.12405632037179806
INFO:pyaf.std:AR_MODEL_COEFF 10 _Ozone_LinearTrend_residue_zeroCycle_residue_Lag20 -0.11765432513657931
INFO:pyaf.std:AR_MODEL_DETAIL_END

At this point, PyAF has tested all possible time series models and selected the best one. The whole process took about 7 seconds in this test.

to predict the next values using this model and show the last 20 values :


In [8]:
ozone_forecast_dataframe = lEngine.forecast(ozone_dataframe, 12);
ozone_forecast_dataframe.tail(20)


INFO:pyaf.std:START_FORECASTING 'Ozone'
INFO:pyaf.std:END_FORECAST_TIME_IN_SECONDS  'Ozone' 0.5641038417816162
Out[8]:
Month Ozone row_number Month_Normalized _Ozone _Ozone_LinearTrend _Ozone_LinearTrend_residue _Ozone_LinearTrend_residue_zeroCycle _Ozone_LinearTrend_residue_zeroCycle_residue _Ozone_LinearTrend_residue_zeroCycle_residue_AR(54) ... _Ozone_Cycle _Ozone_Cycle_residue _Ozone_AR _Ozone_AR_residue _Ozone_TransformedForecast Ozone_Forecast _Ozone_TransformedResidue Ozone_Residue Ozone_Forecast_Lower_Bound Ozone_Forecast_Upper_Bound
208 1972-05-01 3.5 208 1.283976 3.5 2.583939 0.916061 0.0 0.916061 0.568953 ... 0.0 0.916061 0.568953 0.347109 3.152891 3.152891 0.347109 0.347109 NaN NaN
209 1972-06-01 3.4 209 1.290264 3.4 2.572114 0.827886 0.0 0.827886 0.924601 ... 0.0 0.827886 0.924601 -0.096714 3.496714 3.496714 -0.096714 -0.096714 NaN NaN
210 1972-07-01 4.0 210 1.296349 4.0 2.560670 1.439330 0.0 1.439330 0.888587 ... 0.0 1.439330 0.888587 0.550743 3.449257 3.449257 0.550743 0.550743 NaN NaN
211 1972-08-01 3.8 211 1.302637 3.8 2.548846 1.251154 0.0 1.251154 1.058889 ... 0.0 1.251154 1.058889 0.192266 3.607734 3.607734 0.192266 0.192266 NaN NaN
212 1972-09-01 3.1 212 1.308925 3.1 2.537021 0.562979 0.0 0.562979 0.594921 ... 0.0 0.562979 0.594921 -0.031942 3.131942 3.131942 -0.031942 -0.031942 NaN NaN
213 1972-10-01 2.1 213 1.315010 2.1 2.525578 -0.425578 0.0 -0.425578 -0.207479 ... 0.0 -0.425578 -0.207479 -0.218099 2.318099 2.318099 -0.218099 -0.218099 NaN NaN
214 1972-11-01 1.6 214 1.321298 1.6 2.513753 -0.913753 0.0 -0.913753 -1.114128 ... 0.0 -0.913753 -1.114128 0.200375 1.399625 1.399625 0.200375 0.200375 NaN NaN
215 1972-12-01 1.3 215 1.327383 1.3 2.502309 -1.202309 0.0 -1.202309 -1.467150 ... 0.0 -1.202309 -1.467150 0.264841 1.035159 1.035159 0.264841 0.264841 NaN NaN
216 1973-01-01 NaN 216 1.333671 NaN 2.490485 NaN 0.0 NaN -1.312567 ... 0.0 NaN -1.312567 NaN 1.177917 1.177917 NaN NaN -0.063268 2.419103
217 1973-02-01 NaN 217 1.339959 NaN 2.478660 NaN 0.0 NaN -0.952758 ... 0.0 NaN -0.952758 NaN 1.525902 1.525902 NaN NaN 0.360994 2.690810
218 1973-03-01 NaN 218 1.345639 NaN 2.467979 NaN 0.0 NaN -0.388842 ... 0.0 NaN -0.388842 NaN 2.079137 2.079137 NaN NaN 0.804130 3.354145
219 1973-04-01 NaN 219 1.351927 NaN 2.456155 NaN 0.0 NaN 0.046712 ... 0.0 NaN 0.046712 NaN 2.502866 2.502866 NaN NaN 1.172954 3.832779
220 1973-05-01 NaN 220 1.358012 NaN 2.444711 NaN 0.0 NaN 0.617615 ... 0.0 NaN 0.617615 NaN 3.062326 3.062326 NaN NaN 1.766779 4.357873
221 1973-06-01 NaN 221 1.364300 NaN 2.432887 NaN 0.0 NaN 0.716646 ... 0.0 NaN 0.716646 NaN 3.149532 3.149532 NaN NaN 1.753581 4.545483
222 1973-07-01 NaN 222 1.370385 NaN 2.421443 NaN 0.0 NaN 0.743317 ... 0.0 NaN 0.743317 NaN 3.164760 3.164760 NaN NaN 1.793982 4.535539
223 1973-08-01 NaN 223 1.376673 NaN 2.409618 NaN 0.0 NaN 0.994819 ... 0.0 NaN 0.994819 NaN 3.404437 3.404437 NaN NaN 1.844347 4.964527
224 1973-09-01 NaN 224 1.382961 NaN 2.397794 NaN 0.0 NaN 0.420351 ... 0.0 NaN 0.420351 NaN 2.818144 2.818144 NaN NaN 1.330937 4.305352
225 1973-10-01 NaN 225 1.389047 NaN 2.386350 NaN 0.0 NaN -0.097359 ... 0.0 NaN -0.097359 NaN 2.288991 2.288991 NaN NaN 0.539028 4.038954
226 1973-11-01 NaN 226 1.395335 NaN 2.374526 NaN 0.0 NaN -1.343742 ... 0.0 NaN -1.343742 NaN 1.030783 1.030783 NaN NaN -0.960313 3.021880
227 1973-12-01 NaN 227 1.401420 NaN 2.363082 NaN 0.0 NaN -1.752161 ... 0.0 NaN -1.752161 NaN 0.610921 0.610921 NaN NaN -1.636695 2.858537

20 rows × 23 columns

Here, we see that

  1. The last 12 values of the 'Month' column have been filled on a monthly basis for the next year (1973).
  2. The last 12 values of the 'Ozone' column are empty or unknown ('Nan').
  3. A lot of extra columns have been computed

The most impostant new column is 'Ozone_Forecast', we can tsee the last rows of this column as well as the time and signal columns :


In [9]:
ozone_forecast_dataframe[['Month' , 'Ozone' , 'Ozone_Forecast']].tail(20)


Out[9]:
Month Ozone Ozone_Forecast
208 1972-05-01 3.5 3.152891
209 1972-06-01 3.4 3.496714
210 1972-07-01 4.0 3.449257
211 1972-08-01 3.8 3.607734
212 1972-09-01 3.1 3.131942
213 1972-10-01 2.1 2.318099
214 1972-11-01 1.6 1.399625
215 1972-12-01 1.3 1.035159
216 1973-01-01 NaN 1.177917
217 1973-02-01 NaN 1.525902
218 1973-03-01 NaN 2.079137
219 1973-04-01 NaN 2.502866
220 1973-05-01 NaN 3.062326
221 1973-06-01 NaN 3.149532
222 1973-07-01 NaN 3.164760
223 1973-08-01 NaN 3.404437
224 1973-09-01 NaN 2.818144
225 1973-10-01 NaN 2.288991
226 1973-11-01 NaN 1.030783
227 1973-12-01 NaN 0.610921

The predicted value for '1973-01-01' is then '0.811919' , etc

It is possible to plot the forecast against the signal toc check the 'visual' quality of the model and see the prediction intervals


In [10]:
ozone_forecast_dataframe.plot.line('Month', ['Ozone' , 'Ozone_Forecast', 
                                             'Ozone_Forecast_Lower_Bound', 
                                             'Ozone_Forecast_Upper_Bound'], grid = True, figsize=(12, 8))


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbfe47c3910>

Finally, one can the save the forecasts datafrma eto a CSV file or a database :


In [11]:
ozone_forecast_dataframe.to_csv("ozone_forecast.csv")

In [ ]:


In [ ]:

PyAF on some stock value

Using Yahoo Finance all public stock data are available. Here we give an exsample of how one can use PyAF to predict the future values of a popular stock (GOOG).


In [12]:
# yahoo finance is no longer available ....
#goog_link = "http://chart.finance.yahoo.com/table.csv?s=GOOG&a=8&b=14&c=2015&d=9&e=14&f=2016&g=d&ignore=.csv"
goog_link = 'https://raw.githubusercontent.com/antoinecarme/TimeSeriesData/master/YahooFinance/nasdaq/yahoo_GOOG.csv'
    
import pandas as pd
goog_dataframe = pd.read_csv(goog_link);
goog_dataframe['Date'] = goog_dataframe['Date'].apply(lambda x : datetime.datetime.strptime(x, "%Y-%m-%d"))
goog_dataframe.sort_values(by = 'Date' , ascending=True, inplace=True)
goog_dataframe.tail()


Out[12]:
Unnamed: 0 Symbol Date Close Low Volume Open Adj_Close High
4 4 GOOG 2016-07-21 738.630005 735.830994 969100 740.359985 738.630005 741.690002
3 3 GOOG 2016-07-22 742.739990 736.559998 1256300 741.859985 742.739990 743.239990
2 2 GOOG 2016-07-25 739.770020 737.500000 1030700 740.669983 739.770020 742.609985
1 1 GOOG 2016-07-26 738.419983 734.270020 1179400 739.039978 738.419983 741.690002
0 0 GOOG 2016-07-27 741.770020 737.000000 1497100 738.280029 741.770020 744.460022

We needed to tranform the date column and sort values in increasing order (the yahoo API gives most recent values first).

We are interested in getting the future values of the 'Close' column over the next 7 days:


In [13]:
import pyaf.ForecastEngine as autof
lEngine = autof.cForecastEngine()
lEngine.train(goog_dataframe , 'Date' , 'Close', 7);


INFO:pyaf.std:START_TRAINING 'Close'
INFO:pyaf.std:END_TRAINING_TIME_IN_SECONDS 'Close' 5.959474802017212

The predicted values are :


In [14]:
goog_forecast_dataframe = lEngine.forecast(goog_dataframe, 7);


INFO:pyaf.std:START_FORECASTING 'Close'
INFO:pyaf.std:END_FORECAST_TIME_IN_SECONDS  'Close' 0.22056055068969727

We can see the forecasts data for last 7 days


In [15]:
goog_forecast_dataframe[['Date' , 'Close' , 'Close_Forecast']].tail(7)


Out[15]:
Date Close Close_Forecast
1258 2016-07-28 NaN 741.77002
1259 2016-07-29 NaN 741.77002
1260 2016-07-30 NaN 741.77002
1261 2016-07-31 NaN 741.77002
1262 2016-08-01 NaN 741.77002
1263 2016-08-02 NaN 741.77002
1264 2016-08-03 NaN 741.77002

One can see that all the values are equal to the last value (naive model). Forecasting finance data is not easy!!!

If you are curious enough, you can see more info about the model :


In [16]:
lEngine.getModelInfo()


INFO:pyaf.std:TIME_DETAIL TimeVariable='Date' TimeMin=2011-07-28T00:00:00.000000 TimeMax=2015-07-20T00:00:00.000000 TimeDelta=<DateOffset: days=1> Horizon=7
INFO:pyaf.std:SIGNAL_DETAIL_ORIG SignalVariable='Close' Length=1258  Min=490.92085099999997 Max=1220.172036  Mean=696.2858158593004 StdDev=168.76391868223206
INFO:pyaf.std:SIGNAL_DETAIL_TRANSFORMED TransformedSignalVariable='_Close' Min=490.92085099999997 Max=1220.172036  Mean=696.2858158593004 StdDev=168.76391868223206
INFO:pyaf.std:BEST_TRANSOFORMATION_TYPE '_'
INFO:pyaf.std:BEST_DECOMPOSITION  '_Close_Lag1Trend_residue_zeroCycle_residue_NoAR' [Lag1Trend + NoCycle + NoAR]
INFO:pyaf.std:TREND_DETAIL '_Close_Lag1Trend' [Lag1Trend]
INFO:pyaf.std:CYCLE_DETAIL '_Close_Lag1Trend_residue_zeroCycle' [NoCycle]
INFO:pyaf.std:AUTOREG_DETAIL '_Close_Lag1Trend_residue_zeroCycle_residue_NoAR' [NoAR]
INFO:pyaf.std:MODEL_MAPE MAPE_Fit=0.0114 MAPE_Forecast=0.0116 MAPE_Test=0.0042
INFO:pyaf.std:MODEL_SMAPE SMAPE_Fit=0.011 SMAPE_Forecast=0.0116 SMAPE_Test=0.0042
INFO:pyaf.std:MODEL_MASE MASE_Fit=0.999 MASE_Forecast=0.9964 MASE_Test=1.0039
INFO:pyaf.std:MODEL_L1 L1_Fit=7.581005081 L1_Forecast=8.007574187250995 L1_Test=3.1071427142857146
INFO:pyaf.std:MODEL_L2 L2_Fit=20.99627176739844 L2_Forecast=11.134816435807473 L2_Test=3.2363276636230087
INFO:pyaf.std:MODEL_COMPLEXITY 32
INFO:pyaf.std:SIGNAL_TRANSFORMATION_DETAIL_START
INFO:pyaf.std:SIGNAL_TRANSFORMATION_MODEL_VALUES NoTransf None
INFO:pyaf.std:SIGNAL_TRANSFORMATION_DETAIL_END
INFO:pyaf.std:TREND_DETAIL_START
INFO:pyaf.std:LAG1_TREND Lag1Trend 610.9410190000001
INFO:pyaf.std:TREND_DETAIL_END
INFO:pyaf.std:CYCLE_MODEL_DETAIL_START
INFO:pyaf.std:ZERO_CYCLE_MODEL_VALUES _Close_Lag1Trend_residue_zeroCycle 0.0 {}
INFO:pyaf.std:CYCLE_MODEL_DETAIL_END
INFO:pyaf.std:AR_MODEL_DETAIL_START
INFO:pyaf.std:AR_MODEL_DETAIL_END

Again, one can plot the forecasts against the signal:


In [17]:
goog_forecast_dataframe.plot.line('Date', ['Close' , 'Close_Forecast', 
                                             'Close_Forecast_Lower_Bound', 
                                             'Close_Forecast_Upper_Bound'], grid = True, figsize=(12, 8))


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbfe4857e20>

In [ ]:


In [ ]:


In [ ]: