In [1]:
# remove the notebook root logger.\n",
import logging
logger = logging.getLogger()
logger.handlers = []
This is an introduction to time series forecasting with PyAF. We describe the basic vocabulary used and give simple examples.
The main problem to be solved with a forecasting tool is to predict the future values of a quantity of interest called 'Signal' over a period of time called 'Horizon' (predict the shop sales over next 7 days).
Usually all available (mainly past) data on the signal can be used to perform the forecasting task, even external related signals can be used (called exogenous signals, weather data can be used to predict ice-cream sales).
The time itself can also be used. Some time information can be helpful : sundays, holidays, quarter of the year etc. Future values of time-based data are usually known.
Time series models are mathematical concepts that can used to compute the future values. They summarize a pattern that can be observed in the data in a regular way. Example time series models :
Automatic forecasting is about testing a lot of possible time series models and selecting the best model based on its forecast quality (difference between predicted and actual values).
PyAF uses a machine learning approach to perform the forecasting task. It starts by building a time series model based on past values (training process) and then uses this model to generate the future values (forecast).
A typical PyAF use case can sketched the following way :
[read CSV file or table in database ] => [Pandas Dataframe , DF] => [train a model M using DF ] => [forecast DF using M] => [Pandas Dataframe containing Forecasts , FC] => [save FC to a csv file or a database]
The first two and last two operations are Pandas generic data access tasks and are not part of PyAF. PyAF is only concerned with training the mdoel and using it to forecast the signal.
A pandas dataframe object is a very sophisticated representation of the dataset. It allows a lot of possible manipulations on the data. Almost everything that can be done in a spreadsheet or a database (SQL) is possible, in python, using this object. It can be seen an abstarct unifying view of columnar data such as disk files and database tables.
For a serious introduction to pandas objects see http://pandas.pydata.org/pandas-docs/stable/dsintro.html.
Here , we take a standard use case, of ozone data in LA (https://datamarket.com/data/set/22u8/ozon-concentration-downtown-l-a-1955-1972). This is a popular dataset that gives the monthly average of ozone concentration in LA from January 1955 to Dec 1972.
The Ozone dataset is available as a comma separated values (CSV) file from the link below.
First , we load the dataset int a pandas dataframe object.
In [2]:
import pandas as pd
csvfile_link = "https://raw.githubusercontent.com/antoinecarme/TimeSeriesData/master/ozone-la.csv"
ozone_dataframe = pd.read_csv(csvfile_link);
This dataset has two columns : 'Month' and 'Ozone'.
In [3]:
ozone_dataframe.info()
We see here that the column 'Month' is not really a datetime column (strange date format). Pandas prvides the necessary tools to 'normalize' this column on the fly (ISO date format) :
In [4]:
import datetime
ozone_dataframe['Month'] = ozone_dataframe['Month'].apply(lambda x : datetime.datetime.strptime(x, "%Y-%m"))
ozone_dataframe.head()
Out[4]:
Pandas provides also some plotting capabilities
In [5]:
%matplotlib inline
ozone_dataframe.plot.line('Month', ['Ozone'], grid = True, figsize=(12, 8))
Out[5]:
Now, we can build/train a time series model with PyAF to forecast the next 12 values:
In [6]:
import pyaf.ForecastEngine as autof
lEngine = autof.cForecastEngine()
lEngine.train(ozone_dataframe , 'Month' , 'Ozone', 12);
In [7]:
lEngine.getModelInfo()
At this point, PyAF has tested all possible time series models and selected the best one. The whole process took about 7 seconds in this test.
to predict the next values using this model and show the last 20 values :
In [8]:
ozone_forecast_dataframe = lEngine.forecast(ozone_dataframe, 12);
ozone_forecast_dataframe.tail(20)
Out[8]:
Here, we see that
The most impostant new column is 'Ozone_Forecast', we can tsee the last rows of this column as well as the time and signal columns :
In [9]:
ozone_forecast_dataframe[['Month' , 'Ozone' , 'Ozone_Forecast']].tail(20)
Out[9]:
The predicted value for '1973-01-01' is then '0.811919' , etc
It is possible to plot the forecast against the signal toc check the 'visual' quality of the model and see the prediction intervals
In [10]:
ozone_forecast_dataframe.plot.line('Month', ['Ozone' , 'Ozone_Forecast',
'Ozone_Forecast_Lower_Bound',
'Ozone_Forecast_Upper_Bound'], grid = True, figsize=(12, 8))
Out[10]:
Finally, one can the save the forecasts datafrma eto a CSV file or a database :
In [11]:
ozone_forecast_dataframe.to_csv("ozone_forecast.csv")
In [ ]:
In [ ]:
Using Yahoo Finance all public stock data are available. Here we give an exsample of how one can use PyAF to predict the future values of a popular stock (GOOG).
In [12]:
# yahoo finance is no longer available ....
#goog_link = "http://chart.finance.yahoo.com/table.csv?s=GOOG&a=8&b=14&c=2015&d=9&e=14&f=2016&g=d&ignore=.csv"
goog_link = 'https://raw.githubusercontent.com/antoinecarme/TimeSeriesData/master/YahooFinance/nasdaq/yahoo_GOOG.csv'
import pandas as pd
goog_dataframe = pd.read_csv(goog_link);
goog_dataframe['Date'] = goog_dataframe['Date'].apply(lambda x : datetime.datetime.strptime(x, "%Y-%m-%d"))
goog_dataframe.sort_values(by = 'Date' , ascending=True, inplace=True)
goog_dataframe.tail()
Out[12]:
We needed to tranform the date column and sort values in increasing order (the yahoo API gives most recent values first).
We are interested in getting the future values of the 'Close' column over the next 7 days:
In [13]:
import pyaf.ForecastEngine as autof
lEngine = autof.cForecastEngine()
lEngine.train(goog_dataframe , 'Date' , 'Close', 7);
The predicted values are :
In [14]:
goog_forecast_dataframe = lEngine.forecast(goog_dataframe, 7);
We can see the forecasts data for last 7 days
In [15]:
goog_forecast_dataframe[['Date' , 'Close' , 'Close_Forecast']].tail(7)
Out[15]:
One can see that all the values are equal to the last value (naive model). Forecasting finance data is not easy!!!
If you are curious enough, you can see more info about the model :
In [16]:
lEngine.getModelInfo()
Again, one can plot the forecasts against the signal:
In [17]:
goog_forecast_dataframe.plot.line('Date', ['Close' , 'Close_Forecast',
'Close_Forecast_Lower_Bound',
'Close_Forecast_Upper_Bound'], grid = True, figsize=(12, 8))
Out[17]:
In [ ]:
In [ ]:
In [ ]: