If you're reading this you probably already know that MIDAS stands for Mixed Data Sampling, and it is a technique for creating time-series forecast models that allows you to mix series of different frequencies (ie, you can use monthly data as predictors for a quarterly series, or daily data as predictors for a monthly series, etc.). The general approach has been described in a series of papers by Ghysels, Santa-Clara, Valkanov and others.
This notebook attempts to recreate some of the examples from the paper Forecasting with Mixed Frequencies by Michelle T. Armesto, Kristie M. Engemann, and Michael T. Owyang.
In [1]:
%matplotlib inline
import datetime
import numpy as np
import pandas as pd
from midas.mix import mix_freq
from midas.adl import estimate, forecast, midas_adl, rmse
This package currently implements the MIDAS ADL (autoregressive distributed lag) method. We'll start with an example using quarterly GDP and monthly payroll data. We'll then show the basic steps in setting up and fitting this type of model, although in practice you'll probably used the top-level midas_adl function to do forecasts.
TODO: MIDAS equation and discussion
In [2]:
gdp = pd.read_csv('../tests/data/gdp.csv', parse_dates=['DATE'], index_col='DATE')
pay = pd.read_csv('../tests/data/pay.csv', parse_dates=['DATE'], index_col='DATE')
gdp.tail()
Out[2]:
In [3]:
pay.tail()
Out[3]:
In [4]:
gdp_yoy = ((1. + (np.log(gdp.GDP) - np.log(gdp.GDP.shift(3)))) ** 4) - 1.
emp_yoy = ((1. + (np.log(pay.PAY) - np.log(pay.PAY.shift(1)))) ** 12) - 1.
df = pd.concat([gdp_yoy, emp_yoy], axis=1)
df.columns = ['gdp_yoy', 'emp_yoy']
df[['gdp_yoy','emp_yoy']].loc['1980-1-1':].plot(figsize=(15,4), style=['o','-'])
Out[4]:
In [5]:
gdp['gdp_growth'] = (np.log(gdp.GDP) - np.log(gdp.GDP.shift(1))) * 100.
pay['emp_growth'] = (np.log(pay.PAY) - np.log(pay.PAY.shift(1))) * 100.
y, yl, x, yf, ylf, xf = mix_freq(gdp.gdp_growth, pay.emp_growth, "3m", 1, 3,
start_date=datetime.datetime(1985,1,1),
end_date=datetime.datetime(2009,1,1))
x.head()
Out[5]:
The arguments here are as follows:
The horizon argument is a little tricky (the argument name was retained from the MatLab version). This is used both the align the data and to do nowcasting (more on that later). For example, if it's September 2017 then the latest GDP data from FRED will be for Q2 and this will be dated 2017-04-01. The latest monthly data from non-farm payroll will be for August, which will be dated 2017-08-01. If we aligned just on dates, the payroll data for April (04-01), March (03-01), and February(02-01) would be aligned with Q2 (since xlag = "3m"), but what we want is June, May, and April, so here the horizon argument is 3 indicating that the high-frequency data should be lagged three months before being mixed with the quarterly data.
In [6]:
res = estimate(y, yl, x, poly='beta')
res.x
Out[6]:
You can also call forecast directly. This will use the optimization results returned from eatimate to produce a forecast for every date in the index of the forecast inputs (here xf and ylf):
In [7]:
fc = forecast(xf, ylf, res, poly='beta')
forecast_df = fc.join(yf)
forecast_df['gap'] = forecast_df.yfh - forecast_df.gdp_growth
forecast_df
Out[7]:
In [8]:
gdp.join(fc)[['gdp_growth','yfh']].loc['2005-01-01':].plot(style=['-o','-+'], figsize=(12, 4))
Out[8]:
In [9]:
import statsmodels.tsa.api as sm
m = sm.AR(gdp['1975-01-01':'2011-01-01'].gdp_growth,)
r = m.fit(maxlag=1)
r.params
fc_ar = r.predict(start='2005-01-01')
fc_ar.name = 'xx'
In [10]:
df_p = gdp.join(fc)[['gdp_growth','yfh']]
df_p.join(fc_ar)[['gdp_growth','yfh','xx']].loc['2005-01-01':].plot(style=['-o','-+'], figsize=(12, 4))
Out[10]:
The midas_adl function wraps up frequency-mixing, fitting, and forecasting into one process. The default mode of forecasting is fixed, which means that the data between start_date and end_date will be used to fit the model, and then any data in the input beyond end_date will be used for forecasting. For example, here we're fitting from the beginning of 1985 to the end of 2008, but the gdp data extends to Q1 of 2011 so we get nine forecast points. Three monthly lags of the high-frequency data are specified along with one quarterly lag of GDP.
In [11]:
rmse_fc, fc = midas_adl(gdp.gdp_growth, pay.emp_growth,
start_date=datetime.datetime(1985,1,1),
end_date=datetime.datetime(2009,1,1),
xlag="3m",
ylag=1,
horizon=3)
rmse_fc
Out[11]:
You can also change the polynomial used to weight the MIDAS coefficients. The default is 'beta', but you can also specify exponential Almom weighting ('expalmon') or beta with non-zero last term ('betann')
In [13]:
rmse_fc, fc = midas_adl(gdp.gdp_growth, pay.emp_growth,
start_date=datetime.datetime(1985,1,1),
end_date=datetime.datetime(2009,1,1),
xlag="3m",
ylag=1,
horizon=3,
poly='expalmon')
rmse_fc
Out[13]:
As mentioned above the default forecasting method is fixed where the model is fit once and then all data after end_date is used for forecasting. Two other methods are supported rolling window and recursive. The rolling window method is just what it sounds like. The start_date and end_date are used for the initial window, and then each new forecast moves that window forward by one period so that you're always doing one step ahead forecasts. Of course, to do anything useful this also assumes that the date range of the dependent data extends beyond end_date accounting for the lags implied by horizon. Generally, you'll get lower RMSE values here since the forecasts are always one step ahead.
In [14]:
results = {h: midas_adl(gdp.gdp_growth, pay.emp_growth,
start_date=datetime.datetime(1985,10,1),
end_date=datetime.datetime(2009,1,1),
xlag="3m",
ylag=1,
horizon=3,
forecast_horizon=h,
poly='beta',
method='rolling') for h in (1, 2, 5)}
results[1][0]
Out[14]:
The recursive method is similar except that the start date does not change, so the range over which the fitting happens increases for each new forecast.
In [15]:
results = {h: midas_adl(gdp.gdp_growth, pay.emp_growth,
start_date=datetime.datetime(1985,10,1),
end_date=datetime.datetime(2009,1,1),
xlag="3m",
ylag=1,
horizon=3,
forecast_horizon=h,
poly='beta',
method='recursive') for h in (1, 2, 5)}
results[1][0]
Out[15]:
In [33]:
rmse_fc, fc = midas_adl(gdp.gdp_growth, pay.emp_growth,
start_date=datetime.datetime(1985,1,1),
end_date=datetime.datetime(2009,1,1),
xlag="3m",
ylag=1,
horizon=1)
rmse_fc
Out[33]:
Not surprisingly the RMSE drops considerably.
In [16]:
cpi = pd.read_csv('CPIAUCSL.csv', parse_dates=['DATE'], index_col='DATE')
ffr = pd.read_csv('DFF_2_Vintages_Starting_2009_09_28.txt', sep='\t', parse_dates=['observation_date'],
index_col='observation_date')
In [17]:
cpi.head()
Out[17]:
In [18]:
ffr.head(10)
Out[18]:
In [19]:
cpi_yoy = ((1. + (np.log(cpi.CPIAUCSL) - np.log(cpi.CPIAUCSL.shift(1)))) ** 12) - 1.
cpi_yoy.head()
df = pd.concat([cpi_yoy, ffr.DFF_20090928 / 100.], axis=1)
df.columns = ['cpi_growth', 'dff']
df.loc['1980-1-1':'2010-1-1'].plot(figsize=(15,4), style=['-+','-.'])
Out[19]:
In [20]:
cpi_growth = (np.log(cpi.CPIAUCSL) - np.log(cpi.CPIAUCSL.shift(1))) * 100.
y, yl, x, yf, ylf, xf = mix_freq(cpi_growth, ffr.DFF_20090928, "1m", 1, 1,
start_date=datetime.datetime(1975,10,1),
end_date=datetime.datetime(1991,1,1))
In [21]:
x.head()
Out[21]:
In [22]:
res = estimate(y, yl, x)
In [23]:
fc = forecast(xf, ylf, res)
fc.join(yf).head()
Out[23]:
In [24]:
pd.concat([cpi_growth, fc],axis=1).loc['2008-01-01':'2010-01-01'].plot(style=['-o','-+'], figsize=(12, 4))
Out[24]:
In [25]:
results = {h: midas_adl(cpi_growth, ffr.DFF_20090928,
start_date=datetime.datetime(1975,7,1),
end_date=datetime.datetime(1990,11,1),
xlag="1m",
ylag=1,
horizon=1,
forecast_horizon=h,
method='rolling') for h in (1, 2, 5)}
(results[1][0], results[2][0], results[5][0])
Out[25]:
In [26]:
results[1][1].plot(figsize=(12,4))
Out[26]:
In [27]:
results = {h: midas_adl(cpi_growth, ffr.DFF_20090928,
start_date=datetime.datetime(1975,10,1),
end_date=datetime.datetime(1991,1,1),
xlag="1m",
ylag=1,
horizon=1,
forecast_horizon=h,
method='recursive') for h in (1, 2, 5)}
results[1][0]
Out[27]:
In [28]:
results[1][1].plot()
Out[28]:
In [ ]: