In this notebook, we describe how to use Statsmodels to compute the impacts of updated or revised datasets on out-of-sample forecasts or in-sample estimates of missing data. We follow the approach of the "Nowcasting" literature (see references at the end), by using a state space model to compute the "news" and impacts of incoming data.
Note: this notebook applies to Statsmodels v0.12+. In addition, it only applies to the state space models or related classes, which are: sm.tsa.statespace.ExponentialSmoothing
, sm.tsa.arima.ARIMA
, sm.tsa.SARIMAX
, sm.tsa.UnobservedComponents
, sm.tsa.VARMAX
, and sm.tsa.DynamicFactor
.
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
macrodata = sm.datasets.macrodata.load_pandas().data
macrodata.index = pd.period_range('1959Q1', '2009Q3', freq='Q')
Forecasting exercises often start with a fixed set of historical data that is used for model selection and parameter estimation. Then, the fitted selected model (or models) can be used to create out-of-sample forecasts. Most of the time, this is not the end of the story. As new data comes in, you may need to evaluate your forecast errors, possibly update your models, and create updated out-of-sample forecasts. This is sometimes called a "real-time" forecasting exercise (by contrast, a pseudo real-time exercise is one in which you simulate this procedure).
If all that matters is minimizing some loss function based on forecast errors (like MSE), then when new data comes in you may just want to completely redo model selection, parameter estimation and out-of-sample forecasting, using the updated datapoints. If you do this, your new forecasts will have changed for two reasons:
In this notebook, we focus on methods for isolating the first effect. The way we do this comes from the so-called "nowcasting" literature, and in particular Bańbura, Giannone, and Reichlin (2011), Bańbura and Modugno (2014), and Bańbura et al. (2014). They describe this exercise as computing the "news", and we follow them in using this language in Statsmodels.
These methods are perhaps most useful with multivariate models, since there multiple variables may update at the same time, and it is not immediately obvious what forecast change was created by what updated variable. However, they can still be useful for thinking about forecast revisions in univariate models. We will therefore start with the simpler univariate case to explain how things work, and then move to the multivariate case afterwards.
Note on revisions: the framework that we are using is designed to decompose changes to forecasts from newly observed datapoints. It can also take into account revisions to previously published datapoints, but it does not decompose them separately. Instead, it only shows the aggregate effect of "revisions".
Note on exog
data: the framework that we are using only decomposes changes to forecasts from newly observed datapoints for modeled variables. These are the "left-hand-side" variables that in Statsmodels are given in the endog
arguments. This framework does not decompose or account for changes to unmodeled "right-hand-side" variables, like those included in the exog
argument.
We will begin with a simple autoregressive model, an AR(1):
$$y_t = \phi y_{t-1} + \varepsilon_t$$We will use this model to forecast inflation.
To make it simpler to describe the forecast updates in this notebook, we will work with inflation data that has been de-meaned, but it is straightforward in practice to augment the model with a mean term.
In [2]:
# De-mean the inflation series
y = macrodata['infl'] - macrodata['infl'].mean()
Here, we'll simulate an out-of-sample exercise, by constructing and fitting our model using all of the data except the last five observations. We'll assume that we haven't observed these values yet, and then in subsequent steps we'll add them back into the analysis.
In [3]:
y_pre = y.iloc[:-5]
y_pre.plot(figsize=(15, 3), title='Inflation');
To construct forecasts, we first estimate the parameters of the model. This returns a results object that we will be able to use produce forecasts.
In [4]:
mod_pre = sm.tsa.arima.ARIMA(y_pre, order=(1, 0, 0), trend='n')
res_pre = mod_pre.fit()
print(res_pre.summary())
Creating the forecasts from the results object res
is easy - you can just call the forecast
method with the number of forecasts you want to construct. In this case, we'll construct four out-of-sample forecasts.
In [5]:
# Compute the forecasts
forecasts_pre = res_pre.forecast(4)
# Plot the last 3 years of data and the four out-of-sample forecasts
y_pre.iloc[-12:].plot(figsize=(15, 3), label='Data', legend=True)
forecasts_pre.plot(label='Forecast', legend=True);
For the AR(1) model, it is also easy to manually construct the forecasts. Denoting the last observed variable as $y_T$ and the $h$-step-ahead forecast as $y_{T+h|T}$, we have:
$$y_{T+h|T} = \hat \phi^h y_T$$Where $\hat \phi$ is our estimated value for the AR(1) coefficient. From the summary output above, we can see that this is the first parameter of the model, which we can access from the params
attribute of the results object.
In [6]:
# Get the estimated AR(1) coefficient
phi_hat = res_pre.params[0]
# Get the last observed value of the variable
y_T = y_pre.iloc[-1]
# Directly compute the forecasts at the horizons h=1,2,3,4
manual_forecasts = pd.Series([phi_hat * y_T, phi_hat**2 * y_T,
phi_hat**3 * y_T, phi_hat**4 * y_T],
index=forecasts_pre.index)
# We'll print the two to double-check that they're the same
print(pd.concat([forecasts_pre, manual_forecasts], axis=1))
In [7]:
# Get the next observation after the "pre" dataset
y_update = y.iloc[-5:-4]
# Print the forecast error
print('Forecast error: %.2f' % (y_update.iloc[0] - forecasts_pre.iloc[0]))
To compute forecasts based on our updated dataset, we will create an updated results object res_post
using the append
method, to append on our new observation to the previous dataset.
Note that by default, the append
method does not re-estimate the parameters of the model. This is exactly what we want here, since we want to isolate the effect on the forecasts of the new information only.
In [8]:
# Create a new results object by passing the new observations to the `append` method
res_post = res_pre.append(y_update)
# Since we now know the value for 2008Q3, we will only use `res_post` to
# produce forecasts for 2008Q4 through 2009Q2
forecasts_post = pd.concat([y_update, res_post.forecast('2009Q2')])
print(forecasts_post)
In this case, the forecast error is quite large - inflation was more than 10 percentage points below the AR(1) models' forecast. (This was largely because of large swings in oil prices around the global financial crisis).
To analyse this in more depth, we can use Statsmodels to isolate the effect of the new information - or the "news" - on our forecasts. This means that we do not yet want to change our model or re-estimate the parameters. Instead, we will use the news
method that is available in the results objects of state space models.
Computing the news in Statsmodels always requires a previous results object or dataset, and an updated results object or dataset. Here we will use the original results object res_pre
as the previous results and the res_post
results object that we just created as the updated results.
Once we have previous and updated results objects or datasets, we can compute the news by calling the news
method. Here, we will call res_pre.news
, and the first argument will be the updated results, res_post
(however, if you have two results objects, the news
method could can be called on either one).
In addition to specifying the comparison object or dataset as the first argument, there are a variety of other arguments that are accepted. The most important specify the "impact periods" that you want to consider. These "impact periods" correspond to the forecasted periods of interest; i.e. these dates specify with periods will have forecast revisions decomposed.
To specify the impact periods, you must pass two of start
, end
, and periods
(similar to the Pandas date_range
method). If your time series was a Pandas object with an associated date or period index, then you can pass dates as values for start
and end
, as we do below.
In [9]:
# Compute the impact of the news on the four periods that we previously
# forecasted: 2008Q3 through 2009Q2
news = res_pre.news(res_post, start='2008Q3', end='2009Q2')
# Note: one alternative way to specify these impact dates is
# `start='2008Q3', periods=4`
The variable news
is an object of the class NewsResults
, and it contains details about the updates to the data in res_post
compared to res_pre
, the new information in the updated dataset, and the impact that the new information had on the forecasts in the period between start
and end
.
One easy way to summarize the results are with the summary
method.
In [10]:
news.summary()
Out[10]:
Summary output: the default summary for this news results object printed four tables:
start='2008Q3'
and end='2009Q2'
start='2008Q3'
and end='2009Q2'
These are described in more detail below.
Notes:
summary
method to control this output. Check the documentation / docstring for details.First table: summary of the model and datasets
The first table, above, shows:
y_pre
y_post
Second table: the news from updated data
This table simply shows the forecasts from the previous results for observations that were updated in the updated sample.
Notes:
y_post
did not contain any revisions to previously observed datapoints. If it had, there would be an additional table showing the previous and updated values of each such revision.Third table: summary of the impacts of the new information
Columns:
The third table, above, shows:
Notes:
y_post
did not contain any revisions to previously observed datapoints. If it had, there would be additional columns in this table showing the impact of those revisions on the forecasts for the impact dates.estimate (new) = estimate (prev) + impact of news
summary_impacts
method.In our example:
Notice that in our example, the table shows the values that we computed earlier:
forecasts_pre
variable.forecasts_post
variable, which contains the observed value for 2008Q3 and the forecasts from the updated model for 2008Q4 - 2009Q2.Fourth table: details of updates and their impacts
The fourth table, above, shows how each new observation translated into specific impacts at each impact date.
Columns:
The first three columns table described the relevant update (an "updated" is a new observation):
The last four columns described the impact of a given update (an impact is a changed forecast within the "impact periods").
Notes:
sparsify
argument.impact = news * weight
.summary_details
method.In our example:
For this simple example with a univariate model, it is straightforward to compute all of the values shown above by hand. First, recall the formula for forecasting $y_{T+h|T} = \phi^h y_T$, and note that it follows that we also have $y_{T+h|T+1} = \phi^h y_{T+1}$. Finally, note that $y_{T|T+1} = y_T$, because if we know the value of the observations through $T+1$, we know the value of $y_T$.
News: The "news" is nothing more than the forecast error associated with one of the new observations. So the news associated with observation $T+1$ is:
$$n_{T+1} = y_{T+1} - y_{T+1|T} = Y_{T+1} - \phi Y_T$$Impacts: The impact of the news is the difference between the updated and previous forecasts, $i_h \equiv y_{T+h|T+1} - y_{T+h|T}$.
The impacts are therefore:
$$\{ i_h \}_{h=1}^4 = \begin{pmatrix} y_{T+1} - \phi y_T \\ \phi (Y_{T+1} - \phi y_T) \\ \phi^2 (Y_{T+1} - \phi y_T) \\ \phi^3 (Y_{T+1} - \phi y_T) \end{pmatrix}$$Weights: To compute the weights, we just need to note that it is immediate that we can rewrite the impacts in terms of the forecast errors, $n_{T+1}$.
$$\{ i_h \}_{h=1}^4 = \begin{pmatrix} 1 \\ \phi \\ \phi^2 \\ \phi^3 \end{pmatrix} n_{T+1}$$The weights are then simply $w = \begin{pmatrix} 1 \\ \phi \\ \phi^2 \\ \phi^3 \end{pmatrix}$
We can check that this is what the news
method has computed.
In [11]:
# Print the news, computed by the `news` method
print(news.news)
# Manually compute the news
print()
print((y_update.iloc[0] - phi_hat * y_pre.iloc[-1]).round(6))
In [12]:
# Print the total impacts, computed by the `news` method
# (Note: news.total_impacts = news.revision_impacts + news.update_impacts, but
# here there are no data revisions, so total and update impacts are the same)
print(news.total_impacts)
# Manually compute the impacts
print()
print(forecasts_post - forecasts_pre)
In [13]:
# Print the weights, computed by the `news` method
print(news.weights)
# Manually compute the weights
print()
print(np.array([1, phi_hat, phi_hat**2, phi_hat**3]).round(6))
In this example, we'll consider forecasting monthly core price inflation based on the Personal Consumption Expenditures (PCE) price index and the Consumer Price Index (CPI), using a Dynamic Factor model. Both of these measures track prices in the US economy and are based on similar source data, but they have a number of definitional differences. Nonetheless, they track each other relatively well, so modeling them jointly using a single dynamic factor seems reasonable.
One reason that this kind of approach can be useful is that the CPI is released earlier in the month than the PCE. One the CPI is released, therefore, we can update our dynamic factor model with that additional datapoint, and obtain an improved forecast for that month's PCE release. A more involved version of this kind of analysis is available in Knotek and Zaman (2017).
We start by downloading the core CPI and PCE price index data from FRED, converting them to annualized monthly inflation rates, removing two outliers, and de-meaning each series (the dynamic factor model does not
In [14]:
import pandas_datareader as pdr
levels = pdr.get_data_fred(['PCEPILFE', 'CPILFESL'], start='1999', end='2019').to_period('M')
infl = np.log(levels).diff().iloc[1:] * 1200
infl.columns = ['PCE', 'CPI']
# Remove two outliers and de-mean the series
infl['PCE'].loc['2001-09':'2001-10'] = np.nan
To show how this works, we'll imagine that it is April 14, 2017, which is the data of the March 2017 CPI release. So that we can show the effect of multiple updates at once, we'll assume that we haven't updated our data since the end of January, so that:
In [15]:
# Previous dataset runs through 2017-02
y_pre = infl.loc[:'2017-01'].copy()
const_pre = np.ones(len(y_pre))
print(y_pre.tail())
In [16]:
# For the updated dataset, we'll just add in the
# CPI value for 2017-03
y_post = infl.loc[:'2017-03'].copy()
y_post.loc['2017-03', 'PCE'] = np.nan
const_post = np.ones(len(y_post))
# Notice the missing value for PCE in 2017-03
print(y_post.tail())
We chose this particular example because in March 2017, core CPI prices fell for the first time since 2010, and this information may be useful in forecast core PCE prices for that month. The graph below shows the CPI and PCE price data as it would have been observed on April 14th$^\dagger$.
$\dagger$ This statement is not entirely true, because both the CPI and PCE price indexes can be revised to a certain extent after the fact. As a result, the series that we're pulling are not exactly like those observed on April 14, 2017. This could be fixed by pulling the archived data from ALFRED instead of FRED, but the data we have is good enough for this tutorial.
In [17]:
# Plot the updated dataset
fig, ax = plt.subplots(figsize=(15, 3))
y_post.plot(ax=ax)
ax.hlines(0, '2009', '2017-06', linewidth=1.0)
ax.set_xlim('2009', '2017-06');
To perform the exercise, we first construct and fit a DynamicFactor
model. Specifically:
k_factors=1
)factor_order=6
)exog=const_pre
), because the inflation series we are working with are not mean-zero.
In [18]:
mod_pre = sm.tsa.DynamicFactor(y_pre, exog=const_pre, k_factors=1, factor_order=6)
res_pre = mod_pre.fit()
print(res_pre.summary())
With the fitted model in hand, we now construct the news and impacts associated with observing the CPI for March 2017. The updated data is for February 2017 and part of March 2017, and we'll examining the impacts on both March and April.
In the univariate example, we first created an updated results object, and then passed that to the news
method. Here, we're creating the news by directly passing the updated dataset.
Notice that:
y_post
contains the entire updated dataset (not just the new datapoints)We also had to pass an updated exog
array. This array must cover both:
y_post
y_post
through the last impact date, specified by end
Here, y_post
ends in March 2017, so we needed our exog
to extend one more period, to April 2017.
In [19]:
# Create the news results
# Note
const_post_plus1 = np.ones(len(y_post) + 1)
news = res_pre.news(y_post, exog=const_post_plus1, start='2017-03', end='2017-04')
Note:
In the univariate example, above, we first constructed a new results object, and then passed that to the
news
method. We could have done that here too, although there is an extra step required. Since we are requesting an impact for a period beyond the end ofy_post
, we would still need to pass the additional value for theexog
variable during that period tonews
:res_post = res_pre.apply(y_post, exog=const_post) news = res_pre.news(res_post, exog=[1.], start='2017-03', end='2017-04')
Now that we have computed the news
, printing summary
is a convenient way to see the results.
In [20]:
# Show the summary of the news results
news.summary()
Out[20]:
Because we have multiple variables, by default the summary only shows the news from updated data along and the total impacts.
From the first table, we can see that our updated dataset contains three new data points, with most of the "news" from these data coming from the very low reading in March 2017.
The second table shows that these three datapoints substantially impacted the estimate for PCE in March 2017 (which was not yet observed). This estimate revised down by nearly 1.5 percentage points.
The updated data also impacted the forecasts in the first out-of-sample month, April 2017. After incorporating the new data, the model's forecasts for CPI and PCE inflation in that month revised down 0.29 and 0.17 percentage point, respectively.
While these tables show the "news" and the total impacts, they do not show how much of each impact was caused by each updated datapoint. To see that information, we need to look at the details tables.
One way to see the details tables is to pass include_details=True
to the summary
method. To avoid repeating the tables above, however, we'll just call the summary_details
method directly.
In [21]:
news.summary_details()
Out[21]:
This table shows that most of the revisions to the estimate of PCE in April 2017, described above, came from the news associated with the CPI release in March 2017. By contrast, the CPI release in February had only a little effect on the April forecast, and the PCE release in February had essentially no effect.
Bańbura, Marta, Domenico Giannone, and Lucrezia Reichlin. "Nowcasting." The Oxford Handbook of Economic Forecasting. July 8, 2011.
Bańbura, Marta, Domenico Giannone, Michele Modugno, and Lucrezia Reichlin. "Now-casting and the real-time data flow." In Handbook of economic forecasting, vol. 2, pp. 195-237. Elsevier, 2013.
Bańbura, Marta, and Michele Modugno. "Maximum likelihood estimation of factor models on datasets with arbitrary pattern of missing data." Journal of Applied Econometrics 29, no. 1 (2014): 133-160.
Knotek, Edward S., and Saeed Zaman. "Nowcasting US headline and core inflation." Journal of Money, Credit and Banking 49, no. 5 (2017): 931-968.