In [1]:
from __future__ import print_function, division, absolute_import
Chad M. Schafer, Carnegie Mellon University
April 2018
This notebook includes exercises associated with the lecture on state space methods, as part of the LSSTC Data Science Fellowship Program.
In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.signal import lfilter
%matplotlib notebook
In [ ]:
ssdata = sm.datasets.sunspots.load()
Consider a plot of the time series below.
In [359]:
plt.figure()
plt.ylabel("Lag 11 Difference")
plt.xlabel("Year")
plt.plot(list(range(1700,2009)),ssdata.endog)
plt.show()
There is a well-established 11-year cycle to these counts, so in order to create an (approximately) stationary series, it makes sense to try taking the lag 11 differences. The code below will create a differenced version of the series.
In [331]:
def difference(dataset, interval=1):
diff = list()
for i in range(interval, len(dataset)):
value = dataset[i] - dataset[i - interval]
diff.append(value)
return diff
ssdiff = difference(ssdata.endog,11)
Reconsider a plot of the data.
In [355]:
plt.figure()
plt.ylabel("Lag 11 Difference")
plt.xlabel("Year")
plt.plot(list(range(1710,2008)), ssdiff)
plt.show()
This is more like the behavior we expect from a stationary series. Namely, the series shows mean reversion, which means that the series fluctuates, but tends to return to an overall mean level.
Now let's look at the ACF of the series.
In [360]:
sm.graphics.tsa.plot_acf(ssdiff,lags=20)
plt.show()
We can fit the AR(2) model using the following syntax.
In [345]:
ssARmodel = sm.tsa.AR(ssdiff)
ssARmodelresults = ssARmodel.fit(maxlag=2, method="mle")
The parameter estimates and the value of BIC for this model is easily obtained.
In [346]:
print(ssARmodelresults.params)
print(ssARmodelresults.bic)
It is tempting to look at a plot of the actual observations versus what the model predicts (forecasts) for that observation.
In [361]:
plt.figure()
plt.plot(ssARmodelresults.predict(),ssdiff,'o')
plt.show()
We see that there is general agreement, but it is better to construct the ACF of the residuals to determine if there is remaining correlation.
In [362]:
residuals = ssdiff - ssARmodelresults.predict()
Problem 1a
Construct the ACF of the residuals and comment on what you see. Does it appear that the model has accounted for all of the serial correlation in the series?
In [363]:
foo=sm.graphics.tsa.plot_acf(residuals,lags=20)
plt.show()
Problem 1b
Let's consider a model of higher order. What choice of $p$ minimizes the value of BIC in this case? Be sure to work with the differenced series, not the original series. Consider values of $p$ up to 20.
Study the code below taken from http://www.statsmodels.org/dev/statespace.html.
After simulating some data (stored in endog
), it creates a Class AR2
that fits, using a state space formulation, an AR(2) model.
__init__
method, the number of states is specified via k_states
. k_posdef
specifies the dimension of the covariance matrix ${\bf Q}$.start_params
method returns the initial values for the parameters used in the iterative search for the MLE.initialization='stationary'
, the user indicates that the initial states are assumed to be drawn from the stationary distribution of the time series.
In [364]:
# True model parameters
nobs = int(1e3)
true_phi = np.r_[0.5, -0.2]
true_sigma = 1**0.5
# Simulate a time series
np.random.seed(1234)
disturbances = np.random.normal(0, true_sigma, size=(nobs,))
endog = lfilter([1], np.r_[1, -true_phi], disturbances)
# Construct the model
class AR2(sm.tsa.statespace.MLEModel):
def __init__(self, endog):
# Initialize the state space model
super(AR2, self).__init__(endog, k_states=2, k_posdef=1,
initialization='stationary')
# Setup the fixed components of the state space representation
self['design'] = [1, 0]
self['transition'] = [[0, 0],
[1, 0]]
self['selection', 0, 0] = 1
# Describe how parameters enter the model
def update(self, params, transformed=True, **kwargs):
params = super(AR2, self).update(params, transformed, **kwargs)
self['transition', 0, :] = params[:2]
self['state_cov', 0, 0] = params[2]
# Specify start parameters and parameter names
@property
def start_params(self):
return [0,0,1] # these are very simple
# Create and fit the model
mod = AR2(endog)
res = mod.fit()
print(res.summary())
Problem 2a
Explain the role of the update
method in Class AR2
defined above.
Problem 2b
Write out the model being fit here, both in state space form, and in a "natural" form. (It is not the same as the AR(2) model specified in the lecture notes.)
Problem 2c
Update the Class above to make the model match to the one presented during lecture. (An additional parameter is needed.)
Problem 2d
Refit the simulated data using this new model. Does the value of log likelihood increase or decrease? Does the value of BIC increase or decrease? Is this surprising?
Problem 3a
Try to fit the state space AR(2) model to the (differenced) sunspots data. You may encounter some challenges, but see if you can correct this problem. What is the lesson here? Use the model with the parameter added in Problem 2c above.
Problem 3b
Suppose that you believe that there is measurement error in the sunspot count. On the differenced scale, you are comfortable modelling this error as Gaussian. Incorporate this into your state space model, and then refit the model.
Next we will consider the Kalman Filter, and the various estimates it can create.
First, we generate the predicted state for each time step. Note that there is a "prediction" for the initial state of zero. Then there are predictions for each of the states up and including one step beyond the final time step.
In [369]:
plt.figure()
plt.ylabel("Lag 11 Difference")
plt.xlabel("Year")
plt.plot(list(range(1710,2008)),ssdiff)
plt.plot(list(range(1710,2009)),res.smoother_results.predicted_state[0,:])
plt.show()
Problem 3c
If you do not zoom in on the plot, it appears as if the predictions are quite accurate. Zoom in on the plot. What is actually happening?
Problem 3d
The following will create the filtered states. Again, zoom in on the plot and comment on the results.
In [357]:
plt.figure()
plt.ylabel("Lag 11 Difference")
plt.xlabel("Year")
plt.plot(list(range(1710,2008)),ssdiff)
plt.plot(list(range(1710,2008)),res.smoother_results.filtered_state[0,:])
plt.show()
Problem 3e
Visit this page and write the code to find the smoothed states, and the forecasts for the future observation.