In [16]:
from __future__ import print_function, division, absolute_import
Chad M. Schafer, Carnegie Mellon University
April 2018
This notebook includes exercises associated with the lecture on state space methods, as part of the LSSTC Data Science Fellowship Program.
In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.signal import lfilter
%matplotlib notebook
In [18]:
ssdata = sm.datasets.sunspots.load()
Consider a plot of the time series below.
In [19]:
plt.figure()
plt.ylabel("Lag 11 Difference")
plt.xlabel("Year")
plt.plot(list(range(1700,2009)),ssdata.endog)
plt.show()
There is a well-established 11-year cycle to these counts, so in order to create an (approximately) stationary series, it makes sense to try taking the lag 11 differences. The code below will create a differenced version of the series.
In [20]:
def difference(dataset, interval=1):
diff = list()
for i in range(interval, len(dataset)):
value = dataset[i] - dataset[i - interval]
diff.append(value)
return diff
ssdiff = difference(ssdata.endog,11)
Reconsider a plot of the data.
In [21]:
plt.figure()
plt.ylabel("Lag 11 Difference")
plt.xlabel("Year")
plt.plot(list(range(1710,2008)), ssdiff)
plt.show()
This is more like the behavior we expect from a stationary series. Namely, the series shows mean reversion, which means that the series fluctuates, but tends to return to an overall mean level.
Now let's look at the ACF of the series.
In [22]:
sm.graphics.tsa.plot_acf(ssdiff,lags=20)
plt.show()
We can fit the AR(2) model using the following syntax.
In [23]:
ssARmodel = sm.tsa.AR(ssdiff)
ssARmodelresults = ssARmodel.fit(maxlag=2, method="mle")
The parameter estimates and the value of BIC for this model is easily obtained.
In [24]:
print(ssARmodelresults.params)
print(ssARmodelresults.bic)
It is tempting to look at a plot of the actual observations versus what the model predicts (forecasts) for that observation.
In [25]:
plt.figure()
plt.plot(ssARmodelresults.predict(),ssdiff,'o')
plt.show()
We see that there is general agreement, but it is better to construct the ACF of the residuals to determine if there is remaining correlation.
In [26]:
residuals = ssdiff - ssARmodelresults.predict()
Problem 1a
Construct the ACF of the residuals and comment on what you see. Does it appear that the model has accounted for all of the serial correlation in the series?
Solution
The code below creates the ACF of the residuals. From the plot, we can see that there is clear evidence of remaining correlation in the residuals. In other words, the fitted model has not adequately captured the serial correlation.
In [40]:
sm.graphics.tsa.plot_acf(residuals,lags=20)
plt.show()
Problem 1b
Let's consider a model of higher order. What choice of $p$ minimizes the value of BIC in this case? Be sure to work with the differenced series, not the original series. Consider values of $p$ up to 20.
Solution
We can simply loop over the values of $p$:
In [28]:
ssARmodel = sm.tsa.AR(ssdiff)
for p in range(1,20):
ssARmodelresults = ssARmodel.fit(maxlag=p, method="mle")
print(p)
print(ssARmodelresults.bic)
Or, use the built-in functionality of the method:
In [74]:
ssARmodelresults = ssARmodel.fit(maxlag=20, method="mle",ic="bic")
print(ssARmodelresults.params)
In either case, we see that the best choice (as determined by BIC) is $p=13$.
Study the code below taken from http://www.statsmodels.org/dev/statespace.html.
After simulating some data (stored in endog
), it creates a Class AR2
that fits, using a state space formulation, an AR(2) model.
__init__
method, the number of states is specified via k_states
. k_posdef
specifies the dimension of the covariance matrix ${\bf Q}$.start_params
method returns the initial values for the parameters used in the iterative search for the MLE.initialization='stationary'
, the user indicates that the initial states are assumed to be drawn from the stationary distribution of the time series.
In [47]:
# True model parameters
nobs = int(1e3)
true_phi = np.r_[0.5, -0.2]
true_sigma = 1**0.5
# Simulate a time series
np.random.seed(1234)
disturbances = np.random.normal(0, true_sigma, size=(nobs,))
endog = lfilter([1], np.r_[1, -true_phi], disturbances)
# Construct the model
class AR2(sm.tsa.statespace.MLEModel):
def __init__(self, endog):
# Initialize the state space model
super(AR2, self).__init__(endog, k_states=2, k_posdef=1,
initialization='stationary')
# Setup the fixed components of the state space representation
self['design'] = [1, 0]
self['transition'] = [[0, 0],
[1, 0]]
self['selection', 0, 0] = 1
# Describe how parameters enter the model
def update(self, params, transformed=True, **kwargs):
params = super(AR2, self).update(params, transformed, **kwargs)
self['transition', 0, :] = params[:2]
self['state_cov', 0, 0] = params[2]
# Specify start parameters and parameter names
@property
def start_params(self):
return [0,0,1] # these are very simple
# Create and fit the model
mod = AR2(endog)
res = mod.fit()
print(res.summary())
Problem 2a
Explain the role of the update
method in Class AR2
defined above.
Solution
The update
method is the means by which the user specifies how the parameters come into the model matrices.
Problem 2b
Write out the model being fit here, both in state space form, and in a "natural" form. (It is not the same as the AR(2) model specified in the lecture notes.)
Solution
The model is just as the one written in the lecture notes, except that the "state intercept" model is ommitted (and hence is equal to zero). So, the model being fit here is $Y_t = \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \epsilon_t$. This is sometimes called a "mean zero AR(2) model."
Problem 2c
Update the Class above to make the model match to the one presented during lecture. (An additional parameter is needed.)
Solution
In [46]:
class AR2x(sm.tsa.statespace.MLEModel):
def __init__(self, endog):
# Initialize the state space model
super(AR2x, self).__init__(endog, k_states=2, k_posdef=1,
initialization='stationary')
# Setup the fixed components of the state space representation
self['design'] = [1, 0]
self['transition'] = [[0, 0],
[1, 0]]
self['selection'] = [[1],[0]]
self['state_intercept'] = [[0],[0]]
# Describe how parameters enter the model
def update(self, params, transformed=True, **kwargs):
params = super(AR2x, self).update(params, transformed, **kwargs)
self['transition', 0, :] = params[:2]
self['state_cov', 0, 0] = params[2]
self['state_intercept',0, 0] = params[3]
# Specify start parameters and parameter names
@property
def start_params(self):
return [0,0,1,0]
Problem 2d
Refit the simulated data using this new model. Does the value of log likelihood increase or decrease? Does the value of BIC increase or decrease? Is this surprising?
Solution
The fit is shown below. We note that the log likelihood has increased from -1389.437 to -1389.273 by adding this parameter. This is not at all surprising. When fitting a more complex model, the likelihood can only increase. But, we see that BIC has gone up as well, from 2799.598 to 2806.177. The implication is that the new parameter is not "important," it is not adding to the predictive ability of the model.
In [45]:
mod = AR2x(endog)
res = mod.fit()
print(res.summary())
Problem 3a
Try to fit the state space AR(2) model to the (differenced) sunspots data. You may encounter some challenges, but see if you can correct this problem. What is the lesson here? Use the model with the parameter added in Problem 2c above.
Solution
When you fit the model using the code above, you will find that it crashes. This is often an issue when trying to find the MLE in an iterative fashion, with models of such general forms. When parameters enter into a model in a consistent way, software can be written in such a way to optimize more efficiently. Here, the software has to be able to handle a wide range of situations. The price is that more issues of this sort can arise.
The class below tries starting values that gives better results.
In [48]:
class AR2x(sm.tsa.statespace.MLEModel):
def __init__(self, endog):
# Initialize the state space model
super(AR2x, self).__init__(endog, k_states=2, k_posdef=1,
initialization='stationary')
# Setup the fixed components of the state space representation
self['design'] = [1, 0]
self['transition'] = [[0, 0],
[1, 0]]
self['selection'] = [[1],[0]]
self['state_intercept'] = [[0],[0]]
# Describe how parameters enter the model
def update(self, params, transformed=True, **kwargs):
params = super(AR2x, self).update(params, transformed, **kwargs)
self['transition', 0, :] = params[:2]
self['state_cov', 0, 0] = params[2]
self['state_intercept',0, 0] = params[3]
# Specify start parameters and parameter names
@property
def start_params(self):
return [1,-0.4,1,0.5]
mod = AR2x(ssdiff)
res = mod.fit()
print(res.summary())
Problem 3b
Suppose that you believe that there is measurement error in the sunspot count. On the differenced scale, you are comfortable modelling this error as Gaussian. Incorporate this into your state space model, and then refit the model.
Solution
The class below implements this change by adding the observation covariance matrix.
This code also illustrates another useful technique: The variance parameters are transformed to be unconstrained. This can help with numerical issues that can arise when parameter values are close to a bound.
In [49]:
class AR2xx(sm.tsa.statespace.MLEModel):
def __init__(self, endog):
# Initialize the state space model
super(AR2xx, self).__init__(endog, k_states=2, k_posdef=1,
initialization='stationary')
# Setup the fixed components of the state space representation
self['design'] = [1, 0]
self['transition'] = [[0, 0],
[1, 0]]
self['selection'] = [[1],[0]]
self['state_intercept'] = [[0],[0]]
# Describe how parameters enter the model
def update(self, params, transformed=True, **kwargs):
params = super(AR2xx, self).update(params, transformed, **kwargs)
self['transition', 0, :] = params[:2]
self['state_cov', 0, 0] = np.exp(params[2])
self['state_intercept',0, 0] = params[3]
self['obs_cov',0,0] = np.exp(params[4])
# Specify start parameters and parameter names
@property
def start_params(self):
return [1,-0.4,3,0.5,-1]
mod = AR2xx(ssdiff)
res = mod.fit()
print(res.summary())
Next we will consider the Kalman Filter, and the various estimates it can create.
First, we generate the predicted state for each time step. Note that there is a "prediction" for the initial state of zero. Then there are predictions for each of the states up and including one step beyond the final time step.
In [34]:
plt.figure()
plt.ylabel("Lag 11 Difference")
plt.xlabel("Year")
plt.plot(list(range(1710,2008)),ssdiff)
plt.plot(list(range(1710,2009)),res.smoother_results.predicted_state[0,:])
plt.show()
Problem 3c
If you do not zoom in on the plot, it appears as if the predictions are quite accurate. Zoom in on the plot. What is actually happening?
Solution
The prediction for the next state is very close to the previous value of the series. Hence, the predictions are not at all accurate.
Problem 3d
The following will create the filtered states. Again, zoom in on the plot and comment on the results.
Solution
The filtered states agree with the observations quite well. This is not surprising, given that the filtering operation uses all of the data up to and including the time point being predicted.
In [357]:
plt.figure()
plt.ylabel("Lag 11 Difference")
plt.xlabel("Year")
plt.plot(list(range(1710,2008)),ssdiff)
plt.plot(list(range(1710,2008)),res.smoother_results.filtered_state[0,:])
plt.show()
Problem 3e
Visit this page and write the code to find the smoothed states, and the forecasts for the future observation.
Solution
To find the smoothed states:
In [35]:
plt.figure()
plt.ylabel("Lag 11 Difference")
plt.xlabel("Year")
plt.plot(list(range(1710,2008)),ssdiff)
plt.plot(list(range(1710,2008)),res.smoother_results.smoothed_state[0,:])
plt.show()
And to find the forecasts for future observations:
In [38]:
plt.figure()
plt.ylabel("Lag 11 Difference")
plt.xlabel("Year")
plt.plot(list(range(1710,2008)),ssdiff)
plt.plot(list(range(1710,2008)),res.smoother_results.forecasts[0,:])
plt.show()
Solution
The class is shown below. Note how the design matrix has a third dimension to allow for the dependence on the time index. Also, in this case, I used initialization=approximate_diffuse
. This assumes that the initial state is drawn from a Gaussian with large entries on the covariance matrix, to indicate a great deal of uncertainty as to the values.
In [132]:
class varcoef(sm.tsa.statespace.MLEModel):
def __init__(self, x, y):
# Initialize the state space model
super(varcoef, self).__init__(y, k_states=2, k_posdef=2, initialization='approximate_diffuse')
# Setup the fixed components of the state space representation
self['design'] = np.ones((1,2,len(x)))
self['design',0,1,:] = x
self['transition'] = [[1, 0],
[0, 1]]
self['state_cov'] = [[0, 0],
[0, 0]]
self['selection'] = [[1, 0],
[0, 1]]
# Describe how parameters enter the model
def update(self, params, transformed=True, **kwargs):
params = super(varcoef, self).update(params, transformed, **kwargs)
self['state_cov', 0, 0] = np.exp(params[0])
self['state_cov', 1, 1] = np.exp(params[1])
self['obs_cov',0,0] = np.exp(params[2])
# Specify start parameters and parameter names
@property
def start_params(self):
return [0,0,0]
This is an example of the use of the class.
In [133]:
np.random.seed(0)
x = np.random.normal(0, 1, size=len(endog))
y = 10 + endog*x + np.random.normal(0, 0.2, size=len(endog))
mod = varcoef(x,y)
res = mod.fit()
print(res.summary())
In [ ]: