As a preliminary, we define an important concept, that of a stationary series. For an ACF to make sense, the series must be a weakly stationary series. This means that the autocorrelation for any particular lag is the same regardless of where we are in time.
Definition: A series $y_t$ is said to be (weakly) stationary if it satisfies the following properties:
Consider the AR(1) model:
$$\begin{equation} y_t = \beta_1 y_{t-1} + \beta_0 + \epsilon \end{equation}$$The expection of $y_t$ is: $E(y_t) = \frac {\beta_0} {1 - \beta_1}$
The variance of $y_t$ is: $Var(y_t) = \frac {Var(\epsilon)} {1 - \beta_1^2}$
The correlation between observations h time periods apart is: $\rho_h = \beta_1^h$
The ACF property defines a distinct pattern for the autocorrelations. For a positive value of $\beta_1$, the ACF exponentially decreases to 0 as the lag h increases. For negative $\beta_1$, the ACF also exponentially decays to 0 as the lag increases, but the algebraic signs for the autocorrelations alternate between positive and negative.
We usually assume that the error terms are independent unless there is a specific reason to think that this is not the case. Usually violation of this assumption occurs because there is a known temporal component for how the observations were drawn. The easiest way to assess if there is dependency is by producing a scatterplot of the residuals versus the time measurement for that observation (assuming you have the data arranged according to a time sequence order). If the data are independent, then the residuals should look randomly scattered about 0. However, if a noticeable pattern emerges (particularly one that is cyclical) then dependency is likely an issue.
Recall that if we have a first-order autocorrelation with the errors, then the errors are modeled as:
$$\begin{equation} \epsilon_t = \rho \epsilon_{t-1} + w_t \end{equation}$$Where the $w_t$ are identically, independently distributed, each with a normal distribution having mean 0 and the same variance, and $\lvert \rho \rvert < 1$
the Durbin-Watson test is a formal test for $\rho$.
Null Hypothesis $H_0: \rho = 0$
Alt. Hypothesis $H_1: \rho \neq 0$
The test statistic for the Durbin-Watson test is obtained by:
$$\begin{equation} D=\frac {\sum_{t=2}^{n} (e_{t}-e_{t-1})^{2}} {\sum_{t=1}^{n}e_{t}^{2}} \end{equation}$$where $e_t = y_t − \hat{y_t}$ are the residuals from the ordinary least squares fit.
The Ljung-Box Q test (sometimes called the Portmanteau test) is used to test whether or not observations over time are random and independent. In particular, for a given $k$, it tests the following:
Null Hypothesis $H_0$: the autocorrelations up to lag $k$ are all 0.
Alt. Hypothesis $H_1$: the autocorrelations of one or more lags differ from 0.
The test statistic is calculated as:
$$\begin{equation} Q_{k} = n(n+2) \sum_{j=1}^{k} \frac {{r}^{2}_{j}} {n-j} \end{equation}$$
In [1]:
import numpy as np
import pandas as pd
%pylab inline
pylab.style.use('ggplot')
In [5]:
blaisdell = pd.read_csv('blaisdell.csv').set_index('time')
In [6]:
blaisdell
Out[6]:
In [7]:
blaisdell.plot(kind='line')
Out[7]:
In [9]:
import seaborn as sns
sns.lmplot('comsales', 'indsales', data=blaisdell)
Out[9]:
In [10]:
sns.residplot('comsales', 'indsales', data=blaisdell)
Out[10]:
In [12]:
import statsmodels.formula.api as sm
model = sm.ols(formula='indsales ~ comsales', data=blaisdell)
result = model.fit()
result.summary()
Out[12]:
In [14]:
# Calculate Durbin-Watson test statistic by hand
residuals = result.resid.copy()
In [20]:
squared_errors = residuals ** 2
error_diff_squared = residuals.diff(1) ** 2
durbin_watson = error_diff_squared.sum() / squared_errors.sum()
print('The Durbin-Watson statistic is: {0:.3f}'.format(durbin_watson))
In [108]:
# Calculate Ljung-Box Q test statistic by hand
def lbq(resids, lags=1):
"""
Calculate and return LBQ test statistic for lag=1, 2, 3, ..., lags.
"""
assert lags >= 1
n = len(resids)
autocorrs = np.asarray([resids.autocorr(j)**2 / (n-j) for j in range(1, lags+1)])
lbq_val = n * (n + 2) * autocorrs.cumsum()
return lbq_val
In [113]:
lbq(residuals, 5)
Out[113]:
In [111]:
from statsmodels.stats.diagnostic import acorr_ljungbox
In [115]:
acorr_ljungbox(residuals, lags=5)[0]
Out[115]: