Stationary Series

As a preliminary, we define an important concept, that of a stationary series. For an ACF to make sense, the series must be a weakly stationary series. This means that the autocorrelation for any particular lag is the same regardless of where we are in time.

Definition: A series $y_t$ is said to be (weakly) stationary if it satisfies the following properties:

  • The mean $E(y_t)$ is the same for all $t$.
  • The variance of $y_t$ is the same for all $t$.
  • The covariance (and also correlation) between $y_t$ and $y_{t-h}$ is the same for all t.

Properties of AR(1) Model

Consider the AR(1) model:

$$\begin{equation} y_t = \beta_1 y_{t-1} + \beta_0 + \epsilon \end{equation}$$
  • The expection of $y_t$ is: $E(y_t) = \frac {\beta_0} {1 - \beta_1}$

  • The variance of $y_t$ is: $Var(y_t) = \frac {Var(\epsilon)} {1 - \beta_1^2}$

  • The correlation between observations h time periods apart is: $\rho_h = \beta_1^h$

The ACF property defines a distinct pattern for the autocorrelations. For a positive value of $\beta_1$, the ACF exponentially decreases to 0 as the lag h increases. For negative $\beta_1$, the ACF also exponentially decays to 0 as the lag increases, but the algebraic signs for the autocorrelations alternate between positive and negative.

Regression with Autocorrelated Errors: The Durbin-Watson Test

We usually assume that the error terms are independent unless there is a specific reason to think that this is not the case. Usually violation of this assumption occurs because there is a known temporal component for how the observations were drawn. The easiest way to assess if there is dependency is by producing a scatterplot of the residuals versus the time measurement for that observation (assuming you have the data arranged according to a time sequence order). If the data are independent, then the residuals should look randomly scattered about 0. However, if a noticeable pattern emerges (particularly one that is cyclical) then dependency is likely an issue.

Recall that if we have a first-order autocorrelation with the errors, then the errors are modeled as:

$$\begin{equation} \epsilon_t = \rho \epsilon_{t-1} + w_t \end{equation}$$

Where the $w_t$ are identically, independently distributed, each with a normal distribution having mean 0 and the same variance, and $\lvert \rho \rvert < 1$

the Durbin-Watson test is a formal test for $\rho$.

Null Hypothesis $H_0: \rho = 0$

Alt. Hypothesis $H_1: \rho \neq 0$

The test statistic for the Durbin-Watson test is obtained by:

$$\begin{equation} D=\frac {\sum_{t=2}^{n} (e_{t}-e_{t-1})^{2}} {\sum_{t=1}^{n}e_{t}^{2}} \end{equation}$$

where $e_t = y_t − \hat{y_t}$ are the residuals from the ordinary least squares fit.

Ljung-Box Q Test

The Ljung-Box Q test (sometimes called the Portmanteau test) is used to test whether or not observations over time are random and independent. In particular, for a given $k$, it tests the following:

Null Hypothesis $H_0$: the autocorrelations up to lag $k$ are all 0.

Alt. Hypothesis $H_1$: the autocorrelations of one or more lags differ from 0.

The test statistic is calculated as:

$$\begin{equation} Q_{k} = n(n+2) \sum_{j=1}^{k} \frac {{r}^{2}_{j}} {n-j} \end{equation}$$

Illustration: Blaisdell Company Data


In [1]:
import numpy as np
import pandas as pd
%pylab inline
pylab.style.use('ggplot')


Populating the interactive namespace from numpy and matplotlib

In [5]:
blaisdell = pd.read_csv('blaisdell.csv').set_index('time')

In [6]:
blaisdell


Out[6]:
comsales indsales
time
1 20.96 127.3
2 21.40 130.0
3 21.96 132.7
4 21.52 129.4
5 22.39 135.0
6 22.76 137.1
7 23.48 141.2
8 23.66 142.8
9 24.10 145.5
10 24.01 145.3
11 24.54 148.3
12 24.30 146.4
13 25.00 150.2
14 25.64 153.1
15 26.36 157.3
16 26.98 160.7
17 27.52 164.2
18 27.78 165.6
19 28.24 168.7
20 28.78 171.7

In [7]:
blaisdell.plot(kind='line')


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1806e63c748>

In [9]:
import seaborn as sns
sns.lmplot('comsales', 'indsales', data=blaisdell)


Out[9]:
<seaborn.axisgrid.FacetGrid at 0x180704ea240>

In [10]:
sns.residplot('comsales', 'indsales', data=blaisdell)


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x180705084e0>

In [12]:
import statsmodels.formula.api as sm

model = sm.ols(formula='indsales ~ comsales', data=blaisdell)
result = model.fit()
result.summary()


Out[12]:
OLS Regression Results
Dep. Variable: indsales R-squared: 0.999
Model: OLS Adj. R-squared: 0.999
Method: Least Squares F-statistic: 1.489e+04
Date: Tue, 02 May 2017 Prob (F-statistic): 1.01e-27
Time: 22:04:25 Log-Likelihood: -12.971
No. Observations: 20 AIC: 29.94
Df Residuals: 18 BIC: 31.93
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 8.4207 1.146 7.347 0.000 6.013 10.828
comsales 5.6659 0.046 122.017 0.000 5.568 5.763
Omnibus: 0.037 Durbin-Watson: 0.739
Prob(Omnibus): 0.981 Jarque-Bera (JB): 0.258
Skew: 0.012 Prob(JB): 0.879
Kurtosis: 2.444 Cond. No. 260.

In [14]:
# Calculate Durbin-Watson test statistic by hand
residuals = result.resid.copy()

In [20]:
squared_errors = residuals ** 2
error_diff_squared = residuals.diff(1) ** 2
durbin_watson = error_diff_squared.sum() / squared_errors.sum()

print('The Durbin-Watson statistic is: {0:.3f}'.format(durbin_watson))


The Durbin-Watson statistic is: 0.739

In [108]:
# Calculate Ljung-Box Q test statistic by hand

def lbq(resids, lags=1):
    """
    Calculate and return LBQ test statistic for lag=1, 2, 3, ..., lags.
    """
    assert lags >= 1
    n = len(resids)
    
    autocorrs = np.asarray([resids.autocorr(j)**2 / (n-j) for j in range(1, lags+1)])
    lbq_val = n * (n + 2) * autocorrs.cumsum()
    
    return lbq_val

In [113]:
lbq(residuals, 5)


Out[113]:
array([  9.13231308,  10.86379194,  11.34342905,  20.02430506,  39.01687874])

In [111]:
from statsmodels.stats.diagnostic import acorr_ljungbox

In [115]:
acorr_ljungbox(residuals, lags=5)[0]


Out[115]:
array([  9.00338197,  10.66343636,  11.09952693,  17.34734929,  29.10285083])