A univariate time series is a sequence of measurements of the same variable collected over time. Most often, the measurements are made at regular time intervals.
One difference from standard linear regression is that the data are not necessarily independent and not necessarily identically distributed. One defining characteristic of time series is that this is a list of observations where the ordering matters. Changing the order could change the meaning of the data.
ARIMA (Autoregressive Integrated Moving Average) models relate the present value of a series to past values and past prediction errors.
Let $y_t$ be the observed value at time $t$. Then the simplest ARIMA model is:
$$\begin{equation} y_t = \beta_1 y_{t-1} + \beta_0 + \epsilon \end{equation}$$Here $\beta_0$ and $\beta_1$ are the coefficients of OLS regression and $\epsilon$ is the residual.
This model is also called AR(1) since the value at time $t$ is only related to value at time $t-1$.
In an AR(2) model, $y_t$ depends on both $y_{t-1}$ and $y_{t-2}$.
$$\begin{equation} y_t = \beta_1 y_{t-1} + \beta_2 y_{t-2} + \beta_0 + \epsilon \end{equation}$$This can be extended to $AR(k)$ for any $k$, with the caveat that the value of $k$ must make sense in the context. $k$ is called the order of the model.
Assumptions in AR Models
The autocorrelation function (ACF) of a time series is the correlation between values of the series at different times, as a function of the two times or of the time lag.
$$\begin{equation} ACF(k) = Corr(Y_t, Y_{t-k}) \end{equation}$$The ACF is a way to measure the linear relationship between an observation at time $t$ and the observations at previous times. If we assume an $AR(k)$ model, then we may wish to only measure the association between $y_t$ and $y_{t−k}$ and filter out the linear influence of the random variables that lie in between (i.e., $y_{t−1},y_{t−2}, \dots, y_{t−k+1})$, which requires a transformation on the time series. Then by calculating the correlation of the transformed time series we obtain the partial autocorrelation function (PACF).
In [1]:
import numpy as np
import pandas as pd
%pylab inline
pylab.style.use('ggplot')
In [2]:
g_prices = pd.read_csv('google.csv', parse_dates=True, index_col=0, squeeze=True)
In [3]:
g_prices.head()
Out[3]:
In [4]:
# Plot close prices and lagged prices
ax = g_prices.plot(title='Google Close Prices')
lagged_prices = g_prices.shift(1)
lagged_prices.plot(ax=ax)
pylab.legend(['Actual', '1-day lagged'], loc='upper left')
Out[4]:
In [5]:
lagged_prices.head()
Out[5]:
In [6]:
# calculate autocorrelation between close prices and 1-day lagged values.
np.corrcoef(g_prices[1:], lagged_prices[1:])
Out[6]:
In [7]:
# Easier way
g_prices.autocorr(1)
Out[7]:
In [8]:
# ACF Plot
autocorrs = [g_prices.autocorr(i) for i in range(1, 10)]
ax = pd.Series(autocorrs, index=range(1, 10)).plot(kind='bar', title='ACF plot for the 1-10 day lags')
ax.set(xlabel='Lag days', ylabel='Autocorrelation')
Out[8]:
In [9]:
from statsmodels.tsa.stattools import pacf
google_pacf = pacf(g_prices)[1:10]
ax = pd.Series(index=range(1, 10), data=google_pacf).plot(kind='bar')
ax.set(xlabel='Lag Days', ylabel='Partial Autocorrelation')
Out[9]:
In [10]:
data_df = pd.concat({'actual': g_prices[1:], 'lagged': lagged_prices[1:]}, axis=1)
In [11]:
# plot the regression model. At first glance, it looks like a good fit.
# Note that that in line with the AR(1) model, the lagged values go on x, and the actual values go on y.
import seaborn as sns
sns.lmplot('lagged', 'actual', data=data_df)
Out[11]:
In [12]:
# What do the residuals look like?
sns.residplot('lagged', 'actual', data=data_df)
Out[12]:
In [13]:
import statsmodels.formula.api as sm
model = sm.ols('actual ~ lagged', data=data_df)
result = model.fit()
result.summary()
Out[13]:
In [14]:
result.resid.plot(kind='hist')
Out[14]:
For more details, see http://robjhyndman.com/hyndsight/crossvalidation/.
In [15]:
def time_series_cv(data, x, y, min_periods):
"""
Return the squared error from the cross validation of the regression model 'y ~ x' over `data_df`.
"""
assert min_periods > 0
assert len(data) > min_periods
formula = ' ~ '.join([y, x])
# Series to return the results. This will only contain min_periods:end subset of the dates in data.
results = pd.Series(index=data.index[min_periods:], data=np.nan, name='squared_error')
for i in range(min_periods, len(data)):
# Build the model using the first i observations (0, 1, ..., i-1)
fit_df = data.iloc[:i]
fit_result = sm.ols(formula=formula, data=fit_df).fit()
# Predict using the i-th observation
test_df = data.iloc[i:i+1]
predicted = fit_result.predict(test_df)
# Calculcate and save squared error
squared_error = (predicted.values - test_df[y].values) ** 2.0
results[test_df.index[0]] = squared_error
return results
In [16]:
cv_results = time_series_cv(data_df, x='lagged', y='actual', min_periods=30)
In [17]:
cv_results.plot(title='Cross Validation results using 30 day min period.')
Out[17]: