We often refer to our input features in machine learning as "dimensions". On that note, there's a dimension that pervades almost everything we do and observe as humans. It's the fourth dimension we experience every waking moment: time. But time is quite unlike other data we capture, and often requires unique machine learning approaches. These models and approaches are fairly established in R and a few other languages, but have more recently immigrated to Python.
In regession and classification, we use features (collected during a cross-sectional study/survey/measurement) to predict an outcome. The model and parameters represent part of the underlying relationship between features and outcome. But what if we run out of funds to cross-section (it's possible), or need to predict future outcomes for which the features aren't measurable or don't yet exist?
Time series data usually contains more than meets the eye, and can often be decomposed into trend, seasonal, and random fluctuation components.
Time series models fall into two camps:
We should use multiplicative models when the percentage change of our data is more important than the absolute value change (e.g. stocks, commodities); as the trend rises and our values grow, we see amplitude growth in seasonal and random fluctuations. If our seasonality and fluctuations are stable, we likely have an additive model.
Time series model selection is driven by the Trend and Seasonal components of our raw data. The general approach for analysis looks like this:
Algorithm | Trend | Seasonal | Correlations |
---|---|---|---|
ARIMA | X | X | X |
SMA Smoothing | X | ||
Simple Exponential Smoothing | X | ||
Seasonal Adjustment | X | X | |
Holt's Exponential Smoothing | X | ||
Holt-Winters | X | X |
The mean of the series is not a function of time:
The variance of the series is not a function of time (homoscedasticity):
The covariance at different lags is not a function of time:
From A Complete Tutorial on Time Series Modeling in R
alpha
value, we can reject the null hypothesis and say that the series is stationary.
In [7]:
#!pip install pyflux
import numpy as np
import pandas as pd
import pyflux as pf
import warnings
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [8]:
# create a play dataframe from 1-10 (linear and squared) to test how differencing works
play = pd.DataFrame([[x for x in range(1,11)], [x**2 for x in range(1,11)]]).T
play.columns = ['original', 'squared']
play
Out[8]:
In [9]:
# stationarize linear series (mean and variance doesn't change for sub-windows)
play.original.diff()
Out[9]:
In [11]:
# stationarize squared series
play.squared.diff().diff()
Out[11]:
In [12]:
# stationarize squared with log
np.log(play.squared)
Out[12]:
We'll be looking at monthly average temperatures between 1907-1972
In [15]:
# load data, recast columns if needed, convert to datetime
monthly_temp = pd.read_csv('./mean-monthly-temperature-1907-19.csv', skipfooter=2,
infer_datetime_format=True, header=0, index_col=0, names=['month', 'temp'])
monthly_temp.index = monthly_temp.index.to_datetime()
In [17]:
# describe
monthly_temp.describe()
Out[17]:
In [24]:
# resample to annual and plot each
annual_temp = monthly_temp.resample('A').mean()
monthly_temp.plot();
annual_temp.plot();
In [25]:
# plot both on same figure
plt.plot(monthly_temp)
plt.plot(annual_temp);
In [26]:
# plot with plotly (optional): might need a plotly account and key
import plotly.plotly as py
import plotly.graph_objs as go
data = [go.Scatter(x=annual_temp.index, y=annual_temp.temp)]
py.iplot(data)
Out[26]:
In [27]:
# plot binned yearly segments using resample method
monthly_temp.resample('A').temp.plot();
In [28]:
# violinplot months to determine variance and range
sns.violinplot(x=monthly_temp.index.month, y=monthly_temp.temp);
Are these datasets stationary? We can look at a few things per the list above, including a visual check (there seems to be a small upward trend in the annual, too hard to tell for monthly), a standard deviation check on various differences (smallest one is usually most stationary), and the formal Dickey-Fuller test.
In [31]:
# check montly deviations for various diffs
print(monthly_temp.temp.std())
print(monthly_temp.temp.diff().std())
print(monthly_temp.temp.diff().diff().std())
print(monthly_temp.temp.diff().diff().diff().std())
In [32]:
# check annual deviations for various diffs
print(annual_temp.temp.std())
print(annual_temp.temp.diff().std())
print(annual_temp.temp.diff().diff().std())
print(annual_temp.temp.diff().diff().diff().std())
In [33]:
# define Dickey-Fuller Test (DFT) function
import statsmodels.tsa.stattools as ts
def dftest(timeseries):
dftest = ts.adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','Lags Used','Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
#Determing rolling statistics
rolmean = pd.rolling_mean(timeseries, window=12)
rolstd = pd.rolling_std(timeseries, window=12)
#Plot rolling statistics:
orig = plt.plot(timeseries, color='blue',label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean and Standard Deviation')
plt.show(block=False)
In [34]:
# run DFT on monthly
dftest(monthly_temp.temp)
In [35]:
# run DFT on annual
dftest(annual_temp.temp)
Enter Autoregressive Integrated Moving Average (ARIMA) modeling. When we have autocorrelation between outcomes and their ancestors, we will see a theme, or relationship in the outcome plot. This relationship can be modeled in its own way, allowing us to predict the future with a confidence level commensurate to the strength of the relationship and the proximity to known values (prediction weakens the further out we go).
Autocorrelation: a variable's correlation with itself at different lags.
For second-order stationary (both mean and variance: $\mu_t = \mu$ and $\sigma_t^2 = \sigma^2$ for all $t$) data, autocovariance is expressed as a function only of the time lag $k$:
$$ \gamma_k = E[(x_t-\mu)(x_{t+k} - \mu)] $$Therefore, the autocorrelation function is defined as:
$$ \rho_k = \frac{\gamma_k}{\sigma^2} $$We use the plot of these values at different lags to determine optimal ARIMA parameters.
Some things to note:
epsilon
error termSome things to note:
q
lagsepsilon
error term for that period
In [36]:
# define helper plot function for visualization
import statsmodels.tsa.api as smt
def plots(data, lags=None):
layout = (1, 3)
raw = plt.subplot2grid(layout, (0, 0))
acf = plt.subplot2grid(layout, (0, 1))
pacf = plt.subplot2grid(layout, (0, 2))
data.plot(ax=raw)
smt.graphics.plot_acf(data, lags=lags, ax=acf)
smt.graphics.plot_pacf(data, lags=lags, ax=pacf)
sns.despine()
plt.tight_layout()
In [37]:
# helper plot for monthly temps
plots(monthly_temp, lags=75);
ACF Shape | Indicated Model |
---|---|
Exponential, decaying to zero | Autoregressive model. Use the partial autocorrelation plot to identify the order of the autoregressive model. |
Alternating positive and negative, decaying to zero | Autoregressive model. Use the partial autocorrelation plot to help identify the order. |
One or more spikes, rest are essentially zero | Moving average model, order identified by where plot becomes zero. |
Decay, starting after a few lags | Mixed autoregressive and moving average (ARMA) model. |
All zero or close to zero | Data are essentially random. |
High values at fixed intervals | Include seasonal autoregressive term. |
No decay to zero | Series is not stationary. |
In [54]:
# we might need to install dev version for statespace functionality
#!pip install git+https://github.com/statsmodels/statsmodels.git
import statsmodels.api as sm
# fit SARIMA monthly based on helper plots
sar = sm.tsa.statespace.SARIMAX(monthly_temp.temp, order=(3,0,0), seasonal_order=(0,1,1,12), trend='c').fit()
sar.summary()
Out[54]:
In [51]:
# plot resids
plots(sar.resid, lags=40);
In [56]:
# plot residual diagnostics
sar.plot_diagnostics();
In [57]:
# plot predictions
monthly_temp['forecast'] = sar.predict(start = 750, end= 790, dynamic=False)
monthly_temp[730:][['temp', 'forecast']].plot();
Serial correlation (Ljung-Box)
In [ ]:
# create and run statistical tests on model
R has an autoARIMA function (and other automagic methods) that gridsearches/optimizes our model parameters for us. Over time, more of these goodies are porting to Python (e.g. statsmodels.tsa.x13 import x13_arima_select_order). While there's nothing wrong with utilizing these resources, the human makes the final determination! Don't become over-reliant on these methods, especially early on when you are grasping the underlying mechanics and theory!
In [ ]:
# autoselect for monthly, limited to only searching AR and MA parameters
autores = sm.tsa.arma_order_select_ic(monthly_temp.temp, ic=['aic', 'bic'], trend='c', max_ar=4, max_ma=4, fit_kw=dict(method='css-mle'))
print('AIC', autores.aic_min_order) # will use this as inputs for annual
print('BIC', autores.bic_min_order)
In [ ]:
# using itertools to gridsearch solutions
import itertools
import itertools
#set parameter range; UPDATE THESE!
p = q = range(0, 3)
d = range(0, 2)
season = 12
# list of all parameter combos
pdq = list(itertools.product(p, d, q))
# same for seasonal variant
seasonal_pdq = [(x[0], x[1], x[2], season) for x in list(itertools.product(p, d, q))]
print('SARIMAX: {} , {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} , {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} , {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} , {}'.format(pdq[2], seasonal_pdq[4]))
In [58]:
# read and plot data
data = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/sunspot.year.csv') #
plt.figure()
plt.plot(data.index,data['sunspot.year'])
plt.ylabel('Sunspots')
plt.title('Yearly Sunspot Data');
In [59]:
# fit and summarize model
model = pf.ARIMA(data=data,ar=4,ma=4,integ=0,target='sunspot.year')
x = model.fit("MLE")
x.summary()
In [60]:
# plot z-scores of feature coefficients
model.plot_z(indices=range(1,9))
In [61]:
# plot model
model.plot_fit(figsize=(15,5))
In [62]:
# plot in sample
model.plot_predict_is(50, figsize=(15,5))
In [63]:
# plot forecast
model.plot_predict(h=20,past_values=20, figsize=(15,5))
In [64]:
# compare tail to prediction
data.tail()
Out[64]:
In [66]:
model.predict(h=12)
Out[66]:
In [ ]:
# helper plot
In [ ]:
# build and summarize model
In [ ]:
# plot z-scores of feature coefficients
In [ ]:
# plot model against raw data
In [ ]:
# plot predictions
In [ ]:
# plot forecast
In [ ]:
# predict future values
In [ ]:
# load data
data = sm.datasets.co2.load_pandas()
co2 = data.data
In [ ]:
# resample to monthly and check missing values
co2 = co2['co2'].resample('MS').mean()
co2 = co2.fillna(co2.bfill())
co2.isnull().sum()
In [ ]:
# decompose data into trend, seasonal, and residual
decomposition = sm.tsa.seasonal_decompose(co2, model='additive')
fig = decomposition.plot()
plt.show()
In [ ]:
# build model
co2sar = sm.tsa.statespace.SARIMAX(co2, order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12),
enforce_stationarity=False,
enforce_invertibility=False).fit()
co2sar.summary()
In [ ]:
# check diagnostics
co2sar.plot_diagnostics();
In [ ]:
# create predictions and confidence intervals
pred = co2sar.get_prediction(start=pd.to_datetime('1998-01-01'), dynamic=False) # we use as many true values as possible to predict
pred_ci = pred.conf_int()
In [ ]:
# plot predictions
ax = co2['1993':].plot(label='Observed CO2 Levels')
pred.predicted_mean.plot(ax=ax, label='Forecast', alpha=.8) # this is using all available info
ax.fill_between(pred_ci.index, pred_ci.iloc[:, 0], pred_ci.iloc[:, 1], color='k', alpha=.1)
ax.set_xlabel('Year')
ax.set_ylabel('CO2')
plt.legend()
plt.show();
In [ ]:
# compute mean square error
fcast = pred.predicted_mean
true = co2['1998-01-01':]
mse = ((fcast - true) ** 2).mean()
print('MSE of our forecasts is {}'.format(round(mse, 3)))
In [ ]:
# dynamic forecast
fcast = co2sar.get_prediction(start=pd.to_datetime('1998-01-01'), dynamic=True, full_results=True)
fcast_ci = fcast.conf_int()
# caution: this is modeling assumptions on top of assumptions
In [ ]:
# student practice: use template from above
# plot predictions
In [ ]:
# compute mean square error
In [ ]:
# forecast next 100 months and get confidence interval
In [67]:
# plot forecast
For more, check out the Facebook Prophet and Pyflux projects.