In [1]:
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
# the main library has a small set of functionality
from stldecompose import decompose, forecast
from stldecompose.forecast_funcs import (naive,
drift,
mean,
seasonal_naive)
%load_ext autoreload
%autoreload 2
We'll use some of the data that comes pre-packaged with statsmodels
to demonstrate the library functionality. The data set below comprises incomplete, daily measurements of CO2 levels in Hawaii.
Note: at the time of this writing, the current release of statsmodels
includes a utility method for loading these datasets as a pandas.DataFrame
which appears to be broken. Below is a short hack inspired by the current master branch on the statsmodels
GitHub page.
In [2]:
def get_statsmodels_df():
"""Return packaged data in a pandas.DataFrame"""
# some hijinks to get around outdated statsmodels code
dataset = sm.datasets.co2.load()
start = dataset.data['date'][0].decode('utf-8')
index = pd.date_range(start=start, periods=len(dataset.data), freq='W-SAT')
obs = pd.DataFrame(dataset.data['co2'], index=index, columns=['co2'])
return obs
In [3]:
obs = get_statsmodels_df()
obs.head()
Out[3]:
Because it's based on some existing statsmodels
functionality, STLDecompose
requires two things of the input dataframe:
pandas
DateTimeIndex
Since these are both very situation-dependent, we leave it to the user to define how they want to acheive these goals - pandas
provides a number of ways to work with missing data. In particular, the functions shown below make these steps relatively straightforward. Below, we add use linear interpolation, and resample to daily observations. The resulting frame meets both of our criteria.
In [4]:
obs = (obs
.resample('D')
.mean()
.interpolate('linear'))
obs.head(10)
Out[4]:
In [5]:
obs.index
Out[5]:
In [6]:
obs.head(1000).plot()
Out[6]:
One of the primary pieces of functionality is the STL decomposition. The associated method requires the observation frame, and the primary (largest) period of seasonality. This period
is specified in terms of index positions, and so care is needed for the user to correctly specify the periodicity in terms of their observations.
For example, with daily observations and large annual cycles, period=365
. For hourly observations with large daily cycles, period=24
. Some inspection, and trial and error may be helpful.
In [7]:
decomp = decompose(obs, period=365)
decomp
Out[7]:
The resulting object is an extended version of the statsmodels.tsa.seasonal.DecomposeResult
. Like the statsmodels
object, the arrays of values are available on the object (the observations; and the trend, seasonal, and residual components). An extra attribute (the average seasonal cycle) has been added for the purpose of forecasting.
We inherit the built-in .plot()
method on the object.
In [8]:
decomp.plot();
While the STL decomposition is interesting on its own, STLDecompose
also provides some relatively naive capabilities for using the decomposition to forecast based on our observations.
We'll use the same data set, but pretend that we only had the first two third of observations. Then we can compare our forecast to the real observation data.
In [9]:
len(obs)
Out[9]:
In [10]:
short_obs = obs.head(10000)
In [11]:
# apply the decomp to the truncated observation
short_decomp = decompose(short_obs, period=365)
short_decomp
Out[11]:
The forecast()
method requires the following arguments:
DecomposeResult
There are a handful of predefined functions that can be imported from the stldecompose.forecast_funcs
module. These implementations are based on Hyndman's online textbook. The user can also define their own forecast function, following the patterns demonstrated in the predefined functions.
The return type of the forecast()
method is a pandas.Dataframe
with a column name that represents the forecast function and an appropriate DatetimeIndex
.
In [12]:
fcast = forecast(short_decomp, steps=8000, fc_func=drift)
fcast.head()
Out[12]:
If desired, we can then plot the corresponding components of the observation and forecast to check and verify the results.
In [13]:
plt.plot(obs, '--', label='truth')
plt.plot(short_obs, '--', label='obs')
plt.plot(short_decomp.trend, ':', label='decomp.trend')
plt.plot(fcast, '-', label=fcast.columns[0])
plt.xlim('1970','2004'); plt.ylim(330,380);
plt.legend();
To include the estimated seasonal component in the forecast, use the boolean seasonal
keyword.
In [14]:
fcast = forecast(short_decomp, steps=8000, fc_func=drift, seasonal=True)
plt.plot(obs, '--', label='truth')
plt.plot(short_obs, '--', label='obs')
plt.plot(short_decomp.trend, ':', label='decomp.trend')
plt.plot(fcast, '-', label=fcast.columns[0])
plt.xlim('1970','2004'); plt.ylim(330,380);
plt.legend();
In [15]:
fcast.head()
Out[15]:
Enjoy.
In [ ]: