In [1]:
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline

# the main library has a small set of functionality
from stldecompose import decompose, forecast
from stldecompose.forecast_funcs import (naive,
                                         drift, 
                                         mean, 
                                         seasonal_naive)


%load_ext autoreload
%autoreload 2

We'll use some of the data that comes pre-packaged with statsmodels to demonstrate the library functionality. The data set below comprises incomplete, daily measurements of CO2 levels in Hawaii.

Note: at the time of this writing, the current release of statsmodels includes a utility method for loading these datasets as a pandas.DataFrame which appears to be broken. Below is a short hack inspired by the current master branch on the statsmodels GitHub page.


In [2]:
def get_statsmodels_df():
    """Return packaged data in a pandas.DataFrame"""
    # some hijinks to get around outdated statsmodels code
    dataset = sm.datasets.co2.load()
    start = dataset.data['date'][0].decode('utf-8')
    index = pd.date_range(start=start, periods=len(dataset.data), freq='W-SAT')
    obs = pd.DataFrame(dataset.data['co2'], index=index, columns=['co2'])
    return obs

In [3]:
obs = get_statsmodels_df()
obs.head()


Out[3]:
co2
1958-03-29 316.1
1958-04-05 317.3
1958-04-12 317.6
1958-04-19 317.5
1958-04-26 316.4

Because it's based on some existing statsmodels functionality, STLDecompose requires two things of the input dataframe:

  1. continuous observations (no missing data points)
  2. a pandas DateTimeIndex

Since these are both very situation-dependent, we leave it to the user to define how they want to acheive these goals - pandas provides a number of ways to work with missing data. In particular, the functions shown below make these steps relatively straightforward. Below, we add use linear interpolation, and resample to daily observations. The resulting frame meets both of our criteria.


In [4]:
obs = (obs
       .resample('D')
       .mean()
       .interpolate('linear'))

obs.head(10)


Out[4]:
co2
1958-03-29 316.100000
1958-03-30 316.271429
1958-03-31 316.442857
1958-04-01 316.614286
1958-04-02 316.785714
1958-04-03 316.957143
1958-04-04 317.128571
1958-04-05 317.300000
1958-04-06 317.342857
1958-04-07 317.385714

In [5]:
obs.index


Out[5]:
DatetimeIndex(['1958-03-29', '1958-03-30', '1958-03-31', '1958-04-01',
               '1958-04-02', '1958-04-03', '1958-04-04', '1958-04-05',
               '1958-04-06', '1958-04-07',
               ...
               '2001-12-20', '2001-12-21', '2001-12-22', '2001-12-23',
               '2001-12-24', '2001-12-25', '2001-12-26', '2001-12-27',
               '2001-12-28', '2001-12-29'],
              dtype='datetime64[ns]', length=15982, freq='D')

In [6]:
obs.head(1000).plot()


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c1cc3b7b8>

Decompose

One of the primary pieces of functionality is the STL decomposition. The associated method requires the observation frame, and the primary (largest) period of seasonality. This period is specified in terms of index positions, and so care is needed for the user to correctly specify the periodicity in terms of their observations.

For example, with daily observations and large annual cycles, period=365. For hourly observations with large daily cycles, period=24. Some inspection, and trial and error may be helpful.


In [7]:
decomp = decompose(obs, period=365)

decomp


Out[7]:
<statsmodels.tsa.seasonal.DecomposeResult at 0x1c1d0a83c8>

The resulting object is an extended version of the statsmodels.tsa.seasonal.DecomposeResult. Like the statsmodels object, the arrays of values are available on the object (the observations; and the trend, seasonal, and residual components). An extra attribute (the average seasonal cycle) has been added for the purpose of forecasting.

We inherit the built-in .plot() method on the object.


In [8]:
decomp.plot();


Forecast

While the STL decomposition is interesting on its own, STLDecompose also provides some relatively naive capabilities for using the decomposition to forecast based on our observations.

We'll use the same data set, but pretend that we only had the first two third of observations. Then we can compare our forecast to the real observation data.


In [9]:
len(obs)


Out[9]:
15982

In [10]:
short_obs = obs.head(10000)

In [11]:
# apply the decomp to the truncated observation
short_decomp = decompose(short_obs, period=365)

short_decomp


Out[11]:
<statsmodels.tsa.seasonal.DecomposeResult at 0x1c1dc5e860>

The forecast() method requires the following arguments:

  • the previously fit DecomposeResult
  • the number of steps forward for which we'd like the forecast
  • the specific forecasting function to be applied to the decomposition

There are a handful of predefined functions that can be imported from the stldecompose.forecast_funcs module. These implementations are based on Hyndman's online textbook. The user can also define their own forecast function, following the patterns demonstrated in the predefined functions.

The return type of the forecast() method is a pandas.Dataframe with a column name that represents the forecast function and an appropriate DatetimeIndex.


In [12]:
fcast = forecast(short_decomp, steps=8000, fc_func=drift)

fcast.head()


Out[12]:
drift
1985-08-14 345.985881
1985-08-15 345.989898
1985-08-16 345.993915
1985-08-17 345.997933
1985-08-18 346.001950

If desired, we can then plot the corresponding components of the observation and forecast to check and verify the results.


In [13]:
plt.plot(obs, '--', label='truth')
plt.plot(short_obs, '--', label='obs')
plt.plot(short_decomp.trend, ':', label='decomp.trend')
plt.plot(fcast, '-', label=fcast.columns[0])

plt.xlim('1970','2004'); plt.ylim(330,380);
plt.legend();


To include the estimated seasonal component in the forecast, use the boolean seasonal keyword.


In [14]:
fcast = forecast(short_decomp, steps=8000, fc_func=drift, seasonal=True)

plt.plot(obs, '--', label='truth')
plt.plot(short_obs, '--', label='obs')
plt.plot(short_decomp.trend, ':', label='decomp.trend')
plt.plot(fcast, '-', label=fcast.columns[0])

plt.xlim('1970','2004'); plt.ylim(330,380);
plt.legend();



In [15]:
fcast.head()


Out[15]:
drift+seasonal
1985-08-14 344.855165
1985-08-15 344.800451
1985-08-16 344.736606
1985-08-17 344.683039
1985-08-18 344.632911

Enjoy.


In [ ]: