Exploring Time Series Data

Time series models are used in a wide range of applications, particularly for forecasting, which is the goal of this example, performed in four steps:

Explore the characteristics of the time series data.
Decompose the time series into trend, seasonal components, and remainder components.
Apply time series models.
Forecast the production for a 12 month period.

Load and clean the data

The dataset is the production amount of several diary products in California, month by month, for 18 years.
Our goal: forecast the next year production for one of those products: milk.

As usual, first we load the data into a Pandas dataframe.



In [1]:

    
import pandas as pd

X = pd.read_csv('../datasets/cadairydata.csv', index_col=0)
X.head()









    Out[1]:







  
    
      
      Year.Month
      Month.Number
      Year
      Month
      Cotagecheese.Prod
      Icecream.Prod
      Milk.Prod
      N.CA.Fat.Price
    
  
  
    
      1
      1995.01
      1
      1995
      Jan
      4.370
      51.595
      2.112
      0.9803
    
    
      2
      1995.02
      2
      1995
      Feb
      3.695
      56.086
      1.932
      0.8924
    
    
      3
      1995.03
      3
      1995
      Mar
      4.538
      68.453
      2.162
      0.8924
    
    
      4
      1995.04
      4
      1995
      Apr
      4.280
      65.722
      2.130
      0.8967
    
    
      5
      1995.05
      5
      1995
      May
      4.470
      73.730
      2.227
      0.8967



In [2]:

    
X.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 228 entries, 1 to 228
Data columns (total 8 columns):
Year.Month           228 non-null float64
Month.Number         228 non-null int64
Year                 228 non-null int64
Month                228 non-null object
Cotagecheese.Prod    228 non-null float64
Icecream.Prod        228 non-null float64
Milk.Prod            228 non-null float64
N.CA.Fat.Price       228 non-null float64
dtypes: float64(5), int64(2), object(1)
memory usage: 16.0+ KB

The first step is to have a Pandas date column which could be used as the time index.

Add Pandas date-time column

The fastest way is to convert year and month to strings, then create a new Pandas datetime column that combine them.



In [3]:

    
X["Year"] = X["Year"].astype(str)



In [4]:

    
X["Month.Number"] = X["Month.Number"].astype(str)



In [5]:

    
X["YearMonth"] = pd.to_datetime(X["Year"] + X["Month.Number"], format="%Y%m") # new column

Now can drop the other date columns:



In [6]:

    
X.drop(["Year","Month", "Year.Month"], axis=1, inplace=True)

and I prefer to rename some of the other columns, for clarity:



In [7]:

    
X.rename(columns={'Cotagecheese.Prod':'CottageCheeseProd', 
                  'Icecream.Prod':'IceCreamProd',
                  'Milk.Prod':'MilkProd',
                  'N.CA.Fat.Price':'Price',
                  'Month.Number' : 'MonthNumber'},
          inplace=True)



In [8]:

    
X.head()









    Out[8]:







  
    
      
      MonthNumber
      CottageCheeseProd
      IceCreamProd
      MilkProd
      Price
      YearMonth
    
  
  
    
      1
      1
      4.370
      51.595
      2.112
      0.9803
      1995-01-01
    
    
      2
      2
      3.695
      56.086
      1.932
      0.8924
      1995-02-01
    
    
      3
      3
      4.538
      68.453
      2.162
      0.8924
      1995-03-01
    
    
      4
      4
      4.280
      65.722
      2.130
      0.8967
      1995-04-01
    
    
      5
      5
      4.470
      73.730
      2.227
      0.8967
      1995-05-01



In [9]:

    
X.tail()









    Out[9]:







  
    
      
      MonthNumber
      CottageCheeseProd
      IceCreamProd
      MilkProd
      Price
      YearMonth
    
  
  
    
      224
      8
      2.169
      78.382
      3.400
      1.6273
      2013-08-01
    
    
      225
      9
      2.193
      74.871
      3.197
      1.5519
      2013-09-01
    
    
      226
      10
      2.284
      70.011
      3.338
      1.5566
      2013-10-01
    
    
      227
      11
      1.865
      61.394
      3.275
      1.7537
      2013-11-01
    
    
      228
      12
      2.345
      53.653
      3.473
      1.6747
      2013-12-01



In [10]:

    
X.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 228 entries, 1 to 228
Data columns (total 6 columns):
MonthNumber          228 non-null object
CottageCheeseProd    228 non-null float64
IceCreamProd         228 non-null float64
MilkProd             228 non-null float64
Price                228 non-null float64
YearMonth            228 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 12.5+ KB

Now, we can perform both arithmetic and logical operations on datetime objects ("YearMonth").
As an example, here is how to select the last 12 months of dairy data:



In [11]:

    
X[X.YearMonth > "2012-12-01"]









    Out[11]:







  
    
      
      MonthNumber
      CottageCheeseProd
      IceCreamProd
      MilkProd
      Price
      YearMonth
    
  
  
    
      217
      1
      2.282
      62.612
      3.462
      1.7744
      2013-01-01
    
    
      218
      2
      1.941
      66.274
      3.231
      1.6224
      2013-02-01
    
    
      219
      3
      2.150
      75.770
      3.676
      1.6968
      2013-03-01
    
    
      220
      4
      2.217
      76.396
      3.622
      1.7444
      2013-04-01
    
    
      221
      5
      2.341
      80.899
      3.720
      1.8552
      2013-05-01
    
    
      222
      6
      2.694
      92.091
      3.489
      1.8314
      2013-06-01
    
    
      223
      7
      2.433
      84.026
      3.373
      1.6923
      2013-07-01
    
    
      224
      8
      2.169
      78.382
      3.400
      1.6273
      2013-08-01
    
    
      225
      9
      2.193
      74.871
      3.197
      1.5519
      2013-09-01
    
    
      226
      10
      2.284
      70.011
      3.338
      1.5566
      2013-10-01
    
    
      227
      11
      1.865
      61.394
      3.275
      1.7537
      2013-11-01
    
    
      228
      12
      2.345
      53.653
      3.473
      1.6747
      2013-12-01

The Milk prpduction should be in million of pounds.



In [12]:

    
X.isnull().values.any()









    Out[12]:





False

There are no NaNs. The dataset is ready to be explored.

Explore the data

We concentrate now on the milk production (the MilkProd column).

Create a Time Series Plot

First useful thing to do is always to plot the data:



In [13]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [14]:

    
fig,ax = plt.subplots()
ax.plot(X.YearMonth, X.MilkProd)
ax.grid(True)
ax.set_xlabel("Year")
ax.set_ylabel("Milk")
fig.suptitle("Time series of milk production");

For most of the time period shown, the production of milk increased year over year. However, there is a decline in milk production starting in 2009 as a results of the recession. Also, notice that this time series exhibits a strong seasonal component with an annual cycle.

Since the time series has both a visible trend and a seasonality it is clearly a non-stationary series.

Statistical Properties of the Time Series

Autocorrelation is a fundamental property of time series. The Autocorrelation Function or ACF provides information on the dependency of the time series values of previous values.



In [15]:

    
import statsmodels.graphics.tsaplots as sm_pl









    



/Users/Massimo/anaconda/lib/python3.5/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools



In [16]:

    
sm_pl.plot_acf(X.MilkProd, lags=24);

Lags are the months in this case.

Note that the values of the ACF at the various lags decays only slowly. This indicates there is considerable auto correlation between the time series values at the various lags, mostly likely from the trend.

Plotting a histogram provides information on the distribution of values of the time series.



In [17]:

    
fig,ax = plt.subplots()
ax.hist(X.MilkProd, bins=40)
ax.grid(True)
ax.set_xlabel("Production")
ax.set_ylabel("Frequency")
fig.suptitle("Distribution of Milk Production");

The histogram of the full milk production time series shows considerable dispersion. Again this behavior is likely the result of the trend.

Stationarity: are mean and standard deviation constant over time?

From the plot is pretty clear that there is trend and seasonality, therefore is not stationary. Anyway, can be checked.



In [18]:

    
rollingMean = X.MilkProd.rolling(window=12).mean()
rollingStd = X.MilkProd.rolling(window=12).std()



In [19]:

    
fig,ax1 = plt.subplots()
ax1.plot(X.MilkProd, color="black", label="Milk production")
ax1.plot(rollingMean, color="red", label="Average")
ax1.grid(True)
ax1.set_xlabel("Time in months")
ax1.set_ylabel("Milk")
ax1.legend(loc="upper left")

ax2 = ax1.twinx() # Create a twin Axes sharing the xaxis
ax2.plot(rollingStd, color = "blue", label = "StD")
ax2.set_ylabel("Standard Deviation")
ax2.legend(loc="lower right")
ax2.set_ylim(0,0.5)
fig.suptitle("Milk - Average & Standard Deviation");

The mean is clearly increasing over time (the production is rising) but also the standard deviation of milk production increased in the time frame.

Decomposition of the Time Series

Time series are typically decomposed into three components: trend, seasonal, and the remainders, or residuals.
Trend can be modeled by several methods. We will start by decomposing the time series using a simple moving average model.

Trend

The code in the cell below uses moving window method to compute the average of the time series over specified span, or order of the operator. As the moving window operator moves over the data, the average of the values in the windows is calculated.



In [20]:

    
import numpy as np
def movingAverage(values, order):
    end = len(values)
    out = np.zeros(len(values))
    
    out[0] = values[1]
    for i in range(1,end):
        if (i - order <= 1):
            j = 1
        else:
            j += 1
        out[i] = sum(values[j-1:i]) / (i-j+1)
        
    return out

One of the best ways to make a series stationary on variance is through transforming the original series through log transform.
We go back to our original data series and add a new column, the logarithm of the milk production:



In [21]:

    
X['LogMilk'] = pd.Series(np.log(X.MilkProd), index = X.index)



In [22]:

    
X.head()









    Out[22]:







  
    
      
      MonthNumber
      CottageCheeseProd
      IceCreamProd
      MilkProd
      Price
      YearMonth
      LogMilk
    
  
  
    
      1
      1
      4.370
      51.595
      2.112
      0.9803
      1995-01-01
      0.747635
    
    
      2
      2
      3.695
      56.086
      1.932
      0.8924
      1995-02-01
      0.658556
    
    
      3
      3
      4.538
      68.453
      2.162
      0.8924
      1995-03-01
      0.771034
    
    
      4
      4
      4.280
      65.722
      2.130
      0.8967
      1995-04-01
      0.756122
    
    
      5
      5
      4.470
      73.730
      2.227
      0.8967
      1995-05-01
      0.800655

Now we perform a 12 months moving average, to extract the trend:



In [23]:

    
trendM = movingAverage(X.LogMilk, 12)



In [24]:

    
fig,ax = plt.subplots()
ax.plot(X.LogMilk, color="black", label="Milk production")
ax.plot(trendM, color="blue", label="Trend")

ax.grid(True)
ax.set_xlabel("Time in months")
ax.set_ylabel("Log(Milk)")
ax.legend(loc="upper left")


fig.suptitle("Milk & trend - log");

Get the Trend with Lowess

Having tried a simple moving average decomposition, we will now use a lowess model to determine the trend. Lowess is a sophisticated non-linear regression.

The statsmodels package includes a function to calculate the Lowess regression:



In [25]:

    
import statsmodels.nonparametric.smoothers_lowess as sm

trendL = sm.lowess(X.LogMilk, X.YearMonth, frac=0.25,return_sorted=False)



In [26]:

    
fig,ax = plt.subplots()
ax.plot(X.YearMonth, X.LogMilk)
ax.plot(X.YearMonth, trendM, label='Trend from Moving Average')
ax.plot(X.YearMonth, trendL, label="Trend from Lowess")
ax.set_xlabel("Year")
ax.set_ylabel("Log(Milk)")
ax.legend(loc="lower right")
fig.suptitle("Trends extracted from milk production");

The time series charts show the original time series along with the two trends. The trend from Lowess regression is a bit smoother than the one was obtained with the simple moving average decomposition.

Seasonality

Now we apply a linear regression on the Milk production using the months as variables to discover the seasonality.

But first need to convert the months into categorical variables using the One Hot Encoder (OHE) approach.



In [27]:

    
monthsOHE = pd.get_dummies(X.MonthNumber)
monthsOHE.head()

The order doesn't matter in this case (October and November comes before February).



In [28]:

    
prodWithoutTrend = X.LogMilk - trendM # the difference



In [29]:

    
fig,ax = plt.subplots()
ax.plot(prodWithoutTrend)

ax.grid(True)
ax.set_xlabel("Time in months")
ax.set_ylabel("Milk production without trend")

fig.suptitle("Difference looks more stationary");

The milk production - when the trend is removed - seems more stationary, without any immediate pattern.
We create now a temporary data frame to perform the regression:



In [30]:

    
# a new tmp data frame for decomposition 
decX = pd.DataFrame({'LogMilkNoTrend': prodWithoutTrend}) 
decX = decX.join(monthsOHE)



In [31]:

    
decX.head()









    Out[31]:







  
    
      
      LogMilkNoTrend
      1
      10
      11
      12
      2
      3
      4
      5
      6
      7
      8
      9
    
  
  
    
      1
      0.000000
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      -0.089080
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      0.067938
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      4
      0.030380
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      5
      0.067319
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0

The regression is applied to the production without trend:



In [32]:

    
from sklearn import linear_model
model = linear_model.LinearRegression(fit_intercept = False)



In [33]:

    
model.fit(monthsOHE, decX.LogMilkNoTrend)









    Out[33]:





LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)

It's a model with 12 (one for each month, keep in mind that they are not in chronological order) variables:



In [34]:

    
model.coef_









    Out[34]:





array([ 0.02471208,  0.01099585, -0.02349834,  0.01714495, -0.04996016,
        0.06783954,  0.03872409,  0.06733682,  0.01530435,  0.02160618,
        0.01572798, -0.02670753])

Finally, we get the seasonality by predicting the value month by month:



In [35]:

    
seasonalM = model.predict(monthsOHE)

Residuals

If we remove the seasonality from the difference between milk production and the trend, we will get the residuals or remainders:



In [36]:

    
remainderM = prodWithoutTrend - seasonalM

Finally, we put all together in a data fram that contains all decomposition elements: trend, seasonality and residuals



In [37]:

    
# a data frame with all decompositions
decomposed = pd.DataFrame({'t': trendM, 's': seasonalM, 'r': remainderM})



In [38]:

    
decomposed.head()



In [39]:

    
fig,ax = plt.subplots(4, sharex=True, figsize=(15,15))
ax[0].plot(X.MilkProd)
ax[0].set_title('Original')

ax[1].plot(decomposed.t)
ax[1].set_title('Trend')

ax[2].plot(decomposed.s)
ax[2].set_title('Seasonality')

ax[3].plot(decomposed.r)
ax[3].set_title('Residuals');

As we have seen, the trend is of an increasing milk production, beside the last months when it is stable; while there is a clear yearly seasonality.
The remainders in an ideal situation are like white noise, i.e. every pattern has been removed and what is left is only the random variability.

Decomposing using statsmodels

The statsmodels package has also a function to decompose a time series:



In [40]:

    
from statsmodels.tsa.seasonal import seasonal_decompose

# need to pass the time series as array (values) because of error
# in statsmodels, should be fixed in next version 0.9
decomposition = seasonal_decompose(X.MilkProd.values, freq=12)

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid



In [41]:

    
fig,ax = plt.subplots(4, sharex=True, figsize=(15,15))
ax[0].plot(X.MilkProd)
ax[0].set_title('Original')

ax[1].plot(trend)
ax[1].set_title('Trend')

ax[2].plot(seasonal)
ax[2].set_title('Seasonality')

ax[3].plot(residual)
ax[3].set_title('Residuals');

You can see the trend and seasonal components clearly separated in the above plots. The remainder plot looks fairly random, as expected.

Is now stationary?

The question remains, is the remainder from this decomposition stationary? To find out, we plot its ACF:



In [42]:

    
sm_pl.plot_acf(decomposed.r, lags=24);



In [43]:

    
sm_pl.plot_pacf(decomposed.r, lags=24);

The ACF has at least 5 significant lag values, indicating the remainder is NOT, in fact, stationary.
Compared to the behavior of the ACF for the simple moving average decomposition, the behavior of the remainder is improved.

Now we plot the histogram of the remainder:



In [44]:

    
fig,ax = plt.subplots()
ax.hist(decomposed.r, bins=40)
ax.grid(True)
ax.set_xlabel("Residuals")
ax.set_ylabel("Frequency")
fig.suptitle("Distribution of Residuals");

The distribution of the remainder values is much closer to a Normal distribution than for the original time series we created earlier. This result, combined with the ACF plot, indicates that the decomposition is effective.

We can further investigate the remainder (non-seasonal residual) component by making a box plot by month of the year:



In [45]:

    
remaindersByMonth = pd.DataFrame({'resid': decomposed.r, 'month': X["MonthNumber"]})



In [46]:

    
remaindersByMonth.boxplot(by='month');

The remainder component shows only limited variation from month to month. The differences are within the interquartile range, indicating that the seasonal model is a reasonably good fit.

Moving Average Models

Now that we have explored the decomposition of the time series we will now construct and test Autoregressive Moving Average (ARMA) models for the remainder of the time series.
We will create and test these models in three steps, first creating a moving average (MA) model, then creating an autoregressive (AR) model and finally an autoregressive moving average (ARMA) model.

The ACF of the remainder from the decomposition of the milk production time series had at least 5 significant lag values. As an inital model, we therefore now create an MA model of order 5 and examine the model summary.



In [47]:

    
from statsmodels.tsa.arima_model import ARIMA



In [48]:

    
modelMA5 = ARIMA(decomposed.r.values, order=(0,0,5))
resultsMA5 = modelMA5.fit(disp=0) # disp: do not display covergence info
resultsMA5.summary()









    Out[48]:





ARMA Model Results

  Dep. Variable:          y           No. Observations:      228   


  Model:             ARMA(0, 5)       Log Likelihood       629.357 


  Method:              css-mle        S.D. of innovations    0.015  


  Date:           Sat, 23 Jun 2018    AIC                 -1244.713


  Time:               13:58:41        BIC                 -1220.708


  Sample:                 0           HQIC                -1235.028


                                                                 




             coef      std err       z       P>|z|   [0.025     0.975]  


  const    -6.083e-05      0.003     -0.018   0.985     -0.007      0.006


  ma.L1.y      0.7585      0.074     10.264   0.000      0.614      0.903


  ma.L2.y      0.6238      0.086      7.278   0.000      0.456      0.792


  ma.L3.y      0.4338      0.070      6.205   0.000      0.297      0.571


  ma.L4.y      0.2543      0.073      3.488   0.001      0.111      0.397


  ma.L5.y      0.2421      0.077      3.139   0.002      0.091      0.393


              -1.2767                    -0.0000j                    1.2767                   -0.5000     
              -0.5637                    -1.1634j                    1.2927                   -0.3218     
              -0.5637                    +1.1634j                    1.2927                    0.3218     
               0.6769                    -1.2156j                    1.3913                   -0.1691     
               0.6769                    +1.2156j                    1.3913                    0.1691     

Roots

                  Real           Imaginary           Modulus          Frequency


  MA.1

  MA.2

  MA.3

  MA.4

  MA.5

Examining the values of the model coefficients and their standard errors (SE), we notice that the SE of the last two coefficients is the same order than the value of the coefficient itself. This may indicate that the value of this coefficient is poorly determined and could likely be set to zero.
The foregoing result indicates that the order of the MA model could be reduced. Generally, the order of an MA model is reduced in unit steps until all the coefficients appear to be significant.
We try with 3 coefficients and examine the model summary.



In [49]:

    
modelMA3 = ARIMA(decomposed.r.values, order=(0,0,3))
resultsMA3 = modelMA3.fit(disp=0) # disp: do not display covergence info
resultsMA3.summary()









    Out[49]:





ARMA Model Results

  Dep. Variable:          y           No. Observations:      228   


  Model:             ARMA(0, 3)       Log Likelihood       623.018 


  Method:              css-mle        S.D. of innovations    0.016  


  Date:           Sat, 23 Jun 2018    AIC                 -1236.036


  Time:               13:58:41        BIC                 -1218.889


  Sample:                 0           HQIC                -1229.118


                                                                 




             coef      std err       z       P>|z|   [0.025     0.975]  


  const    -2.526e-05      0.003     -0.009   0.993     -0.006      0.006


  ma.L1.y      0.8000      0.063     12.608   0.000      0.676      0.924


  ma.L2.y      0.6155      0.061     10.074   0.000      0.496      0.735


  ma.L3.y      0.3563      0.065      5.459   0.000      0.228      0.484


              -0.1203                    -1.3687j                    1.3739                   -0.2640     
              -0.1203                    +1.3687j                    1.3739                    0.2640     
              -1.4868                    -0.0000j                    1.4868                   -0.5000     

Roots

                  Real           Imaginary           Modulus          Frequency


  MA.1

  MA.2

  MA.3

The AIC is slightly lower and the standard error compared to the magnitude of the coefficients is also slightly improved; indicates that the order of the model is more reasonable but probably not yet perfect. To test how well this model fits the data, and results in a stationary result, we plot the ACF of the residuals of the MA(3) model:



In [50]:

    
sm_pl.plot_acf(resultsMA3.resid, lags=24);



In [51]:

    
sm_pl.plot_pacf(resultsMA3.resid, lags=24);

Autoregressive Models

Next, we test an autoregressive model. The PACF of the reminder indicates that an AR model might not be the best choice. None the less, a low order AR model might fit these data.

In general, AR orders will tend to present themselves by a sharp cutoff in the PACF plot and a slow trending or sinusoidal degradation in the ACF plot. The opposite is usually true for MA orders.



In [52]:

    
modelAR2 = ARIMA(decomposed.r.values, order=(2,0,0))
resultsAR2 = modelAR2.fit(disp=0) # disp: do not display covergence info
resultsAR2.summary()









    Out[52]:





ARMA Model Results

  Dep. Variable:          y           No. Observations:      228   


  Model:             ARMA(2, 0)       Log Likelihood       632.334 


  Method:              css-mle        S.D. of innovations    0.015  


  Date:           Sat, 23 Jun 2018    AIC                 -1256.668


  Time:               13:58:42        BIC                 -1242.951


  Sample:                 0           HQIC                -1251.133


                                                                 




             coef      std err       z       P>|z|   [0.025     0.975]  


  const       -0.0005      0.005     -0.099   0.921     -0.011      0.010


  ar.L1.y      0.7742      0.066     11.709   0.000      0.645      0.904


  ar.L2.y      0.0389      0.066      0.584   0.560     -0.091      0.169


               1.2173                    +0.0000j                    1.2173                    0.0000     
             -21.1420                    +0.0000j                   21.1420                    0.5000     

Roots

                  Real           Imaginary           Modulus          Frequency


  AR.1

  AR.2

The AIC value is the worst among the models tried.
Examining the values of the coefficients and their standard errors, we note that the standard error of the second coefficient is actually bigger than the coefficient. Clearly, the AR(2) model is over parameterized.

Next, we try an AR(1) model.



In [53]:

    
modelAR1 = ARIMA(decomposed.r.values, order=(1,0,0))
resultsAR1 = modelAR1.fit(disp=0) # disp: do not display covergence info
resultsAR1.summary()









    Out[53]:





ARMA Model Results

  Dep. Variable:          y           No. Observations:      228   


  Model:             ARMA(1, 0)       Log Likelihood       632.163 


  Method:              css-mle        S.D. of innovations    0.015  


  Date:           Sat, 23 Jun 2018    AIC                 -1258.327


  Time:               13:58:42        BIC                 -1248.039


  Sample:                 0           HQIC                -1254.176


                                                                 




             coef      std err       z       P>|z|   [0.025     0.975]  


  const       -0.0004      0.005     -0.088   0.930     -0.010      0.009


  ar.L1.y      0.8055      0.039     20.777   0.000      0.729      0.881


               1.2415                    +0.0000j                    1.2415                    0.0000     

Roots

                  Real           Imaginary           Modulus          Frequency


  AR.1

The standard error of the AR(1) model is an order of magnitude less than the value of the coefficient, which is promising.



In [54]:

    
sm_pl.plot_acf(resultsAR1.resid, lags=24);



In [55]:

    
sm_pl.plot_pacf(resultsAR1.resid, lags=24);

Note that only the 0 lag of the ACF is significant and that there are no significant lags for the PACF. These observations indicate that the AR(1) model is a good fit. Compare these results to those of the MA(3) model, noting that they are nearly identical. Either the MA(3) or AR(1) model could be a good choice for these data.

Autoregressive Moving Average (ARMA) Models

We have found that both MA(3) and AR(1) models are possibly good fits to the remainder series.
We will now investigate the use of autoregressive Integrated moving average (ARIMA) models on the remainder series, a combination of both MA and AR models.
As a starting point we try an ARIMA(1,3) model.



In [56]:

    
modelARMA = ARIMA(decomposed.r.values, order=(1,0,3))
resultsARMA = modelARMA.fit(disp=0) # disp: do not display covergence info
resultsARMA.summary()









    Out[56]:





ARMA Model Results

  Dep. Variable:          y           No. Observations:      228   


  Model:             ARMA(1, 3)       Log Likelihood       632.544 


  Method:              css-mle        S.D. of innovations    0.015  


  Date:           Sat, 23 Jun 2018    AIC                 -1253.087


  Time:               13:58:43        BIC                 -1232.511


  Sample:                 0           HQIC                -1244.785


                                                                 




             coef      std err       z       P>|z|   [0.025     0.975]  


  const       -0.0005      0.005     -0.092   0.927     -0.010      0.009


  ar.L1.y      0.8033      0.069     11.705   0.000      0.669      0.938


  ma.L1.y     -0.0279      0.097     -0.287   0.774     -0.218      0.163


  ma.L2.y      0.0548      0.084      0.649   0.517     -0.111      0.220


  ma.L3.y     -0.0116      0.081     -0.144   0.886     -0.170      0.147


               1.2449                    +0.0000j                    1.2449                    0.0000     
              -0.8558                    -3.5578j                    3.6593                   -0.2876     
              -0.8558                    +3.5578j                    3.6593                    0.2876     
               6.4212                    -0.0000j                    6.4212                   -0.0000     

Roots

                  Real           Imaginary           Modulus          Frequency


  AR.1

  MA.1

  MA.2

  MA.3

In each case, the standard error is of the same order of magnitude as the value of the coefficient, indicating this model is a poor fit to the data.

Automatic search for the best ARIMA parameters

There are many guidelines and best practices to get the most suitable parameters for an ARIMA model, yet the correct parametrization of ARIMA models can be a painstaking manual process that requires domain expertise and time. Other statistical programming languages such as R provide automated ways to solve this issue, but not yet in Python.
We can solve this issue by writing Python code to programmatically select the optimal parameter values for our ARIMA(p,d,q) time series model.

We would perform a grid search, a sort of hyperparameter tuning. We can loop through non-trivial values of the ARIMA orders and see which model has the lowest Akaike Information Criterion (AIC).



In [57]:

    
import itertools

# Define the p and q parameters to take any value between 0 and 3
p = q = range(0, 4)

# Generate all different combinations of p, d=0 and q triplets
pdq = [(a,0,b) for a,b in itertools.product(p, q)]



In [58]:

    
import warnings
warnings.simplefilter("ignore")



In [59]:

    
best_score = np.inf
best_pdq = None
temp_model = None

for params in pdq:
        
    try:

        temp_model = ARIMA(decomposed.r.values, order=params)
        resultsARMA = temp_model.fit(disp=0) # disp: do not display covergence info

        if resultsARMA.aic < best_score:
            best_score = resultsARMA.aic
            best_pdq = params
            print(params, " -> NEW best AIC: ", best_score)
    except:
        #print("Unexpected error:", sys.exc_info()[0])
        continue









    



(0, 0, 0)  -> NEW best AIC:  -1020.86399991
(0, 0, 1)  -> NEW best AIC:  -1157.4900345419903
(0, 0, 2)  -> NEW best AIC:  -1207.5618740712084
(0, 0, 3)  -> NEW best AIC:  -1236.036064300405
(1, 0, 0)  -> NEW best AIC:  -1258.326773880575

The lowest AIC is from an ARIMA(1,0) model.



In [60]:

    
bestAICmodel = ARIMA(decomposed.r.values, order=(1,0,0))
resultsARMA = bestAICmodel.fit(disp=0) # disp: do not display covergence info
resultsARMA.summary()









    Out[60]:





ARMA Model Results

  Dep. Variable:          y           No. Observations:      228   


  Model:             ARMA(1, 0)       Log Likelihood       632.163 


  Method:              css-mle        S.D. of innovations    0.015  


  Date:           Sat, 23 Jun 2018    AIC                 -1258.327


  Time:               13:58:49        BIC                 -1248.039


  Sample:                 0           HQIC                -1254.176


                                                                 




             coef      std err       z       P>|z|   [0.025     0.975]  


  const       -0.0004      0.005     -0.088   0.930     -0.010      0.009


  ar.L1.y      0.8055      0.039     20.777   0.000      0.729      0.881


               1.2415                    +0.0000j                    1.2415                    0.0000     

Roots

                  Real           Imaginary           Modulus          Frequency


  AR.1

The standard error of the ar1 coefficient is good.
Next, we plot the ACF and PACF of the model.



In [61]:

    
sm_pl.plot_acf(resultsARMA.resid, lags=24);



In [62]:

    
sm_pl.plot_pacf(resultsARMA.resid, lags=24);



In [63]:

    
fig,ax = plt.subplots()
ax.hist(resultsARMA.resid, bins=70)
ax.grid(True)
ax.set_xlabel("Residuals")
ax.set_ylabel("Frequency")
fig.suptitle("Distribution of Residuals");



In [64]:

    
np.mean(resultsARMA.resid)









    Out[64]:





8.5725149050511039e-05

The mean in the residuals is almost zero showing that the bias has been almost totally removed.

how does the model behave?

Now we could split the initial dataset into the usual train and test sets and check how the trained model behaves on the test set.



In [65]:

    
# split into train and test (last two years) sets
testSize = 24
milkData = X.LogMilk.values
train, test = milkData[0:-testSize], milkData[-testSize:]

We use a one-step-ahead training. For each new observation in the test dataset, a prediction is claculated, then it is added back to the training test and used to improve the model.
Here is an example on the last model that we checked, the ARIMA(1,0).



In [66]:

    
history = [x for x in train]
predictions = list()



In [67]:

    
for t in range(len(test)):
    modelARMA = ARIMA(history, order=(1,0,0))
    resultsARMA = modelARMA.fit(disp=0) # disp: do not display covergence info
    yhat = resultsARMA.forecast()[0][0] # one step forecast
    predictions.append(yhat)
    history.append(test[t])



In [68]:

    
rmse = np.sqrt(sum((test - predictions)**2)/testSize)
rmse









    Out[68]:





0.050296022575550173

The Rooted Mean Square Error (a classic metric to measure the model accuracy) is quite low.
Let's see the predictions plotted:



In [69]:

    
predictions.pop(0) # remove the first element









    Out[69]:





1.2434033647264009



In [70]:

    
fig,ax = plt.subplots()
ax.plot(test, label="Observed")
ax.plot(predictions, label='Predictions', color='#ff0066');
 
ax.set_xlabel("Months")
ax.legend(loc='upper right')
fig.suptitle("Last months milk production");

The ARMA(1,0) is not bad: the predictions are quite good and the RMSE is quite low.

Seasonal ARIMA model

The statsmodels package (from version 0.8) includes a function to model a seasonal arima.
We see how it works first using as parameters AR=1 and MA=0, plus the difference set to 1 to indicate that we want a data transformation to remove the trend and seasonality.



In [71]:

    
import statsmodels.api as sm_sa



In [72]:

    
firstModel = sm_sa.tsa.statespace.SARIMAX(X.LogMilk.values, order=(1,1,0), 
                                      seasonal_order=(1,1,0,12))
results = firstModel.fit()
results.summary()









    Out[72]:





Statespace Model Results

  Dep. Variable:                   y                  No. Observations:      228   


  Model:            SARIMAX(1, 1, 0)x(1, 1, 0, 12)    Log Likelihood       551.410 


  Date:                    Sat, 23 Jun 2018           AIC                 -1096.821


  Time:                        13:58:54               BIC                 -1086.533


  Sample:                          0                  HQIC                -1092.670


                                - 228                                             


  Covariance Type:                opg                                              




              coef      std err       z       P>|z|   [0.025     0.975]  


  ar.L1        -0.1913      0.068     -2.832   0.005     -0.324     -0.059


  ar.S.L12     -0.5278      0.059     -8.995   0.000     -0.643     -0.413


  sigma2        0.0003   2.75e-05     12.378   0.000      0.000      0.000




  Ljung-Box (Q):           105.30    Jarque-Bera (JB):   8.26


  Prob(Q):                  0.00     Prob(JB):           0.02


  Heteroskedasticity (H):   0.81     Skew:               0.08


  Prob(H) (two-sided):      0.39     Kurtosis:           3.95

Again we use a "grid search" to iteratively explore different combinations of parameters. For each combination of parameters, we fit a new seasonal ARIMA model with the SARIMAX() function from the statsmodels module and assess its overall quality. This time is a double loop so it will take much longer time.



In [73]:

    
# Define the p and q parameters to take any value between 0 and 3
p = q = range(0, 4)
sp = sq = range(0, 3)

# Generate all different combinations of p, d=1 and q triplets
pdq = [(a,1,b) for a,b in itertools.product(p, q)]
# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [(a, 1, b, 12) for a,b in list(itertools.product(sp, sq))]



In [87]:

    
warnings.simplefilter("ignore")
best_score = np.inf
best_pdq = None
best_season_pdq = None
temp_model = None

for params in pdq:
    for params_seasonal in seasonal_pdq:
        try:
            temp_model = sm_sa.tsa.statespace.SARIMAX(X.LogMilk.values,
                                            order = params,
                                            seasonal_order = params_seasonal)

            results = temp_model.fit()
            
            if results.aic < best_score:
                best_score = results.aic
                best_pdq = params
                best_season_pdq = params_seasonal
                print(params, params_seasonal, " -> NEW best AIC: ", best_score)

            #print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue









    



(0, 1, 0) (0, 1, 1, 12)  -> NEW best AIC:  -1145.67087571
(0, 1, 0) (2, 1, 1, 12)  -> NEW best AIC:  -1146.24623229
(0, 1, 1) (0, 1, 1, 12)  -> NEW best AIC:  -1147.77581717
(0, 1, 1) (2, 1, 1, 12)  -> NEW best AIC:  -1147.84213857
(1, 1, 1) (0, 1, 1, 12)  -> NEW best AIC:  -1147.96663351
(1, 1, 1) (2, 1, 1, 12)  -> NEW best AIC:  -1148.79899832

The model with lowest AIC is the (1,1,1)(2,1,1) but they have all very close AIC values so it could be worth to check them more in detail one by one.
Here we just take the lowest AIC:



In [75]:

    
bestModel = sm_sa.tsa.statespace.SARIMAX(X.LogMilk.values, order=(1,1,1), 
                                      seasonal_order=(2,1,1,12))
results = bestModel.fit()
results.summary()









    Out[75]:





Statespace Model Results

  Dep. Variable:                   y                  No. Observations:      228   


  Model:            SARIMAX(1, 1, 1)x(2, 1, 1, 12)    Log Likelihood       580.399 


  Date:                    Sat, 23 Jun 2018           AIC                 -1148.799


  Time:                        14:13:47               BIC                 -1128.223


  Sample:                          0                  HQIC                -1140.497


                                - 228                                             


  Covariance Type:                opg                                              




              coef      std err       z       P>|z|   [0.025     0.975]  


  ar.L1         0.5334      0.265      2.013   0.044      0.014      1.053


  ma.L1        -0.6859      0.230     -2.984   0.003     -1.136     -0.235


  ar.S.L12     -0.1900      0.133     -1.429   0.153     -0.450      0.071


  ar.S.L24     -0.1710      0.095     -1.791   0.073     -0.358      0.016


  ma.S.L12     -0.7035      0.097     -7.241   0.000     -0.894     -0.513


  sigma2        0.0002   1.77e-05     14.007   0.000      0.000      0.000




  Ljung-Box (Q):           45.23    Jarque-Bera (JB):   27.85


  Prob(Q):                 0.26     Prob(JB):           0.00 


  Heteroskedasticity (H):  0.84     Skew:               0.02 


  Prob(H) (two-sided):     0.45     Kurtosis:           4.76

The coef column shows the weight (i.e. importance) of each feature and how each one impacts the time series. The P>|z| column informs us of the significance of each feature weight. Here, a couple of weights have a p-value higher than 0.05, so it could be reasonable to remove from the model.

The results object has a convenient diagnostic method named plot_diagnostics method to get a rundown of the fitted model.



In [76]:

    
results.plot_diagnostics(figsize=(16, 10));

The residual plot of the fitted model in the upper left corner appears do be white noise as it does not display obvious seasonality or trend behaviour.
The histogram plot in the upper right corner pair with the kernel density estimation (red line) indicates that the time series is almost normally distributed. This is compared to the density of the standard normal distribution (green line).
The correlogram (autocorrelation plot) confirms this results, since the time series residuals show low correlations with lagged residuals.

The model produces a satisfactory fit that describes the underlying time series data well enough to forecast future values. Although the fit so far appears to be fine, a better fit could be achieved with a more complex model, i.e. a better model might have been von when increasing the parameter space for the grid search.

Forecast

It is easy to forecast values using the previously fitted model.
The get_prediction and conf_int methods calculate predictions for future points in time for the previously fitted model and the confidence intervals associated with a prediction, respectively.

last 12 months

First, we get the predictions for the last 12 months, to see how the model behaves:



In [77]:

    
HISTORIC = X.shape[0]
PRED = 12



In [78]:

    
last12m = results.get_prediction(start=HISTORIC-PRED+1, 
                                 end=HISTORIC,
                                 dynamic=False, full_results=True)

The dynamic=False argument ensures that we produce one-step ahead forecasts, meaning that forecasts at each point are generated using the full history up to that point.



In [79]:

    
fig,ax = plt.subplots()
ax.plot(X.YearMonth[-PRED:], X.LogMilk[-PRED:], label="Observed")
ax.plot(X.YearMonth[-PRED:], last12m.predicted_mean, label='Modelled', color='#ff0066');
  # avoid overlapping of ticks on x ax
plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')

ax.set_xlabel("Time")
ax.legend(loc='upper right')
fig.suptitle("Last year milk production");



In [80]:

    
rmse = ((last12m.predicted_mean - X.LogMilk[-PRED:]) ** 2).mean()
rmse









    Out[80]:





0.0021989614592391187

Both the plot and the RMSE are quite good.

Forecast

To get the forecast we can call the get_forecast function:



In [81]:

    
FORECAST = 24
next24m = results.get_forecast(steps=FORECAST)
next24m_ci = next24m.conf_int()

There is a point forecast (the expected value) along with 95 percent confidence intervals. Notice, that the confidence intervals generally get wider for forecasts further out in time. It is hardly suprising that the forecast has more uncertaintly as time increases from the present.

Finally, we can plot the forecast but need first to create new dataframe with the related dates, for convenience.



In [82]:

    
futureDates = pd.date_range(start = '2014-01-01', periods=FORECAST, freq='MS')



In [83]:

    
futureDF = pd.DataFrame(next24m.predicted_mean, index=futureDates)



In [84]:

    
futureDF.head()



In [85]:

    
next24m_ci.index = futureDates



In [86]:

    
fig,ax = plt.subplots()
ax.plot(X.YearMonth, X.LogMilk, label="Observed")
ax.plot(futureDF, label='Forecast', color='#ff0066');
  # draw confidence bound
ax.fill_between(next24m_ci.index, 
                next24m_ci.iloc[:, 0], 
                next24m_ci.iloc[:, 1], color='#ff0066', alpha=.25);
 
ax.set_xlabel("Year")
ax.set_ylabel("Milk")

ax.legend(loc='upper left')
fig.suptitle("Milk production with forecast");

The original time series of milk production is show in blue in the plot above. The forecast is show in red. The 95 percent confidence intervals are shown in lighter shades of red.

It is obvious that, the further we try to extrapolate the time series evolution into the future, the less confident our prediction becomes. Hence, the confidence bound widens in the course of time.

You can try yourself to decompose the Icecream.Prod column of the diary data frame and answer the following questions:
-Does icecream production have a noticable seasonal component or are the values all close to the average over time? Is there a strong seasonal component for icecream production?
-Does the acf plot indicate that the remainder series is stationary?
-Do the values in the histogram have an approximately normal distribution?
-Does the interquartile range for each month overlap, indicating that the decomposition has produced a reasonably good model of the seasonal variation?

	r	s	t
1	-0.024712	0.024712	0.747635
2	-0.039119	-0.049960	0.747635
3	0.000099	0.067840	0.703096
4	-0.008344	0.038724	0.725742
5	-0.000018	0.067337	0.733337

	0
2014-01-01	1.256764
2014-02-01	1.185427
2014-03-01	1.298966
2014-04-01	1.275778
2014-05-01	1.305538

	Year.Month	Month.Number	Year	Month	Cotagecheese.Prod	Icecream.Prod	Milk.Prod	N.CA.Fat.Price
1	1995.01	1	1995	Jan	4.370	51.595	2.112	0.9803
2	1995.02	2	1995	Feb	3.695	56.086	1.932	0.8924
3	1995.03	3	1995	Mar	4.538	68.453	2.162	0.8924
4	1995.04	4	1995	Apr	4.280	65.722	2.130	0.8967
5	1995.05	5	1995	May	4.470	73.730	2.227	0.8967

	MonthNumber	CottageCheeseProd	IceCreamProd	MilkProd	Price	YearMonth
224	8	2.169	78.382	3.400	1.6273	2013-08-01
225	9	2.193	74.871	3.197	1.5519	2013-09-01
226	10	2.284	70.011	3.338	1.5566	2013-10-01
227	11	1.865	61.394	3.275	1.7537	2013-11-01
228	12	2.345	53.653	3.473	1.6747	2013-12-01

	MonthNumber	CottageCheeseProd	IceCreamProd	MilkProd	Price	YearMonth
217	1	2.282	62.612	3.462	1.7744	2013-01-01
218	2	1.941	66.274	3.231	1.6224	2013-02-01
219	3	2.150	75.770	3.676	1.6968	2013-03-01
220	4	2.217	76.396	3.622	1.7444	2013-04-01
221	5	2.341	80.899	3.720	1.8552	2013-05-01
222	6	2.694	92.091	3.489	1.8314	2013-06-01
223	7	2.433	84.026	3.373	1.6923	2013-07-01
224	8	2.169	78.382	3.400	1.6273	2013-08-01
225	9	2.193	74.871	3.197	1.5519	2013-09-01
226	10	2.284	70.011	3.338	1.5566	2013-10-01
227	11	1.865	61.394	3.275	1.7537	2013-11-01
228	12	2.345	53.653	3.473	1.6747	2013-12-01

	1	2	3	4	5
1	1	0	0	0	0
2	0	1	0	0	0
3	0	0	1	0	0
4	0	0	0	1	0
5	0	0	0	0	1

	LogMilkNoTrend	1	2	3	4	5
1	0.000000	1	0	0	0	0
2	-0.089080	0	1	0	0	0
3	0.067938	0	0	1	0	0
4	0.030380	0	0	0	1	0
5	0.067319	0	0	0	0	1

Dep. Variable:	y	No. Observations:	228
Model:	ARMA(0, 5)	Log Likelihood	629.357
Method:	css-mle	S.D. of innovations	0.015
Date:	Sat, 23 Jun 2018	AIC	-1244.713
Time:	13:58:41	BIC	-1220.708
Sample:	0	HQIC	-1235.028

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-6.083e-05	0.003	-0.018	0.985	-0.007	0.006
ma.L1.y	0.7585	0.074	10.264	0.000	0.614	0.903
ma.L2.y	0.6238	0.086	7.278	0.000	0.456	0.792
ma.L3.y	0.4338	0.070	6.205	0.000	0.297	0.571
ma.L4.y	0.2543	0.073	3.488	0.001	0.111	0.397
ma.L5.y	0.2421	0.077	3.139	0.002	0.091	0.393

Ljung-Box (Q):	105.30	Jarque-Bera (JB):	8.26
Prob(Q):	0.00	Prob(JB):	0.02
Heteroskedasticity (H):	0.81	Skew:	0.08
Prob(H) (two-sided):	0.39	Kurtosis:	3.95

Ljung-Box (Q):	45.23	Jarque-Bera (JB):	27.85
Prob(Q):	0.26	Prob(JB):	0.00
Heteroskedasticity (H):	0.84	Skew:	0.02
Prob(H) (two-sided):	0.45	Kurtosis:	4.76

	1	2	3	4	5
1	1	0	0	0	0
2	0	1	0	0	0
3	0	0	1	0	0
4	0	0	0	1	0
5	0	0	0	0	1

	1	2	3	4	5
1	1	0	0	0	0
2	0	1	0	0	0
3	0	0	1	0	0
4	0	0	0	1	0
5	0	0	0	0	1