Time series models are used in a wide range of applications, particularly for forecasting, which is the goal of this example, performed in four steps:
The dataset is the production amount of several diary products in California, month by month, for 18 years.
Our goal: forecast the next year production for one of those products: milk.
As usual, first we load the data into a Pandas dataframe.
In [1]:
import pandas as pd
X = pd.read_csv('../datasets/cadairydata.csv', index_col=0)
X.head()
Out[1]:
In [2]:
X.info()
The first step is to have a Pandas date column which could be used as the time index.
The fastest way is to convert year and month to strings, then create a new Pandas datetime column that combine them.
In [3]:
X["Year"] = X["Year"].astype(str)
In [4]:
X["Month.Number"] = X["Month.Number"].astype(str)
In [5]:
X["YearMonth"] = pd.to_datetime(X["Year"] + X["Month.Number"], format="%Y%m") # new column
Now can drop the other date columns:
In [6]:
X.drop(["Year","Month", "Year.Month"], axis=1, inplace=True)
and I prefer to rename some of the other columns, for clarity:
In [7]:
X.rename(columns={'Cotagecheese.Prod':'CottageCheeseProd',
'Icecream.Prod':'IceCreamProd',
'Milk.Prod':'MilkProd',
'N.CA.Fat.Price':'Price',
'Month.Number' : 'MonthNumber'},
inplace=True)
In [8]:
X.head()
Out[8]:
In [9]:
X.tail()
Out[9]:
In [10]:
X.info()
Now, we can perform both arithmetic and logical operations on datetime objects ("YearMonth").
As an example, here is how to select the last 12 months of dairy data:
In [11]:
X[X.YearMonth > "2012-12-01"]
Out[11]:
The Milk prpduction should be in million of pounds.
In [12]:
X.isnull().values.any()
Out[12]:
There are no NaNs. The dataset is ready to be explored.
First useful thing to do is always to plot the data:
In [13]:
import matplotlib.pyplot as plt
%matplotlib inline
In [14]:
fig,ax = plt.subplots()
ax.plot(X.YearMonth, X.MilkProd)
ax.grid(True)
ax.set_xlabel("Year")
ax.set_ylabel("Milk")
fig.suptitle("Time series of milk production");
For most of the time period shown, the production of milk increased year over year. However, there is a decline in milk production starting in 2009 as a results of the recession. Also, notice that this time series exhibits a strong seasonal component with an annual cycle.
Since the time series has both a visible trend and a seasonality it is clearly a non-stationary series.
Autocorrelation is a fundamental property of time series. The Autocorrelation Function or ACF provides information on the dependency of the time series values of previous values.
In [15]:
import statsmodels.graphics.tsaplots as sm_pl
In [16]:
sm_pl.plot_acf(X.MilkProd, lags=24);
Lags are the months in this case.
Note that the values of the ACF at the various lags decays only slowly. This indicates there is considerable auto correlation between the time series values at the various lags, mostly likely from the trend.
Plotting a histogram provides information on the distribution of values of the time series.
In [17]:
fig,ax = plt.subplots()
ax.hist(X.MilkProd, bins=40)
ax.grid(True)
ax.set_xlabel("Production")
ax.set_ylabel("Frequency")
fig.suptitle("Distribution of Milk Production");
The histogram of the full milk production time series shows considerable dispersion. Again this behavior is likely the result of the trend.
From the plot is pretty clear that there is trend and seasonality, therefore is not stationary. Anyway, can be checked.
In [18]:
rollingMean = X.MilkProd.rolling(window=12).mean()
rollingStd = X.MilkProd.rolling(window=12).std()
In [19]:
fig,ax1 = plt.subplots()
ax1.plot(X.MilkProd, color="black", label="Milk production")
ax1.plot(rollingMean, color="red", label="Average")
ax1.grid(True)
ax1.set_xlabel("Time in months")
ax1.set_ylabel("Milk")
ax1.legend(loc="upper left")
ax2 = ax1.twinx() # Create a twin Axes sharing the xaxis
ax2.plot(rollingStd, color = "blue", label = "StD")
ax2.set_ylabel("Standard Deviation")
ax2.legend(loc="lower right")
ax2.set_ylim(0,0.5)
fig.suptitle("Milk - Average & Standard Deviation");
The mean is clearly increasing over time (the production is rising) but also the standard deviation of milk production increased in the time frame.
Time series are typically decomposed into three components: trend, seasonal, and the remainders, or residuals.
Trend can be modeled by several methods. We will start by decomposing the time series using a simple moving average model.
The code in the cell below uses moving window method to compute the average of the time series over specified span, or order of the operator. As the moving window operator moves over the data, the average of the values in the windows is calculated.
In [20]:
import numpy as np
def movingAverage(values, order):
end = len(values)
out = np.zeros(len(values))
out[0] = values[1]
for i in range(1,end):
if (i - order <= 1):
j = 1
else:
j += 1
out[i] = sum(values[j-1:i]) / (i-j+1)
return out
One of the best ways to make a series stationary on variance is through transforming the original series through log transform.
We go back to our original data series and add a new column, the logarithm of the milk production:
In [21]:
X['LogMilk'] = pd.Series(np.log(X.MilkProd), index = X.index)
In [22]:
X.head()
Out[22]:
Now we perform a 12 months moving average, to extract the trend:
In [23]:
trendM = movingAverage(X.LogMilk, 12)
In [24]:
fig,ax = plt.subplots()
ax.plot(X.LogMilk, color="black", label="Milk production")
ax.plot(trendM, color="blue", label="Trend")
ax.grid(True)
ax.set_xlabel("Time in months")
ax.set_ylabel("Log(Milk)")
ax.legend(loc="upper left")
fig.suptitle("Milk & trend - log");
In [25]:
import statsmodels.nonparametric.smoothers_lowess as sm
trendL = sm.lowess(X.LogMilk, X.YearMonth, frac=0.25,return_sorted=False)
In [26]:
fig,ax = plt.subplots()
ax.plot(X.YearMonth, X.LogMilk)
ax.plot(X.YearMonth, trendM, label='Trend from Moving Average')
ax.plot(X.YearMonth, trendL, label="Trend from Lowess")
ax.set_xlabel("Year")
ax.set_ylabel("Log(Milk)")
ax.legend(loc="lower right")
fig.suptitle("Trends extracted from milk production");
The time series charts show the original time series along with the two trends. The trend from Lowess regression is a bit smoother than the one was obtained with the simple moving average decomposition.
In [27]:
monthsOHE = pd.get_dummies(X.MonthNumber)
monthsOHE.head()
Out[27]:
The order doesn't matter in this case (October and November comes before February).
In [28]:
prodWithoutTrend = X.LogMilk - trendM # the difference
In [29]:
fig,ax = plt.subplots()
ax.plot(prodWithoutTrend)
ax.grid(True)
ax.set_xlabel("Time in months")
ax.set_ylabel("Milk production without trend")
fig.suptitle("Difference looks more stationary");
The milk production - when the trend is removed - seems more stationary, without any immediate pattern.
We create now a temporary data frame to perform the regression:
In [30]:
# a new tmp data frame for decomposition
decX = pd.DataFrame({'LogMilkNoTrend': prodWithoutTrend})
decX = decX.join(monthsOHE)
In [31]:
decX.head()
Out[31]:
The regression is applied to the production without trend:
In [32]:
from sklearn import linear_model
model = linear_model.LinearRegression(fit_intercept = False)
In [33]:
model.fit(monthsOHE, decX.LogMilkNoTrend)
Out[33]:
It's a model with 12 (one for each month, keep in mind that they are not in chronological order) variables:
In [34]:
model.coef_
Out[34]:
Finally, we get the seasonality by predicting the value month by month:
In [35]:
seasonalM = model.predict(monthsOHE)
If we remove the seasonality from the difference between milk production and the trend, we will get the residuals or remainders:
In [36]:
remainderM = prodWithoutTrend - seasonalM
Finally, we put all together in a data fram that contains all decomposition elements: trend, seasonality and residuals
In [37]:
# a data frame with all decompositions
decomposed = pd.DataFrame({'t': trendM, 's': seasonalM, 'r': remainderM})
In [38]:
decomposed.head()
Out[38]:
In [39]:
fig,ax = plt.subplots(4, sharex=True, figsize=(15,15))
ax[0].plot(X.MilkProd)
ax[0].set_title('Original')
ax[1].plot(decomposed.t)
ax[1].set_title('Trend')
ax[2].plot(decomposed.s)
ax[2].set_title('Seasonality')
ax[3].plot(decomposed.r)
ax[3].set_title('Residuals');
As we have seen, the trend is of an increasing milk production, beside the last months when it is stable; while there is a clear yearly seasonality.
The remainders in an ideal situation are like white noise, i.e. every pattern has been removed and what is left is only the random variability.
In [40]:
from statsmodels.tsa.seasonal import seasonal_decompose
# need to pass the time series as array (values) because of error
# in statsmodels, should be fixed in next version 0.9
decomposition = seasonal_decompose(X.MilkProd.values, freq=12)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
In [41]:
fig,ax = plt.subplots(4, sharex=True, figsize=(15,15))
ax[0].plot(X.MilkProd)
ax[0].set_title('Original')
ax[1].plot(trend)
ax[1].set_title('Trend')
ax[2].plot(seasonal)
ax[2].set_title('Seasonality')
ax[3].plot(residual)
ax[3].set_title('Residuals');
You can see the trend and seasonal components clearly separated in the above plots. The remainder plot looks fairly random, as expected.
In [42]:
sm_pl.plot_acf(decomposed.r, lags=24);
In [43]:
sm_pl.plot_pacf(decomposed.r, lags=24);
The ACF has at least 5 significant lag values, indicating the remainder is NOT, in fact, stationary.
Compared to the behavior of the ACF for the simple moving average decomposition, the behavior of the remainder is improved.
Now we plot the histogram of the remainder:
In [44]:
fig,ax = plt.subplots()
ax.hist(decomposed.r, bins=40)
ax.grid(True)
ax.set_xlabel("Residuals")
ax.set_ylabel("Frequency")
fig.suptitle("Distribution of Residuals");
The distribution of the remainder values is much closer to a Normal distribution than for the original time series we created earlier. This result, combined with the ACF plot, indicates that the decomposition is effective.
We can further investigate the remainder (non-seasonal residual) component by making a box plot by month of the year:
In [45]:
remaindersByMonth = pd.DataFrame({'resid': decomposed.r, 'month': X["MonthNumber"]})
In [46]:
remaindersByMonth.boxplot(by='month');
The remainder component shows only limited variation from month to month. The differences are within the interquartile range, indicating that the seasonal model is a reasonably good fit.
Now that we have explored the decomposition of the time series we will now construct and test Autoregressive Moving Average (ARMA) models for the remainder of the time series.
We will create and test these models in three steps, first creating a moving average (MA) model, then creating an autoregressive (AR) model and finally an autoregressive moving average (ARMA) model.
The ACF of the remainder from the decomposition of the milk production time series had at least 5 significant lag values. As an inital model, we therefore now create an MA model of order 5 and examine the model summary.
In [47]:
from statsmodels.tsa.arima_model import ARIMA
In [48]:
modelMA5 = ARIMA(decomposed.r.values, order=(0,0,5))
resultsMA5 = modelMA5.fit(disp=0) # disp: do not display covergence info
resultsMA5.summary()
Out[48]:
Examining the values of the model coefficients and their standard errors (SE), we notice that the SE of the last two coefficients is the same order than the value of the coefficient itself. This may indicate that the value of this coefficient is poorly determined and could likely be set to zero.
The foregoing result indicates that the order of the MA model could be reduced. Generally, the order of an MA model is reduced in unit steps until all the coefficients appear to be significant.
We try with 3 coefficients and examine the model summary.
In [49]:
modelMA3 = ARIMA(decomposed.r.values, order=(0,0,3))
resultsMA3 = modelMA3.fit(disp=0) # disp: do not display covergence info
resultsMA3.summary()
Out[49]:
The AIC is slightly lower and the standard error compared to the magnitude of the coefficients is also slightly improved; indicates that the order of the model is more reasonable but probably not yet perfect. To test how well this model fits the data, and results in a stationary result, we plot the ACF of the residuals of the MA(3) model:
In [50]:
sm_pl.plot_acf(resultsMA3.resid, lags=24);
In [51]:
sm_pl.plot_pacf(resultsMA3.resid, lags=24);
In general, AR orders will tend to present themselves by a sharp cutoff in the PACF plot and a slow trending or sinusoidal degradation in the ACF plot. The opposite is usually true for MA orders.
In [52]:
modelAR2 = ARIMA(decomposed.r.values, order=(2,0,0))
resultsAR2 = modelAR2.fit(disp=0) # disp: do not display covergence info
resultsAR2.summary()
Out[52]:
The AIC value is the worst among the models tried.
Examining the values of the coefficients and their standard errors, we note that the standard error of the second coefficient is actually bigger than the coefficient. Clearly, the AR(2) model is over parameterized.
Next, we try an AR(1) model.
In [53]:
modelAR1 = ARIMA(decomposed.r.values, order=(1,0,0))
resultsAR1 = modelAR1.fit(disp=0) # disp: do not display covergence info
resultsAR1.summary()
Out[53]:
The standard error of the AR(1) model is an order of magnitude less than the value of the coefficient, which is promising.
In [54]:
sm_pl.plot_acf(resultsAR1.resid, lags=24);
In [55]:
sm_pl.plot_pacf(resultsAR1.resid, lags=24);
Note that only the 0 lag of the ACF is significant and that there are no significant lags for the PACF. These observations indicate that the AR(1) model is a good fit. Compare these results to those of the MA(3) model, noting that they are nearly identical. Either the MA(3) or AR(1) model could be a good choice for these data.
We have found that both MA(3) and AR(1) models are possibly good fits to the remainder series.
We will now investigate the use of autoregressive Integrated moving average (ARIMA) models on the remainder series, a combination of both MA and AR models.
As a starting point we try an ARIMA(1,3) model.
In [56]:
modelARMA = ARIMA(decomposed.r.values, order=(1,0,3))
resultsARMA = modelARMA.fit(disp=0) # disp: do not display covergence info
resultsARMA.summary()
Out[56]:
In each case, the standard error is of the same order of magnitude as the value of the coefficient, indicating this model is a poor fit to the data.
There are many guidelines and best practices to get the most suitable parameters for an ARIMA model, yet the correct parametrization of ARIMA models can be a painstaking manual process that requires domain expertise and time. Other statistical programming languages such as R provide automated ways to solve this issue, but not yet in Python.
We can solve this issue by writing Python code to programmatically select the optimal parameter values for our ARIMA(p,d,q) time series model.
We would perform a grid search, a sort of hyperparameter tuning. We can loop through non-trivial values of the ARIMA orders and see which model has the lowest Akaike Information Criterion (AIC).
In [57]:
import itertools
# Define the p and q parameters to take any value between 0 and 3
p = q = range(0, 4)
# Generate all different combinations of p, d=0 and q triplets
pdq = [(a,0,b) for a,b in itertools.product(p, q)]
In [58]:
import warnings
warnings.simplefilter("ignore")
In [59]:
best_score = np.inf
best_pdq = None
temp_model = None
for params in pdq:
try:
temp_model = ARIMA(decomposed.r.values, order=params)
resultsARMA = temp_model.fit(disp=0) # disp: do not display covergence info
if resultsARMA.aic < best_score:
best_score = resultsARMA.aic
best_pdq = params
print(params, " -> NEW best AIC: ", best_score)
except:
#print("Unexpected error:", sys.exc_info()[0])
continue
The lowest AIC is from an ARIMA(1,0) model.
In [60]:
bestAICmodel = ARIMA(decomposed.r.values, order=(1,0,0))
resultsARMA = bestAICmodel.fit(disp=0) # disp: do not display covergence info
resultsARMA.summary()
Out[60]:
The standard error of the ar1 coefficient is good.
Next, we plot the ACF and PACF of the model.
In [61]:
sm_pl.plot_acf(resultsARMA.resid, lags=24);
In [62]:
sm_pl.plot_pacf(resultsARMA.resid, lags=24);
In [63]:
fig,ax = plt.subplots()
ax.hist(resultsARMA.resid, bins=70)
ax.grid(True)
ax.set_xlabel("Residuals")
ax.set_ylabel("Frequency")
fig.suptitle("Distribution of Residuals");
In [64]:
np.mean(resultsARMA.resid)
Out[64]:
The mean in the residuals is almost zero showing that the bias has been almost totally removed.
Now we could split the initial dataset into the usual train and test sets and check how the trained model behaves on the test set.
In [65]:
# split into train and test (last two years) sets
testSize = 24
milkData = X.LogMilk.values
train, test = milkData[0:-testSize], milkData[-testSize:]
We use a one-step-ahead training. For each new observation in the test dataset, a prediction is claculated, then it is added back to the training test and used to improve the model.
Here is an example on the last model that we checked, the ARIMA(1,0).
In [66]:
history = [x for x in train]
predictions = list()
In [67]:
for t in range(len(test)):
modelARMA = ARIMA(history, order=(1,0,0))
resultsARMA = modelARMA.fit(disp=0) # disp: do not display covergence info
yhat = resultsARMA.forecast()[0][0] # one step forecast
predictions.append(yhat)
history.append(test[t])
In [68]:
rmse = np.sqrt(sum((test - predictions)**2)/testSize)
rmse
Out[68]:
The Rooted Mean Square Error (a classic metric to measure the model accuracy) is quite low.
Let's see the predictions plotted:
In [69]:
predictions.pop(0) # remove the first element
Out[69]:
In [70]:
fig,ax = plt.subplots()
ax.plot(test, label="Observed")
ax.plot(predictions, label='Predictions', color='#ff0066');
ax.set_xlabel("Months")
ax.legend(loc='upper right')
fig.suptitle("Last months milk production");
The ARMA(1,0) is not bad: the predictions are quite good and the RMSE is quite low.
In [71]:
import statsmodels.api as sm_sa
In [72]:
firstModel = sm_sa.tsa.statespace.SARIMAX(X.LogMilk.values, order=(1,1,0),
seasonal_order=(1,1,0,12))
results = firstModel.fit()
results.summary()
Out[72]:
Again we use a "grid search" to iteratively explore different combinations of parameters. For each combination of parameters, we fit a new seasonal ARIMA model with the SARIMAX() function from the statsmodels module and assess its overall quality. This time is a double loop so it will take much longer time.
In [73]:
# Define the p and q parameters to take any value between 0 and 3
p = q = range(0, 4)
sp = sq = range(0, 3)
# Generate all different combinations of p, d=1 and q triplets
pdq = [(a,1,b) for a,b in itertools.product(p, q)]
# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [(a, 1, b, 12) for a,b in list(itertools.product(sp, sq))]
In [87]:
warnings.simplefilter("ignore")
best_score = np.inf
best_pdq = None
best_season_pdq = None
temp_model = None
for params in pdq:
for params_seasonal in seasonal_pdq:
try:
temp_model = sm_sa.tsa.statespace.SARIMAX(X.LogMilk.values,
order = params,
seasonal_order = params_seasonal)
results = temp_model.fit()
if results.aic < best_score:
best_score = results.aic
best_pdq = params
best_season_pdq = params_seasonal
print(params, params_seasonal, " -> NEW best AIC: ", best_score)
#print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
except:
continue
The model with lowest AIC is the (1,1,1)(2,1,1) but they have all very close AIC values so it could be worth to check them more in detail one by one.
Here we just take the lowest AIC:
In [75]:
bestModel = sm_sa.tsa.statespace.SARIMAX(X.LogMilk.values, order=(1,1,1),
seasonal_order=(2,1,1,12))
results = bestModel.fit()
results.summary()
Out[75]:
The coef column shows the weight (i.e. importance) of each feature and how each one impacts the time series. The P>|z| column informs us of the significance of each feature weight. Here, a couple of weights have a p-value higher than 0.05, so it could be reasonable to remove from the model.
The results object has a convenient diagnostic method named plot_diagnostics method to get a rundown of the fitted model.
In [76]:
results.plot_diagnostics(figsize=(16, 10));
The residual plot of the fitted model in the upper left corner appears do be white noise as it does not display obvious seasonality or trend behaviour.
The histogram plot in the upper right corner pair with the kernel density estimation (red line) indicates that the time series is almost normally distributed. This is compared to the density of the standard normal distribution (green line).
The correlogram (autocorrelation plot) confirms this results, since the time series residuals show low correlations with lagged residuals.
The model produces a satisfactory fit that describes the underlying time series data well enough to forecast future values. Although the fit so far appears to be fine, a better fit could be achieved with a more complex model, i.e. a better model might have been von when increasing the parameter space for the grid search.
It is easy to forecast values using the previously fitted model.
The get_prediction and conf_int methods calculate predictions for future points in time for the previously fitted model and the confidence intervals associated with a prediction, respectively.
In [77]:
HISTORIC = X.shape[0]
PRED = 12
In [78]:
last12m = results.get_prediction(start=HISTORIC-PRED+1,
end=HISTORIC,
dynamic=False, full_results=True)
The dynamic=False argument ensures that we produce one-step ahead forecasts, meaning that forecasts at each point are generated using the full history up to that point.
In [79]:
fig,ax = plt.subplots()
ax.plot(X.YearMonth[-PRED:], X.LogMilk[-PRED:], label="Observed")
ax.plot(X.YearMonth[-PRED:], last12m.predicted_mean, label='Modelled', color='#ff0066');
# avoid overlapping of ticks on x ax
plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')
ax.set_xlabel("Time")
ax.legend(loc='upper right')
fig.suptitle("Last year milk production");
In [80]:
rmse = ((last12m.predicted_mean - X.LogMilk[-PRED:]) ** 2).mean()
rmse
Out[80]:
Both the plot and the RMSE are quite good.
In [81]:
FORECAST = 24
next24m = results.get_forecast(steps=FORECAST)
next24m_ci = next24m.conf_int()
There is a point forecast (the expected value) along with 95 percent confidence intervals. Notice, that the confidence intervals generally get wider for forecasts further out in time. It is hardly suprising that the forecast has more uncertaintly as time increases from the present.
Finally, we can plot the forecast but need first to create new dataframe with the related dates, for convenience.
In [82]:
futureDates = pd.date_range(start = '2014-01-01', periods=FORECAST, freq='MS')
In [83]:
futureDF = pd.DataFrame(next24m.predicted_mean, index=futureDates)
In [84]:
futureDF.head()
Out[84]:
In [85]:
next24m_ci.index = futureDates
In [86]:
fig,ax = plt.subplots()
ax.plot(X.YearMonth, X.LogMilk, label="Observed")
ax.plot(futureDF, label='Forecast', color='#ff0066');
# draw confidence bound
ax.fill_between(next24m_ci.index,
next24m_ci.iloc[:, 0],
next24m_ci.iloc[:, 1], color='#ff0066', alpha=.25);
ax.set_xlabel("Year")
ax.set_ylabel("Milk")
ax.legend(loc='upper left')
fig.suptitle("Milk production with forecast");
The original time series of milk production is show in blue in the plot above. The forecast is show in red. The 95 percent confidence intervals are shown in lighter shades of red.
It is obvious that, the further we try to extrapolate the time series evolution into the future, the less confident our prediction becomes. Hence, the confidence bound widens in the course of time.
You can try yourself to decompose the Icecream.Prod column of the diary data frame and answer the following questions:
-Does icecream production have a noticable seasonal component or are the values all close to the average over time? Is there a strong seasonal component for icecream production?
-Does the acf plot indicate that the remainder series is stationary?
-Do the values in the histogram have an approximately normal distribution?
-Does the interquartile range for each month overlap, indicating that the decomposition has produced a reasonably good model of the seasonal variation?