5. Model the Solution

Preprocessing to get the tidy dataframe


In [1]:
# Import the library we need, which is Pandas and Matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Import statsmodel
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.tsa.stattools import adfuller

In [2]:
# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)

In [3]:
# Read the csv file of Monthwise Quantity and Price csv file we have.
df = pd.read_csv('MonthWiseMarketArrivals_clean.csv')

In [4]:
# Changing the date column to a Time Interval columnn
df.date = pd.DatetimeIndex(df.date)

In [5]:
# Change the index to the date column
df.index = pd.PeriodIndex(df.date, freq='M')

In [6]:
# Sort the data frame by date
df = df.sort_values(by = "date")

In [7]:
df.head()


Out[7]:
market month year quantity priceMin priceMax priceMod state city date
1996-01 LASALGAON(MS) January 1996 225063 160 257 226 MS LASALGAON 1996-01-01
1996-02 LASALGAON(MS) February 1996 196164 133 229 186 MS LASALGAON 1996-02-01
1996-03 LASALGAON(MS) March 1996 178992 155 274 243 MS LASALGAON 1996-03-01
1996-04 LASALGAON(MS) April 1996 192592 136 279 254 MS LASALGAON 1996-04-01
1996-05 LASALGAON(MS) May 1996 237574 154 312 269 MS LASALGAON 1996-05-01

Question 3: Can we forecast the price of Onion in Bangalore?

Get the priceMod for Bangalore Market


In [8]:
dfBang = df.loc[df.city == "BANGALORE"].copy()

In [9]:
dfBang.head()


Out[9]:
market month year quantity priceMin priceMax priceMod state city date
2004-01 BANGALORE January 2004 227832 916 1066 991 KNT BANGALORE 2004-01-01
2004-02 BANGALORE February 2004 225133 741 870 793 KNT BANGALORE 2004-02-01
2004-03 BANGALORE March 2004 221952 527 586 556 KNT BANGALORE 2004-03-01
2004-04 BANGALORE April 2004 185150 419 518 465 KNT BANGALORE 2004-04-01
2004-05 BANGALORE May 2004 137390 400 516 455 KNT BANGALORE 2004-05-01

In [10]:
# Drop redundant columns
dfBang = dfBang.drop(["market", "month", "year", "state", "city", "priceMin", "priceMax"], axis = 1)

In [11]:
dfBang.head()


Out[11]:
quantity priceMod date
2004-01 227832 991 2004-01-01
2004-02 225133 793 2004-02-01
2004-03 221952 556 2004-03-01
2004-04 185150 465 2004-04-01
2004-05 137390 455 2004-05-01

In [12]:
dfBang.priceMod.plot()


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x117c53748>

In [13]:
dfBang.quantity.plot()


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x117c29cc0>

Transformation - Log

Transformations such as logarithms can help to stabilize the variance of a time series.


In [14]:
dfBang.priceMod.plot(kind = "hist", bins = 30)


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x117e5d2e8>

In [15]:
dfBang['priceModLog'] = np.log(dfBang.priceMod)
dfBang.head()


Out[15]:
quantity priceMod date priceModLog
2004-01 227832 991 2004-01-01 6.898715
2004-02 225133 793 2004-02-01 6.675823
2004-03 221952 556 2004-03-01 6.320768
2004-04 185150 465 2004-04-01 6.142037
2004-05 137390 455 2004-05-01 6.120297

In [16]:
dfBang.priceModLog.plot(kind = "hist", bins = 30)


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x11aa8cb38>

In [17]:
dfBang.priceModLog.plot()


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x11aab69b0>

Basic Time Series Model

We will build a time-series forecasting model to get a forecast for Onion prices. Let us start with the three most basic models -

  1. Mean Constant Model
  2. Linear Trend Model
  3. Random Walk Model

Mean Model

This very simple forecasting model will be called the "mean model"


In [18]:
model_mean_pred = dfBang.priceModLog.mean()

In [19]:
# Let us store this as our Mean Predication Value
dfBang["priceMean"] = np.exp(model_mean_pred)

In [20]:
dfBang.plot(kind="line", x="date", y = ["priceMod", "priceMean"])


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b8f2c18>

Can we measure the error rate?

We will use Root Mean Squared Error (RMSE) to calculate our error values

$RMSE = \Sigma \sqrt{ (\hat{y} - y)^2/n} $ , where $\hat{y}$ is predicted value of y


In [21]:
def RMSE(predicted, actual):
    mse = (predicted - actual)**2
    rmse = np.sqrt(mse.sum()/mse.count())
    return rmse

In [22]:
model_mean_RMSE = RMSE(dfBang.priceMean, dfBang.priceMod)
model_mean_RMSE


Out[22]:
683.9533483996255

In [23]:
# Save this in a dataframe
dfBangResults = pd.DataFrame(columns = ["Model", "Forecast", "RMSE"])
dfBangResults.head()


Out[23]:
Model Forecast RMSE

In [24]:
dfBangResults.loc[0,"Model"] = "Mean"
dfBangResults.loc[0,"Forecast"] = np.exp(model_mean_pred)
dfBangResults.loc[0,"RMSE"] = model_mean_RMSE
dfBangResults.head()


Out[24]:
Model Forecast RMSE
0 Mean 884.566 683.953

Linear Trend Model

Let us start by plotting a linear trend model between priceModLog and time.

However to do linear regression, we need a numeric indicator for time period - Let us create that.


In [25]:
dfBang.head()


Out[25]:
quantity priceMod date priceModLog priceMean
2004-01 227832 991 2004-01-01 6.898715 884.565812
2004-02 225133 793 2004-02-01 6.675823 884.565812
2004-03 221952 556 2004-03-01 6.320768 884.565812
2004-04 185150 465 2004-04-01 6.142037 884.565812
2004-05 137390 455 2004-05-01 6.120297 884.565812

In [26]:
dfBang.dtypes


Out[26]:
quantity                int64
priceMod                int64
date           datetime64[ns]
priceModLog           float64
priceMean             float64
dtype: object

In [27]:
# What is the starting month of our data
dfBang.date.min()


Out[27]:
Timestamp('2004-01-01 00:00:00')

In [28]:
# Convert date in datetimedelta figure starting from zero
dfBang["timeIndex"] = dfBang.date - dfBang.date.min()

In [29]:
dfBang.head()


Out[29]:
quantity priceMod date priceModLog priceMean timeIndex
2004-01 227832 991 2004-01-01 6.898715 884.565812 0 days
2004-02 225133 793 2004-02-01 6.675823 884.565812 31 days
2004-03 221952 556 2004-03-01 6.320768 884.565812 60 days
2004-04 185150 465 2004-04-01 6.142037 884.565812 91 days
2004-05 137390 455 2004-05-01 6.120297 884.565812 121 days

In [30]:
dfBang.dtypes


Out[30]:
quantity                 int64
priceMod                 int64
date            datetime64[ns]
priceModLog            float64
priceMean              float64
timeIndex      timedelta64[ns]
dtype: object

In [31]:
# Convert to months using the timedelta function
dfBang["timeIndex"] =  dfBang["timeIndex"]/np.timedelta64(1, 'M')

In [32]:
dfBang.timeIndex.head()


Out[32]:
2004-01    0.000000
2004-02    1.018501
2004-03    1.971293
2004-04    2.989794
2004-05    3.975441
Freq: M, Name: timeIndex, dtype: float64

In [33]:
# Round the number to 0
dfBang["timeIndex"] = dfBang["timeIndex"].round(0).astype(int)

In [34]:
dfBang.timeIndex.tail()


Out[34]:
2015-10    141
2015-11    142
2015-12    143
2016-01    144
2016-02    145
Freq: M, Name: timeIndex, dtype: int64

In [35]:
dfBang.head()


Out[35]:
quantity priceMod date priceModLog priceMean timeIndex
2004-01 227832 991 2004-01-01 6.898715 884.565812 0
2004-02 225133 793 2004-02-01 6.675823 884.565812 1
2004-03 221952 556 2004-03-01 6.320768 884.565812 2
2004-04 185150 465 2004-04-01 6.142037 884.565812 3
2004-05 137390 455 2004-05-01 6.120297 884.565812 4

In [36]:
## Now plot linear regression between priceMod and timeIndex
model_linear = smf.ols('priceModLog ~ timeIndex', data = dfBang).fit()

In [37]:
model_linear.summary()


Out[37]:
OLS Regression Results
Dep. Variable: priceModLog R-squared: 0.493
Model: OLS Adj. R-squared: 0.489
Method: Least Squares F-statistic: 139.8
Date: Tue, 22 Mar 2016 Prob (F-statistic): 5.75e-23
Time: 22:04:11 Log-Likelihood: -72.317
No. Observations: 146 AIC: 148.6
Df Residuals: 144 BIC: 154.6
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6.1121 0.066 92.830 0.000 5.982 6.242
timeIndex 0.0093 0.001 11.823 0.000 0.008 0.011
Omnibus: 4.750 Durbin-Watson: 0.384
Prob(Omnibus): 0.093 Jarque-Bera (JB): 4.739
Skew: 0.405 Prob(JB): 0.0935
Kurtosis: 2.648 Cond. No. 167.

In [38]:
## Parameters for y = mx + c equation
model_linear.params


Out[38]:
Intercept    6.112108
timeIndex    0.009283
dtype: float64

In [39]:
c = model_linear.params[0]
c


Out[39]:
6.112108132763634

In [40]:
m = model_linear.params[1]
m


Out[40]:
0.0092826039180399222

In [41]:
model_linear_pred = model_linear.predict()

In [42]:
model_linear_pred


Out[42]:
array([ 6.11210813,  6.12139074,  6.13067334,  6.13995594,  6.14923855,
        6.15852115,  6.16780376,  6.17708636,  6.18636896,  6.19565157,
        6.20493417,  6.21421678,  6.22349938,  6.23278198,  6.24206459,
        6.25134719,  6.2606298 ,  6.2699124 ,  6.279195  ,  6.28847761,
        6.29776021,  6.30704282,  6.31632542,  6.32560802,  6.33489063,
        6.34417323,  6.35345583,  6.36273844,  6.37202104,  6.38130365,
        6.39058625,  6.39986885,  6.40915146,  6.41843406,  6.42771667,
        6.43699927,  6.44628187,  6.45556448,  6.46484708,  6.47412969,
        6.48341229,  6.49269489,  6.5019775 ,  6.5112601 ,  6.52054271,
        6.52982531,  6.53910791,  6.54839052,  6.55767312,  6.56695572,
        6.57623833,  6.58552093,  6.59480354,  6.60408614,  6.61336874,
        6.62265135,  6.63193395,  6.64121656,  6.65049916,  6.65978176,
        6.66906437,  6.67834697,  6.68762958,  6.69691218,  6.70619478,
        6.71547739,  6.72475999,  6.7340426 ,  6.7433252 ,  6.7526078 ,
        6.76189041,  6.77117301,  6.78045561,  6.78973822,  6.79902082,
        6.80830343,  6.81758603,  6.82686863,  6.83615124,  6.84543384,
        6.85471645,  6.86399905,  6.87328165,  6.88256426,  6.89184686,
        6.90112947,  6.91041207,  6.91969467,  6.92897728,  6.93825988,
        6.94754249,  6.95682509,  6.96610769,  6.9753903 ,  6.9846729 ,
        6.9939555 ,  7.00323811,  7.01252071,  7.02180332,  7.03108592,
        7.04036852,  7.04965113,  7.05893373,  7.06821634,  7.07749894,
        7.08678154,  7.09606415,  7.10534675,  7.11462936,  7.12391196,
        7.13319456,  7.14247717,  7.15175977,  7.16104238,  7.17032498,
        7.17960758,  7.18889019,  7.19817279,  7.2074554 ,  7.216738  ,
        7.2260206 ,  7.23530321,  7.24458581,  7.25386841,  7.26315102,
        7.27243362,  7.28171623,  7.29099883,  7.30028143,  7.30956404,
        7.31884664,  7.32812925,  7.33741185,  7.34669445,  7.35597706,
        7.36525966,  7.37454227,  7.38382487,  7.39310747,  7.40239008,
        7.41167268,  7.42095529,  7.43023789,  7.43952049,  7.4488031 ,
        7.4580857 ])

In [43]:
# Plot the prediction line
dfBang.plot(kind="line", x="timeIndex", y = "priceModLog")
plt.plot(dfBang.timeIndex,model_linear_pred, '-')


Out[43]:
[<matplotlib.lines.Line2D at 0x11bd62908>]

In [44]:
model_linear.resid.plot(kind = "bar")


Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x11bdc3390>

Is this a good model?

What measures can we check to see if the model is good?

It is seen here (and also evident on the regression line plot, if you look closely) that the linear trend model has a tendency to make an error of the same sign for many periods in a row. This tendency is measured in statistical terms by the lag-1 autocorrelation and Durbin-Watson statistic. If there is no time pattern, the lag-1 autocorrelation should be very close to zero, and the Durbin-Watson statistic ought to be very close to 2, which is not the case here. If the model has succeeded in extracting all the "signal" from the data, there should be no pattern at all in the errors: the error in the next period should not be correlated with any previous errors. The linear trend model obviously fails the autocorrelation test in this case.


In [45]:
# Manual Calculation
model_linear_forecast_manual = m * 146 + c
model_linear_forecast_manual


Out[45]:
7.4673683047974624

In [46]:
# Using Predict Function
model_linear_forecast_auto = model_linear.predict(exog = dict(timeIndex=146))
model_linear_forecast_auto


Out[46]:
array([ 7.4673683])

In [47]:
dfBang["priceLinear"] = np.exp(model_linear_pred)

In [48]:
dfBang.head()


Out[48]:
quantity priceMod date priceModLog priceMean timeIndex priceLinear
2004-01 227832 991 2004-01-01 6.898715 884.565812 0 451.289090
2004-02 225133 793 2004-02-01 6.675823 884.565812 1 455.497732
2004-03 221952 556 2004-03-01 6.320768 884.565812 2 459.745622
2004-04 185150 465 2004-04-01 6.142037 884.565812 3 464.033127
2004-05 137390 455 2004-05-01 6.120297 884.565812 4 468.360617

In [49]:
# Root Mean Squared Error (RMSE)
model_linear_RMSE = RMSE(dfBang.priceLinear, dfBang.priceMod)
model_linear_RMSE


Out[49]:
518.52360758414704

In [50]:
dfBangResults.loc[1,"Model"] = "Linear"
dfBangResults.loc[1,"Forecast"] = np.exp(model_linear_forecast_manual)
dfBangResults.loc[1,"RMSE"] = model_linear_RMSE
dfBangResults.head()


Out[50]:
Model Forecast RMSE
0 Mean 884.566 683.953
1 Linear 1750 518.524

In [51]:
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", "priceLinear"])


Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c77e9b0>

Linear Model with Regressor


In [52]:
## Now plot linear regression between priceMod and timeIndex
model_linear_quantity = smf.ols('priceModLog ~ timeIndex + np.log(quantity)', data = dfBang).fit()

In [53]:
model_linear_quantity.summary()


Out[53]:
OLS Regression Results
Dep. Variable: priceModLog R-squared: 0.509
Model: OLS Adj. R-squared: 0.502
Method: Least Squares F-statistic: 74.16
Date: Tue, 22 Mar 2016 Prob (F-statistic): 8.00e-23
Time: 22:04:15 Log-Likelihood: -69.892
No. Observations: 146 AIC: 145.8
Df Residuals: 143 BIC: 154.7
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 4.1813 0.881 4.746 0.000 2.440 5.923
timeIndex 0.0084 0.001 9.774 0.000 0.007 0.010
np.log(quantity) 0.1529 0.070 2.198 0.030 0.015 0.290
Omnibus: 5.097 Durbin-Watson: 0.412
Prob(Omnibus): 0.078 Jarque-Bera (JB): 4.952
Skew: 0.403 Prob(JB): 0.0841
Kurtosis: 2.593 Cond. No. 2.29e+03

In [54]:
dfBang["priceLinearQuantity"] = np.exp(model_linear_quantity.predict())

In [55]:
dfBang.plot(kind = "line", x="timeIndex", y = "quantity")
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", 
                                             "priceLinear", "priceLinearQuantity"])


Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d6ca748>

Random Walk Model

When faced with a time series that shows irregular growth, the best strategy may not be to try to directly predict the level of the series at each period (i.e., the quantity Yt). Instead, it may be better to try to predict the change that occurs from one period to the next (i.e., the quantity Yt - Yt-1). That is, it may be better to look at the first difference of the series, to see if a predictable pattern can be found there. For purposes of one-period-ahead forecasting, it is just as good to predict the next change as to predict the next level of the series, since the predicted change can be added to the current level to yield a predicted level. The simplest case of such a model is one that always predicts that the next change will be zero, as if the series is equally likely to go up or down in the next period regardless of what it has done in the past.

Random Walk Model $$ \hat{Y_t} = Y_{t-1} + \epsilon \\$$

Random Walk Model with drift $$ \hat{Y_t} = Y_{t-1} + c + \epsilon \\$$


In [56]:
dfBang["priceModLogShift1"] = dfBang.priceModLog.shift()

In [57]:
dfBang.head()


Out[57]:
quantity priceMod date priceModLog priceMean timeIndex priceLinear priceLinearQuantity priceModLogShift1
2004-01 227832 991 2004-01-01 6.898715 884.565812 0 451.289090 431.410971 NaN
2004-02 225133 793 2004-02-01 6.675823 884.565812 1 455.497732 434.277110 6.898715
2004-03 221952 556 2004-03-01 6.320768 884.565812 2 459.745622 437.007747 6.675823
2004-04 185150 465 2004-04-01 6.142037 884.565812 3 464.033127 428.667281 6.320768
2004-05 137390 455 2004-05-01 6.120297 884.565812 4 468.360617 413.029454 6.142037

In [58]:
dfBang.plot(kind= "scatter", y = "priceModLog", x = "priceModLogShift1", s = 50)


Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d2d4908>

In [59]:
# Lets plot the one-month difference curve
dfBang["priceModLogDiff"] = dfBang.priceModLog - dfBang.priceModLogShift1

In [60]:
dfBang.priceModLogDiff.plot()


Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x11def0080>

In [61]:
dfBang["priceRandom"] = np.exp(dfBang.priceModLogShift1)
dfBang.head()


Out[61]:
quantity priceMod date priceModLog priceMean timeIndex priceLinear priceLinearQuantity priceModLogShift1 priceModLogDiff priceRandom
2004-01 227832 991 2004-01-01 6.898715 884.565812 0 451.289090 431.410971 NaN NaN NaN
2004-02 225133 793 2004-02-01 6.675823 884.565812 1 455.497732 434.277110 6.898715 -0.222891 991
2004-03 221952 556 2004-03-01 6.320768 884.565812 2 459.745622 437.007747 6.675823 -0.355055 793
2004-04 185150 465 2004-04-01 6.142037 884.565812 3 464.033127 428.667281 6.320768 -0.178731 556
2004-05 137390 455 2004-05-01 6.120297 884.565812 4 468.360617 413.029454 6.142037 -0.021740 465

In [62]:
dfBang.tail()


Out[62]:
quantity priceMod date priceModLog priceMean timeIndex priceLinear priceLinearQuantity priceModLogShift1 priceModLogDiff priceRandom
2015-10 1612160 2215 2015-10-01 7.703008 884.565812 141 1670.628673 1913.491280 8.051978 -0.348970 3140
2015-11 1071872 1618 2015-11-01 7.388946 884.565812 142 1686.208656 1812.993037 7.703008 -0.314062 2215
2015-12 513186 1343 2015-12-01 7.202661 884.565812 143 1701.933936 1633.680427 7.388946 -0.186285 1618
2016-01 507223 1448 2016-01-01 7.277939 884.565812 144 1717.805867 1644.591721 7.202661 0.075277 1343
2016-02 400359 1101 2016-02-01 7.003974 884.565812 145 1733.825817 1599.625994 7.277939 -0.273964 1448

In [63]:
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod","priceRandom"])


Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d6caf98>

In [64]:
# Root Mean Squared Error (RMSE)
model_random_RMSE = RMSE(dfBang.priceRandom, dfBang.priceMod)
model_random_RMSE


Out[64]:
323.59240006420174

In [65]:
dfBangResults.loc[2,"Model"] = "Random"
dfBangResults.loc[2,"Forecast"] = np.exp(dfBang.priceModLogShift1[-1])
dfBangResults.loc[2,"RMSE"] = model_random_RMSE
dfBangResults.head()


Out[65]:
Model Forecast RMSE
0 Mean 884.566 683.953
1 Linear 1750 518.524
2 Random 1448 323.592

In [66]:
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", "priceLinear", "priceRandom"])


Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ce31ef0>

Advanced Model

Most of the time series models work on the assumption that the time series is stationary. Intuitively, we can see that if a time series has a particular behaviour over time, there is a very high probability that it will follow the same in the future. Also, the theories related to stationary series are more mature and easier to implement as compared to non-stationary series

Statistical stationarity: A stationary time series is one whose statistical properties such as mean, variance, autocorrelation, etc. are all constant over time. Most statistical forecasting methods are based on the assumption that the time series can be rendered approximately stationary (i.e., "stationarized") through the use of mathematical transformations. A stationarized series is relatively easy to predict: you simply predict that its statistical properties will be the same in the future as they have been in the past!

There are three basic criterion for a series to be classified as stationary series :

  • The mean of the series should not be a function of time rather should be a constant.

  • The variance of the series should not a be a function of time. This property is known as homoscedasticity.

  • The covariance of the i th term and the (i + m) th term should not be a function of time.

How do we check for Stationarity in a series?

  • Plotting Rolling Statistics: We can plot the moving average or moving variance and see if it varies with time. By moving average/variance I mean that at any instant ‘t’, we’ll take the average/variance of the last year, i.e. last 12 months. But again this is more of a visual technique.
  • Dickey-Fuller Test: This is one of the statistical tests for checking stationarity. Here the null hypothesis is that the time series is non-stationary. The test results comprise of a Test Statistic and some Critical Values for difference confidence levels. If the ‘Test Statistic’ is less than the ‘Critical Value’, we can reject the null hypothesis and say that the series is stationary.

Augmented Dickey Fuller Test of Stationarity

The Augmented Dickey-Fuller test can be used to test for a unit root in a univariate process in the presence of serial correlation.

$$ Y_t = \rho * Y_{t-1} + \epsilon_t \\$$$$ Y_t - Y_{t-1} = (\rho - 1) Y_{t - 1} + \epsilon_t \\$$

We have to test if p – 1 is significantly different than zero or not. If the null hypothesis gets rejected, we’ll get a stationary time series.


In [67]:
def adf(ts):
    
    # Determing rolling statistics
    rolmean = pd.rolling_mean(ts, window=12)
    rolstd = pd.rolling_std(ts, window=12)

    #Plot rolling statistics:
    orig = plt.plot(ts, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    
    # Calculate ADF factors
    adftest = adfuller(ts, autolag='AIC')
    adfoutput = pd.Series(adftest[0:4], index=['Test Statistic','p-value','# of Lags Used',
                                              'Number of Observations Used'])
    for key,value in adftest[4].items():
        adfoutput['Critical Value (%s)'%key] = value
    return adfoutput

How to make a Time Series Stationary?

Lets understand what is making a time series non-stationary. There are 3 major reasons behind non-stationarity:

  • Trend: A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear. Sometimes we will refer to a trend “changing direction” when it might go from an increasing trend to a decreasing trend.
  • Seasonal: A seasonal pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or day of the week). Seasonality is always of a fixed and known period.
  • Cyclic: A cyclic pattern exists when data exhibit rises and falls that are not of fixed period. The duration of these fluctuations is usually of at least 2 years.
$$ y_t=S_t+T_t+E_t \\$$

where $y_t$ is the data at period t, $S_t$ is the seasonal component at period 't', $T_t$ is the trend-cycle component at period tt and $E_t$ is the remainder (or irregular or error) component at period tt.

Alternatively, a multiplicative model would be written as

$$ y_t=S_t*T_t*E_t \\$$

The additive model is most appropriate if the magnitude of the seasonal fluctuations or the variation around the trend-cycle does not vary with the level of the time series. When the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series, then a multiplicative model is more appropriate. With economic time series, multiplicative models are common.

An alternative to using a multiplicative model, is to first transform the data until the variation in the series appears to be stable over time, and then use an additive model. When a log transformation has been used, this is equivalent to using a multiplicative decomposition because

$$ log y_t=log S_t+ log T_t + log E_t \\$$

Sometimes, the trend-cycle component is simply called the “trend” component, even though it may contain cyclic behaviour as well.

Forecasting Steps

The underlying principle is to model or estimate the trend and seasonality in the series and remove those from the series to get a stationary series. Then statistical forecasting techniques can be implemented on this series. The final step would be to convert the forecasted values into the original scale by applying trend and seasonality constraints back.

Estimating and Eliminating Trend

  1. Transformation - Take a log, sqrt, cuberoot etc. transformation
  2. Aggregation – taking average for a time period like monthly/weekly averages
  3. Smoothing – taking rolling averages, exponential smoothing
  4. Polynomial Fitting – fit a regression model

Simple Moving Average

As a first step in moving beyond mean models, random walk model and linear trend models, nonseasonal patterns and trends can be extrapolated using a moving-average or smoothing model. The basic assumption behind averaging and smoothing models is that the time series is locally stationary with a slowly varying mean. Hence, we take a moving (local) average to estimate the current value of the mean and then use that as the forecast for the near future. This can be considered as a compromise between the mean model and the random-walk-without-drift-model. The same strategy can be used to estimate and extrapolate a local trend. A moving average is often called a "smoothed" version of the original series because short-term averaging has the effect of smoothing out the bumps in the original series. By adjusting the degree of smoothing (the width of the moving average), we can hope to strike some kind of optimal balance between the performance of the mean and random walk models.

Simple Moving Average (SMA)

$$ \hat{y_t} = \frac{y_{t-1} + y_{t-2} + y_{t-3} + ... + y_{t-m}}{m} \\$$

In [68]:
# For smoothing the values we can use 12 month Moving Averages 
dfBang['priceModLogMA12'] = pd.rolling_mean(dfBang.priceModLog, window = 12)

In [69]:
dfBang.plot(kind ="line", y=["priceModLogMA12", "priceModLog"])


Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x11cc155c0>

The long-term forecasts from the SMA model are a horizontal straight line, just as in the random walk model. Thus, the SMA model assumes that there is no trend in the data. However, whereas the forecasts from the random walk model are simply equal to the last observed value, the forecasts from the SMA model are equal to a weighted average of recent values.


In [70]:
dfBang["priceMA12"] = np.exp(dfBang.priceModLogMA12)
dfBang.tail()


Out[70]:
quantity priceMod date priceModLog priceMean timeIndex priceLinear priceLinearQuantity priceModLogShift1 priceModLogDiff priceRandom priceModLogMA12 priceMA12
2015-10 1612160 2215 2015-10-01 7.703008 884.565812 141 1670.628673 1913.491280 8.051978 -0.348970 3140 7.526074 1855.805596
2015-11 1071872 1618 2015-11-01 7.388946 884.565812 142 1686.208656 1812.993037 7.703008 -0.314062 2215 7.537956 1877.986866
2015-12 513186 1343 2015-12-01 7.202661 884.565812 143 1701.933936 1633.680427 7.388946 -0.186285 1618 7.535329 1873.061120
2016-01 507223 1448 2016-01-01 7.277939 884.565812 144 1717.805867 1644.591721 7.202661 0.075277 1343 7.526907 1857.351582
2016-02 400359 1101 2016-02-01 7.003974 884.565812 145 1733.825817 1599.625994 7.277939 -0.273964 1448 7.492590 1794.694278

In [71]:
model_MA12_forecast = dfBang.priceModLog.tail(12).mean()

In [72]:
# Root Mean Squared Error (RMSE)
model_MA12_RMSE = RMSE(dfBang.priceMA12, dfBang.priceMod)
model_MA12_RMSE


Out[72]:
518.82175104808096

In [73]:
dfBangResults.loc[3,"Model"] = "Moving Average 12"
dfBangResults.loc[3,"Forecast"] = np.exp(model_MA12_forecast)
dfBangResults.loc[3,"RMSE"] = model_MA12_RMSE
dfBangResults.head()


Out[73]:
Model Forecast RMSE
0 Mean 884.566 683.953
1 Linear 1750 518.524
2 Random 1448 323.592
3 Moving Average 12 1794.69 518.822

In [74]:
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", "priceLinear",
                                             "priceRandom", "priceMA12"])


Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c9e2400>

In [75]:
# Test remaining part for Stationary
ts = dfBang.priceModLog - dfBang.priceModLogMA12
ts.dropna(inplace = True)
adf(ts)


Out[75]:
Test Statistic                -6.523729e+00
p-value                        1.026903e-08
# of Lags Used                 1.000000e+00
Number of Observations Used    1.330000e+02
Critical Value (10%)          -2.578496e+00
Critical Value (1%)           -3.480500e+00
Critical Value (5%)           -2.883528e+00
dtype: float64

Simple Exponential Smoothing Model (SES)

Instead of equally weighting each of the observation, in the SES model we give more weightage to the recent observations and less to the older ones. This is done by the using a smoothing variable like alpha

$$ \hat{y_t} = \alpha y_{t-1} + (1-\alpha)\hat{y_{t-1}} \\$$

In [76]:
dfBang['priceModLogExp12'] = pd.ewma(dfBang.priceModLog, halflife=12)

In [77]:
halflife = 12
alpha = 1 - np.exp(np.log(0.5)/halflife)
alpha


Out[77]:
0.056125687318306472

In [78]:
dfBang.plot(kind ="line", y=["priceModLogExp12", "priceModLog"])


Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x11feb9208>

In [79]:
dfBang["priceExp12"] = np.exp(dfBang.priceModLogExp12)
dfBang.tail()


Out[79]:
quantity priceMod date priceModLog priceMean timeIndex priceLinear priceLinearQuantity priceModLogShift1 priceModLogDiff priceRandom priceModLogMA12 priceMA12 priceModLogExp12 priceExp12
2015-10 1612160 2215 2015-10-01 7.703008 884.565812 141 1670.628673 1913.491280 8.051978 -0.348970 3140 7.526074 1855.805596 7.379526 1602.830556
2015-11 1071872 1618 2015-11-01 7.388946 884.565812 142 1686.208656 1812.993037 7.703008 -0.314062 2215 7.537956 1877.986866 7.380055 1603.678391
2015-12 513186 1343 2015-12-01 7.202661 884.565812 143 1701.933936 1633.680427 7.388946 -0.186285 1618 7.535329 1873.061120 7.370096 1587.786947
2016-01 507223 1448 2016-01-01 7.277939 884.565812 144 1717.805867 1644.591721 7.202661 0.075277 1343 7.526907 1857.351582 7.364923 1579.593558
2016-02 400359 1101 2016-02-01 7.003974 884.565812 145 1733.825817 1599.625994 7.277939 -0.273964 1448 7.492590 1794.694278 7.344660 1547.908508

In [80]:
# Root Mean Squared Error (RMSE)
model_Exp12_RMSE = RMSE(dfBang.priceExp12, dfBang.priceMod)
model_Exp12_RMSE


Out[80]:
547.16281093335465

In [81]:
y_exp = dfBang.priceModLog[-1]
y_exp


Out[81]:
7.0039741367226798

In [82]:
y_for = dfBang.priceModLogExp12[-1]
y_for


Out[82]:
7.3446599493429714

In [83]:
model_Exp12_forecast = alpha * y_exp + (1 - alpha) * y_for

In [84]:
dfBangResults.loc[4,"Model"] = "Exp Smoothing 12"
dfBangResults.loc[4,"Forecast"] = np.exp(model_Exp12_forecast)
dfBangResults.loc[4,"RMSE"] = model_Exp12_RMSE
dfBangResults.head()


Out[84]:
Model Forecast RMSE
0 Mean 884.566 683.953
1 Linear 1750 518.524
2 Random 1448 323.592
3 Moving Average 12 1794.69 518.822
4 Exp Smoothing 12 1518.59 547.163

In [85]:
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", "priceLinear", 
                                             "priceRandom",
                                             "priceMA12", "priceExp12"])


Out[85]:
<matplotlib.axes._subplots.AxesSubplot at 0x11fad2710>

In [86]:
# Test remaining part for Stationary
ts = dfBang.priceModLog - dfBang.priceModLogExp12
ts.dropna(inplace = True)
adf(ts)


Out[86]:
Test Statistic                -6.638916e+00
p-value                        5.472586e-09
# of Lags Used                 1.000000e+00
Number of Observations Used    1.440000e+02
Critical Value (10%)          -2.577589e+00
Critical Value (1%)           -3.476598e+00
Critical Value (5%)           -2.881829e+00
dtype: float64

Eliminating Trend and Seasonality

  • Differencing – taking the differece with a particular time lag
  • Decomposition – modeling both trend and seasonality and removing them from the model.

Differencing

One of the most common methods of dealing with both trend and seasonality is differencing. In this technique, we take the difference of the observation at a particular instant with that at the previous instant. This mostly works well in improving stationarity. We have already done first order difference earlier


In [87]:
dfBang.priceModLogDiff.plot()


Out[87]:
<matplotlib.axes._subplots.AxesSubplot at 0x1206706a0>

In [88]:
# Test remaining part for Stationary
ts = dfBang.priceModLogDiff
ts.dropna(inplace = True)
adf(ts)


Out[88]:
Test Statistic                -7.293246e+00
p-value                        1.399305e-10
# of Lags Used                 7.000000e+00
Number of Observations Used    1.370000e+02
Critical Value (10%)          -2.578149e+00
Critical Value (1%)           -3.479007e+00
Critical Value (5%)           -2.882878e+00
dtype: float64

Time Series Decomposition

We can also decompose the time series into trend and seasonality


In [89]:
from statsmodels.tsa.seasonal import seasonal_decompose

In [90]:
dfBang.index = dfBang.index.to_datetime()

In [91]:
dfBang.head()


Out[91]:
quantity priceMod date priceModLog priceMean timeIndex priceLinear priceLinearQuantity priceModLogShift1 priceModLogDiff priceRandom priceModLogMA12 priceMA12 priceModLogExp12 priceExp12
2004-01-01 227832 991 2004-01-01 6.898715 884.565812 0 451.289090 431.410971 NaN NaN NaN NaN NaN 6.898715 991.000000
2004-02-01 225133 793 2004-02-01 6.675823 884.565812 1 455.497732 434.277110 6.898715 -0.222891 991 NaN NaN 6.784051 883.641198
2004-03-01 221952 556 2004-03-01 6.320768 884.565812 2 459.745622 437.007747 6.675823 -0.355055 793 NaN NaN 6.620623 750.412129
2004-04-01 185150 465 2004-04-01 6.142037 884.565812 3 464.033127 428.667281 6.320768 -0.178731 556 NaN NaN 6.490419 658.799360
2004-05-01 137390 455 2004-05-01 6.120297 884.565812 4 468.360617 413.029454 6.142037 -0.021740 465 NaN NaN 6.407606 606.440185

In [92]:
decomposition = seasonal_decompose(dfBang.priceModLog, model = "additive")

In [93]:
decomposition.plot()


Out[93]:

In [94]:
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

In [95]:
dfBang["priceDecomp"] = np.exp(trend + seasonal)

In [96]:
# Root Mean Squared Error (RMSE)
model_Decomp_RMSE = RMSE(dfBang.priceDecomp, dfBang.priceMod)
model_Decomp_RMSE


Out[96]:
374.10858464802351

In [97]:
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", "priceLinear", "priceRandom",
                                             "priceMA12", "priceExp12", "priceDecomp"])


Out[97]:
<matplotlib.axes._subplots.AxesSubplot at 0x1217b3710>

In [98]:
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod",
                                              "priceDecomp"])


Out[98]:
<matplotlib.axes._subplots.AxesSubplot at 0x1217cbc50>

In [99]:
# Test remaining part for Stationary
ts = decomposition.resid
ts.dropna(inplace = True)
adf(ts)


Out[99]:
Test Statistic                -7.906662e+00
p-value                        4.049566e-12
# of Lags Used                 4.000000e+00
Number of Observations Used    1.290000e+02
Critical Value (10%)          -2.578864e+00
Critical Value (1%)           -3.482088e+00
Critical Value (5%)           -2.884219e+00
dtype: float64

Auto Regressive Models - AR(p)

In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. The term autoregression indicates that it is a regression of the variable against itself.

Thus an autoregressive model of order (p) can be written as

$$ y_t = c + m_1y_{t-1} + m_2y_{t-2} + m_3y_{t-3} + .. \\$$

Random walk model is an AR(1) model with $$m_1=1, c = 0\\$$ Random walk model with drift model $$m_1=1, c \not= 0\\$$

We normally restrict autoregressive models to stationary data, and then some constraints on the values of the parameters are required.

For an AR(1) model: $$ −1<m_1<−1 \\$$ For an AR(2) model:
$$ −1<m_2<−1, m_1 + m_2 < 1, m_2 - m_1 <1 \\$$

Moving Average Model - MA(q)

Rather than use past values of the forecast variable in a regression, a moving average model uses past forecast errors in a regression-like model.

$$ y_t=c+e_t+l_1 e_{t−1}+l_2 e_{t−2} + ... + l_q e_{t-q} \\$$

where e is white noise. We refer to this as an MA(q) model. Of course, we do not observe the values of e(t), so it is not really regression in the usual sense.

Notice that each value of y(t) can be thought of as a weighted moving average of the past few forecast errors. However, moving average models should not be confused with moving average smoothing. A moving average model is used for forecasting future values while moving average smoothing is used for estimating the trend-cycle of past values.

ARIMA Model

If we combine differencing with autoregression and a moving average model, we obtain a non-seasonal ARIMA model. ARIMA is an acronym for AutoRegressive Integrated Moving Average model (“integration” in this context is the reverse of differencing). The full model can be written as

  • Number of AR (Auto-Regressive) terms (p): AR terms are just lags of dependent variable. For instance if p is 5, the predictors for y(t) will be y(t-1)….y(t-5).
  • Number of MA (Moving Average) terms (q): MA terms are lagged forecast errors in prediction equation. For instance if q is 5, the predictors for y(t) will be e(t-1)….e(t-5) where e(i) is the difference between the moving average at ith instant and actual value.
  • Number of Differences (d): These are the number of nonseasonal differences, i.e. in this case we took the first order difference. So either we can pass that variable and put d=0 or pass the original variable and put d=1. Both will generate same results.

An importance concern here is how to determine the value of ‘p’ and ‘q’. We use two plots to determine these numbers. Lets discuss them first.

  • Autocorrelation Function (ACF): It is a measure of the correlation between the the TS with a lagged version of itself. For instance at lag 5, ACF would compare series at time instant ‘t1’…’t2’ with series at instant ‘t1-5’…’t2-5’ (t1-5 and t2 being end points).
  • Partial Autocorrelation Function (PACF): This measures the correlation between the TS with a lagged version of itself but after eliminating the variations already explained by the intervening comparisons. Eg at lag 5, it will check the correlation but remove the effects already explained by lags 1 to 4.

In MA model, noise / shock quickly vanishes with time. The AR model has a much lasting effect of the shock.


In [100]:
ts = dfBang.priceModLog
ts_diff = dfBang.priceModLogDiff
ts_diff.dropna(inplace = True)

In [101]:
#ACF and PACF plots:
from statsmodels.tsa.stattools import acf, pacf

In [102]:
lag_acf = acf(ts_diff, nlags=20)

In [103]:
lag_acf


Out[103]:
array([ 1.        ,  0.40215445, -0.02855346, -0.21472715, -0.23756032,
       -0.26332645, -0.22038359, -0.08576989, -0.0754398 ,  0.01497797,
        0.12361818,  0.20862956,  0.158123  ,  0.02053933, -0.01290934,
       -0.07097602, -0.17993659, -0.09514604, -0.04878255, -0.08854117,
       -0.18810881])

In [104]:
ACF = pd.Series(lag_acf)

In [105]:
ACF


Out[105]:
0     1.000000
1     0.402154
2    -0.028553
3    -0.214727
4    -0.237560
5    -0.263326
6    -0.220384
7    -0.085770
8    -0.075440
9     0.014978
10    0.123618
11    0.208630
12    0.158123
13    0.020539
14   -0.012909
15   -0.070976
16   -0.179937
17   -0.095146
18   -0.048783
19   -0.088541
20   -0.188109
dtype: float64

In [106]:
ACF.plot(kind = "bar")


Out[106]:
<matplotlib.axes._subplots.AxesSubplot at 0x121de6eb8>

In [107]:
lag_pacf = pacf(ts_diff, nlags=20, method='ols')

In [108]:
PACF = pd.Series(lag_pacf)

In [109]:
PACF.plot(kind = "bar")


Out[109]:
<matplotlib.axes._subplots.AxesSubplot at 0x1223b6940>

Running the ARIMA Model


In [110]:
from statsmodels.tsa.arima_model import ARIMA

In [111]:
ts_diff.head()


Out[111]:
2004-02-01   -0.222891
2004-03-01   -0.355055
2004-04-01   -0.178731
2004-05-01   -0.021740
2004-06-01    0.191437
Freq: MS, Name: priceModLogDiff, dtype: float64

In [112]:
# Running the ARIMA Model(1,0,1)
model_AR1MA = ARIMA(ts_diff, order=(1,0,1))

In [113]:
results_ARIMA = model_AR1MA.fit(disp = -1)

In [114]:
results_ARIMA.fittedvalues.head()


Out[114]:
2004-02-01   -0.000977
2004-03-01   -0.093805
2004-04-01   -0.139355
2004-05-01   -0.041727
2004-06-01    0.002164
Freq: MS, dtype: float64

In [115]:
ts_diff.plot()
results_ARIMA.fittedvalues.plot()


Out[115]:
<matplotlib.axes._subplots.AxesSubplot at 0x11fec53c8>

In [116]:
ts_diff.sum()


Out[116]:
0.1052596023926915

In [117]:
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff.tail()


Out[117]:
2015-10-01    0.008537
2015-11-01   -0.170970
2015-12-01   -0.096505
2016-01-01   -0.059154
2016-02-01    0.054308
Freq: MS, dtype: float64

In [118]:
predictions_ARIMA_diff.sum()


Out[118]:
0.07204872072412598

In [119]:
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_diff_cumsum.tail()


Out[119]:
2015-10-01    0.344369
2015-11-01    0.173399
2015-12-01    0.076894
2016-01-01    0.017740
2016-02-01    0.072049
Freq: MS, dtype: float64

In [120]:
ts.ix[0]


Out[120]:
6.8987145343299883

In [121]:
predictions_ARIMA_log = pd.Series(ts.ix[0], index=ts.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
predictions_ARIMA_log.tail()


Out[121]:
2015-10-01    7.243084
2015-11-01    7.072114
2015-12-01    6.975609
2016-01-01    6.916455
2016-02-01    6.970763
Freq: MS, dtype: float64

In [122]:
dfBang['priceARIMA'] = np.exp(predictions_ARIMA_log)

In [123]:
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceARIMA"])


Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x12062fbe0>

In [124]:
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", "priceLinear", "priceRandom",
                                             "priceMA12", "priceExp12", "priceDecomp", "priceARIMA"])


Out[124]:
<matplotlib.axes._subplots.AxesSubplot at 0x123041f98>