5. Model the Solution

Preprocessing to get the tidy dataframe



In [1]:

    
# Import the library we need, which is Pandas and Matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Import statsmodel
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.tsa.stattools import adfuller



In [2]:

    
# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)



In [3]:

    
# Read the csv file of Monthwise Quantity and Price csv file we have.
df = pd.read_csv('MonthWiseMarketArrivals_clean.csv')



In [4]:

    
# Changing the date column to a Time Interval columnn
df.date = pd.DatetimeIndex(df.date)



In [5]:

    
# Change the index to the date column
df.index = pd.PeriodIndex(df.date, freq='M')



In [6]:

    
# Sort the data frame by date
df = df.sort_values(by = "date")



In [7]:

    
df.head()









    Out[7]:






  
    
      
      market
      month
      year
      quantity
      priceMin
      priceMax
      priceMod
      state
      city
      date
    
  
  
    
      1996-01
      LASALGAON(MS)
      January
      1996
      225063
      160
      257
      226
      MS
      LASALGAON
      1996-01-01
    
    
      1996-02
      LASALGAON(MS)
      February
      1996
      196164
      133
      229
      186
      MS
      LASALGAON
      1996-02-01
    
    
      1996-03
      LASALGAON(MS)
      March
      1996
      178992
      155
      274
      243
      MS
      LASALGAON
      1996-03-01
    
    
      1996-04
      LASALGAON(MS)
      April
      1996
      192592
      136
      279
      254
      MS
      LASALGAON
      1996-04-01
    
    
      1996-05
      LASALGAON(MS)
      May
      1996
      237574
      154
      312
      269
      MS
      LASALGAON
      1996-05-01

Question 3: Can we forecast the price of Onion in Bangalore?

Get the priceMod for Bangalore Market



In [8]:

    
dfBang = df.loc[df.city == "BANGALORE"].copy()



In [9]:

    
dfBang.head()









    Out[9]:






  
    
      
      market
      month
      year
      quantity
      priceMin
      priceMax
      priceMod
      state
      city
      date
    
  
  
    
      2004-01
      BANGALORE
      January
      2004
      227832
      916
      1066
      991
      KNT
      BANGALORE
      2004-01-01
    
    
      2004-02
      BANGALORE
      February
      2004
      225133
      741
      870
      793
      KNT
      BANGALORE
      2004-02-01
    
    
      2004-03
      BANGALORE
      March
      2004
      221952
      527
      586
      556
      KNT
      BANGALORE
      2004-03-01
    
    
      2004-04
      BANGALORE
      April
      2004
      185150
      419
      518
      465
      KNT
      BANGALORE
      2004-04-01
    
    
      2004-05
      BANGALORE
      May
      2004
      137390
      400
      516
      455
      KNT
      BANGALORE
      2004-05-01



In [10]:

    
# Drop redundant columns
dfBang = dfBang.drop(["market", "month", "year", "state", "city", "priceMin", "priceMax"], axis = 1)



In [11]:

    
dfBang.head()



In [12]:

    
dfBang.priceMod.plot()









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x117c53748>



In [13]:

    
dfBang.quantity.plot()









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x117c29cc0>

Transformation - Log

Transformations such as logarithms can help to stabilize the variance of a time series.



In [14]:

    
dfBang.priceMod.plot(kind = "hist", bins = 30)









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x117e5d2e8>



In [15]:

    
dfBang['priceModLog'] = np.log(dfBang.priceMod)
dfBang.head()









    Out[15]:






  
    
      
      quantity
      priceMod
      date
      priceModLog
    
  
  
    
      2004-01
      227832
      991
      2004-01-01
      6.898715
    
    
      2004-02
      225133
      793
      2004-02-01
      6.675823
    
    
      2004-03
      221952
      556
      2004-03-01
      6.320768
    
    
      2004-04
      185150
      465
      2004-04-01
      6.142037
    
    
      2004-05
      137390
      455
      2004-05-01
      6.120297



In [16]:

    
dfBang.priceModLog.plot(kind = "hist", bins = 30)









    Out[16]:





<matplotlib.axes._subplots.AxesSubplot at 0x11aa8cb38>



In [17]:

    
dfBang.priceModLog.plot()









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x11aab69b0>

Basic Time Series Model

We will build a time-series forecasting model to get a forecast for Onion prices. Let us start with the three most basic models -

Mean Constant Model
Linear Trend Model
Random Walk Model

Mean Model

This very simple forecasting model will be called the "mean model"



In [18]:

    
model_mean_pred = dfBang.priceModLog.mean()



In [19]:

    
# Let us store this as our Mean Predication Value
dfBang["priceMean"] = np.exp(model_mean_pred)



In [20]:

    
dfBang.plot(kind="line", x="date", y = ["priceMod", "priceMean"])









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x11b8f2c18>

Can we measure the error rate?

We will use Root Mean Squared Error (RMSE) to calculate our error values

$RMSE = \Sigma \sqrt{ (\hat{y} - y)^2/n} $ , where $\hat{y}$ is predicted value of y



In [21]:

    
def RMSE(predicted, actual):
    mse = (predicted - actual)**2
    rmse = np.sqrt(mse.sum()/mse.count())
    return rmse



In [22]:

    
model_mean_RMSE = RMSE(dfBang.priceMean, dfBang.priceMod)
model_mean_RMSE









    Out[22]:





683.9533483996255



In [23]:

    
# Save this in a dataframe
dfBangResults = pd.DataFrame(columns = ["Model", "Forecast", "RMSE"])
dfBangResults.head()









    Out[23]:






  
    
      
      Model
      Forecast
      RMSE



In [24]:

    
dfBangResults.loc[0,"Model"] = "Mean"
dfBangResults.loc[0,"Forecast"] = np.exp(model_mean_pred)
dfBangResults.loc[0,"RMSE"] = model_mean_RMSE
dfBangResults.head()









    Out[24]:






  
    
      
      Model
      Forecast
      RMSE
    
  
  
    
      0
      Mean
      884.566
      683.953

Linear Trend Model

Let us start by plotting a linear trend model between priceModLog and time.

However to do linear regression, we need a numeric indicator for time period - Let us create that.



In [25]:

    
dfBang.head()









    Out[25]:






  
    
      
      quantity
      priceMod
      date
      priceModLog
      priceMean
    
  
  
    
      2004-01
      227832
      991
      2004-01-01
      6.898715
      884.565812
    
    
      2004-02
      225133
      793
      2004-02-01
      6.675823
      884.565812
    
    
      2004-03
      221952
      556
      2004-03-01
      6.320768
      884.565812
    
    
      2004-04
      185150
      465
      2004-04-01
      6.142037
      884.565812
    
    
      2004-05
      137390
      455
      2004-05-01
      6.120297
      884.565812



In [26]:

    
dfBang.dtypes









    Out[26]:





quantity                int64
priceMod                int64
date           datetime64[ns]
priceModLog           float64
priceMean             float64
dtype: object



In [27]:

    
# What is the starting month of our data
dfBang.date.min()









    Out[27]:





Timestamp('2004-01-01 00:00:00')



In [28]:

    
# Convert date in datetimedelta figure starting from zero
dfBang["timeIndex"] = dfBang.date - dfBang.date.min()



In [29]:

    
dfBang.head()









    Out[29]:






  
    
      
      quantity
      priceMod
      date
      priceModLog
      priceMean
      timeIndex
    
  
  
    
      2004-01
      227832
      991
      2004-01-01
      6.898715
      884.565812
      0 days
    
    
      2004-02
      225133
      793
      2004-02-01
      6.675823
      884.565812
      31 days
    
    
      2004-03
      221952
      556
      2004-03-01
      6.320768
      884.565812
      60 days
    
    
      2004-04
      185150
      465
      2004-04-01
      6.142037
      884.565812
      91 days
    
    
      2004-05
      137390
      455
      2004-05-01
      6.120297
      884.565812
      121 days



In [30]:

    
dfBang.dtypes









    Out[30]:





quantity                 int64
priceMod                 int64
date            datetime64[ns]
priceModLog            float64
priceMean              float64
timeIndex      timedelta64[ns]
dtype: object



In [31]:

    
# Convert to months using the timedelta function
dfBang["timeIndex"] =  dfBang["timeIndex"]/np.timedelta64(1, 'M')



In [32]:

    
dfBang.timeIndex.head()









    Out[32]:





2004-01    0.000000
2004-02    1.018501
2004-03    1.971293
2004-04    2.989794
2004-05    3.975441
Freq: M, Name: timeIndex, dtype: float64



In [33]:

    
# Round the number to 0
dfBang["timeIndex"] = dfBang["timeIndex"].round(0).astype(int)



In [34]:

    
dfBang.timeIndex.tail()









    Out[34]:





2015-10    141
2015-11    142
2015-12    143
2016-01    144
2016-02    145
Freq: M, Name: timeIndex, dtype: int64



In [35]:

    
dfBang.head()









    Out[35]:






  
    
      
      quantity
      priceMod
      date
      priceModLog
      priceMean
      timeIndex
    
  
  
    
      2004-01
      227832
      991
      2004-01-01
      6.898715
      884.565812
      0
    
    
      2004-02
      225133
      793
      2004-02-01
      6.675823
      884.565812
      1
    
    
      2004-03
      221952
      556
      2004-03-01
      6.320768
      884.565812
      2
    
    
      2004-04
      185150
      465
      2004-04-01
      6.142037
      884.565812
      3
    
    
      2004-05
      137390
      455
      2004-05-01
      6.120297
      884.565812
      4



In [36]:

    
## Now plot linear regression between priceMod and timeIndex
model_linear = smf.ols('priceModLog ~ timeIndex', data = dfBang).fit()



In [37]:

    
model_linear.summary()









    Out[37]:





OLS Regression Results

  Dep. Variable:        priceModLog      R-squared:             0.493


  Model:                    OLS          Adj. R-squared:        0.489


  Method:              Least Squares     F-statistic:           139.8


  Date:              Tue, 22 Mar 2016    Prob (F-statistic):  5.75e-23


  Time:                  22:04:11        Log-Likelihood:      -72.317


  No. Observations:          146         AIC:                   148.6


  Df Residuals:              144         BIC:                   154.6


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      6.1121      0.066     92.830   0.000      5.982     6.242


  timeIndex      0.0093      0.001     11.823   0.000      0.008     0.011




  Omnibus:         4.750    Durbin-Watson:         0.384


  Prob(Omnibus):   0.093    Jarque-Bera (JB):      4.739


  Skew:            0.405    Prob(JB):             0.0935


  Kurtosis:        2.648    Cond. No.               167.



In [38]:

    
## Parameters for y = mx + c equation
model_linear.params









    Out[38]:





Intercept    6.112108
timeIndex    0.009283
dtype: float64



In [39]:

    
c = model_linear.params[0]
c









    Out[39]:





6.112108132763634



In [40]:

    
m = model_linear.params[1]
m









    Out[40]:





0.0092826039180399222



In [41]:

    
model_linear_pred = model_linear.predict()



In [42]:

    
model_linear_pred









    Out[42]:





array([ 6.11210813,  6.12139074,  6.13067334,  6.13995594,  6.14923855,
        6.15852115,  6.16780376,  6.17708636,  6.18636896,  6.19565157,
        6.20493417,  6.21421678,  6.22349938,  6.23278198,  6.24206459,
        6.25134719,  6.2606298 ,  6.2699124 ,  6.279195  ,  6.28847761,
        6.29776021,  6.30704282,  6.31632542,  6.32560802,  6.33489063,
        6.34417323,  6.35345583,  6.36273844,  6.37202104,  6.38130365,
        6.39058625,  6.39986885,  6.40915146,  6.41843406,  6.42771667,
        6.43699927,  6.44628187,  6.45556448,  6.46484708,  6.47412969,
        6.48341229,  6.49269489,  6.5019775 ,  6.5112601 ,  6.52054271,
        6.52982531,  6.53910791,  6.54839052,  6.55767312,  6.56695572,
        6.57623833,  6.58552093,  6.59480354,  6.60408614,  6.61336874,
        6.62265135,  6.63193395,  6.64121656,  6.65049916,  6.65978176,
        6.66906437,  6.67834697,  6.68762958,  6.69691218,  6.70619478,
        6.71547739,  6.72475999,  6.7340426 ,  6.7433252 ,  6.7526078 ,
        6.76189041,  6.77117301,  6.78045561,  6.78973822,  6.79902082,
        6.80830343,  6.81758603,  6.82686863,  6.83615124,  6.84543384,
        6.85471645,  6.86399905,  6.87328165,  6.88256426,  6.89184686,
        6.90112947,  6.91041207,  6.91969467,  6.92897728,  6.93825988,
        6.94754249,  6.95682509,  6.96610769,  6.9753903 ,  6.9846729 ,
        6.9939555 ,  7.00323811,  7.01252071,  7.02180332,  7.03108592,
        7.04036852,  7.04965113,  7.05893373,  7.06821634,  7.07749894,
        7.08678154,  7.09606415,  7.10534675,  7.11462936,  7.12391196,
        7.13319456,  7.14247717,  7.15175977,  7.16104238,  7.17032498,
        7.17960758,  7.18889019,  7.19817279,  7.2074554 ,  7.216738  ,
        7.2260206 ,  7.23530321,  7.24458581,  7.25386841,  7.26315102,
        7.27243362,  7.28171623,  7.29099883,  7.30028143,  7.30956404,
        7.31884664,  7.32812925,  7.33741185,  7.34669445,  7.35597706,
        7.36525966,  7.37454227,  7.38382487,  7.39310747,  7.40239008,
        7.41167268,  7.42095529,  7.43023789,  7.43952049,  7.4488031 ,
        7.4580857 ])



In [43]:

    
# Plot the prediction line
dfBang.plot(kind="line", x="timeIndex", y = "priceModLog")
plt.plot(dfBang.timeIndex,model_linear_pred, '-')









    Out[43]:





[<matplotlib.lines.Line2D at 0x11bd62908>]



In [44]:

    
model_linear.resid.plot(kind = "bar")









    Out[44]:





<matplotlib.axes._subplots.AxesSubplot at 0x11bdc3390>

Is this a good model?

What measures can we check to see if the model is good?

It is seen here (and also evident on the regression line plot, if you look closely) that the linear trend model has a tendency to make an error of the same sign for many periods in a row. This tendency is measured in statistical terms by the lag-1 autocorrelation and Durbin-Watson statistic. If there is no time pattern, the lag-1 autocorrelation should be very close to zero, and the Durbin-Watson statistic ought to be very close to 2, which is not the case here. If the model has succeeded in extracting all the "signal" from the data, there should be no pattern at all in the errors: the error in the next period should not be correlated with any previous errors. The linear trend model obviously fails the autocorrelation test in this case.



In [45]:

    
# Manual Calculation
model_linear_forecast_manual = m * 146 + c
model_linear_forecast_manual









    Out[45]:





7.4673683047974624



In [46]:

    
# Using Predict Function
model_linear_forecast_auto = model_linear.predict(exog = dict(timeIndex=146))
model_linear_forecast_auto









    Out[46]:





array([ 7.4673683])



In [47]:

    
dfBang["priceLinear"] = np.exp(model_linear_pred)



In [48]:

    
dfBang.head()









    Out[48]:






  
    
      
      quantity
      priceMod
      date
      priceModLog
      priceMean
      timeIndex
      priceLinear
    
  
  
    
      2004-01
      227832
      991
      2004-01-01
      6.898715
      884.565812
      0
      451.289090
    
    
      2004-02
      225133
      793
      2004-02-01
      6.675823
      884.565812
      1
      455.497732
    
    
      2004-03
      221952
      556
      2004-03-01
      6.320768
      884.565812
      2
      459.745622
    
    
      2004-04
      185150
      465
      2004-04-01
      6.142037
      884.565812
      3
      464.033127
    
    
      2004-05
      137390
      455
      2004-05-01
      6.120297
      884.565812
      4
      468.360617



In [49]:

    
# Root Mean Squared Error (RMSE)
model_linear_RMSE = RMSE(dfBang.priceLinear, dfBang.priceMod)
model_linear_RMSE









    Out[49]:





518.52360758414704



In [50]:

    
dfBangResults.loc[1,"Model"] = "Linear"
dfBangResults.loc[1,"Forecast"] = np.exp(model_linear_forecast_manual)
dfBangResults.loc[1,"RMSE"] = model_linear_RMSE
dfBangResults.head()



In [51]:

    
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", "priceLinear"])









    Out[51]:





<matplotlib.axes._subplots.AxesSubplot at 0x11c77e9b0>

Linear Model with Regressor



In [52]:

    
## Now plot linear regression between priceMod and timeIndex
model_linear_quantity = smf.ols('priceModLog ~ timeIndex + np.log(quantity)', data = dfBang).fit()



In [53]:

    
model_linear_quantity.summary()









    Out[53]:





OLS Regression Results

  Dep. Variable:        priceModLog      R-squared:             0.509


  Model:                    OLS          Adj. R-squared:        0.502


  Method:              Least Squares     F-statistic:           74.16


  Date:              Tue, 22 Mar 2016    Prob (F-statistic):  8.00e-23


  Time:                  22:04:15        Log-Likelihood:      -69.892


  No. Observations:          146         AIC:                   145.8


  Df Residuals:              143         BIC:                   154.7


  Df Model:                    2                                     


  Covariance Type:       nonrobust                                   




                      coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept             4.1813      0.881      4.746   0.000      2.440     5.923


  timeIndex             0.0084      0.001      9.774   0.000      0.007     0.010


  np.log(quantity)      0.1529      0.070      2.198   0.030      0.015     0.290




  Omnibus:         5.097    Durbin-Watson:         0.412


  Prob(Omnibus):   0.078    Jarque-Bera (JB):      4.952


  Skew:            0.403    Prob(JB):             0.0841


  Kurtosis:        2.593    Cond. No.           2.29e+03



In [54]:

    
dfBang["priceLinearQuantity"] = np.exp(model_linear_quantity.predict())



In [55]:

    
dfBang.plot(kind = "line", x="timeIndex", y = "quantity")
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", 
                                             "priceLinear", "priceLinearQuantity"])









    Out[55]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d6ca748>

Random Walk Model

When faced with a time series that shows irregular growth, the best strategy may not be to try to directly predict the level of the series at each period (i.e., the quantity Yt). Instead, it may be better to try to predict the change that occurs from one period to the next (i.e., the quantity Yt - Yt-1). That is, it may be better to look at the first difference of the series, to see if a predictable pattern can be found there. For purposes of one-period-ahead forecasting, it is just as good to predict the next change as to predict the next level of the series, since the predicted change can be added to the current level to yield a predicted level. The simplest case of such a model is one that always predicts that the next change will be zero, as if the series is equally likely to go up or down in the next period regardless of what it has done in the past.

Random Walk Model $$ \hat{Y_t} = Y_{t-1} + \epsilon \\$$

Random Walk Model with drift $$ \hat{Y_t} = Y_{t-1} + c + \epsilon \\$$



In [56]:

    
dfBang["priceModLogShift1"] = dfBang.priceModLog.shift()



In [57]:

    
dfBang.head()









    Out[57]:






  
    
      
      quantity
      priceMod
      date
      priceModLog
      priceMean
      timeIndex
      priceLinear
      priceLinearQuantity
      priceModLogShift1
    
  
  
    
      2004-01
      227832
      991
      2004-01-01
      6.898715
      884.565812
      0
      451.289090
      431.410971
      NaN
    
    
      2004-02
      225133
      793
      2004-02-01
      6.675823
      884.565812
      1
      455.497732
      434.277110
      6.898715
    
    
      2004-03
      221952
      556
      2004-03-01
      6.320768
      884.565812
      2
      459.745622
      437.007747
      6.675823
    
    
      2004-04
      185150
      465
      2004-04-01
      6.142037
      884.565812
      3
      464.033127
      428.667281
      6.320768
    
    
      2004-05
      137390
      455
      2004-05-01
      6.120297
      884.565812
      4
      468.360617
      413.029454
      6.142037



In [58]:

    
dfBang.plot(kind= "scatter", y = "priceModLog", x = "priceModLogShift1", s = 50)









    Out[58]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d2d4908>



In [59]:

    
# Lets plot the one-month difference curve
dfBang["priceModLogDiff"] = dfBang.priceModLog - dfBang.priceModLogShift1



In [60]:

    
dfBang.priceModLogDiff.plot()









    Out[60]:





<matplotlib.axes._subplots.AxesSubplot at 0x11def0080>



In [61]:

    
dfBang["priceRandom"] = np.exp(dfBang.priceModLogShift1)
dfBang.head()









    Out[61]:






  
    
      
      quantity
      priceMod
      date
      priceModLog
      priceMean
      timeIndex
      priceLinear
      priceLinearQuantity
      priceModLogShift1
      priceModLogDiff
      priceRandom
    
  
  
    
      2004-01
      227832
      991
      2004-01-01
      6.898715
      884.565812
      0
      451.289090
      431.410971
      NaN
      NaN
      NaN
    
    
      2004-02
      225133
      793
      2004-02-01
      6.675823
      884.565812
      1
      455.497732
      434.277110
      6.898715
      -0.222891
      991
    
    
      2004-03
      221952
      556
      2004-03-01
      6.320768
      884.565812
      2
      459.745622
      437.007747
      6.675823
      -0.355055
      793
    
    
      2004-04
      185150
      465
      2004-04-01
      6.142037
      884.565812
      3
      464.033127
      428.667281
      6.320768
      -0.178731
      556
    
    
      2004-05
      137390
      455
      2004-05-01
      6.120297
      884.565812
      4
      468.360617
      413.029454
      6.142037
      -0.021740
      465



In [62]:

    
dfBang.tail()









    Out[62]:






  
    
      
      quantity
      priceMod
      date
      priceModLog
      priceMean
      timeIndex
      priceLinear
      priceLinearQuantity
      priceModLogShift1
      priceModLogDiff
      priceRandom
    
  
  
    
      2015-10
      1612160
      2215
      2015-10-01
      7.703008
      884.565812
      141
      1670.628673
      1913.491280
      8.051978
      -0.348970
      3140
    
    
      2015-11
      1071872
      1618
      2015-11-01
      7.388946
      884.565812
      142
      1686.208656
      1812.993037
      7.703008
      -0.314062
      2215
    
    
      2015-12
      513186
      1343
      2015-12-01
      7.202661
      884.565812
      143
      1701.933936
      1633.680427
      7.388946
      -0.186285
      1618
    
    
      2016-01
      507223
      1448
      2016-01-01
      7.277939
      884.565812
      144
      1717.805867
      1644.591721
      7.202661
      0.075277
      1343
    
    
      2016-02
      400359
      1101
      2016-02-01
      7.003974
      884.565812
      145
      1733.825817
      1599.625994
      7.277939
      -0.273964
      1448



In [63]:

    
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod","priceRandom"])









    Out[63]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d6caf98>



In [64]:

    
# Root Mean Squared Error (RMSE)
model_random_RMSE = RMSE(dfBang.priceRandom, dfBang.priceMod)
model_random_RMSE









    Out[64]:





323.59240006420174



In [65]:

    
dfBangResults.loc[2,"Model"] = "Random"
dfBangResults.loc[2,"Forecast"] = np.exp(dfBang.priceModLogShift1[-1])
dfBangResults.loc[2,"RMSE"] = model_random_RMSE
dfBangResults.head()



In [66]:

    
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", "priceLinear", "priceRandom"])









    Out[66]:





<matplotlib.axes._subplots.AxesSubplot at 0x11ce31ef0>

Advanced Model

Most of the time series models work on the assumption that the time series is stationary. Intuitively, we can see that if a time series has a particular behaviour over time, there is a very high probability that it will follow the same in the future. Also, the theories related to stationary series are more mature and easier to implement as compared to non-stationary series

Statistical stationarity: A stationary time series is one whose statistical properties such as mean, variance, autocorrelation, etc. are all constant over time. Most statistical forecasting methods are based on the assumption that the time series can be rendered approximately stationary (i.e., "stationarized") through the use of mathematical transformations. A stationarized series is relatively easy to predict: you simply predict that its statistical properties will be the same in the future as they have been in the past!

There are three basic criterion for a series to be classified as stationary series :

The mean of the series should not be a function of time rather should be a constant.

The variance of the series should not a be a function of time. This property is known as homoscedasticity.

The covariance of the i th term and the (i + m) th term should not be a function of time.

How do we check for Stationarity in a series?

Plotting Rolling Statistics: We can plot the moving average or moving variance and see if it varies with time. By moving average/variance I mean that at any instant ‘t’, we’ll take the average/variance of the last year, i.e. last 12 months. But again this is more of a visual technique.
Dickey-Fuller Test: This is one of the statistical tests for checking stationarity. Here the null hypothesis is that the time series is non-stationary. The test results comprise of a Test Statistic and some Critical Values for difference confidence levels. If the ‘Test Statistic’ is less than the ‘Critical Value’, we can reject the null hypothesis and say that the series is stationary.

Augmented Dickey Fuller Test of Stationarity

The Augmented Dickey-Fuller test can be used to test for a unit root in a univariate process in the presence of serial correlation.

$$ Y_t = \rho * Y_{t-1} + \epsilon_t \\$$$$ Y_t - Y_{t-1} = (\rho - 1) Y_{t - 1} + \epsilon_t \\$$

We have to test if p – 1 is significantly different than zero or not. If the null hypothesis gets rejected, we’ll get a stationary time series.



In [67]:

    
def adf(ts):
    
    # Determing rolling statistics
    rolmean = pd.rolling_mean(ts, window=12)
    rolstd = pd.rolling_std(ts, window=12)

    #Plot rolling statistics:
    orig = plt.plot(ts, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    
    # Calculate ADF factors
    adftest = adfuller(ts, autolag='AIC')
    adfoutput = pd.Series(adftest[0:4], index=['Test Statistic','p-value','# of Lags Used',
                                              'Number of Observations Used'])
    for key,value in adftest[4].items():
        adfoutput['Critical Value (%s)'%key] = value
    return adfoutput

How to make a Time Series Stationary?

Lets understand what is making a time series non-stationary. There are 3 major reasons behind non-stationarity:

Trend: A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear. Sometimes we will refer to a trend “changing direction” when it might go from an increasing trend to a decreasing trend.
Seasonal: A seasonal pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or day of the week). Seasonality is always of a fixed and known period.
Cyclic: A cyclic pattern exists when data exhibit rises and falls that are not of fixed period. The duration of these fluctuations is usually of at least 2 years.

$$ y_t=S_t+T_t+E_t \\$$

where $y_t$ is the data at period t, $S_t$ is the seasonal component at period 't', $T_t$ is the trend-cycle component at period tt and $E_t$ is the remainder (or irregular or error) component at period tt.

Alternatively, a multiplicative model would be written as

$$ y_t=S_t*T_t*E_t \\$$

The additive model is most appropriate if the magnitude of the seasonal fluctuations or the variation around the trend-cycle does not vary with the level of the time series. When the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series, then a multiplicative model is more appropriate. With economic time series, multiplicative models are common.

An alternative to using a multiplicative model, is to first transform the data until the variation in the series appears to be stable over time, and then use an additive model. When a log transformation has been used, this is equivalent to using a multiplicative decomposition because

$$ log y_t=log S_t+ log T_t + log E_t \\$$

Sometimes, the trend-cycle component is simply called the “trend” component, even though it may contain cyclic behaviour as well.

Forecasting Steps

The underlying principle is to model or estimate the trend and seasonality in the series and remove those from the series to get a stationary series. Then statistical forecasting techniques can be implemented on this series. The final step would be to convert the forecasted values into the original scale by applying trend and seasonality constraints back.

Estimating and Eliminating Trend

Transformation - Take a log, sqrt, cuberoot etc. transformation
Aggregation – taking average for a time period like monthly/weekly averages
Smoothing – taking rolling averages, exponential smoothing
Polynomial Fitting – fit a regression model

Simple Moving Average

As a first step in moving beyond mean models, random walk model and linear trend models, nonseasonal patterns and trends can be extrapolated using a moving-average or smoothing model. The basic assumption behind averaging and smoothing models is that the time series is locally stationary with a slowly varying mean. Hence, we take a moving (local) average to estimate the current value of the mean and then use that as the forecast for the near future. This can be considered as a compromise between the mean model and the random-walk-without-drift-model. The same strategy can be used to estimate and extrapolate a local trend. A moving average is often called a "smoothed" version of the original series because short-term averaging has the effect of smoothing out the bumps in the original series. By adjusting the degree of smoothing (the width of the moving average), we can hope to strike some kind of optimal balance between the performance of the mean and random walk models.

Simple Moving Average (SMA)

$$ \hat{y_t} = \frac{y_{t-1} + y_{t-2} + y_{t-3} + ... + y_{t-m}}{m} \\$$



In [68]:

    
# For smoothing the values we can use 12 month Moving Averages 
dfBang['priceModLogMA12'] = pd.rolling_mean(dfBang.priceModLog, window = 12)



In [69]:

    
dfBang.plot(kind ="line", y=["priceModLogMA12", "priceModLog"])









    Out[69]:





<matplotlib.axes._subplots.AxesSubplot at 0x11cc155c0>

The long-term forecasts from the SMA model are a horizontal straight line, just as in the random walk model. Thus, the SMA model assumes that there is no trend in the data. However, whereas the forecasts from the random walk model are simply equal to the last observed value, the forecasts from the SMA model are equal to a weighted average of recent values.



In [70]:

    
dfBang["priceMA12"] = np.exp(dfBang.priceModLogMA12)
dfBang.tail()









    Out[70]:






  
    
      
      quantity
      priceMod
      date
      priceModLog
      priceMean
      timeIndex
      priceLinear
      priceLinearQuantity
      priceModLogShift1
      priceModLogDiff
      priceRandom
      priceModLogMA12
      priceMA12
    
  
  
    
      2015-10
      1612160
      2215
      2015-10-01
      7.703008
      884.565812
      141
      1670.628673
      1913.491280
      8.051978
      -0.348970
      3140
      7.526074
      1855.805596
    
    
      2015-11
      1071872
      1618
      2015-11-01
      7.388946
      884.565812
      142
      1686.208656
      1812.993037
      7.703008
      -0.314062
      2215
      7.537956
      1877.986866
    
    
      2015-12
      513186
      1343
      2015-12-01
      7.202661
      884.565812
      143
      1701.933936
      1633.680427
      7.388946
      -0.186285
      1618
      7.535329
      1873.061120
    
    
      2016-01
      507223
      1448
      2016-01-01
      7.277939
      884.565812
      144
      1717.805867
      1644.591721
      7.202661
      0.075277
      1343
      7.526907
      1857.351582
    
    
      2016-02
      400359
      1101
      2016-02-01
      7.003974
      884.565812
      145
      1733.825817
      1599.625994
      7.277939
      -0.273964
      1448
      7.492590
      1794.694278



In [71]:

    
model_MA12_forecast = dfBang.priceModLog.tail(12).mean()



In [72]:

    
# Root Mean Squared Error (RMSE)
model_MA12_RMSE = RMSE(dfBang.priceMA12, dfBang.priceMod)
model_MA12_RMSE









    Out[72]:





518.82175104808096



In [73]:

    
dfBangResults.loc[3,"Model"] = "Moving Average 12"
dfBangResults.loc[3,"Forecast"] = np.exp(model_MA12_forecast)
dfBangResults.loc[3,"RMSE"] = model_MA12_RMSE
dfBangResults.head()









    Out[73]:






  
    
      
      Model
      Forecast
      RMSE
    
  
  
    
      0
      Mean
      884.566
      683.953
    
    
      1
      Linear
      1750
      518.524
    
    
      2
      Random
      1448
      323.592
    
    
      3
      Moving Average 12
      1794.69
      518.822



In [74]:

    
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", "priceLinear",
                                             "priceRandom", "priceMA12"])









    Out[74]:





<matplotlib.axes._subplots.AxesSubplot at 0x11c9e2400>



In [75]:

    
# Test remaining part for Stationary
ts = dfBang.priceModLog - dfBang.priceModLogMA12
ts.dropna(inplace = True)
adf(ts)









    












    Out[75]:





Test Statistic                -6.523729e+00
p-value                        1.026903e-08
# of Lags Used                 1.000000e+00
Number of Observations Used    1.330000e+02
Critical Value (10%)          -2.578496e+00
Critical Value (1%)           -3.480500e+00
Critical Value (5%)           -2.883528e+00
dtype: float64

Simple Exponential Smoothing Model (SES)

Instead of equally weighting each of the observation, in the SES model we give more weightage to the recent observations and less to the older ones. This is done by the using a smoothing variable like alpha

$$ \hat{y_t} = \alpha y_{t-1} + (1-\alpha)\hat{y_{t-1}} \\$$



In [76]:

    
dfBang['priceModLogExp12'] = pd.ewma(dfBang.priceModLog, halflife=12)



In [77]:

    
halflife = 12
alpha = 1 - np.exp(np.log(0.5)/halflife)
alpha









    Out[77]:





0.056125687318306472



In [78]:

    
dfBang.plot(kind ="line", y=["priceModLogExp12", "priceModLog"])









    Out[78]:





<matplotlib.axes._subplots.AxesSubplot at 0x11feb9208>



In [79]:

    
dfBang["priceExp12"] = np.exp(dfBang.priceModLogExp12)
dfBang.tail()









    Out[79]:






  
    
      
      quantity
      priceMod
      date
      priceModLog
      priceMean
      timeIndex
      priceLinear
      priceLinearQuantity
      priceModLogShift1
      priceModLogDiff
      priceRandom
      priceModLogMA12
      priceMA12
      priceModLogExp12
      priceExp12
    
  
  
    
      2015-10
      1612160
      2215
      2015-10-01
      7.703008
      884.565812
      141
      1670.628673
      1913.491280
      8.051978
      -0.348970
      3140
      7.526074
      1855.805596
      7.379526
      1602.830556
    
    
      2015-11
      1071872
      1618
      2015-11-01
      7.388946
      884.565812
      142
      1686.208656
      1812.993037
      7.703008
      -0.314062
      2215
      7.537956
      1877.986866
      7.380055
      1603.678391
    
    
      2015-12
      513186
      1343
      2015-12-01
      7.202661
      884.565812
      143
      1701.933936
      1633.680427
      7.388946
      -0.186285
      1618
      7.535329
      1873.061120
      7.370096
      1587.786947
    
    
      2016-01
      507223
      1448
      2016-01-01
      7.277939
      884.565812
      144
      1717.805867
      1644.591721
      7.202661
      0.075277
      1343
      7.526907
      1857.351582
      7.364923
      1579.593558
    
    
      2016-02
      400359
      1101
      2016-02-01
      7.003974
      884.565812
      145
      1733.825817
      1599.625994
      7.277939
      -0.273964
      1448
      7.492590
      1794.694278
      7.344660
      1547.908508



In [80]:

    
# Root Mean Squared Error (RMSE)
model_Exp12_RMSE = RMSE(dfBang.priceExp12, dfBang.priceMod)
model_Exp12_RMSE









    Out[80]:





547.16281093335465



In [81]:

    
y_exp = dfBang.priceModLog[-1]
y_exp









    Out[81]:





7.0039741367226798



In [82]:

    
y_for = dfBang.priceModLogExp12[-1]
y_for









    Out[82]:





7.3446599493429714



In [83]:

    
model_Exp12_forecast = alpha * y_exp + (1 - alpha) * y_for



In [84]:

    
dfBangResults.loc[4,"Model"] = "Exp Smoothing 12"
dfBangResults.loc[4,"Forecast"] = np.exp(model_Exp12_forecast)
dfBangResults.loc[4,"RMSE"] = model_Exp12_RMSE
dfBangResults.head()









    Out[84]:






  
    
      
      Model
      Forecast
      RMSE
    
  
  
    
      0
      Mean
      884.566
      683.953
    
    
      1
      Linear
      1750
      518.524
    
    
      2
      Random
      1448
      323.592
    
    
      3
      Moving Average 12
      1794.69
      518.822
    
    
      4
      Exp Smoothing 12
      1518.59
      547.163



In [85]:

    
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", "priceLinear", 
                                             "priceRandom",
                                             "priceMA12", "priceExp12"])









    Out[85]:





<matplotlib.axes._subplots.AxesSubplot at 0x11fad2710>



In [86]:

    
# Test remaining part for Stationary
ts = dfBang.priceModLog - dfBang.priceModLogExp12
ts.dropna(inplace = True)
adf(ts)









    












    Out[86]:





Test Statistic                -6.638916e+00
p-value                        5.472586e-09
# of Lags Used                 1.000000e+00
Number of Observations Used    1.440000e+02
Critical Value (10%)          -2.577589e+00
Critical Value (1%)           -3.476598e+00
Critical Value (5%)           -2.881829e+00
dtype: float64

Eliminating Trend and Seasonality

Differencing – taking the differece with a particular time lag
Decomposition – modeling both trend and seasonality and removing them from the model.

Differencing

One of the most common methods of dealing with both trend and seasonality is differencing. In this technique, we take the difference of the observation at a particular instant with that at the previous instant. This mostly works well in improving stationarity. We have already done first order difference earlier



In [87]:

    
dfBang.priceModLogDiff.plot()









    Out[87]:





<matplotlib.axes._subplots.AxesSubplot at 0x1206706a0>



In [88]:

    
# Test remaining part for Stationary
ts = dfBang.priceModLogDiff
ts.dropna(inplace = True)
adf(ts)









    












    Out[88]:





Test Statistic                -7.293246e+00
p-value                        1.399305e-10
# of Lags Used                 7.000000e+00
Number of Observations Used    1.370000e+02
Critical Value (10%)          -2.578149e+00
Critical Value (1%)           -3.479007e+00
Critical Value (5%)           -2.882878e+00
dtype: float64

Time Series Decomposition

We can also decompose the time series into trend and seasonality



In [89]:

    
from statsmodels.tsa.seasonal import seasonal_decompose



In [90]:

    
dfBang.index = dfBang.index.to_datetime()



In [91]:

    
dfBang.head()









    Out[91]:






  
    
      
      quantity
      priceMod
      date
      priceModLog
      priceMean
      timeIndex
      priceLinear
      priceLinearQuantity
      priceModLogShift1
      priceModLogDiff
      priceRandom
      priceModLogMA12
      priceMA12
      priceModLogExp12
      priceExp12
    
  
  
    
      2004-01-01
      227832
      991
      2004-01-01
      6.898715
      884.565812
      0
      451.289090
      431.410971
      NaN
      NaN
      NaN
      NaN
      NaN
      6.898715
      991.000000
    
    
      2004-02-01
      225133
      793
      2004-02-01
      6.675823
      884.565812
      1
      455.497732
      434.277110
      6.898715
      -0.222891
      991
      NaN
      NaN
      6.784051
      883.641198
    
    
      2004-03-01
      221952
      556
      2004-03-01
      6.320768
      884.565812
      2
      459.745622
      437.007747
      6.675823
      -0.355055
      793
      NaN
      NaN
      6.620623
      750.412129
    
    
      2004-04-01
      185150
      465
      2004-04-01
      6.142037
      884.565812
      3
      464.033127
      428.667281
      6.320768
      -0.178731
      556
      NaN
      NaN
      6.490419
      658.799360
    
    
      2004-05-01
      137390
      455
      2004-05-01
      6.120297
      884.565812
      4
      468.360617
      413.029454
      6.142037
      -0.021740
      465
      NaN
      NaN
      6.407606
      606.440185



In [92]:

    
decomposition = seasonal_decompose(dfBang.priceModLog, model = "additive")



In [93]:

    
decomposition.plot()









    Out[93]:



In [94]:

    
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid



In [95]:

    
dfBang["priceDecomp"] = np.exp(trend + seasonal)



In [96]:

    
# Root Mean Squared Error (RMSE)
model_Decomp_RMSE = RMSE(dfBang.priceDecomp, dfBang.priceMod)
model_Decomp_RMSE









    Out[96]:





374.10858464802351



In [97]:

    
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", "priceLinear", "priceRandom",
                                             "priceMA12", "priceExp12", "priceDecomp"])









    Out[97]:





<matplotlib.axes._subplots.AxesSubplot at 0x1217b3710>



In [98]:

    
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod",
                                              "priceDecomp"])









    Out[98]:





<matplotlib.axes._subplots.AxesSubplot at 0x1217cbc50>



In [99]:

    
# Test remaining part for Stationary
ts = decomposition.resid
ts.dropna(inplace = True)
adf(ts)









    












    Out[99]:





Test Statistic                -7.906662e+00
p-value                        4.049566e-12
# of Lags Used                 4.000000e+00
Number of Observations Used    1.290000e+02
Critical Value (10%)          -2.578864e+00
Critical Value (1%)           -3.482088e+00
Critical Value (5%)           -2.884219e+00
dtype: float64

Auto Regressive Models - AR(p)

In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. The term autoregression indicates that it is a regression of the variable against itself.

Thus an autoregressive model of order (p) can be written as

$$ y_t = c + m_1y_{t-1} + m_2y_{t-2} + m_3y_{t-3} + .. \\$$

Random walk model is an AR(1) model with $$m_1=1, c = 0\\$$ Random walk model with drift model $$m_1=1, c \not= 0\\$$

We normally restrict autoregressive models to stationary data, and then some constraints on the values of the parameters are required.

For an AR(1) model: $$ −1<m_1<−1 \\$$ For an AR(2) model:
$$ −1<m_2<−1, m_1 + m_2 < 1, m_2 - m_1 <1 \\$$

Moving Average Model - MA(q)

Rather than use past values of the forecast variable in a regression, a moving average model uses past forecast errors in a regression-like model.

$$ y_t=c+e_t+l_1 e_{t−1}+l_2 e_{t−2} + ... + l_q e_{t-q} \\$$

where e is white noise. We refer to this as an MA(q) model. Of course, we do not observe the values of e(t), so it is not really regression in the usual sense.

Notice that each value of y(t) can be thought of as a weighted moving average of the past few forecast errors. However, moving average models should not be confused with moving average smoothing. A moving average model is used for forecasting future values while moving average smoothing is used for estimating the trend-cycle of past values.

ARIMA Model

If we combine differencing with autoregression and a moving average model, we obtain a non-seasonal ARIMA model. ARIMA is an acronym for AutoRegressive Integrated Moving Average model (“integration” in this context is the reverse of differencing). The full model can be written as

Number of AR (Auto-Regressive) terms (p): AR terms are just lags of dependent variable. For instance if p is 5, the predictors for y(t) will be y(t-1)….y(t-5).
Number of MA (Moving Average) terms (q): MA terms are lagged forecast errors in prediction equation. For instance if q is 5, the predictors for y(t) will be e(t-1)….e(t-5) where e(i) is the difference between the moving average at ith instant and actual value.
Number of Differences (d): These are the number of nonseasonal differences, i.e. in this case we took the first order difference. So either we can pass that variable and put d=0 or pass the original variable and put d=1. Both will generate same results.

An importance concern here is how to determine the value of ‘p’ and ‘q’. We use two plots to determine these numbers. Lets discuss them first.

Autocorrelation Function (ACF): It is a measure of the correlation between the the TS with a lagged version of itself. For instance at lag 5, ACF would compare series at time instant ‘t1’…’t2’ with series at instant ‘t1-5’…’t2-5’ (t1-5 and t2 being end points).
Partial Autocorrelation Function (PACF): This measures the correlation between the TS with a lagged version of itself but after eliminating the variations already explained by the intervening comparisons. Eg at lag 5, it will check the correlation but remove the effects already explained by lags 1 to 4.

In MA model, noise / shock quickly vanishes with time. The AR model has a much lasting effect of the shock.



In [100]:

    
ts = dfBang.priceModLog
ts_diff = dfBang.priceModLogDiff
ts_diff.dropna(inplace = True)



In [101]:

    
#ACF and PACF plots:
from statsmodels.tsa.stattools import acf, pacf



In [102]:

    
lag_acf = acf(ts_diff, nlags=20)



In [103]:

    
lag_acf









    Out[103]:





array([ 1.        ,  0.40215445, -0.02855346, -0.21472715, -0.23756032,
       -0.26332645, -0.22038359, -0.08576989, -0.0754398 ,  0.01497797,
        0.12361818,  0.20862956,  0.158123  ,  0.02053933, -0.01290934,
       -0.07097602, -0.17993659, -0.09514604, -0.04878255, -0.08854117,
       -0.18810881])



In [104]:

    
ACF = pd.Series(lag_acf)



In [105]:

    
ACF









    Out[105]:





0     1.000000
1     0.402154
2    -0.028553
3    -0.214727
4    -0.237560
5    -0.263326
6    -0.220384
7    -0.085770
8    -0.075440
9     0.014978
10    0.123618
11    0.208630
12    0.158123
13    0.020539
14   -0.012909
15   -0.070976
16   -0.179937
17   -0.095146
18   -0.048783
19   -0.088541
20   -0.188109
dtype: float64



In [106]:

    
ACF.plot(kind = "bar")









    Out[106]:





<matplotlib.axes._subplots.AxesSubplot at 0x121de6eb8>



In [107]:

    
lag_pacf = pacf(ts_diff, nlags=20, method='ols')



In [108]:

    
PACF = pd.Series(lag_pacf)



In [109]:

    
PACF.plot(kind = "bar")









    Out[109]:





<matplotlib.axes._subplots.AxesSubplot at 0x1223b6940>

Running the ARIMA Model



In [110]:

    
from statsmodels.tsa.arima_model import ARIMA



In [111]:

    
ts_diff.head()









    Out[111]:





2004-02-01   -0.222891
2004-03-01   -0.355055
2004-04-01   -0.178731
2004-05-01   -0.021740
2004-06-01    0.191437
Freq: MS, Name: priceModLogDiff, dtype: float64



In [112]:

    
# Running the ARIMA Model(1,0,1)
model_AR1MA = ARIMA(ts_diff, order=(1,0,1))



In [113]:

    
results_ARIMA = model_AR1MA.fit(disp = -1)



In [114]:

    
results_ARIMA.fittedvalues.head()









    Out[114]:





2004-02-01   -0.000977
2004-03-01   -0.093805
2004-04-01   -0.139355
2004-05-01   -0.041727
2004-06-01    0.002164
Freq: MS, dtype: float64



In [115]:

    
ts_diff.plot()
results_ARIMA.fittedvalues.plot()









    Out[115]:





<matplotlib.axes._subplots.AxesSubplot at 0x11fec53c8>



In [116]:

    
ts_diff.sum()









    Out[116]:





0.1052596023926915



In [117]:

    
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff.tail()









    Out[117]:





2015-10-01    0.008537
2015-11-01   -0.170970
2015-12-01   -0.096505
2016-01-01   -0.059154
2016-02-01    0.054308
Freq: MS, dtype: float64



In [118]:

    
predictions_ARIMA_diff.sum()









    Out[118]:





0.07204872072412598



In [119]:

    
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_diff_cumsum.tail()









    Out[119]:





2015-10-01    0.344369
2015-11-01    0.173399
2015-12-01    0.076894
2016-01-01    0.017740
2016-02-01    0.072049
Freq: MS, dtype: float64



In [120]:

    
ts.ix[0]









    Out[120]:





6.8987145343299883



In [121]:

    
predictions_ARIMA_log = pd.Series(ts.ix[0], index=ts.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
predictions_ARIMA_log.tail()









    Out[121]:





2015-10-01    7.243084
2015-11-01    7.072114
2015-12-01    6.975609
2016-01-01    6.916455
2016-02-01    6.970763
Freq: MS, dtype: float64



In [122]:

    
dfBang['priceARIMA'] = np.exp(predictions_ARIMA_log)



In [123]:

    
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceARIMA"])









    Out[123]:





<matplotlib.axes._subplots.AxesSubplot at 0x12062fbe0>



In [124]:

    
dfBang.plot(kind="line", x="timeIndex", y = ["priceMod", "priceMean", "priceLinear", "priceRandom",
                                             "priceMA12", "priceExp12", "priceDecomp", "priceARIMA"])









    Out[124]:





<matplotlib.axes._subplots.AxesSubplot at 0x123041f98>

	market	month	year	quantity	priceMin	priceMax	priceMod	state	city	date
1996-01	LASALGAON(MS)	January	1996	225063	160	257	226	MS	LASALGAON	1996-01-01
1996-02	LASALGAON(MS)	February	1996	196164	133	229	186	MS	LASALGAON	1996-02-01
1996-03	LASALGAON(MS)	March	1996	178992	155	274	243	MS	LASALGAON	1996-03-01
1996-04	LASALGAON(MS)	April	1996	192592	136	279	254	MS	LASALGAON	1996-04-01
1996-05	LASALGAON(MS)	May	1996	237574	154	312	269	MS	LASALGAON	1996-05-01

	market	month	year	quantity	priceMin	priceMax	priceMod	state	city	date
2004-01	BANGALORE	January	2004	227832	916	1066	991	KNT	BANGALORE	2004-01-01
2004-02	BANGALORE	February	2004	225133	741	870	793	KNT	BANGALORE	2004-02-01
2004-03	BANGALORE	March	2004	221952	527	586	556	KNT	BANGALORE	2004-03-01
2004-04	BANGALORE	April	2004	185150	419	518	465	KNT	BANGALORE	2004-04-01
2004-05	BANGALORE	May	2004	137390	400	516	455	KNT	BANGALORE	2004-05-01

Dep. Variable:	priceModLog	R-squared:	0.493
Model:	OLS	Adj. R-squared:	0.489
Method:	Least Squares	F-statistic:	139.8
Date:	Tue, 22 Mar 2016	Prob (F-statistic):	5.75e-23
Time:	22:04:11	Log-Likelihood:	-72.317
No. Observations:	146	AIC:	148.6
Df Residuals:	144	BIC:	154.6
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	6.1121	0.066	92.830	0.000	5.982 6.242
timeIndex	0.0093	0.001	11.823	0.000	0.008 0.011

Omnibus:	4.750	Durbin-Watson:	0.384
Prob(Omnibus):	0.093	Jarque-Bera (JB):	4.739
Skew:	0.405	Prob(JB):	0.0935
Kurtosis:	2.648	Cond. No.	167.

Omnibus:	5.097	Durbin-Watson:	0.412
Prob(Omnibus):	0.078	Jarque-Bera (JB):	4.952
Skew:	0.403	Prob(JB):	0.0841
Kurtosis:	2.593	Cond. No.	2.29e+03

	quantity	priceMod	date	priceModLog	priceMean	timeIndex	priceLinear	priceLinearQuantity	priceModLogShift1	priceModLogDiff	priceRandom
2015-10	1612160	2215	2015-10-01	7.703008	884.565812	141	1670.628673	1913.491280	8.051978	-0.348970	3140
2015-11	1071872	1618	2015-11-01	7.388946	884.565812	142	1686.208656	1812.993037	7.703008	-0.314062	2215
2015-12	513186	1343	2015-12-01	7.202661	884.565812	143	1701.933936	1633.680427	7.388946	-0.186285	1618
2016-01	507223	1448	2016-01-01	7.277939	884.565812	144	1717.805867	1644.591721	7.202661	0.075277	1343
2016-02	400359	1101	2016-02-01	7.003974	884.565812	145	1733.825817	1599.625994	7.277939	-0.273964	1448

	Model	Forecast	RMSE
0	Mean	884.566	683.953
1	Linear	1750	518.524
2	Random	1448	323.592
3	Moving Average 12	1794.69	518.822