In [11]:

    
import pandas as pd
import numpy as np
import pyflux as pf
import matplotlib.pyplot as plt
from fbprophet import Prophet
    
%matplotlib inline
plt.rcParams['figure.figsize']=(20,10)
plt.style.use('ggplot')

Load the data

For this work, we're going to use the same retail sales data that we've used before. It can be found in the examples directory of this repository.



In [122]:

    
sales_df = pd.read_csv('../examples/retail_sales.csv', index_col='date', parse_dates=True)



In [123]:

    
sales_df.head()

Like all good modeling projects, we need to take a look at the data to get an idea of what it looks like.



In [22]:

    
sales_df.plot()









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f416b3ff590>

It's pretty clear from this data that we are looking at a trending dataset with some seasonality. This is actually a pretty good datset for prophet since the additive model and prophet's implemention does well with this type of data.

With that in mind, let's take look at what prophet does from a modeling standpoint to compare with the dynamic linear regression model. For more details on this, you can take a look at my blog post titled Forecasting Time Series data with Prophet – Part 4 (http://pythondata.com/forecasting-time-series-data-prophet-part-4/)



In [166]:

    
# Prep data for prophet and run prophet
df = sales_df.reset_index()
df=df.rename(columns={'date':'ds', 'sales':'y'})

model = Prophet(weekly_seasonality=True)
model.fit(df);
future = model.make_future_dataframe(periods=24, freq = 'm')

forecast = model.predict(future)
model.plot(forecast);









    



INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

With our prophet model ready for comparison, let's build a model with pyflux's dynamic linear regresion model.

More Data Viz

Now that we've run our prophet model and can see what it has done, its time to walk through what I call the 'long form' of model building. This is more involved than throwing data at a library and accepting the results.

For this data, let's first look at the differenced log values of our sales data (to try to make it more stationary).



In [51]:

    
diff_log = pd.DataFrame(np.diff(np.log(sales_df['sales'].values)))
diff_log.index = sales_df.index.values[1:sales_df.index.values.shape[0]]
diff_log.columns = ["Sales DiffLog"]



In [64]:

    
sales_df['logged']=np.log(sales_df['sales'])



In [65]:

    
sales_df.tail()



In [60]:

    
sales_df.plot(subplots=True)









    Out[60]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f4162a9d4d0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f4162998a50>], dtype=object)

With our original data (top pane in orange), we can see a very pronounced trend. With the differenced log values (bottom pane in blue), we've removed that trend and made the data staionary (or hopefully we have).

Now, lets take a look at an autocorrelation plot, which will tell us whether the future sales is correlated with the past data. I won't go into detail on autocorrelation, but if you don't understand whether you have autocorrelation (and to what degree), you might be in for a hard time :)

Let's take a look at the autocorrelation plot (acf) if the differenced log values as well as the ACF of the square of the differenced log values.



In [34]:

    
pf.acf_plot(diff_log.values.T[0])
pf.acf_plot(np.square(diff_log.values.T[0]))

We can see that at a lag of 1 and 2 months, there are positive correlations for sales but as time goes on, that correlation drops quickly to a negative correlation that stays in place over time, which hints at the fact that there are some autoregressive effects within this data.

Because of this fact, we can start our modeling by using an ARMA model of some sort.



In [70]:

    
Logged = pd.DataFrame(np.log(sales_df['sales']))
Logged.index = pd.to_datetime(sales_df.index)
Logged.columns = ['Sales - Logged']



In [73]:

    
Logged.head()









    Out[73]:







  
    
      
      Sales - Logged
    
    
      date
      
    
  
  
    
      2009-10-01
      12.732663
    
    
      2009-11-01
      12.734893
    
    
      2009-12-01
      12.899880
    
    
      2010-01-01
      12.659184
    
    
      2010-02-01
      12.647619



In [160]:

    
modelLLT = pf.LLT(data=Logged)



In [161]:

    
x = modelLLT.fit()
x.summary()









    



LLT                                                                                                       
======================================================= ==================================================
Dependent Variable: Sales - Logged                      Method: MLE                                       
Start Date: 2009-10-01 00:00:00                         Log Likelihood: 79.4161                           
End Date: 2015-09-01 00:00:00                           AIC: -152.8322                                    
Number of observations: 72                              BIC: -146.0022                                    
==========================================================================================================
Latent Variable                          Estimate   Std Error  z        P>|z|    95% C.I.                 
======================================== ========== ========== ======== ======== =========================
Sigma^2 irregular                        0.00404690                                                       
Sigma^2 level                            2.2047e-05                                                       
Sigma^2 trend                            0.0                                                              
==========================================================================================================



In [77]:

    
model.plot_fit(figsize=(20,10))



In [162]:

    
modelLLT.plot_predict_is(h=len(Logged)-1, figsize=(20,10))



In [163]:

    
predicted = modelLLT.predict_is(h=len(Logged)-1)
predicted.columns = ['Predicted']



In [164]:

    
predicted.tail()



In [165]:

    
np.exp(predicted).plot()









    Out[165]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4161616410>



In [ ]:

    
sales_df_future=sales_df
sales_df



In [181]:

    
final_sales=sales_df.merge(np.exp(predicted),right_on=predicted.index)









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-181-83c5cca24b33> in <module>()
----> 1 final_sales=sales_df.merge(np.exp(predicted),right_on=predicted.index)

/vagrant/pythondata/env/local/lib/python2.7/site-packages/pandas/core/frame.pyc in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
   4720                      right_on=right_on, left_index=left_index,
   4721                      right_index=right_index, sort=sort, suffixes=suffixes,
-> 4722                      copy=copy, indicator=indicator)
   4723 
   4724     def round(self, decimals=0, *args, **kwargs):

/vagrant/pythondata/env/local/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
     51                          right_on=right_on, left_index=left_index,
     52                          right_index=right_index, sort=sort, suffixes=suffixes,
---> 53                          copy=copy, indicator=indicator)
     54     return op.get_result()
     55 

/vagrant/pythondata/env/local/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator)
    551             warnings.warn(msg, UserWarning)
    552 
--> 553         self._validate_specification()
    554 
    555         # note this function has side effects

/vagrant/pythondata/env/local/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in _validate_specification(self)
    950                                      'of levels in the index of "left"')
    951                 self.left_on = [None] * n
--> 952         if len(self.right_on) != len(self.left_on):
    953             raise ValueError("len(right_on) must equal len(left_on)")
    954 

TypeError: object of type 'NoneType' has no len()



In [176]:

    
final_sales = sales_df.merge()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-176-1c2846ade65b> in <module>()
----> 1 final_sales = sales_df.reset_index.merge(np.exp(predicted).reset_index)

AttributeError: 'function' object has no attribute 'merge'



In [173]:

    
final_sales.tail()









    Out[173]:







  
    
      
      sales
      Predicted
    
    
      date
      
      
    
  
  
    
      2015-05-01
      462615
      456989.958343
    
    
      2015-06-01
      448229
      493340.358610
    
    
      2015-07-01
      457710
      461767.885997
    
    
      2015-08-01
      456340
      466655.649862
    
    
      2015-09-01
      430917
      461417.638666



In [157]:

    
final_sales.plot()









    Out[157]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f416174c450>



In [ ]:

	sales
date
2009-10-01	338630
2009-11-01	339386
2009-12-01	400264
2010-01-01	314640
2010-02-01	311022

	sales	diff_log	logged
date
2015-05-01	462615	13.044650	13.044650
2015-06-01	448229	13.013060	13.013060
2015-07-01	457710	13.033991	13.033991
2015-08-01	456340	13.030993	13.030993
2015-09-01	430917	12.973671	12.973671

	Sales - Logged
date
2009-10-01	12.732663
2009-11-01	12.734893
2009-12-01	12.899880
2010-01-01	12.659184
2010-02-01	12.647619

	Predicted
date
2015-05-01	13.032417
2015-06-01	13.108955
2015-07-01	13.042818
2015-08-01	13.053347
2015-09-01	13.042059