In [11]:
import pandas as pd
import numpy as np
import pyflux as pf
import matplotlib.pyplot as plt
from fbprophet import Prophet
%matplotlib inline
plt.rcParams['figure.figsize']=(20,10)
plt.style.use('ggplot')
In [122]:
sales_df = pd.read_csv('../examples/retail_sales.csv', index_col='date', parse_dates=True)
In [123]:
sales_df.head()
Out[123]:
Like all good modeling projects, we need to take a look at the data to get an idea of what it looks like.
In [22]:
sales_df.plot()
Out[22]:
It's pretty clear from this data that we are looking at a trending dataset with some seasonality. This is actually a pretty good datset for prophet since the additive model and prophet's implemention does well with this type of data.
With that in mind, let's take look at what prophet does from a modeling standpoint to compare with the dynamic linear regression model. For more details on this, you can take a look at my blog post titled Forecasting Time Series data with Prophet – Part 4 (http://pythondata.com/forecasting-time-series-data-prophet-part-4/)
In [166]:
# Prep data for prophet and run prophet
df = sales_df.reset_index()
df=df.rename(columns={'date':'ds', 'sales':'y'})
model = Prophet(weekly_seasonality=True)
model.fit(df);
future = model.make_future_dataframe(periods=24, freq = 'm')
forecast = model.predict(future)
model.plot(forecast);
With our prophet model ready for comparison, let's build a model with pyflux's dynamic linear regresion model.
Now that we've run our prophet model and can see what it has done, its time to walk through what I call the 'long form' of model building. This is more involved than throwing data at a library and accepting the results.
For this data, let's first look at the differenced log values of our sales data (to try to make it more stationary).
In [51]:
diff_log = pd.DataFrame(np.diff(np.log(sales_df['sales'].values)))
diff_log.index = sales_df.index.values[1:sales_df.index.values.shape[0]]
diff_log.columns = ["Sales DiffLog"]
In [64]:
sales_df['logged']=np.log(sales_df['sales'])
In [65]:
sales_df.tail()
Out[65]:
In [60]:
sales_df.plot(subplots=True)
Out[60]:
With our original data (top pane in orange), we can see a very pronounced trend. With the differenced log values (bottom pane in blue), we've removed that trend and made the data staionary (or hopefully we have).
Now, lets take a look at an autocorrelation plot, which will tell us whether the future sales is correlated with the past data. I won't go into detail on autocorrelation, but if you don't understand whether you have autocorrelation (and to what degree), you might be in for a hard time :)
Let's take a look at the autocorrelation plot (acf) if the differenced log values as well as the ACF of the square of the differenced log values.
In [34]:
pf.acf_plot(diff_log.values.T[0])
pf.acf_plot(np.square(diff_log.values.T[0]))
We can see that at a lag of 1 and 2 months, there are positive correlations for sales but as time goes on, that correlation drops quickly to a negative correlation that stays in place over time, which hints at the fact that there are some autoregressive effects within this data.
Because of this fact, we can start our modeling by using an ARMA model of some sort.
In [70]:
Logged = pd.DataFrame(np.log(sales_df['sales']))
Logged.index = pd.to_datetime(sales_df.index)
Logged.columns = ['Sales - Logged']
In [73]:
Logged.head()
Out[73]:
In [160]:
modelLLT = pf.LLT(data=Logged)
In [161]:
x = modelLLT.fit()
x.summary()
In [77]:
model.plot_fit(figsize=(20,10))
In [162]:
modelLLT.plot_predict_is(h=len(Logged)-1, figsize=(20,10))
In [163]:
predicted = modelLLT.predict_is(h=len(Logged)-1)
predicted.columns = ['Predicted']
In [164]:
predicted.tail()
Out[164]:
In [165]:
np.exp(predicted).plot()
Out[165]:
In [ ]:
sales_df_future=sales_df
sales_df
In [181]:
final_sales=sales_df.merge(np.exp(predicted),right_on=predicted.index)
In [176]:
final_sales = sales_df.merge()
In [173]:
final_sales.tail()
Out[173]:
In [157]:
final_sales.plot()
Out[157]:
In [ ]: