Import necessary libraries


In [1]:
import pandas as pd
import numpy as np
from fbprophet import Prophet
import matplotlib.pyplot as plt
 
%matplotlib inline
 
plt.rcParams['figure.figsize']=(20,10)
plt.style.use('ggplot')

Read in the data

Read the data in from the retail sales CSV file in the examples folder then set the index to the 'date' column. We are also parsing dates in the data file.


In [2]:
sales_df = pd.read_csv('../examples/retail_sales.csv', index_col='date', parse_dates=True)

In [3]:
sales_df.head()


Out[3]:
sales
date
2009-10-01 338630
2009-11-01 339386
2009-12-01 400264
2010-01-01 314640
2010-02-01 311022

Prepare for Prophet

For prophet to work, we need to change the names of these columns to 'ds' and 'y', so lets just create a new dataframe and keep our old one handy (you'll see why later). The new dataframe will initially be created with an integer index so we can rename the columns


In [4]:
df = sales_df.reset_index()

In [5]:
df.head()


Out[5]:
date sales
0 2009-10-01 338630
1 2009-11-01 339386
2 2009-12-01 400264
3 2010-01-01 314640
4 2010-02-01 311022

Let's rename the columns as required by fbprophet. Additioinally, fbprophet doesn't like the index to be a datetime...it wants to see 'ds' as a non-index column, so we won't set an index differnetly than the integer index.


In [6]:
df=df.rename(columns={'date':'ds', 'sales':'y'})

In [7]:
df.head()


Out[7]:
ds y
0 2009-10-01 338630
1 2009-11-01 339386
2 2009-12-01 400264
3 2010-01-01 314640
4 2010-02-01 311022

Now's a good time to take a look at your data. Plot the data using pandas' plot function


In [8]:
df.set_index('ds').y.plot()


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f33b05bf790>

When working with time-series data, its good to take a look at the data to determine if trends exist, whether it is stationary, has any outliers and/or any other anamolies. Facebook prophet's example uses the log-transform as a way to remove some of these anomolies but it isn't the absolute 'best' way to do this...but given that its the example and a simple data series, I'll follow their lead for now. Taking the log of a number is easily reversible to be able to see your original data.

To log-transform your data, you can use numpy's log() function


In [9]:
df['y'] = np.log(df['y'])

In [10]:
df.tail()


Out[10]:
ds y
67 2015-05-01 13.044650
68 2015-06-01 13.013060
69 2015-07-01 13.033991
70 2015-08-01 13.030993
71 2015-09-01 12.973671

In [11]:
df.set_index('ds').y.plot()


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3380c76390>

As you can see in the above chart, the plot looks the same as the first one but just at a different scale.

Running Prophet

Now, let's set prophet up to begin modeling our data.

Note: Since we are using monthly data, you'll see a message from Prophet saying Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. This is OK since we are workign with monthly data but you can disable it by using weekly_seasonality=True in the instantiation of Prophet.


In [12]:
model = Prophet()
model.fit(df);


Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.

Forecasting is fairly useless unless you can look into the future, so we need to add some future dates to our dataframe. For this example, I want to forecast 2 years into the future, so I'll built a future dataframe with 24 periods since we are working with monthly data. Note the freq='m' inclusion to ensure we are adding 24 months of data.

This can be done with the following code:


In [13]:
future = model.make_future_dataframe(periods=24, freq = 'm')
future.tail()


Out[13]:
ds
91 2017-04-30
92 2017-05-31
93 2017-06-30
94 2017-07-31
95 2017-08-31

To forecast this future data, we need to run it through Prophet's model.


In [14]:
forecast = model.predict(future)

The resulting forecast dataframe contains quite a bit of data, but we really only care about a few columns. First, let's look at the full dataframe:


In [15]:
forecast.tail()


Out[15]:
ds t trend seasonal_lower seasonal_upper trend_lower trend_upper yhat_lower yhat_upper yearly yearly_lower yearly_upper seasonal yhat
91 2017-04-30 1.280888 13.042878 0.020710 0.020710 12.933548 13.143430 12.952123 13.165723 0.020710 0.020710 0.020710 0.020710 13.063588
92 2017-05-31 1.295234 13.044540 0.011932 0.011932 12.923032 13.152482 12.938400 13.164907 0.011932 0.011932 0.011932 0.011932 13.056472
93 2017-06-30 1.309116 13.046148 0.028827 0.028827 12.915811 13.162192 12.944827 13.193656 0.028827 0.028827 0.028827 0.028827 13.074975
94 2017-07-31 1.323461 13.047810 0.003016 0.003016 12.910398 13.171669 12.915154 13.178260 0.003016 0.003016 0.003016 0.003016 13.050825
95 2017-08-31 1.337807 13.049471 -0.025674 -0.025674 12.901938 13.182222 12.880747 13.154949 -0.025674 -0.025674 -0.025674 -0.025674 13.023797

We really only want to look at yhat, yhat_lower and yhat_upper, so we can do that with:


In [16]:
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()


Out[16]:
ds yhat yhat_lower yhat_upper
91 2017-04-30 13.063588 12.952123 13.165723
92 2017-05-31 13.056472 12.938400 13.164907
93 2017-06-30 13.074975 12.944827 13.193656
94 2017-07-31 13.050825 12.915154 13.178260
95 2017-08-31 13.023797 12.880747 13.154949

Plotting Prophet results

Prophet has a plotting mechanism called plot. This plot functionality draws the original data (black dots), the model (blue line) and the error of the forecast (shaded blue area).


In [17]:
model.plot(forecast);


Personally, I'm not a fan of this visualization so I like to break the data up and build a chart myself. The next section describes how I build my own visualization for Prophet modeling

Visualizing Prophet models

In order to build a useful dataframe to visualize our model versus our original data, we need to combine the output of the Prophet model with our original data set, then we'll build a new chart manually using pandas and matplotlib.

First, let's set our dataframes to have the same index of ds


In [18]:
df.set_index('ds', inplace=True)
forecast.set_index('ds', inplace=True)

Now, we'll combine the original data and our forecast model data


In [19]:
viz_df = sales_df.join(forecast[['yhat', 'yhat_lower','yhat_upper']], how = 'outer')

If we look at the head(), we see the data has been joined correctly but the scales of our original data (sales) and our model (yhat) are different. We need to rescale the yhat colums(s) to get the same scale, so we'll use numpy's exp function to do that.


In [20]:
viz_df.head()


Out[20]:
sales yhat yhat_lower yhat_upper
2009-10-01 338630.0 12.730339 12.719124 12.741363
2009-11-01 339386.0 12.748734 12.737656 12.760092
2009-12-01 400264.0 12.884803 12.873969 12.896787
2010-01-01 314640.0 12.660496 12.649357 12.671431
2010-02-01 311022.0 12.656080 12.644973 12.667275

In [21]:
viz_df['yhat_rescaled'] = np.exp(viz_df['yhat'])

In [22]:
viz_df.head()


Out[22]:
sales yhat yhat_lower yhat_upper yhat_rescaled
2009-10-01 338630.0 12.730339 12.719124 12.741363 337843.878333
2009-11-01 339386.0 12.748734 12.737656 12.760092 344115.810258
2009-12-01 400264.0 12.884803 12.873969 12.896787 394274.522610
2010-01-01 314640.0 12.660496 12.649357 12.671431 315052.880217
2010-02-01 311022.0 12.656080 12.644973 12.667275 313664.686287

Let's take a look at the sales and yhat_rescaled data together in a chart.


In [23]:
viz_df[['sales', 'yhat_rescaled']].plot()


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f33800fc090>

You can see from the chart that the model (blue) is pretty good when plotted against the actual signal (orange) but I like to make my vizualization's a little better to understand. To build my 'better' visualization, we'll need to go back to our original sales_df and forecast dataframes.

First things first - we need to find the 2nd to last date of the original sales data in sales_df in order to ensure the original sales data and model data charts are connected.


In [24]:
sales_df.index = pd.to_datetime(sales_df.index) #make sure our index as a datetime object
connect_date = sales_df.index[-2] #select the 2nd to last date

Using the connect_date we can now grab only the model data that after that date (you'll see why in a minute). To do this, we'll mask the forecast data.


In [25]:
mask = (forecast.index > connect_date)
predict_df = forecast.loc[mask]

In [26]:
predict_df.head()


Out[26]:
t trend seasonal_lower seasonal_upper trend_lower trend_upper yhat_lower yhat_upper yearly yearly_lower yearly_upper seasonal yhat
ds
2015-09-01 1.000000 13.010341 -0.036137 -0.036137 13.010341 13.010341 12.962424 12.985457 -0.036137 -0.036137 -0.036137 -0.036137 12.974204
2015-09-30 1.013420 13.011896 -0.034123 -0.034123 13.011743 13.012067 12.966155 12.988926 -0.034123 -0.034123 -0.034123 -0.034123 12.977772
2015-10-31 1.027765 13.013557 -0.013423 -0.013423 13.011848 13.015386 12.988682 13.012529 -0.013423 -0.013423 -0.013423 -0.013423 13.000134
2015-11-30 1.041647 13.015166 0.156795 0.156795 13.010909 13.019452 13.158479 13.184817 0.156795 0.156795 0.156795 0.156795 13.171960
2015-12-31 1.055993 13.016827 -0.137438 -0.137438 13.008524 13.024350 12.863865 12.893395 -0.137438 -0.137438 -0.137438 -0.137438 12.879390

Now, let's build a dataframe to use in our new visualization. We'll follow the same steps we did before.


In [27]:
viz_df = sales_df.join(predict_df[['yhat', 'yhat_lower','yhat_upper']], how = 'outer')
viz_df['yhat_scaled']=np.exp(viz_df['yhat'])

Now, if we take a look at the head() of viz_df we'll see 'NaN's everywhere except for our original data rows.


In [28]:
viz_df.head()


Out[28]:
sales yhat yhat_lower yhat_upper yhat_scaled
2009-10-01 338630.0 NaN NaN NaN NaN
2009-11-01 339386.0 NaN NaN NaN NaN
2009-12-01 400264.0 NaN NaN NaN NaN
2010-01-01 314640.0 NaN NaN NaN NaN
2010-02-01 311022.0 NaN NaN NaN NaN

If we take a look at the tail() of the viz_df you'll see we have data for the forecasted data and NaN's for the original data series.


In [29]:
viz_df.tail()


Out[29]:
sales yhat yhat_lower yhat_upper yhat_scaled
2017-04-30 NaN 13.063588 12.952123 13.165723 471459.411475
2017-05-31 NaN 13.056472 12.938400 13.164907 468116.264965
2017-06-30 NaN 13.074975 12.944827 13.193656 476858.563032
2017-07-31 NaN 13.050825 12.915154 13.178260 465480.406592
2017-08-31 NaN 13.023797 12.880747 13.154949 453067.919980

time to plot

Now, let's plot everything to get the 'final' visualization of our sales data and forecast with errors.


In [30]:
fig, ax1 = plt.subplots()
ax1.plot(viz_df.sales)
ax1.plot(viz_df.yhat_scaled, color='black', linestyle=':')
ax1.fill_between(viz_df.index, np.exp(viz_df['yhat_upper']), np.exp(viz_df['yhat_lower']), alpha=0.5, color='darkgray')
ax1.set_title('Sales (Orange) vs Sales Forecast (Black)')
ax1.set_ylabel('Dollar Sales')
ax1.set_xlabel('Date')

L=ax1.legend() #get the legend
L.get_texts()[0].set_text('Actual Sales') #change the legend text for 1st plot
L.get_texts()[1].set_text('Forecasted Sales') #change the legend text for 2nd plot


This visualization is much better (in my opinion) than the default fbprophet plot. It is much easier to quickly understand and describe what's happening. The orange line is actual sales data and the black dotted line is the forecast. The gray shaded area is the uncertaintity estimation of the forecast.


In [ ]: