Time Series Analysis and Forecasting

Sometimes the data we're working with has a special dependence on time as its primary predictive feature, and we want to predict how a variable evolves with time. These situations occur all of the time, from predicting stock prices to tomorrow's weather. In these cases, the data is called a time series, and we can apply a special set of statistical methods for analyzing the data.

Time Series Exploration With Pandas


In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6

Pandas has great support for datetime objects and general time series analysis operations. We'll be working with an example of predicting the number of airline passengers (in thousands) by month adapted from this tutorial.

First, download this dataset and load it into a Pandas Dataframe by specifying the 'Month' column as the datetime index.


In [ ]:
dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m')
data = pd.read_csv('AirPassengers.csv', parse_dates=['Month'], index_col='Month',date_parser=dateparse)
print data.head()

Note that Pandas is using the 'Month' column as the Dataframe index.


In [ ]:
ts = data["#Passengers"]
ts.index

We can index into the Dataframe in two ways - either by using a string representation for the index or by constructing a datetime object.


In [ ]:
ts['1949-01-01']

In [ ]:
from datetime import datetime
ts[datetime(1949,1,1)]

We can also use the Pandas datetime index support to retrieve entire years


In [ ]:
ts['1949']
and ranges

In [ ]:
ts['1949-01-01':'1949-05-01']

Finally, let's plot the time series to get an intial visualization of how the series grows.


In [ ]:
plt.plot(ts)

Stationarity

Most of the important results for time series forecasting (including the ARIMA model, which we focus on today) assume that the series is stationary - that is, its statistical properties like mean and variance are constant. However, the graph above certainly isn't stationary, given the obvious growth. Thus, we want to manipulate the time series to make it stationary. This process of reducing a time series to a stationary series is a hallmark of time series analysis and forecasting, as most real world time series' aren't initially stationary.

To solve this nonstationarity issue, we can break a time series up into its trend and seasonality. These are the two factors that make a series nonstationary, so the main idea is to remove these factors, operate on the resulting stationary series, then add these factors back in.

First, we will take the log of the series to reduce the positive trend. This gives a seemingly linear trend, making it easier to estimate.


In [ ]:
ts_log = np.log(ts)
plt.plot(ts_log)

A simple moving average is the most basic way to predict the trend of a series, taking advantage of the generally continuous nature of trends. For example, if I told you to predict the number of wins of a basketball team this season, without giving you any information about the team apart from its past record, you would take the average of the team's wins over the last few seasons as a reasonable predictor.

The simple moving average operates on this exact principle. Choosing an $n$ element window to average over, the prediction at each point is obtained by taking the average of the last $n$ values. Notice that the moving average is undefined for first 12 values because we're looking at a 12 value window.


In [ ]:
moving_avg = pd.Series(ts_log).rolling(window=12).mean()

plt.plot(ts_log)
plt.plot(moving_avg, color='red')

You might be unhappy with having to choose a window size. How do we know what window size we want if we don't know much about the data? One solution is to average over all past data, discounting earlier values because they have less predictive power than more recent values. This method is known as smoothing.


In [ ]:
expwighted_avg = pd.Series(ts_log).ewm(halflife=12).mean()

plt.plot(ts_log)
plt.plot(expwighted_avg, color='red')

Now we can subtract the trend from the original data (eliminating the null values in the case of the simple moving average) to create a new series that is hopefully more stationary. The blue graph represents the smoothing difference, while the red graph represents the simple moving average difference


In [ ]:
ts_exp_moving_avg_diff = ts_log - expwighted_avg
ts_log_moving_avg_diff = ts_log - moving_avg
ts_log_moving_avg_diff.dropna(inplace=True)
plt.plot(ts_exp_moving_avg_diff)
plt.plot(ts_log_moving_avg_diff, color='red')

Now there is no longer an upward trend, suggesting a stationarity. There does seem to be a strong seasonality effect, as the number of passengers is low at the beginning and middle of the year but spikes at the first and third quarters.

Dealing with Seasonality

One baseline way of dealing with both trend and seasonality at once is differencing, taking a single step lag (subtracting the last value of the series from the current value at each step) to represent how the time series grows. Of course, this method can be refined by factoring in more complex lags.


In [ ]:
ts_log_diff = ts_log - ts_log.shift()
plt.plot(ts_log_diff)

Another method of dealing with trend and seasonality is separating the two effects, then removing both from the time series to obtain the stationary series. We'll be using the statsmodels module, which you can get via pip by running the following command in the terminal.

python -mpip install statsmodels

We will use the seasonal decompose tool to separate seasonality from trend.


In [ ]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(ts_log)

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

plt.subplot(411)
plt.plot(ts_log, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()

Forecasting

Using the seasonal decomposition, we were able to separate the trend and seasonality effects, which is great for time series analysis. However, another goal of working with time series is forecasting the future - how do we do that given the tools that we've been using and the stationary series we've obtained?

The ARIMA (Autoregressive Integrated Moving Average) model, which operates on stationary series', is one of the most commonly used models for time series forecasting. ARIMA, with parameters $p$, $d$, and $q$, combines an Autoregressive Model with a Moving Average model. Let's take a look at what this means.

Autoregressive model: output variable depends linearly on previous values. The $p$ parameter determines the number of lag terms used in the regression. Formally, $X_t = c + \sum_{i = 1}^p \varphi_iX_{t - i} + \epsilon_t$.

Moving average model: generalizes the same concept of moving average we saw earlier - the $q$ parameter determines the order of the model. Formally, $X_t = \mu + \sum_{i = 1}^q \theta_i\epsilon_{t - i}$.

Integrated model: the $d$ parameter represents the number of times past values have been subtracted, extending on the differencing method described earlier. This integrates the differencing for stationality into the ARIMA model, allowing it to be fit on non-stationalized data.

We don't have time to cover the math behind these models in depth, but Wikipedia provides some more detailed descriptions of the AR, MA, ARMA, and ARIMA models.

Comparing our model's results (red) to the actual differenced data (blue).


In [ ]:
from statsmodels.tsa.arima_model import ARIMA

model = ARIMA(ts_log, order=(2, 1, 2))  
results_ARIMA = model.fit(disp=-1)  
plt.plot(ts_log_diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts_log_diff)**2))

Now that we have a model for the stationary series that we can use to predict future values in the stationary series, and we want to get back to the original series. Note that we won't have a value for the first element because we are working with a one step lag. The following procedure takes care of that.


In [ ]:
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)

Now, we can plot the prediction (green) against the actual data. Note that the prediction model captures the seasonality and trend of the original series. It's not perfect, and additional steps can be made to tune the model. The important takeaway from this workshop is the general time series procedure of separating the time series into the trend and seasonality effects, and understanding how to work with time series' in Pandas.


In [ ]:
predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.plot(ts)
plt.plot(predictions_ARIMA)
plt.title('RMSE: %.4f'% np.sqrt(sum((predictions_ARIMA-ts)**2)/len(ts)))

Challenge: ARIMA Tuning

This is an open ended challenge. There aren't any right or wrong answers, we'd just like to see how you would approach tuning the ARIMA model.

As you can see above, the ARIMA predictions could certainly use some tuning. Try manually tuning $p$, $d$, and $q$ and see how that changes the ARIMA predictions. How would you use the AR, MA, and ARMA models individually using the ARIMA model? Do these results match what you would expect from these individual models? Can you automate this process to find the parameters that minimize RMSE? Do you see any issues with tuning $p$, $d$ and $q$ this way?


In [ ]:
# TODO: adjust the p, d, and q parameters to model the AR, MA, and ARMA models. Then, adjust these parameters to optimally tune the ARIMA model.
test_model = ARIMA(ts_log, order=(2, 1, 2))  
test_results_ARIMA = test_model.fit(disp=-1)  

test_predictions_ARIMA_diff = pd.Series(test_results_ARIMA.fittedvalues, copy=True)
test_predictions_ARIMA_diff_cumsum = test_predictions_ARIMA_diff.cumsum()
test_predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
test_predictions_ARIMA_log = test_predictions_ARIMA_log.add(test_predictions_ARIMA_diff_cumsum,fill_value=0)

test_predictions_ARIMA = np.exp(test_predictions_ARIMA_log)
plt.plot(ts)
plt.plot(test_predictions_ARIMA)
plt.title('RMSE: %.4f'% np.sqrt(sum((test_predictions_ARIMA-ts)**2)/len(ts)))

In [ ]: