```
In [ ]:
```import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6

Pandas has great support for datetime objects and general time series analysis operations. We'll be working with an example of predicting the number of airline passengers (in thousands) by month adapted from this tutorial.

First, download this dataset and load it into a Pandas Dataframe by specifying the 'Month' column as the datetime index.

```
In [ ]:
```dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m')
data = pd.read_csv('AirPassengers.csv', parse_dates=['Month'], index_col='Month',date_parser=dateparse)
print data.head()

Note that Pandas is using the 'Month' column as the Dataframe index.

```
In [ ]:
```ts = data["#Passengers"]
ts.index

```
In [ ]:
```ts['1949-01-01']

```
In [ ]:
```from datetime import datetime
ts[datetime(1949,1,1)]

We can also use the Pandas datetime index support to retrieve entire years

```
In [ ]:
```ts['1949']

and ranges

```
In [ ]:
```ts['1949-01-01':'1949-05-01']

Finally, let's plot the time series to get an intial visualization of how the series grows.

```
In [ ]:
```plt.plot(ts)

Most of the important results for time series forecasting (including the ARIMA model, which we focus on today) assume that the series is stationary - that is, its statistical properties like mean and variance are constant. However, the graph above certainly isn't stationary, given the obvious growth. Thus, we want to manipulate the time series to make it stationary. This process of reducing a time series to a stationary series is a hallmark of time series analysis and forecasting, as most real world time series' aren't initially stationary.

To solve this nonstationarity issue, we can break a time series up into its trend and seasonality. These are the two factors that make a series nonstationary, so the main idea is to remove these factors, operate on the resulting stationary series, then add these factors back in.

First, we will take the log of the series to reduce the positive trend. This gives a seemingly linear trend, making it easier to estimate.

```
In [ ]:
```ts_log = np.log(ts)
plt.plot(ts_log)

A simple moving average is the most basic way to predict the trend of a series, taking advantage of the generally continuous nature of trends. For example, if I told you to predict the number of wins of a basketball team this season, without giving you any information about the team apart from its past record, you would take the average of the team's wins over the last few seasons as a reasonable predictor.

The simple moving average operates on this exact principle. Choosing an $n$ element window to average over, the prediction at each point is obtained by taking the average of the last $n$ values. Notice that the moving average is undefined for first 12 values because we're looking at a 12 value window.

```
In [ ]:
```moving_avg = pd.Series(ts_log).rolling(window=12).mean()
plt.plot(ts_log)
plt.plot(moving_avg, color='red')

```
In [ ]:
```expwighted_avg = pd.Series(ts_log).ewm(halflife=12).mean()
plt.plot(ts_log)
plt.plot(expwighted_avg, color='red')

```
In [ ]:
```ts_exp_moving_avg_diff = ts_log - expwighted_avg
ts_log_moving_avg_diff = ts_log - moving_avg
ts_log_moving_avg_diff.dropna(inplace=True)
plt.plot(ts_exp_moving_avg_diff)
plt.plot(ts_log_moving_avg_diff, color='red')

```
In [ ]:
```ts_log_diff = ts_log - ts_log.shift()
plt.plot(ts_log_diff)

Another method of dealing with trend and seasonality is separating the two effects, then removing both from the time series to obtain the stationary series. We'll be using the statsmodels module, which you can get via pip by running the following command in the terminal.

python -mpip install statsmodels

We will use the seasonal decompose tool to separate seasonality from trend.

```
In [ ]:
```from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(ts_log)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
plt.subplot(411)
plt.plot(ts_log, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()

Using the seasonal decomposition, we were able to separate the trend and seasonality effects, which is great for time series analysis. However, another goal of working with time series is forecasting the future - how do we do that given the tools that we've been using and the stationary series we've obtained?

The ARIMA (Autoregressive Integrated Moving Average) model, which operates on stationary series', is one of the most commonly used models for time series forecasting. ARIMA, with parameters $p$, $d$, and $q$, combines an Autoregressive Model with a Moving Average model. Let's take a look at what this means.

Autoregressive model: output variable depends linearly on previous values. The $p$ parameter determines the number of lag terms used in the regression. Formally, $X_t = c + \sum_{i = 1}^p \varphi_iX_{t - i} + \epsilon_t$.

Moving average model: generalizes the same concept of moving average we saw earlier - the $q$ parameter determines the order of the model. Formally, $X_t = \mu + \sum_{i = 1}^q \theta_i\epsilon_{t - i}$.

Integrated model: the $d$ parameter represents the number of times past values have been subtracted, extending on the differencing method described earlier. This integrates the differencing for stationality into the ARIMA model, allowing it to be fit on non-stationalized data.

We don't have time to cover the math behind these models in depth, but Wikipedia provides some more detailed descriptions of the AR, MA, ARMA, and ARIMA models.

Comparing our model's results (red) to the actual differenced data (blue).

```
In [ ]:
```from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts_log, order=(2, 1, 2))
results_ARIMA = model.fit(disp=-1)
plt.plot(ts_log_diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts_log_diff)**2))

```
In [ ]:
```predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)

```
In [ ]:
```predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.plot(ts)
plt.plot(predictions_ARIMA)
plt.title('RMSE: %.4f'% np.sqrt(sum((predictions_ARIMA-ts)**2)/len(ts)))

This is an open ended challenge. There aren't any right or wrong answers, we'd just like to see how you would approach tuning the ARIMA model.

As you can see above, the ARIMA predictions could certainly use some tuning. Try manually tuning $p$, $d$, and $q$ and see how that changes the ARIMA predictions. How would you use the AR, MA, and ARMA models individually using the ARIMA model? Do these results match what you would expect from these individual models? Can you automate this process to find the parameters that minimize RMSE? Do you see any issues with tuning $p$, $d$ and $q$ this way?

```
In [ ]:
```# TODO: adjust the p, d, and q parameters to model the AR, MA, and ARMA models. Then, adjust these parameters to optimally tune the ARIMA model.
test_model = ARIMA(ts_log, order=(2, 1, 2))
test_results_ARIMA = test_model.fit(disp=-1)
test_predictions_ARIMA_diff = pd.Series(test_results_ARIMA.fittedvalues, copy=True)
test_predictions_ARIMA_diff_cumsum = test_predictions_ARIMA_diff.cumsum()
test_predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
test_predictions_ARIMA_log = test_predictions_ARIMA_log.add(test_predictions_ARIMA_diff_cumsum,fill_value=0)
test_predictions_ARIMA = np.exp(test_predictions_ARIMA_log)
plt.plot(ts)
plt.plot(test_predictions_ARIMA)
plt.title('RMSE: %.4f'% np.sqrt(sum((test_predictions_ARIMA-ts)**2)/len(ts)))

```
In [ ]:
```