The general process for ARIMA models is the following:
Let's go through these steps!
We will be using some data about monthly milk production, full details on it can be found here.
Its saved as a csv for you already, let's load it up:
In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
df = pd.read_csv('monthly-milk-production-pounds-p.csv')
In [3]:
df.head()
Out[3]:
In [4]:
df.tail()
Out[4]:
Clean Up
Let's clean this up just a little!
In [5]:
df.columns = ['Month', 'Milk in pounds per cow']
df.head()
Out[5]:
In [6]:
# Weird last value at bottom causing issues
df.drop(168,
axis = 0,
inplace = True)
In [7]:
df['Month'] = pd.to_datetime(df['Month'])
In [8]:
df.head()
Out[8]:
In [9]:
df.set_index('Month',inplace=True)
In [10]:
df.head()
Out[10]:
In [11]:
df.describe().transpose()
Out[11]:
In [12]:
df.plot()
Out[12]:
In [13]:
timeseries = df['Milk in pounds per cow']
In [14]:
timeseries.rolling(12).mean().plot(label='12 Month Rolling Mean')
timeseries.rolling(12).std().plot(label='12 Month Rolling Std')
timeseries.plot()
plt.legend()
Out[14]:
In [15]:
timeseries.rolling(12).mean().plot(label = '12 Month Rolling Mean')
timeseries.plot()
plt.legend()
Out[15]:
In [16]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df['Milk in pounds per cow'],
freq = 12)
fig = plt.figure()
fig = decomposition.plot()
fig.set_size_inches(15, 8)
We can use the Augmented Dickey-Fuller unit root test.
In statistics and econometrics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis that a unit root is present in a time series sample. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity.
Basically, we are trying to whether to accept the Null Hypothesis H0 (that the time series has a unit root, indicating it is non-stationary) or reject H0 and go with the Alternative Hypothesis (that the time series has no unit root and is stationary).
We end up deciding this based on the p-value return.
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.
A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis.
Let's run the Augmented Dickey-Fuller test on our data:
In [17]:
df.head()
Out[17]:
In [18]:
from statsmodels.tsa.stattools import adfuller
In [19]:
result = adfuller(df['Milk in pounds per cow'])
In [20]:
print('Augmented Dickey-Fuller Test:')
labels = ['ADF Test Statistic',
'p-value',
'#Lags Used',
'Number of Observations Used']
for value,label in zip(result,labels):
print(label+' : '+str(value) )
if result[1] <= 0.05:
print("strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary")
else:
print("weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary ")
In [21]:
# Store in a function for later use!
def adf_check(time_series):
"""
Pass in a time series, returns ADF report
"""
result = adfuller(time_series)
print('Augmented Dickey-Fuller Test:')
labels = ['ADF Test Statistic',
'p-value',
'#Lags Used',
'Number of Observations Used']
for value,label in zip(result,labels):
print(label+' : '+str(value) )
if result[1] <= 0.05:
print("strong evidence against the null hypothesis, reject the null hypothesis. Data has no unit root and is stationary")
else:
print("weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary ")
We have now realized that our data is seasonal (it is also pretty obvious from the plot itself). This means we need to use Seasonal ARIMA on our model. If our data was not seasonal, it means we could use just ARIMA on it. We will take this into account when differencing our data! Typically financial stock data won't be seasonal, but that is kind of the point of this section, to show you common methods, that won't work well on stock finance data!
First Difference
In [22]:
df['Milk First Difference'] = df['Milk in pounds per cow'] - df['Milk in pounds per cow'].shift(1)
In [23]:
adf_check(df['Milk First Difference'].dropna())
In [24]:
df['Milk First Difference'].plot()
Out[24]:
Second Difference
In [25]:
# Sometimes it would be necessary to do a second difference
# This is just for show, we didn't need to do a second difference in our case
df['Milk Second Difference'] = df['Milk First Difference'] - df['Milk First Difference'].shift(1)
In [26]:
adf_check(df['Milk Second Difference'].dropna())
In [27]:
df['Milk Second Difference'].plot()
Out[27]:
Seasonal Difference
In [28]:
df['Seasonal Difference'] = df['Milk in pounds per cow'] - df['Milk in pounds per cow'].shift(12)
df['Seasonal Difference'].plot()
Out[28]:
In [29]:
# Seasonal Difference by itself was not enough!
adf_check(df['Seasonal Difference'].dropna())
Seasonal First Difference
In [30]:
# You can also do seasonal first difference
df['Seasonal First Difference'] = df['Milk First Difference'] - df['Milk First Difference'].shift(12)
df['Seasonal First Difference'].plot()
Out[30]:
In [31]:
adf_check(df['Seasonal First Difference'].dropna())
An autocorrelation plot (also known as a Correlogram ) shows the correlation of the series with itself, lagged by x time units. So the y axis is the correlation and the x axis is the number of time units of lag.
So imagine taking your time series of length T, copying it, and deleting the first observation of copy #1 and the last observation of copy #2. Now you have two series of length T−1 for which you calculate a correlation coefficient. This is the value of of the vertical axis at x=1x=1 in your plots. It represents the correlation of the series lagged by one time unit. You go on and do this for all possible time lags x and this defines the plot.
You will run these plots on your differenced/stationary data. There is a lot of great information for identifying and interpreting ACF and PACF here and here.
The actual interpretation and how it relates to ARIMA models can get a bit complicated, but there are some basic common methods we can use for the ARIMA model. Our main priority here is to try to figure out whether we will use the AR or MA components for the ARIMA model (or both!) as well as how many lags we should use. In general you would use either AR or MA, using both is less common.
If the autocorrelation plot shows positive autocorrelation at the first lag (lag-1), then it suggests to use the AR terms in relation to the lag
If the autocorrelation plot shows negative autocorrelation at the first lag, then it suggests using MA terms.
Here we will be showing running the ACF and PACF on multiple differenced data sets that have been made stationary in different ways, typically you would just choose a single stationary data set and continue all the way through with that.
The reason we use two here is to show you the two typical types of behaviour you would see when using ACF.
In [32]:
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
In [33]:
# Duplicate plots
# Check out: https://stackoverflow.com/questions/21788593/statsmodels-duplicate-charts
# https://github.com/statsmodels/statsmodels/issues/1265
fig_first = plot_acf(df["Milk First Difference"].dropna())
In [34]:
fig_seasonal_first = plot_acf(df["Seasonal First Difference"].dropna())
Pandas also has this functionality built in, but only for ACF, not PACF. So I recommend using statsmodels, as ACF and PACF is more core to its functionality than it is to pandas' functionality.
In [35]:
from pandas.plotting import autocorrelation_plot
autocorrelation_plot(df['Seasonal First Difference'].dropna())
Out[35]:
In general, a partial correlation is a conditional correlation.
It is the correlation between two variables under the assumption that we know and take into account the values of some other set of variables.
For instance, consider a regression context in which y = response variable and x1, x2, and x3 are predictor variables. The partial correlation between y and x3 is the correlation between the variables determined taking into account how both y and x3 are related to x1 and x2.
Formally, this is relationship is defined as:
Check out this link for full details on this.
We can then plot this relationship:
In [36]:
result = plot_pacf(df["Seasonal First Difference"].dropna())
In [37]:
fig = plt.figure(figsize = (12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(df['Seasonal First Difference'].iloc[13:],
lags = 40,
ax = ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(df['Seasonal First Difference'].iloc[13:],
lags = 40,
ax = ax2)
In [38]:
# For non-seasonal data
from statsmodels.tsa.arima_model import ARIMA
In [39]:
# I recommend you glance over this!
#
help(ARIMA)
In [40]:
# We have seasonal data!
model = sm.tsa.statespace.SARIMAX(df['Milk in pounds per cow'],
order = (0,1,0),
seasonal_order = (1,1,1,12))
results = model.fit()
print(results.summary())
In [41]:
results.resid.plot()
Out[41]:
In [42]:
results.resid.plot(kind = 'kde')
Out[42]:
In [43]:
df['forecast'] = results.predict(start = 150,
end = 168,
dynamic = True)
df[['Milk in pounds per cow','forecast']].plot(figsize = (12, 8))
Out[43]:
In [44]:
df.tail()
Out[44]:
In [45]:
# https://pandas.pydata.org/pandas-docs/stable/timeseries.html
# Alternatives
# pd.date_range(df.index[-1],periods=12,freq='M')
In [46]:
from pandas.tseries.offsets import DateOffset
In [47]:
future_dates = [df.index[-1] + DateOffset(months = x) for x in range(0,24) ]
In [48]:
future_dates
Out[48]:
In [49]:
future_dates_df = pd.DataFrame(index = future_dates[1:],
columns = df.columns)
In [50]:
future_df = pd.concat([df,future_dates_df])
In [51]:
future_df.head()
Out[51]:
In [52]:
future_df.tail()
Out[52]:
In [53]:
future_df['forecast'] = results.predict(start = 168,
end = 188,
dynamic= True)
future_df[['Milk in pounds per cow', 'forecast']].plot(figsize = (12, 8))
Out[53]:
Not bad! Pretty cool in fact! I hope this helped you see the potential for ARIMA models, unfortunately a lot of financial data won't follow this sort of behaviour, in fact it will often follow something indicating brownian motion, what is that you ask? Well head on over to the next video section and we'll find out!