This is a toy example for me to get acquainted with time series and presenting with Jupyter Notebooks, thus the prediction is not meant to be accurate by any measure. Expect a very introductory presentation, as this is my first foray into forecasting time series.
For the data to be analysed, I chose the financial endowment of a few British universities as listed on Wikipedia. The data is very limited, and I don't suppose that one can accurately predict the future endowment only based on this data, but it may be a fun learning experience.
Here I go.
In [1]:
from collections import OrderedDict
import datetime
import warnings
# Parsing
from bs4 import BeautifulSoup
import requests
# Plotting
import matplotlib.pyplot as plt
# Numerical computation
import pandas as pd
import numpy as np
# Predicting
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima_model import ARIMA
%matplotlib inline
warnings.filterwarnings('ignore')
In [2]:
#I'm interested in the 2nd table on the page
table_number = 2
url = "http://en.wikipedia.org/wiki/List_of_UK_universities_by_endowment"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
table = soup.find_all('table', class_="wikitable")[table_number - 1] # 0-indexed structures
In [3]:
# Using `OrderedDict()` to have the legend ordered later on when plotting the results
unis = OrderedDict()
for row in table.find_all('tr')[1:]:
data = row.text.split('\n')
unis[data[1]] = [money.split('[')[0] for money in data[2:-1]]
In [4]:
years = list(range(2015, 2005, -1)) # Values are stored in reverse chronological order
for uni, money in unis.items():
y = [m.strip("£") for i, m in enumerate(money)]
plt.figure(num=1, figsize=(15,6))
plt.plot(years, y, label=uni)
plt.legend(unis.keys(), bbox_to_anchor=(0.5, 1),)
plt.xlabel('year')
plt.ylabel('$m endowment')
# Don't format the years in scientific notation
ax = plt.gca()
ax.get_xaxis().get_major_formatter().set_useOffset(False)
In [5]:
# Convert to `datetime` objects for the time series processing
date = [datetime.datetime.strptime(str(year), "%Y") for year in years]
df = pd.DataFrame({'ICL': [float(val[1:]) for val in unis["Imperial College London"]], "Year": date})
df = df.set_index("Year")
ts = df["ICL"][::-1]
plt.plot(ts)
plt.title("ICL Financial Endowment")
plt.ylabel("$m")
plt.show()
Now, time series often have two properties that are best eliminated before attempting any predicting algorithm: trend and seasonality:
To my understanding, there are simply more practical tools and theoretical background developed around stationary series (i.e. series without trend or seasonality aka series with stationary mean, variance, and autocorrelation).
There are 2 how's:
When you cannot tell from the plot, you may calculate the degree of stationary using Dickey-Fuller Test.
There are many methods, but here I'll use differencing because it seems to be effective and simple enough for an introduction tutorial.
Differencing is a simple process that extracts the relative difference between any two number of adjacent instances, and for this case the number will be 2. Be aware that this means that the first instance will be invalid as there is no previous instance to subtract.
In [6]:
ts_diff = ts - ts.shift()
#Drop the invalid instances (i.e. the first one)
ts_diff.dropna(inplace=True)
print(ts) #Print original values
plt.plot(ts_diff)
# Plot the differences
plt.show()
The time series above does not show a trend anymore and seems quite stationary (visually, at least). The next step is to use Auto-Regressive Integrated Moving Averages (ARIMA) to forecast the time series.
Lag: In time series analysis, "lag" is a way to refer to previous values of a variable. For example, you may predict a new value based on p
past values (lags).
Auto-Regressive terms (ARIMA) are just the number of lags used for predicting a new value. So if there are 5 auto-regressive terms, then the new value is dependent on 5 past terms.
Moving average refers to a technique to calculate the trend of the data. It chooses a time window and averages the values inside that time window to smooth out the noise. For a visual interpretation, check the graph below.
Moving Average terms (ARIMA) are lagged errors used for predicting a new value. So if there are 5 moving average terms, each term is also predicted by 5 errors calculated as the difference between the moving average value at that instance and the actual value.
To decide on what values the terms should take, one may use techniques such as Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF), but there is very little data in this example, so I will simply pick both to be 1
.
In [7]:
moving_avg = pd.rolling_mean(ts,2)
plt.plot(moving_avg)
plt.plot(ts)
plt.title('Moving average for a period of 3 years')
plt.show()
# Helper function
def plot_graphs(series1, series2, label1="ICL", label2="ARIMA prediction", title="Predicting ICL endowment"):
plt.plot(series1, label=label1)
plt.plot(series2, label=label2)
plt.legend(loc='best')
plt.title(title)
plt.show()
model = ARIMA(ts_diff, order=(1,1,1))
results_ARIMA = model.fit(disp=-1)
plot_graphs(ts_diff, results_ARIMA.fittedvalues)
Above, it can be seen that although the ARIMA predictions catch some of the peaks, it underestimates quite a bit at times -- for instance, the actual results are rarely ever negative, while the predictions go as low as -15
.
Keep in mind that at this point, the plot only displays the difference between each instance. To visualise the actual results, one needs to scale back the results:
In [8]:
preds = results_ARIMA.predict(end=13).cumsum() + ts.ix[0]
plot_graphs(ts,preds)
In my defense, this is my first time doing time series analysis. I would suppose that this is far from enough data to get good results, but the results may also be bad because it predicted negative differences previously, thus the values have a harder time picking up. To confirm this, I will plot the unscaled predictions:
In [9]:
plot_graphs(ts_diff, results_ARIMA.predict(end=13))
In [10]:
results_ARIMA = ARIMA(ts_diff, order=(0, 0, 2)).fit(disp=-1)
preds = results_ARIMA.predict(end=13).cumsum() + ts.ix[0]
plot_graphs(ts,preds)
In [11]:
results_ARIMA = ARIMA(ts, order=(1, 1, 0)).fit(disp=-1)
preds = results_ARIMA.predict(end=13)
plot_graphs(ts,preds)
To obtain better results, I believe there's a need for more training data, and a better understanding of the domain -- in this case, better understanding of how ARIMA works and how to tune the parameters would probably help.
The experiment however was successful in the sense that it provided me with an interesting hands-on introduction to time series and using Jupyter Notebooks as a presentation medium.
In [ ]: