2017-04-28, Josh Montague
This is Part 2 of the "Time Series Modeling in Python" series.
In Part 1, we looked at data structures within the pandas
library that make working with time series particularly convenient.
Here, in Part 2, we'll look at the some of the simplest methods for modeling a time series, making forecasts, and evaluating the accuracy of our models. We'll take advantage of both pandas
and numpy
functionality since they're both great for these sorts of tasks.
In [ ]:
import copy
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
from sklearn.metrics import r2_score, mean_squared_error
In [ ]:
%matplotlib inline
plt.rcParams["figure.figsize"] = (8,6)
In [ ]:
# `unemploy.csv` is included in the repo - it's a small csv
! head misc/unemploy.csv
In [ ]:
# get data from local dir
data = pd.read_csv('misc/unemploy.csv', parse_dates=True, index_col=0)
data.head()
In [ ]:
# this is for the data we have locally
data_name = 'UNEMPLOY'
< optional web data acquisition >
If you want to experiment with other easily accessible data...
</ data acquisition >
In [ ]:
# what does this data look like?
data.plot()
Let's start with the simplest set of things we can possibly do in our forecasting model[1]. These models are computationally simple, require only a small amount of data to make predictions, and generally provide a baseline over which to compare more complicated models.
Generally, we will want some form of a forecast()
method.
But we'll get to abstractions in a moment. First, let's start by just doing it directly.
.shift()
)Let's start with the most straightforward and obvious forecasting method possible: the next data point will be the same as the previous data point. This approach (reasonably) assumes that in many real world systems there is some form of inertia or momentum in the underlying processes that is greater than the associated random fluctuations.
In form, what we'd like is a DataFrame with a column of observed data and column of forecasted data.
To start, our forecasted column will be a shifted version of the existing values column. We can do that with pandas
' .shift()
method.
In [ ]:
# check the top of the dataset
data.head()
In [ ]:
# what does the shift() method do?
# (remember that pandas methods return a new df)
data.shift().head()
In [ ]:
# what happens at the end of the series?
data.tail()
In [ ]:
# 'periods=1' is the default first arg
data.shift(periods=1).tail()
Note that the shift()
method effectively slides the index up (relative to the data), keeping the same index range but clipping or extending the data column as needed.
What if you want the data column to have the same range, but shift the index? For this case, the shift()
method has a freq
kwarg to use instead of the (implicit) periods
kwarg.
In [ ]:
data.head()
In [ ]:
# a timely reminder of bit.ly/pd-offsets
data.shift(freq='2D').head()
Note that the data column stayed fixed, but we adjusted each value in the index according to the freq
kwarg while maintaining the period e.g. one month.
In [ ]:
data.tail()
In [ ]:
data.shift(freq='1M').tail()
Ok, great, now that we can simply shift the existing data column; let's attach that our original frame as our first "forecast" series.
In [ ]:
# use a copy for our forecast frame
d = copy.deepcopy(data)
In [ ]:
# assign the "forecast" to be the 1-period shifted version of the "observed" column
d['forecast'] = data.shift(1)[data_name]
d.head()
We made a forecast!
In [ ]:
Image(filename='misc/kermit.png')
Now how do we assess our model?
It's always a good idea to do a visual inspection of the data. Both in the original of the data (time series), and also in the space of your truth vs predictions.
In [ ]:
# how does our forecast look by eye?
d.plot()
plt.title('naive')
Not so bad! Certainly better than a random number generator.
Another common way to inspect a prediction (in any supervised learning task) is to plot the true data against the predicted data and fit a line through it. In the case where prediction = truth
, this will be a beautiful, straight line, with zero residual error across the entire data set.
If you do see a beautiful line like this, you most likely made a mistake somewhere; the real world is messy. Double check that you didn't accidentally train on your hold-out data, or evaluate on your training data, or something similar.
In the real world (if your model is good), your plot should appear roughly linear, and should have some variance around the line you would draw through the data.
In [ ]:
plt.scatter(d[data_name], d[data_name])
plt.xlabel('truth')
plt.ylabel('also truth')
plt.title('this will never happen');
In [ ]:
plt.scatter(d[data_name], d['forecast'])
plt.xlabel('truth')
plt.ylabel('forecast')
plt.title("variance is a sign that you're alive");
Looks pretty good! We'll come back to a more quantitative assessment of this prediction in a bit.
Before we move to other models, let's convert the naive model to a functional form. This way, as we develop more forecasting models, we can use the same API for consistency.
numpy
> pandas
While we typically have our data in DataFrames, we'll see that most (all?) of our forecasting methods don't make use of the extra metadata in a DataFrame or Series. As a result, we'll define our forecasting models with the expectaion of numpy arrays as input (which are more general), and we'll know that we can always use the .values
attribute to get the respective numpy array data out of a pandas
object.
Here's an implementation of the naive model in functional form. It's a lot of typing for a simple model, but the pattern will prove useful.
In [ ]:
def fc_naive(data, **kwargs):
"""The 'naive' forecast of the next point in `data` (presumed to be
ordered in time) is equal to the last point observed in the series.
`data` should be a 1-D numpy array
Returns a single-valued forecast for the next value in the series.
"""
forecast = data[-1]
return forecast
We can use this function to create a similar DataFrame to the one we used earlier. We'll loop over the observed data, and use the past data as our input.
In [ ]:
# container for the forecast
forecasts = []
# loop over positions in the array
for idx in range(len(data[data_name])):
# subset the series from beginning to position idx
array_slice = data[data_name].iloc[:idx].values
if idx < 10:
# a view behind the scenes...
print('iteration {}, array_slice: {}'.format(idx, array_slice))
# make a forecast using that series
try:
forecasts.append( fc_naive(array_slice) )
except IndexError:
# the first position won't have a forecast value
forecasts.append(np.nan)
In [ ]:
d = copy.deepcopy(data)
d['forecast'] = forecasts
d.head()
In [ ]:
d.plot()
plt.title('same naive graph');
NOW FOR MOAR MODELS!
Instead of just forecasting based on the previous point, we might know there is a consistent cycle within the data. Often, the term "seasonal" is used to describe any sort of cyclic behavior. Seems sloppy, but oh well. In this case, we can prescribe how far back to look in the series, and forecast the corresponding value.
In [ ]:
def fc_snaive(data, n=7, **kwargs):
"""The 'seasonal naive' forecast of the next point in `data` (presumed to be
ordered in time) is equal to the point observed `n` points prior in the series.
`data` should be a 1-D numpy array
`n` should be an integer
Returns a single-valued forecast for the next value in the series.
"""
forecast = data[-n]
return forecast
Note that we aren't paying any attention to the observed resolution of the data, only the relative position of any cyclic behavior.
Since we want to create a new forecast array (as before), let's make a function that encapsulates the iteration over an observed array and returns the corresponding forecast array.
In [ ]:
def forecast_series(observed, fc_func, **kwargs):
"""Returns an array of forecasted values (using `fc_func` and any `kwargs` like a window `n`)
for each value in the input np.array `observed`.
"""
# container for the forecast
forecasts = []
for idx in range(len(observed)):
# subset the series from beginning to position idx
array_slice = observed[:idx]
# make a forecast using that series
try:
forecasts.append( fc_func(array_slice, **kwargs) )
except IndexError:
# the first position won't have a forecast value
forecasts.append(np.nan)
return forecasts
In [ ]:
d = copy.deepcopy(data)
# our data is monthly, and i have a hunch about quarterly cycles, so let's use n=3 (3 months in a quarter)
forecasts = forecast_series(d[data_name].values, fc_snaive, n=3)
d['forecast'] = forecasts
d.head()
In [ ]:
d.plot()
plt.title('seasonal naive (n=3)')
In [ ]:
plt.scatter(d[data_name], d['forecast'])
plt.xlabel('truth')
plt.ylabel('forecast')
plt.title('seasonal naive method')
In [ ]:
def fc_mean(data, n=3, **kwargs):
"""The 'mean' forecast of the next point in `data` (presumed to be
ordered in time) is equal to the mean value of the most recent `n` observed points.
`data` should be a 1-D numpy array
`n` should be an integer
Returns a single-valued forecast for the next value in the series.
"""
# don't start averaging until we've seen n points
if len(data[-n:]) < n:
forecast = np.nan
else:
# nb: we'll keep the forecast as a float
forecast = np.mean(data[-n:])
return forecast
In [ ]:
d = copy.deepcopy(data)
# let's try a 4-point rolling mean
forecasts = forecast_series(d[data_name].values, fc_mean, n=4)
d['forecast'] = forecasts
d.head()
In [ ]:
d.plot()
plt.title('mean forecast (n=3)');
In [ ]:
plt.scatter(d[data_name], d['forecast'])
plt.xlabel('truth')
plt.ylabel('forecast')
plt.title('mean method')
In [ ]:
def fc_drift(data, n=3, **kwargs):
"""The 'drift' forecast of the next point in `data` (presumed to be
ordered in time) is a linear extrapoloation from `n` points ago, through the
most recent point.
`data` should be a 1-D numpy array
`n` should be an integer
Returns a single-valued forecast for the next value in the series.
"""
yi = data[-n]
yf = data[-1]
slope = (yf - yi) / (n-1)
forecast = yf + slope
return forecast
In [ ]:
d = copy.deepcopy(data)
# let's try a 5-point drift method
forecasts = forecast_series(d[data_name].values, fc_drift, n=5)
d['forecast'] = forecasts
d.head()
In [ ]:
d.plot()
plt.title('drift method');
In [ ]:
plt.scatter(d[data_name], d['forecast'])
plt.xlabel('truth')
plt.ylabel('forecast')
At this point, we've looked at four different simple models: naive, seasonal naive, (rolling) mean, and drift. A next reasonable question is: which one best reflects our data? To answer that, we'll look to some metrics for model accuracy measurements.
As is often the case, scikit-learn has a module that supplies a handful of these out of the box. The first thing you'll note is that there are more metrics for classification accuracy than for regression accuracy evaluation. Still, at least we don't have to reinvent the wheel!
First, let's make some forecasts...
In [ ]:
d = copy.deepcopy(data)
# feel free to tweak the 'n' args
model_list = [('naive', fc_naive, 0),
('seasonal_naive', fc_snaive, 3),
('mean', fc_mean, 3),
('drift', fc_drift, 5)]
# create new cols for each model
for name, model, nn in model_list:
d[name] = forecast_series(d[data_name].values, model, n=nn)
In [ ]:
d.head()
In [ ]:
d.plot()
plt.title('ALL THE FORECASTS!');
In [ ]:
for name, series_data in d.items():
plt.plot(d[data_name], series_data, 'o', alpha=0.6, label=name)
plt.xlabel('truth')
plt.ylabel('pred')
plt.title('another view')
plt.legend()
Another view on these charts to quantify the quality of a model is to look at the distribution of residuals.
In [ ]:
comparison = 'naive'
(d[data_name] - d[comparison]).hist(bins=30)
plt.xlabel('residuals')
plt.title('residual distribution for method: {}'.format(comparison));
They all look pretty good, but we can be more specific.
The $R^2$ score is a common regression scoring method. The user guide (and wikipedia) have nice explanations of both the defintion, and the domain; in short, the maximum value is 1.0, scoring a constant prediction that is the mean value of the observed data will give 0.0, and it can be arbitraily negative). The $R^2$ metric is basically what your eyeballs are assessing when you look at a scatter plot of the truth data vs. the predicted data.
Let's look at the $R^2$ value for the models that we've introduced so far.
In [ ]:
print('* R2 scores (bigger = better) *\n')
# calculate R2 for each model (against the observed data)
for name, series_data in d.items():
# strip rows with nans
subdf = d[[data_name, name]].dropna()
truth = subdf[data_name].values
pred = subdf[name].values
# calculate metric
r2 = r2_score(truth, pred)
print('{} - {:.4f}'.format(name, r2))
Feel free to experiment (or BYO GridSearchCV), but I think you'll typically find these are all >0.95, and the naive model frequently has the highest value.
The mean squared error (MSE) is a simpler calculation that is the expected value of the quadratic error.
$$ MSE(y,\hat{y}) = \frac{1}{n_{samples}} \sum_i (y_i - \hat{y}_i)^2 $$
In [ ]:
print('* MAE scores (smaller = better) *\n')
# calculate MAE for each model (against the observed data)
for name, series_data in d.items():
# strip rows with nans
subdf = d[[data_name, name]].dropna()
truth = subdf[data_name].values
pred = subdf[name].values
# calculate metric
mae = mean_squared_error(truth, pred)
print('{} - {:.4f}'.format(name, mae))
Since MAE isn't normalized, it's a little hard to eyeball (big numbers), and to compare different models. There are also mean absolute, and median absolute errors, if you feel like you want to "penalize" outliers in any particular fashion.
I like $R^2$ because it's relateively straightforward, but it's good to know there are alternatives.
While none of these models are particularly fancy, they provide great baselines for any other model you can dream up. They are generally very fast and straightforward to evaluate, and provide a surprisingly high accuracy on many forecasting tasks.
For some broader context on forecasting (if, sadly, written in R
), check out Ref [1].
Hopefully, there are still two classes of time series modeling that we'll look at in the future: so-called "state space models" (of which the most famous is the ARIMA family), and more recent, neural network-based models. When we do get to these models, now we can baseline them against these simple methods!
pandas
AppendixThere are a handful of pandas
methods that are time-series related, but not necessarily about forecasting and this seemed like a good place to highlight them. In particular, if you'd like to transform a time series according to a rolling window average, or exponential smoothing, there are efficient built-in methods for that!
In [ ]:
# recall our original 'data' dataframe
data.head()
The official docs include many examples, but a common pattern is creating a Rolling
object - which has the notion of a window size, and a window type (square, triangular, etc.) - and then applying functions like you would after a groupby or a resample.
In [ ]:
# make a "rolling" object that we can use for calculations
r = data.rolling(window=5)
# this object can be treated much like a GroupBy object
r
In [ ]:
# we can apply a number of methods to the Rolling object, like standard numerical calcs
r.mean().head(10)
In [ ]:
plt.plot(data, 'o--', label=data_name)
plt.plot(r.mean(), '.-', label='rolling mean')
plt.legend()
In [ ]:
plt.plot(data, 'o--', label=data_name)
plt.plot(r.max(), '.-', label='rolling max')
plt.legend()
In [ ]:
# calculate stdev on the rolling object within window size
stds = r.std()
# add the stdevs as error bars on each point
data.plot(style='o', yerr=stds)
plt.title('observed data points + windowed stdev')
plt.legend();
Finally, if you should want them, there are windows that have an exponentially decaying weight. These are relatively new to pandas, and havequestionable documentation. But, they're configured by either a span
(much like the rolling window), the decay constant alpha
, or a couple of related.
In [ ]:
plt.plot(data, 'o-', label=data_name)
plt.plot(data.ewm(span=5).mean(), '.--', label='EMW')
plt.legend();
In [ ]: