http://machinelearningmastery.com/visualize-time-series-residual-forecast-errors-with-python/
Forecast errors on a time series forecasting problem are called residual errors or residuals.
e = y - yhat
We often stop there and summarize the skill of a model as a summary of this error.
Instead, we can collect these individual residual errors across all forecasts and use them to better understand the forecast model.
Generally, when exploring residual errors we are looking for patterns or structure. A sign of a pattern suggests that the errors are not random.
We expect the residual errors to be random, because it means that the model has captured all of the structure and the only error left is the random fluctuations in the time series that cannot be modeled.
A sign of a pattern or structure suggests that there is more information that a model could capture and use to make better predictions.
In [45]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
In [46]:
fn = '/data/daily-minimum-temperatures-in-me.csv'
df = pd.read_csv(fn, header=0, sep=';', decimal=',')
#df.info()
df.plot(figsize=(20,10));
This is called the “naive forecast” or the persistence forecast model.
After the dataset is loaded, it is phrased as a supervised learning problem. A lagged version of the dataset is created where the prior time step (t-1) is used as the input variable and the next time step (t+1) is taken as the output variable.
In [47]:
# create lagged dataset
dataframe = pd.concat([df.ix[:,1].shift(1), df.ix[:,1]], axis=1)
dataframe.columns = ['t-1', 't+1']
# split into train and test sets
X = dataframe.values
train_size = int(len(X) * 0.66)
train, test = X[1:train_size], X[train_size:]
train_X, train_y = train[:,0], train[:,1]
test_X, test_y = test[:,0], test[:,1]
The persistence model is applied by predicting the output value (y) as a copy of the input value (x).
In [48]:
# persistence model
predictions = [x for x in test_X]
# calculate residuals
residuals = [test_y[i]-predictions[i] for i in range(len(predictions))]
predictions = pd.DataFrame(predictions)
residuals = pd.DataFrame(residuals)
print(residuals.head())
In [50]:
fig, (ax0, ax1) = plt.subplots(2, 1, figsize=(20,10), sharex=True)
predictions.plot(ax=ax0)
residuals.plot(ax=ax1);
Primarily, we are interested in the mean value of the residual errors. A value close to zero suggests no bias in the forecasts, whereas positive and negative values suggest a positive or negative bias in the forecasts made.
It is useful to know about a bias in the forecasts as it can be directly corrected in forecasts prior to their use or evaluation.
In [51]:
residuals.describe()
Out[51]:
We would expect the forecast errors to be normally distributed around a zero mean. Plots can help discover skews in this distribution.
If the plot showed a distribution that was distinctly non-Gaussian, it would suggest that assumptions made by the modeling process were perhaps incorrect and that a different modeling method may be required.
A large skew may suggest the opportunity for performing a transform to the data prior to modeling, such as taking the log or square root.
In [52]:
fig, (ax0, ax1) = plt.subplots(2, 1, figsize=(20,10))
residuals.hist(ax=ax0)
residuals.plot(kind='kde', ax=ax1);
A Q-Q plot, or quantile plot, compares two distributions and can be used to see how similar or different they happen to be.
The Q-Q plot can be used to quickly check the normality of the distribution of residual errors.
The values are ordered and compared to an idealized Gaussian distribution. The comparison is shown as a scatter plot (theoretical on the x-axis and observed on the y-axis) where a match between the two distributions is shown as a diagonal line from the bottom left to the top-right of the plot.
The plot is helpful to spot obvious departures from this expectation.
Below is an example of a Q-Q plot of the residual errors. The x-axis shows the theoretical quantiles and the y-axis shows the sample quantiles.
In [53]:
import scipy
from statsmodels.graphics.gofplots import qqplot
dist = scipy.stats.distributions.norm
residuals = np.array(residuals)
fig, ax = plt.subplots(1, 1, figsize=(20,10))
qqplot(residuals, dist=dist, line='r', fit=False, ax=ax);
Why does this not look like a line??
Next, we can check for correlations between the errors over time.
We would not expect there to be any correlation between the residuals. This would be shown by autocorrelation scores being below the threshold of significance (dashed and dotted horizontal lines on the plot).
A significant autocorrelation in the residual plot suggests that the model could be doing a better job of incorporating the relationship between observations and lagged observations, called autoregression.
In [55]:
from pandas.tools.plotting import autocorrelation_plot
residuals = pd.DataFrame(residuals)
autocorrelation_plot(residuals)
Out[55]:
In [63]:
from statsmodels.graphics.tsaplots import plot_acf
from pandas.tools.plotting import autocorrelation_plot
fig, (ax0, ax1) = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True, facecolor='white', figsize=(20,10))
autocorrelation_plot(residuals, ax=ax0)
ax0.set(title='pandas', xlabel='Lag', ylabel='Autocorrelation')
ax0.set_xlim([0, 50])
ax0.set_ylim([-0.2, None])
#plot_acf(residuals, ax=ax1, lags=len(residuals)-1)
plot_acf(residuals, ax=ax1)
ax1.set(title='statsmodels', xlabel='Lag', ylabel='Autocorrelation');
In [69]:
from pandas.tools.plotting import lag_plot
lag_plot(residuals, lag=1);
In [ ]: