Residual Error

http://machinelearningmastery.com/visualize-time-series-residual-forecast-errors-with-python/

Forecast errors on a time series forecasting problem are called residual errors or residuals.

e = y - yhat

We often stop there and summarize the skill of a model as a summary of this error.

Instead, we can collect these individual residual errors across all forecasts and use them to better understand the forecast model.

Generally, when exploring residual errors we are looking for patterns or structure. A sign of a pattern suggests that the errors are not random.

We expect the residual errors to be random, because it means that the model has captured all of the structure and the only error left is the random fluctuations in the time series that cannot be modeled.

A sign of a pattern or structure suggests that there is more information that a model could capture and use to make better predictions.



In [45]:

    
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')



In [46]:

    
fn = '/data/daily-minimum-temperatures-in-me.csv'
df = pd.read_csv(fn, header=0, sep=';', decimal=',')
#df.info()
df.plot(figsize=(20,10));

Baseline Model

This is called the “naive forecast” or the persistence forecast model.

After the dataset is loaded, it is phrased as a supervised learning problem. A lagged version of the dataset is created where the prior time step (t-1) is used as the input variable and the next time step (t+1) is taken as the output variable.



In [47]:

    
# create lagged dataset
dataframe = pd.concat([df.ix[:,1].shift(1), df.ix[:,1]], axis=1)
dataframe.columns = ['t-1', 't+1']

# split into train and test sets
X = dataframe.values
train_size = int(len(X) * 0.66)
train, test = X[1:train_size], X[train_size:]
train_X, train_y = train[:,0], train[:,1]
test_X, test_y = test[:,0], test[:,1]

The persistence model is applied by predicting the output value (y) as a copy of the input value (x).



In [48]:

    
# persistence model
predictions = [x for x in test_X]

# calculate residuals
residuals = [test_y[i]-predictions[i] for i in range(len(predictions))]

predictions = pd.DataFrame(predictions)
residuals = pd.DataFrame(residuals)
print(residuals.head())

Residual Line Plot

The first plot is to look at the residual forecast errors over time as a line plot.

We would expect the plot to be random around the value of 0 and not show any trend or cyclic structure.



In [50]:

    
fig, (ax0, ax1) = plt.subplots(2, 1, figsize=(20,10), sharex=True)
predictions.plot(ax=ax0)
residuals.plot(ax=ax1);

Primarily, we are interested in the mean value of the residual errors. A value close to zero suggests no bias in the forecasts, whereas positive and negative values suggest a positive or negative bias in the forecasts made.

It is useful to know about a bias in the forecasts as it can be directly corrected in forecasts prior to their use or evaluation.



In [51]:

    
residuals.describe()









    Out[51]:






  
    
      
      0
    
  
  
    
      count
      1241.000000
    
    
      mean
      0.006205
    
    
      std
      2.613844
    
    
      min
      -10.000000
    
    
      25%
      -1.700000
    
    
      50%
      0.100000
    
    
      75%
      1.600000
    
    
      max
      8.100000

Residual Histogram and Density Plots

We would expect the forecast errors to be normally distributed around a zero mean. Plots can help discover skews in this distribution.

If the plot showed a distribution that was distinctly non-Gaussian, it would suggest that assumptions made by the modeling process were perhaps incorrect and that a different modeling method may be required.

A large skew may suggest the opportunity for performing a transform to the data prior to modeling, such as taking the log or square root.



In [52]:

    
fig, (ax0, ax1) = plt.subplots(2, 1, figsize=(20,10))
residuals.hist(ax=ax0)
residuals.plot(kind='kde', ax=ax1);

Residual Q-Q Plot

A Q-Q plot, or quantile plot, compares two distributions and can be used to see how similar or different they happen to be.

The Q-Q plot can be used to quickly check the normality of the distribution of residual errors.

The values are ordered and compared to an idealized Gaussian distribution. The comparison is shown as a scatter plot (theoretical on the x-axis and observed on the y-axis) where a match between the two distributions is shown as a diagonal line from the bottom left to the top-right of the plot.

The plot is helpful to spot obvious departures from this expectation.

Below is an example of a Q-Q plot of the residual errors. The x-axis shows the theoretical quantiles and the y-axis shows the sample quantiles.



In [53]:

    
import scipy
from statsmodels.graphics.gofplots import qqplot
dist = scipy.stats.distributions.norm
residuals = np.array(residuals)

fig, ax = plt.subplots(1, 1, figsize=(20,10))
qqplot(residuals, dist=dist, line='r', fit=False, ax=ax);

Why does this not look like a line??

Residual Autocorrelation Plot

Next, we can check for correlations between the errors over time.

We would not expect there to be any correlation between the residuals. This would be shown by autocorrelation scores being below the threshold of significance (dashed and dotted horizontal lines on the plot).

A significant autocorrelation in the residual plot suggests that the model could be doing a better job of incorporating the relationship between observations and lagged observations, called autoregression.



In [55]:

    
from pandas.tools.plotting import autocorrelation_plot
residuals = pd.DataFrame(residuals)
autocorrelation_plot(residuals)









    Out[55]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f08ff684da0>



In [63]:

    
from statsmodels.graphics.tsaplots import plot_acf
from pandas.tools.plotting import autocorrelation_plot

fig, (ax0, ax1) = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True, facecolor='white', figsize=(20,10))
autocorrelation_plot(residuals, ax=ax0)
ax0.set(title='pandas', xlabel='Lag', ylabel='Autocorrelation')
ax0.set_xlim([0, 50])
ax0.set_ylim([-0.2, None])

#plot_acf(residuals, ax=ax1, lags=len(residuals)-1)
plot_acf(residuals, ax=ax1)
ax1.set(title='statsmodels', xlabel='Lag', ylabel='Autocorrelation');



In [69]:

    
from pandas.tools.plotting import lag_plot
lag_plot(residuals, lag=1);



In [ ]:

	0
count	1241.000000
mean	0.006205
std	2.613844
min	-10.000000
25%	-1.700000
50%	0.100000
75%	1.600000
max	8.100000