DAILY COMMODITY FUTURES INDEX

Imports


In [1]:
import pandas            as pd
import matplotlib.pyplot as plt
import seaborn           as sbn
import numpy             as np

from warnings                      import filterwarnings
from helpers                       import HarModel, HarchModel
from itertools                     import product
from pandas.tools.plotting         import autocorrelation_plot
from scipy.stats                   import expon, laplace, t
from arch.univariate               import HARX, HARCH, GARCH, EGARCH, StudentsT, SkewStudent
from statsmodels.graphics.tsaplots import plot_pacf

Notebook Settings


In [2]:
filterwarnings('ignore')
%matplotlib notebook
sbn.set_style('whitegrid')

Read and Fix Data


In [3]:
dat_file = './data/CommodityIndices.xlsx'
skiprows = (2, 3, 4, 5)
raw_data = pd.read_excel(dat_file, skiprows=skiprows, index_col=0, parse_cols='B:AE')

raw_data.columns = [column[4:-2].rstrip() for column in raw_data.columns]
raw_data.iloc[0] = 100.0

Single Index Analysis

Pick any one by name


In [4]:
raw_data.columns.values.tolist()


Out[4]:
['Commodity',
 'Energy',
 'Precious Metals',
 'Industrial Metals',
 'Agriculture',
 'Livestock',
 'WTI Crude Oil',
 'Brent Crude',
 'Heat Oil',
 'Unleaded Gas',
 'Natural Gas',
 'Aluminum',
 'Copper',
 'Nickel',
 'Zinc',
 'Gold',
 'Silver',
 'Wheat',
 'Kansas Wheat',
 'Corn',
 'Soybeans',
 'Soybean Oil',
 'Soybean Meal',
 'Cotton',
 'Sugar',
 'Coffee',
 'Cocoa',
 'Live Cattle',
 'Lean Hogs']

In [5]:
name = 'Lean Hogs' # Name of index as string

data = raw_data[name]

fig, ax = plt.subplots(num=name+' Over the Aeons')
data.plot(ax=ax, label=name)

ax.legend(loc='best', frameon=True, fancybox=True, framealpha=0.8)
ax.set_xlabel('Time')
ax.set_ylabel('Relative index')

fig.tight_layout()


We don't want to do any smoothing or averaging to find a trend in the data because we know little of how to extrapolate such a trend to the future.

Let's look at daily returns instead

Note: Log-returns would show pretty much the same picture but simple differences put a trend on the volatility.


In [6]:
winsize = 30  # Window size (in business days) to average over
max_lag = 300 # Maximum lag to consider in (partial) autocorrelation

returns = 100 * data.pct_change(1).dropna()
rtnswin = returns.rolling(winsize, center=True) # or .ewm(com=winsize)
rtnmean = rtnswin.mean()
rtn_std = rtnswin.std()

fig = plt.figure(num='Returns on '+name+' and their (Partial) Autocorrelation', figsize=(9.7, 10))
return_ax = fig.add_subplot(311)
aucorr_ax = fig.add_subplot(312)
pacorr_ax = fig.add_subplot(313)

returns.plot(color='#4c72b0', style=['_'], marker='o', alpha=0.1, ax=return_ax)
rtnmean.plot(color='#4c72b0', ax=return_ax)
(rtnmean + rtn_std).plot(color='#c44e52', linewidth=1.0, ax=return_ax)
(rtnmean - rtn_std).plot(color='#c44e52', linewidth=1.0, ax=return_ax)
return_ax.set_xlabel('Time')
return_ax.set_ylabel('Daily returns')

autocorrelation_plot(returns, linewidth=1.0, ax=aucorr_ax)
aucorr_ax.set_xlim(left=0, right=max_lag)

plot_pacf(returns, lags=max_lag, alpha=0.01, ax=pacorr_ax, linewidth=0.5)
pacorr_ax.set_ylim(aucorr_ax.get_ylim())
pacorr_ax.set_xlim(aucorr_ax.get_xlim())
pacorr_ax.set_title('')
pacorr_ax.set_xlabel('Lag')
pacorr_ax.set_ylabel('Partial autocorrelation')

fig.tight_layout()


We note that:

  1. The mean is fairly stable.
  2. The variance is clearly not.
  3. The (partial) autocorrelation of the returns is not particularly pronounced.

So let's look at the absolute returns and their autocorrelation next.

Absolute returns and their autocorrelation


In [7]:
fig = plt.figure(num='Absolute Returns on '+name+' and their (Partial) Autocorrelation', figsize=(9.7, 10))
return_ax = fig.add_subplot(311)
aucorr_ax = fig.add_subplot(312)
pacorr_ax = fig.add_subplot(313)

returns.abs().plot(linewidth=0.5, ax=return_ax, label='Volatility')
return_ax.set_xlabel('Time')
return_ax.set_ylabel('Absolute daily returns')

autocorrelation_plot(returns.abs(), linewidth=0.5, ax=aucorr_ax, label='Volatility');
aucorr_ax.set_xlim(left=0, right=max_lag)

plot_pacf(returns.abs(), lags=max_lag, alpha=0.01, ax=pacorr_ax, linewidth=0.5)
pacorr_ax.set_ylim(aucorr_ax.get_ylim())
pacorr_ax.set_xlim(aucorr_ax.get_xlim())
pacorr_ax.set_title('')
pacorr_ax.set_xlabel('Lag')
pacorr_ax.set_ylabel('Partial autocorrelation')

fig.tight_layout()


We observe volatility bunching and long-term autocorrelation of the absolute returns. The partial autocorrelation dies off at much shorter lags.

Distribution of (absolute) returns


In [8]:
fig = plt.figure(num='Distribution of (Absolute) Returns on '+name, figsize=(9.7, 4))
returns_ax = fig.add_subplot(121)
absrtns_ax = fig.add_subplot(122)

sbn.distplot(returns, 
             bins=99, 
             kde=False, 
             fit=laplace, 
             fit_kws={'color': '#c44e52', 'label': 'Laplace'}, 
             ax=returns_ax);
returns_ax.set_title('Daily Returns on '+name)
returns_ax.set_xlabel('Value')
returns_ax.set_ylabel('Frequency')

sbn.distplot(returns.abs(), 
             bins=99, 
             kde=False, 
             fit=expon, 
             fit_kws={'color': '#c44e52'}, 
             color='#55a868', 
             ax=absrtns_ax); 
absrtns_ax.set_title('Absolute Daily Returns on '+name)
absrtns_ax.set_xlabel('Absolute Value')
absrtns_ax.set_ylabel('Frequency')

x_lims = returns_ax.get_xlim()
x_vals = np.linspace(*x_lims, 200)
params = t.fit(returns[returns.abs() > 0.05], floc=0)
returns_ax.plot(x_vals, t.pdf(x_vals, *params), label='Students-t')
returns_ax.legend(loc='best', frameon=True, fancybox=True, framealpha=0.8)

fig.tight_layout()


They returns follow a Laplace- whereas the absolute returns follow an Exponential (or maybe also a Pareto) distribution. The tails are far too fat for a Gaussian process. Interestingly, a Students-t distribution appears to be quite a reasonable fit for the returns, if the pronounced peak at zero is neglected.

Fit a HAR+HARCH model

Initialize ar HAR model


In [9]:
params = {           'y': returns,
              'constant': False,
             'hold_back': max_lag,
          'distribution': StudentsT()}

har = HarModel(params)

Find optimal HAR model


In [10]:
while har.bic.gets_smaller():
    har.refine()
    har.extend()

result = har.model.fit()
result.summary()


Iteration:      1,   Func. Count:      5,   Neg. LLF: 11397.2996474539
Iteration:      2,   Func. Count:     14,   Neg. LLF: 11395.579956318432
Iteration:      3,   Func. Count:     21,   Neg. LLF: 11391.356572325449
Iteration:      4,   Func. Count:     27,   Neg. LLF: 11386.658417337556
Iteration:      5,   Func. Count:     33,   Neg. LLF: 11382.604715707766
Iteration:      6,   Func. Count:     39,   Neg. LLF: 11379.213603979366
Iteration:      7,   Func. Count:     45,   Neg. LLF: 11377.12410942758
Iteration:      8,   Func. Count:     51,   Neg. LLF: 11373.753951978328
Iteration:      9,   Func. Count:     56,   Neg. LLF: 11372.663961696007
Iteration:     10,   Func. Count:     62,   Neg. LLF: 11372.603335565247
Iteration:     11,   Func. Count:     67,   Neg. LLF: 11372.603201027618
Iteration:     12,   Func. Count:     72,   Neg. LLF: 11372.603198577192
Optimization terminated successfully.    (Exit mode 0)
            Current function value: 11372.6031986
            Iterations: 12
            Function evaluations: 72
            Gradient evaluations: 12
Out[10]:
HAR - Constant Variance Model Results
Dep. Variable: Lean Hogs R-squared: -0.001
Mean Model: HAR Adj. R-squared: -0.001
Vol Model: Constant Variance Log-Likelihood: -11372.6
Distribution: Standardized Student's t AIC: 22751.2
Method: Maximum Likelihood BIC: 22771.5
No. Observations: 6437
Date: Fri, Mar 10 2017 Df Residuals: 6434
Time: 12:15:35 Df Model: 3
Mean Model
coef std err t P>|t| 95.0% Conf. Int.
Lean Hogs[0:3] -0.0466 2.208e-02 -2.110 3.488e-02 [-8.987e-02,-3.308e-03]
Volatility Model
coef std err t P>|t| 95.0% Conf. Int.
sigma2 2.2268 6.121e-02 36.382 8.203e-290 [ 2.107, 2.347]
Distribution
coef std err t P>|t| 95.0% Conf. Int.
nu 4.7808 0.292 16.360 3.707e-60 [ 4.208, 5.354]

Add optimal HARCH volatility to the model


In [11]:
harch = HarchModel(har.model)

while harch.bic.gets_smaller():
    harch.refine()
    harch.extend()

result = harch.model.fit()
result.summary()


Iteration:      1,   Func. Count:      7,   Neg. LLF: 11145.271891285207
Iteration:      2,   Func. Count:     18,   Neg. LLF: 11143.19355902127
Iteration:      3,   Func. Count:     26,   Neg. LLF: 11128.372827288546
Iteration:      4,   Func. Count:     34,   Neg. LLF: 11127.124629222519
Iteration:      5,   Func. Count:     42,   Neg. LLF: 11125.498370953339
Iteration:      6,   Func. Count:     50,   Neg. LLF: 11125.024023551276
Iteration:      7,   Func. Count:     58,   Neg. LLF: 11124.415286985308
Iteration:      8,   Func. Count:     66,   Neg. LLF: 11124.344143269183
Iteration:      9,   Func. Count:     73,   Neg. LLF: 11124.259105772066
Iteration:     10,   Func. Count:     80,   Neg. LLF: 11124.239535599774
Iteration:     11,   Func. Count:     87,   Neg. LLF: 11124.234826832588
Iteration:     12,   Func. Count:     94,   Neg. LLF: 11124.234709411596
Iteration:     13,   Func. Count:    101,   Neg. LLF: 11124.234708300763
Optimization terminated successfully.    (Exit mode 0)
            Current function value: 11124.2347083
            Iterations: 13
            Function evaluations: 101
            Gradient evaluations: 13
Out[11]:
HAR - HARCH Model Results
Dep. Variable: Lean Hogs R-squared: -0.001
Mean Model: HAR Adj. R-squared: -0.001
Vol Model: HARCH Log-Likelihood: -11124.2
Distribution: Standardized Student's t AIC: 22258.5
Method: Maximum Likelihood BIC: 22292.3
No. Observations: 6437
Date: Fri, Mar 10 2017 Df Residuals: 6432
Time: 12:25:06 Df Model: 5
Mean Model
coef std err t P>|t| 95.0% Conf. Int.
Lean Hogs[0:3] -0.0430 2.082e-02 -2.064 3.899e-02 [-8.379e-02,-2.173e-03]
Volatility Model
coef std err t P>|t| 95.0% Conf. Int.
omega 0.2420 8.932e-02 2.709 6.741e-03 [6.694e-02, 0.417]
alpha[28] 0.7081 5.173e-02 13.688 1.202e-42 [ 0.607, 0.809]
alpha[262] 0.2000 5.815e-02 3.440 5.812e-04 [8.607e-02, 0.314]
Distribution
coef std err t P>|t| 95.0% Conf. Int.
nu 8.4887 0.891 9.528 1.606e-21 [ 6.743, 10.235]

Try if adding a constant to the mean model, allowing for skew of the Student's t-distribution, or a different volatility model reduces the BIC


In [12]:
fixed = {           'y': returns,
                 'lags': har.lags,
            'hold_back': max_lag}

tests = {    'constant': [False, True],
         'distribution': [StudentsT(), SkewStudent()],
           'volatility': [HARCH(lags=harch.lags),
                          GARCH(p=1, o=0, q=1),
                          GARCH(p=1, o=1, q=1), 
                          GARCH(p=1, o=0, q=1, power=1.0),
                          GARCH(p=1, o=1, q=1, power=1.0),
                         EGARCH(p=1, o=0, q=1),
                         EGARCH(p=1, o=1, q=1)]}

fit_kws = {'update_freq': 0,
                  'disp': 'off'}

params_grid = (dict(zip(tests.keys(), values)) for values in product(*tests.values()))

min_bic, min_params = min((HARX(**fixed, **params).fit(**fit_kws).bic, params) for params in params_grid)

print('Minimum BIC is: ', min_bic)
print('for model with: ', [str(param) for param in min_params.values()])

model = HARX(**fixed, **min_params)
result = model.fit(**fit_kws)
result.summary()


Minimum BIC is:  22292.318506
for model with:  ['False', "Standardized Student's t distribution", 'HARCH(lags: 28, 262)']
Out[12]:
HAR - HARCH Model Results
Dep. Variable: Lean Hogs R-squared: -0.001
Mean Model: HAR Adj. R-squared: -0.001
Vol Model: HARCH Log-Likelihood: -11124.2
Distribution: Standardized Student's t AIC: 22258.5
Method: Maximum Likelihood BIC: 22292.3
No. Observations: 6437
Date: Fri, Mar 10 2017 Df Residuals: 6432
Time: 12:25:25 Df Model: 5
Mean Model
coef std err t P>|t| 95.0% Conf. Int.
Lean Hogs[0:3] -0.0430 2.082e-02 -2.064 3.899e-02 [-8.379e-02,-2.173e-03]
Volatility Model
coef std err t P>|t| 95.0% Conf. Int.
omega 0.2420 8.932e-02 2.709 6.741e-03 [6.694e-02, 0.417]
alpha[28] 0.7081 5.173e-02 13.688 1.202e-42 [ 0.607, 0.809]
alpha[262] 0.2000 5.815e-02 3.440 5.812e-04 [8.607e-02, 0.314]
Distribution
coef std err t P>|t| 95.0% Conf. Int.
nu 8.4887 0.891 9.528 1.606e-21 [ 6.743, 10.235]

Plot residuals and analyze their (partial) autocorrelation


In [13]:
fig = plt.figure(num='Residuals of HAR+Volatility Model and their (Partial) Autocorrelation', figsize=(9.7, 10))
residual_ax = fig.add_subplot(311)
autocorr_ax = fig.add_subplot(312)
partcorr_ax = fig.add_subplot(313)

returns.plot(ax=residual_ax, linewidth=1, label='Returns')
result.resid.plot(ax=residual_ax, linewidth=0.5, label='Residual')
(returns-result.resid).plot(ax=residual_ax, color='#c44e52', label='Model')
residual_ax.set_xlabel('Time')
residual_ax.set_ylabel('Residual of model')
residual_ax.legend(loc='best', frameon=True, fancybox=True, framealpha=0.8)

autocorrelation_plot(returns, ax=autocorr_ax, linewidth=1, label='Returns')
autocorrelation_plot(result.resid.dropna(), ax=autocorr_ax, linewidth=1, label='Residual')
autocorr_ax.set_xlim(left=0, right=max_lag)
autocorr_ax.legend(loc='best', frameon=True, fancybox=True, framealpha=0.8)

plot_pacf(returns, lags=max_lag, alpha=0.01, ax=partcorr_ax, linewidth=0.5, label='Returns')
plot_pacf(result.resid.dropna(), lags=max_lag, alpha=0.05, ax=partcorr_ax, linewidth=0.5, label='Residual')
partcorr_ax.set_ylim(autocorr_ax.get_ylim())
partcorr_ax.set_xlim(autocorr_ax.get_xlim())
partcorr_ax.set_title('')
partcorr_ax.set_xlabel('Lag')
partcorr_ax.set_ylabel('Partial autocorrelation')

fig.tight_layout()


Analyze fit to the conditional volatility and the suppression of its (partial) autocorrelation


In [14]:
resids = (returns.abs()-result.conditional_volatility).dropna()
resids.plot(linewidth=0.5, ax=return_ax, label='Residual')
result.conditional_volatility.plot(linewidth=1, ax=return_ax, label='Model')
return_ax.legend(loc='best', frameon=True, fancybox=True, framealpha=0.8)
autocorrelation_plot(resids, linewidth=1, ax=aucorr_ax, label='Residual')
aucorr_ax.legend(loc='best', frameon=True, fancybox=True, framealpha=0.8)
plot_pacf(resids, lags=max_lag, alpha=0.05, ax=pacorr_ax, linewidth=0.5)


Let's try a forecast on the returns ...


In [15]:
n_sims = 1000 # Number of simulations to average over
future = 22   # Number of future business days to provide a forecast for
past = 5      # Number of business days in the past to start our the forecast at

start_d = returns.index[-past-1]
horizon = past + future

forecast = result.forecast(horizon=horizon, 
                           start=start_d, 
                           align='origin', 
                           method='simulation', 
                           simulations=n_sims)

fc_index = pd.DatetimeIndex(freq='B', start=start_d + pd.tseries.offsets.BDay(1), periods=horizon)
fc_frame = pd.DataFrame(data=forecast.mean.iloc[-past-1:].values[0], index=fc_index, columns=['mean'])
fc_frame['+var'] =  forecast.variance.iloc[-past-1:].values[0]
fc_frame['-var'] = -forecast.variance.iloc[-past-1:].values[0]

fig, ax = plt.subplots(num='Forecast on Returns of '+name)

returns.iloc[-horizon:].plot(ax=ax)
fc_frame['mean'].plot(ax=ax)
fc_frame['+var'].plot(ax=ax, color='#c44e52')
fc_frame['-var'].plot(ax=ax, color='#c44e52')
ax.fill_between(fc_index, fc_frame['+var'], fc_frame['-var'], color='#c44e52', alpha=0.2)
ax.set_ylabel('Daily returns')

fig.tight_layout()


... and on the actual index


In [16]:
past_pad = max_lag  # Number of business days in the past to include into the plot

fig, ax = plt.subplots(num='Forecast on Index of '+name, figsize=(9.7, 6))
data.iloc[-past_pad:].plot(ax=ax)

init_val = data[-past-1]
fc_frame = pd.DataFrame(data=np.zeros((horizon, n_sims)), index=fc_index)
fc_frame.loc[start_d] = init_val
fc_frame.sort_index(inplace=True)
for sim in range(n_sims):
    fc_frame.loc[1:, sim] = init_val * (forecast.simulations.values[-past-1][sim] / 100 + 1).cumprod()
    fc_frame[sim].plot(ax=ax, color='#55a868', alpha=0.01, linewidth=5)

confintvls = [fc_frame.iloc[0, :2].tolist()]
for day in range(1, horizon+1):
    confintvls.append(np.percentile(fc_frame.iloc[day], (2.5, 97.5)).tolist())
ci_frame = pd.DataFrame(confintvls, index=fc_frame.index)
ci_frame.plot(ax=ax, color='#c44e52', legend=False)

ax.set_ylabel('Time')
ax.set_ylabel('Index')
fig.tight_layout()


Simulate the process to see if it could make sense


In [17]:
max_days = data.size # Number of business days to run the simulation for
init_val = 100.      # Value that simulated index starts from

initial_vals = returns[-max(har.lags)]
initial_vola = result.conditional_volatility[-max(harch.lags)]
                        
sim_returns = model.simulate(result.params, 
                             max_days,
                             burn=1000,
                             initial_value=initial_vals, 
                             initial_value_vol=initial_vola)

sim_trace = [init_val]
sim_trace.extend(init_val * (sim_returns.data / 100 + 1).cumprod())
    
fig = plt.figure(num='Simulated Traces for the '+name+' Index and its Volatility', figsize=(9.7, 6))
vals_ax = fig.add_subplot(211)
vola_ax = fig.add_subplot(212)

vals_ax.plot(sim_trace)
vals_ax.set_xlabel('Days')
vals_ax.set_ylabel('Index')
vals_ax.set_xlim(left=0, right=max_days)

sim_returns.volatility.plot(ax=vola_ax, color='#c44e52')
vola_ax.set_xlabel('Days')
vola_ax.set_ylabel('Volatility')
vola_ax.set_xlim(left=0, right=max_days)

fig.tight_layout()



In [ ]: