Volatility Modelling in Python

This tutorial demonstrates the use of Python tools and libraries applied to volatility modelling, more specifically the generalized autoregressive conditional heteroscedasticity (GARCH) model.

Working with financial data, we are often faced with processing time series, that is sequences of discrete time data (e.g. the daily closing value of a stock market index). To work with such data, we can use parametric models in the time domain.

A GARCH model is essentially an autoregressive conditional heteroskedasticity model for which an autoregressive moving average model (ARMA model) is assumed for the error variance.

Getting the data

The first step of our tutorial is to get the data. For this tutorial, we will use the adjusted closing price of 10 stocks (the adjustment is done to account for the dividends and the splits).

The pandas_datareader library could prove useful for this task (note that such tools are very sensitive to changes in the target API). However, this will be the subject of some other tutorial. For now, let's just read the local files.


In [1]:
import pandas as pd
import string

raw_prices = pd.DataFrame(columns=['Date'])
stock_names = string.ascii_uppercase[:10]

for stock_name in stock_names:
    raw_data = pd.read_csv('data/'+stock_name+'.csv')[['Date','Adj Close']]
    raw_data.rename(columns={'Adj Close':stock_name},inplace=True)
    raw_data.dropna(inplace=True)
    raw_prices = pd.merge(raw_prices,raw_data,how='outer',on='Date')
    
raw_prices.sort_values(by='Date',inplace=True)
raw_prices.head()


Out[1]:
Date A B C D E F G H I J
8450 1962-01-02 NaN 0.202636 NaN 0.059620 NaN NaN NaN NaN NaN NaN
8451 1962-01-03 NaN 0.206688 NaN 0.060421 NaN NaN NaN NaN NaN NaN
8452 1962-01-04 NaN 0.204662 NaN 0.060421 NaN NaN NaN NaN NaN NaN
8453 1962-01-05 NaN 0.200609 NaN 0.060621 NaN NaN NaN NaN NaN NaN
8454 1962-01-08 NaN 0.201116 NaN 0.060421 NaN NaN NaN NaN NaN NaN

We can see that some of the assets had prices back in the 60's but that many others didn't. Let's drop the incomplete rows. We could have done inner joins instead of of an outer join followed by dropping incomplete rows, but we wanted to show the complete picture. (We leave it as an exercise to the reader to write inner joins.)


In [2]:
raw_prices.dropna(inplace=True)
raw_prices.head()


Out[2]:
Date A B C D E F G H I J
1550 1990-02-16 3.657441 12.040095 0.064382 6.770798 4.859589 0.456696 3.713418 1.243683 0.841391 3.756277
1551 1990-02-20 3.556310 11.776785 0.066552 6.647411 4.784537 0.467182 3.615980 1.222314 0.815098 3.688598
1552 1990-02-21 3.539457 11.585296 0.065105 6.593432 4.690724 0.447959 3.605154 1.260778 0.817727 3.637836
1553 1990-02-22 3.488895 11.609239 0.065829 6.562585 4.690724 0.449124 3.551025 1.291745 0.815098 3.595536
1554 1990-02-23 3.438330 11.489553 0.065467 6.462332 4.615672 0.454366 3.518545 1.253250 0.809839 3.527858

In [3]:
raw_prices.tail()


Out[3]:
Date A B C D E F G H I J
8444 2017-06-28 89.328644 199.619995 31.783060 106.070923 154.300003 69.800003 76.510002 153.229996 34.200001 133.820007
8445 2017-06-29 90.651367 197.460007 31.119259 104.929413 153.130005 68.489998 75.930000 152.160004 33.540001 132.639999
8446 2017-06-30 90.900002 197.750000 31.010277 105.465424 153.160004 68.930000 75.680000 153.399994 33.740002 132.289993
8447 2017-07-03 92.750000 198.589996 31.039999 106.666489 152.500000 68.169998 75.360001 154.009995 33.459999 132.899994
8449 2017-07-06 93.379997 201.479996 30.719999 103.349998 153.089996 68.570000 75.470001 152.050003 33.630001 132.520004

This seems better. Let's constrain our analysis from Mar 1, 1990, to Jun 31, 2017, in order to have complete calendar year quarters. (We leave as an exercise to the reader to make sure that we have data for every trading day in that period.)


In [4]:
raw_prices = raw_prices[(raw_prices.Date>'1990-03') & (raw_prices.Date<'2017-07')]
raw_prices.head(2)


Out[4]:
Date A B C D E F G H I J
1558 1990-03-01 3.43833 12.447020 0.066552 6.670546 4.739837 0.466017 3.864985 1.364459 0.854538 3.654757
1559 1990-03-02 3.50575 12.423082 0.067275 6.732243 4.777454 0.483492 3.908290 1.364459 0.857166 3.722438

In [5]:
raw_prices.tail(2)


Out[5]:
Date A B C D E F G H I J
8445 2017-06-29 90.651367 197.460007 31.119259 104.929413 153.130005 68.489998 75.93 152.160004 33.540001 132.639999
8446 2017-06-30 90.900002 197.750000 31.010277 105.465424 153.160004 68.930000 75.68 153.399994 33.740002 132.289993

For simplicity, let's pretend that our portfolio consists of one share of each stock.


In [6]:
raw_prices['Close'] = raw_prices[list(stock_names)].sum(axis=1)
portfolio = raw_prices.reset_index()[['Date','Close']].copy()
portfolio.head()


Out[6]:
Date Close
0 1990-03-01 37.567041
1 1990-03-02 37.841649
2 1990-03-05 37.961847
3 1990-03-06 38.351505
4 1990-03-07 38.399774

Plotting the data


In [7]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import seaborn as sns


/usr/local/lib/python2.7/dist-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated since IPython 4.0. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

In [8]:
f, ax = plt.subplots(figsize=(12, 6))
sns.set(color_codes=True)
sns.tsplot(data=portfolio.Close, ax=ax)

xticks = range(0,len(portfolio),365*2)
ax.set_xticks(xticks)
ax.set_xticklabels([pd.to_datetime(portfolio.Date).dt.date[i] for i in xticks], rotation=90)
ax.set_ylabel('Portfolio - Closing value ')


Out[8]:
<matplotlib.text.Text at 0x7f357058df50>

Computing returns and squared returns


In [9]:
portfolio['Return'] = np.log(portfolio.Close / portfolio.Close.shift())
portfolio['R2'] = np.square(portfolio['Return'])
portfolio.tail()


Out[9]:
Date Close Return R2
6884 2017-06-26 1045.294773 -0.001264 0.000002
6885 2017-06-27 1042.395983 -0.002777 0.000008
6886 2017-06-28 1048.662634 0.005994 0.000036
6887 2017-06-29 1040.050053 -0.008247 0.000068
6888 2017-06-30 1042.325696 0.002186 0.000005

Note that we use the daily continuously compounded return (log return) instead of the daily simple rate of return as given by the pct_change method:


In [10]:
portfolio.Close.pct_change().tail()


Out[10]:
6884   -0.001263
6885   -0.002773
6886    0.006012
6887   -0.008213
6888    0.002188
Name: Close, dtype: float64

Both returns appear relatively similar. However, they both have their pros and cons. The main reason we prefer to use the log return is that we can calculate the compounded return over a specific horizon by adding the daily returns. This simple fact facilitates the development of models.

Autocorrelation

The autocorrelation is the correlation of a series with a delayed, or shifted, copy of itself. It is a useful tool for detecting linear dynamics in time series analysis.

Let's look at the autocorrelation of daily returns.


In [11]:
def plot_autocorrelation(returns, y_label,zero_line=False):
    lags = np.arange(1,101)
    return_autocorr = [returns.autocorr(lag=lag) for lag in lags]

    f, ax = plt.subplots(figsize=(12, 6))
    sns.set(color_codes=True)
    sns.tsplot(data=return_autocorr, time=lags, ax=ax)
    ax.set_xticks(np.arange(lags[0]-1,lags[-1]+1,10))
    ax.set_xlabel('Lag order')
    ax.set_ylabel(y_label)
    if(zero_line):
        plt.plot([lags[0],lags[-1]], [0,0], 'b')

In [12]:
plot_autocorrelation(portfolio.Return, 'Autocorrelation of daily returns',zero_line=True)


We see that there is very little autocorrelation for daily returns of the portfolio over this period. Let's look at the autocorrelation of the squared returns.


In [13]:
plot_autocorrelation(portfolio.R2, 'Autocorrelation of squared daily returns')


We notice a fair amount of autocorrelation of the squared returns.

GARCH(1,1)

For the purpose of this tutorial, let's first build our own model. In this model, we consider one lag of squared returns and one lag of variance (1,1).


In [14]:
import scipy.optimize as opt

def compute_omega(factor,alpha,beta):
    persistence = alpha + beta
    return factor *(1-persistence)        
    
def compute_cond_var(data,initial_value,alpha,beta,omega=None):
    if omega is None:
        omega = compute_omega(initial_value,alpha,beta)
    cond_var = np.array([initial_value], dtype=np.float64)
    for time in range(1,len(data)):
        cond_var = np.append(cond_var,(omega +beta*cond_var[time-1]+alpha*data[time]**2))
    return cond_var

In [15]:
def fn(x, df, r_var):
    sig_2 = compute_cond_var(df.Return,r_var,x[0],x[1])[:-1]
    return (np.log(sig_2)+df.R2[1:]/sig_2).sum()*1/2

res = opt.minimize(fn, [0.08,0.82], bounds=[[0,1],[0,1]], 
                    args=(portfolio, portfolio.Return.std()**2))
res


/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:3: RuntimeWarning: invalid value encountered in log
  This is separate from the ipykernel package so we can avoid doing imports until
Out[15]:
      fun: -28107.387398143728
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([ 0.11132215,  0.08949428])
  message: 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 54
      nit: 10
   status: 0
  success: True
        x: array([ 0.08210269,  0.908509  ])

In [16]:
compute_cond_var(portfolio.Return,portfolio.Return.std()**2,res.x[0],res.x[1])


Out[16]:
array([  1.46845582e-04,   1.39144331e-04,   1.28618231e-04, ...,
         3.01296863e-05,   3.43355444e-05,   3.29649827e-05])

It turns out that a python library exists for such models. Let's use it to build a GARCH(1,1) model and compare it to our own model.


In [17]:
from arch import arch_model
garch = arch_model(100*portfolio.Return[1:],)
res = garch.fit(update_freq=5)
print(res.summary())


Iteration:      5,   Func. Count:     39,   Neg. LLF: 9921.63392772
Iteration:     10,   Func. Count:     72,   Neg. LLF: 9918.29018018
Optimization terminated successfully.    (Exit mode 0)
            Current function value: 9918.28947529
            Iterations: 12
            Function evaluations: 84
            Gradient evaluations: 12
                     Constant Mean - GARCH Model Results                      
==============================================================================
Dep. Variable:                 Return   R-squared:                      -0.000
Mean Model:             Constant Mean   Adj. R-squared:                 -0.000
Vol Model:                      GARCH   Log-Likelihood:               -9918.29
Distribution:                  Normal   AIC:                           19844.6
Method:            Maximum Likelihood   BIC:                           19871.9
                                        No. Observations:                 6888
Date:                Tue, Jul 11 2017   Df Residuals:                     6884
Time:                        11:20:05   Df Model:                            4
                                 Mean Model                                 
============================================================================
                 coef    std err          t      P>|t|      95.0% Conf. Int.
----------------------------------------------------------------------------
mu             0.0730  1.064e-02      6.863  6.743e-12 [5.219e-02,9.391e-02]
                              Volatility Model                              
============================================================================
                 coef    std err          t      P>|t|      95.0% Conf. Int.
----------------------------------------------------------------------------
omega          0.0142  3.680e-03      3.853  1.166e-04 [6.966e-03,2.139e-02]
alpha[1]       0.0864  1.305e-02      6.621  3.577e-11   [6.084e-02,  0.112]
beta[1]        0.9048  1.371e-02     65.992      0.000     [  0.878,  0.932]
============================================================================

Covariance estimator: robust

In [18]:
omega = res.params['omega']/(100**2)
omega


Out[18]:
1.4177695568094985e-06

In [19]:
compute_cond_var(portfolio.Return,portfolio.Return.std()**2,res.params['alpha[1]'],res.params['beta[1]'],omega)


Out[19]:
array([  1.46845582e-04,   1.38870052e-04,   1.27938520e-04, ...,
         3.00012610e-05,   3.44409699e-05,   3.29932817e-05])

Our model was quite simple. Let's make it slightly more robust and take into account the leverage effect (nonlinear GARCH also known as NGARCH).


In [20]:
import scipy.optimize as opt
from functools import partial
import sys

# Constraints for the optimization algorithm
a = np.array([[ 1.,  0.,  0.],
              [ 0.,  1.,  0.],
              [ 0.,  0.,  1.],
              [-1.,  0.,  0.],
              [ 0., -1.,  0.],
              [ 0.,  0.,  0.]])

b = [ 0.,  0.,  0., -1., -1., 0.]

def ieqcons(x, *args, **kwargs):
    ineq = kwargs['a'].dot(x) - kwargs['b']
    ineq[len(a)-1] = 1 - compute_persistance(x[0],x[1],x[2])
    return ineq

def f_ieqcons(a,b):
    return partial(ieqcons,a=a,b=b)

def compute_persistance(alpha,beta,theta=0.0):
    return alpha*(1+theta**2) + beta

def compute_omega(unconditional_variance,persistence):
    return unconditional_variance *(1-persistence)        
    
def compute_cond_var(returns,unconditional_variance,alpha,beta,omega=None,theta=0.0):
    if omega is None:
        omega = compute_omega(unconditional_variance,compute_persistance(alpha,beta,theta))
    cond_var = np.array([unconditional_variance], dtype=np.float64)
    for time in range(1,len(returns)):
        cond_var = np.append(cond_var,(omega +beta*cond_var[time-1]
                                       +alpha*(returns[time-1]-theta*cond_var[time-1]**0.5)**2))
    return cond_var

def our_garch(x, returns, unconditional_variance):
    alpha = x[0]
    beta = x[1]
    theta = 0 if (len(x)<=2) else x[2]
    persistence = compute_persistance(alpha,beta,theta)
    # somehow fmin_slsqp runs iterations that fail to respect the given constraints (values in ieqcons are < 0)
    # so this is a workaround
    if(persistence<=1 and persistence>=0):
        omega = compute_omega(unconditional_variance,persistence)
        cond_var = compute_cond_var(returns,unconditional_variance,alpha,beta,omega,theta)
        score = (np.log(cond_var)+np.square(returns)/cond_var).sum()
    else:
        score = sys.float_info.max
    return score

returns = portfolio.Return[1:].tolist()
unconditional_variance = portfolio.Return[1:].std()**2

Without leverage, we get the same results as before (whitout warnings this time):


In [21]:
res = opt.fmin_slsqp(our_garch, [0.08,0.82,0.0], bounds=((0,1),(0,1),(0,0)), f_ieqcons=f_ieqcons(a,b),
                     args=(returns,unconditional_variance), iter=100, acc=1e-10,
                     epsilon=1.4901161193847656e-08)
res


Optimization terminated successfully.    (Exit mode 0)
            Current function value: -56214.7747963
            Iterations: 13
            Function evaluations: 72
            Gradient evaluations: 13
Out[21]:
array([ 0.08210233,  0.9085093 ,  0.        ])

Including the leverage:


In [22]:
res = opt.fmin_slsqp(our_garch, [0.08,0.82,0.5], f_ieqcons=f_ieqcons(a,b),
                    args=(returns,unconditional_variance), iter=100, acc=1e-06,
                    epsilon=1.4901161193847656e-08)
res


Optimization terminated successfully.    (Exit mode 0)
            Current function value: -56442.8246217
            Iterations: 19
            Function evaluations: 104
            Gradient evaluations: 19
Out[22]:
array([ 0.06924546,  0.87205728,  0.80197473])

In [ ]: