Pairs trading, Rolling regressions

DLM?

From https://github.com/pymc-devs/pymc3/blob/master/docs/source/notebooks/GLM-rolling-regression.ipynb

Los Alamos National Laboratory ISR-1 Space Science and Applications Brian Larsen, Ph.D. balarsen@lanl.gov 505-665-7691



In [4]:

    
%matplotlib inline
#%matplotlib notebook
%load_ext version_information
%load_ext autoreload



In [5]:

    
import datetime
import os
import sys
import warnings

warnings.simplefilter("ignore")
import matplotlib.pyplot as plt
import numpy as np
import spacepy.datamodel as dm
import spacepy.plot as spp
import spacepy.toolbox as tb
import pandas as pd
import pymc3 as pm


%version_information matplotlib, numpy, pandas, pymc3









    Out[5]:




Software Version
Python 3.6.2 64bit [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
IPython 6.1.0
OS Darwin 15.6.0 x86_64 i386 64bit
matplotlib 2.0.2
numpy 1.13.1
pandas 0.20.3
pymc3 3.1
Thu Aug 10 10:06:28 2017 MDT



In [6]:

    
FIGDIR = os.path.join('..', 'figures')
SRCDIR = os.path.join('..', 'src')
DEVDIR = os.path.join('..', 'develop')
DELDIR = os.path.join('..', 'deliver')
DATDIR = os.path.join('..', 'data')
sys.path.append(SRCDIR)



In [7]:

    
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['savefig.dpi'] = plt.rcParams['figure.dpi'] # 72
%config InlineBackend.figure_format = 'retina'

Rolling Regression

Author: Thomas Wiecki

Pairs trading is a famous technique in algorithmic trading that plays two stocks against each other.
For this to work, stocks must be correlated (cointegrated).
One common example is the price of gold (GLD) and the price of gold mining operations (GFI).



In [8]:

    
#Lets load the prices of GFI and GLD.



In [9]:

    
from pandas_datareader import data
prices = data.GoogleDailyReader(symbols=['GOOGL', 'AAPL'], end='2017-8-1').read().loc['Open', :, :]
prices.head()



In [11]:

    
finite_idx = (np.isfinite(prices.AAPL.values)) & (np.isfinite(prices.GOOGL.values))
prices = prices.iloc[finite_idx]

Plotting the prices over time suggests a strong correlation. However, the correlation seems to change over time.



In [14]:

    
def plot_dat():
    fig = plt.figure(figsize=(9, 6))
    ax = fig.add_subplot(111, xlabel='Price GOOGL in \$', ylabel='Price AAPL in \$')
    colors = np.linspace(0.1, 1, len(prices))
    mymap = plt.get_cmap("winter")
    sc = ax.scatter(prices.AAPL, prices.GOOGL, c=colors, cmap=mymap, lw=0)
    cb = plt.colorbar(sc)
    cb.ax.set_yticklabels([str(p.date()) for p in prices[::len(prices)//10].index]);  
    return ax



In [15]:

    
plot_dat();

A naive approach would be to estimate a linear model and ignore the time domain.



In [18]:

    
with pm.Model() as model_reg:
    pm.glm.GLM.from_formula('GOOGL ~ AAPL', prices)
    trace_reg = pm.sample(2000, njobs=4)









    



Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
Average Loss = 2.1628e+05:   6%|▋         | 12681/200000 [00:02<00:29, 6293.80it/s]
Convergence archived at 13100
Interrupted at 13,100 [6%]: Average Loss = 8.3528e+07
100%|██████████| 2500/2500 [03:14<00:00, 12.83it/s]

The posterior predictive plot shows how bad the fit is.



In [19]:

    
ax = plot_dat()
pm.plot_posterior_predictive_glm(trace_reg[100:], samples=100, 
                              label='posterior predictive regression lines',
                              lm=lambda x, sample: sample['Intercept'] + sample['AAPL'] * x,
                              eval=np.linspace(prices.AAPL.min(), prices.AAPL.max(), 100))
ax.legend(loc=0);

Rolling regression

Next, we will build an improved model that will allow for changes in the regression coefficients over time. Specifically, we will assume that intercept and slope follow a random-walk through time. That idea is similar to the stochastic volatility model. $$ \alpha_t \sim \mathcal{N}(\alpha_{t-1}, \sigma_\alpha^2) $$$$ \beta_t \sim \mathcal{N}(\beta_{t-1}, \sigma_\beta^2) $$ First, lets define the hyper-priors for $\sigma_\alpha^2$ and $\sigma_\beta^2$. This parameter can be interpreted as the volatility in the regression coefficients.



In [20]:

    
model_randomwalk = pm.Model()
with model_randomwalk:
    # std of random walk, best sampled in log space.
    sigma_alpha = pm.Exponential('sigma_alpha', 1./.02, testval = .1)
    sigma_beta = pm.Exponential('sigma_beta', 1./.02, testval = .1)

Next, we define the regression parameters that are not a single random variable but rather a random vector with the above stated dependence structure. So as not to fit a coefficient to a single data point, we will chunk the data into bins of 50 and apply the same coefficients to all data points in a single bin.



In [21]:

    
import theano.tensor as tt
# To make the model simpler, we will apply the same coefficient for 50 data points at a time
subsample_n = 50

lendata = len(prices)
ncoef = lendata // subsample_n
idx = range(ncoef * subsample_n)
with model_randomwalk:
    alpha = pm.GaussianRandomWalk('alpha', sigma_alpha**-2, 
                                  shape=ncoef)
    beta = pm.GaussianRandomWalk('beta', sigma_beta**-2, 
                                 shape=ncoef)
    
    # Make coefficients have the same length as prices
    alpha_r = tt.repeat(alpha, subsample_n)
    beta_r = tt.repeat(beta, subsample_n)

Perform the regression given coefficients and data and link to the data via the likelihood.



In [22]:

    
with model_randomwalk:
    # Define regression
    regression = alpha_r + beta_r * prices.AAPL.values[idx]
    
    # Assume prices are Normally distributed, the mean comes from the regression.
    sd = pm.Uniform('sd', 0, 20)
    likelihood = pm.Normal('y', 
                           mu=regression, 
                           sd=sd, 
                           observed=prices.GOOGL.values[idx])

Inference. Despite this being quite a complex model, NUTS handles it wells.



In [23]:

    
with model_randomwalk:
    trace_rw = pm.sample(1000, njobs=4)  # I think this is 1000 in each of 4 chains









    



Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
Average Loss = 9,037.6:  18%|█▊        | 35294/200000 [00:08<00:36, 4462.76it/s]   
Convergence archived at 35600
Interrupted at 35,600 [17%]: Average Loss = 3.5707e+05
100%|██████████| 1500/1500 [02:22<00:00, 10.53it/s]

Analysis of results

$\alpha$, the intercept, does not seem to change over time.



In [24]:

    
fig = plt.figure(figsize=(8, 6))
ax = plt.subplot(111, xlabel='time', ylabel='alpha', title='Change of alpha over time.')
ax.plot(trace_rw[-1000:]['alpha'].T, 'r', alpha=.05);
ax.set_xticklabels([str(p.date()) for p in prices[::len(prices)//5].index]);



In [25]:

    
fig = plt.figure(figsize=(8, 6))
# ax = plt.subplot(111, xlabel='time', ylabel='alpha', title='Change of alpha over time.')
ax = plt.subplot(111)
ax.hist(trace_rw[-1000:]['alpha'].flatten(), bins=50);
# ax.set_xticklabels([str(p.date()) for p in prices[::len(prices)//5].index]);

However, the slope does.



In [26]:

    
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel='time', ylabel='beta', title='Change of beta over time')
ax.plot(trace_rw[-1000:]['beta'].T, 'b', alpha=.05);
ax.set_xticklabels([str(p.date()) for p in prices[::len(prices)//5].index]);

The posterior predictive plot shows that we capture the change in regression over time much better. Note that we should have used returns instead of prices. The model would still work the same, but the visualisations would not be quite as clear.



In [27]:

    
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, xlabel='Price AAPL in \$', ylabel='Price GOOGL in \$', 
            title='Posterior predictive regression lines')

colors = np.linspace(0.1, 1, len(prices))
colors_sc = np.linspace(0.1, 1, len(trace_rw[-500::10]['alpha'].T))
mymap = plt.get_cmap('winter')
mymap_sc = plt.get_cmap('winter')

xi = np.linspace(prices.AAPL.min(), prices.AAPL.max(), 50)
for i, (alpha, beta) in enumerate(zip(trace_rw[-500::10]['alpha'].T, trace_rw[-500::10]['beta'].T)):
    for a, b in zip(alpha, beta):
        ax.plot(xi, a + b*xi, alpha=.05, lw=1, c=mymap_sc(colors_sc[i]))
        
sc = ax.scatter(prices.AAPL, prices.GOOGL, label='data', cmap=mymap, c=colors)
cb = plt.colorbar(sc)
cb.ax.set_yticklabels([str(p.date()) for p in prices[::len(prices)//10].index]);



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	AAPL	GOOGL
Date
2010-01-04	30.49	313.79
2010-01-05	30.66	313.90
2010-01-06	30.63	313.24
2010-01-07	30.25	305.00
2010-01-08	30.04	296.30

Software	Version
Python	3.6.2 64bit [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
IPython	6.1.0
OS	Darwin 15.6.0 x86_64 i386 64bit
matplotlib	2.0.2
numpy	1.13.1
pandas	0.20.3
pymc3	3.1
Thu Aug 10 10:06:28 2017 MDT