This project is motivated by the prevalence of websites hawking "safe" trading strategies, usually based on some form of statistical arbitrage. Pairs trading is one such method, where an investor tracks two stocks that tend to move in the same fashion. When one stock outperforms the other, the higher stock is sold short, and the other is bought, betting on a reversion to the mean. When the stock prices (hopefully) reconverge, the trades are undone, resulting in profit. Anectodally, it is said that there should be some known relationship between the stocks that helps to ensure consistent performance over time.
Websites such as Investopedia have pages devoted to describing such strategies, generally in an overly optimistic fashion. See this page on Investopedia, for example.
However, these strategies, particularly when applied with a fairly basic method such as correlation, can be very risky. This article, while perhaps a bit dramatic, does a good job of describing a couple of scenarios and reasons why it can be dangerous.
Our goal is to evaluate three of the most commonly recommended pairs trading stock selection strategies to determine how reliable these methods actually are. These methods are correlation, distance (sum of square difference), and cointegration. Further, we wish to evaluate the anecdotal claim that stocks should be related in some known fashion, and should not simply be selected by statistical means.
In [1]:
import pandas as pd
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as stat
import statsmodels.tsa.stattools as ts
from itertools import combinations
import multiprocessing
from multiprocessing import Pool
import cPickle as pkl
%pylab inline
pylab.rcParams['figure.figsize'] = (20.0, 16.0)
pylab.rcParams['xtick.major.size'] = 14
pylab.rcParams['xtick.labelsize'] = 12
pylab.rcParams['font.size'] = 14
In [2]:
# This cell loads the .pkl training dataset object, either created above or downloaded from the github repository.
trainInput = open('trainData.pkl', 'rb')
trainData = pkl.load(trainInput)
trainInput.close()
In [3]:
# This cell loads the .pkl test dataset object, either created above or downloaded from the github repository.
testInput = open('testData.pkl', 'rb')
testData = pkl.load(testInput)
testInput.close()
In [4]:
# Extracts column headers from training data set and creates all possible pairs
tickerlist = sorted(trainData.columns.tolist())[1:]
tickerpairs = list(combinations(tickerlist,2))
print tickerpairs[:10] # Look at first 10 combinations of pairs
print len(tickerpairs) # Over 3 million possible pairs
In [5]:
# This loads the training dataset mean SSD results from the .gz file available on github.
# This should be run to access already-generated results and will just take a short while.
# If you are creating the data yourself, please skip this step.
!gunzip -c trainDistResult.pkl.gz > trainDistResult.pkl
trainDistInput = open('trainDistResult.pkl', 'rb')
trainDistResult = pkl.load(trainDistInput)
trainDistInput.close()
In [6]:
# This cell sorts the mean SSD results and selects just the 100 "closest" pairs of stocks.
smallssd = trainDistResult.sort(columns = 1)[0:100]
print smallssd.head()
In [7]:
# This cell plots the standardized stock prices for the five 'closest' pairs of stocks using the mean SSD method.
# The plot covers the entire training dataset time range, from January 1, 2011, to December 31, 2012.
# This will plot inline, and the following cell is a figure caption.
plt.figure(1)
i = 1
for a in smallssd.index[0:5]:
stock1 = tickerpairs[a][0]
stock2 = tickerpairs[a][1]
pairsprice1 = trainData[stock1]/(trainData[stock1][0])
pairsprice2 = trainData[stock2]/(trainData[stock2][0])
plt.subplot(5,1,i)
plt.plot_date(pairsprice1.index, pairsprice1,'r')
plt.plot_date(pairsprice1.index, pairsprice2,'b')
plt.legend([stock1,stock2],loc=4)
plt.xlabel('Date')
plt.ylabel('Standardized Price')
plt.title(stock1+' vs '+stock2)
i += 1
plt.gcf().autofmt_xdate(rotation=45)
plt.show()
In [8]:
# This cell plots the difference in standardized stock prices for the five 'closest' pairs of stocks using the mean SSD method.
# The plot covers the entire training dataset time range, from January 1, 2011, to December 31, 2012.
# This will plot inline, and the following cell is a figure caption.
# A horizontal line at the mean gap value has been included to help distinguish any trend in difference, also known as 'drift'
plt.figure(2)
i = 1
for a in smallssd.index[0:5]:
stock1 = tickerpairs[a][0]
stock2 = tickerpairs[a][1]
pairsprice1 = trainData[stock1]/(trainData[stock1][0])
pairsprice2 = trainData[stock2]/(trainData[stock2][0])
pairsgap = pairsprice1-pairsprice2
plt.subplot(5,1,i)
plt.plot(pairsprice1.index, pairsgap,'b')
plt.axhline(y=mean(pairsgap), color='r')
plt.xlabel('Date')
plt.ylabel('Standardized Gap')
plt.title(stock1+' vs '+stock2)
i += 1
plt.gcf().autofmt_xdate(rotation=45)
plt.show()
In [9]:
# This loads the training dataset ADF cointegration test results from the .gz file available on github.
# This should be run to access already-generated results and will just take a short while.
# If you are creating the data yourself, please skip this step.
!gunzip -c trainCointResult.pkl.gz > trainCointResult.pkl
trainCointInput = open('trainCointResult.pkl', 'rb')
trainCointResult = pkl.load(trainCointInput)
trainCointInput.close()
In [10]:
# This cell sorts the ADF cointegration test p-values and selects just the 100 pairs of stocks with the smallest p-values.
smallCoint = trainCointResult.sort(columns = 1)[0:100]
print smallCoint.head()
In [11]:
# This cell plots the standardized stock prices for the five pairs of stocks with the smallest ADF cointegration test p-values.
# The plot covers the entire training dataset time range, from January 1, 2011, to December 31, 2012.
# This will plot inline, and the following cell is a figure caption.
plt.figure(3)
i = 1
for a in smallCoint.index[0:5]:
stock1 = tickerpairs[a][0]
stock2 = tickerpairs[a][1]
pairsprice1 = trainData[stock1]/(trainData[stock1][0])
pairsprice2 = trainData[stock2]/(trainData[stock2][0])
plt.subplot(5,1,i)
plt.plot_date(pairsprice1.index, pairsprice1,'r')
plt.plot_date(pairsprice1.index, pairsprice2,'b')
plt.legend([stock1,stock2])
plt.xlabel('Date')
plt.ylabel('Standardized Price')
plt.title(stock1+' vs '+stock2)
i += 1
plt.gcf().autofmt_xdate(rotation=45)
plt.show()
In [12]:
# This cell plots the difference in standardized stock prices for the five pairs of stocks with the smallest ADF cointegration test p-values.
# The plot covers the entire training dataset time range, from January 1, 2011, to December 31, 2012.
# This will plot inline, and the following cell is a figure caption.
# A horizontal line at the mean gap value has been included to help distinguish any trend in difference, also known as 'drift'
plt.figure(4)
i = 1
for a in smallCoint.index[0:5]:
stock1 = tickerpairs[a][0]
stock2 = tickerpairs[a][1]
pairsprice1 = trainData[stock1]/(trainData[stock1][0])
pairsprice2 = trainData[stock2]/(trainData[stock2][0])
pairsgap = pairsprice1-pairsprice2
plt.subplot(5,1,i)
plt.plot(pairsprice1.index, pairsgap,'b')
plt.axhline(y=mean(pairsgap), color='r')
plt.xlabel('Date')
plt.ylabel('Standardized Gap')
plt.title(stock1+' vs '+stock2)
i += 1
plt.gcf().autofmt_xdate(rotation=45)
plt.show()
In [13]:
# This loads the training dataset correlation results from the .gz file available on github.
# This should be run to access already-generated results and will just take a short while.
# If you are creating the data yourself, please skip this step.
!gunzip -c trainCorrResult.pkl.gz > trainCorrResult.pkl
trainCorrInput = open('trainCorrResult.pkl', 'rb')
trainCorrResult = pkl.load(trainCorrInput)
trainCorrInput.close()
In [14]:
# This cell sorts the correlation results and selects just the 100 most positively correlated pairs of stocks.
trainCorrResult.columns = ['pair','corr']
trainCorrTop = trainCorrResult.dropna().sort(columns='corr', ascending=False)[0:100]
trainCorrTop.head()
Out[14]:
In [15]:
# This cell plots the standardized stock prices for the five most positively correlated pairs of stocks.
# The plot covers the entire training dataset time range, from January 1, 2011, to December 31, 2012.
# This will plot inline, and the following cell is a figure caption.
plt.figure(5)
i = 1
for a in trainCorrTop.index[0:5]:
stock1 = tickerpairs[a][0]
stock2 = tickerpairs[a][1]
pairsprice1 = trainData[stock1]/(trainData[stock1][0])
pairsprice2 = trainData[stock2]/(trainData[stock2][0])
plt.subplot(5,1,i)
plt.plot_date(pairsprice1.index, pairsprice1,'r')
plt.plot_date(pairsprice1.index, pairsprice2,'b')
plt.legend([stock1,stock2], loc=7)
plt.xlabel('Date')
plt.ylabel('Standardized Price')
plt.title(stock1+' vs '+stock2)
i += 1
plt.gcf().autofmt_xdate(rotation=45)
plt.show()
In [16]:
# This cell plots the difference in standardized stock prices for the five most positively correlated pairs of stocks.
# The plot covers the entire training dataset time range, from January 1, 2011, to December 31, 2012.
# This will plot inline, and the following cell is a figure caption.
# A horizontal line at the mean gap value has been included to help distinguish any trend in difference, also known as 'drift'
plt.figure(6)
i = 1
for a in trainCorrTop.index[0:5]:
stock1 = tickerpairs[a][0]
stock2 = tickerpairs[a][1]
pairsprice1 = trainData[stock1]/(trainData[stock1][0])
pairsprice2 = trainData[stock2]/(trainData[stock2][0])
pairsgap = pairsprice1-pairsprice2
plt.subplot(5,1,i)
plt.plot(pairsprice1.index, pairsgap,'b')
plt.axhline(y=mean(pairsgap), color='r')
plt.xlabel('Date')
plt.ylabel('Standardized Gap')
plt.title(stock1+' vs '+stock2)
i += 1
plt.gcf().autofmt_xdate(rotation=45)
plt.show()
Now that we have determined the top pairs of stocks in the training data set using our suggested methods, we will evaluate those pairs in the test data set to see how well the methods hold up over time.
In [17]:
# This cell calculates the mean SSD for each pair of stocks in the test data set.
def dist(tlist):
xname = tickerpairs[tlist][0]
yname = tickerpairs[tlist][1]
x = testData[xname]/(testData[xname][0])
y = testData[yname]/(testData[yname][0])
# Remove any NA values
z = (x-y).dropna()
# Only consider pairs with most of data available
if len(z) > 248:
return([(xname, yname)], sum(map(lambda z:z**2, z)))
else:
return()
if __name__ == '__main__':
testDistPool = Pool(processes=4)
# Tests just the number of pairs selected in smallssd
testDistResult = pd.DataFrame(testDistPool.map(dist, smallssd.index))
testDistPool.close()
In [18]:
# This cell creates a scatterplot of training mean SSD vs test mean SSD.
# The top 100 'closest' stock pairs from the training mean SSD method are included.
plt.scatter(smallssd[1], testDistResult[1])
plt.vlines(.3, -2, 30, colors='r')
plt.text(.31,11,'x = .3')
plt.xlabel('Training SSD')
plt.ylabel('Test SSD')
plt.title('Consistency of SSD Method')
plt.show()
In [19]:
# This cell plots the standardized stock prices for the five 'closest' pairs of stocks using the training mean SSD method.
# The plot covers the entire test dataset time range, from January 1, 2013, to December 31, 2013.
# This will plot inline, and the following cell is a figure caption.
plt.figure(1)
i = 1
for a in smallssd.index[0:5]:
stock1 = tickerpairs[a][0]
stock2 = tickerpairs[a][1]
pairsprice1 = testData[stock1]/(testData[stock1][0])
pairsprice2 = testData[stock2]/(testData[stock2][0])
plt.subplot(5,1,i)
plt.plot_date(pairsprice1.index, pairsprice1,'r')
plt.plot_date(pairsprice1.index, pairsprice2,'b')
plt.legend([stock1,stock2], loc=4)
plt.xlabel('Date')
plt.ylabel('Standardized Price')
plt.title(stock1+' vs '+stock2)
i += 1
plt.gcf().autofmt_xdate(rotation=45)
plt.show()
In [20]:
# This cell plots the difference in standardized stock prices for the five 'closest' pairs of stocks using the training mean SSD method.
# The plot covers the entire test dataset time range, from January 1, 2013, to December 31, 2013.
# This will plot inline, and the following cell is a figure caption.
# A horizontal line at y=0 has been included to help distinguish any trend in difference, also known as 'drift'
plt.figure(2)
i = 1
for a in smallssd.index[0:5]:
stock1 = tickerpairs[a][0]
stock2 = tickerpairs[a][1]
#Updated to reflect standardized price differential
pairsprice1 = testData[stock1]/(testData[stock1][0])
pairsprice2 = testData[stock2]/(testData[stock2][0])
pairsgap = pairsprice1-pairsprice2
plt.subplot(5,1,i)
plt.plot(pairsprice1.index, pairsgap,'b')
plt.axhline(y=0, color='r')
plt.xlabel('Date')
plt.ylabel('Standardized Gap')
plt.title(stock1+' vs '+stock2)
i += 1
plt.gcf().autofmt_xdate(rotation=45)
plt.show()
In [21]:
# This cell calculates the ADF cointegration p-value for each pair of stocks in the test data set.
def cointegration_test(y, x):
result = stat.OLS(y, x).fit()
return(ts.adfuller(result.resid))
def coint(cointTestingTlist):
try:
cointTestingXname = tickerpairs[cointTestingTlist][0]
cointTestingYname = tickerpairs[cointTestingTlist][1]
testCointX = testData[cointTestingXname]
testCointY = testData[cointTestingYname]
if min(testCointX.count(), testCointY.count()) > 248:
testxclean = testCointX[testCointX.notnull() & testCointY.notnull()]
testyclean = testCointY[testCointX.notnull() & testCointY.notnull()]
testxp = list(testxclean)
testyp = list(testyclean)
return([(cointTestingXname, cointTestingYname)], cointegration_test(testxp,testyp)[1]) # Get the p-value of test for each pair
else:
return()
except ValueError:
return()
except TypeError:
return()
if __name__ == '__main__':
testCointPool = Pool(processes=4)
# Test just the first 100 pairs - remove [0:100] for full test
testCointResult = pd.DataFrame(testCointPool.map(coint, smallCoint.index))
testCointPool.close()
testsmallCoint = testCointResult.sort(columns = 1)[0:100]
print testsmallCoint.head()
In [22]:
# This cell creates a scatterplot of training mean SSD vs test mean SSD.
# The top 100 'closest' stock pairs from the training mean SSD method are included.
plt.scatter(smallCoint[1][0:100], testCointResult[1][0:100])
plt.xlim(-1e-13,1e-12)
plt.xlabel('Training Cointegration')
plt.ylabel('Test Cointegration')
plt.title('Consistency of Cointegration Method')
plt.show()
In [23]:
# Note scale of training p-values
smallCoint.head()
Out[23]:
In [24]:
# Note same pairs test p-values - 3 of 5 are NOT SIGNIFICANT
testCointResult.head()
Out[24]:
In [25]:
# Concatenate the cointegration results for further evaluation
df = pd.DataFrame(smallCoint.reset_index())
df['2'] = testCointResult[1]
df.dropna(axis=0,how='any')
df.columns = ['index','pair','train','test']
df.head()
Out[25]:
In [26]:
# This cell plots the standardized stock prices for the five pairs of stocks with the smallest training ADF cointegration test p-values.
# The plot covers the entire test dataset time range, from January 1, 2013, to December 31, 2013.
# This will plot inline, and the following cell is a figure caption.
plt.figure(1)
i = 1
for a in smallCoint.index[0:5]:
stock1 = tickerpairs[a][0]
stock2 = tickerpairs[a][1]
pairsprice1 = testData[stock1]/(testData[stock1][0])
pairsprice2 = testData[stock2]/(testData[stock2][0])
plt.subplot(5,1,i)
plt.plot_date(pairsprice1.index, pairsprice1,'r')
plt.plot_date(pairsprice1.index, pairsprice2,'b')
plt.legend([stock1,stock2])
plt.xlabel('Date')
plt.ylabel('Standardized Price')
plt.title(stock1+' vs '+stock2)
i += 1
plt.gcf().autofmt_xdate(rotation=45)
plt.show()
In [27]:
# This cell plots the difference in standardized stock prices for the five pairs of stocks with the smallest training ADF cointegration test p-values.
# The plot covers the entire test dataset time range, from January 1, 2013, to December 31, 2013.
# This will plot inline, and the following cell is a figure caption.
# A horizontal line at y=0 has been included to help distinguish any trend in difference, also known as 'drift'
plt.figure(2)
i = 1
for a in smallCoint.index[0:5]:
stock1 = tickerpairs[a][0]
stock2 = tickerpairs[a][1]
#Updated to reflect standardized price differential
pairsprice1 = testData[stock1]/(testData[stock1][0])
pairsprice2 = testData[stock2]/(testData[stock2][0])
pairsgap = pairsprice1-pairsprice2
plt.subplot(5,1,i)
plt.plot(pairsprice1.index, pairsgap,'b')
plt.axhline(y=0, color='r')
plt.xlabel('Date')
plt.ylabel('Standardized Gap')
plt.title(stock1+' vs '+stock2)
i += 1
plt.gcf().autofmt_xdate(rotation=45)
plt.show()
In [28]:
# Parallel processing for correlation test
# Consider updating to take advantage of pandas.DataFrame.corr() but would need to change handling of output
def correlate(tlist):
try:
xname = tickerpairs[tlist][0]
yname = tickerpairs[tlist][1]
x = testData[xname]
y = testData[yname]
if min(x.count(), y.count()) > 240:
corrs = x.corr(y)
return([(xname, yname)], corrs)
else:
return()
except ValueError:
return()
except TypeError:
return()
if __name__ == '__main__':
testCorrPool = Pool(processes=4)
#Test just the first 100 pairs - remove [0:100] for full test
testCorrResult = pd.DataFrame(testCorrPool.map(correlate, trainCorrTop.index))
testCorrPool.close()
In [29]:
# This cell creates a scatterplot of training correlation vs test correlation.
# The top 100 most positively correlated stock pairs from the training set are included.
testCorrResult.columns = ['pair','corr']
plt.scatter(trainCorrTop['corr'], testCorrResult['corr'])
plt.vlines(.993,-1,1, colors='r')
plt.text(.994,0,'x = .993')
plt.xlabel('Training Correlation')
plt.ylabel('Test Correlation')
plt.title('Consistency of Correlation Method')
plt.show()
In [30]:
# This cell plots the standardized stock prices for the training set five most positively correlated pairs of stocks.
# The plot covers the entire test dataset time range, from January 1, 2013, to December 31, 2013.
# This will plot inline, and the following cell is a figure caption.
plt.figure(1)
i = 1
for a in trainCorrTop.index[0:5]:
stock1 = tickerpairs[a][0]
stock2 = tickerpairs[a][1]
pairsprice1 = testData[stock1]/(testData[stock1][0])
pairsprice2 = testData[stock2]/(testData[stock2][0])
plt.subplot(5,1,i)
plt.plot_date(pairsprice1.index, pairsprice1,'r')
plt.plot_date(pairsprice1.index, pairsprice2,'b')
plt.legend([stock1,stock2],loc=4)
plt.xlabel('Date')
plt.ylabel('Standardized Price')
plt.title(stock1+' vs '+stock2)
i += 1
plt.gcf().autofmt_xdate(rotation=45)
plt.show()
In [31]:
# This cell plots the difference in standardized stock prices for the five most positively correlated pairs of stocks.
# The plot covers the entire test dataset time range, from January 1, 2013, to December 31, 2013.
# This will plot inline, and the following cell is a figure caption.
# A horizontal line at y=0 has been included to help distinguish any trend in difference, also known as 'drift'
plt.figure(2)
i = 1
for a in trainCorrTop.index[0:5]:
stock1 = tickerpairs[a][0]
stock2 = tickerpairs[a][1]
#Updated to reflect standardized price differential
pairsprice1 = testData[stock1]/(testData[stock1][0])
pairsprice2 = testData[stock2]/(testData[stock2][0])
pairsgap = pairsprice1-pairsprice2
plt.subplot(5,1,i)
plt.plot(pairsprice1.index, pairsgap,'b')
plt.axhline(y=0, color='r')
plt.xlabel('Date')
plt.ylabel('Standardized Gap')
plt.title(stock1+' vs '+stock2)
i += 1
plt.gcf().autofmt_xdate(rotation=45)
plt.show()
In summary, we have found no evidence that any of these statistical methods alone is enough to select a pair of stocks for pairs trading. Further, we have not found any evidence refuting the anecdotal information that stocks should be related in a known manner to be suitable for pairs trading.
Of the three methods, the distance method performed the worst, with the largest drift away from the mean over time. Correlation performed somewhat better, but both correlation and distance showed little consistency between the training and test data. Cointegration showed almost no consistency between training and test, yet the related stocks selected by cointegration showed promising results.
We would suggest further research into selection of pairs of related stocks using the cointegration method.