Correlation and causality example

For this example exercise, we will use a dataset with stocks from the NYSE, which is mixed with some data from US airlines.

We will look for some correlations, some of which will make sense, and some of which won't.

Note: Please note that this isn't necessarily real stock data, it is a mix of real and synthetic data for teaching purposes. If you want to play around with real stock data there are lots of sources, google and yahoo finance are particularly use friendly.


In [34]:
# import Pandas 
import pandas as pd 
# making plots inline 
% matplotlib inline

In [35]:
# read the data (if you have any questions about these relative paths and index_col, please feel free to discuss)
data = pd.read_csv('../data/stocks_and_airliners.csv', index_col='date')

In [36]:
# quick preview
data.head()


Out[36]:
Agilent Technologies Inc American Airlines Group Advance Auto Parts Apple Inc. AbbVie AmerisourceBergen Corp Abbott Laboratories Accenture plc Adobe Systems Inc Analog Devices, Inc. ... Zions Bancorp Zoetis ASM_Domestic ASM_International Flights_Domestic Flights_International Passengers_Domestic Passengers_International RPM_Domestic RPM_International
date
2010-01-31 28.030000 5.31 39.450001 192.060003 NaN 27.260000 52.939953 40.990002 32.299999 26.959999 ... 18.969999 NaN 54057495 50300292 695109 108804 45501620 12393439 40482623 39252875
2010-02-28 31.460000 7.33 40.799999 204.619997 NaN 28.040001 54.279954 39.970001 34.650002 29.240000 ... 18.540001 NaN 48005940 43683891 623211 96264 42440614 10694165 37130277 32519637
2010-03-31 34.390001 7.35 41.919998 235.000011 NaN 28.920000 52.679953 41.950001 35.369999 28.820000 ... 21.840000 NaN 57316299 50510736 740135 111927 54424077 13279320 47794152 40651970
2010-04-30 36.260001 7.07 45.099998 261.090008 NaN 30.850000 51.159955 43.639999 33.599998 29.930000 ... 28.730000 NaN 55496575 48768955 722797 106640 52498074 12398700 45813191 38701988
2010-05-31 32.360001 8.83 51.759998 256.880005 NaN 31.280001 47.559958 37.520000 32.080002 29.170000 ... 23.950001 NaN 57134829 53705460 741802 110374 53842422 13380664 47140355 44079060

5 rows × 509 columns


In [37]:
# what data do we have here? 
data.columns


Out[37]:
Index(['Agilent Technologies Inc', 'American Airlines Group',
       'Advance Auto Parts', 'Apple Inc.', 'AbbVie', 'AmerisourceBergen Corp',
       'Abbott Laboratories', 'Accenture plc', 'Adobe Systems Inc',
       'Analog Devices, Inc.',
       ...
       'Zions Bancorp', 'Zoetis', 'ASM_Domestic', 'ASM_International',
       'Flights_Domestic', 'Flights_International', 'Passengers_Domestic',
       'Passengers_International', 'RPM_Domestic', 'RPM_International'],
      dtype='object', length=509)

Let's get an example of pearson correlation, and one of spearman correlation


In [39]:
# What is the pearson correlation between Accenture stock prices and Adobe stock prices? 
data['Accenture plc'].corr(data['Adobe Systems Inc'], method='pearson')


Out[39]:
0.94819860996153926

In [40]:
data[['Accenture plc', 'Adobe Systems Inc']].plot(figsize=(16, 5))


Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x1026b9b38>

In [46]:
# what is the spearman correlation between American Airlines Group stock price and the number of domestic passengers?
data['American Airlines Group'].corr(data['Passengers_Domestic'], method='spearman')


Out[46]:
0.40457833643495233