Data Analysis

Warning: this doesn't work! The yahoo website doesn't seem to work anymore.

A simple demo with pandas in Python

This notebook is based on course notes from Lamoureux's course Math 651 at the University of Calgary, Winter 2016.

This was an exercise to try out some resourse in Python. Specifically, we want to scrape some data from the web concerning stock prices, and display in a Panda. Then do some basic data analysis on the information.

We take advantage of the fact that there is a lot of financial data freely accessible on the web, and lots of people post information about how to use it.

Pandas in Python

How to access real data from the web and apply data analysis tools.

I am using the book Python for Data Analysis by Wes McKinney as a reference for this section.

The point of using Python for this is that a lot of people have created good code to do this.

The pandas name comes from Panel Data, an econometrics terms for multidimensional structured data sets, as well as from Python Data Analysis.

The dataframe objects that appear in pandas originated in R. But apparently thery have more functionality in Python than in R.

I will be using PYLAB as well in this section, so we can make use of NUMPY and MATPLOTLIB.

Accessing financial data

For free, historical data on commodities like Oil, you can try this site: http://www.databank.rbs.com This site will download data directly into spreadsheets for you, plot graphs of historical data, etc. Here is an example of oil prices (West Texas Intermdiate), over the last 15 years. Look how low it goes...

Yahoo supplies current stock and commodity prices. Here is an intereting site that tells you how to download loads of data into a csv file. http://www.financialwisdomforum.org/gummy-stuff/Yahoo-data.htm

Here is another site that discusses accessing various financial data sources. http://quant.stackexchange.com/questions/141/what-data-sources-are-available-online

Loading data off the web

To get away from the highly contentious issues of oil prices and political parties, let's look at some simple stock prices -- say Apple and Microsoft. We can import some basic webtools to get prices directly from Yahoo.


In [17]:
# Get some basic tools
%pylab inline
from pandas import Series, DataFrame
import pandas as pd
#import pandas.io.data as web
#from pandas_datareader import data, web
#import pandas_datareader as pdr
from pandas_datareader import data as pdr
import fix_yahoo_finance


Populating the interactive namespace from numpy and matplotlib

In [19]:
# Here are apple and microsoft closing prices since 2016
start = datetime.datetime(2016,1,1)
end = datetime.date.today()
data = pdr.get_data_yahoo(["SPY", "IWM"], start="2017-01-01", end="2017-04-30")
# aapl = pdr.get_data_yahoo('AAPL')
#apple = pdr.DataReader('AAPL', 'yahoo', start, end)
#aapl = pdr.get_data_yahoo('AAPL','2001-01-01')['Adj Close']
#msft = pdr.get_data_yahoo('MSFT','2001-01-01')['Adj Close']
#subplot(2,1,1)
#plot(aapl)
#subplot(2,1,2)
#plot(msft)


---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-19-e189741bd1a7> in <module>()
      2 start = datetime.datetime(2016,1,1)
      3 end = datetime.date.today()
----> 4 data = pdr.get_data_yahoo(["SPY", "IWM"], start="2017-01-01", end="2017-04-30")
      5 # aapl = pdr.get_data_yahoo('AAPL')
      6 #apple = pdr.DataReader('AAPL', 'yahoo', start, end)

/Users/mikel/anaconda/lib/python3.6/site-packages/fix_yahoo_finance/__init__.py in download(tickers, start, end, as_panel, group_by, auto_adjust, progress, *args, **kwargs)
    139 
    140         # yahoo crumb/cookie
--> 141         crumb, cookie = get_yahoo_crumb()
    142 
    143         tried_once = False

/Users/mikel/anaconda/lib/python3.6/site-packages/fix_yahoo_finance/__init__.py in get_yahoo_crumb(force)
     50             return (_YAHOO_CRUMB_, _YAHOO_COOKIE_)
     51 
---> 52     res = requests.get('https://finance.yahoo.com/quote/SPY/history')
     53     _YAHOO_COOKIE_ = res.cookies['B']
     54 

/Users/mikel/anaconda/lib/python3.6/site-packages/requests/api.py in get(url, params, **kwargs)
     68     :rtype: requests.Response
     69     """
---> 70 
     71     kwargs.setdefault('allow_redirects', True)
     72     return request('get', url, params=params, **kwargs)

/Users/mikel/anaconda/lib/python3.6/site-packages/requests/api.py in request(method, url, **kwargs)
     54     # By using the 'with' statement we are sure the session is closed, thus we
     55     # avoid leaving sockets open which can trigger a ResourceWarning in some
---> 56     # cases, and look like a memory leak in others.
     57     with sessions.Session() as session:
     58         return session.request(method=method, url=url, **kwargs)

/Users/mikel/anaconda/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    486         # Create the Request.
    487         req = Request(
--> 488             method = method.upper(),
    489             url = url,
    490             headers = headers,

/Users/mikel/anaconda/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs)
    607         # Guard against that specific failure case.
    608         if isinstance(request, Request):
--> 609             raise ValueError('You can only send PreparedRequests.')
    610 
    611         # Set up variables needed for resolve_redirects and dispatching of hooks

/Users/mikel/anaconda/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    421                        "both timeouts to the same value".format(timeout))
    422                 raise ValueError(err)
--> 423         elif isinstance(timeout, TimeoutSauce):
    424             pass
    425         else:

/Users/mikel/anaconda/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, **response_kw)

/Users/mikel/anaconda/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)

/Users/mikel/anaconda/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py in _validate_conn(self, conn)

/Users/mikel/anaconda/lib/python3.6/site-packages/requests/packages/urllib3/connection.py in connect(self)

/Users/mikel/anaconda/lib/python3.6/site-packages/requests/packages/urllib3/util/ssl_.py in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir)

/Users/mikel/anaconda/lib/python3.6/site-packages/requests/packages/urllib3/contrib/pyopenssl.py in load_verify_locations(self, cafile, capath, cadata)

/Users/mikel/anaconda/lib/python3.6/site-packages/OpenSSL/SSL.py in load_verify_locations(self, cafile, capath)
    523         )
    524         if not load_result:
--> 525             _raise_current_error()
    526 
    527     def _wrap_callback(self, callback):

/Users/mikel/anaconda/lib/python3.6/site-packages/OpenSSL/_util.py in exception_from_error_queue(exception_type)
     46             text(lib.ERR_reason_error_string(error))))
     47 
---> 48     raise exception_type(errors)
     49 
     50 

Error: [('system library', 'fopen', 'No such file or directory'), ('BIO routines', 'BIO_new_file', 'no such file'), ('x509 certificate routines', 'X509_load_cert_crl_file', 'system lib')]

In [3]:
aapl


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-fb822895386a> in <module>()
----> 1 aapl

NameError: name 'aapl' is not defined

In [4]:
# Let's look at the changes in the stock prices, normalized as a percentage
aapl_rets = aapl.pct_change()
msft_rets = msft.pct_change()
subplot(2,1,1)
plot(aapl_rets)
subplot(2,1,2)
plot(msft_rets)


Out[4]:
[<matplotlib.lines.Line2D at 0x7fa2617a5990>]

In [5]:
# Let's look at the correlation between these two series
pd.rolling_corr(aapl_rets, msft_rets, 250).plot()


/opt/conda/envs/python2/lib/python2.7/site-packages/ipykernel/__main__.py:2: FutureWarning: pd.rolling_corr is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=250).corr(other=<Series>)
  from ipykernel import kernelapp as app
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa262d44a10>

Getting fancy.

Now, we can use some more sophisticated statistical tools, like least squares regression. However, I had to do some work to get Python to recognize these items. But I didn't work too hard, I just followed the error messages.

It became clear that I needed to go back to a terminal window to load in some packages. The two commands I had to type in were

  • pip install statsmodels
  • pip install patsy

'pip' is an 'python installer package' that install packages of code onto your computer (or whatever machine is running your python). The two packages 'statsmodels' and 'patsy' are assorted statistical packages. I don't know much about them, but they are easy to find on the web.


In [6]:
# We may also try a least square regression, also built in as a panda function
model = pd.ols(y=aapl_rets, x={'MSFT': msft_rets},window=256)


/opt/conda/envs/python2/lib/python2.7/site-packages/ipykernel/__main__.py:2: FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
  from ipykernel import kernelapp as app

In [7]:
model.beta


Out[7]:
MSFT intercept
Date
2002-01-14 0.796463 0.000416
2002-01-15 0.788022 0.000417
2002-01-16 0.789784 0.000191
2002-01-17 0.802081 0.000600
2002-01-18 0.793941 0.000671
2002-01-22 0.796478 0.000718
2002-01-23 0.797909 0.001172
2002-01-24 0.786143 0.000960
2002-01-25 0.781526 0.001102
2002-01-28 0.782449 0.001064
2002-01-29 0.782113 0.001196
2002-01-30 0.764488 0.001069
2002-01-31 0.784935 0.001246
2002-02-01 0.784775 0.001254
2002-02-04 0.774178 0.001252
2002-02-05 0.781460 0.001383
2002-02-06 0.781529 0.001354
2002-02-07 0.792091 0.001506
2002-02-08 0.785451 0.001019
2002-02-11 0.788570 0.001084
2002-02-12 0.793316 0.001001
2002-02-13 0.796715 0.001117
2002-02-14 0.796227 0.001075
2002-02-15 0.801823 0.001176
2002-02-19 0.804453 0.000885
2002-02-20 0.814666 0.001097
2002-02-21 0.830058 0.000799
2002-02-22 0.818041 0.001174
2002-02-25 0.822848 0.001161
2002-02-26 0.821503 0.001248
... ... ...
2016-05-04 0.634886 -0.001187
2016-05-05 0.632351 -0.001121
2016-05-06 0.631153 -0.001282
2016-05-09 0.631186 -0.001277
2016-05-10 0.627805 -0.001240
2016-05-11 0.632612 -0.001326
2016-05-12 0.629215 -0.001440
2016-05-13 0.626302 -0.001429
2016-05-16 0.631538 -0.001302
2016-05-17 0.629171 -0.001258
2016-05-18 0.629934 -0.001218
2016-05-19 0.626321 -0.001243
2016-05-20 0.627607 -0.001232
2016-05-23 0.625496 -0.001210
2016-05-24 0.624308 -0.001229
2016-05-25 0.625938 -0.001186
2016-05-26 0.625857 -0.001192
2016-05-27 0.628037 -0.001277
2016-05-31 0.624431 -0.001256
2016-06-01 0.623145 -0.001322
2016-06-02 0.623414 -0.001335
2016-06-03 0.620833 -0.001279
2016-06-06 0.621354 -0.001256
2016-06-07 0.621346 -0.001237
2016-06-08 0.621420 -0.001246
2016-06-09 0.620108 -0.001200
2016-06-10 0.620255 -0.001216
2016-06-13 0.619448 -0.001207
2016-06-14 0.618853 -0.001180
2016-06-15 0.618974 -0.001180

3631 rows × 2 columns


In [8]:
model.beta['MSFT'].plot()


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa2617a5550>

In [9]:
# Those two graphs looked similar. Let's plot them together
subplot(2,1,1)
pd.rolling_corr(aapl_rets, msft_rets, 250).plot()
title('Rolling correlations')
subplot(2,1,2)
model.beta['MSFT'].plot()
title('Least squaresn model')


/opt/conda/envs/python2/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: pd.rolling_corr is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=250).corr(other=<Series>)
  app.launch_new_instance()
Out[9]:
<matplotlib.text.Text at 0x7fa24cf18d90>

more stocks

There is all kinds of neat info on the web. Here is the SPY exchange-traded fund, which tracks the S&P 500 index.


In [10]:
px = web.get_data_yahoo('SPY')['Adj Close']*10
px


Out[10]:
Date
2010-01-04     998.08658
2010-01-05    1000.72861
2010-01-06    1001.43318
2010-01-07    1005.66052
2010-01-08    1009.00712
2010-01-11    1010.41626
2010-01-12    1000.99287
2010-01-13    1009.44749
2010-01-14    1012.17761
2010-01-15    1000.81670
2010-01-19    1013.32249
2010-01-20    1003.01842
2010-01-21     983.73128
2010-01-22     961.80210
2010-01-25     966.73395
2010-01-26     962.68278
2010-01-27     967.26241
2010-01-28     956.16569
2010-01-29     945.77354
2010-02-01     960.48105
2010-02-02     972.10617
2010-02-03     967.26241
2010-02-04     937.40701
2010-02-05     939.34454
2010-02-08     932.56318
2010-02-09     944.27638
2010-02-10     942.42694
2010-02-11     952.29063
2010-02-12     951.49804
2010-02-16     966.46975
                 ...    
2016-05-04    2050.09995
2016-05-05    2049.70001
2016-05-06    2057.20001
2016-05-09    2058.89999
2016-05-10    2084.49997
2016-05-11    2065.00000
2016-05-12    2065.59998
2016-05-13    2047.59995
2016-05-16    2067.79999
2016-05-17    2048.50006
2016-05-18    2049.10004
2016-05-19    2041.99997
2016-05-20    2054.90005
2016-05-23    2052.10007
2016-05-24    2078.69995
2016-05-25    2092.79999
2016-05-26    2093.39996
2016-05-27    2102.40005
2016-05-31    2098.39996
2016-06-01    2102.70004
2016-06-02    2109.10004
2016-06-03    2102.79999
2016-06-06    2113.50006
2016-06-07    2116.79993
2016-06-08    2123.69995
2016-06-09    2120.80002
2016-06-10    2100.70007
2016-06-13    2084.49997
2016-06-14    2080.39993
2016-06-15    2077.50000
Name: Adj Close, dtype: float64

In [11]:
plot(px)


Out[11]:
[<matplotlib.lines.Line2D at 0x7fa24ce5fa90>]