Data Bootcamp 1: Examples

Python applied to economic and financial data

This is an introduction to Data Bootcamp, a (prospective) course at NYU designed to give students some familiarity with (i) Python and (ii) economic and financial data. A more complete collection of materials, including this IPython Notebook, is available in our Github repository. (And yes, Python and IPython are different things, but ignore that for now.)

In this Notebook we illustrate some of the possibilities with examples. The code will be obscure if you're new to Python, but we will fill in the gaps over time. In the meantime, you might note for future reference things you run across that you'd like to understand better. We think it's best to take this one step at a time, but if you're interested in the logic behind the code, we give links to relevant documentation under "References." The occasional "Comments" are things for us to follow up on, we suggest you ignore them.

  • Recent US GDP growth (data from FRED)
  • GDP per capita in selected countries (data from the World Bank)
  • Fama-French equity "factors" (data from Ken French's website)
  • S&P 500 ETF (Spyders) (data from Yahoo finance)
  • read a csv??

Warnings

  • This program requires internet access, that's where the data comes from.
  • It's also work in progress, the eta for the course is Fall 2015.

In [20]:
x = [7, 3, 5]

In [23]:
x.pop?

Example 1: US GDP Growth

Investors -- and others -- keep a close eye on the state of the economy because it affects the performance of firms and financial assets. We'll go into this more extensively later, but for now we want to see what the economy has done in the past, especially the recent past. We use the wonderful FRED interface ("API") and load the data straight from their website. Then we graph GDP growth over the past 50 years or so and for a more recent period of greater interest.

This strategy -- find the data on the web, load it, and produce a graph -- is a model for much of what we do.

Question(s). It's always good to know what you're looking for so we'll post question(s) for each example. Here we ask how the economy is doing, and how its current performance compares to the past.

References

Note to self: The FRED API allows you to import transformations like growth rates directly. Is that possible with Pandas?



In [1]:
# anything after the hashtag is a comment 
# load packages 
import datetime as dt 
import pandas.io.data as web       # data import tools 
import matplotlib.pyplot as plt    # plotting tools 

# The next one is an IPython command:  it says to put plots here in the notebook, rather than open a separate window.
%matplotlib inline

In [2]:
# get data from FRED
fred_series = ["GDPC1"]
start_date = dt.datetime(1960, 1, 1)
data = web.DataReader(fred_series, "fred", start_date)

# print last 3 data points to see what we've got 
print(data.tail(3))


              GDPC1
DATE               
2014-04-01  16010.4
2014-07-01  16205.6
2014-10-01  16294.7

In [3]:
# compute annual growth rates 
g = 4*data.pct_change()

# change label   
g.columns = ['US GDP Growth']

The variable g (quarterly GDP growth expressed as an annual rate) is now what Python calls a DataFrame, which is a collection of data organized by variable and observation. You can get some of its properties by typing some or all of the following in the box below:

  • type(g)
  • g.tail()
  • g.head(2)

You can get information about g and what we can do with it by typing: g.[tab]. (Don't type the second period!) That will pop up a list you can scroll through. Typically it's a long list, so it takes some experience to know what to do with it.

You can also get information about things you can do with g by typing commands with an open paren: g.command( and wait. That will give you the arguments of the command. g.head and g.tail, for example, have an argument n which is the number of observations to print. head prints the top of the DataFrame, tail prints the bottom. If you leave it blank, it prints 5.


In [4]:
# enter your commands here

In [5]:
# more examples:  some statistics on GDP growth  
print(['Mean GDP growth ', g.mean()])
print(['Std deviation ', g.std()])

# do this for subperiods...


['Mean GDP growth ', US GDP Growth    0.030429
dtype: float64]
['Std deviation ', US GDP Growth    0.03383
dtype: float64]

In [6]:
# quick and dirty plot 
# note the financial crisis:  GDP fell 8% one quarter (at an annual rate, so really 2%)
g.plot() 
plt.show()



In [7]:
# more complex plot, bar chart for last 6 quarters
# also:  add moving average?

Example 2: Fama-French equity "factors"

Gene Fama and Ken French are two of the leading academics studying (primarily) equity returns. Some of this work is summarized in the press release and related material for the 2013 Nobel Prize in economics, which was shared by Fama with Lars Hansen and Robert Shiller. For now, it's enough to say that Ken French posts an extensive collection of equity data on his website.

We'll look at what have come to be called the Fama-French factors. The data includes:

  • xsm: the return on the market (aggregate equity) minus the riskfree rate
  • smb (small minus big): the return on small firms minus the return on big firms
  • hml (high minus low): the return on firms with high book-to-market ratios minus those with low ratios.
  • rf: the riskfree rate.

We download all of these at once, monthly from 1926. Each is reported as a percentage. Since they're monthly, you can get a rough annual number if you multiply by 12.

Question(s). The question we address is how the returns compare: their means, their variability, and so on.
[Ask yourself: how would I answer this? What would I like to do with the data?]

References



In [8]:
# load packages (if it's redundant it'll be ignored)
import pandas.io.data as web      

# read data from Ken French's website 
ff = web.DataReader('F-F_Research_Data_Factors', 'famafrench')[0] 
# NB:  ff.xs is a conflict, rename to xsm  
ff.columns = ['xsm', 'smb', 'hml', 'rf']

# see what we've got
print(ff.head(3))
print(ff.describe())


         xsm   smb   hml    rf
192607  2.96 -2.30 -2.87  0.22
192608  2.64 -1.40  4.19  0.25
192609  0.36 -1.32  0.01  0.23
               xsm          smb          hml           rf
count  1065.000000  1065.000000  1065.000000  1065.000000
mean      0.653878     0.221850     0.389840     0.283099
std       5.400974     3.221139     3.534641     0.254418
min     -29.130000   -16.400000   -13.280000    -0.060000
25%      -2.020000    -1.560000    -1.290000     0.050000
50%       1.020000     0.060000     0.230000     0.250000
75%       3.660000     1.780000     1.740000     0.430000
max      38.850000    36.700000    35.460000     1.350000

In [9]:
# compute and print summary stats 
moments = [ff.mean(), ff.std(), ff.skew(), ff.kurtosis() - 3]

# \n here is a line break
print('Summary stats for Fama-French factors (mean, std, skew, ex kurt)') #, end='\n\n') 
print(moments)
#[print(moment, end='\n\n') for moment in moments]


Summary stats for Fama-French factors (mean, std, skew, ex kurt)
[xsm    0.653878
smb    0.221850
hml    0.389840
rf     0.283099
dtype: float64, xsm    5.400974
smb    3.221139
hml    3.534641
rf     0.254418
dtype: float64, xsm    0.192321
smb    1.973245
hml    2.147722
rf     1.051113
dtype: float64, xsm     4.771829
smb    16.497750
hml    15.971733
rf     -1.745900
dtype: float64]

In [10]:
# try some things yourself 
# like what?  type ff.[tab]
import pandas as pd
pd.__version__


Out[10]:
'0.16.0'

In [11]:
# some plots 
ff.plot()
plt.show()

ff.hist(bins=50, sharex=True)
plt.show()

ff.boxplot(whis=0, return_type='axes')
plt.show()


Answer(s)? Aren't the boxplots in the last figure cool? The histograms above them? What do you see in them? How do the various returns compare?


Example 3: GDP per capita and life expectancy

The World Bank collects a broad range of economic and social indicators for most countries in the World. They also have a nice interface. It's a good source for basic information about the economic climate compares across countries.

We illustrate its usefulness with a scatterplot of life expectancy v GDP per capita.

Question(s). How closely are these two indicators of quality of life are related.

References



In [12]:
# load package under name wb 
from pandas.io import wb

# find the codes for the variables of interest 
wb.search
wb.search(string='gdp.*capita').iloc[:2]


Out[12]:
id name source sourceNote sourceOrganization topics
699 6.0.GDPpc GDP per capita (2011 US$) LAC Equity Lab GDP per capita is gross domestic product divid... b'LAC Equity Lab Tablulations of the World Dev... Economy & Growth
700 6.0.GDPpc_constant GDP per capita, PPP (constant 2011 internation... LAC Equity Lab GDP per capita based on purchasing power parit... b'NULWorld Development Indicators (World Bank)L' Economy & Growth

In [13]:
# specify dates, variables, and countries 
start = 2011
# GDP per capita, population, life expectancy
variable_list = ['NY.GDP.PCAP.KD', 'SP.POP.TOTL', 'SP.DYN.LE00.IN']
country_list  = ['US', 'FR', 'JP', 'CN', 'IN', 'BR', 'MX'] 

# Python understands we need to go to the second line because ( hasn't been closed by )
data = wb.download(indicator=variable_list, 
                   country=country_list, start=start, end=start).dropna()

# see what we've got
print(data)


                    NY.GDP.PCAP.KD  SP.POP.TOTL  SP.DYN.LE00.IN
country       year                                             
Brazil        2011     5721.289504    196935134       73.347366
China         2011     3121.969744   1344130000       75.042000
France        2011    35771.547550     65343588       82.114634
India         2011     1086.048597   1221156319       65.958488
Japan         2011    36203.430066    127817277       82.591220
Mexico        2011     8307.686996    119361233       76.914171
United States 2011    44341.811840    311582564       78.641463

In [14]:
# check the column labels, change to something simpler 
print(data.columns)
data.columns = ['gdppc', 'pop', 'le']
print(data)


Index(['NY.GDP.PCAP.KD', 'SP.POP.TOTL', 'SP.DYN.LE00.IN'], dtype='object')
                           gdppc         pop         le
country       year                                     
Brazil        2011   5721.289504   196935134  73.347366
China         2011   3121.969744  1344130000  75.042000
France        2011  35771.547550    65343588  82.114634
India         2011   1086.048597  1221156319  65.958488
Japan         2011  36203.430066   127817277  82.591220
Mexico        2011   8307.686996   119361233  76.914171
United States 2011  44341.811840   311582564  78.641463

In [15]:
# scatterplot
# life expectancy v GDP per capita 
# size of circles controlled by population 

# load packages (ignored if redundant) 
import numpy as np            
import matplotlib.pyplot as plt

plt.scatter(data['gdppc'], data['le'], s=0.000001*data['pop'], alpha=0.5)
plt.ylabel('Life Expectancy')
plt.xlabel('GDP Per Capita')
plt.show()



In [16]:
# Note:  size of circles based on population

Example 4: Option prices

A financial option gives its owner the right to buy or sell an asset (the "underlying") at a preset price (the "strike") by a specific date (the "expiration date"). Puts are options to sell, calls are options to buy. We explore option prices with Yahoo Finance, specifically options on the S&P 500 exchange-traded fund, ticker SPY.

We illustrate its usefulness with a scatterplot of life expectancy v GDP per capita.

Question(s). How do put and call prices vary with their strike price? [Think about this. What would you expect?]

Warning. This won't work in Python 2.7 or, in fact, in any environment that uses versions of Pandas prior to 0.14.1. The Yahoo Option API is labeled experimental and it seems the earlier versions don't allow easy access to the strike prices.

References



In [17]:
# load packages 
import pandas as pd 
import pandas.io.data as web
from pandas.io.data import Options
import datetime as dt 
import matplotlib.pylab as plt

# ticker 
ticker = 'spy'

In [18]:
# load stock price first (the underlying) 
# pick a recent date and subtract seven days to be sure we get a quote  
# http://pymotw.com/2/datetime/#date-arithmetic
today = dt.date.today()
one_week = dt.timedelta(days=7)
start = today - one_week
stock = web.DataReader(ticker, 'yahoo', start) 
print(stock)        # just to see what we have

# take the last close (-1 is the last, 'Close' is the close)
# this shows up in our figure 
atm = stock.ix[-1,'Close']      # the -1 takes the last observation


                 Open       High        Low      Close    Volume  Adj Close
Date                                                                       
2015-04-20  209.06000  210.25000  208.96001  209.85001  82969100  209.85001
2015-04-21  209.85001  210.86000  209.24001  209.60001  67403700  209.60001
2015-04-22  210.00999  210.85001  208.89999  210.63000  77064000  210.63000
2015-04-23  210.14999  211.94000  210.00999  211.16000  99433500  211.16000
2015-04-24  211.66000  211.97000  211.11000  211.64999  57952900  211.64999

In [19]:
# get option prices for same ticker 
option = Options(ticker, 'yahoo')
expiry = dt.date(2014, 11, 20)
data_calls = option.get_call_data(expiry=expiry).dropna()
data_puts  = option.get_put_data(expiry=expiry).dropna()

# check what we have 
print(data_calls.index) 
print(data_calls.tail())


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in _get_data_in_date_range(self, dates, call, put)
   1129                 try:  # Try to access on the instance
-> 1130                     frame = getattr(self, nam)
   1131                 except AttributeError:

AttributeError: 'Options' object has no attribute 'calls150501'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in _get_option_data(self, expiry, name)
    748         try:
--> 749             frames = getattr(self, frame_name)
    750         except AttributeError:

AttributeError: 'Options' object has no attribute '_frames150501'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-19-a7d8c5c8e428> in <module>()
      2 option = Options(ticker, 'yahoo')
      3 expiry = dt.date(2014, 11, 20)
----> 4 data_calls = option.get_call_data(expiry=expiry).dropna()
      5 data_puts  = option.get_put_data(expiry=expiry).dropna()
      6 

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in get_call_data(self, month, year, expiry)
    817         """
    818         expiry = self._try_parse_dates(year, month, expiry)
--> 819         return self._get_data_in_date_range(expiry, call=True, put=False)
    820 
    821     def get_put_data(self, month=None, year=None, expiry=None):

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in _get_data_in_date_range(self, dates, call, put)
   1130                     frame = getattr(self, nam)
   1131                 except AttributeError:
-> 1132                     frame = self._get_option_data(expiry=expiry_date, name=name)
   1133                 data.append(frame)
   1134 

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in _get_option_data(self, expiry, name)
    749             frames = getattr(self, frame_name)
    750         except AttributeError:
--> 751             frames = self._get_option_frames_from_yahoo(expiry)
    752 
    753         option_data = frames[name]

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in _get_option_frames_from_yahoo(self, expiry)
    666     def _get_option_frames_from_yahoo(self, expiry):
    667         url = self._yahoo_url_from_expiry(expiry)
--> 668         option_frames = self._option_frames_from_url(url)
    669         frame_name = '_frames' + self._expiry_to_string(expiry)
    670         setattr(self, frame_name, option_frames)

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in _option_frames_from_url(self, url)
    687 
    688     def _option_frames_from_url(self, url):
--> 689         frames = read_html(url)
    690         nframes = len(frames)
    691         frames_req = max(self._TABLE_LOC.values())

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\html.py in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands, encoding)
    862                          'data (you passed a negative value)')
    863     return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 864                   parse_dates, tupleize_cols, thousands, attrs, encoding)

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\html.py in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs, encoding)
    715     retained = None
    716     for flav in flavor:
--> 717         parser = _parser_dispatch(flav)
    718         p = parser(io, compiled_match, attrs, encoding)
    719 

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\html.py in _parser_dispatch(flavor)
    660     if flavor in ('bs4', 'html5lib'):
    661         if not _HAS_HTML5LIB:
--> 662             raise ImportError("html5lib not found, please install it")
    663         if not _HAS_BS4:
    664             raise ImportError("BeautifulSoup4 (bs4) not found, please install it")

ImportError: html5lib not found, please install it

In [ ]:
# compute mid of bid and ask and arrange series for plotting 
calls_bid = data_calls['Bid']
calls_ask = data_calls['Ask'] 

calls_strikes = data_calls['Strike']
calls_mid = (data_calls['Bid'] + data_calls['Ask'])/2
puts_strikes = data_puts['Strike']
puts_mid = (data_puts['Bid'] + data_puts['Ask'])/2

Note to self. In older versions of Pandas, prior to 0.14.1, the option input puts the strike in the index, not as a column of data. The next two lines check the versions of pandas and python on the off chance we want to check: print(pd.version), ! python --version


In [ ]:
# plot call and put prices v strike 
plt.plot(calls_strikes, calls_mid, 'r', lw=2, label='calls')
plt.plot(puts_strikes, puts_mid, 'b', lw=2, label='puts')

# prettify it 
#plt.axis([120, 250, 0, 50])
plt.axvline(x=atm, color='k', linestyle='--', label='ATM')               
plt.legend(loc='best')
plt.show()

In [ ]:
# rerun the figure above with different color lines.  Or dashed lines for call and put prices.  
# or change the form of the vertical ATM line:  solid? another color?

In [ ]: