Data Bootcamp 1: Examples

Python applied to economic and financial data

This is an introduction to Data Bootcamp, a (prospective) course at NYU designed to give students some familiarity with (i) Python and (ii) economic and financial data. A more complete collection of materials, including this IPython Notebook, is available in our Github repository. (And yes, Python and IPython are different things, but ignore that for now.)

In this Notebook we illustrate some of the possibilities with examples. The code will be obscure if you're new to Python, but we will fill in the gaps over time. In the meantime, you might note for future reference things you run across that you'd like to understand better. We think it's best to take this one step at a time, but if you're interested in the logic behind the code, we give links to relevant documentation under "References." The occasional "Comments" are things for us to follow up on, we suggest you ignore them.

Recent US GDP growth (data from FRED)
GDP per capita in selected countries (data from the World Bank)
Fama-French equity "factors" (data from Ken French's website)
S&P 500 ETF (Spyders) (data from Yahoo finance)
read a csv??

Warnings

This program requires internet access, that's where the data comes from.
It's also work in progress, the eta for the course is Fall 2015.

About us. This is part of a collection of quantitative materials for economics and finance at NYU. More at

Dave Backus's Data Bootcamp.
Tom Sargent and John Stachurski's Quantitative Economics. Dynamic macroeconomic theory with Python.
Laura Veldkamp and Venky Vekateswaran's Advanced Macro. Efficient processing of information for forecasting and portfolio choice.
Dave Backus's Macrofoundations course. A course on the macroeconomics foundations of asset prices and returns, including equity index options and long-term bonds. Uses Matlab.

Getting started

Python. Python is a popular general-purpose programming language that has been used for a broad range of applications. Dropbox, for example, is written entirely in Python.

We'll use Python to study economic and financial data, but the same methods -- and more -- can be applied to data from any source. One of our former students is using it to study patterns of survival on the Titanic. Another is using it to process text from analyst reports. Thanks to AQR and others, the data analytic toolsets in Python now rival stat-focused languages like R.

Python on the Cloud. The easiest way to run Python is to run it on the Cloud with Wakari. Just click on the link and sign up for an account. In our experience, it's virtually instantaneous. Wakari runs what are called IPython Notebooks, which are a combination of Python code and text, with the code executable in chunks. The text allows us to document what we're doing as we go. A former student (one who is reasonably tech savvy) writes: "IPython seems best for beginners. The structure almost forces good documentation and you can build the script over time versus building and running it all at once."

The one downside we've found to this is that the Wakari default uses a older version of Python that does not allow some of the cool web tools for data input. The interface for Yahoo Options data, for example, doesn't quite work (close, but not quite). We can probably tell Wakari to use a newer version, or write a workaround for this version, but we're not there yet.

If you'd like to give Wakari a try, set up an account and give it a try. It's pretty close to self-explanatory. In the near future, we'll give instructions for accessing and running this Notebook.

Python on your computer. We find this more user-friendly, but it's trickier to set up initially. Here's what's involved:

Step 1: Download and install the Anaconda distribution of Python. Anaconda is a combination of basic Python and toolsets called packages -- what aficionados call a "code distribution." Anaconda has all the basics and covers most of what we plan to do. It has options for most operating systems (Windows, Mac OS, Linux, all of various vintages) and versions of Python (we use version 3.4). Follow these steps:

Download the installer. Click on this link. Below the words "Download Anaconda" you'll see in caps: "CHOOSE YOUR INSTALLER." To the right you'll see three logos and the words "I WANT PYTHON 3.4." First, choose your computer's operating system by clicking on the appropriate logo (window panes, apple, or penguin). The site usually guesses this, which you can tell because it will be blue rather than gray. Second, click on "I WANT PYTHON 3.4." Yup, only the newest and best for you! Finally, immediately below the words "CHOOSE YOUR INSTALLER" you'll see a box with words like "[Op System] 64-bit / Python 3.4 / Graphical Installer." Click on it to download the installer.
Install Python. Once you've downloaded the installer, click on it and follow instructions. This is no different from any other installation you've done.

Step 2: Running Python. You should now have Python on your computer. There are two standard ways to run Python:

Spyder. Our preferred method is to run Python in Spyder. We'd like to say: Start it up as you would any program. That's intentionally glib, we've had difficulty finding the on button in some environments, but there's no single solution, every operating system is different. In Windows 7, for example, installation usually puts a Spyder icon on the desktop. Click on it and you're good to go. In Windows 8.1, we managed to locate Spyder in the apps, but it's well hidden.
Spyder, continued. Once you get Spyder going, you'll have a pretty standard graphical user interface, similar to what you'd have with Matlab or R (RStudio, for example). There's an editor, typically on the left, where you can write and edit code. There are buttons at the top for running it. There's an IPython console where you can try out commands or ask for help. We'll do a lot of this, but not right now.
IPython. To create or run an IPython Notebook, you need to do something like: go to the command line and type ipython notebook. If it works, it will look a lot like Wakari, but you'll be running IPython on your computer with whatever version of Python you have installed. As with Spyder, getting it going is harded in some systems than others. Good luck!

Advice. There's a steep learning curve at the start of any effort like this. You'll get through it in time, but it takes patience and persistence We suggest:

Take your time. This won't happen overnight. But it won't take weeks, either, before you're doing cool things.
Ask for help. This is much easier to do if you have friends doing the same thing. Or even better, friends who are ahead of you. You'll probably find you need a lot of help at the start, but once you get going you'll be able to help yourself.
Google fu. Once you've made some progress, you can learn a lot just by Googling your issue. The best links are often the Python documentation and Stack Overflow.
Start from templates. The easiest place to start is with programs that work. Start with one and tweak it till you get what you want. The documentation includes lots of examples. And we'll have examples of our own.

Wordplay. Python is named for Monty Python. Idle, a well-know Python editor, is a reference to Eric Idle. Anaconda is, of course, a play on the snake theme.

Example 1: US GDP Growth

Investors -- and others -- keep a close eye on the state of the economy because it affects the performance of firms and financial assets. We'll go into this more extensively later, but for now we want to see what the economy has done in the past, especially the recent past. We use the wonderful FRED interface ("API") and load the data straight from their website. Then we graph GDP growth over the past 50 years or so and for a more recent period of greater interest.

This strategy -- find the data on the web, load it, and produce a graph -- is a model for much of what we do.

Question(s). It's always good to know what you're looking for so we'll post question(s) for each example. Here we ask how the economy is doing, and how its current performance compares to the past.

References

FRED: http://research.stlouisfed.org/fred2/
Pandas: http://pandas.pydata.org/pandas-docs/stable/
Data access: http://pandas.pydata.org/pandas-docs/stable/remote_data.html#fred
Inline plots: http://stackoverflow.com/questions/21176731/automatically-run-matplotlib-inline-in-ipython-notebook

Note to self: The FRED API allows you to import transformations like growth rates directly. Is that possible with Pandas?



In [1]:

    
# anything after the hashtag is a comment 
# load packages 
import datetime as dt 
import pandas.io.data as web       # data import tools 
import matplotlib.pyplot as plt    # plotting tools 

# The next one is an IPython command:  it says to put plots here in the notebook, rather than open a separate window.
%matplotlib inline



In [2]:

    
# get data from FRED
fred_series = ["GDPC1"]
start_date = dt.datetime(1960, 1, 1)
data = web.DataReader(fred_series, "fred", start_date)

# print last 3 data points to see what we've got 
print(data.tail(3))









    



              GDPC1
DATE               
2014-04-01  16010.4
2014-07-01  16205.6
2014-10-01  16294.7



In [3]:

    
# compute annual growth rates 
g = 4*data.pct_change()

# change label   
g.columns = ['US GDP Growth']

The variable g (quarterly GDP growth expressed as an annual rate) is now what Python calls a DataFrame, which is a collection of data organized by variable and observation. You can get some of its properties by typing some or all of the following in the box below:

type(g)
g.tail()
g.head(2)

You can get information about g and what we can do with it by typing: g.[tab]. (Don't type the second period!) That will pop up a list you can scroll through. Typically it's a long list, so it takes some experience to know what to do with it.

You can also get information about things you can do with g by typing commands with an open paren: g.command( and wait. That will give you the arguments of the command. g.head and g.tail, for example, have an argument n which is the number of observations to print. head prints the top of the DataFrame, tail prints the bottom. If you leave it blank, it prints 5.



In [4]:

    
# enter your commands here



In [5]:

    
# more examples:  some statistics on GDP growth  
print(['Mean GDP growth ', g.mean()])
print(['Std deviation ', g.std()])

# do this for subperiods...









    



['Mean GDP growth ', US GDP Growth    0.030429
dtype: float64]
['Std deviation ', US GDP Growth    0.03383
dtype: float64]



In [6]:

    
# quick and dirty plot 
# note the financial crisis:  GDP fell 8% one quarter (at an annual rate, so really 2%)
g.plot() 
plt.show()



In [7]:

    
# more complex plot, bar chart for last 6 quarters
# also:  add moving average?

Example 2: Fama-French equity "factors"

Gene Fama and Ken French are two of the leading academics studying (primarily) equity returns. Some of this work is summarized in the press release and related material for the 2013 Nobel Prize in economics, which was shared by Fama with Lars Hansen and Robert Shiller. For now, it's enough to say that Ken French posts an extensive collection of equity data on his website.

We'll look at what have come to be called the Fama-French factors. The data includes:

xsm: the return on the market (aggregate equity) minus the riskfree rate
smb (small minus big): the return on small firms minus the return on big firms
hml (high minus low): the return on firms with high book-to-market ratios minus those with low ratios.
rf: the riskfree rate.

We download all of these at once, monthly from 1926. Each is reported as a percentage. Since they're monthly, you can get a rough annual number if you multiply by 12.

Question(s). The question we address is how the returns compare: their means, their variability, and so on.
[Ask yourself: how would I answer this? What would I like to do with the data?]

References



In [8]:

    
# load packages (if it's redundant it'll be ignored)
import pandas.io.data as web      

# read data from Ken French's website 
ff = web.DataReader('F-F_Research_Data_Factors', 'famafrench')[0] 
# NB:  ff.xs is a conflict, rename to xsm  
ff.columns = ['xsm', 'smb', 'hml', 'rf']

# see what we've got
print(ff.head(3))
print(ff.describe())









    



         xsm   smb   hml    rf
192607  2.96 -2.30 -2.87  0.22
192608  2.64 -1.40  4.19  0.25
192609  0.36 -1.32  0.01  0.23
               xsm          smb          hml           rf
count  1065.000000  1065.000000  1065.000000  1065.000000
mean      0.653878     0.221850     0.389840     0.283099
std       5.400974     3.221139     3.534641     0.254418
min     -29.130000   -16.400000   -13.280000    -0.060000
25%      -2.020000    -1.560000    -1.290000     0.050000
50%       1.020000     0.060000     0.230000     0.250000
75%       3.660000     1.780000     1.740000     0.430000
max      38.850000    36.700000    35.460000     1.350000



In [9]:

    
# compute and print summary stats 
moments = [ff.mean(), ff.std(), ff.skew(), ff.kurtosis() - 3]

# \n here is a line break
print('Summary stats for Fama-French factors (mean, std, skew, ex kurt)') #, end='\n\n') 
print(moments)
#[print(moment, end='\n\n') for moment in moments]









    



Summary stats for Fama-French factors (mean, std, skew, ex kurt)
[xsm    0.653878
smb    0.221850
hml    0.389840
rf     0.283099
dtype: float64, xsm    5.400974
smb    3.221139
hml    3.534641
rf     0.254418
dtype: float64, xsm    0.192321
smb    1.973245
hml    2.147722
rf     1.051113
dtype: float64, xsm     4.771829
smb    16.497750
hml    15.971733
rf     -1.745900
dtype: float64]



In [10]:

    
# try some things yourself 
# like what?  type ff.[tab]
import pandas as pd
pd.__version__









    Out[10]:





'0.16.0'



In [11]:

    
# some plots 
ff.plot()
plt.show()

ff.hist(bins=50, sharex=True)
plt.show()

ff.boxplot(whis=0, return_type='axes')
plt.show()

Answer(s)? Aren't the boxplots in the last figure cool? The histograms above them? What do you see in them? How do the various returns compare?

Example 3: GDP per capita and life expectancy

The World Bank collects a broad range of economic and social indicators for most countries in the World. They also have a nice interface. It's a good source for basic information about the economic climate compares across countries.

We illustrate its usefulness with a scatterplot of life expectancy v GDP per capita.

Question(s). How closely are these two indicators of quality of life are related.

References



In [12]:

    
# load package under name wb 
from pandas.io import wb

# find the codes for the variables of interest 
wb.search
wb.search(string='gdp.*capita').iloc[:2]









    Out[12]:






  
    
      
      id
      name
      source
      sourceNote
      sourceOrganization
      topics
    
  
  
    
      699
      6.0.GDPpc
      GDP per capita (2011 US$)
      LAC Equity Lab
      GDP per capita is gross domestic product divid...
      b'LAC Equity Lab Tablulations of the World Dev...
      Economy & Growth
    
    
      700
      6.0.GDPpc_constant
      GDP per capita, PPP (constant 2011 internation...
      LAC Equity Lab
      GDP per capita based on purchasing power parit...
      b'NULWorld Development Indicators (World Bank)L'
      Economy & Growth



In [13]:

    
# specify dates, variables, and countries 
start = 2011
# GDP per capita, population, life expectancy
variable_list = ['NY.GDP.PCAP.KD', 'SP.POP.TOTL', 'SP.DYN.LE00.IN']
country_list  = ['US', 'FR', 'JP', 'CN', 'IN', 'BR', 'MX'] 

# Python understands we need to go to the second line because ( hasn't been closed by )
data = wb.download(indicator=variable_list, 
                   country=country_list, start=start, end=start).dropna()

# see what we've got
print(data)









    



                    NY.GDP.PCAP.KD  SP.POP.TOTL  SP.DYN.LE00.IN
country       year                                             
Brazil        2011     5721.289504    196935134       73.347366
China         2011     3121.969744   1344130000       75.042000
France        2011    35771.547550     65343588       82.114634
India         2011     1086.048597   1221156319       65.958488
Japan         2011    36203.430066    127817277       82.591220
Mexico        2011     8307.686996    119361233       76.914171
United States 2011    44341.811840    311582564       78.641463



In [14]:

    
# check the column labels, change to something simpler 
print(data.columns)
data.columns = ['gdppc', 'pop', 'le']
print(data)









    



Index(['NY.GDP.PCAP.KD', 'SP.POP.TOTL', 'SP.DYN.LE00.IN'], dtype='object')
                           gdppc         pop         le
country       year                                     
Brazil        2011   5721.289504   196935134  73.347366
China         2011   3121.969744  1344130000  75.042000
France        2011  35771.547550    65343588  82.114634
India         2011   1086.048597  1221156319  65.958488
Japan         2011  36203.430066   127817277  82.591220
Mexico        2011   8307.686996   119361233  76.914171
United States 2011  44341.811840   311582564  78.641463



In [15]:

    
# scatterplot
# life expectancy v GDP per capita 
# size of circles controlled by population 

# load packages (ignored if redundant) 
import numpy as np            
import matplotlib.pyplot as plt

plt.scatter(data['gdppc'], data['le'], s=0.000001*data['pop'], alpha=0.5)
plt.ylabel('Life Expectancy')
plt.xlabel('GDP Per Capita')
plt.show()



In [16]:

    
# Note:  size of circles based on population

Example 4: Option prices

A financial option gives its owner the right to buy or sell an asset (the "underlying") at a preset price (the "strike") by a specific date (the "expiration date"). Puts are options to sell, calls are options to buy. We explore option prices with Yahoo Finance, specifically options on the S&P 500 exchange-traded fund, ticker SPY.

We illustrate its usefulness with a scatterplot of life expectancy v GDP per capita.

Question(s). How do put and call prices vary with their strike price? [Think about this. What would you expect?]

Warning. This won't work in Python 2.7 or, in fact, in any environment that uses versions of Pandas prior to 0.14.1. The Yahoo Option API is labeled experimental and it seems the earlier versions don't allow easy access to the strike prices.

References



In [17]:

    
# load packages 
import pandas as pd 
import pandas.io.data as web
from pandas.io.data import Options
import datetime as dt 
import matplotlib.pylab as plt

# ticker 
ticker = 'spy'



In [18]:

    
# load stock price first (the underlying) 
# pick a recent date and subtract seven days to be sure we get a quote  
# http://pymotw.com/2/datetime/#date-arithmetic
today = dt.date.today()
one_week = dt.timedelta(days=7)
start = today - one_week
stock = web.DataReader(ticker, 'yahoo', start) 
print(stock)        # just to see what we have

# take the last close (-1 is the last, 'Close' is the close)
# this shows up in our figure 
atm = stock.ix[-1,'Close']      # the -1 takes the last observation









    



                 Open       High        Low      Close    Volume  Adj Close
Date                                                                       
2015-04-20  209.06000  210.25000  208.96001  209.85001  82969100  209.85001
2015-04-21  209.85001  210.86000  209.24001  209.60001  67403700  209.60001
2015-04-22  210.00999  210.85001  208.89999  210.63000  77064000  210.63000
2015-04-23  210.14999  211.94000  210.00999  211.16000  99433500  211.16000
2015-04-24  211.66000  211.97000  211.11000  211.64999  57952900  211.64999



In [19]:

    
# get option prices for same ticker 
option = Options(ticker, 'yahoo')
expiry = dt.date(2014, 11, 20)
data_calls = option.get_call_data(expiry=expiry).dropna()
data_puts  = option.get_put_data(expiry=expiry).dropna()

# check what we have 
print(data_calls.index) 
print(data_calls.tail())









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in _get_data_in_date_range(self, dates, call, put)
   1129                 try:  # Try to access on the instance
-> 1130                     frame = getattr(self, nam)
   1131                 except AttributeError:

AttributeError: 'Options' object has no attribute 'calls150501'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in _get_option_data(self, expiry, name)
    748         try:
--> 749             frames = getattr(self, frame_name)
    750         except AttributeError:

AttributeError: 'Options' object has no attribute '_frames150501'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-19-a7d8c5c8e428> in <module>()
      2 option = Options(ticker, 'yahoo')
      3 expiry = dt.date(2014, 11, 20)
----> 4 data_calls = option.get_call_data(expiry=expiry).dropna()
      5 data_puts  = option.get_put_data(expiry=expiry).dropna()
      6 

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in get_call_data(self, month, year, expiry)
    817         """
    818         expiry = self._try_parse_dates(year, month, expiry)
--> 819         return self._get_data_in_date_range(expiry, call=True, put=False)
    820 
    821     def get_put_data(self, month=None, year=None, expiry=None):

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in _get_data_in_date_range(self, dates, call, put)
   1130                     frame = getattr(self, nam)
   1131                 except AttributeError:
-> 1132                     frame = self._get_option_data(expiry=expiry_date, name=name)
   1133                 data.append(frame)
   1134 

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in _get_option_data(self, expiry, name)
    749             frames = getattr(self, frame_name)
    750         except AttributeError:
--> 751             frames = self._get_option_frames_from_yahoo(expiry)
    752 
    753         option_data = frames[name]

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in _get_option_frames_from_yahoo(self, expiry)
    666     def _get_option_frames_from_yahoo(self, expiry):
    667         url = self._yahoo_url_from_expiry(expiry)
--> 668         option_frames = self._option_frames_from_url(url)
    669         frame_name = '_frames' + self._expiry_to_string(expiry)
    670         setattr(self, frame_name, option_frames)

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py in _option_frames_from_url(self, url)
    687 
    688     def _option_frames_from_url(self, url):
--> 689         frames = read_html(url)
    690         nframes = len(frames)
    691         frames_req = max(self._TABLE_LOC.values())

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\html.py in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands, encoding)
    862                          'data (you passed a negative value)')
    863     return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 864                   parse_dates, tupleize_cols, thousands, attrs, encoding)

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\html.py in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs, encoding)
    715     retained = None
    716     for flav in flavor:
--> 717         parser = _parser_dispatch(flav)
    718         p = parser(io, compiled_match, attrs, encoding)
    719 

C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\html.py in _parser_dispatch(flavor)
    660     if flavor in ('bs4', 'html5lib'):
    661         if not _HAS_HTML5LIB:
--> 662             raise ImportError("html5lib not found, please install it")
    663         if not _HAS_BS4:
    664             raise ImportError("BeautifulSoup4 (bs4) not found, please install it")

ImportError: html5lib not found, please install it



In [ ]:

    
# compute mid of bid and ask and arrange series for plotting 
calls_bid = data_calls['Bid']
calls_ask = data_calls['Ask'] 

calls_strikes = data_calls['Strike']
calls_mid = (data_calls['Bid'] + data_calls['Ask'])/2
puts_strikes = data_puts['Strike']
puts_mid = (data_puts['Bid'] + data_puts['Ask'])/2

Note to self. In older versions of Pandas, prior to 0.14.1, the option input puts the strike in the index, not as a column of data. The next two lines check the versions of pandas and python on the off chance we want to check: print(pd.version), ! python --version



In [ ]:

    
# plot call and put prices v strike 
plt.plot(calls_strikes, calls_mid, 'r', lw=2, label='calls')
plt.plot(puts_strikes, puts_mid, 'b', lw=2, label='puts')

# prettify it 
#plt.axis([120, 250, 0, 50])
plt.axvline(x=atm, color='k', linestyle='--', label='ATM')               
plt.legend(loc='best')
plt.show()



In [ ]:

    
# rerun the figure above with different color lines.  Or dashed lines for call and put prices.  
# or change the form of the vertical ATM line:  solid? another color?



In [ ]:

	id	name	source	sourceNote	sourceOrganization	topics
699	6.0.GDPpc	GDP per capita (2011 US$)	LAC Equity Lab	GDP per capita is gross domestic product divid...	b'LAC Equity Lab Tablulations of the World Dev...	Economy & Growth
700	6.0.GDPpc_constant	GDP per capita, PPP (constant 2011 internation...	LAC Equity Lab	GDP per capita based on purchasing power parit...	b'NULWorld Development Indicators (World Bank)L'	Economy & Growth