This is an introduction to Data Bootcamp, a (prospective) course at NYU designed to give students some familiarity with (i) Python and (ii) economic and financial data. A more complete collection of materials, including this IPython Notebook, is available in our Github repository. (And yes, Python and IPython are different things, but ignore that for now.)
In this Notebook we illustrate some of the possibilities with examples. The code will be obscure if you're new to Python, but we will fill in the gaps over time. In the meantime, you might note for future reference things you run across that you'd like to understand better. We think it's best to take this one step at a time, but if you're interested in the logic behind the code, we give links to relevant documentation under "References." The occasional "Comments" are things for us to follow up on, we suggest you ignore them.
Warnings
About us. This is part of a collection of quantitative materials for economics and finance at NYU. More at
Python. Python is a popular general-purpose programming language that has been used for a broad range of applications. Dropbox, for example, is written entirely in Python.
We'll use Python to study economic and financial data, but the same methods -- and more -- can be applied to data from any source. One of our former students is using it to study patterns of survival on the Titanic. Another is using it to process text from analyst reports. Thanks to AQR and others, the data analytic toolsets in Python now rival stat-focused languages like R.
Python on the Cloud. The easiest way to run Python is to run it on the Cloud with Wakari. Just click on the link and sign up for an account. In our experience, it's virtually instantaneous. Wakari runs what are called IPython Notebooks, which are a combination of Python code and text, with the code executable in chunks. The text allows us to document what we're doing as we go. A former student (one who is reasonably tech savvy) writes: "IPython seems best for beginners. The structure almost forces good documentation and you can build the script over time versus building and running it all at once."
The one downside we've found to this is that the Wakari default uses a older version of Python that does not allow some of the cool web tools for data input. The interface for Yahoo Options data, for example, doesn't quite work (close, but not quite). We can probably tell Wakari to use a newer version, or write a workaround for this version, but we're not there yet.
If you'd like to give Wakari a try, set up an account and give it a try. It's pretty close to self-explanatory. In the near future, we'll give instructions for accessing and running this Notebook.
Python on your computer. We find this more user-friendly, but it's trickier to set up initially. Here's what's involved:
Step 1: Download and install the Anaconda distribution of Python. Anaconda is a combination of basic Python and toolsets called packages -- what aficionados call a "code distribution." Anaconda has all the basics and covers most of what we plan to do. It has options for most operating systems (Windows, Mac OS, Linux, all of various vintages) and versions of Python (we use version 3.4). Follow these steps:
Download the installer. Click on this link. Below the words "Download Anaconda" you'll see in caps: "CHOOSE YOUR INSTALLER." To the right you'll see three logos and the words "I WANT PYTHON 3.4." First, choose your computer's operating system by clicking on the appropriate logo (window panes, apple, or penguin). The site usually guesses this, which you can tell because it will be blue rather than gray. Second, click on "I WANT PYTHON 3.4." Yup, only the newest and best for you! Finally, immediately below the words "CHOOSE YOUR INSTALLER" you'll see a box with words like "[Op System] 64-bit / Python 3.4 / Graphical Installer." Click on it to download the installer.
Install Python. Once you've downloaded the installer, click on it and follow instructions. This is no different from any other installation you've done.
Step 2: Running Python. You should now have Python on your computer. There are two standard ways to run Python:
Spyder. Our preferred method is to run Python in Spyder. We'd like to say: Start it up as you would any program. That's intentionally glib, we've had difficulty finding the on button in some environments, but there's no single solution, every operating system is different. In Windows 7, for example, installation usually puts a Spyder icon on the desktop. Click on it and you're good to go. In Windows 8.1, we managed to locate Spyder in the apps, but it's well hidden.
Spyder, continued. Once you get Spyder going, you'll have a pretty standard graphical user interface, similar to what you'd have with Matlab or R (RStudio, for example). There's an editor, typically on the left, where you can write and edit code. There are buttons at the top for running it. There's an IPython console where you can try out commands or ask for help. We'll do a lot of this, but not right now.
IPython. To create or run an IPython Notebook, you need to do something like: go to the command line and type ipython notebook. If it works, it will look a lot like Wakari, but you'll be running IPython on your computer with whatever version of Python you have installed. As with Spyder, getting it going is harded in some systems than others. Good luck!
Advice. There's a steep learning curve at the start of any effort like this. You'll get through it in time, but it takes patience and persistence We suggest:
Wordplay. Python is named for Monty Python. Idle, a well-know Python editor, is a reference to Eric Idle. Anaconda is, of course, a play on the snake theme.
Investors -- and others -- keep a close eye on the state of the economy because it affects the performance of firms and financial assets. We'll go into this more extensively later, but for now we want to see what the economy has done in the past, especially the recent past. We use the wonderful FRED interface ("API") and load the data straight from their website. Then we graph GDP growth over the past 50 years or so and for a more recent period of greater interest.
This strategy -- find the data on the web, load it, and produce a graph -- is a model for much of what we do.
Question(s). It's always good to know what you're looking for so we'll post question(s) for each example. Here we ask how the economy is doing, and how its current performance compares to the past.
References
Note to self: The FRED API allows you to import transformations like growth rates directly. Is that possible with Pandas?
In [1]:
# anything after the hashtag is a comment
# load packages
import datetime as dt
import pandas.io.data as web # data import tools
import matplotlib.pyplot as plt # plotting tools
# The next one is an IPython command: it says to put plots here in the notebook, rather than open a separate window.
%matplotlib inline
In [2]:
# get data from FRED
fred_series = ["GDPC1"]
start_date = dt.datetime(1960, 1, 1)
data = web.DataReader(fred_series, "fred", start_date)
# print last 3 data points to see what we've got
print(data.tail(3))
In [3]:
# compute annual growth rates
g = 4*data.pct_change()
# change label
g.columns = ['US GDP Growth']
The variable g (quarterly GDP growth expressed as an annual rate) is now what Python calls a DataFrame, which is a collection of data organized by variable and observation. You can get some of its properties by typing some or all of the following in the box below:
You can get information about g and what we can do with it by typing: g.[tab]. (Don't type the second period!) That will pop up a list you can scroll through. Typically it's a long list, so it takes some experience to know what to do with it.
You can also get information about things you can do with g by typing commands with an open paren: g.command( and wait. That will give you the arguments of the command. g.head and g.tail, for example, have an argument n which is the number of observations to print. head prints the top of the DataFrame, tail prints the bottom. If you leave it blank, it prints 5.
In [4]:
# enter your commands here
In [5]:
# more examples: some statistics on GDP growth
print(['Mean GDP growth ', g.mean()])
print(['Std deviation ', g.std()])
# do this for subperiods...
In [6]:
# quick and dirty plot
# note the financial crisis: GDP fell 8% one quarter (at an annual rate, so really 2%)
g.plot()
plt.show()
In [7]:
# more complex plot, bar chart for last 6 quarters
# also: add moving average?
Gene Fama and Ken French are two of the leading academics studying (primarily) equity returns. Some of this work is summarized in the press release and related material for the 2013 Nobel Prize in economics, which was shared by Fama with Lars Hansen and Robert Shiller. For now, it's enough to say that Ken French posts an extensive collection of equity data on his website.
We'll look at what have come to be called the Fama-French factors. The data includes:
We download all of these at once, monthly from 1926. Each is reported as a percentage. Since they're monthly, you can get a rough annual number if you multiply by 12.
Question(s).
The question we address is how the returns compare: their means, their variability, and so on.
[Ask yourself: how would I answer this? What would I like to do with the data?]
References
In [8]:
# load packages (if it's redundant it'll be ignored)
import pandas.io.data as web
# read data from Ken French's website
ff = web.DataReader('F-F_Research_Data_Factors', 'famafrench')[0]
# NB: ff.xs is a conflict, rename to xsm
ff.columns = ['xsm', 'smb', 'hml', 'rf']
# see what we've got
print(ff.head(3))
print(ff.describe())
In [9]:
# compute and print summary stats
moments = [ff.mean(), ff.std(), ff.skew(), ff.kurtosis() - 3]
# \n here is a line break
print('Summary stats for Fama-French factors (mean, std, skew, ex kurt)') #, end='\n\n')
print(moments)
#[print(moment, end='\n\n') for moment in moments]
In [10]:
# try some things yourself
# like what? type ff.[tab]
import pandas as pd
pd.__version__
Out[10]:
In [11]:
# some plots
ff.plot()
plt.show()
ff.hist(bins=50, sharex=True)
plt.show()
ff.boxplot(whis=0, return_type='axes')
plt.show()
Answer(s)? Aren't the boxplots in the last figure cool? The histograms above them? What do you see in them? How do the various returns compare?
The World Bank collects a broad range of economic and social indicators for most countries in the World. They also have a nice interface. It's a good source for basic information about the economic climate compares across countries.
We illustrate its usefulness with a scatterplot of life expectancy v GDP per capita.
Question(s). How closely are these two indicators of quality of life are related.
References
In [12]:
# load package under name wb
from pandas.io import wb
# find the codes for the variables of interest
wb.search
wb.search(string='gdp.*capita').iloc[:2]
Out[12]:
In [13]:
# specify dates, variables, and countries
start = 2011
# GDP per capita, population, life expectancy
variable_list = ['NY.GDP.PCAP.KD', 'SP.POP.TOTL', 'SP.DYN.LE00.IN']
country_list = ['US', 'FR', 'JP', 'CN', 'IN', 'BR', 'MX']
# Python understands we need to go to the second line because ( hasn't been closed by )
data = wb.download(indicator=variable_list,
country=country_list, start=start, end=start).dropna()
# see what we've got
print(data)
In [14]:
# check the column labels, change to something simpler
print(data.columns)
data.columns = ['gdppc', 'pop', 'le']
print(data)
In [15]:
# scatterplot
# life expectancy v GDP per capita
# size of circles controlled by population
# load packages (ignored if redundant)
import numpy as np
import matplotlib.pyplot as plt
plt.scatter(data['gdppc'], data['le'], s=0.000001*data['pop'], alpha=0.5)
plt.ylabel('Life Expectancy')
plt.xlabel('GDP Per Capita')
plt.show()
In [16]:
# Note: size of circles based on population
A financial option gives its owner the right to buy or sell an asset (the "underlying") at a preset price (the "strike") by a specific date (the "expiration date"). Puts are options to sell, calls are options to buy. We explore option prices with Yahoo Finance, specifically options on the S&P 500 exchange-traded fund, ticker SPY.
We illustrate its usefulness with a scatterplot of life expectancy v GDP per capita.
Question(s). How do put and call prices vary with their strike price? [Think about this. What would you expect?]
Warning. This won't work in Python 2.7 or, in fact, in any environment that uses versions of Pandas prior to 0.14.1. The Yahoo Option API is labeled experimental and it seems the earlier versions don't allow easy access to the strike prices.
References
In [17]:
# load packages
import pandas as pd
import pandas.io.data as web
from pandas.io.data import Options
import datetime as dt
import matplotlib.pylab as plt
# ticker
ticker = 'spy'
In [18]:
# load stock price first (the underlying)
# pick a recent date and subtract seven days to be sure we get a quote
# http://pymotw.com/2/datetime/#date-arithmetic
today = dt.date.today()
one_week = dt.timedelta(days=7)
start = today - one_week
stock = web.DataReader(ticker, 'yahoo', start)
print(stock) # just to see what we have
# take the last close (-1 is the last, 'Close' is the close)
# this shows up in our figure
atm = stock.ix[-1,'Close'] # the -1 takes the last observation
In [19]:
# get option prices for same ticker
option = Options(ticker, 'yahoo')
expiry = dt.date(2014, 11, 20)
data_calls = option.get_call_data(expiry=expiry).dropna()
data_puts = option.get_put_data(expiry=expiry).dropna()
# check what we have
print(data_calls.index)
print(data_calls.tail())
In [ ]:
# compute mid of bid and ask and arrange series for plotting
calls_bid = data_calls['Bid']
calls_ask = data_calls['Ask']
calls_strikes = data_calls['Strike']
calls_mid = (data_calls['Bid'] + data_calls['Ask'])/2
puts_strikes = data_puts['Strike']
puts_mid = (data_puts['Bid'] + data_puts['Ask'])/2
Note to self. In older versions of Pandas, prior to 0.14.1, the option input puts the strike in the index, not as a column of data. The next two lines check the versions of pandas and python on the off chance we want to check: print(pd.version), ! python --version
In [ ]:
# plot call and put prices v strike
plt.plot(calls_strikes, calls_mid, 'r', lw=2, label='calls')
plt.plot(puts_strikes, puts_mid, 'b', lw=2, label='puts')
# prettify it
#plt.axis([120, 250, 0, 50])
plt.axvline(x=atm, color='k', linestyle='--', label='ATM')
plt.legend(loc='best')
plt.show()
In [ ]:
# rerun the figure above with different color lines. Or dashed lines for call and put prices.
# or change the form of the vertical ATM line: solid? another color?
In [ ]: