Data Bootcamp: Examples

Most of what we do in this course is find some data that interests us, read it into Python, organize it in a convenient form, and produce a graph. The details vary, but that's the general plan. The best-case scenario: The graph tells us something we don't know, and points to other things we'd like to follow up on. As for the Python code: You'll be able to do all this and more by the end of the term.

Most of this runs with packages that come pre-installed with the Anaconda distribution. The exceptions are the options example, which uses the pandas-datareader package, and the heatmap we produced for economic indicators using the seaborn package. We'll talk more about this later, but you can install them from the terminal or command line with conda install package.

This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

Preliminaries

The tradition is to put import statements at the top and be done with them. But since we sometimes run sections of code separately, we've repeated some of them below. That's ok, a second import is redundant but does only aesthetic harm.



In [ ]:

    
# import packages 
import pandas as pd                   # data management
import matplotlib.pyplot as plt       # graphics 
import matplotlib as mpl              # graphics parameters
import numpy as np                    # numerical calculations 

# IPython command, puts plots in notebook 
%matplotlib inline

# check Python version 
import datetime as dt 
import sys
print('Today is', dt.date.today())
print('What version of Python are we running? \n', sys.version, sep='')

Example: US economic conditions

We see a lot of variation in GDP growth rates in all countries. It's not obvious why, but it's a fact of life, and one that investors and other business analysts track closely. Basic numbers like GDP only come out quarterly, and with at least a one month lag, so we're often in the position of not knowing how the economy is doing now, much less how it will do in the future. To get a clearer sense of current conditions, analysts typically look at a wide range of indicators. Many of these indicators are available more quickly than GDP. Since they typically move up and down with GDP, we can use their movements to infer what GDP might be doing.

Of the hundreds of economic indicators, some of the most popular (with their FRED codes) are

INDPRO: industrial production
PAYEMS: nonfarm employment
AWHMAN: average weekly hours worked in manufacturing
PERMIT: permits for new housing
UMCSENT: consumer sentiment

All of them are monthly. You can find more about this kind of thing in our book, chapter 11.



In [ ]:

    
# get data from FRED
import pandas as pd 
from pandas_datareader import data, wb
import datetime as dt                  # handles dates 

# get data 
indicators = ['INDPRO', 'PAYEMS', 'AWHMAN', 'PERMIT', 'UMCSENT']
start_date = dt.datetime(1970, 1, 1)
inds = data.DataReader(indicators, "fred", start_date)
end = inds.index[-1]

# yoy growth rates 
g = inds.pct_change(periods=12).dropna()
# standardize
g_std = (g - g.mean()) / g.std()
gs = g_std

# correlations 
g_std.corr()



In [ ]:

    
# plot 
fig, ax = plt.subplots()
g_std.plot(ax=ax) 
ax.set_title('Economic indicators', fontsize=14, loc='left')
ax.set_ylabel('Standard deviations from mean')
ax.set_xlabel('')
ax.hlines(y=0, xmin=start_date, xmax=end, linestyles='dashed')
ax.legend().set_visible(False)



In [ ]:

    
# focus on recent past 
recent_date = dt.datetime(2011, 1, 1)
g_std = g_std[g_std.index>=recent_date]



In [ ]:

    
fig, ax = plt.subplots()
g_std.plot(ax=ax)
ax.set_title('Zoom in on recent past', fontsize=14, loc='left')
ax.set_ylabel('Standard deviations from mean')
ax.set_xlabel('')
ax.hlines(y=0, xmin=recent_date, xmax=end, linestyles='dashed')
ax.legend(loc='upper left', fontsize=10, handlelength=2, labelspacing=0.15)

Comment. Chase likes heatmaps. The one that follows needs work, but it's an attempt to reproduce the previous graph as a heatmap, where the value of the variable is represented by the color and its intensity. Dark red: large postive value. Dark blue: large negative value. Light colors: small values.



In [ ]:

    
import seaborn as sns      # graphics packages
fig, ax = plt.subplots()
sns.heatmap(g_std, linewidths=.5, ax=ax)
ax.set_yticklabels(g_std.index, visible=False);

Comment. This needs some work on the dates.

Questions

What would you change in this picture?
What did you learn?
What else would you like to know?
Where does the data come from?

The following two lines reset the graphics to the pyplot standard. If we ignore them, we get the seaborn standard, which most people think looks better. If you don't have seaborn installed, none of this matters.



In [ ]:

    
# reset graphics (or not) 
#mpl.rcParams.update(mpl.rcParamsDefault)
#%matplotlib inline



In [ ]:

Example: Employment trends

Not long ago, the US had more people working (a higher fraction of the adult population) than many other developed countries. Over the past 15 years things have flipped. The deep question is why, but here we simply report what we know.

The key variables are the employment rate (fraction of people aged 25-54) and the participation rate (fraction either working or unemployed). They're similar but the latter irons out some of the cyclical fluctuations. We get the data from FRED, but they get it from the OECD's Main Economic Indicators, which covers mostly rich countries.



In [ ]:

    
# countries = AU, CA, CH, DE, ES, EU, EZ, FR, GB, JP, KR, NL, O1 (OECD), SE, US, ZA 
countries = ['CA', 'DE', 'GB', 'JP', 'US']
emcodes = ['LREM25TT' + c + 'Q156S' for c in countries]
lrcodes = ['LRAC25TT' + c + 'Q156S' for c in countries]

start_date = dt.datetime(2000, 1, 1)
em = data.DataReader(emcodes, 'fred', start_date)
lr = data.DataReader(lrcodes, 'fred', start_date)
em.columns = countries
lr.columns = countries
em.head(3)



In [ ]:

    
def emplot(em, title):
    fig, ax = plt.subplots()
    em.plot(ax=ax, lw=2)
    ax.set_title(title, fontsize=14, loc='left')
    ax.set_xlabel('')
    ax.set_ylabel('Percent')
    ax.legend(loc='best', fontsize=8, handlelength=2, labelspacing=0.1)
    
emplot(em, 'Employment rates, age 25-54')
emplot(lr, 'Participation rates, age 25-54')

Questions

What would you change in this picture?
What did you learn?
What else would you like to know?
Where does the data come from?



In [ ]:



In [ ]:

Example: Government debt

One of the traditional ways for a country to get into trouble is to issue so much debt that investors worry about getting paid back. How much is that? Hard to say, but we'll look at some debt numbers from the IMF's World Economic Outlook or WEO, a popular source of international data on debt, deficits, and other macroeconomic indicators. We use numbers for the ratio of government debt to GDP, a standard indicator of public indebtedness.

Given recent events, we pay special attention to Argentina and Greece.



In [ ]:

    
%%time 

url = 'http://www.imf.org/external/pubs/ft/weo/2016/02/weodata/WEOOct2016all.xls'
weo = pd.read_csv(url, sep='\t', thousands=',', na_values=['n/a', '--']) 
weo.shape
#weo



In [ ]:

    
list(weo[list(range(12))])



In [ ]:

    
country_guide = weo[['ISO', 'Country']].drop_duplicates().set_index('ISO')

variable_guide = weo[['WEO Subject Code', 'Subject Descriptor', 'Subject Notes']].drop_duplicates().set_index('WEO Subject Code')
variable_guide.head(3)



In [ ]:

    
variables = ['GGXWDG_NGDP']
countries = ['ARG', 'DEU', 'FRA', 'GRC', 'USA']
debt = weo[weo['ISO'].isin(countries) & weo['WEO Subject Code'].isin(variables)]
some = [3] + list(range(9,44))
debt = debt[some].set_index('Country').T.dropna()
debt.head(3)



In [ ]:

    
fig, ax = plt.subplots()
debt.plot(ax=ax, lw=2)
ax.set_title('Ratio of government debt to GDP', fontsize=14, loc='left')
ax.set_ylabel('Government Debt (Percent of GDP)')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)

Questions

What would you change in this picture?
What did you learn?
What else would you like to know?
Where does the data come from?



In [ ]:

Example: Options data

WARNING: This code is currently broken in pandas_datareader, so we will skip this for now

The Pandas package is spinning off the remote data access tools, including FRED, World Bank, and so on. Here we use the new tool, the Pandas DataReader. This requires installation: on the command line, enter

conda install pandas-datareader html5lib

and follow instructions.

You can run commands on the command line from within jupyter by putting a ! at the beginning of the line. To run the commands above you can do enter the following in a code cell:

!conda install -y pandas-datareader html5lib

We'll use the Options tool, which reads in prices of stock options from Yahoo Finance. The options come with a variety of contract specifications:

Type. Calls are options to buy, puts are options to sell.
Expiration. The date when the option expires.
Underlying. The asset you have a right to buy or sell. In this data, it's a stock.
Strike price. The price at which you are able to buy (calls) or sell (puts) the underlying.

We give an illustration using options on Amazon stock.

The first version of this module was written by our very own Spencer Lyon.

from pandas_datareader.data import Options

"""
supply ticker, get option prices 
"""
ticker = 'aapl'
aapl = Options(ticker, 'yahoo')
data = aapl.get_all_data()

exp = aapl.expiry_dates     # get expiration dates 
exp

# get option prices 
cols = [0, 1, 2, 7]   
opexp = 2
print('Options with expiry', exp[opexp])
calls = otk.get_call_data(expiry=exp[opexp])[cols]
puts  = otk.get_put_data(expiry=exp[opexp])[cols]

calls.head()

# drop extra index levels 
calls = calls.reset_index(level=[1,2,3], drop=True)
puts  = puts.reset_index(level=[1,2,3], drop=True)

# cut off extremes 
spot = otk.underlying_price
print('Spot price', spot)
delta = 0.25
calls  = calls[(calls.index >= (1-delta)*spot) & (calls.index <= (1+delta)*spot)]
puts  = puts[(puts.index >= (1-delta)*spot) & (puts.index <= (1+delta)*spot)]

# compute avg of bid and ask  
calls['Mid'] = (calls['Bid'] + calls['Ask'])/2
puts['Mid']  = (puts['Bid'] + puts['Ask'])/2

# plot put and call prices
which = 'Mid'

fig, ax = plt.subplots()
calls[which].plot(lw=2, color='blue', alpha=0.6, ax=ax)
puts[which].plot(lw=2, color='magenta', alpha=0.6, ax=ax)
ymin, ymax = ax.get_ylim()
xmin, xmax = ax.get_xlim()
ax.set_title('Prices of Amazon options (bid-ask avg)', fontsize=14, loc='left')
ax.set_ylabel('Option Prices')
ax.set_xlabel('Strike Price')
ax.vlines(x=spot, ymin=ymin, ymax=ymax, linestyle='dashed')
ax.text(1.01*spot, 0.9*ymax, 'Stock price', horizontalalignment='left')
ax.text(0.875*spot, 0.13*ymax, 'Put prices', horizontalalignment='right', color='m')
ax.text(1.125*spot, 0.13*ymax, 'Call prices', horizontalalignment='left', color='b')

Questions

What would you change in this picture?
What did you learn?
What else would you like to know?
Where does the data come from?



In [ ]:

Example: Japan's aging population

Data from the UN's Population Division. One of our favorite quotes:

Last year, for the first time, sales of adult diapers in Japan exceeded those for babies.

This is what the numbers look like. They're UN projections, what they call the "medium variant."



In [ ]:

    
url1 = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url2 = '1_Indicators%20(Standard)/EXCEL_FILES/1_Population/'
url3 = 'WPP2017_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.XLSX'
url = url1 + url2 + url3 

cols = [2, 4, 5] + list(range(6,28))
#est = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
prj = pd.read_excel(url, sheetname=1, skiprows=16, parse_cols=cols, na_values=['…'])

"""
for later:  change cols for the two sources, rename 80+ to 80-84, then concat 
#pop = pd.concat([est, prj], axis=0, join='outer')      
"""
pop = prj



In [ ]:

    
pop.head()



In [ ]:

    
# rename some variables 
pop = pop.rename(columns={'Reference date (as of 1 July)': 'Year', 
                          'Region, subregion, country or area *': 'Country', 
                          'Country code': 'Code'})
# select Japan and years 
countries = ['Japan']
years     = [2015, 2025, 2035, 2045, 2055, 2065]
pop = pop[pop['Country'].isin(countries) & pop['Year'].isin(years)]
pop = pop.drop(['Country', 'Code'], axis=1)
pop.head()



In [ ]:

    
pop = pop.set_index('Year').T
pop.head()



In [ ]:

    
fig, ax = plt.subplots(6, 1)
pop.plot(ax=ax[0])
pop.plot(ax=ax[1])



In [ ]:

    
ax = pop.plot(kind='bar',  
         alpha=0.5, subplots=True, sharey=True, figsize=(6, 8))
for axnum in range(len(ax)):  
    ax[axnum].set_title('')
    ax[axnum].legend(loc='upper left', prop={'size':10})
    
ax[0].set_title('Japanese population by age', fontsize=14, loc='left')

Questions

What would you change in this picture?
What did you learn?
What else would you like to know?
Where does the data come from?



In [ ]:

Example: Birth rates

We might wonder, why is the population falling in Japan? Other countries? Well, one reason is that birth rates are falling. Demographers call this fertility. Here we look at the fertility using the same UN source as the previous example. We look at two variables: total fertility and fertility by age of mother. In both cases we explore the numbers to date, but the same files contain projections of future fertility.



In [ ]:

    
%%time

# fertility overall 
uft  = 'http://esa.un.org/unpd/wpp/DVD/Files/'
uft += '1_Indicators%20(Standard)/EXCEL_FILES/'
uft += '2_Fertility/WPP2017_FERT_F04_TOTAL_FERTILITY.XLSX'

cols = [2, 4] + list(range(5,18))
ftot = pd.read_excel(uft, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])

# fertility by age 
ufa  = 'http://esa.un.org/unpd/wpp/DVD/Files/'
ufa += '1_Indicators%20(Standard)/EXCEL_FILES/'
ufa += '2_Fertility/WPP2017_FERT_F07_AGE_SPECIFIC_FERTILITY.XLSX'

#cols = [2, 4, 5] + list(range(6,13))
#fage = pd.read_excel(ufa, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])



In [ ]:

    
ftot.head(3) #[list(range(10))]



In [ ]:

    
# rename some variables 
ftot = ftot.rename(columns={'Region, subregion, country or area *': 'Country', 
                            'Country code': 'Code'})



In [ ]:

    
# drop code 
f = ftot.drop(['Code'], axis=1)

# select countries 
countries = ['China', 'Japan', 'Germany', 'United States of America']
f = f[f['Country'].isin(countries)]

# shape
f = f.set_index('Country').T 
f = f.rename(columns={'United States of America': 'United States'})
f.tail()



In [ ]:

    
fig, ax = plt.subplots()
f.plot(ax=ax, kind='line', alpha=0.5, lw=3, figsize=(6.5, 4))
ax.set_title('Fertility rate (births per woman, lifetime)', fontsize=14, loc='left')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)
ax.set_ylim(ymin=0)
ax.hlines(2.1, -1, 13, linestyles='dashed')
ax.text(8.5, 2.4, 'Replacement = 2.1')

Questions

What would you change in this picture?
What did you learn?
What else would you like to know?
Where does the data come from?



In [ ]:

Example: US government bond yields

People often refer to interest rates moving up and down, but in fact the yields (as we call them) move up and down differently at different maturities. From 2010 to 2015, for examples, yields on bonds with maturities under 2 years were essentially fixed at zero, but yields on higher maturities moved up and down quite a bit.

One challenge with bond yields is getting good data. Bonds are typically traded over the counter, and those bonds differ in maturity, coupon, and often other features. Analysts often focus on the interest rates of pure discount "zero-coupon" bonds, which they infer from prices of coupon bonds -- which are, after all, what is generally traded. This isn't something to go into unless you have a strong interest. Suffice it to say that we have taken these yields from estimates supplied by the Fed.

This is still a work in progress, but our goal is to produce a movie of bond yields over time.



In [ ]:

    
url = 'http://pages.stern.nyu.edu/~dbackus/Data/GSW_nominal_yields.csv'
y = pd.read_csv(url, index_col=0, parse_dates=True)
y.tail()



In [ ]:

    
# compute mean yields
ybar = y.mean(axis=0)
ystd = y.std(axis=0)
maturities = list(range(1,122))
ybar.index = maturities
ystd.index = maturities



In [ ]:

    
fig, ax = plt.subplots()
ybar.plot(ax=ax)
ystd.plot(ax=ax)
ax.set_xlabel('Maturity in Months')
ax.set_ylabel('Mean and Std Dev of Yield (Annual Percent)')
ax.set_ylim(0)
ax.legend(["Mean", "std. dev."])

Starbucks Revenue

This data was given to us by one of our students.



In [ ]:

    
url1 = "https://github.com/NYUDataBootcamp/Materials/blob/master/"
url2 = "Data/Starbucks_Revenue_Worldwide.xlsx?raw=true"
url = url1 + url2
starbucks = pd.read_excel(url, 
                          sheetname=1,
                          skiprows=4,
                          parse_cols=[1,2])
starbucks



In [ ]:

    
starbucks.rename(columns={"Unnamed: 0": "year", "data": "revenue"}, inplace=True)
starbucks



In [ ]:

    
starbucks.set_index("year", inplace=True)
starbucks



In [ ]:

    
mpl.rcParams.update(mpl.rcParamsDefault)
starbucks.plot()



In [ ]:

    
starbucks_growth = starbucks.pct_change()
starbucks_growth



In [ ]:

    
starbucks_growth.plot()



In [ ]:



In [ ]:



In [ ]: