Data Bootcamp: Examples

Most of what we do in this course is find some data that interests us, read it into Python, organize it in a convenient form, and produce a graph. The details vary, but that's the general plan. The best-case scenario: The graph tells us something we don't know, and points to other things we'd like to follow up on. As for the Python code: You'll be able to do all this and more by the end of the term.

Most of this runs with packages that come pre-installed with the Anaconda distribution. The exceptions are the options example, which uses the pandas-datareader package, and the heatmap we produced for economic indicators using the seaborn package. We'll talk more about this later, but you can install them from the terminal or command line with conda install package.

This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

Preliminaries

The tradition is to put import statements at the top and be done with them. But since we sometimes run sections of code separately, we've repeated some of them below. That's ok, a second import is redundant but does only aesthetic harm.



In [7]:

    
# import packages 
import pandas as pd                   # data management
import matplotlib.pyplot as plt       # graphics 
import matplotlib as mpl              # graphics parameters
import numpy as np                    # numerical calculations 

# IPython command, puts plots in notebook 
%matplotlib inline

# check Python version 
import datetime as dt 
import sys
print('Today is', dt.date.today())
print('What version of Python are we running? \n', sys.version, sep='')









    



Today is 2016-03-22
What version of Python are we running? 
3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]

Example: US economic conditions

We see a lot of variation in GDP growth rates in all countries. It's not obvious why, but it's a fact of life, and one that investors and other business analysts track closely. Basic numbers like GDP only come out quarterly, and with at least a one month lag, so we're often in the position of not knowing how the economy is doing now, much less how it will do in the future. To get a clearer sense of current conditions, analysts typically look at a wide range of indicators. Many of these indicators are available more quickly than GDP. Since they typically move up and down with GDP, we can use their movements to infer what GDP might be doing.

Of the hundreds of economic indicators, some of the most popular (with their FRED codes) are

INDPRO: industrial production
PAYEMS: nonfarm employment
AWHMAN: average weekly hours worked in manufacturing
PERMIT: permits for new housing
NAPM: purchasing managers index
UMCSENT: consumer sentiment

All of them are monthly. You can find more about this kind of thing in our book, chapter 11.



In [8]:

    
# get data from FRED
import pandas as pd 
import pandas.io.data as web           # web interface with FRED
import datetime as dt                  # handles dates 

# get data 
indicators = ['INDPRO', 'PAYEMS', 'AWHMAN', 'PERMIT', 'NAPM', 'UMCSENT']
start_date = dt.datetime(1970, 1, 1)
inds = web.DataReader(indicators, "fred", start_date)
end = inds.index[-1]

# yoy growth rates 
g = inds.pct_change(periods=12).dropna()
# standardize
g_std = (g - g.mean()) / g.std()
gs = g_std

# correlations 
g_std.corr()



In [9]:

    
# plot 
fig, ax = plt.subplots()
g_std.plot(ax=ax) 
ax.set_title('Economic indicators', fontsize=14, loc='left')
ax.set_ylabel('Standard deviations from mean')
ax.set_xlabel('')
ax.hlines(y=0, xmin=start_date, xmax=end, linestyles='dashed')
ax.legend().set_visible(False)



In [10]:

    
# focus on recent past 
recent_date = dt.datetime(2011, 1, 1)
g_std = g_std[g_std.index>=recent_date]



In [11]:

    
fig, ax = plt.subplots()
g_std.plot(ax=ax)
ax.set_title('Zoom in on recent past', fontsize=14, loc='left')
ax.set_ylabel('Standard deviations from mean')
ax.set_xlabel('')
ax.hlines(y=0, xmin=recent_date, xmax=end, linestyles='dashed')
ax.legend(loc='upper left', fontsize=10, handlelength=2, labelspacing=0.15)









    Out[11]:





<matplotlib.legend.Legend at 0xaff32b0>

Comment. Chase likes heatmaps. The one that follows needs work, but it's an attempt to reproduce the previous graph as a heatmap, where the value of the variable is represented by the color and its intensity. Dark red: large postive value. Dark blue: large negative value. Light colors: small values.



In [12]:

    
import seaborn as sns      # graphics packages
fig, ax = plt.subplots()
sns.heatmap(g_std, linewidths=.5, ax=ax)
ax.set_yticklabels(g_std.index, visible=False)









    Out[12]:





[<matplotlib.text.Text at 0xc194668>,
 <matplotlib.text.Text at 0xc16b780>,
 <matplotlib.text.Text at 0xc1c9ef0>,
 <matplotlib.text.Text at 0xc1d49b0>,
 <matplotlib.text.Text at 0xc1d8470>,
 <matplotlib.text.Text at 0xc1d8ef0>,
 <matplotlib.text.Text at 0xc1db9b0>,
 <matplotlib.text.Text at 0xc1e0470>,
 <matplotlib.text.Text at 0xc1e0ef0>,
 <matplotlib.text.Text at 0xc1e59b0>,
 <matplotlib.text.Text at 0xc1e7470>,
 <matplotlib.text.Text at 0xc1e7ef0>,
 <matplotlib.text.Text at 0xc1ec9b0>,
 <matplotlib.text.Text at 0xc1f0470>,
 <matplotlib.text.Text at 0xc1f0ef0>,
 <matplotlib.text.Text at 0xc1f59b0>,
 <matplotlib.text.Text at 0xc1f8470>,
 <matplotlib.text.Text at 0xc1f8ef0>,
 <matplotlib.text.Text at 0xc1fc9b0>,
 <matplotlib.text.Text at 0xc200470>,
 <matplotlib.text.Text at 0xc200ef0>,
 <matplotlib.text.Text at 0xc2049b0>,
 <matplotlib.text.Text at 0xc208470>,
 <matplotlib.text.Text at 0xc208ef0>,
 <matplotlib.text.Text at 0xc20c9b0>,
 <matplotlib.text.Text at 0xc20f470>,
 <matplotlib.text.Text at 0xc20fef0>,
 <matplotlib.text.Text at 0xc2139b0>,
 <matplotlib.text.Text at 0xc218470>,
 <matplotlib.text.Text at 0xc218ef0>,
 <matplotlib.text.Text at 0xc21b9b0>,
 <matplotlib.text.Text at 0xc199320>,
 <matplotlib.text.Text at 0xc190f98>,
 <matplotlib.text.Text at 0xc162cc0>,
 <matplotlib.text.Text at 0xc21e1d0>,
 <matplotlib.text.Text at 0xc21ea90>,
 <matplotlib.text.Text at 0xc222470>,
 <matplotlib.text.Text at 0xc222ef0>,
 <matplotlib.text.Text at 0xc2279b0>,
 <matplotlib.text.Text at 0xc229470>,
 <matplotlib.text.Text at 0xc229ef0>,
 <matplotlib.text.Text at 0xc22e9b0>,
 <matplotlib.text.Text at 0xc231470>,
 <matplotlib.text.Text at 0xc231ef0>,
 <matplotlib.text.Text at 0xc2359b0>,
 <matplotlib.text.Text at 0xc23a470>,
 <matplotlib.text.Text at 0xc23aef0>,
 <matplotlib.text.Text at 0xc23d9b0>,
 <matplotlib.text.Text at 0xc240470>,
 <matplotlib.text.Text at 0xc240ef0>,
 <matplotlib.text.Text at 0xc2459b0>,
 <matplotlib.text.Text at 0xc249470>,
 <matplotlib.text.Text at 0xc249ef0>,
 <matplotlib.text.Text at 0xc24c9b0>,
 <matplotlib.text.Text at 0xc251470>,
 <matplotlib.text.Text at 0xc251ef0>,
 <matplotlib.text.Text at 0xc2549b0>,
 <matplotlib.text.Text at 0xc258470>,
 <matplotlib.text.Text at 0xc258ef0>,
 <matplotlib.text.Text at 0xc25e9b0>,
 <matplotlib.text.Text at 0xc260470>,
 <matplotlib.text.Text at 0xc260ef0>]

Comment. This needs some work on the dates.

Questions

What would you change in this picture?
What did you learn?
What else would you like to know?
Where does the data come from?

The following two lines reset the graphics to the pyplot standard. If we ignore them, we get the seaborn standard, which most people think looks better. If you don't have seaborn installed, none of this matters.



In [7]:

    
# reset graphics (or not) 
#mpl.rcParams.update(mpl.rcParamsDefault)
#%matplotlib inline



In [ ]:

Example: Employment trends

Not long ago, the US had more people working (a higher fraction of the adult population) than many other developed countries. Over the past 15 years things have flipped. The deep question is why, but here we simply report what we know.

The key variables are the employment rate (fraction of people aged 25-54) and the participation rate (fraction either working or unemployed). They're similar but the latter irons out some of the cyclical fluctuations. We get the data from FRED, but they get it from the OECD's Main Economic Indicators, which covers mostly rich countries.



In [5]:

    
import pandas as pd                   # data management
import datetime as dt                 # time and date tool 
import pandas.io.data as web          # data interface (soon to be moved, see warning)









    



C:\Users\dbackus\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\data.py:33: FutureWarning: 
The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.
  FutureWarning)



In [6]:

    
# countries = AU, CA, CH, DE, ES, EU, EZ, FR, GB, JP, KR, NL, O1 (OECD), SE, US, ZA 
countries = ['CA', 'DE', 'GB', 'JP', 'US']
emcodes = ['LREM25TT' + c + 'Q156S' for c in countries]
lrcodes = ['LRAC25TT' + c + 'Q156S' for c in countries]

start_date = dt.datetime(2000, 1, 1)
em = web.DataReader(emcodes, 'fred', start_date)
lr = web.DataReader(lrcodes, 'fred', start_date)
em.columns = countries
lr.columns = countries
em.head(3)



In [10]:

    
def emplot(em, title):
    fig, ax = plt.subplots()
    em.plot(ax=ax, lw=2)
    ax.set_title(title, fontsize=14, loc='left')
    ax.set_xlabel('')
    ax.set_ylabel('Percent')
    ax.legend(loc='best', fontsize=8, handlelength=2, labelspacing=0.1)
    
emplot(em, 'Employment rates, age 25-54')
emplot(lr, 'Participation rates, age 25-54')

Questions

What would you change in this picture?
What did you learn?
What else would you like to know?
Where does the data come from?



In [ ]:



In [ ]:

Example: Government debt

One of the traditional ways for a country to get into trouble is to issue so much debt that investors worry about getting paid back. How much is that? Hard to say, but we'll look at some debt numbers from the IMF's World Economic Outlook or WEO, a popular source of international data on debt, deficits, and other macroeconomic indicators. We use numbers for the ratio of government debt to GDP, a standard indicator of public indebtedness.

Given recent events, we pay special attention to Argentina and Greece.



In [11]:

    
%%time 
import pandas as pd
import matplotlib.pyplot as plt

url1 = 'http://www.imf.org/external/pubs/ft/weo/2015/01/weodata/'
url2 = 'WEOApr2015all.xls'
url = url1 + url2 
weo = pd.read_csv(url, sep='\t', thousands=',', na_values=['n/a', '--']) 
weo.shape









    



Wall time: 2.99 s



In [12]:

    
list(weo[list(range(12))])









    Out[12]:





['WEO Country Code',
 'ISO',
 'WEO Subject Code',
 'Country',
 'Subject Descriptor',
 'Subject Notes',
 'Units',
 'Scale',
 'Country/Series-specific Notes',
 '1980',
 '1981',
 '1982']



In [13]:

    
country_guide = weo[['ISO', 'Country']].drop_duplicates().set_index('ISO')

variable_guide = weo[['WEO Subject Code', 'Subject Descriptor', 'Subject Notes']].drop_duplicates().set_index('WEO Subject Code')
variable_guide.head(3)









    Out[13]:






  
    
      
      Subject Descriptor
      Subject Notes
    
    
      WEO Subject Code
      
      
    
  
  
    
      NGDP_R
      Gross domestic product, constant prices
      Expressed in billions of national currency uni...
    
    
      NGDP_RPCH
      Gross domestic product, constant prices
      Annual percentages of constant price GDP are y...
    
    
      NGDP
      Gross domestic product, current prices
      Expressed in billions of national currency uni...



In [14]:

    
variables = ['GGXWDG_NGDP']
countries = ['ARG', 'DEU', 'FRA', 'GRC', 'USA']
debt = weo[weo['ISO'].isin(countries) & weo['WEO Subject Code'].isin(variables)]
some = [3] + list(range(9,44))
debt = debt[some].set_index('Country').T.dropna()
debt.head(3)









    Out[14]:






  
    
      Country
      Argentina
      France
      Germany
      Greece
      United States
    
  
  
    
      2001
      44.886
      57.943
      57.529
      99.922
      53.022
    
    
      2002
      137.512
      59.840
      59.169
      98.114
      55.379
    
    
      2003
      116.525
      63.947
      62.920
      94.094
      58.516



In [15]:

    
fig, ax = plt.subplots()
debt.plot(ax=ax, lw=2)
ax.set_title('Ratio of government debt to GDP', fontsize=14, loc='left')
ax.set_ylabel('Government Debt (Percent of GDP)')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)









    Out[15]:





<matplotlib.legend.Legend at 0xc02a908>

Questions

What would you change in this picture?
What did you learn?
What else would you like to know?
Where does the data come from?



In [ ]:

Example: Options data

The Pandas package is spinning off the remote data access tools, including FRED, World Bank, and so on. Here we use the new tool, the Pandas DataReader. This requires installation: on the command line, enter

conda install pandas-datareader html5lib

and follow instructions.

You can run commands on the command line from within jupyter by putting a ! at the beginning of the line. To run the commands above you can do enter the following in a code cell:

!conda install -y pandas-datareader html5lib

We'll use the Options tool, which reads in prices of stock options from Yahoo Finance. The options come with a variety of contract specifications:

Type. Calls are options to buy, puts are options to sell.
Expiration. The date when the option expires.
Underlying. The asset you have a right to buy or sell. In this data, it's a stock.
Strike price. The price at which you are able to buy (calls) or sell (puts) the underlying.

We give an illustration using options on Amazon stock.

The first version of this module was written by our very own Spencer Lyon.



In [2]:

    
import pandas_datareader.yahoo.options as yho

"""
supply ticker, get option prices 
"""
ticker = 'amzn'
otk = yho.Options(ticker)
exp = otk.expiry_dates     # get expiration dates 
exp









    Out[2]:





[datetime.date(2016, 3, 24),
 datetime.date(2016, 4, 1),
 datetime.date(2016, 4, 8),
 datetime.date(2016, 4, 15),
 datetime.date(2016, 4, 22),
 datetime.date(2016, 4, 29),
 datetime.date(2016, 5, 20),
 datetime.date(2016, 6, 17),
 datetime.date(2016, 7, 15),
 datetime.date(2016, 10, 21),
 datetime.date(2017, 1, 20),
 datetime.date(2018, 1, 19)]



In [3]:

    
# get option prices 
cols = [0, 1, 2, 7]   
opexp = 2
print('Options with expiry', exp[opexp])
calls = otk.get_call_data(expiry=exp[opexp])[cols]
puts  = otk.get_put_data(expiry=exp[opexp])[cols]

calls.head()









    



Options with expiry 2016-04-08






    Out[3]:






  
    
      
      
      
      
      Last
      Bid
      Ask
      IV
    
    
      Strike
      Expiry
      Type
      Symbol
      
      
      
      
    
  
  
    
      385
      2016-04-08
      call
      AMZN160408C00385000
      0
      171.90
      175.95
      86.47%
    
    
      390
      2016-04-08
      call
      AMZN160408C00390000
      0
      167.15
      170.95
      83.89%
    
    
      395
      2016-04-08
      call
      AMZN160408C00395000
      0
      162.00
      166.00
      82.10%
    
    
      400
      2016-04-08
      call
      AMZN160408C00400000
      0
      157.30
      161.20
      82.32%
    
    
      405
      2016-04-08
      call
      AMZN160408C00405000
      0
      152.05
      156.15
      79.08%



In [4]:

    
# drop extra index levels 
calls = calls.reset_index(level=[1,2,3], drop=True)
puts  = puts.reset_index(level=[1,2,3], drop=True)

# cut off extremes 
spot = otk.underlying_price
print('Spot price', spot)
delta = 0.25
calls  = calls[(calls.index >= (1-delta)*spot) & (calls.index <= (1+delta)*spot)]
puts  = puts[(puts.index >= (1-delta)*spot) & (puts.index <= (1+delta)*spot)]

# compute avg of bid and ask  
calls['Mid'] = (calls['Bid'] + calls['Ask'])/2
puts['Mid']  = (puts['Bid'] + puts['Ask'])/2









    



Spot price 559.9001



In [1]:

    
# plot put and call prices
which = 'Mid'

fig, ax = plt.subplots()
calls[which].plot(lw=2, color='blue', alpha=0.6, ax=ax)
puts[which].plot(lw=2, color='magenta', alpha=0.6, ax=ax)
ymin, ymax = ax.get_ylim()
xmin, xmax = ax.get_xlim()
ax.set_title('Prices of Amazon options (bid-ask avg)', fontsize=14, loc='left')
ax.set_ylabel('Option Prices')
ax.set_xlabel('Strike Price')
ax.vlines(x=spot, ymin=ymin, ymax=ymax, linestyle='dashed')
ax.text(1.01*spot, 0.9*ymax, 'Stock price', horizontalalignment='left')
ax.text(0.875*spot, 0.13*ymax, 'Put prices', horizontalalignment='right', color='m')
ax.text(1.125*spot, 0.13*ymax, 'Call prices', horizontalalignment='left', color='b')









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-bf34046d1176> in <module>()
      2 which = 'Mid'
      3 
----> 4 fig, ax = plt.subplots()
      5 calls[which].plot(lw=2, color='blue', alpha=0.6, ax=ax)
      6 puts[which].plot(lw=2, color='magenta', alpha=0.6, ax=ax)

NameError: name 'plt' is not defined

Questions

What would you change in this picture?
What did you learn?
What else would you like to know?
Where does the data come from?



In [ ]:

Example: Japan's aging population

Data from the UN's Population Division. One of our favorite quotes:

Last year, for the first time, sales of adult diapers in Japan exceeded those for babies.

This is what the numbers look like. They're UN projections, what they call the "medium variant."



In [28]:

    
url1 = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url2 = '1_Indicators%20(Standard)/EXCEL_FILES/1_Population/'
url3 = 'WPP2015_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.XLS'
url = url1 + url2 + url3 

cols = [2, 4, 5] + list(range(6,28))
#est = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
prj = pd.read_excel(url, sheetname=1, skiprows=16, parse_cols=cols, na_values=['…'])

"""
for later:  change cols for the two sources, rename 80+ to 80-84, then concat 
#pop = pd.concat([est, prj], axis=0, join='outer')      
"""
pop = prj



In [30]:

    
pop.head()









    Out[30]:






  
    
      
      Major area, region, country or area *
      Country code
      Reference date (as of 1 July)
      0-4
      5-9
      10-14
      15-19
      20-24
      25-29
      30-34
      ...
      55-59
      60-64
      65-69
      70-74
      75-79
      80-84
      85-89
      90-94
      95-99
      100+
    
  
  
    
      0
      WORLD
      900
      2015
      670928.185
      637448.895
      607431.299
      590069.337
      603508.555
      609793.207
      551085.426
      ...
      339691.467
      292726.792
      215047.151
      153207.026
      114652.312
      71449.698
      37062.196
      13388.529
      2921.852
      450.895
    
    
      1
      WORLD
      900
      2020
      677599.590
      664282.610
      634568.409
      604322.921
      585831.926
      599278.263
      604808.769
      ...
      387339.470
      320927.366
      267874.842
      187441.892
      123355.620
      81851.457
      42700.991
      16937.683
      4242.605
      605.465
    
    
      2
      WORLD
      900
      2025
      673174.914
      671929.973
      661684.410
      631509.113
      600050.925
      581681.875
      594598.480
      ...
      428166.199
      367325.750
      295011.725
      234797.844
      152989.596
      89822.147
      49847.989
      20116.420
      5538.923
      905.182
    
    
      3
      WORLD
      900
      2030
      672121.017
      667926.121
      669744.224
      658942.513
      627343.133
      595871.013
      577192.328
      ...
      450091.908
      407564.034
      339528.512
      260426.171
      193066.277
      113475.561
      56221.915
      24041.609
      6836.461
      1244.851
    
    
      4
      WORLD
      900
      2035
      678271.589
      667353.285
      665875.094
      667248.100
      655034.545
      623315.619
      591598.185
      ...
      456022.842
      429343.580
      378601.876
      301893.804
      216249.259
      144393.943
      72643.007
      28074.222
      8397.572
      1617.797
    
  

5 rows × 24 columns



In [31]:

    
# rename some variables 
pop = pop.rename(columns={'Reference date (as of 1 July)': 'Year', 
                          'Major area, region, country or area *': 'Country', 
                          'Country code': 'Code'})
# select Japan and years 
countries = ['Japan']
years     = [2015, 2025, 2035, 2045, 2055, 2065]
pop = pop[pop['Country'].isin(countries) & pop['Year'].isin(years)]
pop = pop.drop(['Country', 'Code'], axis=1)
pop.head()









    Out[31]:






  
    
      
      Year
      0-4
      5-9
      10-14
      15-19
      20-24
      25-29
      30-34
      35-39
      40-44
      ...
      55-59
      60-64
      65-69
      70-74
      75-79
      80-84
      85-89
      90-94
      95-99
      100+
    
  
  
    
      1458
      2015
      5269.038
      5398.973
      5603.638
      5960.784
      6111.768
      6843.421
      7455.687
      8345.753
      9689.865
      ...
      7535.334
      8530.749
      9452.518
      7770.410
      6297.532
      4940.325
      3115.732
      1338.774
      366.082
      60.630
    
    
      1460
      2025
      4872.732
      5086.975
      5275.897
      5425.060
      5665.781
      6033.351
      6166.461
      6868.725
      7446.336
      ...
      8425.830
      7577.456
      7111.085
      7810.061
      8212.023
      6065.908
      3985.173
      2137.536
      712.019
      115.971
    
    
      1462
      2035
      4610.562
      4720.645
      4880.967
      5114.258
      5340.457
      5502.279
      5726.044
      6067.218
      6172.067
      ...
      8129.515
      9259.518
      8021.327
      7034.014
      6319.464
      6327.227
      5517.980
      2871.710
      1029.126
      216.863
    
    
      1464
      2045
      4448.702
      4529.657
      4619.658
      4748.906
      4947.499
      5194.574
      5404.633
      5541.785
      5739.054
      ...
      6728.225
      7171.861
      7796.329
      8693.253
      7260.858
      5876.803
      4469.537
      3242.964
      1597.432
      338.945
    
    
      1466
      2055
      4271.907
      4371.016
      4458.145
      4557.117
      4685.414
      4829.637
      5013.420
      5236.482
      5422.259
      ...
      5971.177
      5985.655
      6493.459
      6798.299
      7167.066
      7456.047
      5368.363
      3233.182
      1437.057
      462.541
    
  

5 rows × 22 columns



In [32]:

    
pop = pop.set_index('Year').T
pop.head()



In [33]:

    
ax = pop.plot(kind='bar',  
         alpha=0.5, subplots=True, sharey=True, figsize=(6, 8))
for axnum in range(len(ax)):  
    ax[axnum].set_title('')
    
ax[0].set_title('Japanese population by age', fontsize=14, loc='left')









    Out[33]:





<matplotlib.text.Text at 0x14d400b8>

Questions

What would you change in this picture?
What did you learn?
What else would you like to know?
Where does the data come from?



In [ ]:

Example: Birth rates

We might wonder, why is the population falling in Japan? Other countries? Well, one reason is that birth rates are falling. Demographers call this fertility. Here we look at the fertility using the same UN source as the previous example. We look at two variables: total fertility and fertility by age of mother. In both cases we explore the numbers to date, but the same files contain projections of future fertility.



In [34]:

    
%%time
import pandas as pd 

# fertility overall 
uft  = 'http://esa.un.org/unpd/wpp/DVD/Files/'
uft += '1_Indicators%20(Standard)/EXCEL_FILES/'
uft += '2_Fertility/WPP2015_FERT_F04_TOTAL_FERTILITY.XLS'

cols = [2, 4] + list(range(5,18))
ftot = pd.read_excel(uft, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])

# fertility by age 
ufa  = 'http://esa.un.org/unpd/wpp/DVD/Files/'
ufa += '1_Indicators%20(Standard)/EXCEL_FILES/'
ufa += '2_Fertility/WPP2015_FERT_F07_AGE_SPECIFIC_FERTILITY.XLS'

#cols = [2, 4, 5] + list(range(6,13))
#fage = pd.read_excel(ufa, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])









    



Wall time: 304 ms



In [35]:

    
ftot.head(3)[list(range(10))]









    Out[35]:






  
    
      
      Major area, region, country or area *
      Country code
      1950-1955
      1955-1960
      1960-1965
      1965-1970
      1970-1975
      1975-1980
      1980-1985
      1985-1990
    
  
  
    
      0
      WORLD
      900
      4.961571
      4.898665
      5.024379
      4.922202
      4.475478
      3.865195
      3.591816
      3.446811
    
    
      1
      More developed regions
      901
      2.823786
      2.807368
      2.685900
      2.387534
      2.150816
      1.917349
      1.843806
      1.807382
    
    
      2
      Less developed regions
      902
      6.075417
      5.941033
      6.129418
      6.034843
      5.416602
      4.585636
      4.167889
      3.917106



In [36]:

    
# rename some variables 
ftot = ftot.rename(columns={'Major area, region, country or area *': 'Country', 
                            'Country code': 'Code'})



In [37]:

    
# drop code 
f = ftot.drop(['Code'], axis=1)

# select countries 
countries = ['China', 'Japan', 'Germany', 'United States of America']
f = f[f['Country'].isin(countries)]

# shape
f = f.set_index('Country').T 
f = f.rename(columns={'United States of America': 'United States'})
f.tail()









    Out[37]:






  
    
      Country
      China
      Japan
      Germany
      United States
    
  
  
    
      1990-1995
      2.00
      1.4760
      1.3007
      2.0297
    
    
      1995-2000
      1.48
      1.3690
      1.3456
      1.9965
    
    
      2000-2005
      1.50
      1.2980
      1.3513
      2.0420
    
    
      2005-2010
      1.53
      1.3388
      1.3623
      2.0590
    
    
      2010-2015
      1.55
      1.3960
      1.3909
      1.8902



In [38]:

    
fig, ax = plt.subplots()
f.plot(ax=ax, kind='line', alpha=0.5, lw=3, figsize=(6.5, 4))
ax.set_title('Fertility rate (births per woman, lifetime)', fontsize=14, loc='left')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)
ax.set_ylim(ymin=0)
ax.hlines(2.1, -1, 13, linestyles='dashed')
ax.text(8.5, 2.4, 'Replacement = 2.1')









    Out[38]:





<matplotlib.text.Text at 0x13f9c320>

Questions

What would you change in this picture?
What did you learn?
What else would you like to know?
Where does the data come from?



In [ ]:

Example: US government bond yields

People often refer to interest rates moving up and down, but in fact the yields (as we call them) move up and down differently at different maturities. From 2010 to 2015, for examples, yields on bonds with maturities under 2 years were essentially fixed at zero, but yields on higher maturities moved up and down quite a bit.

One challenge with bond yields is getting good data. Bonds are typically traded over the counter, and those bonds differ in maturity, coupon, and often other features. Analysts often focus on the interest rates of pure discount "zero-coupon" bonds, which they infer from prices of coupon bonds -- which are, after all, what is generally traded. This isn't something to go into unless you have a strong interest. Suffice it to say that we have taken these yields from estimates supplied by the Fed.

This is still a work in progress, but our goal is to produce a movie of bond yields over time.



In [27]:

    
url = 'http://pages.stern.nyu.edu/~dbackus/Data/GSW_nominal_yields.csv'
y = pd.read_csv(url, index_col=0, parse_dates=True)
y[list(range(10))].tail()



In [28]:

    
# compute mean yields
ybar = y.mean(axis=0)
ystd = y.std(axis=0)
maturities = list(range(1,122))
ybar.index = maturities
ystd.index = maturities



In [45]:

    
fig, ax = plt.subplots()
ybar.plot(ax=ax)
ystd.plot(ax=ax)
ax.set_xlabel('Maturity in Months')
ax.set_ylabel('Mean and Std Dev of Yield (Annual Percent)')
ax.set_ylim(0)
ax.legend(["Mean", "std. dev."])









    Out[45]:





(0, 6.5)



In [ ]:



In [ ]:



In [ ]:

	INDPRO	PAYEMS	AWHMAN	PERMIT	NAPM	UMCSENT
INDPRO	1.000000	0.778040	0.611732	0.239098	0.266087	0.270027
PAYEMS	0.778040	1.000000	0.268637	0.086402	-0.125681	0.064980
AWHMAN	0.611732	0.268637	1.000000	0.425513	0.683019	0.444003
PERMIT	0.239098	0.086402	0.425513	1.000000	0.494587	0.523323
NAPM	0.266087	-0.125681	0.683019	0.494587	1.000000	0.603901
UMCSENT	0.270027	0.064980	0.444003	0.523323	0.603901	1.000000

	CA	DE	GB	JP	US
DATE
2000-01-01	79.900000	NaN	79.913064	78.392194	81.683615
2000-04-01	79.900000	NaN	80.184368	78.838946	81.616079
2000-07-01	79.833333	NaN	80.442829	78.744605	81.277837

	y1	y2	y3	y4	y5	y6	y7	y8	y9	y10
2014-12-31	0.203838	0.185108	0.173745	0.168964	0.170051	0.176355	0.187282	0.202292	0.220892	0.242636
2015-01-31	0.270731	0.230483	0.199873	0.177721	0.162974	0.154694	0.152046	0.154287	0.160756	0.170868
2015-02-28	0.170712	0.153952	0.144484	0.141464	0.144129	0.151789	0.163823	0.179671	0.198828	0.220840
2015-03-31	0.271311	0.253007	0.240062	0.231969	0.228259	0.228504	0.232309	0.239310	0.249173	0.261594
2015-04-30	0.126999	0.125811	0.127855	0.132838	0.140489	0.150555	0.162804	0.177019	0.193000	0.210562

	Subject Descriptor	Subject Notes
WEO Subject Code
NGDP_R	Gross domestic product, constant prices	Expressed in billions of national currency uni...
NGDP_RPCH	Gross domestic product, constant prices	Annual percentages of constant price GDP are y...
NGDP	Gross domestic product, current prices	Expressed in billions of national currency uni...

Country	Argentina	France	Germany	Greece	United States
2001	44.886	57.943	57.529	99.922	53.022
2002	137.512	59.840	59.169	98.114	55.379
2003	116.525	63.947	62.920	94.094	58.516

				Last	Bid	Ask	IV
Strike	Expiry	Type	Symbol
385	2016-04-08	call	AMZN160408C00385000	0	171.90	175.95	86.47%
390	2016-04-08	call	AMZN160408C00390000	0	167.15	170.95	83.89%
395	2016-04-08	call	AMZN160408C00395000	0	162.00	166.00	82.10%
400	2016-04-08	call	AMZN160408C00400000	0	157.30	161.20	82.32%
405	2016-04-08	call	AMZN160408C00405000	0	152.05	156.15	79.08%

	Major area, region, country or area *	Country code	Reference date (as of 1 July)	0-4	5-9	10-14	15-19	20-24	25-29	30-34	...	55-59	60-64	65-69	70-74	75-79	80-84	85-89	90-94	95-99	100+
0	WORLD	900	2015	670928.185	637448.895	607431.299	590069.337	603508.555	609793.207	551085.426	...	339691.467	292726.792	215047.151	153207.026	114652.312	71449.698	37062.196	13388.529	2921.852	450.895
1	WORLD	900	2020	677599.590	664282.610	634568.409	604322.921	585831.926	599278.263	604808.769	...	387339.470	320927.366	267874.842	187441.892	123355.620	81851.457	42700.991	16937.683	4242.605	605.465
2	WORLD	900	2025	673174.914	671929.973	661684.410	631509.113	600050.925	581681.875	594598.480	...	428166.199	367325.750	295011.725	234797.844	152989.596	89822.147	49847.989	20116.420	5538.923	905.182
3	WORLD	900	2030	672121.017	667926.121	669744.224	658942.513	627343.133	595871.013	577192.328	...	450091.908	407564.034	339528.512	260426.171	193066.277	113475.561	56221.915	24041.609	6836.461	1244.851
4	WORLD	900	2035	678271.589	667353.285	665875.094	667248.100	655034.545	623315.619	591598.185	...	456022.842	429343.580	378601.876	301893.804	216249.259	144393.943	72643.007	28074.222	8397.572	1617.797

	Year	0-4	5-9	10-14	15-19	20-24	25-29	30-34	35-39	40-44	...	55-59	60-64	65-69	70-74	75-79	80-84	85-89	90-94	95-99	100+
1458	2015	5269.038	5398.973	5603.638	5960.784	6111.768	6843.421	7455.687	8345.753	9689.865	...	7535.334	8530.749	9452.518	7770.410	6297.532	4940.325	3115.732	1338.774	366.082	60.630
1460	2025	4872.732	5086.975	5275.897	5425.060	5665.781	6033.351	6166.461	6868.725	7446.336	...	8425.830	7577.456	7111.085	7810.061	8212.023	6065.908	3985.173	2137.536	712.019	115.971
1462	2035	4610.562	4720.645	4880.967	5114.258	5340.457	5502.279	5726.044	6067.218	6172.067	...	8129.515	9259.518	8021.327	7034.014	6319.464	6327.227	5517.980	2871.710	1029.126	216.863
1464	2045	4448.702	4529.657	4619.658	4748.906	4947.499	5194.574	5404.633	5541.785	5739.054	...	6728.225	7171.861	7796.329	8693.253	7260.858	5876.803	4469.537	3242.964	1597.432	338.945
1466	2055	4271.907	4371.016	4458.145	4557.117	4685.414	4829.637	5013.420	5236.482	5422.259	...	5971.177	5985.655	6493.459	6798.299	7167.066	7456.047	5368.363	3233.182	1437.057	462.541

	Major area, region, country or area *	Country code	1950-1955	1955-1960	1960-1965	1965-1970	1970-1975	1975-1980	1980-1985	1985-1990
0	WORLD	900	4.961571	4.898665	5.024379	4.922202	4.475478	3.865195	3.591816	3.446811
1	More developed regions	901	2.823786	2.807368	2.685900	2.387534	2.150816	1.917349	1.843806	1.807382
2	Less developed regions	902	6.075417	5.941033	6.129418	6.034843	5.416602	4.585636	4.167889	3.917106

Country	China	Japan	Germany	United States
1990-1995	2.00	1.4760	1.3007	2.0297
1995-2000	1.48	1.3690	1.3456	1.9965
2000-2005	1.50	1.2980	1.3513	2.0420
2005-2010	1.53	1.3388	1.3623	2.0590
2010-2015	1.55	1.3960	1.3909	1.8902