Optional Code Practice A: Jupyter basics and Python's graphics tools (the Matplotlib package). The goals are to become familiar with Jupyter and Matplotlib and to explore some new datasets. The data management part of this goes beyond what we've done in class. We recommend you just run the code provided and focus on the graphs for now.
This notebook written by Dave Backus for the NYU Stern course Data Bootcamp.
Check Jupyter before we start. Run the code below and make sure it works.
In [ ]:
# to make sure things are working, run this
import pandas as pd
print('Pandas version: ', pd.__version__)
If you get something like "Pandas version: 0.17.1" you're fine. If you get an error, bring your computer by and ask for help. If you're unusually brave, go to StackOverflow and read the instructions. Then come ask for help. (This has to do with how your computer processes unicode. When you hear that word -- unicode -- you should run away at high speed.)
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
%matplotlib inline
Remind yourself:
pandas package do?matplotlib package do? %matplotlib inline do? Practice with the following:
The internet has produced some interesting market behavior, music being a great example. Among them:
Curiously enough, we see the same thing in the US beer industry:
We illustrate this with data from Victor and Carol Tremblay that describe the output of the top 100 US beer producers from 1947 to 2004. This is background data from their book, The US Brewing Industry, MIT Press, 2004. See here for the names of the brewers. Output is measured in thousands of 31-gallon barrels.
Data manipulation. The data manipulation goes beyond what we've done in class. You're free to ignore it, but here's the idea.
In [ ]:
url = 'http://pages.stern.nyu.edu/~dbackus/Data/beer_production_1947-2004.xlsx'
beer = pd.read_excel(url, skiprows=12, index_col=0)
print('Dimensions:', beer.shape)
beer[list(range(1,11))].head(3)
In [ ]:
vars = list(range(1,101)) # extract top 100 firms
pdf = beer[vars].T # transpose (flip rows and columns)
pdf[[1947, 1967, 1987, 2004]].head()
Question. Can you see consolidation here?
In [ ]:
# a basic plot
fig, ax = plt.subplots()
pdf[1947].plot(ax=ax, logy=True)
pdf[1967].plot(ax=ax, logy=True)
pdf[1987].plot(ax=ax, logy=True)
pdf[2004].plot(ax=ax, logy=True)
ax.legend()
Answer these questions below. Code is sufficient, but it's often helpful to add comments to remind yourself what you did, and why.
set.title method by typing ax.set_title? in a new cell and running it. Note that you can open the documentation this produces in a separate tab with the icon in the upper right (hover text = "Open the pager in an external window"). ax.set_title('Your title'). lw=2 to the ax.plot() statements? ax.set_xlabel(). logy=True)? What happens if we don't?color argument/parameter to choose a more effective set of colors.
In [ ]:
In [ ]:
In [ ]:
Populations are getting older throughout the world, but Japan is a striking example. One of our favorite quotes:
Last year, for the first time, sales of adult diapers in Japan exceeded those for babies.
Let's see what the numbers look like using projections fron the United Nations' Population Division. They have several projections; we use what they call the "medium variant."
We have a similar issue with the data: population by age for a given country and date goes across rows, not down columns. So we choose the ones we want and transpose them. Again, more than we've done so far.
In [ ]:
# data input (takes about 20 seconds on a wireless network)
url1 = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url2 = '1_Indicators%20(Standard)/EXCEL_FILES/1_Population/'
url3 = 'WPP2017_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.XLSX'
url = url1 + url2 + url3
cols = [2, 4, 5] + list(range(6,28))
prj = pd.read_excel(url, sheetname=1, skiprows=16, parse_cols=cols, na_values=['…'])
print('Dimensions: ', prj.shape)
print('Column labels: ', prj.columns)
In [ ]:
# rename some variables
pop = prj
pop = pop.rename(columns={'Reference date (as of 1 July)': 'Year',
'Region, subregion, country or area *': 'Country',
'Country code': 'Code'})
# select Japan and years
countries = ['Japan']
years = [2015, 2035, 2055, 2075, 2095]
pop = pop[pop['Country'].isin(countries) & pop['Year'].isin(years)]
pop = pop.drop(['Country', 'Code'], axis=1)
pop = pop.set_index('Year').T
pop = pop/1000 # convert population from thousands to millions
pop.head()
In [ ]:
pop.tail()
Comment. Now we have the number of people in any five-year age group running down columns. The column labels are the years.
With the dataframe df:
pop[[2015]].plot(). Note that 2015 here does not have quotes around it: it's an unusual case of integer column labels.
In [ ]:
In [ ]:
One of our favorite topics is the yield curve: a plot of the yield to maturity on a bond against the bond's maturity. The foundation here is yields on zero coupon bonds, which are simpler objects than yields on coupon bonds.
We often refer to bond yields rising or falling, but in fact the yield curve often does different things at different maturities. We will see that here. For several years, short yields have been stuck at zero, yet yields for bond with maturities of two years and above have varied quite a bit.
We use the Fed's well-known Gurkaynak, Sack, and Wright data, which provides daily data on US Treasury yields from 1961 to the present. The Fed posts the data, but it's in an unfriendly format. So we saved it as a csv file, which we read in below. The variables are yields: SVENYnn is the yield for maturity nn years.
In [ ]:
# data input (takes about 20 seconds on a wireless network)
url = 'http://pages.stern.nyu.edu/~dbackus/Data/feds200628.csv'
gsw = pd.read_csv(url, skiprows=9, index_col=0, usecols=list(range(11)), parse_dates=True)
print('Dimensions: ', gsw.shape)
print('Column labels: ', gsw.columns)
print('Row labels: ', gsw.index)
In [ ]:
# grab recent data
df = gsw[gsw.index >= dt.datetime(2010,1,1)]
# convert to annual, last day of year
df = df.resample('A', how='last').sort_index()
df.head()
In [ ]:
df.columns = list(range(1,11))
ylds = df.T
ylds.head(3)
With the dataframe ylds:
In [ ]:
In [ ]:
In [ ]: