We love demography, specifically the dynamics of population growth and decline. You can drill down seemingly without end, as this terrific graphic about causes of death suggests.
We take a look here at the UN's population data: the age distribution of the population, life expectancy, fertility (the word we use for births), and mortality (deaths). Explore the website, it's filled with interesting data. There are other sources that cover longer time periods, and for some countries you can get detailed data on specific things (causes of death, for example).
We use a number of countries as examples, but Japan and China are the most striking. The code is written so that the country is easily changed.
This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.
In [1]:
# import packages
import pandas as pd # data management
import matplotlib.pyplot as plt # graphics
import matplotlib as mpl # graphics parameters
import numpy as np # numerical calculations
# IPython command, puts plots in notebook
%matplotlib inline
# check Python version
import datetime as dt
import sys
print('Today is', dt.date.today())
print('What version of Python are we running? \n', sys.version, sep='')
In [ ]:
We have both "estimates" of the past (1950-2015) and "projections" of the future (out to 2100). Here we focus on the latter, specifically what the UN refers to as the medium variant: their middle of the road projection. It gives us a sense of how Japan's population might change over the next century.
It takes a few seconds to read the data.
What are the numbers? Thousands of people in various 5-year age categories.
In [2]:
url1 = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url2 = '1_Indicators%20(Standard)/EXCEL_FILES/1_Population/'
url3 = 'WPP2015_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.XLS'
url = url1 + url2 + url3
cols = [2, 5] + list(range(6,28))
#est = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
prj = pd.read_excel(url, sheetname=1, skiprows=16, parse_cols=cols, na_values=['…'])
prj.head(3)[list(range(6))]
Out[2]:
In [3]:
# rename some variables
pop = prj
names = list(pop)
pop = pop.rename(columns={names[0]: 'Country',
names[1]: 'Year'})
# select country and years
country = ['Japan']
years = [2015, 2055, 2095]
pop = pop[pop['Country'].isin(country) & pop['Year'].isin(years)]
pop = pop.drop(['Country'], axis=1)
# set index = Year
# divide by 1000 to convert numbers from thousands to millions
pop = pop.set_index('Year')/1000
pop.head()[list(range(8))]
Out[3]:
In [4]:
# transpose (T) so that index = age
pop = pop.T
pop.head(3)
Out[4]:
In [5]:
ax = pop.plot(kind='bar',
color='blue',
alpha=0.5, subplots=True, sharey=True, figsize=(8,6))
for axnum in range(len(ax)):
ax[axnum].set_title('')
ax[axnum].set_ylabel('Millions')
ax[0].set_title('Population by age', fontsize=14, loc='left')
Out[5]:
In [ ]:
Exercise. What do you see here? What else would you like to know?
Exercise. Adapt the preceeding code to do the same thing for China. Or some other country that sparks your interest.
In [ ]:
We might wonder, why is the population falling in Japan? Other countries? Well, one reason is that birth rates are falling. Demographers call this fertility. Here we look at the fertility using the same UN source as the previous example. We look at two variables: total fertility and fertility by age of mother. In both cases we explore the numbers to date, but the same files contain projections of future fertility.
In [6]:
# fertility overall
uft = 'http://esa.un.org/unpd/wpp/DVD/Files/'
uft += '1_Indicators%20(Standard)/EXCEL_FILES/'
uft += '2_Fertility/WPP2015_FERT_F04_TOTAL_FERTILITY.XLS'
cols = [2] + list(range(5,18))
ftot = pd.read_excel(uft, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
ftot.head(3)[list(range(6))]
Out[6]:
In [7]:
# rename some variables
names = list(ftot)
f = ftot.rename(columns={names[0]: 'Country'})
# select countries
countries = ['China', 'Japan', 'Germany', 'United States of America']
f = f[f['Country'].isin(countries)]
# shape
f = f.set_index('Country').T
f = f.rename(columns={'United States of America': 'United States'})
f.tail(3)
Out[7]:
In [8]:
fig, ax = plt.subplots()
f.plot(ax=ax, kind='line', alpha=0.5, lw=3, figsize=(6.5, 4))
ax.set_title('Fertility (births per woman, lifetime)', fontsize=14, loc='left')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)
ax.set_ylim(ymin=0)
ax.hlines(2.1, -1, 13, linestyles='dashed')
ax.text(8.5, 2.4, 'Replacement = 2.1')
Out[8]:
Exercise. What do you see here? What else would you like to know?
Exercise. Add Canada to the figure. How does it compare to the others? What other countries would you be interested in?
In [ ]:
One of the bottom line summary numbers for mortality is life expectancy: if mortaility rates fall, people live longer, on average. Here we look at life expectancy at birth. There are also numbers for life expectancy given than you live to some specific age; for example, life expectancy given that you survive to age 60.
In [9]:
# life expectancy at birth, both sexes
ule = 'http://esa.un.org/unpd/wpp/DVD/Files/1_Indicators%20(Standard)/EXCEL_FILES/3_Mortality/'
ule += 'WPP2015_MORT_F07_1_LIFE_EXPECTANCY_0_BOTH_SEXES.XLS'
cols = [2] + list(range(5,34))
le = pd.read_excel(ule, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
le.head(3)[list(range(10))]
Out[9]:
In [10]:
# rename some variables
oldname = list(le)[0]
l = le.rename(columns={oldname: 'Country'})
l.head(3)[list(range(8))]
Out[10]:
In [11]:
# select countries
countries = ['China', 'Japan', 'Germany', 'United States of America']
l = l[l['Country'].isin(countries)]
# shape
l = l.set_index('Country').T
l = l.rename(columns={'United States of America': 'United States'})
l.tail()
Out[11]:
In [12]:
fig, ax = plt.subplots()
l.plot(ax=ax, kind='line', alpha=0.5, lw=3, figsize=(6, 8), grid=True)
ax.set_title('Life expectancy at birth', fontsize=14, loc='left')
ax.set_ylabel('Life expectancy in years')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)
ax.set_ylim(ymin=0)
Out[12]:
Exercise. What other countries would you like to see? Can you add them? The code below generates a list.
In [13]:
countries = le.rename(columns={oldname: 'Country'})['Country']
Exercise. Why do you think the US is falling behind? What would you look at to verify your conjecture?
In [ ]:
Another thing that affects the age distribution of the population is the mortality rate: if mortality rates fall people live longer, on average. Here we look at how mortality rates have changed over the past 60+ years. Roughly speaking, people live an extra five years every generation. Which is a lot. Some of you will live to be a hundred. (Look at the 100+ agen category over time for Japan.)
The experts look at mortality rates by age. The UN has a whole page devoted to mortality numbers. We take 5-year mortality rates from the Abridged Life Table.
The numbers are percentages of people in a given age group who die over a 5-year period. 0.1 means that 90 percent of an age group is still alive in five years.
In [14]:
# mortality overall
url = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url += '1_Indicators%20(Standard)/EXCEL_FILES/3_Mortality/'
url += 'WPP2015_MORT_F17_1_ABRIDGED_LIFE_TABLE_BOTH_SEXES.XLS'
cols = [2, 5, 6, 7, 9]
mort = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
mort.tail(3)
Out[14]:
In [15]:
# change names
names = list(mort)
m = mort.rename(columns={names[0]: 'Country', names[2]: 'Age', names[3]: 'Interval', names[4]: 'Mortality'})
m.head(3)
Out[15]:
Comment. At this point, we need to pivot the data. That's not something we've done before, so take it as simply something we can do easily if we have to. We're going to do this twice to produce different graphs:
In [16]:
# compare countries for most recent period
countries = ['China', 'Japan', 'Germany', 'United States of America']
mt = m[m['Country'].isin(countries) & m['Interval'].isin([5]) & m['Period'].isin(['2010-2015'])]
print('Dimensions:', mt.shape)
mp = mt.pivot(index='Age', columns='Country', values='Mortality')
mp.head(3)
Out[16]:
In [17]:
fig, ax = plt.subplots()
mp.plot(ax=ax, kind='line', alpha=0.5, linewidth=3,
# logy=True,
figsize=(6, 4))
ax.set_title('Mortality by age', fontsize=14, loc='left')
ax.set_ylabel('Mortality Rate (log scale)')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)
Out[17]:
Exercises.
In [ ]:
In [18]:
# compare periods for the one country -- countries[0] is China
mt = m[m['Country'].isin([countries[0]]) & m['Interval'].isin([5])]
print('Dimensions:', mt.shape)
mp = mt.pivot(index='Age', columns='Period', values='Mortality')
mp = mp[[0, 6, 12]]
mp.head(3)
Out[18]:
In [19]:
fig, ax = plt.subplots()
mp.plot(ax=ax, kind='line', alpha=0.5, linewidth=3,
# logy=True,
figsize=(6, 4))
ax.set_title('Mortality over time', fontsize=14, loc='left')
ax.set_ylabel('Mortality Rate (log scale)')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)
Out[19]:
Exercise. What do you see? What else would you like to know?
Exercise. Repeat this graph for the United States? How does it compare?
In [ ]:
In [ ]: