Data Bootcamp: Demography

We love demography, specifically the dynamics of population growth and decline. You can drill down seemingly without end, as this terrific graphic about causes of death suggests.

We take a look here at the UN's population data: the age distribution of the population, life expectancy, fertility (the word we use for births), and mortality (deaths). Explore the website, it's filled with interesting data. There are other sources that cover longer time periods, and for some countries you can get detailed data on specific things (causes of death, for example).

We use a number of countries as examples, but Japan and China are the most striking. The code is written so that the country is easily changed.

This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

Preliminaries

Import statements and a date check for future reference.


In [1]:
# import packages 
import pandas as pd                   # data management
import matplotlib.pyplot as plt       # graphics 
import matplotlib as mpl              # graphics parameters
import numpy as np                    # numerical calculations 

# IPython command, puts plots in notebook 
%matplotlib inline

# check Python version 
import datetime as dt 
import sys
print('Today is', dt.date.today())
print('What version of Python are we running? \n', sys.version, sep='')


Today is 2016-03-23
What version of Python are we running? 
3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]

In [ ]:

Population by age

We have both "estimates" of the past (1950-2015) and "projections" of the future (out to 2100). Here we focus on the latter, specifically what the UN refers to as the medium variant: their middle of the road projection. It gives us a sense of how Japan's population might change over the next century.

It takes a few seconds to read the data.

What are the numbers? Thousands of people in various 5-year age categories.


In [2]:
url1 = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url2 = '1_Indicators%20(Standard)/EXCEL_FILES/1_Population/'
url3 = 'WPP2015_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.XLS'
url = url1 + url2 + url3 

cols = [2, 5] + list(range(6,28))
#est = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
prj = pd.read_excel(url, sheetname=1, skiprows=16, parse_cols=cols, na_values=['…'])

prj.head(3)[list(range(6))]


Out[2]:
Major area, region, country or area * Reference date (as of 1 July) 0-4 5-9 10-14 15-19
0 WORLD 2015 670928.185 637448.895 607431.299 590069.337
1 WORLD 2020 677599.590 664282.610 634568.409 604322.921
2 WORLD 2025 673174.914 671929.973 661684.410 631509.113

In [3]:
# rename some variables 
pop = prj 
names = list(pop) 
pop = pop.rename(columns={names[0]: 'Country', 
                          names[1]: 'Year'}) 
# select country and years 
country = ['Japan']
years     = [2015, 2055, 2095]
pop = pop[pop['Country'].isin(country) & pop['Year'].isin(years)]
pop = pop.drop(['Country'], axis=1)

# set index = Year 
# divide by 1000 to convert numbers from thousands to millions
pop = pop.set_index('Year')/1000

pop.head()[list(range(8))]


Out[3]:
0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39
Year
2015 5.269038 5.398973 5.603638 5.960784 6.111768 6.843421 7.455687 8.345753
2055 4.271907 4.371016 4.458145 4.557117 4.685414 4.829637 5.013420 5.236482
2095 3.720541 3.789596 3.860493 3.938189 4.025477 4.106967 4.194272 4.290983

In [4]:
# transpose (T) so that index = age 
pop = pop.T
pop.head(3)


Out[4]:
Year 2015 2055 2095
0-4 5.269038 4.271907 3.720541
5-9 5.398973 4.371016 3.789596
10-14 5.603638 4.458145 3.860493

In [5]:
ax = pop.plot(kind='bar',  
              color='blue', 
              alpha=0.5, subplots=True, sharey=True, figsize=(8,6))

for axnum in range(len(ax)):  
    ax[axnum].set_title('')
    ax[axnum].set_ylabel('Millions')
        
ax[0].set_title('Population by age', fontsize=14, loc='left')


Out[5]:
<matplotlib.text.Text at 0xa259e10>

In [ ]:

Exercise. What do you see here? What else would you like to know?

Exercise. Adapt the preceeding code to do the same thing for China. Or some other country that sparks your interest.


In [ ]:

Fertility: aka birth rates

We might wonder, why is the population falling in Japan? Other countries? Well, one reason is that birth rates are falling. Demographers call this fertility. Here we look at the fertility using the same UN source as the previous example. We look at two variables: total fertility and fertility by age of mother. In both cases we explore the numbers to date, but the same files contain projections of future fertility.


In [6]:
# fertility overall 
uft  = 'http://esa.un.org/unpd/wpp/DVD/Files/'
uft += '1_Indicators%20(Standard)/EXCEL_FILES/'
uft += '2_Fertility/WPP2015_FERT_F04_TOTAL_FERTILITY.XLS'

cols = [2] + list(range(5,18))
ftot = pd.read_excel(uft, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])

ftot.head(3)[list(range(6))]


Out[6]:
Major area, region, country or area * 1950-1955 1955-1960 1960-1965 1965-1970 1970-1975
0 WORLD 4.961571 4.898665 5.024379 4.922202 4.475478
1 More developed regions 2.823786 2.807368 2.685900 2.387534 2.150816
2 Less developed regions 6.075417 5.941033 6.129418 6.034843 5.416602

In [7]:
# rename some variables 
names = list(ftot)
f = ftot.rename(columns={names[0]: 'Country'}) 

# select countries 
countries = ['China', 'Japan', 'Germany', 'United States of America']
f = f[f['Country'].isin(countries)]

# shape
f = f.set_index('Country').T 
f = f.rename(columns={'United States of America': 'United States'})
f.tail(3)


Out[7]:
Country China Japan Germany United States
2000-2005 1.50 1.2980 1.3513 2.0420
2005-2010 1.53 1.3388 1.3623 2.0590
2010-2015 1.55 1.3960 1.3909 1.8902

In [8]:
fig, ax = plt.subplots()
f.plot(ax=ax, kind='line', alpha=0.5, lw=3, figsize=(6.5, 4))
ax.set_title('Fertility (births per woman, lifetime)', fontsize=14, loc='left')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)
ax.set_ylim(ymin=0)
ax.hlines(2.1, -1, 13, linestyles='dashed')
ax.text(8.5, 2.4, 'Replacement = 2.1')


Out[8]:
<matplotlib.text.Text at 0x92c4278>

Exercise. What do you see here? What else would you like to know?

Exercise. Add Canada to the figure. How does it compare to the others? What other countries would you be interested in?


In [ ]:

Life expectancy

One of the bottom line summary numbers for mortality is life expectancy: if mortaility rates fall, people live longer, on average. Here we look at life expectancy at birth. There are also numbers for life expectancy given than you live to some specific age; for example, life expectancy given that you survive to age 60.


In [9]:
# life expectancy at birth, both sexes  
ule  = 'http://esa.un.org/unpd/wpp/DVD/Files/1_Indicators%20(Standard)/EXCEL_FILES/3_Mortality/'
ule += 'WPP2015_MORT_F07_1_LIFE_EXPECTANCY_0_BOTH_SEXES.XLS'

cols = [2] + list(range(5,34))
le  = pd.read_excel(ule, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])

le.head(3)[list(range(10))]


Out[9]:
Major area, region, country or area * 1950-1955 1955-1960 1960-1965 1965-1970 1970-1975 1975-1980 1980-1985 1985-1990 1990-1995
0 WORLD 46.807955 49.211594 51.069722 55.378481 58.049389 60.212254 61.989053 63.610776 64.536781
1 More developed regions 64.669835 67.679537 69.440381 70.281907 71.063794 71.960446 72.806852 73.901457 74.107521
2 Less developed regions 41.508118 43.882493 46.017378 51.441078 54.770858 57.342650 59.418113 61.229860 62.453399

In [10]:
# rename some variables 
oldname = list(le)[0]
l = le.rename(columns={oldname: 'Country'}) 
l.head(3)[list(range(8))]


Out[10]:
Country 1950-1955 1955-1960 1960-1965 1965-1970 1970-1975 1975-1980 1980-1985
0 WORLD 46.807955 49.211594 51.069722 55.378481 58.049389 60.212254 61.989053
1 More developed regions 64.669835 67.679537 69.440381 70.281907 71.063794 71.960446 72.806852
2 Less developed regions 41.508118 43.882493 46.017378 51.441078 54.770858 57.342650 59.418113

In [11]:
# select countries 
countries = ['China', 'Japan', 'Germany', 'United States of America']
l = l[l['Country'].isin(countries)]

# shape
l = l.set_index('Country').T 
l = l.rename(columns={'United States of America': 'United States'})
l.tail()


Out[11]:
Country China Japan Germany United States
1990-1995 69.386 79.447 75.878 75.617
1995-2000 70.587 80.475 77.212 76.404
2000-2005 72.852 81.829 78.573 77.133
2005-2010 74.438 82.621 79.757 78.113
2010-2015 75.432 83.298 80.647 78.873

In [12]:
fig, ax = plt.subplots()
l.plot(ax=ax, kind='line', alpha=0.5, lw=3, figsize=(6, 8), grid=True)
ax.set_title('Life expectancy at birth', fontsize=14, loc='left')
ax.set_ylabel('Life expectancy in years')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)
ax.set_ylim(ymin=0)


Out[12]:
(0, 85.0)

Exercise. What other countries would you like to see? Can you add them? The code below generates a list.


In [13]:
countries = le.rename(columns={oldname: 'Country'})['Country']

Exercise. Why do you think the US is falling behind? What would you look at to verify your conjecture?


In [ ]:

Mortality: aka death rates

Another thing that affects the age distribution of the population is the mortality rate: if mortality rates fall people live longer, on average. Here we look at how mortality rates have changed over the past 60+ years. Roughly speaking, people live an extra five years every generation. Which is a lot. Some of you will live to be a hundred. (Look at the 100+ agen category over time for Japan.)

The experts look at mortality rates by age. The UN has a whole page devoted to mortality numbers. We take 5-year mortality rates from the Abridged Life Table.

The numbers are percentages of people in a given age group who die over a 5-year period. 0.1 means that 90 percent of an age group is still alive in five years.


In [14]:
# mortality overall 
url  = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url += '1_Indicators%20(Standard)/EXCEL_FILES/3_Mortality/'
url += 'WPP2015_MORT_F17_1_ABRIDGED_LIFE_TABLE_BOTH_SEXES.XLS'

cols = [2, 5, 6, 7, 9]
mort = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
mort.tail(3)


Out[14]:
Major area, region, country or area * Period Age (x) Age interval (n) Probability of dying q(x,n)
59524 Tonga 2010-2015 75 5 0.279867
59525 Tonga 2010-2015 80 5 0.397735
59526 Tonga 2010-2015 85 15 NaN

In [15]:
# change names 
names = list(mort)
m = mort.rename(columns={names[0]: 'Country', names[2]: 'Age', names[3]: 'Interval', names[4]: 'Mortality'})
m.head(3)


Out[15]:
Country Period Age Interval Mortality
0 WORLD 1950-1955 0 1 0.141804
1 WORLD 1950-1955 1 4 0.085487
2 WORLD 1950-1955 5 5 0.031513

Comment. At this point, we need to pivot the data. That's not something we've done before, so take it as simply something we can do easily if we have to. We're going to do this twice to produce different graphs:

  • Compare countries for the same period.
  • Compare different periods for the same country.

In [16]:
# compare countries for most recent period
countries = ['China', 'Japan', 'Germany', 'United States of America']
mt = m[m['Country'].isin(countries) & m['Interval'].isin([5]) & m['Period'].isin(['2010-2015'])] 
print('Dimensions:', mt.shape) 

mp = mt.pivot(index='Age', columns='Country', values='Mortality')  
mp.head(3)


Dimensions: (64, 5)
Out[16]:
Country China Germany Japan United States of America
Age
5 0.001756 0.000397 0.000462 0.000597
10 0.001286 0.000456 0.000437 0.000714
15 0.001844 0.001300 0.001141 0.002264

In [17]:
fig, ax = plt.subplots()
mp.plot(ax=ax, kind='line', alpha=0.5, linewidth=3, 
#        logy=True, 
        figsize=(6, 4))
ax.set_title('Mortality by age', fontsize=14, loc='left')
ax.set_ylabel('Mortality Rate (log scale)')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)


Out[17]:
<matplotlib.legend.Legend at 0x9191390>

Exercises.

  • What country's old people have the lowest mortality?
  • What do you see here for the US? Why is our life expectancy shorter?
  • What other countries would you like to see? Can you adapt the code to show them?
  • Anything else cross your mind?

In [ ]:


In [18]:
# compare periods for the one country -- countries[0] is China 
mt = m[m['Country'].isin([countries[0]]) & m['Interval'].isin([5])] 
print('Dimensions:', mt.shape) 

mp = mt.pivot(index='Age', columns='Period', values='Mortality')  
mp = mp[[0, 6, 12]]
mp.head(3)


Dimensions: (208, 5)
Out[18]:
Period 1950-1955 1980-1985 2010-2015
Age
5 0.043088 0.006487 0.001756
10 0.025641 0.003300 0.001286
15 0.026845 0.004172 0.001844

In [19]:
fig, ax = plt.subplots()
mp.plot(ax=ax, kind='line', alpha=0.5, linewidth=3, 
#        logy=True, 
        figsize=(6, 4))
ax.set_title('Mortality over time', fontsize=14, loc='left')
ax.set_ylabel('Mortality Rate (log scale)')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)


Out[19]:
<matplotlib.legend.Legend at 0xd7d7dd8>

Exercise. What do you see? What else would you like to know?

Exercise. Repeat this graph for the United States? How does it compare?


In [ ]:


In [ ]: