Data Bootcamp: Demography

We love demography, specifically the dynamics of population growth and decline. You can drill down seemingly without end, as this terrific graphic about causes of death suggests.

We take a look here at the UN's population data: the age distribution of the population, life expectancy, fertility (the word we use for births), and mortality (deaths). Explore the website, it's filled with interesting data. There are other sources that cover longer time periods, and for some countries you can get detailed data on specific things (causes of death, for example).

We use a number of countries as examples, but Japan and China are the most striking. The code is written so that the country is easily changed.

This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

Preliminaries

Import statements and a date check for future reference.



In [1]:

    
# import packages 
import pandas as pd                   # data management
import matplotlib.pyplot as plt       # graphics 
import matplotlib as mpl              # graphics parameters
import numpy as np                    # numerical calculations 

# IPython command, puts plots in notebook 
%matplotlib inline

# check Python version 
import datetime as dt 
import sys
print('Today is', dt.date.today())
print('What version of Python are we running? \n', sys.version, sep='')









    



Today is 2016-03-23
What version of Python are we running? 
3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]



In [ ]:

Population by age

We have both "estimates" of the past (1950-2015) and "projections" of the future (out to 2100). Here we focus on the latter, specifically what the UN refers to as the medium variant: their middle of the road projection. It gives us a sense of how Japan's population might change over the next century.

It takes a few seconds to read the data.

What are the numbers? Thousands of people in various 5-year age categories.



In [2]:

    
url1 = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url2 = '1_Indicators%20(Standard)/EXCEL_FILES/1_Population/'
url3 = 'WPP2015_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.XLS'
url = url1 + url2 + url3 

cols = [2, 5] + list(range(6,28))
#est = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
prj = pd.read_excel(url, sheetname=1, skiprows=16, parse_cols=cols, na_values=['…'])

prj.head(3)[list(range(6))]









    Out[2]:






  
    
      
      Major area, region, country or area *
      Reference date (as of 1 July)
      0-4
      5-9
      10-14
      15-19
    
  
  
    
      0
      WORLD
      2015
      670928.185
      637448.895
      607431.299
      590069.337
    
    
      1
      WORLD
      2020
      677599.590
      664282.610
      634568.409
      604322.921
    
    
      2
      WORLD
      2025
      673174.914
      671929.973
      661684.410
      631509.113



In [3]:

    
# rename some variables 
pop = prj 
names = list(pop) 
pop = pop.rename(columns={names[0]: 'Country', 
                          names[1]: 'Year'}) 
# select country and years 
country = ['Japan']
years     = [2015, 2055, 2095]
pop = pop[pop['Country'].isin(country) & pop['Year'].isin(years)]
pop = pop.drop(['Country'], axis=1)

# set index = Year 
# divide by 1000 to convert numbers from thousands to millions
pop = pop.set_index('Year')/1000

pop.head()[list(range(8))]



In [4]:

    
# transpose (T) so that index = age 
pop = pop.T
pop.head(3)



In [5]:

    
ax = pop.plot(kind='bar',  
              color='blue', 
              alpha=0.5, subplots=True, sharey=True, figsize=(8,6))

for axnum in range(len(ax)):  
    ax[axnum].set_title('')
    ax[axnum].set_ylabel('Millions')
        
ax[0].set_title('Population by age', fontsize=14, loc='left')









    Out[5]:





<matplotlib.text.Text at 0xa259e10>



In [ ]:

Exercise. What do you see here? What else would you like to know?

Exercise. Adapt the preceeding code to do the same thing for China. Or some other country that sparks your interest.



In [ ]:

Fertility: aka birth rates

We might wonder, why is the population falling in Japan? Other countries? Well, one reason is that birth rates are falling. Demographers call this fertility. Here we look at the fertility using the same UN source as the previous example. We look at two variables: total fertility and fertility by age of mother. In both cases we explore the numbers to date, but the same files contain projections of future fertility.



In [6]:

    
# fertility overall 
uft  = 'http://esa.un.org/unpd/wpp/DVD/Files/'
uft += '1_Indicators%20(Standard)/EXCEL_FILES/'
uft += '2_Fertility/WPP2015_FERT_F04_TOTAL_FERTILITY.XLS'

cols = [2] + list(range(5,18))
ftot = pd.read_excel(uft, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])

ftot.head(3)[list(range(6))]









    Out[6]:






  
    
      
      Major area, region, country or area *
      1950-1955
      1955-1960
      1960-1965
      1965-1970
      1970-1975
    
  
  
    
      0
      WORLD
      4.961571
      4.898665
      5.024379
      4.922202
      4.475478
    
    
      1
      More developed regions
      2.823786
      2.807368
      2.685900
      2.387534
      2.150816
    
    
      2
      Less developed regions
      6.075417
      5.941033
      6.129418
      6.034843
      5.416602



In [7]:

    
# rename some variables 
names = list(ftot)
f = ftot.rename(columns={names[0]: 'Country'}) 

# select countries 
countries = ['China', 'Japan', 'Germany', 'United States of America']
f = f[f['Country'].isin(countries)]

# shape
f = f.set_index('Country').T 
f = f.rename(columns={'United States of America': 'United States'})
f.tail(3)









    Out[7]:






  
    
      Country
      China
      Japan
      Germany
      United States
    
  
  
    
      2000-2005
      1.50
      1.2980
      1.3513
      2.0420
    
    
      2005-2010
      1.53
      1.3388
      1.3623
      2.0590
    
    
      2010-2015
      1.55
      1.3960
      1.3909
      1.8902



In [8]:

    
fig, ax = plt.subplots()
f.plot(ax=ax, kind='line', alpha=0.5, lw=3, figsize=(6.5, 4))
ax.set_title('Fertility (births per woman, lifetime)', fontsize=14, loc='left')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)
ax.set_ylim(ymin=0)
ax.hlines(2.1, -1, 13, linestyles='dashed')
ax.text(8.5, 2.4, 'Replacement = 2.1')









    Out[8]:





<matplotlib.text.Text at 0x92c4278>

Exercise. What do you see here? What else would you like to know?

Exercise. Add Canada to the figure. How does it compare to the others? What other countries would you be interested in?



In [ ]:

Life expectancy

One of the bottom line summary numbers for mortality is life expectancy: if mortaility rates fall, people live longer, on average. Here we look at life expectancy at birth. There are also numbers for life expectancy given than you live to some specific age; for example, life expectancy given that you survive to age 60.



In [9]:

    
# life expectancy at birth, both sexes  
ule  = 'http://esa.un.org/unpd/wpp/DVD/Files/1_Indicators%20(Standard)/EXCEL_FILES/3_Mortality/'
ule += 'WPP2015_MORT_F07_1_LIFE_EXPECTANCY_0_BOTH_SEXES.XLS'

cols = [2] + list(range(5,34))
le  = pd.read_excel(ule, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])

le.head(3)[list(range(10))]









    Out[9]:






  
    
      
      Major area, region, country or area *
      1950-1955
      1955-1960
      1960-1965
      1965-1970
      1970-1975
      1975-1980
      1980-1985
      1985-1990
      1990-1995
    
  
  
    
      0
      WORLD
      46.807955
      49.211594
      51.069722
      55.378481
      58.049389
      60.212254
      61.989053
      63.610776
      64.536781
    
    
      1
      More developed regions
      64.669835
      67.679537
      69.440381
      70.281907
      71.063794
      71.960446
      72.806852
      73.901457
      74.107521
    
    
      2
      Less developed regions
      41.508118
      43.882493
      46.017378
      51.441078
      54.770858
      57.342650
      59.418113
      61.229860
      62.453399



In [10]:

    
# rename some variables 
oldname = list(le)[0]
l = le.rename(columns={oldname: 'Country'}) 
l.head(3)[list(range(8))]









    Out[10]:






  
    
      
      Country
      1950-1955
      1955-1960
      1960-1965
      1965-1970
      1970-1975
      1975-1980
      1980-1985
    
  
  
    
      0
      WORLD
      46.807955
      49.211594
      51.069722
      55.378481
      58.049389
      60.212254
      61.989053
    
    
      1
      More developed regions
      64.669835
      67.679537
      69.440381
      70.281907
      71.063794
      71.960446
      72.806852
    
    
      2
      Less developed regions
      41.508118
      43.882493
      46.017378
      51.441078
      54.770858
      57.342650
      59.418113



In [11]:

    
# select countries 
countries = ['China', 'Japan', 'Germany', 'United States of America']
l = l[l['Country'].isin(countries)]

# shape
l = l.set_index('Country').T 
l = l.rename(columns={'United States of America': 'United States'})
l.tail()









    Out[11]:






  
    
      Country
      China
      Japan
      Germany
      United States
    
  
  
    
      1990-1995
      69.386
      79.447
      75.878
      75.617
    
    
      1995-2000
      70.587
      80.475
      77.212
      76.404
    
    
      2000-2005
      72.852
      81.829
      78.573
      77.133
    
    
      2005-2010
      74.438
      82.621
      79.757
      78.113
    
    
      2010-2015
      75.432
      83.298
      80.647
      78.873



In [12]:

    
fig, ax = plt.subplots()
l.plot(ax=ax, kind='line', alpha=0.5, lw=3, figsize=(6, 8), grid=True)
ax.set_title('Life expectancy at birth', fontsize=14, loc='left')
ax.set_ylabel('Life expectancy in years')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)
ax.set_ylim(ymin=0)









    Out[12]:





(0, 85.0)

Exercise. What other countries would you like to see? Can you add them? The code below generates a list.



In [13]:

    
countries = le.rename(columns={oldname: 'Country'})['Country']

Exercise. Why do you think the US is falling behind? What would you look at to verify your conjecture?



In [ ]:

Mortality: aka death rates

Another thing that affects the age distribution of the population is the mortality rate: if mortality rates fall people live longer, on average. Here we look at how mortality rates have changed over the past 60+ years. Roughly speaking, people live an extra five years every generation. Which is a lot. Some of you will live to be a hundred. (Look at the 100+ agen category over time for Japan.)

The experts look at mortality rates by age. The UN has a whole page devoted to mortality numbers. We take 5-year mortality rates from the Abridged Life Table.

The numbers are percentages of people in a given age group who die over a 5-year period. 0.1 means that 90 percent of an age group is still alive in five years.



In [14]:

    
# mortality overall 
url  = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url += '1_Indicators%20(Standard)/EXCEL_FILES/3_Mortality/'
url += 'WPP2015_MORT_F17_1_ABRIDGED_LIFE_TABLE_BOTH_SEXES.XLS'

cols = [2, 5, 6, 7, 9]
mort = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
mort.tail(3)









    Out[14]:






  
    
      
      Major area, region, country or area *
      Period
      Age (x)
      Age interval (n)
      Probability of dying q(x,n)
    
  
  
    
      59524
      Tonga
      2010-2015
      75
      5
      0.279867
    
    
      59525
      Tonga
      2010-2015
      80
      5
      0.397735
    
    
      59526
      Tonga
      2010-2015
      85
      15
      NaN



In [15]:

    
# change names 
names = list(mort)
m = mort.rename(columns={names[0]: 'Country', names[2]: 'Age', names[3]: 'Interval', names[4]: 'Mortality'})
m.head(3)

Comment. At this point, we need to pivot the data. That's not something we've done before, so take it as simply something we can do easily if we have to. We're going to do this twice to produce different graphs:

Compare countries for the same period.
Compare different periods for the same country.



In [16]:

    
# compare countries for most recent period
countries = ['China', 'Japan', 'Germany', 'United States of America']
mt = m[m['Country'].isin(countries) & m['Interval'].isin([5]) & m['Period'].isin(['2010-2015'])] 
print('Dimensions:', mt.shape) 

mp = mt.pivot(index='Age', columns='Country', values='Mortality')  
mp.head(3)









    



Dimensions: (64, 5)






    Out[16]:






  
    
      Country
      China
      Germany
      Japan
      United States of America
    
    
      Age
      
      
      
      
    
  
  
    
      5
      0.001756
      0.000397
      0.000462
      0.000597
    
    
      10
      0.001286
      0.000456
      0.000437
      0.000714
    
    
      15
      0.001844
      0.001300
      0.001141
      0.002264



In [17]:

    
fig, ax = plt.subplots()
mp.plot(ax=ax, kind='line', alpha=0.5, linewidth=3, 
#        logy=True, 
        figsize=(6, 4))
ax.set_title('Mortality by age', fontsize=14, loc='left')
ax.set_ylabel('Mortality Rate (log scale)')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)









    Out[17]:





<matplotlib.legend.Legend at 0x9191390>

Exercises.

What country's old people have the lowest mortality?
What do you see here for the US? Why is our life expectancy shorter?
What other countries would you like to see? Can you adapt the code to show them?
Anything else cross your mind?



In [ ]:



In [18]:

    
# compare periods for the one country -- countries[0] is China 
mt = m[m['Country'].isin([countries[0]]) & m['Interval'].isin([5])] 
print('Dimensions:', mt.shape) 

mp = mt.pivot(index='Age', columns='Period', values='Mortality')  
mp = mp[[0, 6, 12]]
mp.head(3)









    



Dimensions: (208, 5)






    Out[18]:






  
    
      Period
      1950-1955
      1980-1985
      2010-2015
    
    
      Age
      
      
      
    
  
  
    
      5
      0.043088
      0.006487
      0.001756
    
    
      10
      0.025641
      0.003300
      0.001286
    
    
      15
      0.026845
      0.004172
      0.001844



In [19]:

    
fig, ax = plt.subplots()
mp.plot(ax=ax, kind='line', alpha=0.5, linewidth=3, 
#        logy=True, 
        figsize=(6, 4))
ax.set_title('Mortality over time', fontsize=14, loc='left')
ax.set_ylabel('Mortality Rate (log scale)')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)









    Out[19]:





<matplotlib.legend.Legend at 0xd7d7dd8>

Exercise. What do you see? What else would you like to know?

Exercise. Repeat this graph for the United States? How does it compare?



In [ ]:



In [ ]:

	0-4	5-9	10-14	15-19	20-24	25-29	30-34	35-39
Year
2015	5.269038	5.398973	5.603638	5.960784	6.111768	6.843421	7.455687	8.345753
2055	4.271907	4.371016	4.458145	4.557117	4.685414	4.829637	5.013420	5.236482
2095	3.720541	3.789596	3.860493	3.938189	4.025477	4.106967	4.194272	4.290983

Year	2015	2055	2095
0-4	5.269038	4.271907	3.720541
5-9	5.398973	4.371016	3.789596
10-14	5.603638	4.458145	3.860493

	Major area, region, country or area *	Reference date (as of 1 July)	0-4	5-9	10-14	15-19
0	WORLD	2015	670928.185	637448.895	607431.299	590069.337
1	WORLD	2020	677599.590	664282.610	634568.409	604322.921
2	WORLD	2025	673174.914	671929.973	661684.410	631509.113

	Major area, region, country or area *	1950-1955	1955-1960	1960-1965	1965-1970	1970-1975
0	WORLD	4.961571	4.898665	5.024379	4.922202	4.475478
1	More developed regions	2.823786	2.807368	2.685900	2.387534	2.150816
2	Less developed regions	6.075417	5.941033	6.129418	6.034843	5.416602

Country	China	Japan	Germany	United States
2000-2005	1.50	1.2980	1.3513	2.0420
2005-2010	1.53	1.3388	1.3623	2.0590
2010-2015	1.55	1.3960	1.3909	1.8902

	Major area, region, country or area *	1950-1955	1955-1960	1960-1965	1965-1970	1970-1975	1975-1980	1980-1985	1985-1990	1990-1995
0	WORLD	46.807955	49.211594	51.069722	55.378481	58.049389	60.212254	61.989053	63.610776	64.536781
1	More developed regions	64.669835	67.679537	69.440381	70.281907	71.063794	71.960446	72.806852	73.901457	74.107521
2	Less developed regions	41.508118	43.882493	46.017378	51.441078	54.770858	57.342650	59.418113	61.229860	62.453399

Country	China	Japan	Germany	United States
1990-1995	69.386	79.447	75.878	75.617
1995-2000	70.587	80.475	77.212	76.404
2000-2005	72.852	81.829	78.573	77.133
2005-2010	74.438	82.621	79.757	78.113
2010-2015	75.432	83.298	80.647	78.873

	Major area, region, country or area *	Period	Age (x)	Age interval (n)	Probability of dying q(x,n)
59524	Tonga	2010-2015	75	5	0.279867
59525	Tonga	2010-2015	80	5	0.397735
59526	Tonga	2010-2015	85	15	NaN

	Country	Period	Age	Interval	Mortality
0	WORLD	1950-1955	0	1	0.141804
1	WORLD	1950-1955	1	4	0.085487
2	WORLD	1950-1955	5	5	0.031513

Country	China	Germany	Japan	United States of America
Age
5	0.001756	0.000397	0.000462	0.000597
10	0.001286	0.000456	0.000437	0.000714
15	0.001844	0.001300	0.001141	0.002264

Period	1950-1955	1980-1985	2010-2015
Age
5	0.043088	0.006487	0.001756
10	0.025641	0.003300	0.001286
15	0.026845	0.004172	0.001844