In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_context("talk")
In "The Functional Art: An introduction to information graphics and visualization" by Alberto Cairo, on page 12 we are presented with a visualization of UN data time series of Fertility rate (average number of children per woman) per country:
Figure 1.6 Highlighting the relevant, keeping the secondary in the background.
Let's try to reproduce this.
The visualization was done in 2012, but limited the visualization to 2010. This should make it easy, in theory, to get the data, since it is historical. These are directly available as excel spreadsheets now, we'll just ignore the last bucket (2010-2015).
Pandas allows loading an excel spreadsheet straight from a URL, but here we will download it first so we have a local copy.
In [3]:
!wget 'http://esa.un.org/unpd/wpp/DVD/Files/1_Indicators%20(Standard)/EXCEL_FILES/2_Fertility/WPP2015_FERT_F04_TOTAL_FERTILITY.XLS'
File FERT/4: Total fertility by major area, region and country, 1950-2100 (children per woman)
Estimates, 1950 - 2015
POP/DB/WPP/Rev.2015/FERT/F04
July 2015 - Copyright © 2015 by United Nations. All rights reserved
Suggested citation: United Nations, Department of Economic and Social Affairs, Population Division (2015). World Population Prospects: The 2015 Revision, DVD Edition.
In [2]:
df = pd.read_excel('WPP2015_FERT_F04_TOTAL_FERTILITY.XLS', skiprows=16, index_col = 'Country code')
df = df[df.index < 900]
In [3]:
len(df)
Out[3]:
In [4]:
df.head()
Out[4]:
First problem... The book states on page 8:
-- "Using the filters the site offers, I asked for a table that included the more than 150 countries on which the UN has complete research."
Yet we have 201 countries (codes 900+ are regions) with complete data. We do not have a easy way to identify which countries were added to this. Still, let's move forward and prep our data.
In [5]:
df.rename(columns={df.columns[2]:'Description'}, inplace=True)
In [6]:
df.drop(df.columns[[0, 1, 3, 16]], axis=1, inplace=True) # drop what we dont need
In [7]:
df.head()
Out[7]:
In [8]:
highlight_countries = ['Niger','Yemen','India',
'Brazil','Norway','France','Sweden','United Kingdom',
'Spain','Italy','Germany','Japan', 'China'
]
In [9]:
# Subset only countries to highlight, transpose for timeseries
df_high = df[df.Description.isin(highlight_countries)].T[1:]
In [10]:
# Subset the rest of the countries, transpose for timeseries
df_bg = df[~df.Description.isin(highlight_countries)].T[1:]
In [11]:
# background
ax = df_bg.plot(legend=False, color='k', alpha=0.02, figsize=(12,12))
ax.xaxis.tick_top()
# highlighted countries
df_high.plot(legend=False, ax=ax)
# replacement level line
ax.hlines(y=2.1, xmin=0, xmax=12, color='k', alpha=1, linestyle='dashed')
# Average over time on all countries
df.mean().plot(ax=ax, color='k', label='World\naverage')
# labels for highlighted countries on the right side
for country in highlight_countries:
ax.text(11.2,df[df.Description==country].values[0][12],country)
# start y axis at 1
ax.set_ylim(ymin=1)
Out[11]:
For one thing, the line for China doesn't look like the one in the book. Concerning. The other issue is that there are some lines that are going lower than Italy or Spain in 1995-2000 and in 2000-2005 (majority in the Balkans) and that were not on the graph in the book, AFAICT:
In [12]:
df.describe()
Out[12]:
In [13]:
df[df['1995-2000']<1.25]
Out[13]:
In [14]:
df[df['2000-2005']<1.25]
Out[14]:
The other thing that I really need to address is the labeling. Clearly we need the functionality to move labels up and down to make them readable. Collision detection, basically. I'm surprised this functionality doesn't exist, because I keep bumping into that. Usually, I can tweak the Y pos by a few pixels, but in this specific case, there is no way to do that.
So, I guess I have a project for 2016...
In [ ]: