In [1]:
# Configure Jupyter so figures appear in the notebook
%matplotlib inline
# Configure Jupyter to display the assigned value after an assignment
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
# import functions from the modsim.py module
from modsim import *
from pandas import read_html
In [2]:
filename = 'data/World_population_estimates.html'
tables = read_html(filename, header=0, index_col=0, decimal='M')
table2 = tables[2]
table2.columns = ['census', 'prb', 'un', 'maddison',
'hyde', 'tanton', 'biraben', 'mj',
'thomlinson', 'durand', 'clark']
table2.shape
In [3]:
census = table2.census / 1e9
census.shape
In [4]:
un = table2.un / 1e9
un.shape
A DataFrame
contains index
, which labels the rows. It is an Int64Index
, which is similar to a NumPy array.
In [5]:
table2.index
And columns
, which labels the columns.
In [6]:
table2.columns
And values
, which is an array of values.
In [7]:
table2.values
A Series
does not have columns
, but it does have name
.
In [8]:
census.name
It contains values
, which is an array.
In [9]:
census.values
And it contains index
:
In [10]:
census.index
If you ever wonder what kind of object a variable refers to, you can use the type
function. The result indicates what type the object is, and the module where that type is defined.
DataFrame
, Int64Index
, Index
, and Series
are defined by Pandas.
ndarray
is defined by NumPy.
In [11]:
type(table2)
In [12]:
type(table2.index)
In [13]:
type(table2.columns)
In [14]:
type(table2.values)
In [15]:
type(census)
In [16]:
type(census.index)
In [17]:
type(census.values)
The following exercise provides a chance to practice what you have learned so far, and maybe develop a different growth model. If you feel comfortable with what we have done so far, you might want to give it a try.
Optional Exercise: On the Wikipedia page about world population estimates, the first table contains estimates for prehistoric populations. The following cells process this table and plot some of the results.
In [18]:
filename = 'data/World_population_estimates.html'
tables = read_html(filename, header=0, index_col=0, decimal='M')
len(tables)
Select tables[1]
, which is the second table on the page.
In [19]:
table1 = tables[1]
table1.head()
Not all agencies and researchers provided estimates for the same dates. Again NaN
is the special value that indicates missing data.
In [20]:
table1.tail()
Again, we'll replace the long column names with more convenient abbreviations.
In [21]:
table1.columns = ['PRB', 'UN', 'Maddison', 'HYDE', 'Tanton',
'Biraben', 'McEvedy & Jones', 'Thomlinson', 'Durand', 'Clark']
Some of the estimates are in a form Pandas doesn't recognize as numbers, but we can coerce them to be numeric.
In [22]:
for col in table1.columns:
table1[col] = pd.to_numeric(table1[col], errors='coerce')
Here are the results. Notice that we are working in millions now, not billions.
In [23]:
table1.plot()
decorate(xlim=[-10000, 2000], xlabel='Year',
ylabel='World population (millions)',
title='Prehistoric population estimates')
plt.legend(fontsize='small');
We can use xlim
to zoom in on everything after Year 0.
In [24]:
table1.plot()
decorate(xlim=[0, 2000], xlabel='Year',
ylabel='World population (millions)',
title='CE population estimates')
plt.legend(fontsize='small');
See if you can find a model that fits these data well from Year 0 to 1950.
How well does your best model predict actual population growth from 1950 to the present?
In [25]:
# Solution goes here
In [26]:
# Solution goes here
In [ ]: