Several of the notebooks we've already explored loaded datasets into a python pandas dataframe for analysis. Local copies of some of these datasets had been previously saved to disk in a few cases we read in the data directly from an online sources via a data API. This section explains how that is done in a bit more depth. Some of the possible advantages of reading in the data this way is that it allows would-be users to modify and extend the analysis, perhaps focusing on different time-periods or adding in other variables of interest.
Easy to use python wrappers for data APIs have been written for the World Bank and several other online data providers (including FRED, Eurostat, and many many others). The pandas-datareader library allows access to several databases from the World Bank's datasets and other sources.
If you haven't already installed the pandas-datareader library you can do so directly from a jupyter notebook code cell:
!pip install pandas-datareader
Once the library is in installed we can load it as:
In [1]:
%matplotlib inline
import seaborn as sns
import warnings
import numpy as np
import statsmodels.formula.api as smf
import datetime as dt
In [2]:
from pandas_datareader import wb
Our earlier analysis of the Harris_Todaro migration model suggested that policies designed to favor certain sectors or labor groups
Let's search for indicators (and their identification codes) relating to GDP per capita and urban population share. We could look these up in a book or from the website http://data.worldbank.org/ but we can also search for keywords directly.
First lets search for series having to do with gdp per capita
In [3]:
wb.search('gdp.*capita.*const')[['id','name']]
Out[3]:
We will use NY.GDP.PCAP.KD for GDP per capita (constant 2010 US$).
You can also first browse and search for data series from the World Bank's DataBank page at http://databank.worldbank.org/data/. Then find the 'id' for the series that you are interested in in the 'metadata' section from the webpage
Now let's look for data on urban population share:
In [4]:
wb.search('Urban Population')[['id','name']].tail()
Out[4]:
Let's use the ones we like but use a python dictionary to rename these to shorter variable names when we load the data into a python dataframe:
In [5]:
indicators = ['NY.GDP.PCAP.KD', 'SP.URB.TOTL.IN.ZS']
Since we are interested in exploring the extent of 'urban bias' in some countries, let's load data from 1980 which was toward the end of the era of import-substituting industrialization when urban-biased policies were claimed to be most pronounced.
In [6]:
dat = wb.download(indicator=indicators, country = 'all', start=1980, end=1980)
In [7]:
dat.columns
Out[7]:
Let's rename the columns to something shorter and then plot and regress log gdp per capita against urban extent we get a pretty tight fit:
In [8]:
dat.columns = [['gdppc', 'urbpct']]
dat['lngpc'] = np.log(dat.gdppc)
In [9]:
g = sns.jointplot("lngpc", "urbpct", data=dat, kind="reg",
color ="b", size=7)
That is a pretty tight fit: urbanization rises with income per-capita, but there are several middle income country outliersthat have considerably higher urbanization than would be predicted. Let's look at the regression line.
In [10]:
mod = smf.ols("urbpct ~ lngpc", dat).fit()
In [11]:
print(mod.summary())
Now let's just look at a list of countries sorted by the size of their residuals in this regression line. Countries with the largest residuals had urbanization in excess of what the model predicts from their 1980 level of income per capita.
Here is the sorted list of top 15 outliers.
In [12]:
mod.resid.sort_values(ascending=False).head(15)
Out[12]:
This is of course only suggestive but (leaving aside the island states like Singapore and Hong-Kong) the list is dominated by southern cone countries such as Chile, Argentina and Peru which in addition to having legacies of heavy political centralization also pursued ISI policies in the 60s and 70s that many would associate with urban biased policies.
In [13]:
countries = ['CHL', 'USA', 'ARG']
start, end = dt.datetime(1950, 1, 1), dt.datetime(2016, 1, 1)
dat = wb.download(
indicator=indicators,
country = countries,
start=start,
end=end).dropna()
Lets use shorter column names
In [14]:
dat.columns
Out[14]:
In [15]:
dat.columns = [['gdppc', 'urb']]
In [16]:
dat.head()
Out[16]:
Notice this has a two-level multi-index. The outer level is named 'country' and the inner level is 'year'
We can pull out group data for a single country like this using the .xs
or cross section method.
In [17]:
dat.xs('Chile',level='country').head(3)
Out[17]:
(Note we could have also used dat.loc['Chile'].head()
)
And we can pull a 'year' level cross section like this:
In [18]:
dat.xs('2007', level='year').head()
Out[18]:
Note that what was returned was a dataframe with the data just for our selected country. We can in turn further specify what column(s) from this we want:
In [19]:
dat.loc['Chile']['gdppc'].head()
Out[19]:
In [20]:
datyr = dat.unstack(level='country')
datyr.head()
Out[20]:
We can now easily index a 2015 cross-section of GDP per capita like so:
In [21]:
datyr.xs('1962')['gdppc']
Out[21]:
We'd get same result from datyr.loc['2015']['gdppc']
We can also easily plot all countries:
In [22]:
datyr['urb'].plot(kind='line');