Data APIs and pandas operations

Several of the notebooks we've already explored loaded datasets into a python pandas dataframe for analysis. Local copies of some of these datasets had been previously saved to disk in a few cases we read in the data directly from an online sources via a data API. This section explains how that is done in a bit more depth. Some of the possible advantages of reading in the data this way is that it allows would-be users to modify and extend the analysis, perhaps focusing on different time-periods or adding in other variables of interest.

Easy to use python wrappers for data APIs have been written for the World Bank and several other online data providers (including FRED, Eurostat, and many many others). The pandas-datareader library allows access to several databases from the World Bank's datasets and other sources.

If you haven't already installed the pandas-datareader library you can do so directly from a jupyter notebook code cell:

!pip install pandas-datareader

Once the library is in installed we can load it as:


In [1]:
%matplotlib inline
import seaborn as sns
import warnings
import numpy as np
import statsmodels.formula.api as smf
import datetime as dt

In [2]:
from pandas_datareader import wb

Data on urban bias

Our earlier analysis of the Harris_Todaro migration model suggested that policies designed to favor certain sectors or labor groups

Let's search for indicators (and their identification codes) relating to GDP per capita and urban population share. We could look these up in a book or from the website http://data.worldbank.org/ but we can also search for keywords directly.

First lets search for series having to do with gdp per capita


In [3]:
wb.search('gdp.*capita.*const')[['id','name']]


Out[3]:
id name
685 6.0.GDPpc_constant GDP per capita, PPP (constant 2011 internation...
7588 NY.GDP.PCAP.KD GDP per capita (constant 2010 US$)
7590 NY.GDP.PCAP.KN GDP per capita (constant LCU)
7592 NY.GDP.PCAP.PP.KD GDP per capita, PPP (constant 2011 internation...

We will use NY.GDP.PCAP.KD for GDP per capita (constant 2010 US$).

You can also first browse and search for data series from the World Bank's DataBank page at http://databank.worldbank.org/data/. Then find the 'id' for the series that you are interested in in the 'metadata' section from the webpage

Now let's look for data on urban population share:


In [4]:
wb.search('Urban Population')[['id','name']].tail()


Out[4]:
id name
9999 SP.URB.GROW Urban population growth (annual %)
10000 SP.URB.TOTL Urban population
10001 SP.URB.TOTL.FE.ZS Urban population, female (% of total)
10002 SP.URB.TOTL.IN.ZS Urban population (% of total)
10003 SP.URB.TOTL.MA.ZS Urban population, male (% of total)

Let's use the ones we like but use a python dictionary to rename these to shorter variable names when we load the data into a python dataframe:


In [5]:
indicators = ['NY.GDP.PCAP.KD', 'SP.URB.TOTL.IN.ZS']

Since we are interested in exploring the extent of 'urban bias' in some countries, let's load data from 1980 which was toward the end of the era of import-substituting industrialization when urban-biased policies were claimed to be most pronounced.


In [6]:
dat = wb.download(indicator=indicators, country = 'all', start=1980, end=1980)

In [7]:
dat.columns


Out[7]:
Index(['NY.GDP.PCAP.KD', 'SP.URB.TOTL.IN.ZS'], dtype='object')

Let's rename the columns to something shorter and then plot and regress log gdp per capita against urban extent we get a pretty tight fit:


In [8]:
dat.columns = [['gdppc', 'urbpct']]
dat['lngpc']  = np.log(dat.gdppc)

In [9]:
g = sns.jointplot("lngpc", "urbpct", data=dat, kind="reg",
                  color ="b", size=7)


That is a pretty tight fit: urbanization rises with income per-capita, but there are several middle income country outliersthat have considerably higher urbanization than would be predicted. Let's look at the regression line.


In [10]:
mod = smf.ols("urbpct ~ lngpc", dat).fit()

In [11]:
print(mod.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 urbpct   R-squared:                       0.744
Model:                            OLS   Adj. R-squared:                  0.743
Method:                 Least Squares   F-statistic:                     494.3
Date:                Wed, 03 May 2017   Prob (F-statistic):           3.43e-52
Time:                        15:26:52   Log-Likelihood:                -670.94
No. Observations:                 172   AIC:                             1346.
Df Residuals:                     170   BIC:                             1352.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -68.0209      5.185    -13.120      0.000     -78.255     -57.787
lngpc         13.9583      0.628     22.233      0.000      12.719      15.198
==============================================================================
Omnibus:                        8.555   Durbin-Watson:                   2.204
Prob(Omnibus):                  0.014   Jarque-Bera (JB):               17.062
Skew:                          -0.027   Prob(JB):                     0.000197
Kurtosis:                       4.542   Cond. No.                         47.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now let's just look at a list of countries sorted by the size of their residuals in this regression line. Countries with the largest residuals had urbanization in excess of what the model predicts from their 1980 level of income per capita.

Here is the sorted list of top 15 outliers.


In [12]:
mod.resid.sort_values(ascending=False).head(15)


Out[12]:
country               year
Singapore             1980    35.469714
Chile                 1980    30.563014
Malta                 1980    30.448467
Hong Kong SAR, China  1980    29.957702
Uruguay               1980    29.143223
Argentina             1980    25.369855
Cuba                  1980    24.148529
Belgium               1980    20.731859
Israel                1980    20.456625
Iraq                  1980    20.262678
Peru                  1980    17.810693
Myanmar               1980    17.719561
Bulgaria              1980    17.360914
Jordan                1980    15.706878
Colombia              1980    15.258744
dtype: float64

This is of course only suggestive but (leaving aside the island states like Singapore and Hong-Kong) the list is dominated by southern cone countries such as Chile, Argentina and Peru which in addition to having legacies of heavy political centralization also pursued ISI policies in the 60s and 70s that many would associate with urban biased policies.

Panel data

Very often we want data on several indicators and a whole group of countries over a number of years. we could also have used datetime format dates:


In [13]:
countries = ['CHL', 'USA', 'ARG']
start, end = dt.datetime(1950, 1, 1), dt.datetime(2016, 1, 1)
dat = wb.download(
    indicator=indicators, 
    country = countries, 
    start=start, 
    end=end).dropna()

Lets use shorter column names


In [14]:
dat.columns


Out[14]:
Index(['NY.GDP.PCAP.KD', 'SP.URB.TOTL.IN.ZS'], dtype='object')

In [15]:
dat.columns = [['gdppc', 'urb']]

In [16]:
dat.head()


Out[16]:
gdppc urb
country year
Argentina 2015 10501.660269 91.751
2014 10334.780146 91.604
2013 10711.229530 91.452
2012 10558.265365 91.295
2011 10780.342508 91.133

Notice this has a two-level multi-index. The outer level is named 'country' and the inner level is 'year'

We can pull out group data for a single country like this using the .xs or cross section method.


In [17]:
dat.xs('Chile',level='country').head(3)


Out[17]:
gdppc urb
year
2015 14660.505335 89.530
2014 14479.763258 89.356
2013 14364.140970 89.175

(Note we could have also used dat.loc['Chile'].head())

And we can pull a 'year' level cross section like this:


In [18]:
dat.xs('2007', level='year').head()


Out[18]:
gdppc urb
country
Argentina 9830.759871 90.445
Chile 12223.484611 87.926
United States 49979.533843 80.269

Note that what was returned was a dataframe with the data just for our selected country. We can in turn further specify what column(s) from this we want:


In [19]:
dat.loc['Chile']['gdppc'].head()


Out[19]:
year
2015    14660.505335
2014    14479.763258
2013    14364.140970
2012    13963.665402
2011    13385.131216
Name: gdppc, dtype: float64

Unstack data

The unstack method turns index values into column names while stack method converts column names to index values. Here we apply unstack.


In [20]:
datyr = dat.unstack(level='country')
datyr.head()


Out[20]:
gdppc urb
country Argentina Chile United States Argentina Chile United States
year
1960 5605.191722 3630.391789 17036.885170 73.611 67.836 69.996
1961 5815.233002 3692.112435 17142.193767 74.217 68.660 70.377
1962 5675.060043 3796.728684 17910.278790 74.767 69.435 70.757
1963 5290.764999 3939.263166 18431.158404 75.309 70.200 71.134
1964 5738.598474 3955.026738 19231.171859 75.844 70.955 71.508

We can now easily index a 2015 cross-section of GDP per capita like so:


In [21]:
datyr.xs('1962')['gdppc']


Out[21]:
country
Argentina         5675.060043
Chile             3796.728684
United States    17910.278790
Name: 1962, dtype: float64

We'd get same result from datyr.loc['2015']['gdppc']

We can also easily plot all countries:


In [22]:
datyr['urb'].plot(kind='line');