Data APIs and pandas operations

Several of the notebooks we've already explored loaded datasets into a python pandas dataframe for analysis. Local copies of some of these datasets had been previously saved to disk in a few cases we read in the data directly from an online sources via a data API. This section explains how that is done in a bit more depth. Some of the possible advantages of reading in the data this way is that it allows would-be users to modify and extend the analysis, perhaps focusing on different time-periods or adding in other variables of interest.

Easy to use python wrappers for data APIs have been written for the World Bank and several other online data providers (including FRED, Eurostat, and many many others). The pandas-datareader library allows access to several databases from the World Bank's datasets and other sources.

If you haven't already installed the pandas-datareader library you can do so directly from a jupyter notebook code cell:

!pip install pandas-datareader

Once the library is in installed we can load it as:



In [1]:

    
%matplotlib inline
import seaborn as sns
import warnings
import numpy as np
import statsmodels.formula.api as smf
import datetime as dt



In [2]:

    
from pandas_datareader import wb

Data on urban bias

Our earlier analysis of the Harris_Todaro migration model suggested that policies designed to favor certain sectors or labor groups

Let's search for indicators (and their identification codes) relating to GDP per capita and urban population share. We could look these up in a book or from the website http://data.worldbank.org/ but we can also search for keywords directly.

First lets search for series having to do with gdp per capita



In [3]:

    
wb.search('gdp.*capita.*const')[['id','name']]









    Out[3]:






  
    
      
      id
      name
    
  
  
    
      685
      6.0.GDPpc_constant
      GDP per capita, PPP (constant 2011 internation...
    
    
      7588
      NY.GDP.PCAP.KD
      GDP per capita (constant 2010 US$)
    
    
      7590
      NY.GDP.PCAP.KN
      GDP per capita (constant LCU)
    
    
      7592
      NY.GDP.PCAP.PP.KD
      GDP per capita, PPP (constant 2011 internation...

We will use NY.GDP.PCAP.KD for GDP per capita (constant 2010 US$).

You can also first browse and search for data series from the World Bank's DataBank page at http://databank.worldbank.org/data/. Then find the 'id' for the series that you are interested in in the 'metadata' section from the webpage

Now let's look for data on urban population share:



In [4]:

    
wb.search('Urban Population')[['id','name']].tail()









    Out[4]:






  
    
      
      id
      name
    
  
  
    
      9999
      SP.URB.GROW
      Urban population growth (annual %)
    
    
      10000
      SP.URB.TOTL
      Urban population
    
    
      10001
      SP.URB.TOTL.FE.ZS
      Urban population, female (% of total)
    
    
      10002
      SP.URB.TOTL.IN.ZS
      Urban population (% of total)
    
    
      10003
      SP.URB.TOTL.MA.ZS
      Urban population, male (% of total)

Let's use the ones we like but use a python dictionary to rename these to shorter variable names when we load the data into a python dataframe:



In [5]:

    
indicators = ['NY.GDP.PCAP.KD', 'SP.URB.TOTL.IN.ZS']

Since we are interested in exploring the extent of 'urban bias' in some countries, let's load data from 1980 which was toward the end of the era of import-substituting industrialization when urban-biased policies were claimed to be most pronounced.



In [6]:

    
dat = wb.download(indicator=indicators, country = 'all', start=1980, end=1980)



In [7]:

    
dat.columns









    Out[7]:





Index(['NY.GDP.PCAP.KD', 'SP.URB.TOTL.IN.ZS'], dtype='object')

Let's rename the columns to something shorter and then plot and regress log gdp per capita against urban extent we get a pretty tight fit:



In [8]:

    
dat.columns = [['gdppc', 'urbpct']]
dat['lngpc']  = np.log(dat.gdppc)



In [9]:

    
g = sns.jointplot("lngpc", "urbpct", data=dat, kind="reg",
                  color ="b", size=7)

That is a pretty tight fit: urbanization rises with income per-capita, but there are several middle income country outliersthat have considerably higher urbanization than would be predicted. Let's look at the regression line.



In [10]:

    
mod = smf.ols("urbpct ~ lngpc", dat).fit()



In [11]:

    
print(mod.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 urbpct   R-squared:                       0.744
Model:                            OLS   Adj. R-squared:                  0.743
Method:                 Least Squares   F-statistic:                     494.3
Date:                Wed, 03 May 2017   Prob (F-statistic):           3.43e-52
Time:                        15:26:52   Log-Likelihood:                -670.94
No. Observations:                 172   AIC:                             1346.
Df Residuals:                     170   BIC:                             1352.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -68.0209      5.185    -13.120      0.000     -78.255     -57.787
lngpc         13.9583      0.628     22.233      0.000      12.719      15.198
==============================================================================
Omnibus:                        8.555   Durbin-Watson:                   2.204
Prob(Omnibus):                  0.014   Jarque-Bera (JB):               17.062
Skew:                          -0.027   Prob(JB):                     0.000197
Kurtosis:                       4.542   Cond. No.                         47.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now let's just look at a list of countries sorted by the size of their residuals in this regression line. Countries with the largest residuals had urbanization in excess of what the model predicts from their 1980 level of income per capita.

Here is the sorted list of top 15 outliers.



In [12]:

    
mod.resid.sort_values(ascending=False).head(15)









    Out[12]:





country               year
Singapore             1980    35.469714
Chile                 1980    30.563014
Malta                 1980    30.448467
Hong Kong SAR, China  1980    29.957702
Uruguay               1980    29.143223
Argentina             1980    25.369855
Cuba                  1980    24.148529
Belgium               1980    20.731859
Israel                1980    20.456625
Iraq                  1980    20.262678
Peru                  1980    17.810693
Myanmar               1980    17.719561
Bulgaria              1980    17.360914
Jordan                1980    15.706878
Colombia              1980    15.258744
dtype: float64

This is of course only suggestive but (leaving aside the island states like Singapore and Hong-Kong) the list is dominated by southern cone countries such as Chile, Argentina and Peru which in addition to having legacies of heavy political centralization also pursued ISI policies in the 60s and 70s that many would associate with urban biased policies.

Panel data

Very often we want data on several indicators and a whole group of countries over a number of years. we could also have used datetime format dates:



In [13]:

    
countries = ['CHL', 'USA', 'ARG']
start, end = dt.datetime(1950, 1, 1), dt.datetime(2016, 1, 1)
dat = wb.download(
    indicator=indicators, 
    country = countries, 
    start=start, 
    end=end).dropna()

Lets use shorter column names



In [14]:

    
dat.columns









    Out[14]:





Index(['NY.GDP.PCAP.KD', 'SP.URB.TOTL.IN.ZS'], dtype='object')



In [15]:

    
dat.columns = [['gdppc', 'urb']]



In [16]:

    
dat.head()









    Out[16]:






  
    
      
      
      gdppc
      urb
    
    
      country
      year
      
      
    
  
  
    
      Argentina
      2015
      10501.660269
      91.751
    
    
      2014
      10334.780146
      91.604
    
    
      2013
      10711.229530
      91.452
    
    
      2012
      10558.265365
      91.295
    
    
      2011
      10780.342508
      91.133

Notice this has a two-level multi-index. The outer level is named 'country' and the inner level is 'year'

We can pull out group data for a single country like this using the .xs or cross section method.



In [17]:

    
dat.xs('Chile',level='country').head(3)









    Out[17]:






  
    
      
      gdppc
      urb
    
    
      year
      
      
    
  
  
    
      2015
      14660.505335
      89.530
    
    
      2014
      14479.763258
      89.356
    
    
      2013
      14364.140970
      89.175

(Note we could have also used dat.loc['Chile'].head())

And we can pull a 'year' level cross section like this:



In [18]:

    
dat.xs('2007', level='year').head()









    Out[18]:






  
    
      
      gdppc
      urb
    
    
      country
      
      
    
  
  
    
      Argentina
      9830.759871
      90.445
    
    
      Chile
      12223.484611
      87.926
    
    
      United States
      49979.533843
      80.269

Note that what was returned was a dataframe with the data just for our selected country. We can in turn further specify what column(s) from this we want:



In [19]:

    
dat.loc['Chile']['gdppc'].head()









    Out[19]:





year
2015    14660.505335
2014    14479.763258
2013    14364.140970
2012    13963.665402
2011    13385.131216
Name: gdppc, dtype: float64

Unstack data

The unstack method turns index values into column names while stack method converts column names to index values. Here we apply unstack.



In [20]:

    
datyr = dat.unstack(level='country')
datyr.head()









    Out[20]:






  
    
      
      gdppc
      urb
    
    
      country
      Argentina
      Chile
      United States
      Argentina
      Chile
      United States
    
    
      year
      
      
      
      
      
      
    
  
  
    
      1960
      5605.191722
      3630.391789
      17036.885170
      73.611
      67.836
      69.996
    
    
      1961
      5815.233002
      3692.112435
      17142.193767
      74.217
      68.660
      70.377
    
    
      1962
      5675.060043
      3796.728684
      17910.278790
      74.767
      69.435
      70.757
    
    
      1963
      5290.764999
      3939.263166
      18431.158404
      75.309
      70.200
      71.134
    
    
      1964
      5738.598474
      3955.026738
      19231.171859
      75.844
      70.955
      71.508

We can now easily index a 2015 cross-section of GDP per capita like so:



In [21]:

    
datyr.xs('1962')['gdppc']









    Out[21]:





country
Argentina         5675.060043
Chile             3796.728684
United States    17910.278790
Name: 1962, dtype: float64

We'd get same result from datyr.loc['2015']['gdppc']

We can also easily plot all countries:



In [22]:

    
datyr['urb'].plot(kind='line');

	id	name
685	6.0.GDPpc_constant	GDP per capita, PPP (constant 2011 internation...
7588	NY.GDP.PCAP.KD	GDP per capita (constant 2010 US$)
7590	NY.GDP.PCAP.KN	GDP per capita (constant LCU)
7592	NY.GDP.PCAP.PP.KD	GDP per capita, PPP (constant 2011 internation...

	id	name
9999	SP.URB.GROW	Urban population growth (annual %)
10000	SP.URB.TOTL	Urban population
10001	SP.URB.TOTL.FE.ZS	Urban population, female (% of total)
10002	SP.URB.TOTL.IN.ZS	Urban population (% of total)
10003	SP.URB.TOTL.MA.ZS	Urban population, male (% of total)

		gdppc	urb
country	year
Argentina	2015	10501.660269	91.751
	2014	10334.780146	91.604
	2013	10711.229530	91.452
	2012	10558.265365	91.295
	2011	10780.342508	91.133

	gdppc	urb
year
2015	14660.505335	89.530
2014	14479.763258	89.356
2013	14364.140970	89.175

	gdppc	urb
country
Argentina	9830.759871	90.445
Chile	12223.484611	87.926
United States	49979.533843	80.269