First, we need to import all the necessary libraries and set up some environment variables.
In [1]:
import re
import requests
import zipfile
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import statsmodels.formula.api as sm
sns.set_context('talk')
pd.set_option('float_format', '{:6.2f}'.format)
%matplotlib inline
Load the zip file from the web and save it to your hard drive.
In [2]:
url = 'http://databank.worldbank.org/data/download/Edstats_csv.zip'
path = '../data/WorldBank/Edstats_csv.zip'
response = requests.get(url)
with open(path, "wb") as file:
file.write(response.content)
Show contents of the zip file.
In [3]:
zf = zipfile.ZipFile(path)
files = zf.namelist()
print(files)
Read csv-formatted data directly from the zip file into pandas DataFrame. Also rename some columns for prettier output.
In [4]:
data = pd.read_csv(zf.open(files[0]))
series = pd.read_csv(zf.open(files[2]))
series.rename(columns={series.columns[0]: 'Series Code'}, inplace=True)
data.rename(columns={data.columns[0]: 'Country Name'}, inplace=True)
print(series.columns)
Show unique values of the Topic
column.
In [5]:
print(series['Topic'].unique())
Leave only those rows that have Expenditures
in the column Topic
. Next, leave only those that contain PPP
in the Indicator Name
column values. Finally, create a dictionary with a pair of variable key and its meaningful name.
In [6]:
subset = series.query("Topic == 'Expenditures'")[['Series Code', 'Indicator Name']]
subset = subset[subset['Indicator Name'].str.contains('PPP')]
print(subset.values)
xvar = {'UIS.XUNIT.PPP.1.FSGOV': 'Expenditure per student'}
Do the same for Attainment
among Topic
values and slightly more involved subset of Indicator Name
. Here we require that it contains both strings, with primary schooling
and 15
.
In [7]:
subset = series.query("Topic == 'Attainment'")[['Series Code', 'Indicator Name']]
subset = subset[subset['Indicator Name'].str.contains('(?=.*with primary schooling)(?=.*15)')]
print(subset.values)
yvar = {'BAR.PRM.CMPT.15UP.ZS': 'Pct with schooling'}
Now show all column names in the primary data set.
In [8]:
print(data.columns)
Combine two dictionaries into one.
In [9]:
renames = xvar.copy()
renames.update(yvar)
print(renames)
Subset the data to include only three interesting columns that we have found above and only for the year 2010.
In [10]:
cols = ['Country Name', 'Indicator Code', '2010']
data_sub = data.ix[data['Indicator Code'].isin(renames.keys()), cols].dropna()
data_sub.replace({'Indicator Code': renames}, inplace=True)
data_sub.set_index(cols[:2], inplace=True)
data_sub = data_sub[cols[-1]].unstack(cols[1]).dropna()
data_sub.columns.name = 'Indicator'
data_sub.index.name = 'Country'
print(data_sub.head())
Export data to Excel.
In [11]:
data_sub.to_excel('../data/WorldBank/education.xlsx', sheet_name='data')
Now suppose we already have the data saved in the Excel file. Let's read it from scratch into pandas DataFrame.
In [12]:
education = pd.read_excel('../data/WorldBank/education.xlsx', sheet_name='data', index_col=0)
print(education.head())
Let's see how percentage of educated population depends on government expenditures on primary students. Also, save the picture to the pdf file.
In [13]:
education['Expenditure per student (log)'] = np.log(education['Expenditure per student'])
fig = plt.figure(figsize=(8, 6))
sns.regplot(x='Expenditure per student (log)', y='Pct with schooling',
data=education, ax=fig.gca())
plt.savefig('../plots/education.pdf')
plt.show()
To be more precise we can quantify the effect of expenditures on schooling via simple OLS regression.
In [14]:
formula = 'Q("Pct with schooling") ~ np.log(Q("Expenditure per student"))'
result = sm.ols(formula=formula, data=education).fit()
print(result.summary())
And save the key result to the LaTeX table.
In [15]:
out = pd.DataFrame({'Parameter': result.params, 't-stat': result.tvalues})
out.to_latex('../tables/education_ols.tex')
print(out)