In [1]:
import pandas as pd
import numpy as np
import pickle
from copy import deepcopy

from sqlalchemy_conn import engine

Getting data from database

I set up a Postgres database on Amazon Web Services and used a pandasql engine to write SQL queries directly from Python. sqlalchemy_conn is script containing the connection details and includes my hostname and password to connect to the database.

I was most interested in goals 1, 5, 6, and 8.

  • Goal 1: Eradicate extreme poverty and hunger
  • Goal 5: Improve maternal health
  • Goal 6: Combat HIV/AIDS, malaria, and other diseases
  • Goal 8: Develop a global partnership for development

I first queried the database to see the specific targets associated with each of these goals.


In [2]:
query = 'SELECT DISTINCT targetname FROM undata WHERE goalid = 1'
targets1 =  pd.read_sql(query, engine)
for target in targets1['targetname']:
    print target


Target 1.B: Achieve full and productive employment and decent work for all, including women and young people
Target 1.A: Halve, between 1990 and 2015, the proportion of people whose income is less than one dollar a day
Target 1.C: Halve, between 1990 and 2015, the proportion of people who suffer from hunger

In [3]:
query = 'SELECT DISTINCT targetname FROM undata WHERE goalid = 5'
targets5 =  pd.read_sql(query, engine)
for target in targets5['targetname']:
    print target


Target 5.A: Reduce by three quarters, between 1990 and 2015, the maternal mortality ratio
Target 5.B: Achieve, by 2015, universal access to reproductive health

In [4]:
query = 'SELECT DISTINCT targetname FROM undata WHERE goalid = 6'
targets6 =  pd.read_sql(query, engine)
for target in targets6['targetname']:
    print target


Target 6.B: Achieve, by 2010, universal access to treatment for HIV/AIDS for all those who need it
Target 6.C: Have halted by 2015 and begun to reverse the incidence of malaria and other major diseases
Target 6.A: Have halted by 2015 and begun to reverse the spread of HIV/AIDS

In [5]:
query = 'SELECT DISTINCT targetname FROM undata WHERE goalid = 8'
targets8 =  pd.read_sql(query, engine)
for target in targets8['targetname']:
    print target


Target 8.B: Address the special needs of the least developed countries
Target 8.A: Develop further an open, rule-based, predictable, non-discriminatory trading and financial system
Target 8.C: Address the special needs of landlocked countries and small island developing States
Target 8.D: Deal comprehensively with the debt problems of developing countries through national and international measures in order to make debt sustainable in the long term
Target 8.F: In cooperation with the private sector, make available the benefits of new technologies, especially information and communications

Selecting targets to investigate & series to use

I was particularly interested in Goal 6 and Target 6.A: Have halted by 2015 and begun to reverse the spread of HIV/AIDS. My questions:

  • What progress has been made in halting the spread of HIV/AIDS?
  • What factors may have contributed to this progress?

I felt that there may be data associated with goals 1, 5, and 8 which could also help me answer these questions. I wanted to examine the relationships between the reduction of HIV/AIDS with a reduction in poverty, promoting maternal health, and development assistance.

Next, I queried the database to see which series were available, keeping note of the target id for series which seemed like they might help answer my questions.


In [6]:
# PRINT FULL SERIES NAMES FOR GOAL 1

query = "SELECT DISTINCT targetid, seriesrowid, seriesname FROM undata WHERE goalid = 1;"
series1 = pd.read_sql(query, engine)
for name in series1['seriesname']:
    print name


Share of youth unemployed to youth population, men
Purchasing power parities (PPP) conversion factor, local currency unit to international dollar
Children under 5 severely underweight, percentage
Poorest quintile"s share in national income or consumption, percentage
Proportion of own-account and contributing family workers in total employment, women, percentage
Children under 5 moderately or severely underweight, percentage
Ratio of youth unemployment rate to adult unemployment rate, men
Growth rate of GDP per person employed, percentage
Ratio of youth unemployment rate to adult unemployment rate, women
Employment-to-population ratio, men, percentage
Employment-to-population ratio, both sexes, percentage
Share of youth unemployed to total unemployed, women
Youth unemployment rate, aged 15-24, women
Youth unemployment rate, aged 15-24, both sexes
Youth unemployment rate, aged 15-24, men
Population undernourished, millions
Proportion of own-account and contributing family workers in total employment, men, percentage
Proportion of own-account and contributing family workers in total employment, both sexes, percentage
Share of youth unemployed to youth population, women
Population below national poverty line, rural, percentage
Poverty gap ratio at $1.25 a day (PPP), percentage
Employment-to-population ratio, women, percentage
Share of youth unemployed to youth population, both sexes
Population below national poverty line, total, percentage
Ratio of youth unemployment rate to adult unemployment rate, both sexes
Population below $1.25 (PPP) per day, percentage
Share of youth unemployed to total unemployed, both sexes
Proportion of employed people living below $1 (PPP) per day, percentage
Share of youth unemployed to total unemployed, men
Population undernourished, percentage
Population below national poverty line, urban, percentage

In [7]:
# PRINT TABLE

series1


Out[7]:
targetid seriesrowid seriesname
0 16 678 Share of youth unemployed to youth population,...
1 1 699 Purchasing power parities (PPP) conversion fac...
2 2 560 Children under 5 severely underweight, percentage
3 1 585 Poorest quintile"s share in national inco...
4 16 773 Proportion of own-account and contributing fam...
5 2 559 Children under 5 moderately or severely underw...
6 16 672 Ratio of youth unemployment rate to adult unem...
7 16 757 Growth rate of GDP per person employed, percen...
8 16 673 Ratio of youth unemployment rate to adult unem...
9 16 770 Employment-to-population ratio, men, percentage
10 16 758 Employment-to-population ratio, both sexes, pe...
11 16 676 Share of youth unemployed to total unemployed,...
12 16 596 Youth unemployment rate, aged 15-24, women
13 16 630 Youth unemployment rate, aged 15-24, both sexes
14 16 597 Youth unemployment rate, aged 15-24, men
15 2 640 Population undernourished, millions
16 16 772 Proportion of own-account and contributing fam...
17 16 760 Proportion of own-account and contributing fam...
18 16 679 Share of youth unemployed to youth population,...
19 1 583 Population below national poverty line, rural,...
20 1 584 Poverty gap ratio at $1.25 a day (PPP), percen...
21 16 771 Employment-to-population ratio, women, percentage
22 16 677 Share of youth unemployed to youth population,...
23 1 581 Population below national poverty line, total,...
24 16 671 Ratio of youth unemployment rate to adult unem...
25 1 580 Population below $1.25 (PPP) per day, percentage
26 16 674 Share of youth unemployed to total unemployed,...
27 16 759 Proportion of employed people living below $1 ...
28 16 675 Share of youth unemployed to total unemployed,...
29 2 566 Population undernourished, percentage
30 1 582 Population below national poverty line, urban,...

In [8]:
# PRINT FULL SERIES NAMES FOR GOAL 5

query = "SELECT DISTINCT targetid, seriesrowid, seriesname FROM undata WHERE goalid = 5;"
series5 = pd.read_sql(query, engine)
for name in series5['seriesname']:
    print name


Current contraceptive use among married women 15-49 years old, any method, percentage
Unmet need for family planning, total, percentage
Current contraceptive use among married women 15-49 years old, modern methods, percentage
Unmet need for family planning, limiting, percentage
Antenatal care coverage, at least four visits, percentage
Antenatal care coverage, at least one visit, percentage
Unmet need for family planning, spacing, percentage
Adolescent birth rate, per 1,000 women
Maternal mortality ratio per 100,000 live births
Births attended by skilled health personnel, percentage
Current contraceptive use among married women 15-49 years old, condom, percentage

In [9]:
# PRINT TABLE

series5


Out[9]:
targetid seriesrowid seriesname
0 19 730 Current contraceptive use among married women ...
1 19 764 Unmet need for family planning, total, percentage
2 19 731 Current contraceptive use among married women ...
3 19 778 Unmet need for family planning, limiting, perc...
4 19 763 Antenatal care coverage, at least four visits,...
5 19 762 Antenatal care coverage, at least one visit, p...
6 19 777 Unmet need for family planning, spacing, perce...
7 19 761 Adolescent birth rate, per 1,000 women
8 6 553 Maternal mortality ratio per 100,000 live births
9 6 570 Births attended by skilled health personnel, p...
10 19 732 Current contraceptive use among married women ...

In [10]:
# PRINT FULL SERIES NAMES FOR GOAL 6

query = "SELECT DISTINCT targetid, seriesrowid, seriesname FROM undata WHERE goalid = 6;"
series6 = pd.read_sql(query, engine)
for name in series6['seriesname']:
    print name


AIDS deaths
AIDS orphans (one or both parents)
Ratio of school attendance rate of orphans to school attendance rate of non orphans
People living with HIV, 15-49 years old, percentage
Condom use to overall contraceptive use among currently married women 15-49 years old, percentage
Condom use at last high-risk sex, 15-24 years old, men, percentage
Condom use at last high-risk sex, 15-24 years old, women, percentage
Men 15-24 years old with comprehensive correct knowledge of HIV/AIDS, percentage
Women 15-24 years old with comprehensive correct knowledge of HIV/AIDS, percentage
HIV prevalence rate, men 15-49 years old, in national based surveys
HIV prevalence rate, women 15-49 years old, in national based surveys
School attendance rate of orphans aged 10-14
School attendance rate of children aged 10-14 both of whose parents are alive and who live with at least one parent
AIDS deaths (lower bound)
AIDS deaths (upper bound)
People living with HIV, 15-49 years old, percentage (lower bound)
People living with HIV, 15-49 years old, percentage (upper bound)
HIV incidence rate, 15-49 years old, percentage (mid-point)
HIV incidence rate, 15-49 years old, percentage (upper bound)
HIV incidence rate, 15-49 years old, percentage (lower bound)
Tuberculosis prevalence rate per 100,000 population (mid-point)
Tuberculosis treatment success rate under DOTS, percentage
Notified cases of malaria per 100,000 population
Children under 5 sleeping under insecticide-treated bed nets, percentage
Children under 5 with fever being treated with anti-malarial drugs, percentage
Tuberculosis death rate per year per 100,000 population (mid-point)
Malaria death rate per 100,000 population, all ages
Malaria death rate per 100,000 population, ages 0-4
Tuberculosis detection rate under DOTS, percentage (mid-point)
Tuberculosis incidence rate per year per 100,000 population (mid-point)
Tuberculosis incidence rate per year per 100,000 population (lower bound)
Tuberculosis incidence rate per year per 100,000 population (upper bound)
Tuberculosis prevalence rate per 100,000 population (lower bound)
Tuberculosis prevalence rate per 100,000 population (upper bound)
Tuberculosis death rate per year per 100,000 population (lower bound)
Tuberculosis death rate per year per 100,000 population (upper bound)
Tuberculosis detection rate under DOTS, percentage (lower bound)
Tuberculosis detection rate under DOTS, percentage (upper bound)
Antiretroviral therapy coverage among people with advanced HIV infection, percentage
Antiretroviral therapy coverage among people with advanced HIV infection, percentage (upper bound)
Antiretroviral therapy coverage among people with advanced HIV infection, percentage (lower bound)
Percentage of HIV-infected pregnant women who received antiretroviral drugs to reduce the risk for mother-to-child transmission (Mid point)
Percentage of HIV-infected pregnant women who received antiretroviral drugs to reduce the risk for mother-to-child transmission (upper bound)
Percentage of HIV-infected pregnant women who received antiretroviral drugs to reduce the risk for mother-to-child transmission (lower bound)

In [11]:
# PRINT TABLE

series6


Out[11]:
targetid seriesrowid seriesname
0 7 579 AIDS deaths
1 7 622 AIDS orphans (one or both parents)
2 7 726 Ratio of school attendance rate of orphans to ...
3 7 729 People living with HIV, 15-49 years old, perce...
4 7 733 Condom use to overall contraceptive use among ...
5 7 734 Condom use at last high-risk sex, 15-24 years ...
6 7 735 Condom use at last high-risk sex, 15-24 years ...
7 7 741 Men 15-24 years old with comprehensive correct...
8 7 742 Women 15-24 years old with comprehensive corre...
9 7 747 HIV prevalence rate, men 15-49 years old, in n...
10 7 748 HIV prevalence rate, women 15-49 years old, in...
11 7 781 School attendance rate of orphans aged 10-14
12 7 782 School attendance rate of children aged 10-14 ...
13 7 797 AIDS deaths (lower bound)
14 7 798 AIDS deaths (upper bound)
15 7 799 People living with HIV, 15-49 years old, perce...
16 7 800 People living with HIV, 15-49 years old, perce...
17 7 801 HIV incidence rate, 15-49 years old, percentag...
18 7 802 HIV incidence rate, 15-49 years old, percentag...
19 7 803 HIV incidence rate, 15-49 years old, percentag...
20 8 617 Tuberculosis prevalence rate per 100,000 popul...
21 8 618 Tuberculosis treatment success rate under DOTS...
22 8 644 Notified cases of malaria per 100,000 population
23 8 645 Children under 5 sleeping under insecticide-tr...
24 8 646 Children under 5 with fever being treated with...
25 8 647 Tuberculosis death rate per year per 100,000 p...
26 8 663 Malaria death rate per 100,000 population, all...
27 8 664 Malaria death rate per 100,000 population, age...
28 8 718 Tuberculosis detection rate under DOTS, percen...
29 8 766 Tuberculosis incidence rate per year per 100,0...
30 8 789 Tuberculosis incidence rate per year per 100,0...
31 8 790 Tuberculosis incidence rate per year per 100,0...
32 8 791 Tuberculosis prevalence rate per 100,000 popul...
33 8 792 Tuberculosis prevalence rate per 100,000 popul...
34 8 793 Tuberculosis death rate per year per 100,000 p...
35 8 794 Tuberculosis death rate per year per 100,000 p...
36 8 795 Tuberculosis detection rate under DOTS, percen...
37 8 796 Tuberculosis detection rate under DOTS, percen...
38 20 765 Antiretroviral therapy coverage among people w...
39 20 804 Antiretroviral therapy coverage among people w...
40 20 805 Antiretroviral therapy coverage among people w...
41 20 806 Percentage of HIV-infected pregnant women who ...
42 20 807 Percentage of HIV-infected pregnant women who ...
43 20 808 Percentage of HIV-infected pregnant women who ...

In [12]:
# PRINT FULL SERIES NAMES FOR GOAL 8

query = "SELECT DISTINCT targetid, seriesrowid, seriesname FROM undata WHERE goalid = 8;"
series8 = pd.read_sql(query, engine)
for name in series8['seriesname']:
    print name


Debt relief delivered in full under MDRI initiative, cumulative million US$ in end-2009 NPV terms
Mobile-cellular subscriptions
Fixed-telephone subscriptions per 100 inhabitants
Agriculture support estimate for OECD countries, million US$
Mobile-cellular subscriptions per 100 inhabitants
ODA provided to help build trade capacity, percentage
ODA that is untied, million US$
ODA that is untied, percentage
Developed country imports from the LDCs, admitted duty free, percentage
Internet users per 100 inhabitants
Debt relief committed under HIPC initiative, cumulative million US$ in end-2009 NPV terms
ODA received in small islands developing States, million US$
Agriculture support estimate for OECD countries as percentage of their GDP
Fixed-telephone subscriptions
ODA to basic social services as percentage of sector-allocable ODA
Net ODA as percentage of OECD/DAC donors GNI
Net ODA to LDCs as percentage of OECD/DAC donors GNI
ODA received in landlocked developing countries, million US$
Net ODA, million US$
Debt service as percentage of exports of goods and services and net income
Developed country imports from developing countries, admitted duty free, percentage
ODA received in small islands developing States as percentage of their GNI
ODA received in landlocked developing countries as percentage of their GNI
Net ODA to LDCs, million US$
ODA to basic social services, million US$

In [13]:
# PRINT TABLE

series8


Out[13]:
targetid seriesrowid seriesname
0 15 787 Debt relief delivered in full under MDRI initi...
1 18 780 Mobile-cellular subscriptions
2 18 755 Fixed-telephone subscriptions per 100 inhabitants
3 14 601 Agriculture support estimate for OECD countrie...
4 18 756 Mobile-cellular subscriptions per 100 inhabitants
5 14 717 ODA provided to help build trade capacity, per...
6 13 598 ODA that is untied, million US$
7 13 599 ODA that is untied, percentage
8 14 714 Developed country imports from the LDCs, admit...
9 18 605 Internet users per 100 inhabitants
10 15 688 Debt relief committed under HIPC initiative, c...
11 14 639 ODA received in small islands developing State...
12 14 654 Agriculture support estimate for OECD countrie...
13 18 779 Fixed-telephone subscriptions
14 13 593 ODA to basic social services as percentage of ...
15 12 568 Net ODA as percentage of OECD/DAC donors GNI
16 12 650 Net ODA to LDCs as percentage of OECD/DAC dono...
17 14 632 ODA received in landlocked developing countrie...
18 12 569 Net ODA, million US$
19 15 655 Debt service as percentage of exports of goods...
20 14 713 Developed country imports from developing coun...
21 14 653 ODA received in small islands developing State...
22 14 652 ODA received in landlocked developing countrie...
23 12 651 Net ODA to LDCs, million US$
24 13 592 ODA to basic social services, million US$

Selecting the data

The series associated with target ids 1, 7, 12, 13, 14, 15, 19, and 20 seemed like they may be helpful in answering my questions. I queried the database again and saved these as a dataframe.


In [14]:
query = "SELECT * FROM undata WHERE targetid IN (1, 7, 12, 13, 14, 15, 19, 20);"
undata = pd.read_sql(query, engine)
undata.shape


Out[14]:
(56598, 28)

I wanted to use data for all countries, but the former Sudan was giving me errors when I tried to unstack and reshape the data. I took a look at what series data was available for the former Sudan.


In [15]:
print len(undata[undata.isformer == 1])
print len(set(undata[undata.isformer == 1]['seriesname']))
print set(undata[undata.isformer == 1]['seriesname'])


46
9
set([u'Antiretroviral therapy coverage among people with advanced HIV infection, percentage', u'Percentage of HIV-infected pregnant women who received antiretroviral drugs to reduce the risk for mother-to-child transmission (lower bound)', u'Antiretroviral therapy coverage among people with advanced HIV infection, percentage (lower bound)', u'Percentage of HIV-infected pregnant women who received antiretroviral drugs to reduce the risk for mother-to-child transmission (upper bound)', u'Percentage of HIV-infected pregnant women who received antiretroviral drugs to reduce the risk for mother-to-child transmission (Mid point)', u'Developed country imports from developing countries, admitted duty free, percentage', u'Developed country imports from the LDCs, admitted duty free, percentage', u'Antenatal care coverage, at least one visit, percentage', u'Antiretroviral therapy coverage among people with advanced HIV infection, percentage (upper bound)'])

Exclude former Sudan data

No data was available for the main variable I planned to explore, the HIV incidence rate, so I excluded the former Sudan rows from my dataframe.


In [16]:
undata = undata[undata.isformer == 0]
undata = undata.drop(['isformer'], axis=1)

Simplify dataframe

I looked at the included columns and simplified the dataframe to include only variables I intended to use.


In [17]:
columns = [column for column in undata.columns]
columns


Out[17]:
[u'goalid',
 u'goalname',
 u'targetid',
 u'targetname',
 u'indicatorid',
 u'indicatorname',
 u'indicatororderclause',
 u'seriesrowid',
 u'seriesname',
 u'seriesorderclause',
 u'ismdg',
 u'lastupdated',
 u'countryid',
 u'countryname',
 u'iso3code',
 u'isdeveloped',
 u'mdgregions',
 u'ismdgcountry',
 u'isldc2014',
 u'islldc',
 u'gdppc2012',
 u'population2012',
 u'year',
 u'value',
 u'nature',
 u'footnoteid',
 u'footnotetext']

In [18]:
unsimple = undata[['countryname', 'iso3code', 'year', 'isdeveloped', 'mdgregions', 
                   'isldc2014', 'islldc', 'ismdgcountry', 'seriesname', 'gdppc2012',
                   'population2012', 'value']]

In [19]:
unsimple.describe()


Out[19]:
year isdeveloped isldc2014 islldc ismdgcountry gdppc2012 population2012 value
count 56552.000000 56483.000000 56552.000000 56552.000000 56552.000000 56080.000000 5.608000e+04 5.655200e+04
mean 2002.477790 0.201300 0.301740 0.013952 0.998780 10903.839376 3.912418e+07 -2.594671e+05
std 6.772721 0.400975 0.459017 0.117292 0.034909 17970.787321 1.300971e+08 2.388371e+07
min 1984.000000 0.000000 0.000000 0.000000 0.000000 128.100000 5.046000e+03 -2.146826e+09
25% 1997.000000 0.000000 0.000000 0.000000 1.000000 1125.900000 3.514381e+06 9.400000e-01
50% 2003.000000 0.000000 0.000000 0.000000 1.000000 3673.000000 1.049628e+07 1.450000e+01
75% 2008.000000 0.000000 1.000000 0.000000 1.000000 10421.600000 2.995478e+07 9.170000e+01
max 2015.000000 1.000000 1.000000 1.000000 1.000000 158976.800000 1.377065e+09 2.400000e+06

In [20]:
unsimple.shape


Out[20]:
(56552, 12)

Unstack the dataframe

The dataframe included a column called 'value' which contained the value available for the many different series. In order to explore and make predictions I unstacked the dataframe, turning the value column into multiple columns associated with the corresponding series name.

I started by extracting the values for the series 'AIDS deaths' and then looped through a list of the other series, merging them with the reshaped dataframe.


In [21]:
all_series = list(set(unsimple['seriesname']))
row = 'AIDS deaths'
all_series.remove(row)

In [22]:
unreshape = deepcopy(unsimple[unsimple['seriesname'] == row])
unreshape.rename(columns={'value': row}, inplace=True)
unreshape.shape


Out[22]:
(2808, 12)

In [23]:
for series in all_series:
    new_cols = deepcopy(unsimple[unsimple['seriesname'] == series])
    new_cols.rename(columns={'value': series}, inplace=True)
    new_cols = new_cols.drop(['seriesname'], axis=1)
    keys = ['countryname', 'iso3code', 'year', 'isdeveloped', 'mdgregions', 
            'ismdgcountry', 'gdppc2012', 'population2012', 'isldc2014', 'islldc']
    unreshape = pd.merge(unreshape, new_cols, how='outer', on=keys)
    
unreshape.shape


Out[23]:
(5332, 73)

Add more series from World Bank MDG database

I found more variables which I wanted to include in my analysis on the World Bank database.

The years for this data were individual columns, so I needed to transform it into a shape similar to the original data in order to unstack and merge it with my reshaped dataframe.


In [24]:
undata_new = pd.read_csv('data/Data_Extract_From_Millennium_Development_Goals_Data.csv')
list(set(undata_new['Series Name']))


Out[24]:
['Net official development assistance received (current US$)',
 nan,
 'Net ODA received per capita (current US$)',
 'Contraceptive prevalence (% of women ages 15-49)',
 'Net ODA received (% of GNI)',
 'GNI per capita, Atlas method (current US$)',
 'Population, total',
 'Condom use with non regular partner, % adults(15-49), female',
 'Prevalence of HIV, total (% of population ages 15-49)',
 'Condom use, population ages 15-24, male (% of males ages 15-24)',
 'Condom use, population ages 15-24, female (% of females ages 15-24)',
 'Condom use with non regular partner, % adults(15-49), male',
 'Net official development assistance and official aid received (current US$)']

In [25]:
undata_new.head()


Out[25]:
Country Name Country Code Series Name Series Code 1990 [YR1990] 1991 [YR1991] 1992 [YR1992] 1993 [YR1993] 1994 [YR1994] 1995 [YR1995] ... 2005 [YR2005] 2006 [YR2006] 2007 [YR2007] 2008 [YR2008] 2009 [YR2009] 2010 [YR2010] 2011 [YR2011] 2012 [YR2012] 2013 [YR2013] 2014 [YR2014]
0 Afghanistan AFG Net ODA received (% of GNI) DT.ODA.ODAT.GN.ZS .. .. .. .. .. .. ... 45.1403164746878 41.7083617468951 50.2826596611131 47.6440672911038 49.7207031717927 40.1686350307288 38.2210459953184 32.5907503002823 25.7479981923313 ..
1 Afghanistan AFG Net ODA received per capita (current US$) DT.ODA.ODAT.PC.ZS 10.375756327596 40.2416959726509 14.8350593358933 14.7476248449374 13.7348955275633 12.085131228558 ... 114.140080862062 115.549038865867 188.419834300363 180.365288104404 225.038180953521 226.302646133442 236.548237651466 225.516668557727 172.362077442958 ..
2 Afghanistan AFG Population, total SP.POP.TOTL 11731193 12612043 13811876 15175325 16485018 17586073 ... 24860855 25631282 26349243 27032197 27708187 28397812 29105480 29824536 30551674 ..
3 Afghanistan AFG Prevalence of HIV, total (% of population ages... SH.DYN.AIDS.ZS 0.1 0.1 0.1 0.1 0.1 0.1 ... 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ..
4 Afghanistan AFG GNI per capita, Atlas method (current US$) NY.GNP.PCAP.CD .. .. .. .. .. .. ... 250 270 330 370 460 510 570 690 690 680

5 rows × 29 columns


In [26]:
# drop series code
undata_new = undata_new.drop(['Series Code'], axis = 1)

In [27]:
# rename columns
years = range(1990, 2015)
undata_new.columns = ['countryname', 'iso3code', 'seriesname'] + years

In [28]:
undata_new.head()


Out[28]:
countryname iso3code seriesname 1990 1991 1992 1993 1994 1995 1996 ... 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
0 Afghanistan AFG Net ODA received (% of GNI) .. .. .. .. .. .. .. ... 45.1403164746878 41.7083617468951 50.2826596611131 47.6440672911038 49.7207031717927 40.1686350307288 38.2210459953184 32.5907503002823 25.7479981923313 ..
1 Afghanistan AFG Net ODA received per capita (current US$) 10.375756327596 40.2416959726509 14.8350593358933 14.7476248449374 13.7348955275633 12.085131228558 9.21135878972857 ... 114.140080862062 115.549038865867 188.419834300363 180.365288104404 225.038180953521 226.302646133442 236.548237651466 225.516668557727 172.362077442958 ..
2 Afghanistan AFG Population, total 11731193 12612043 13811876 15175325 16485018 17586073 18415307 ... 24860855 25631282 26349243 27032197 27708187 28397812 29105480 29824536 30551674 ..
3 Afghanistan AFG Prevalence of HIV, total (% of population ages... 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ... 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ..
4 Afghanistan AFG GNI per capita, Atlas method (current US$) .. .. .. .. .. .. .. ... 250 270 330 370 460 510 570 690 690 680

5 rows × 28 columns


In [29]:
# set a multi-level index and stack the data
undata_new = undata_new.set_index(keys=['countryname', 'iso3code', 'seriesname'])
undata_new = pd.DataFrame(undata_new.stack(level=-1))

In [30]:
# rename value column
undata_new.columns = ['value']

In [31]:
undata_new.head()


Out[31]:
value
countryname iso3code seriesname
Afghanistan AFG Net ODA received (% of GNI) 1990 ..
1991 ..
1992 ..
1993 ..
1994 ..

In [32]:
# reset the index to turn these variables back into columns
undata_new = undata_new.reset_index()

# rename columns
undata_new.columns = ['countryname', 'iso3code', 'seriesname', 'year', 'value']
undata_new.head()


Out[32]:
countryname iso3code seriesname year value
0 Afghanistan AFG Net ODA received (% of GNI) 1990 ..
1 Afghanistan AFG Net ODA received (% of GNI) 1991 ..
2 Afghanistan AFG Net ODA received (% of GNI) 1992 ..
3 Afghanistan AFG Net ODA received (% of GNI) 1993 ..
4 Afghanistan AFG Net ODA received (% of GNI) 1994 ..

Add columns to new dataframe

In order to properly merge the data provided by the UN with the data downloaded from the World Bank, the two dataframes need to have the same key columns. In order to add these columns to the new dataframe, I created a dictionary using the ISO code as a key and storing the values for the columns to add.


In [33]:
# columns to add
cols = ['countryname', 'isdeveloped', 'mdgregions', 'ismdgcountry', 'gdppc2012', 
        'population2012', 'isldc2014', 'islldc']

add_cols = {}

for code in set(unreshape['iso3code']):
    country_cols = {}
    for col in cols:
        try:
            temp = list(unreshape[unreshape['iso3code'] == code][col])[0]
            country_cols[col] = temp
        except IndexError:
            country_cols[col] = np.NaN
    add_cols[code] = country_cols

In [34]:
add_cols['AFG']


Out[34]:
{'countryname': u'Afghanistan',
 'gdppc2012': 682.79999999999995,
 'isdeveloped': 0.0,
 'isldc2014': 1.0,
 'islldc': 1.0,
 'ismdgcountry': 1.0,
 'mdgregions': u'Southern Asia',
 'population2012': 29824536.0}

Rename and remove countries

A few countries in the World Bank data had ISO codes which differed from the original UN dataset, possibly these are outdated codes. I replaces these with the codes used in the UN dataset.

A number of countries from the new World Bank data were not present in the original UN dataset. I removed these to avoid errors.


In [36]:
undata_new['iso3code'] = undata_new['iso3code'].replace('ZAR', 'COD')
undata_new['iso3code'] = undata_new['iso3code'].replace('TMP', 'TLS')

In [39]:
missing_data = list(set(undata_new['iso3code']).difference(set(unreshape['iso3code'])))
sorted(missing_data)


Out[39]:
['ADO', 'CHI', 'CUW', 'IMY', 'KSV', 'MAF', 'MCO', 'ROM', 'SXM', 'WBG']

In [40]:
undata_new.shape


Out[40]:
(64200, 5)

In [41]:
for country in missing_data:
    undata_new = undata_new[undata_new['iso3code'] != country]

In [42]:
undata_new.shape


Out[42]:
(61200, 5)

These loops add the columns 'countryname', 'isdeveloped', 'mdgregions', 'ismdgcountry', 'gdppc2012', 'population2012', 'isldc2014', and 'islldc' to the World Bank dataframe so that it can be properly merged with the UN dataframe.


In [43]:
for col in cols:
    new_col = []
    for code in undata_new['iso3code']:
        new_col.append(add_cols[code][col])
    undata_new[col] = new_col
    
for col in cols:
    new_col = []
    for code in unreshape['iso3code']:
        new_col.append(add_cols[code][col])
    unreshape[col] = new_col

In [44]:
undata_new.shape


Out[44]:
(61200, 12)

In [45]:
undata_new.dtypes


Out[45]:
countryname        object
iso3code           object
seriesname         object
year                int64
value              object
isdeveloped       float64
mdgregions         object
ismdgcountry      float64
gdppc2012         float64
population2012    float64
isldc2014         float64
islldc            float64
dtype: object

Merge World Bank data with reshaped dataframe


In [46]:
for series in set(undata_new['seriesname']):
    new_cols = deepcopy(undata_new[undata_new['seriesname'] == series])
    new_cols.rename(columns={'value': series}, inplace=True)
    new_cols = new_cols.drop(['seriesname'], axis=1)
    keys = ['countryname', 'iso3code', 'year', 'isdeveloped', 'mdgregions', 
            'ismdgcountry', 'gdppc2012', 'population2012', 'isldc2014', 'islldc']
    unreshape = pd.merge(unreshape, new_cols, how='outer', on=keys)
    
unreshape.shape


Out[46]:
(5534, 85)

In [47]:
# replace '..' values with NaN
unreshape = unreshape.replace('..', np.NaN)

In [48]:
# inspect how many observations are available for each variable
for col in sorted(unreshape.columns):
    print unreshape[col].count(), col


2808 AIDS deaths
2808 AIDS deaths (lower bound)
2808 AIDS deaths (upper bound)
936 AIDS orphans (one or both parents)
2838 Adolescent birth rate, per 1,000 women
326 Agriculture support estimate for OECD countries as percentage of their GDP
326 Agriculture support estimate for OECD countries, million US$
378 Antenatal care coverage, at least four visits, percentage
698 Antenatal care coverage, at least one visit, percentage
310 Antiretroviral therapy coverage among people with advanced HIV infection, percentage
316 Antiretroviral therapy coverage among people with advanced HIV infection, percentage (lower bound)
322 Antiretroviral therapy coverage among people with advanced HIV infection, percentage (upper bound)
166 Condom use at last high-risk sex, 15-24 years old, men, percentage
203 Condom use at last high-risk sex, 15-24 years old, women, percentage
700 Condom use to overall contraceptive use among currently married women 15-49 years old, percentage
125 Condom use with non regular partner, % adults(15-49), female
120 Condom use with non regular partner, % adults(15-49), male
144 Condom use, population ages 15-24, female (% of females ages 15-24)
122 Condom use, population ages 15-24, male (% of males ages 15-24)
787 Contraceptive prevalence (% of women ages 15-49)
746 Current contraceptive use among married women 15-49 years old, any method, percentage
700 Current contraceptive use among married women 15-49 years old, condom, percentage
730 Current contraceptive use among married women 15-49 years old, modern methods, percentage
36 Debt relief committed under HIPC initiative, cumulative million US$ in end-2009 NPV terms
36 Debt relief delivered in full under MDRI initiative, cumulative million US$ in end-2009 NPV terms
2575 Debt service as percentage of exports of goods and services and net income
2856 Developed country imports from developing countries, admitted duty free, percentage
931 Developed country imports from the LDCs, admitted duty free, percentage
4375 GNI per capita, Atlas method (current US$)
1980 HIV incidence rate, 15-49 years old, percentage (lower bound)
1978 HIV incidence rate, 15-49 years old, percentage (mid-point)
1980 HIV incidence rate, 15-49 years old, percentage (upper bound)
57 HIV prevalence rate, men 15-49 years old, in national based surveys
57 HIV prevalence rate, women 15-49 years old, in national based surveys
172 Men 15-24 years old with comprehensive correct knowledge of HIV/AIDS, percentage
649 Net ODA as percentage of OECD/DAC donors GNI
3219 Net ODA received (% of GNI)
3492 Net ODA received per capita (current US$)
610 Net ODA to LDCs as percentage of OECD/DAC donors GNI
616 Net ODA to LDCs, million US$
651 Net ODA, million US$
3792 Net official development assistance and official aid received (current US$)
3495 Net official development assistance received (current US$)
212 ODA provided to help build trade capacity, percentage
705 ODA received in landlocked developing countries as percentage of their GNI
726 ODA received in landlocked developing countries, million US$
764 ODA received in small islands developing States as percentage of their GNI
938 ODA received in small islands developing States, million US$
464 ODA that is untied, million US$
193 ODA that is untied, percentage
394 ODA to basic social services as percentage of sector-allocable ODA
395 ODA to basic social services, million US$
2784 People living with HIV, 15-49 years old, percentage
2784 People living with HIV, 15-49 years old, percentage (lower bound)
2784 People living with HIV, 15-49 years old, percentage (upper bound)
71 Percentage of HIV-infected pregnant women who received antiretroviral drugs to reduce the risk for mother-to-child transmission (Mid point)
205 Percentage of HIV-infected pregnant women who received antiretroviral drugs to reduce the risk for mother-to-child transmission (lower bound)
205 Percentage of HIV-infected pregnant women who received antiretroviral drugs to reduce the risk for mother-to-child transmission (upper bound)
978 Poorest quintile"s share in national income or consumption, percentage
853 Population below $1.25 (PPP) per day, percentage
427 Population below national poverty line, rural, percentage
623 Population below national poverty line, total, percentage
449 Population below national poverty line, urban, percentage
4893 Population, total
853 Poverty gap ratio at $1.25 a day (PPP), percentage
2784 Prevalence of HIV, total (% of population ages 15-49)
3744 Purchasing power parities (PPP) conversion factor, local currency unit to international dollar
204 Ratio of school attendance rate of orphans to school attendance rate of non orphans
86 School attendance rate of children aged 10-14 both of whose parents are alive and who live with at least one parent
66 School attendance rate of orphans aged 10-14
335 Unmet need for family planning, limiting, percentage
325 Unmet need for family planning, spacing, percentage
383 Unmet need for family planning, total, percentage
299 Women 15-24 years old with comprehensive correct knowledge of HIV/AIDS, percentage
5509 countryname
5192 gdppc2012
5488 isdeveloped
5509 isldc2014
5509 islldc
5509 ismdgcountry
5509 iso3code
5488 mdgregions
5192 population2012
2808 seriesname
5534 year

In [49]:
# drop rows with no value for countryname or region
unreshape = unreshape.drop(pd.isnull(unreshape[['mdgregions', 'countryname']]).any(1).nonzero()[0])

In [50]:
# now that series have been transformed into columns, seriesname can be dropped
unreshape = unreshape.drop(['seriesname'], axis = 1)

Save processed data

Finally, I saved the reshaped and merged dataframe with pickle for use in exploration in another ipython notebook.


In [51]:
with open('un_reshape.pkl', 'w') as picklefile:
    pickle.dump(unreshape, picklefile)

In [ ]: