Link for the data : https://drive.google.com/drive/folders/1fNsdCaBysj0E-sTvE0IhGVidTTwJoLaj?usp=sharing
Put 'data' and 'finalData.pkl' at the root folder of the zip.
In [2]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import time
import pickle
import wbdata
import warnings
from pygsp import graphs, filters, plotting
from sklearn import preprocessing, decomposition
from mpl_toolkits.mplot3d import Axes3D
import IPython.display as ipd
from collections import defaultdict
from Modules.data_acquisition import get_data, data_format
from Modules.utils import *
from Modules.plots import *
warnings.filterwarnings('ignore')
The idea of the project was to use data from the World Bank to analyze which set of indicators were better at predicting to which group belonged each country.
The World Development Indicators are a set of global development indicators collected by the World Bank, at a country-, region- and global-level. They span multiple themes such as health, education, demographics, economy,... and are collected on a yearly basis, starting from 1960. Furthermore, the countries are classified according to the World Bank classification, namely by 'geographic region' or 'income level', plus another custom-made label about 'migration rate'.
Thanks to these labels, we wanted to analyze which indicators (grouped in thematic groups) were better at clustering our countries.
At first, a dataset about health, nutrition and population statistics (link) was downloaded from Kaggle, but it was further enriched with many more indicators scrapped from the World Bank, as we wanted to get a comprehensive list of possible indicators that would not only span health, nutrition and demographics but many other themes such as education, poverty, or economy.
In [3]:
#Initial dataset
df = pd.read_csv('./data/data.csv', index_col=0)
# Code to indicator name
code_to_indicator = dict(zip(df['Indicator Code'],df['Indicator Name']))
code_to_country = dict(zip(df['Country Code'],df.index))
country_to_code = {v: k for k, v in code_to_country.items()}
#Transform the dataframe in multicolumns format
df = df.set_index('Country Code')
df.drop(labels=['Indicator Name'],axis=1, inplace=True)
df.set_index([df.index,'Indicator Code'],inplace=True)
df = df.unstack(level=-1)
print('Data consists of {} countries with {} indicators across {} years'.format(len(df.index),len(set(df.columns.get_level_values(1))),len(set(df.columns.get_level_values(0)))))
print('\nExample indicator code: NY.GNP.PCAP.CD -> {}'.format(code_to_indicator['NY.GNP.PCAP.CD']))
print('Example country code dict: AGO -> {}'.format(code_to_country['AGO']))
df.head(5)
Out[3]:
This dataset only contained data about health, nutrition and demographics. We wanted to increase the amount of indicators available, to increase the chances of finding something of interest. As such all the possible indicators were queried from the World Bank and converted into a dataframe. A few functions from the module wbdata were used to query all the indicators available in the different databases of the world bank.
In [ ]:
"""/!\ Execution of this cell takes a very long time (>24hrs)"""
# GET ALL POSSIBLE INDICATORS WITHOUT THE ONES WE ALREADY HAVE
all_indicators = wbdata.get_indicator(display=False)
all_indicators = set([d['id'] for d in all_indicators])
all_indicators = list(all_indicators - set(code_to_indicator.keys()))
how_many_chunks = 5
get_data(code_to_indicator=code_to_indicator, how_many_chunks=how_many_chunks, all_indicators=all_indicators)
# Construct final dataframe
finalDF = pd.DataFrame()
for i in range(0,how_many_chunks):
temp = pd.read_pickle('./data/data{}.pkl'.format(i),compression='gzip')
with open('./data/dict{}.pickle'.format(i), 'rb') as handle:
temp_dict = pickle.load(handle)
df = data_format(temp,country_to_code) # format data into nice dataframe similar as what is abouve
finalDF = pd.concat([finalDF,df],axis=1)
code_to_indicator.update(temp_dict)
finalDF.to_pickle('./finalData.pkl',compression='gzip') # save final data
with open('./code_to_country_dict.pickle', 'wb') as handle:
pickle.dump(code_to_country, handle, protocol=pickle.HIGHEST_PROTOCOL) # save code to country dict
with open('./code_to_indicator_dict.pickle', 'wb') as handle:
pickle.dump(code_to_indicator, handle, protocol=pickle.HIGHEST_PROTOCOL) # save code to indicator dict
Once all the data was saved, a unique dataset was generated and was the one we worked with for the end of the report.
In [4]:
data = pd.read_pickle('./finalData.pkl', compression='gzip')
# clean the raw data. Basically removed not 'weird' years and remove countries that are aggregates, such as WLD = world, OED = OECD countries, etc.
data = first_clean(data)
# load dictionnaries.
code_to_country = pickle.load(open('./code_to_country_dict.pickle','rb')) # ISO3 code to name of the country. EX code_to_country['BRA'] = Brazil
code_to_indicator = pickle.load(open('./code_to_indicator_dict.pickle','rb')) # Indicator code to name of the indicator. EX code_to_indicator['SP.POP.TOTL'] = Population, total
# Build indicators, years, and countries lists
indicators = list(set(data.columns.levels[1]))
years = list(set(data.columns.levels[0][:-1]))
countries = list(set(data.index))
print('Data consists of {} countries with {} indicators across {} years'.format(len(countries),len(indicators),len(years)))
# Get labels, code_to_labels and colors
# labels = labels of each country, specified with 'method'
# code_to_labels = dict of ISO3 country code to label. EX code_to_labels['BRA'] = 1
# colors = dict of ISO3 country code to label color. EX colors['BRA'] = 'b'
labels,code_to_labels,colors = get_labels(countries) #by income level
#labels,code_to_labels,colors = get_labels(countries,method='migration',df=data.xs('SM.POP.NETM',level=1,axis=1)) #by migration rate
#labels,code_to_labels,colors = get_labels(countries,method='region') #by region
The countries can be labeled according to three categories: region, income level and migration rate (positive or negative)
In [5]:
print('By region:')
print(get_labels(countries,method='region')[1])
print('\nBy income level:')
print(get_labels(countries,method='incomeLevel')[1])
print('\nBy migration rate:')
print(get_labels(countries,method='migration',df=data.xs('SM.POP.NETM',level=1,axis=1))[1])
In [6]:
# PLOT SPARSITY PATTERN OF THE WHOLE DATASET
plotSparsityFull(data)
# PLOT SPARSITY PATTERN OF year-ranges
plotSparsityRange(data,4)
In [7]:
# How many NaN values in whole dataset
how_many_nans = data.isnull().sum().sum()
print('Raw dataset has {0} NaN values ({1:.2f}%)\n'.format(how_many_nans,100*(how_many_nans/data.size)))
# PLOT NUMBER OF NANS PER YEAR
plotNANperyear(data)
As we have seen in the above 3 plots, our data matrix is mostly empty. Futhermore, data increases when we get to more recent years.
Another interesting behavior is that we constantly have more data in 5 years intervals. This makes sense as many indicators were probably not sampled every year.
In [8]:
# PLOT NUMBER OF NANS PER COUNTRY, with background color as country's label
plotNANpercountry(data,get_labels(countries,method='incomeLevel')[0])
Interestingly, there seems to be a trend between income level and amount of data available. As one of the goals of the World Bank is "reduction of poverty", it makes sense that most of the data will come from developing countries.
In [9]:
# PLOT NUMBER OF NANS PER INDICATOR, sorted
plotNANperindicator(data,background=False)
As we have seen here, our dataset is very sparse. A lot of data is not actually available. Several reasons can explain the sparsity of the data.
Regarding the proportion of NaN values in our dataset over accross all the years from 1960 to 2015, we determined some strategies in order to get a functional dataset, keeping a time-dimension and the maximum number of countries. The first strategy was to make a 3 years-average of the data (cf. group_data function). In this way, we aim at reducing the number of NaN values, while keeping the time evolution of the variables. The dataset has values for some 3 years-period slices from 1960 to 2015 (i.e '1960-1962', '1963-1965',...,'2011-2013').
In [10]:
aggregated_data = group_data(data,3,clean=None,what=None) #returns a dictionary containing averaged data (keys= periods, value=corresponding dataset)
print('After averaging the dataset over three years periods, we end up with {} ''slices'' of data from 1960 to 2015'.format(len(aggregated_data)))
However, the number of NaN values was still significant, so another strategy consisted in deleting the columns (indicators) and the rows (countries), one after the other, containing the maximum number of NaN, seeking to maximize the number of countries. (c.f. function cleaNaN).
In [ ]:
"""/!\ Execution of this cell takes long time (cleanNaN over all the periods)"""
agg_data = dict()
for key, value in aggregated_data.items():
agg_data[key]=cleanNaN(aggregated_data[key])
# To save the data averaged by three years (dictionary)
with open('./agg_data.pickle', 'wb') as handle:
pickle.dump(agg_data, handle, protocol=pickle.HIGHEST_PROTOCOL)
In [11]:
#Open the sliced data
with open('./agg_data.pickle', 'rb') as handle:
agg_data = pickle.load(handle)
In [12]:
print('The final dataset are of the following shape:')
for key, value in agg_data.items():
print('period: {}, {} countries and {} indicators'.format(key, len(agg_data[key].index), len(agg_data[key].columns)))
We tried to check here some indicators that should a priori be correlated. We could indeed confirm in the first plots that several indicators were correlated and we had nice clusters according to the income level.
In the last plot however, the intuitive correlation was not confirmed. Even though the data was normalized, several outliers are visible and could probably have biased the data.
In [13]:
key='2008-2010'
var1='Rural population (% of total population)'
var2='Crop production index (2004-2006 = 100)'
var3='Employment in agriculture (% of total employment) (modeled ILO estimate)'
var4='Life expectancy at birth, total (years)'
var5='Birth rate, crude (per 1,000 people)'
var6='Unemployment, youth male (% of male labor force ages 15-24) (modeled ILO estimate)'
var7='CO2 emissions (kt)'
var8='Agricultural methane emissions (% of total)'
var9='Agricultural nitrous oxide emissions (thousand metric tons of CO2 equivalent)'
variables_total=dict()
variables_total[0] = [var1,var2,var3]
variables_total[1] = [var4,var5,var6]
variables_total[2] = [var7,var8,var9]
%matplotlib inline
for i in range(0,len(variables_total.keys())):
code_indicators=agg_data[key].columns.values
df=agg_data[key]
all_variables=converter(code_indicators,code_to_indicator)
variables=variables_total[i]
var_codes=converter_reverse(variables, code_to_indicator)
df=df[var_codes]
df -= np.nanmean(df,axis=0)
df/= np.nanstd(df,axis=0)
df.columns = [code_to_indicator[df.columns[0]].split('(')[0],code_to_indicator[df.columns[1]].split('(')[0],\
code_to_indicator[df.columns[2]].split('(')[0]]
vec=[]
for i in df.index:
vec.append(labels['2009'][i])
df2=df.assign(label=vec)
sns.pairplot(df2, hue="label",size=3);
After checking the different indicators left, we decided to study the data from a thematic point of view. The different themes we chose (according to indicators available were the following:
To do so, we attributeed the indicators to the different theme lists thanks to the theme_selector function that retrieves indicators by key words related to the theme.
In [14]:
#key words list
demographic_kw=['birth','expectancy','ages','rural','urban']
education_kw=['school','literacy','teacher','education']
health_kw=['water','defecation','sanitation', 'anemia','immunization','hiv','contraceptive','health','tetanus',\
'malnutrition','smoking','tuberculosis','lifetime','death','fertility','risk','obesity','overweight',\
'survival']
social_kw=['labor','employment','unemployment','corruption','workers','wage']
economic_kw=['economic','debt','GDP','merchandise','goods','banking','wage']
gas_production_kw=['CO2','fuel','energy','methane','nitrous','mineral']
technology_kw=['telephone','cellular','electricity']
unstability_kw=['Violence/Terrorism']
########################################
theme_dico={'demographic':demographic_kw,'education': education_kw,'health':health_kw,'social':social_kw,\
'economic':economic_kw,'gas_production':gas_production_kw,'technology':technology_kw}
#Creation of the 2 dimensions-dictionary (1: periods, 2:themes)
agg_theme_data=dict.fromkeys(agg_data.keys())
for key, value in agg_theme_data.items():
agg_theme_data[key]=defaultdict(dict).fromkeys(theme_dico.keys())
for key1, value1 in agg_data.items():
for key2, value2 in theme_dico.items():
# Theme list creation
code_indicators=agg_data[key1].columns.values
"""preparation of the thematic dataset according to the theme"""
theme_list, theme_name = make_theme_list (key2, theme_dico[key2], code_to_indicator,code_indicators)
theme_code=converter_reverse(theme_list, code_to_indicator)
theme_code2=list(theme_code)
for i in theme_code2:
if i not in agg_data[key1].columns.values:
theme_code.remove(i)
agg_theme_data[key1][key2]=agg_data[key1][theme_code]
The pipeline of our exploratory phase is to create similarity graphs. The countries are the nodes. The edges represent the strength of similarity between two nodes for a given theme (e.g education, health etc..).
In the next cell, we show the pipeline for one given case: gas_production in '2008-2010' for level income grouping of countries.
In [15]:
network_plot(agg_theme_data, period='1996-1998', theme='health', sparse='NN', neigh=40,data=data)
network_plot(agg_theme_data, period='2008-2010', theme='health', sparse='NN', neigh=40,data=data)
network_plot(agg_theme_data, period='1999-2001', theme='social', sparse='NN', neigh=40,data=data)
network_plot(agg_theme_data, period='2008-2010', theme='social', sparse='NN', neigh=40,data=data)
network_plot(agg_theme_data, period='1993-1995', theme='technology', sparse='NN', neigh=40,data=data)
network_plot(agg_theme_data, period='1996-1998', theme='technology', sparse='NN', neigh=40,data=data)
In [19]:
from IPython.display import Javascript, display, HTML
display(HTML(open("./Visualization/draw_network.html", "r").read()))
"""/!!\ THIS DYNAMIC VISUALIZATION DOES NOT WORK HERE --> SEE IN FOLDER VISUALIZATION """
Out[19]:
In [16]:
# PCA VISUALIZATION
pca_plot(agg_theme_data, period='1999-2001', theme='social', methode='incomeLevel')
pca_plot(agg_theme_data, period='2002-2004', theme='social', methode='incomeLevel')
pca_plot(agg_theme_data, period='2005-2007', theme='social', methode='incomeLevel')
Another step in the exploration consisted in checking how separable is the data with Principal Component Analysis method. The plots represent the projection of the data corresponding to the theme 'social' on the two first components. (for 3 different periods). In the first plot, the data is nicely distinguishable in term of the income level. Indeed, the color of the labels smoothly change when moving along the x-axis (PC1). When going through the years, we can see that the data almost only project on the first PC. It means that the second PC does not provide anymore information about variance of the data. Only the 'social' features resulted in a good separable data with PCA.
From the exploration part, we could see that the migration categorization was not relevant, and could not correlate with any variables. However, the drawing of similarity networks showed a good clusterization in term of regions and income level categories.
In this part, we would like to show if we can infer the classes of income level from the information carried through the different thematic features.
The next step in the project was to see if we could infer the situation of the countries in term of income level, depending on the different features, categorized by theme. To do so, we first embedded the graph, by projecting the data on the second and third eigenvectors.
In [17]:
%matplotlib inline
plotting.BACKEND = 'matplotlib'
plot_spectral(agg_theme_data, period='1996-1998', theme='health', methode='incomeLevel')
plot_spectral(agg_theme_data, period='2005-2007', theme='social', methode='incomeLevel')
plot_spectral(agg_theme_data, period='2008-2010', theme='social', methode='incomeLevel')
plot_spectral(agg_theme_data, period='1993-1995', theme='technology', methode='incomeLevel')
In this part we intended to classify the countries into the different categories of income level: low income, lower medium income, higher medium income and high income. To do so, we used an unsupervised learning method. It computes a smooth solution, and we infer from it the predicted solution. In general, the obtained errors are quite high (around 50%) with a hidding proportion of 0.25. However, we noticed that some themes allow a better prediction than other. For example, features related to technology offered a low error of classification, as well as health, or economic features. Those are the same themes that were already giving a well-clustered network in the networks drawings of the exploration part.
We notice that the lowest error is for technology (29%) for the period 1993-1995. Features about technology were found with the key words 'internet', 'mobile', telephone' and 'electricity'. This period is critical and determining to discriminate between the different countries, especially in term of income level. Indeed, this period is when the technology boom happenened in western countries.
We have seen throughout the project that indicators grouped in themes could somehow predict the income level of a country. Classifying the countries by geographical region was however more difficult, and the errors were very large. This can be explained by the fact that we had 7 regions to classify and that some of them had very little countries remaining.
Some of the themes (demographics, economy, health, social, technology) were better at discriminating between income levels, as opposed to education or gas production. The fact that 'economy' could nicely predict income levels is pretty intuitive. The income level is defined by GNI per capita and is tightly link to the economical health of a country.
A limitation of the analysis is that the chosen themes were very broad, and that the themes were not necesarily comparable as not all of them had the same number of indicators. Some of the themes had even more indicators than actual observations, which makes the classification very difficult.