In this notebook we are going to demonstrate how to create a DDF dataset with ddf_utils. We will use International migrant stock: The 2019 revision as source and convert the migrant stock By destination and origin. Below are screenshots for this source file.
Data Tables:
ANNX sheet:
migrant_stock. But when
origin = total or destination = total, we can have
immigrant_stock or emmigrant_stock. So we will have 3
indicators in the DDF dataset. Also each indicator will have 2 variants:
one with gender domain and one without gender domain.
In [1]:
import numpy as np
import pandas as pd
# from ddf_utils.dsl import *
In [2]:
source = '../source/UN_MigrantStockByOriginAndDestination_2019.xlsx'
first of all, let's create a data reader to load a table into pandas DataFrame with correct data types.
i.e. data starts from the Total column should all be numbers(float64)
We can see there are a few things we should process:
.. inside number columns. we should consider .. as N/A values.
In [3]:
# NOTE: this function will be re-written below.
def read_source(sheet, skip=14, header=[0, 1]):
return pd.read_excel(source, skiprows=skip, sheet_name=sheet, header=header,
na_values=['..', ''], keep_default_na=False)
In [4]:
# now try to load Table 1
table1 = read_source('Table 1', 14)
In [5]:
table1.head()
Out[5]:
In [6]:
# function to fix column names
def fix_column_names(df_):
df = df_.copy()
new_cols = np.r_[
df.columns.get_level_values(0).values[:6], # column 6 is `Total`
df.columns.get_level_values(1).values[6:]]
df.columns = new_cols
return df
In [7]:
fix_column_names(table1).head()
Out[7]:
In [8]:
table1 = fix_column_names(table1)
In [9]:
# see data types.
table1.dtypes['Total':]
Out[9]:
Now there is a problem. The Albania column have object type which is not the desired data type.
We need to double check what goes wrong.
Besides checking data types, we should also check if the numbers are loaded correctly ("153,011,473" in excel table should be 153011473 in the DataFrame). We can check this by viewing the source excel and the DataFrame side by side.
Depends on the actual data you are working on, other kind of double checking might be required.
In [10]:
def isnumeric(x):
"""check if a value is numeric value"""
if isinstance(x, (np.floating, np.int, int, float)):
return True
else:
return False
In [11]:
alb = table1['Albania'].dropna()
In [12]:
alb.loc[~alb.map(isnumeric)]
Out[12]:
This means that there are - in the data table. We will treat it as N/A.
In [13]:
# redefine the read_source function
def read_source(sheet, skip=14, header=[0, 1]):
return pd.read_excel(source, skiprows=skip, sheet_name=sheet, header=header,
na_values=['..', '', '-'], keep_default_na=False)
In [14]:
table1 = fix_column_names(read_source('Table 1', 14))
In [15]:
all([x == 'float64' for x in table1.dtypes['Total':]])
Out[15]:
In [16]:
table2 = fix_column_names(read_source('Table 2', 14))
In [17]:
all([x == 'float64' for x in table2.dtypes['Total':]])
Out[17]:
In [18]:
table3 = fix_column_names(read_source('Table 3', 14))
In [19]:
all([x == 'float64' for x in table3.dtypes['Total':]])
Out[19]:
Now let's load the ANNEX table and create geo domain:
code as identifier. Now the column values are numbers, we will convert to str.UN development groups).
We will treat them as entity setsIn ddf_utils there are data classes for Entity and EntityDomain.
In [20]:
from ddf_utils.model.ddf import Entity, EntityDomain
from ddf_utils.str import to_concept_id
In [21]:
country = read_source('ANNEX', 15, 0)
In [22]:
country.head()
Out[22]:
In [ ]:
In [23]:
country.columns
Out[23]:
In [24]:
# load all entities
ents = []
current_set = None
sets_in_col = ['More Developed Regions', 'Less Developed Regions',
'Least developed countries', 'High-income Countries',
'Middle-income Countries', 'Upper-middle-income Countries',
'Lower-middle-income Countries', 'Low-income Countries',
'No income group available', 'Sub-Saharan Africa']
for i, row in country.iterrows():
name = row['Region, subregion, country or area']
if pd.isnull(row['Code']): # it's a set, and rows below this row belong to this set.
current_set = to_concept_id(name)
continue
else:
sets = set()
for s in sets_in_col:
if not pd.isnull(row[s]):
sets.add(to_concept_id(s))
if current_set:
sets.add(current_set)
ents.append(Entity(id=str(int(row['Code'])), domain='geo', sets=list(sets), props={'name': name}))
In [25]:
# Noticed that in data table there are "other south" and "other north", which are not in the ANNEX table.
# So we append these 2 entity too
ents.append(
Entity(id='other_south', domain='geo', sets=[], props=dict(name='Other South')))
ents.append(
Entity(id='other_north', domain='geo', sets=[], props=dict(name='Other North')))
In [26]:
domain = EntityDomain('geo', [])
for e in ents:
domain.add_entity(e)
In [27]:
country_df = pd.DataFrame.from_dict(domain.to_dict())
In [28]:
country_df.head()
Out[28]:
In [29]:
# check: if origin and destination share same entities
origin_cols = table1.columns[9:]
for c in origin_cols:
if c not in country_df['name'].values:
print(f"missing {c}")
else:
print("all countries in Table 1 are in country domain")
In [30]:
# save to file.
# country_df.to_csv('../../ddf--entities--geo.csv', index=False)
In [31]:
# gender domain
sexdf = pd.DataFrame([
['male', 'Male'],
['female', 'Female']
], columns=['gender', 'name'])
In [32]:
# sexdf.to_csv('../../ddf--entities--gender.csv', index=False)
Table 1-3 are loaded into DataFrame, we need to do some transformation to make them datapoints:
In [33]:
# create a mapping from name to code.
def _trans(x):
if not pd.isnull(x):
return str(int(x))
return x
name_id_map = country.set_index('Region, subregion, country or area')['Code'].map(_trans).to_dict()
In [34]:
# there are some area doesn't have codes, set here
name_id_map['Total'] = '900'
name_id_map['Other South'] = 'other_south'
name_id_map['Other North'] = 'other_north'
In [35]:
dp_cols = ['Code', 'Year']
for c in origin_cols:
dp_cols.append(c)
In [36]:
pd.Series(dp_cols)
Out[36]:
In [37]:
dp = table1[dp_cols].copy()
dp = dp.dropna(subset=['Code', 'Year'], how='any')
dp['Code'] = dp['Code'].map(int)
dp['Year'] = dp['Year'].map(int)
dp = dp.set_index(['Code', 'Year'])
# rename remaining columns to geo id
dp.columns = dp.columns.map(name_id_map)
# create new dimension
dp = dp.stack()
dp.index.names = ['destination', 'year', 'origin']
dp.name = 'migrant_stock'
In [38]:
dp.head()
Out[38]:
In [ ]:
In [39]:
# double check: if there are duplicated index
dp.index.has_duplicates
Out[39]:
In [40]:
# seems something goes wrong. digging in...
In [41]:
dp[dp.index.duplicated(keep=False)].sort_index()
Out[41]:
In [42]:
dp[dp.index.duplicated(keep=False)].reset_index()['destination'].unique()
Out[42]:
In [43]:
# so only country id 909 has duplication. let's see what 909 means
In [44]:
country[country['Code'] == 909]
Out[44]:
In [45]:
table1[table1['Code'] == 909]
Out[45]:
In [ ]:
In [46]:
# so Oceania appear twice, and data values are same for both. We can safely drop these duplicates.
In [47]:
# create a function for data tables.
def create_datapoints(table):
origin_cols = table.columns[6:]
for c in origin_cols:
# double check: if the country/region in the geo domain
if c not in country_df['name'].values:
print(f'column {c} is not in geo domain!')
dp_cols = ['Code', 'Year']
[dp_cols.append(x) for x in origin_cols]
dp = table[dp_cols].copy()
dp = dp.dropna(subset=['Code', 'Year'], how='any')
dp['Code'] = dp['Code'].map(int)
dp['Year'] = dp['Year'].map(int)
dp = dp.set_index(['Code', 'Year'])
dp.columns = dp.columns.map(name_id_map)
dp = dp.stack().reset_index()
# print(dp.columns)
dp.columns = ['destination', 'year', 'origin', 'migrant_stock']
dp = dp.drop_duplicates()
return dp[['destination', 'origin', 'year', 'migrant_stock']]
In [48]:
# also function for emrgrant
def create_emgire_datapoints(dp):
dp_emg = dp[(dp.destination == 900)].copy()
dp_emg = dp_emg[['origin', 'year', 'migrant_stock']]
dp_emg.columns = ['geo', 'year', 'emigrant_stock']
return dp_emg
In [49]:
# and immigrant
def create_imgire_datapoints(dp):
dp_img = dp[(dp.origin == '900')].copy()
dp_img = dp_img[['destination', 'year', 'migrant_stock']]
dp_img.columns = ['geo', 'year', 'immigrant_stock']
return dp_img
In [50]:
df = create_datapoints(table1)
df_emg = create_emgire_datapoints(df)
df_img = create_imgire_datapoints(df)
In [51]:
df.head()
Out[51]:
In [52]:
# check: if there are still duplicated index
np.all(df.duplicated(subset=['destination', 'origin', 'year']))
Out[52]:
In [53]:
df_emg.head()
Out[53]:
In [54]:
df_img.head()
Out[54]:
In [55]:
# save to files
# df.to_csv('../../ddf--datapoints--migrant_stock--by--destination--origin--year.csv', index=False)
# df_emg.to_csv('../../ddf--datapoints--emigrant_stock--by--geo--year.csv', index=False)
# df_img.to_csv('../../ddf--datapoints--immigrant_stock--by--geo--year.csv', index=False)
In [ ]:
In [56]:
# Table 2 and Table 3
# We will create gender dimension for them and merge them together.
df2 = create_datapoints(table2)
df2_emg = create_emgire_datapoints(df2)
df2_img = create_imgire_datapoints(df2)
df2['gender'] = 'male'
df2 = df2[['destination', 'origin', 'gender', 'year', 'migrant_stock']]
df2_emg['gender'] = 'male'
df2_emg = df2_emg[['geo', 'gender', 'year', 'emigrant_stock']]
df2_img['gender'] = 'male'
df2_img = df2_img[['geo', 'gender', 'year', 'immigrant_stock']]
In [57]:
df3 = create_datapoints(table3)
df3_emg = create_emgire_datapoints(df3)
df3_img = create_imgire_datapoints(df3)
df3['gender'] = 'female'
df3 = df3[['destination', 'origin', 'gender', 'year', 'migrant_stock']]
df3_emg['gender'] = 'female'
df3_emg = df3_emg[['geo', 'gender', 'year', 'emigrant_stock']]
df3_img['gender'] = 'female'
df3_img = df3_img[['geo', 'gender', 'year', 'immigrant_stock']]
In [ ]:
In [58]:
df_sex = pd.concat([df2, df3], ignore_index=True)
df_sex_emg = pd.concat([df2_emg, df3_emg], ignore_index=True)
df_sex_img = pd.concat([df2_img, df3_img], ignore_index=True)
In [59]:
# save to files
# df_sex.to_csv('../../ddf--datapoints--migrant_stock--by--destination--origin--gender--year.csv', index=False)
# df_sex_emg.to_csv('../../ddf--datapoints--emigrant_stock--by--geo--gender--year.csv', index=False)
# df_sex_img.to_csv('../../ddf--datapoints--immigrant_stock--by--geo--gender--year.csv', index=False)
In [ ]:
In [ ]:
In [ ]:
In [60]:
from ddf_utils.model.ddf import Concept
In [61]:
strings = [
['name', 'Name'],
['domain', 'Domain'],
]
measures = [
['migration_stock', 'Migration Stock'],
['emigrant_stock', 'Emigrant Stock'],
['immigrant_stock', 'Immigrant Stock']
]
entity_domains = [
['geo', 'Geo Location'],
['gender', 'Gender']
]
entity_sets_geo_ids = list(map(lambda x: x[4:], filter(lambda x: x.startswith('is--'), country_df.columns)))
entity_sets_geo_names = list(map(lambda x: x.replace('_', ' ').title(), entity_sets_geo_ids))
entity_sets_geo = list(zip(entity_sets_geo_ids, entity_sets_geo_names))
In [62]:
roles = [
['destination', 'Destination'],
['origin', 'Origin']
]
In [63]:
concepts = list()
for e, n in strings:
concepts.append(Concept(e, 'string', dict(name=n)))
for e, n in measures:
concepts.append(Concept(e, 'measure', dict(name=n)))
for e, n in entity_domains:
concepts.append(Concept(e, 'entity_domain', dict(name=n)))
for e, n in entity_sets_geo:
concepts.append(Concept(e, 'entity_set', dict(name=n, domain='geo')))
for e, n in roles:
concepts.append(Concept(e, 'role', dict(name=n, domain='geo')))
In [64]:
concepts.append(Concept('year', 'time', dict(name='Time')))
In [65]:
cdf = pd.DataFrame.from_records([x.to_dict() for x in concepts])
In [66]:
# cdf.to_csv('../../ddf--concepts.csv', index=False)
In [ ]:
In [ ]:
# create datapackage
# run below ddf command in a terminal
!ddf create_datapackage --update ../../
In [ ]: