Create DDF dataset from UN International migrant stock 2019 dataset

In this notebook we are going to demonstrate how to create a DDF dataset with ddf_utils. We will use International migrant stock: The 2019 revision as source and convert the migrant stock By destination and origin. Below are screenshots for this source file.

Data Tables:

ANNX sheet:

The plan

from the excel tables we can see there is a "ANNEX" sheet, which contains all country/region information. We will convert it to a geo domain.
3 data Tables contains total/male/female data. We will create a domain for gender.
origin and destination are both regions/countries. So they will be roles in the region/country domain.
there is only one indicator in source: migrant_stock. But when origin = total or destination = total, we can have immigrant_stock or emmigrant_stock. So we will have 3 indicators in the DDF dataset. Also each indicator will have 2 variants: one with gender domain and one without gender domain.
We will not keep Notes/Type of Data/Sort Order columns

ETL process

Note: This notebook assumed you create a ddf dataset project with ddf new command and place the source file in etl/source and this notebook in etl/notebooks



In [1]:

    
import numpy as np
import pandas as pd
# from ddf_utils.dsl import *



In [2]:

    
source = '../source/UN_MigrantStockByOriginAndDestination_2019.xlsx'

first of all, let's create a data reader to load a table into pandas DataFrame with correct data types. i.e. data starts from the Total column should all be numbers(float64)

We can see there are a few things we should process:

we should skip a few lines at the beginning.
there are .. inside number columns. we should consider .. as N/A values.
the headers in data table span over 2 rows



In [3]:

    
# NOTE: this function will be re-written below.
def read_source(sheet, skip=14, header=[0, 1]):
    return pd.read_excel(source, skiprows=skip, sheet_name=sheet, header=header,
                         na_values=['..', ''], keep_default_na=False)



In [4]:

    
# now try to load Table 1
table1 = read_source('Table 1', 14)



In [5]:

    
table1.head()









    Out[5]:







  
    
      
      Year
      Sort\norder
      Major area, region, country or area of destination
      Notes
      Code
      Type of data (a)
      Country or area of origin
    
    
      
      Unnamed: 0_level_1
      Unnamed: 1_level_1
      Unnamed: 2_level_1
      Unnamed: 3_level_1
      Unnamed: 4_level_1
      Unnamed: 5_level_1
      Total
      Other South
      Other North
      Afghanistan
      ...
      Uruguay
      Uzbekistan
      Vanuatu
      Venezuela (Bolivarian Republic of)
      Viet Nam
      Wallis and Futuna Islands
      Western Sahara
      Yemen
      Zambia
      Zimbabwe
    
  
  
    
      0
      1990
      1990001
      WORLD
      NaN
      900.0
      NaN
      153011473.0
      6548526.0
      2366800.0
      6823350.0
      ...
      237486.0
      1428020.0
      5060.0
      185946.0
      1237873.0
      6484.0
      168239.0
      455492.0
      85203.0
      204365.0
    
    
      1
      1990
      1990002
      UN development groups
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      1990
      1990003
      More developed regions
      b
      901.0
      NaN
      82767216.0
      3385103.0
      1077179.0
      119386.0
      ...
      56838.0
      1078563.0
      1017.0
      114991.0
      1085310.0
      884.0
      333.0
      11457.0
      26062.0
      40957.0
    
    
      3
      1990
      1990004
      Less developed regions
      c
      902.0
      NaN
      70244257.0
      3163423.0
      1289621.0
      6703964.0
      ...
      180648.0
      349457.0
      4043.0
      70955.0
      152563.0
      5600.0
      167906.0
      444035.0
      59141.0
      163408.0
    
    
      4
      1990
      1990005
      Least developed countries
      d
      941.0
      NaN
      11060221.0
      482753.0
      239756.0
      0.0
      ...
      286.0
      2027.0
      9.0
      2510.0
      71579.0
      0.0
      0.0
      357.0
      26254.0
      75122.0
    
  

5 rows × 241 columns



In [6]:

    
# function to fix column names

def fix_column_names(df_):
    df = df_.copy()
    
    new_cols = np.r_[
        df.columns.get_level_values(0).values[:6],  # column 6 is `Total`
        df.columns.get_level_values(1).values[6:]]
    df.columns = new_cols
    return df



In [7]:

    
fix_column_names(table1).head()









    Out[7]:







  
    
      
      Year
      Sort\norder
      Major area, region, country or area of destination
      Notes
      Code
      Type of data (a)
      Total
      Other South
      Other North
      Afghanistan
      ...
      Uruguay
      Uzbekistan
      Vanuatu
      Venezuela (Bolivarian Republic of)
      Viet Nam
      Wallis and Futuna Islands
      Western Sahara
      Yemen
      Zambia
      Zimbabwe
    
  
  
    
      0
      1990
      1990001
      WORLD
      NaN
      900.0
      NaN
      153011473.0
      6548526.0
      2366800.0
      6823350.0
      ...
      237486.0
      1428020.0
      5060.0
      185946.0
      1237873.0
      6484.0
      168239.0
      455492.0
      85203.0
      204365.0
    
    
      1
      1990
      1990002
      UN development groups
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      1990
      1990003
      More developed regions
      b
      901.0
      NaN
      82767216.0
      3385103.0
      1077179.0
      119386.0
      ...
      56838.0
      1078563.0
      1017.0
      114991.0
      1085310.0
      884.0
      333.0
      11457.0
      26062.0
      40957.0
    
    
      3
      1990
      1990004
      Less developed regions
      c
      902.0
      NaN
      70244257.0
      3163423.0
      1289621.0
      6703964.0
      ...
      180648.0
      349457.0
      4043.0
      70955.0
      152563.0
      5600.0
      167906.0
      444035.0
      59141.0
      163408.0
    
    
      4
      1990
      1990005
      Least developed countries
      d
      941.0
      NaN
      11060221.0
      482753.0
      239756.0
      0.0
      ...
      286.0
      2027.0
      9.0
      2510.0
      71579.0
      0.0
      0.0
      357.0
      26254.0
      75122.0
    
  

5 rows × 241 columns



In [8]:

    
table1 = fix_column_names(table1)



In [9]:

    
# see data types.
table1.dtypes['Total':]









    Out[9]:





Total                        float64
Other South                  float64
Other North                  float64
Afghanistan                  float64
Albania                       object
                              ...   
Wallis and Futuna Islands    float64
Western Sahara               float64
Yemen                        float64
Zambia                       float64
Zimbabwe                     float64
Length: 235, dtype: object

Now there is a problem. The Albania column have object type which is not the desired data type. We need to double check what goes wrong.

Note

Besides checking data types, we should also check if the numbers are loaded correctly ("153,011,473" in excel table should be 153011473 in the DataFrame). We can check this by viewing the source excel and the DataFrame side by side.

Depends on the actual data you are working on, other kind of double checking might be required.



In [10]:

    
def isnumeric(x):
    """check if a value is numeric value"""
    if isinstance(x, (np.floating, np.int, int, float)):
        return True
    else:
        return False



In [11]:

    
alb = table1['Albania'].dropna()



In [12]:

    
alb.loc[~alb.map(isnumeric)]









    Out[12]:





260     -
543     -
826     -
1109    -
Name: Albania, dtype: object

This means that there are - in the data table. We will treat it as N/A.



In [13]:

    
# redefine the read_source function

def read_source(sheet, skip=14, header=[0, 1]):
    return pd.read_excel(source, skiprows=skip, sheet_name=sheet, header=header,
                         na_values=['..', '', '-'], keep_default_na=False)



In [14]:

    
table1 = fix_column_names(read_source('Table 1', 14))



In [15]:

    
all([x == 'float64' for x in table1.dtypes['Total':]])









    Out[15]:





True



In [16]:

    
table2 = fix_column_names(read_source('Table 2', 14))



In [17]:

    
all([x == 'float64' for x in table2.dtypes['Total':]])









    Out[17]:





True



In [18]:

    
table3 = fix_column_names(read_source('Table 3', 14))



In [19]:

    
all([x == 'float64' for x in table3.dtypes['Total':]])









    Out[19]:





True

geo domain

Now let's load the ANNEX table and create geo domain:

we will use code as identifier. Now the column values are numbers, we will convert to str.
rows where code is empty are group names (for example UN development groups). We will treat them as entity sets
columns are indicators for entity set membership too.

In ddf_utils there are data classes for Entity and EntityDomain.



In [20]:

    
from ddf_utils.model.ddf import Entity, EntityDomain
from ddf_utils.str import to_concept_id



In [21]:

    
country = read_source('ANNEX', 15, 0)



In [22]:

    
country.head()









    Out[22]:







  
    
      
      Index
      Region, subregion, country or area
      Notes
      Code
      Type of data
      More Developed Regions
      Less Developed Regions
      Least developed countries
      High-income Countries
      Middle-income Countries
      Upper-middle-income Countries
      Lower-middle-income Countries
      Low-income Countries
      No income group available
      Sub-Saharan Africa
    
  
  
    
      0
      1
      WORLD
      NaN
      900.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      2
      UN development groups
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      3
      More developed regions
      b
      901.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      4
      Less developed regions
      c
      902.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      5
      Least developed countries
      d
      941.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN



In [ ]:



In [23]:

    
country.columns









    Out[23]:





Index(['Index', 'Region, subregion, country or area', 'Notes', 'Code',
       'Type of data', 'More Developed Regions', 'Less Developed Regions',
       'Least developed countries', 'High-income Countries',
       'Middle-income Countries', 'Upper-middle-income Countries',
       'Lower-middle-income Countries', 'Low-income Countries',
       'No income group available', 'Sub-Saharan Africa'],
      dtype='object')



In [24]:

    
# load all entities
ents = []

current_set = None

sets_in_col = ['More Developed Regions', 'Less Developed Regions',
       'Least developed countries', 'High-income Countries',
       'Middle-income Countries', 'Upper-middle-income Countries',
       'Lower-middle-income Countries', 'Low-income Countries',
       'No income group available', 'Sub-Saharan Africa']

for i, row in country.iterrows():
    name = row['Region, subregion, country or area']
    if pd.isnull(row['Code']):  # it's a set, and rows below this row belong to this set.
        current_set = to_concept_id(name)
        continue
    else:
        sets = set()
        for s in sets_in_col:
            if not pd.isnull(row[s]):
                sets.add(to_concept_id(s))
        if current_set:
            sets.add(current_set)
        ents.append(Entity(id=str(int(row['Code'])), domain='geo', sets=list(sets), props={'name': name}))



In [25]:

    
# Noticed that in data table there are "other south" and "other north", which are not in the ANNEX table.
# So we append these 2 entity too

ents.append(
    Entity(id='other_south', domain='geo', sets=[], props=dict(name='Other South')))

ents.append(
    Entity(id='other_north', domain='geo', sets=[], props=dict(name='Other North')))



In [26]:

    
domain = EntityDomain('geo', [])

for e in ents:
    domain.add_entity(e)



In [27]:

    
country_df = pd.DataFrame.from_dict(domain.to_dict())



In [28]:

    
country_df.head()









    Out[28]:







  
    
      
      geo
      name
      is--un_development_groups
      is--middle_income_countries
      is--no_income_group_available
      is--low_income_countries
      is--high_income_countries
      is--more_developed_regions
      is--geographic_regions
      is--less_developed_regions
      is--lower_middle_income_countries
      is--least_developed_countries
      is--sustainable_development_goal_sdg_regions
      is--sub_saharan_africa
      is--upper_middle_income_countries
      is--world_bank_income_groups
    
  
  
    
      0
      900
      WORLD
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
    
    
      1
      901
      More developed regions
      TRUE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
    
    
      2
      902
      Less developed regions
      TRUE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
    
    
      3
      941
      Least developed countries
      TRUE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
    
    
      4
      934
      Less developed regions, excluding least develo...
      TRUE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE
      FALSE



In [29]:

    
# check: if origin and destination share same entities

origin_cols = table1.columns[9:]

for c in origin_cols:
    if c not in country_df['name'].values:
        print(f"missing {c}")
else:
    print("all countries in Table 1 are in country domain")









    



all countries in Table 1 are in country domain



In [30]:

    
# save to file.
# country_df.to_csv('../../ddf--entities--geo.csv', index=False)

Gender domain

the gender domain is quite simple. We would just create a DataFrame manually.



In [31]:

    
# gender domain

sexdf = pd.DataFrame([
    ['male', 'Male'],
    ['female', 'Female']
], columns=['gender', 'name'])



In [32]:

    
# sexdf.to_csv('../../ddf--entities--gender.csv', index=False)

datapoints

Table 1-3 are loaded into DataFrame, we need to do some transformation to make them datapoints:

drop unneeded columns
converted origins/destinations names to codes
origins are in columns, but should be in rows as a dimension
for table 2 and table 3, add the gender column and merge them together
filter origin = total (destination = total) to create immigrant (emigrant) data



In [33]:

    
# create a mapping from name to code.
def _trans(x):
    if not pd.isnull(x):
        return str(int(x))
    return x

name_id_map = country.set_index('Region, subregion, country or area')['Code'].map(_trans).to_dict()



In [34]:

    
# there are some area doesn't have codes, set here
name_id_map['Total'] = '900'
name_id_map['Other South'] = 'other_south'
name_id_map['Other North'] = 'other_north'



In [35]:

    
dp_cols = ['Code', 'Year']
for c in origin_cols:
    dp_cols.append(c)



In [36]:

    
pd.Series(dp_cols)









    Out[36]:





0                           Code
1                           Year
2                    Afghanistan
3                        Albania
4                        Algeria
                 ...            
229    Wallis and Futuna Islands
230               Western Sahara
231                        Yemen
232                       Zambia
233                     Zimbabwe
Length: 234, dtype: object



In [37]:

    
dp = table1[dp_cols].copy()
dp = dp.dropna(subset=['Code', 'Year'], how='any')

dp['Code'] = dp['Code'].map(int)
dp['Year'] = dp['Year'].map(int)

dp = dp.set_index(['Code', 'Year'])

# rename remaining columns to geo id
dp.columns = dp.columns.map(name_id_map)

# create new dimension
dp = dp.stack()

dp.index.names = ['destination', 'year', 'origin']
dp.name = 'migrant_stock'



In [38]:

    
dp.head()









    Out[38]:





destination  year  origin
900          1990  4         6823350.0
                   8          180284.0
                   12         921727.0
                   16           2041.0
                   20           3792.0
Name: migrant_stock, dtype: float64



In [ ]:



In [39]:

    
# double check: if there are duplicated index

dp.index.has_duplicates









    Out[39]:





True



In [40]:

    
# seems something goes wrong. digging in...



In [41]:

    
dp[dp.index.duplicated(keep=False)].sort_index()









    Out[41]:





destination  year  origin
909          1990  100        1755.0
                   100        1755.0
                   104       10247.0
                   104       10247.0
                   108          15.0
                              ...   
             2019  90         3721.0
                   92            0.0
                   92            0.0
                   96         3237.0
                   96         3237.0
Name: migrant_stock, Length: 3248, dtype: float64



In [42]:

    
dp[dp.index.duplicated(keep=False)].reset_index()['destination'].unique()









    Out[42]:





array([909])



In [43]:

    
# so only country id 909 has duplication. let's see what 909 means



In [44]:

    
country[country['Code'] == 909]









    Out[44]:







  
    
      
      Index
      Region, subregion, country or area
      Notes
      Code
      Type of data
      More Developed Regions
      Less Developed Regions
      Least developed countries
      High-income Countries
      Middle-income Countries
      Upper-middle-income Countries
      Lower-middle-income Countries
      Low-income Countries
      No income group available
      Sub-Saharan Africa
    
  
  
    
      19
      20
      Oceania
      NaN
      909.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      195
      199
      OCEANIA
      NaN
      909.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN



In [45]:

    
table1[table1['Code'] == 909]









    Out[45]:







  
    
      
      Year
      Sort\norder
      Major area, region, country or area of destination
      Notes
      Code
      Type of data (a)
      Total
      Other South
      Other North
      Afghanistan
      ...
      Uruguay
      Uzbekistan
      Vanuatu
      Venezuela (Bolivarian Republic of)
      Viet Nam
      Wallis and Futuna Islands
      Western Sahara
      Yemen
      Zambia
      Zimbabwe
    
  
  
    
      19
      1990
      1990020
      Oceania
      NaN
      909.0
      NaN
      4731848.0
      34250.0
      11579.0
      2885.0
      ...
      11795.0
      245.0
      4693.0
      606.0
      140846.0
      5588.0
      0.0
      221.0
      2534.0
      8941.0
    
    
      195
      1990
      1990196
      OCEANIA
      NaN
      909.0
      NaN
      4731848.0
      34250.0
      11579.0
      2885.0
      ...
      11795.0
      245.0
      4693.0
      606.0
      140846.0
      5588.0
      0.0
      221.0
      2534.0
      8941.0
    
    
      302
      1995
      1995020
      Oceania
      NaN
      909.0
      NaN
      5022287.0
      67404.0
      15814.0
      8382.0
      ...
      11366.0
      383.0
      5375.0
      915.0
      161801.0
      6478.0
      0.0
      314.0
      3248.0
      11320.0
    
    
      478
      1995
      1995196
      OCEANIA
      NaN
      909.0
      NaN
      5022287.0
      67404.0
      15814.0
      8382.0
      ...
      11366.0
      383.0
      5375.0
      915.0
      161801.0
      6478.0
      0.0
      314.0
      3248.0
      11320.0
    
    
      585
      2000
      2000011
      Oceania
      NaN
      909.0
      NaN
      5361231.0
      54117.0
      53514.0
      13294.0
      ...
      10570.0
      490.0
      5945.0
      1170.0
      168753.0
      7360.0
      0.0
      390.0
      4371.0
      14917.0
    
    
      761
      2000
      2000196
      OCEANIA
      NaN
      909.0
      NaN
      5361231.0
      54117.0
      53514.0
      13294.0
      ...
      10570.0
      490.0
      5945.0
      1170.0
      168753.0
      7360.0
      0.0
      390.0
      4371.0
      14917.0
    
    
      868
      2005
      2005020
      Oceania
      NaN
      909.0
      NaN
      6023412.0
      58905.0
      58663.0
      20671.0
      ...
      10790.0
      760.0
      6375.0
      1580.0
      180367.0
      7409.0
      0.0
      520.0
      5796.0
      29474.0
    
    
      1044
      2005
      2005196
      OCEANIA
      NaN
      909.0
      NaN
      6023412.0
      58905.0
      58663.0
      20671.0
      ...
      10790.0
      760.0
      6375.0
      1580.0
      180367.0
      7409.0
      0.0
      520.0
      5796.0
      29474.0
    
    
      1151
      2010
      2010020
      Oceania
      NaN
      909.0
      NaN
      7127680.0
      64700.0
      67114.0
      33154.0
      ...
      10630.0
      1300.0
      6703.0
      3350.0
      210658.0
      7448.0
      0.0
      590.0
      7544.0
      41880.0
    
    
      1327
      2010
      2010196
      OCEANIA
      NaN
      909.0
      NaN
      7127680.0
      64700.0
      67114.0
      33154.0
      ...
      10630.0
      1300.0
      6703.0
      3350.0
      210658.0
      7448.0
      0.0
      590.0
      7544.0
      41880.0
    
    
      1434
      2015
      2015020
      Oceania
      NaN
      909.0
      NaN
      8069944.0
      71352.0
      83929.0
      52090.0
      ...
      10580.0
      1840.0
      5593.0
      5760.0
      243131.0
      11210.0
      0.0
      710.0
      8781.0
      48586.0
    
    
      1610
      2015
      2015196
      OCEANIA
      NaN
      909.0
      NaN
      8069944.0
      71352.0
      83929.0
      52090.0
      ...
      10580.0
      1840.0
      5593.0
      5760.0
      243131.0
      11210.0
      0.0
      710.0
      8781.0
      48586.0
    
    
      1717
      2019
      2019020
      Oceania
      NaN
      909.0
      NaN
      8927925.0
      74580.0
      89844.0
      62408.0
      ...
      11246.0
      2147.0
      5976.0
      7427.0
      272784.0
      11804.0
      0.0
      888.0
      9521.0
      53752.0
    
    
      1893
      2019
      2019196
      OCEANIA
      NaN
      909.0
      NaN
      8927925.0
      74580.0
      89844.0
      62408.0
      ...
      11246.0
      2147.0
      5976.0
      7427.0
      272784.0
      11804.0
      0.0
      888.0
      9521.0
      53752.0
    
  

14 rows × 241 columns



In [ ]:



In [46]:

    
# so Oceania appear twice, and data values are same for both. We can safely drop these duplicates.



In [47]:

    
# create a function for data tables.

def create_datapoints(table):
    origin_cols = table.columns[6:]
    for c in origin_cols:
        # double check: if the country/region in the geo domain
        if c not in country_df['name'].values:
            print(f'column {c} is not in geo domain!')

    dp_cols = ['Code', 'Year']
    [dp_cols.append(x) for x in origin_cols]

    dp = table[dp_cols].copy()
    dp = dp.dropna(subset=['Code', 'Year'], how='any')
    dp['Code'] = dp['Code'].map(int)
    dp['Year'] = dp['Year'].map(int)

    dp = dp.set_index(['Code', 'Year'])
    dp.columns = dp.columns.map(name_id_map)

    dp = dp.stack().reset_index()
    # print(dp.columns)

    dp.columns = ['destination', 'year', 'origin', 'migrant_stock']
    dp = dp.drop_duplicates()
    return dp[['destination', 'origin', 'year', 'migrant_stock']]



In [48]:

    
# also function for emrgrant
def create_emgire_datapoints(dp):
    dp_emg = dp[(dp.destination == 900)].copy()
    dp_emg = dp_emg[['origin', 'year', 'migrant_stock']]
    dp_emg.columns = ['geo', 'year', 'emigrant_stock']
    return dp_emg



In [49]:

    
# and immigrant
def create_imgire_datapoints(dp):
    dp_img = dp[(dp.origin == '900')].copy()
    dp_img = dp_img[['destination', 'year', 'migrant_stock']]
    dp_img.columns = ['geo', 'year', 'immigrant_stock']
    return dp_img



In [50]:

    
df = create_datapoints(table1)
df_emg = create_emgire_datapoints(df)
df_img = create_imgire_datapoints(df)









    



column Total is not in geo domain!



In [51]:

    
df.head()









    Out[51]:







  
    
      
      destination
      origin
      year
      migrant_stock
    
  
  
    
      0
      900
      900
      1990
      153011473.0
    
    
      1
      900
      other_south
      1990
      6548526.0
    
    
      2
      900
      other_north
      1990
      2366800.0
    
    
      3
      900
      4
      1990
      6823350.0
    
    
      4
      900
      8
      1990
      180284.0



In [52]:

    
# check: if there are still duplicated index
np.all(df.duplicated(subset=['destination', 'origin', 'year']))









    Out[52]:





False



In [53]:

    
df_emg.head()









    Out[53]:







  
    
      
      geo
      year
      emigrant_stock
    
  
  
    
      0
      900
      1990
      153011473.0
    
    
      1
      other_south
      1990
      6548526.0
    
    
      2
      other_north
      1990
      2366800.0
    
    
      3
      4
      1990
      6823350.0
    
    
      4
      8
      1990
      180284.0



In [54]:

    
df_img.head()









    Out[54]:







  
    
      
      geo
      year
      immigrant_stock
    
  
  
    
      0
      900
      1990
      153011473.0
    
    
      235
      901
      1990
      82767216.0
    
    
      470
      902
      1990
      70244257.0
    
    
      705
      941
      1990
      11060221.0
    
    
      940
      934
      1990
      59184036.0



In [55]:

    
# save to files
# df.to_csv('../../ddf--datapoints--migrant_stock--by--destination--origin--year.csv', index=False)
# df_emg.to_csv('../../ddf--datapoints--emigrant_stock--by--geo--year.csv', index=False)
# df_img.to_csv('../../ddf--datapoints--immigrant_stock--by--geo--year.csv', index=False)



In [ ]:



In [56]:

    
# Table 2 and Table 3
# We will create gender dimension for them and merge them together.

df2 = create_datapoints(table2)
df2_emg = create_emgire_datapoints(df2)
df2_img = create_imgire_datapoints(df2)
df2['gender'] = 'male'
df2 = df2[['destination', 'origin', 'gender', 'year', 'migrant_stock']]
df2_emg['gender'] = 'male'
df2_emg = df2_emg[['geo', 'gender', 'year', 'emigrant_stock']]
df2_img['gender'] = 'male'
df2_img = df2_img[['geo', 'gender', 'year', 'immigrant_stock']]









    



column Total is not in geo domain!



In [57]:

    
df3 = create_datapoints(table3)
df3_emg = create_emgire_datapoints(df3)
df3_img = create_imgire_datapoints(df3)
df3['gender'] = 'female'
df3 = df3[['destination', 'origin', 'gender', 'year', 'migrant_stock']]
df3_emg['gender'] = 'female'
df3_emg = df3_emg[['geo', 'gender', 'year', 'emigrant_stock']]

df3_img['gender'] = 'female'
df3_img = df3_img[['geo', 'gender', 'year', 'immigrant_stock']]









    



column Total is not in geo domain!



In [ ]:



In [58]:

    
df_sex = pd.concat([df2, df3], ignore_index=True)
df_sex_emg = pd.concat([df2_emg, df3_emg], ignore_index=True)
df_sex_img = pd.concat([df2_img, df3_img], ignore_index=True)



In [59]:

    
# save to files

# df_sex.to_csv('../../ddf--datapoints--migrant_stock--by--destination--origin--gender--year.csv', index=False)
# df_sex_emg.to_csv('../../ddf--datapoints--emigrant_stock--by--geo--gender--year.csv', index=False)
# df_sex_img.to_csv('../../ddf--datapoints--immigrant_stock--by--geo--gender--year.csv', index=False)



In [ ]:



In [ ]:



In [ ]:

Concepts

The source file doesn't come with the properties of concepts so we need to create it manually.



In [60]:

    
from ddf_utils.model.ddf import Concept



In [61]:

    
strings = [
    ['name', 'Name'],
    ['domain', 'Domain'],
]

measures = [
    ['migration_stock', 'Migration Stock'],
    ['emigrant_stock', 'Emigrant Stock'],
    ['immigrant_stock', 'Immigrant Stock']
]

entity_domains = [
    ['geo', 'Geo Location'],
    ['gender', 'Gender']
]

entity_sets_geo_ids = list(map(lambda x: x[4:], filter(lambda x: x.startswith('is--'), country_df.columns)))
entity_sets_geo_names = list(map(lambda x: x.replace('_', ' ').title(), entity_sets_geo_ids))

entity_sets_geo = list(zip(entity_sets_geo_ids, entity_sets_geo_names))



In [62]:

    
roles = [
    ['destination', 'Destination'],
    ['origin', 'Origin']
]



In [63]:

    
concepts = list()

for e, n in strings:
    concepts.append(Concept(e, 'string', dict(name=n)))
    
for e, n in measures:
    concepts.append(Concept(e, 'measure', dict(name=n)))
    
for e, n in entity_domains:
    concepts.append(Concept(e, 'entity_domain', dict(name=n)))
    
for e, n in entity_sets_geo:
    concepts.append(Concept(e, 'entity_set', dict(name=n, domain='geo')))
    
for e, n in roles:
    concepts.append(Concept(e, 'role', dict(name=n, domain='geo')))



In [64]:

    
concepts.append(Concept('year', 'time', dict(name='Time')))



In [65]:

    
cdf = pd.DataFrame.from_records([x.to_dict() for x in concepts])



In [66]:

    
# cdf.to_csv('../../ddf--concepts.csv', index=False)



In [ ]:



In [ ]:

    
# create datapackage 
# run below ddf command in a terminal

!ddf create_datapackage --update ../../



In [ ]:

	Year	Sort\norder	Major area, region, country or area of destination	Notes	Code	Type of data (a)	Country or area of origin
	Unnamed: 0_level_1	Unnamed: 1_level_1	Unnamed: 2_level_1	Unnamed: 3_level_1	Unnamed: 4_level_1	Unnamed: 5_level_1	Total	Other South	Other North	Afghanistan	...	Uruguay	Uzbekistan	Vanuatu	Venezuela (Bolivarian Republic of)	Viet Nam	Wallis and Futuna Islands	Western Sahara	Yemen	Zambia	Zimbabwe
0	1990	1990001	WORLD	NaN	900.0	NaN	153011473.0	6548526.0	2366800.0	6823350.0	...	237486.0	1428020.0	5060.0	185946.0	1237873.0	6484.0	168239.0	455492.0	85203.0	204365.0
1	1990	1990002	UN development groups	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1990	1990003	More developed regions	b	901.0	NaN	82767216.0	3385103.0	1077179.0	119386.0	...	56838.0	1078563.0	1017.0	114991.0	1085310.0	884.0	333.0	11457.0	26062.0	40957.0
3	1990	1990004	Less developed regions	c	902.0	NaN	70244257.0	3163423.0	1289621.0	6703964.0	...	180648.0	349457.0	4043.0	70955.0	152563.0	5600.0	167906.0	444035.0	59141.0	163408.0
4	1990	1990005	Least developed countries	d	941.0	NaN	11060221.0	482753.0	239756.0	0.0	...	286.0	2027.0	9.0	2510.0	71579.0	0.0	0.0	357.0	26254.0	75122.0

	geo	name	is--un_development_groups	is--middle_income_countries	is--no_income_group_available	is--low_income_countries	is--high_income_countries	is--more_developed_regions	is--geographic_regions	is--less_developed_regions	is--lower_middle_income_countries	is--least_developed_countries	is--sustainable_development_goal_sdg_regions	is--sub_saharan_africa	is--upper_middle_income_countries	is--world_bank_income_groups
0	900	WORLD	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
1	901	More developed regions	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
2	902	Less developed regions	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
3	941	Least developed countries	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
4	934	Less developed regions, excluding least develo...	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE

	Index	Region, subregion, country or area	Notes	Code	Type of data	More Developed Regions	Less Developed Regions	Least developed countries	High-income Countries	Middle-income Countries	Upper-middle-income Countries	Lower-middle-income Countries	Low-income Countries	No income group available	Sub-Saharan Africa
19	20	Oceania	NaN	909.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
195	199	OCEANIA	NaN	909.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	Year	Sort\norder	Major area, region, country or area of destination	Notes	Code	Type of data (a)	Total	Other South	Other North	Afghanistan	...	Uruguay	Uzbekistan	Vanuatu	Venezuela (Bolivarian Republic of)	Viet Nam	Wallis and Futuna Islands	Yemen	Zambia	Zimbabwe
19	1990	1990020	Oceania	NaN	909.0	NaN	4731848.0	34250.0	11579.0	2885.0	...	11795.0	245.0	4693.0	606.0	140846.0	5588.0	221.0	2534.0	8941.0
195	1990	1990196	OCEANIA	NaN	909.0	NaN	4731848.0	34250.0	11579.0	2885.0	...	11795.0	245.0	4693.0	606.0	140846.0	5588.0	221.0	2534.0	8941.0
302	1995	1995020	Oceania	NaN	909.0	NaN	5022287.0	67404.0	15814.0	8382.0	...	11366.0	383.0	5375.0	915.0	161801.0	6478.0	314.0	3248.0	11320.0
478	1995	1995196	OCEANIA	NaN	909.0	NaN	5022287.0	67404.0	15814.0	8382.0	...	11366.0	383.0	5375.0	915.0	161801.0	6478.0	314.0	3248.0	11320.0
585	2000	2000011	Oceania	NaN	909.0	NaN	5361231.0	54117.0	53514.0	13294.0	...	10570.0	490.0	5945.0	1170.0	168753.0	7360.0	390.0	4371.0	14917.0
761	2000	2000196	OCEANIA	NaN	909.0	NaN	5361231.0	54117.0	53514.0	13294.0	...	10570.0	490.0	5945.0	1170.0	168753.0	7360.0	390.0	4371.0	14917.0
868	2005	2005020	Oceania	NaN	909.0	NaN	6023412.0	58905.0	58663.0	20671.0	...	10790.0	760.0	6375.0	1580.0	180367.0	7409.0	520.0	5796.0	29474.0
1044	2005	2005196	OCEANIA	NaN	909.0	NaN	6023412.0	58905.0	58663.0	20671.0	...	10790.0	760.0	6375.0	1580.0	180367.0	7409.0	520.0	5796.0	29474.0
1151	2010	2010020	Oceania	NaN	909.0	NaN	7127680.0	64700.0	67114.0	33154.0	...	10630.0	1300.0	6703.0	3350.0	210658.0	7448.0	590.0	7544.0	41880.0
1327	2010	2010196	OCEANIA	NaN	909.0	NaN	7127680.0	64700.0	67114.0	33154.0	...	10630.0	1300.0	6703.0	3350.0	210658.0	7448.0	590.0	7544.0	41880.0
1434	2015	2015020	Oceania	NaN	909.0	NaN	8069944.0	71352.0	83929.0	52090.0	...	10580.0	1840.0	5593.0	5760.0	243131.0	11210.0	710.0	8781.0	48586.0
1610	2015	2015196	OCEANIA	NaN	909.0	NaN	8069944.0	71352.0	83929.0	52090.0	...	10580.0	1840.0	5593.0	5760.0	243131.0	11210.0	710.0	8781.0	48586.0
1717	2019	2019020	Oceania	NaN	909.0	NaN	8927925.0	74580.0	89844.0	62408.0	...	11246.0	2147.0	5976.0	7427.0	272784.0	11804.0	888.0	9521.0	53752.0
1893	2019	2019196	OCEANIA	NaN	909.0	NaN	8927925.0	74580.0	89844.0	62408.0	...	11246.0	2147.0	5976.0	7427.0	272784.0	11804.0	888.0	9521.0	53752.0

	destination	origin	year	migrant_stock
0	900	900	1990	153011473.0
1	900	other_south	1990	6548526.0
2	900	other_north	1990	2366800.0
3	900	4	1990	6823350.0
4	900	8	1990	180284.0