Create DDF dataset from UN International migrant stock 2019 dataset

In this notebook we are going to demonstrate how to create a DDF dataset with ddf_utils. We will use International migrant stock: The 2019 revision as source and convert the migrant stock By destination and origin. Below are screenshots for this source file.

Data Tables:

ANNX sheet:

The plan

  1. from the excel tables we can see there is a "ANNEX" sheet, which contains all country/region information. We will convert it to a geo domain.
  2. 3 data Tables contains total/male/female data. We will create a domain for gender.
  3. origin and destination are both regions/countries. So they will be roles in the region/country domain.
  4. there is only one indicator in source: migrant_stock. But when origin = total or destination = total, we can have immigrant_stock or emmigrant_stock. So we will have 3 indicators in the DDF dataset. Also each indicator will have 2 variants: one with gender domain and one without gender domain.
  5. We will not keep Notes/Type of Data/Sort Order columns

ETL process

Note: This notebook assumed you create a ddf dataset project with ddf new command and place the source file in etl/source and this notebook in etl/notebooks


In [1]:
import numpy as np
import pandas as pd
# from ddf_utils.dsl import *

In [2]:
source = '../source/UN_MigrantStockByOriginAndDestination_2019.xlsx'

first of all, let's create a data reader to load a table into pandas DataFrame with correct data types. i.e. data starts from the Total column should all be numbers(float64)

We can see there are a few things we should process:

  1. we should skip a few lines at the beginning.
  2. there are .. inside number columns. we should consider .. as N/A values.
  3. the headers in data table span over 2 rows

In [3]:
# NOTE: this function will be re-written below.
def read_source(sheet, skip=14, header=[0, 1]):
    return pd.read_excel(source, skiprows=skip, sheet_name=sheet, header=header,
                         na_values=['..', ''], keep_default_na=False)

In [4]:
# now try to load Table 1
table1 = read_source('Table 1', 14)

In [5]:
table1.head()


Out[5]:
Year Sort\norder Major area, region, country or area of destination Notes Code Type of data (a) Country or area of origin
Unnamed: 0_level_1 Unnamed: 1_level_1 Unnamed: 2_level_1 Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1 Total Other South Other North Afghanistan ... Uruguay Uzbekistan Vanuatu Venezuela (Bolivarian Republic of) Viet Nam Wallis and Futuna Islands Western Sahara Yemen Zambia Zimbabwe
0 1990 1990001 WORLD NaN 900.0 NaN 153011473.0 6548526.0 2366800.0 6823350.0 ... 237486.0 1428020.0 5060.0 185946.0 1237873.0 6484.0 168239.0 455492.0 85203.0 204365.0
1 1990 1990002 UN development groups NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1990 1990003 More developed regions b 901.0 NaN 82767216.0 3385103.0 1077179.0 119386.0 ... 56838.0 1078563.0 1017.0 114991.0 1085310.0 884.0 333.0 11457.0 26062.0 40957.0
3 1990 1990004 Less developed regions c 902.0 NaN 70244257.0 3163423.0 1289621.0 6703964.0 ... 180648.0 349457.0 4043.0 70955.0 152563.0 5600.0 167906.0 444035.0 59141.0 163408.0
4 1990 1990005 Least developed countries d 941.0 NaN 11060221.0 482753.0 239756.0 0.0 ... 286.0 2027.0 9.0 2510.0 71579.0 0.0 0.0 357.0 26254.0 75122.0

5 rows × 241 columns


In [6]:
# function to fix column names

def fix_column_names(df_):
    df = df_.copy()
    
    new_cols = np.r_[
        df.columns.get_level_values(0).values[:6],  # column 6 is `Total`
        df.columns.get_level_values(1).values[6:]]
    df.columns = new_cols
    return df

In [7]:
fix_column_names(table1).head()


Out[7]:
Year Sort\norder Major area, region, country or area of destination Notes Code Type of data (a) Total Other South Other North Afghanistan ... Uruguay Uzbekistan Vanuatu Venezuela (Bolivarian Republic of) Viet Nam Wallis and Futuna Islands Western Sahara Yemen Zambia Zimbabwe
0 1990 1990001 WORLD NaN 900.0 NaN 153011473.0 6548526.0 2366800.0 6823350.0 ... 237486.0 1428020.0 5060.0 185946.0 1237873.0 6484.0 168239.0 455492.0 85203.0 204365.0
1 1990 1990002 UN development groups NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1990 1990003 More developed regions b 901.0 NaN 82767216.0 3385103.0 1077179.0 119386.0 ... 56838.0 1078563.0 1017.0 114991.0 1085310.0 884.0 333.0 11457.0 26062.0 40957.0
3 1990 1990004 Less developed regions c 902.0 NaN 70244257.0 3163423.0 1289621.0 6703964.0 ... 180648.0 349457.0 4043.0 70955.0 152563.0 5600.0 167906.0 444035.0 59141.0 163408.0
4 1990 1990005 Least developed countries d 941.0 NaN 11060221.0 482753.0 239756.0 0.0 ... 286.0 2027.0 9.0 2510.0 71579.0 0.0 0.0 357.0 26254.0 75122.0

5 rows × 241 columns


In [8]:
table1 = fix_column_names(table1)

In [9]:
# see data types.
table1.dtypes['Total':]


Out[9]:
Total                        float64
Other South                  float64
Other North                  float64
Afghanistan                  float64
Albania                       object
                              ...   
Wallis and Futuna Islands    float64
Western Sahara               float64
Yemen                        float64
Zambia                       float64
Zimbabwe                     float64
Length: 235, dtype: object

Now there is a problem. The Albania column have object type which is not the desired data type. We need to double check what goes wrong.

Note

Besides checking data types, we should also check if the numbers are loaded correctly ("153,011,473" in excel table should be 153011473 in the DataFrame). We can check this by viewing the source excel and the DataFrame side by side.

Depends on the actual data you are working on, other kind of double checking might be required.


In [10]:
def isnumeric(x):
    """check if a value is numeric value"""
    if isinstance(x, (np.floating, np.int, int, float)):
        return True
    else:
        return False

In [11]:
alb = table1['Albania'].dropna()

In [12]:
alb.loc[~alb.map(isnumeric)]


Out[12]:
260     -
543     -
826     -
1109    -
Name: Albania, dtype: object

This means that there are - in the data table. We will treat it as N/A.


In [13]:
# redefine the read_source function

def read_source(sheet, skip=14, header=[0, 1]):
    return pd.read_excel(source, skiprows=skip, sheet_name=sheet, header=header,
                         na_values=['..', '', '-'], keep_default_na=False)

In [14]:
table1 = fix_column_names(read_source('Table 1', 14))

In [15]:
all([x == 'float64' for x in table1.dtypes['Total':]])


Out[15]:
True

In [16]:
table2 = fix_column_names(read_source('Table 2', 14))

In [17]:
all([x == 'float64' for x in table2.dtypes['Total':]])


Out[17]:
True

In [18]:
table3 = fix_column_names(read_source('Table 3', 14))

In [19]:
all([x == 'float64' for x in table3.dtypes['Total':]])


Out[19]:
True

geo domain

Now let's load the ANNEX table and create geo domain:

  1. we will use code as identifier. Now the column values are numbers, we will convert to str.
  2. rows where code is empty are group names (for example UN development groups). We will treat them as entity sets
  3. columns are indicators for entity set membership too.

In ddf_utils there are data classes for Entity and EntityDomain.


In [20]:
from ddf_utils.model.ddf import Entity, EntityDomain
from ddf_utils.str import to_concept_id

In [21]:
country = read_source('ANNEX', 15, 0)

In [22]:
country.head()


Out[22]:
Index Region, subregion, country or area Notes Code Type of data More Developed Regions Less Developed Regions Least developed countries High-income Countries Middle-income Countries Upper-middle-income Countries Lower-middle-income Countries Low-income Countries No income group available Sub-Saharan Africa
0 1 WORLD NaN 900.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 UN development groups NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 More developed regions b 901.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 Less developed regions c 902.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 Least developed countries d 941.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

In [ ]:


In [23]:
country.columns


Out[23]:
Index(['Index', 'Region, subregion, country or area', 'Notes', 'Code',
       'Type of data', 'More Developed Regions', 'Less Developed Regions',
       'Least developed countries', 'High-income Countries',
       'Middle-income Countries', 'Upper-middle-income Countries',
       'Lower-middle-income Countries', 'Low-income Countries',
       'No income group available', 'Sub-Saharan Africa'],
      dtype='object')

In [24]:
# load all entities
ents = []

current_set = None

sets_in_col = ['More Developed Regions', 'Less Developed Regions',
       'Least developed countries', 'High-income Countries',
       'Middle-income Countries', 'Upper-middle-income Countries',
       'Lower-middle-income Countries', 'Low-income Countries',
       'No income group available', 'Sub-Saharan Africa']

for i, row in country.iterrows():
    name = row['Region, subregion, country or area']
    if pd.isnull(row['Code']):  # it's a set, and rows below this row belong to this set.
        current_set = to_concept_id(name)
        continue
    else:
        sets = set()
        for s in sets_in_col:
            if not pd.isnull(row[s]):
                sets.add(to_concept_id(s))
        if current_set:
            sets.add(current_set)
        ents.append(Entity(id=str(int(row['Code'])), domain='geo', sets=list(sets), props={'name': name}))

In [25]:
# Noticed that in data table there are "other south" and "other north", which are not in the ANNEX table.
# So we append these 2 entity too

ents.append(
    Entity(id='other_south', domain='geo', sets=[], props=dict(name='Other South')))

ents.append(
    Entity(id='other_north', domain='geo', sets=[], props=dict(name='Other North')))

In [26]:
domain = EntityDomain('geo', [])

for e in ents:
    domain.add_entity(e)

In [27]:
country_df = pd.DataFrame.from_dict(domain.to_dict())

In [28]:
country_df.head()


Out[28]:
geo name is--un_development_groups is--middle_income_countries is--no_income_group_available is--low_income_countries is--high_income_countries is--more_developed_regions is--geographic_regions is--less_developed_regions is--lower_middle_income_countries is--least_developed_countries is--sustainable_development_goal_sdg_regions is--sub_saharan_africa is--upper_middle_income_countries is--world_bank_income_groups
0 900 WORLD FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1 901 More developed regions TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 902 Less developed regions TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3 941 Least developed countries TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 934 Less developed regions, excluding least develo... TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

In [29]:
# check: if origin and destination share same entities

origin_cols = table1.columns[9:]

for c in origin_cols:
    if c not in country_df['name'].values:
        print(f"missing {c}")
else:
    print("all countries in Table 1 are in country domain")


all countries in Table 1 are in country domain

In [30]:
# save to file.
# country_df.to_csv('../../ddf--entities--geo.csv', index=False)

Gender domain

the gender domain is quite simple. We would just create a DataFrame manually.


In [31]:
# gender domain

sexdf = pd.DataFrame([
    ['male', 'Male'],
    ['female', 'Female']
], columns=['gender', 'name'])

In [32]:
# sexdf.to_csv('../../ddf--entities--gender.csv', index=False)

datapoints

Table 1-3 are loaded into DataFrame, we need to do some transformation to make them datapoints:

  • drop unneeded columns
  • converted origins/destinations names to codes
  • origins are in columns, but should be in rows as a dimension
  • for table 2 and table 3, add the gender column and merge them together
  • filter origin = total (destination = total) to create immigrant (emigrant) data

In [33]:
# create a mapping from name to code.
def _trans(x):
    if not pd.isnull(x):
        return str(int(x))
    return x

name_id_map = country.set_index('Region, subregion, country or area')['Code'].map(_trans).to_dict()

In [34]:
# there are some area doesn't have codes, set here
name_id_map['Total'] = '900'
name_id_map['Other South'] = 'other_south'
name_id_map['Other North'] = 'other_north'

In [35]:
dp_cols = ['Code', 'Year']
for c in origin_cols:
    dp_cols.append(c)

In [36]:
pd.Series(dp_cols)


Out[36]:
0                           Code
1                           Year
2                    Afghanistan
3                        Albania
4                        Algeria
                 ...            
229    Wallis and Futuna Islands
230               Western Sahara
231                        Yemen
232                       Zambia
233                     Zimbabwe
Length: 234, dtype: object

In [37]:
dp = table1[dp_cols].copy()
dp = dp.dropna(subset=['Code', 'Year'], how='any')

dp['Code'] = dp['Code'].map(int)
dp['Year'] = dp['Year'].map(int)

dp = dp.set_index(['Code', 'Year'])

# rename remaining columns to geo id
dp.columns = dp.columns.map(name_id_map)

# create new dimension
dp = dp.stack()

dp.index.names = ['destination', 'year', 'origin']
dp.name = 'migrant_stock'

In [38]:
dp.head()


Out[38]:
destination  year  origin
900          1990  4         6823350.0
                   8          180284.0
                   12         921727.0
                   16           2041.0
                   20           3792.0
Name: migrant_stock, dtype: float64

In [ ]:


In [39]:
# double check: if there are duplicated index

dp.index.has_duplicates


Out[39]:
True

In [40]:
# seems something goes wrong. digging in...

In [41]:
dp[dp.index.duplicated(keep=False)].sort_index()


Out[41]:
destination  year  origin
909          1990  100        1755.0
                   100        1755.0
                   104       10247.0
                   104       10247.0
                   108          15.0
                              ...   
             2019  90         3721.0
                   92            0.0
                   92            0.0
                   96         3237.0
                   96         3237.0
Name: migrant_stock, Length: 3248, dtype: float64

In [42]:
dp[dp.index.duplicated(keep=False)].reset_index()['destination'].unique()


Out[42]:
array([909])

In [43]:
# so only country id 909 has duplication. let's see what 909 means

In [44]:
country[country['Code'] == 909]


Out[44]:
Index Region, subregion, country or area Notes Code Type of data More Developed Regions Less Developed Regions Least developed countries High-income Countries Middle-income Countries Upper-middle-income Countries Lower-middle-income Countries Low-income Countries No income group available Sub-Saharan Africa
19 20 Oceania NaN 909.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
195 199 OCEANIA NaN 909.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

In [45]:
table1[table1['Code'] == 909]


Out[45]:
Year Sort\norder Major area, region, country or area of destination Notes Code Type of data (a) Total Other South Other North Afghanistan ... Uruguay Uzbekistan Vanuatu Venezuela (Bolivarian Republic of) Viet Nam Wallis and Futuna Islands Western Sahara Yemen Zambia Zimbabwe
19 1990 1990020 Oceania NaN 909.0 NaN 4731848.0 34250.0 11579.0 2885.0 ... 11795.0 245.0 4693.0 606.0 140846.0 5588.0 0.0 221.0 2534.0 8941.0
195 1990 1990196 OCEANIA NaN 909.0 NaN 4731848.0 34250.0 11579.0 2885.0 ... 11795.0 245.0 4693.0 606.0 140846.0 5588.0 0.0 221.0 2534.0 8941.0
302 1995 1995020 Oceania NaN 909.0 NaN 5022287.0 67404.0 15814.0 8382.0 ... 11366.0 383.0 5375.0 915.0 161801.0 6478.0 0.0 314.0 3248.0 11320.0
478 1995 1995196 OCEANIA NaN 909.0 NaN 5022287.0 67404.0 15814.0 8382.0 ... 11366.0 383.0 5375.0 915.0 161801.0 6478.0 0.0 314.0 3248.0 11320.0
585 2000 2000011 Oceania NaN 909.0 NaN 5361231.0 54117.0 53514.0 13294.0 ... 10570.0 490.0 5945.0 1170.0 168753.0 7360.0 0.0 390.0 4371.0 14917.0
761 2000 2000196 OCEANIA NaN 909.0 NaN 5361231.0 54117.0 53514.0 13294.0 ... 10570.0 490.0 5945.0 1170.0 168753.0 7360.0 0.0 390.0 4371.0 14917.0
868 2005 2005020 Oceania NaN 909.0 NaN 6023412.0 58905.0 58663.0 20671.0 ... 10790.0 760.0 6375.0 1580.0 180367.0 7409.0 0.0 520.0 5796.0 29474.0
1044 2005 2005196 OCEANIA NaN 909.0 NaN 6023412.0 58905.0 58663.0 20671.0 ... 10790.0 760.0 6375.0 1580.0 180367.0 7409.0 0.0 520.0 5796.0 29474.0
1151 2010 2010020 Oceania NaN 909.0 NaN 7127680.0 64700.0 67114.0 33154.0 ... 10630.0 1300.0 6703.0 3350.0 210658.0 7448.0 0.0 590.0 7544.0 41880.0
1327 2010 2010196 OCEANIA NaN 909.0 NaN 7127680.0 64700.0 67114.0 33154.0 ... 10630.0 1300.0 6703.0 3350.0 210658.0 7448.0 0.0 590.0 7544.0 41880.0
1434 2015 2015020 Oceania NaN 909.0 NaN 8069944.0 71352.0 83929.0 52090.0 ... 10580.0 1840.0 5593.0 5760.0 243131.0 11210.0 0.0 710.0 8781.0 48586.0
1610 2015 2015196 OCEANIA NaN 909.0 NaN 8069944.0 71352.0 83929.0 52090.0 ... 10580.0 1840.0 5593.0 5760.0 243131.0 11210.0 0.0 710.0 8781.0 48586.0
1717 2019 2019020 Oceania NaN 909.0 NaN 8927925.0 74580.0 89844.0 62408.0 ... 11246.0 2147.0 5976.0 7427.0 272784.0 11804.0 0.0 888.0 9521.0 53752.0
1893 2019 2019196 OCEANIA NaN 909.0 NaN 8927925.0 74580.0 89844.0 62408.0 ... 11246.0 2147.0 5976.0 7427.0 272784.0 11804.0 0.0 888.0 9521.0 53752.0

14 rows × 241 columns


In [ ]:


In [46]:
# so Oceania appear twice, and data values are same for both. We can safely drop these duplicates.

In [47]:
# create a function for data tables.

def create_datapoints(table):
    origin_cols = table.columns[6:]
    for c in origin_cols:
        # double check: if the country/region in the geo domain
        if c not in country_df['name'].values:
            print(f'column {c} is not in geo domain!')

    dp_cols = ['Code', 'Year']
    [dp_cols.append(x) for x in origin_cols]

    dp = table[dp_cols].copy()
    dp = dp.dropna(subset=['Code', 'Year'], how='any')
    dp['Code'] = dp['Code'].map(int)
    dp['Year'] = dp['Year'].map(int)

    dp = dp.set_index(['Code', 'Year'])
    dp.columns = dp.columns.map(name_id_map)

    dp = dp.stack().reset_index()
    # print(dp.columns)

    dp.columns = ['destination', 'year', 'origin', 'migrant_stock']
    dp = dp.drop_duplicates()
    return dp[['destination', 'origin', 'year', 'migrant_stock']]

In [48]:
# also function for emrgrant
def create_emgire_datapoints(dp):
    dp_emg = dp[(dp.destination == 900)].copy()
    dp_emg = dp_emg[['origin', 'year', 'migrant_stock']]
    dp_emg.columns = ['geo', 'year', 'emigrant_stock']
    return dp_emg

In [49]:
# and immigrant
def create_imgire_datapoints(dp):
    dp_img = dp[(dp.origin == '900')].copy()
    dp_img = dp_img[['destination', 'year', 'migrant_stock']]
    dp_img.columns = ['geo', 'year', 'immigrant_stock']
    return dp_img

In [50]:
df = create_datapoints(table1)
df_emg = create_emgire_datapoints(df)
df_img = create_imgire_datapoints(df)


column Total is not in geo domain!

In [51]:
df.head()


Out[51]:
destination origin year migrant_stock
0 900 900 1990 153011473.0
1 900 other_south 1990 6548526.0
2 900 other_north 1990 2366800.0
3 900 4 1990 6823350.0
4 900 8 1990 180284.0

In [52]:
# check: if there are still duplicated index
np.all(df.duplicated(subset=['destination', 'origin', 'year']))


Out[52]:
False

In [53]:
df_emg.head()


Out[53]:
geo year emigrant_stock
0 900 1990 153011473.0
1 other_south 1990 6548526.0
2 other_north 1990 2366800.0
3 4 1990 6823350.0
4 8 1990 180284.0

In [54]:
df_img.head()


Out[54]:
geo year immigrant_stock
0 900 1990 153011473.0
235 901 1990 82767216.0
470 902 1990 70244257.0
705 941 1990 11060221.0
940 934 1990 59184036.0

In [55]:
# save to files
# df.to_csv('../../ddf--datapoints--migrant_stock--by--destination--origin--year.csv', index=False)
# df_emg.to_csv('../../ddf--datapoints--emigrant_stock--by--geo--year.csv', index=False)
# df_img.to_csv('../../ddf--datapoints--immigrant_stock--by--geo--year.csv', index=False)

In [ ]:


In [56]:
# Table 2 and Table 3
# We will create gender dimension for them and merge them together.

df2 = create_datapoints(table2)
df2_emg = create_emgire_datapoints(df2)
df2_img = create_imgire_datapoints(df2)
df2['gender'] = 'male'
df2 = df2[['destination', 'origin', 'gender', 'year', 'migrant_stock']]
df2_emg['gender'] = 'male'
df2_emg = df2_emg[['geo', 'gender', 'year', 'emigrant_stock']]
df2_img['gender'] = 'male'
df2_img = df2_img[['geo', 'gender', 'year', 'immigrant_stock']]


column Total is not in geo domain!

In [57]:
df3 = create_datapoints(table3)
df3_emg = create_emgire_datapoints(df3)
df3_img = create_imgire_datapoints(df3)
df3['gender'] = 'female'
df3 = df3[['destination', 'origin', 'gender', 'year', 'migrant_stock']]
df3_emg['gender'] = 'female'
df3_emg = df3_emg[['geo', 'gender', 'year', 'emigrant_stock']]

df3_img['gender'] = 'female'
df3_img = df3_img[['geo', 'gender', 'year', 'immigrant_stock']]


column Total is not in geo domain!

In [ ]:


In [58]:
df_sex = pd.concat([df2, df3], ignore_index=True)
df_sex_emg = pd.concat([df2_emg, df3_emg], ignore_index=True)
df_sex_img = pd.concat([df2_img, df3_img], ignore_index=True)

In [59]:
# save to files

# df_sex.to_csv('../../ddf--datapoints--migrant_stock--by--destination--origin--gender--year.csv', index=False)
# df_sex_emg.to_csv('../../ddf--datapoints--emigrant_stock--by--geo--gender--year.csv', index=False)
# df_sex_img.to_csv('../../ddf--datapoints--immigrant_stock--by--geo--gender--year.csv', index=False)

In [ ]:


In [ ]:


In [ ]:

Concepts

The source file doesn't come with the properties of concepts so we need to create it manually.


In [60]:
from ddf_utils.model.ddf import Concept

In [61]:
strings = [
    ['name', 'Name'],
    ['domain', 'Domain'],
]

measures = [
    ['migration_stock', 'Migration Stock'],
    ['emigrant_stock', 'Emigrant Stock'],
    ['immigrant_stock', 'Immigrant Stock']
]

entity_domains = [
    ['geo', 'Geo Location'],
    ['gender', 'Gender']
]

entity_sets_geo_ids = list(map(lambda x: x[4:], filter(lambda x: x.startswith('is--'), country_df.columns)))
entity_sets_geo_names = list(map(lambda x: x.replace('_', ' ').title(), entity_sets_geo_ids))

entity_sets_geo = list(zip(entity_sets_geo_ids, entity_sets_geo_names))

In [62]:
roles = [
    ['destination', 'Destination'],
    ['origin', 'Origin']
]

In [63]:
concepts = list()

for e, n in strings:
    concepts.append(Concept(e, 'string', dict(name=n)))
    
for e, n in measures:
    concepts.append(Concept(e, 'measure', dict(name=n)))
    
for e, n in entity_domains:
    concepts.append(Concept(e, 'entity_domain', dict(name=n)))
    
for e, n in entity_sets_geo:
    concepts.append(Concept(e, 'entity_set', dict(name=n, domain='geo')))
    
for e, n in roles:
    concepts.append(Concept(e, 'role', dict(name=n, domain='geo')))

In [64]:
concepts.append(Concept('year', 'time', dict(name='Time')))

In [65]:
cdf = pd.DataFrame.from_records([x.to_dict() for x in concepts])

In [66]:
# cdf.to_csv('../../ddf--concepts.csv', index=False)

In [ ]:


In [ ]:
# create datapackage 
# run below ddf command in a terminal

!ddf create_datapackage --update ../../

In [ ]: