Migration data was obtained from two sources. The World Refugee Agency (UNHCR) and the Organisation for Economic Co-operation and Development (OECD) are publishing datasets about migration.
The UNHCR dataset contains the following variables:
The OECD dataset contains the following variables:
As a first step we can have a look at the individual datasets seperately to get a feel for them.
In [1]:
%pylab inline
import sys
sys.path.insert(0,"../lib/")
import pandas as pd
from unhcrData import UNHCRdata
from oecdData import OECDdata
In [2]:
fname = "../data/unhcr/unhcr_popstats_export_persons_of_concern_all_data.csv"
unhcr = UNHCRdata(fname)
unhcr.data.dtypes
Out[2]:
Now get some general statistics about the dataset.
In [3]:
idx = ["Refugees (incl. refugee-like situations)", "Asylum-seekers (pending cases)", "Returned refugees", \
"Internally displaced persons (IDPs)", "Returned IDPs", "Stateless persons", "Others of concern", \
"Total Population"]
unhcr.data[idx].describe()
Out[3]:
Check for missing values. The number will be in percent of the total entries.
In [4]:
isnan(unhcr.data[idx]).sum() / np.shape(unhcr.data)[0]
Out[4]:
The only columns that might be of interest for the project could be the columns Refugees (incl. refugee-like situations)
, Asylum-seekers (pending cases)
, and Total Population
. They will be highly correlated and effectively it might be better to focus on the total population alone.
We can plot the number of people vs. time for individual countries,
In [5]:
unhcr.show(destination_country="Canada")
we can limit the plot to only investigate the migration between two countries,
In [6]:
unhcr.show(destination_country="Canada", origin_country="Germany")
it is also possible to get the number of people leaving a specific country,
In [7]:
unhcr.show(origin_country="Italy")
Especially in this last plot we can see one difficulty of this dataset. The Total Population
count can vary drastically if a new variable was introduced in that year. In the current example many Stateless persons
were reported for year 2003 which drastically changes the Total Population
count compared to the previous year.
In [8]:
fname = "../data/oecd/MIG_15082015002909613.csv.zip"
oecd = OECDdata(fname)
oecd.data.dtypes
Out[8]:
In [9]:
idx = ["Acquisition of nationality by country of former nationality", "Inflows of asylum seekers by nationality", \
"Inflows of foreign population by nationality", "Inflows of foreign workers by nationality", \
"Inflows of seasonal foreign workers by nationality", "Outflows of foreign population by nationality", \
"Stock of foreign labour by nationality", "Stock of foreign population by nationality", \
"Stock of foreign-born labour by country of birth", "Stock of foreign-born population by country of birth" ]
oecd.data[idx].describe()
Out[9]:
In [10]:
isnan(oecd.data[idx]).sum() / np.shape(oecd.data)[0]
Out[10]:
In [11]:
oecd.show(destination_country="Canada")
In [12]:
oecd.show(destination_country="Canada", origin_country="Germany")
In [13]:
oecd.show(origin_country="Italy")
In [ ]: