Author: Pascal, pascal@bayesimpact.org

Date: 2016-04-26

Skip the run test because the ROME version has to be updated to make it work in the exported repository. TODO: Update ROME and remove the skiptest flag.

ROME mobility

The ROME dataset contains links between ROME job groups, it is called "mobilité" (mobility in French) as this is used to tell job seekers to which other jobs they could move.

This notebook does a sanity check on this table, before we use it in our product.



In [1]:

    
from os import path
import pandas
import seaborn as _

rome_version = 'v330'
data_folder = '../../../data'
rome_folder = path.join(data_folder, 'rome/csv')
mobility_csv = path.join(rome_folder, 'unix_rubrique_mobilite_{}_utf8.csv'.format(rome_version))
rome_csv = path.join(rome_folder, 'unix_referentiel_code_rome_{}_utf8.csv'.format(rome_version))
appellation_csv = path.join(rome_folder, 'unix_referentiel_appellation_{}_utf8.csv'.format(rome_version))

mobility = pandas.read_csv(mobility_csv)
rome = pandas.read_csv(rome_csv)[['code_rome', 'libelle_rome']]
rome_names = rome.groupby('code_rome').first()['libelle_rome']
jobs = pandas.read_csv(appellation_csv)[['code_ogr', 'code_rome', 'libelle_appellation_court']]
jobs_names = jobs.groupby('code_ogr').first()['libelle_appellation_court']

First Look

Let's first check how it looks like:



In [2]:

    
mobility.head(2).transpose()









    Out[2]:






  
    
      
      0
      1
    
  
  
    
      code_rome
      A1101
      A1202
    
    
      code_rome_cible
      A1416
      A1203
    
    
      code_appellation_source
      NaN
      NaN
    
    
      code_appellation_cible
      NaN
      NaN
    
    
      code_type_mobilite
      1
      1
    
    
      libelle_type_mobilite
      Proche
      Proche



In [3]:

    
mobility.count()









    Out[3]:





code_rome                  4210
code_rome_cible            4210
code_appellation_source     223
code_appellation_cible      761
code_type_mobilite         4210
libelle_type_mobilite      4210
dtype: int64



In [4]:

    
mobility[mobility.code_appellation_source.notnull()].head(2).transpose()









    Out[4]:






  
    
      
      38
      80
    
  
  
    
      code_rome
      A1413
      B1303
    
    
      code_rome_cible
      D1501
      B1603
    
    
      code_appellation_source
      17010
      15405
    
    
      code_appellation_cible
      NaN
      15415
    
    
      code_type_mobilite
      1
      1
    
    
      libelle_type_mobilite
      Proche
      Proche

It seems pretty straight forward: it's a list of links from a job group (or a specific job) to another (group or specific job). So let's clean up a bit and add names.



In [5]:

    
# Rename columns.
mobility.rename(columns={
        'code_rome': 'group_source',
        'code_appellation_source': 'job_source',
        'code_rome_cible': 'group_target',
        'code_appellation_cible': 'job_target',
    }, inplace=True)

# Add names.
mobility['group_source_name'] = mobility['group_source'].map(rome_names)
mobility['group_target_name'] = mobility['group_target'].map(rome_names)
mobility['job_source_name'] = mobility['job_source'].map(jobs_names)
mobility['job_target_name'] = mobility['job_target'].map(jobs_names)

# Sort columns.
mobility = mobility[[
        'group_source', 'group_source_name', 'job_source', 'job_source_name',
        'group_target', 'group_target_name', 'job_target', 'job_target_name',
        'code_type_mobilite', 'libelle_type_mobilite'
    ]]

mobility.head(2).transpose()









    Out[5]:






  
    
      
      0
      1
    
  
  
    
      group_source
      A1101
      A1202
    
    
      group_source_name
      Conduite d''engins agricoles et forestiers
      Entretien des espaces naturels
    
    
      job_source
      NaN
      NaN
    
    
      job_source_name
      NaN
      NaN
    
    
      group_target
      A1416
      A1203
    
    
      group_target_name
      Polyculture, élevage
      Entretien des espaces verts
    
    
      job_target
      NaN
      NaN
    
    
      job_target_name
      NaN
      NaN
    
    
      code_type_mobilite
      1
      1
    
    
      libelle_type_mobilite
      Proche
      Proche

Sanity

Let's do some sanity checks:

Is there links from a job group to itself?
Is there duplicate links?
If there is a link from A to B, is there one from B to A?
When using specific jobs, is there also links from or to the job group?



In [6]:

    
# Links from one job group to the same one.
len(mobility[mobility.group_source == mobility.group_target].index)









    Out[6]:





0



In [7]:

    
# Number of duplicate links.
len(mobility.index) - len(mobility.drop_duplicates())









    Out[7]:





0



In [8]:

    
# Number of duplicate links when we ignore the link types.
len(mobility.index) - len(mobility.drop_duplicates([
    'group_source', 'job_source', 'group_target', 'job_target']))









    Out[8]:





0



In [9]:

    
# Reverse links.
two_links = pandas.merge(
    mobility.fillna(''), mobility.fillna(''),
    left_on=['group_target', 'job_target'],
    right_on=['group_source', 'job_source'])
str(len(two_links[
        (two_links.group_source_x == two_links.group_target_y) &
        (two_links.job_source_x == two_links.job_target_y)].index) / len(mobility.index) * 100) + '%'









    Out[9]:





'33.4916864608076%'



In [10]:

    
rome_froms = pandas.merge(
    mobility[mobility.job_source.notnull()].drop_duplicates(['group_source', 'group_source_name']),
    mobility[mobility.job_source.isnull()].drop_duplicates(['group_source', 'group_source_name']),
    on=['group_source', 'group_source_name'], how='outer', suffixes=['_specific', '_group'])

# Number of ROME job groups that have links both for the group and for at least one specific job.
len(rome_froms[rome_froms.group_target_specific.notnull() & rome_froms.group_target_group.notnull()])









    Out[10]:





79



In [11]:

    
# ROME job groups that have only links for specific jobs and not for the group.
rome_froms[rome_froms.group_target_group.isnull()]['group_source_name'].tolist()









    Out[11]:





['Collaboration juridique']



In [12]:

    
rome_froms = pandas.merge(
    mobility[mobility.job_source.notnull()].drop_duplicates(['group_target', 'group_target_name']),
    mobility[mobility.job_source.isnull()].drop_duplicates(['group_target', 'group_target_name']),
    on=['group_target', 'group_target_name'], how='outer', suffixes=['_specific', '_group'])

# Number of ROME job groups that have links both to the group and to at least one specific job.
len(rome_froms[rome_froms.group_source_specific.notnull() & rome_froms.group_source_group.notnull()])









    Out[12]:





88



In [13]:

    
# ROME job groups that have only links to specific jobs and not to the whole group.
rome_froms[rome_froms.group_source_group.isnull()]['group_target_name'].tolist()









    Out[13]:





['Image cinématographique et télévisuelle',
 'Personnel technique des jeux',
 'Pharmacie',
 'Santé animale',
 'Médecine dentaire',
 'Intervention technique en formulation et analyse sensorielle']



In [14]:

    
# Number of links specific to jobs (as opposed to groups) that are already specified by group links.
mobility['has_job_source'] = ~mobility.job_source.isnull()
mobility['has_job_target'] = ~mobility.job_target.isnull()
any_job_mobility = mobility.drop_duplicates(['group_source', 'has_job_source', 'group_target', 'has_job_target'])
len(any_job_mobility) - len(any_job_mobility.drop_duplicates(['group_source', 'group_target']))









    Out[14]:





0

So to summarize:

There are no self links from a job group to itself, or within a group.
There are no duplicate links (even with a different type).
33% of links go both way: so it means that direction is meaningful.
When using specific jobs, in most cases there's also links concerning the whole job group; but for some rare cases it happens that there's nothing for the group.
However when there's a link to or from a specific job, there's never an equivalent group link that would encompass it.

Coverage

Let's check how much of the ROME code is covered by links, more specifically: for a given job, how many mobility jobs and job groups are linked from it (either directly or because of its group)?



In [15]:

    
# In this snippet, we count # of links to groups & to specific jobs for each job.

mobility_from_group = mobility[mobility.job_source.isnull()][['group_source', 'group_target', 'job_target']]
# Count # of groups that are linked from each group.
mobility_from_group['target_groups'] = (
    mobility_from_group[mobility_from_group.job_target.isnull()]
        .groupby('group_source')['group_source'].transform('count'))
mobility_from_group['target_groups'].fillna(0, inplace=True)
# Count # of specific jobs that are linked from each group.
mobility_from_group['target_jobs'] = (
    mobility_from_group[mobility_from_group.job_target.notnull()]
        .groupby('group_source')['group_source'].transform('count'))
mobility_from_group['target_jobs'].fillna(0, inplace=True)

mobility_from_group = mobility_from_group.groupby('group_source', as_index=False).max()[
    ['group_source', 'target_groups', 'target_jobs']]


mobility_from_job = mobility[mobility.job_source.notnull()][['job_source', 'group_target', 'job_target']]
# Count # of groups that are linked from each job.
mobility_from_job['target_groups'] = (
    mobility_from_job[mobility_from_job.job_target.isnull()]
        .groupby('job_source')['job_source'].transform('count'))
mobility_from_job['target_groups'].fillna(0, inplace=True)
# Count # of jobs that are linked from each job.
mobility_from_job['target_jobs'] = (
    mobility_from_job[mobility_from_job.job_target.notnull()]
        .groupby('job_source')['job_source'].transform('count'))
mobility_from_job['target_jobs'].fillna(0, inplace=True)

mobility_from_job = mobility_from_job.groupby('job_source', as_index=False).max()[
    ['job_source', 'target_groups', 'target_jobs']]

jobs_with_counts = pandas.merge(
    jobs, mobility_from_group, left_on='code_rome', right_on='group_source', how='left')
jobs_with_counts = pandas.merge(
    jobs_with_counts, mobility_from_job, left_on='code_ogr', right_on='job_source', how='left')

jobs_with_counts.fillna(0, inplace=True)
jobs_with_counts['target_groups'] = jobs_with_counts.target_groups_x + jobs_with_counts.target_groups_y
jobs_with_counts['target_jobs'] = jobs_with_counts.target_jobs_x + jobs_with_counts.target_jobs_y
jobs_with_counts['total'] = jobs_with_counts['target_groups'] + jobs_with_counts['target_jobs']

jobs_with_counts = jobs_with_counts[['code_ogr', 'libelle_appellation_court', 'target_groups', 'target_jobs', 'total']]



In [16]:

    
# Jobs that don't have any links from them or from their group.
jobs_with_counts[jobs_with_counts.total == 0]['libelle_appellation_court'].tolist()









    Out[16]:





['Assistant / Assistante de service juridique',
 'Assistant notarial / Assistante notariale',
 'Auxiliaire juridique',
 'Clerc',
 'Clerc assermenté',
 'Clerc aux formalités',
 'Clerc aux formalités de notaire',
 "Clerc d''avocat",
 'Clerc de commissaire-priseur',
 "Clerc d''huissier",
 'Clerc formaliste',
 'Clerc hors rang',
 'Clerc hors rang de notaire',
 'Clerc rédacteur',
 'Clerc significateur',
 "Clerc significateur assermenté d''huissier",
 "Collaborateur(trice) d''huissier chargé(e) de recouvrement",
 'Greffier / Greffière',
 'Greffier / Greffière en chef',
 'Premier clerc',
 'Principal clerc',
 'Principal clerc adjoint',
 'Principal clerc adjoint de notaire',
 "Principal clerc d''avocat",
 'Principal clerc de commissaire-priseur',
 'Principal clerc de notaire',
 "Principal clerc d''huissier",
 "Sous-principal clerc d''avocat",
 'Sous-principal clerc de notaire',
 'Formaliste']



In [17]:

    
jobs_with_counts.total.hist()
str(len(jobs_with_counts.total[jobs_with_counts.total >= 5].index) / len(jobs_with_counts.index)*100) + '%'









    Out[17]:





'85.4585312385824%'

For each job, except for the ~30 jobs above in the law business, there is at least one link, and in most of the cases (85%) there are 5 or more.

	0	1
code_rome	A1101	A1202
code_rome_cible	A1416	A1203
code_appellation_source	NaN	NaN
code_appellation_cible	NaN	NaN
code_type_mobilite	1	1
libelle_type_mobilite	Proche	Proche

	38	80
code_rome	A1413	B1303
code_rome_cible	D1501	B1603
code_appellation_source	17010	15405
code_appellation_cible	NaN	15415
code_type_mobilite	1	1
libelle_type_mobilite	Proche	Proche

	0	1
group_source	A1101	A1202
group_source_name	Conduite d''engins agricoles et forestiers	Entretien des espaces naturels
job_source	NaN	NaN
job_source_name	NaN	NaN
group_target	A1416	A1203
group_target_name	Polyculture, élevage	Entretien des espaces verts
job_target	NaN	NaN
job_target_name	NaN	NaN
code_type_mobilite	1	1
libelle_type_mobilite	Proche	Proche