Author: Cyrille, cyrille@bayesimpact.org

Date: 2018-04-09

ROME update from v333 to v334

In March 2018 a new version of the ROME was released. I want to investigate what changed and whether we need to do anything about it.

You might not be able to reproduce this notebook, mostly because it requires to have the two versions of the ROME in your data/rome/csv folder which happens only just before we switch to v334. You will have to trust me on the results ;-)

Skip the run test because it requires older versions of the ROME.


In [1]:
import collections
import glob
import os
from os import path

import matplotlib_venn
import pandas as pd

rome_path = path.join(os.getenv('DATA_FOLDER'), 'rome/csv')

OLD_VERSION = '333'
NEW_VERSION = '334'

old_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(OLD_VERSION)))
new_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(NEW_VERSION)))

First let's check if there are new or deleted files (only matching by file names).


In [2]:
new_files = new_version_files - frozenset(f.replace(OLD_VERSION, NEW_VERSION) for f in old_version_files)
deleted_files = old_version_files - frozenset(f.replace(NEW_VERSION, OLD_VERSION) for f in new_version_files)

print('{:d} new files'.format(len(new_files)))
print('{:d} deleted files'.format(len(deleted_files)))


0 new files
0 deleted files

So we have the same set of files in both versions: good start.

Now let's set up a dataset that, for each table, links both the old and the new file together.


In [3]:
# Load all ROME datasets for the two versions we compare.
VersionedDataset = collections.namedtuple('VersionedDataset', ['basename', 'old', 'new'])
rome_data = [VersionedDataset(
        basename=path.basename(f),
        old=pd.read_csv(f.replace(NEW_VERSION, OLD_VERSION)),
        new=pd.read_csv(f))
    for f in sorted(new_version_files)]

def find_rome_dataset_by_name(data, partial_name):
    for dataset in data:
        if 'unix_{}_v{}_utf8.csv'.format(partial_name, NEW_VERSION) == dataset.basename:
            return dataset
    raise ValueError('No dataset named {}, the list is\n{}'.format(partial_name, [d.basename for d in data]))

Let's make sure the structure hasn't changed:


In [4]:
for dataset in rome_data:
    if set(dataset.old.columns) != set(dataset.new.columns):
        print('Columns of {} have changed.'.format(dataset.basename))

All files have the same columns as before: still good.

Now let's see for each file if there are more or less rows.


In [5]:
same_row_count_files = 0
for dataset in rome_data:
    diff = len(dataset.new.index) - len(dataset.old.index)
    if diff > 0:
        print('{:d}/{:d} values added in {}'.format(
            diff, len(dataset.new.index), dataset.basename))
    elif diff < 0:
        print('{:d}/{:d} values removed in {}'.format(
            -diff, len(dataset.old.index), dataset.basename))
    else:
        same_row_count_files += 1
print('{:d}/{:d} files with the same number of rows'.format(
    same_row_count_files, len(rome_data)))


14/30720 values added in unix_coherence_item_v334_utf8.csv
31/11677 values added in unix_cr_gd_dp_appellations_v334_utf8.csv
15108/16308 values removed in unix_item_arborescence_v334_utf8.csv
6/13387 values added in unix_item_v334_utf8.csv
45/42273 values added in unix_liens_rome_referentiels_v334_utf8.csv
31/11021 values added in unix_referentiel_appellation_v334_utf8.csv
1/4948 values added in unix_referentiel_competence_v334_utf8.csv
2/5039 values added in unix_texte_v334_utf8.csv
13/21 files with the same number of rows

There are some minor changes in many files, except for the arborescence, which seems to have lost most of its content.

Climbing up the arborescence tree

Let's look into arborescence more deeply, before looking at other files (a brief presentation of the tree structure in this dataset is done in this notebook). It's describing a large tree of concepts grouped in 5 main subtrees defined by the code_type_referentiel field. In Bob we mainly use the branch Métiers accessibles sans diplôme et sans expérience of the subtree 7(code ROME).


In [6]:
arborescence = find_rome_dataset_by_name(rome_data, 'item_arborescence')

old_roots = arborescence.old[arborescence.old.libelle_type_noeud == 'RACINE']
new_roots = arborescence.new[arborescence.new.libelle_type_noeud == 'RACINE']

old = old_roots[['code_type_referentiel', 'libelle_item_arbo']]
new = new_roots[['code_type_referentiel', 'libelle_item_arbo']]

links_merged = old.merge(new, how='outer', indicator=True)
links_merged['_diff'] = links_merged._merge.map({'left_only': 'removed', 'right_only': 'added'})
links_merged.head()


Out[6]:
code_type_referentiel libelle_item_arbo _merge _diff
0 10 Racine de l''arborescence d''item - environnem... left_only removed
1 6 Racine de l''arborescence principale both NaN
2 7 Racine de l''arborescence secondaire - code ROME both NaN
3 8 Racine de l''arborescence d''item - activite left_only removed
4 9 Racine de l''arborescence d''item - competence left_only removed

So three branches of the arborescence were trimmed, while the other two are kept. Subtrees for 'environnement de travail', 'activite' and 'competence' have been cut, while the principal subtree and the ROME subtree have been kept. According to "Pôle emploi", 'environnement de travail' was removed because it was done by hand and largely incomplete. The other two have been renamed 'savoir-faire' and 'savoirs', and moved outside the ROME (not sure where exactly, though).

Let's see if there is much change in this file, once those subtrees are removed:


In [7]:
is_old_obsolete = arborescence.old.code_type_referentiel.isin(
    links_merged.code_type_referentiel[links_merged._diff == 'removed'])
old_arborescence_trimmed = arborescence.old[~is_old_obsolete]

obsolete_arborescence_titles = set(
    old_arborescence_trimmed.libelle_item_arbo) - set(
    arborescence.new.libelle_item_arbo)
new_arborescence_titles = set(
    arborescence.new.libelle_item_arbo) - set(
    old_arborescence_trimmed.libelle_item_arbo)

print('Titles removed: "{}"'.format('", "'.join(obsolete_arborescence_titles)))
print('Titles added: "{}"'.format('", "'.join(new_arborescence_titles)))


Titles removed: "Sécurité et protection santé du BTP"
Titles added: "Qualité Sécurité Environnement et protection santé du BTP"

Hmm... Only one obsolete and one new title, and they seem quite similar. Let's try and replace the old one with the new one, and compare the datasets.


In [8]:
arborescence.old.loc[~is_old_obsolete, 'libelle_item_arbo'] = old_arborescence_trimmed.libelle_item_arbo.str.replace(
    'Sécurité et protection santé du BTP',
    'Qualité Sécurité Environnement et protection santé du BTP')

arborescence.old[arborescence.old.libelle_item_arbo == 'Sécurité et protection santé du BTP']


Out[8]:
code_ogr code_type_referentiel code_pere code_noeud libelle_item_arbo code_item_arbo_associe code_type_noeud libelle_type_noeud statut

In [9]:
arborescence.old[~is_old_obsolete].reset_index(drop=True).to_json() == arborescence.new.reset_index(drop=True).to_json()


Out[9]:
True

Ok, so apart from changing a title, they are actually the same!

So this dataset has seen a major change, but it is limited to dropping some branches and we weren't using them anyway, so this does not bother us too much.

Let's take a look at the changes in the other datasets, especially those of interest to us.

Other changes

The most interesting ones are in referentiel_appellation, item, and liens_rome_referentiels, so let's see more precisely.


In [10]:
jobs = find_rome_dataset_by_name(rome_data, 'referentiel_appellation')

new_jobs = set(jobs.new.code_ogr) - set(jobs.old.code_ogr)
obsolete_jobs = set(jobs.old.code_ogr) - set(jobs.new.code_ogr)
stable_jobs = set(jobs.new.code_ogr) & set(jobs.old.code_ogr)

matplotlib_venn.venn2((len(obsolete_jobs), len(new_jobs), len(stable_jobs)), (OLD_VERSION, NEW_VERSION));


Alright, so the only change seems to be 31 new jobs added. Let's take a look (only showing interesting fields):


In [11]:
jobs.new[jobs.new.code_ogr.isin(new_jobs)][['code_ogr', 'libelle_appellation_long', 'code_rome']]


Out[11]:
code_ogr libelle_appellation_long code_rome
3850 140877 Conducteur / Conductrice de tracteur enjambeur A1101
3851 140878 Adjoint / Adjointe au responsable QSE - Qualit... F1204
3852 140879 Animateur / Animatrice QSE - Qualité Sécurité ... F1204
3854 140880 Chargé / Chargée de mission QSE - Qualité Sécu... F1204
3855 140881 Ingénieur / Ingénieure HSE - Hygiène Sécurité ... F1204
3856 140882 Ingénieur /Ingénieure QHSE - Qualité Hygiène S... F1204
3857 140883 Responsable qualité sécurité développement dur... F1204
3858 140884 Responsable HSE - Hygiène Sécurité Environneme... F1204
3859 140885 Responsable QSE - Qualité Sécurité Environneme... F1204
3860 140886 Technicien / Technicienne QSE - Qualité Sécuri... F1204
3861 140887 Conducteur / Conductrice de raboteuse de chaussée F1302
3862 140888 Moniteur / Monitrice de basketball G1204
3863 140889 Moniteur / Monitrice d''équitation G1204
3865 140890 Moniteur / Monitrice de fitness musculation G1204
3866 140891 Moniteur / Monitrice de golf G1204
3867 140892 Moniteur / Monitrice de gymnastique G1204
3868 140893 Moniteur / Monitrice de handball G1204
3869 140894 Moniteur / Monitrice de judo G1204
3870 140895 Moniteur / Monitrice de karaté G1204
3871 140896 Moniteur / Monitrice de natation G1204
3872 140897 Moniteur / Monitrice de plongée G1204
3873 140898 Moniteur / Monitrice de rugby G1204
3874 140899 Moniteur / Monitrice de ski alpin G1204
3876 140900 Moniteur / Monitrice de tennis G1204
3877 140901 Moniteur / Monitrice de tennis de table G1204
3878 140902 Moniteur / Monitrice de voile G1204
3879 140903 Procurement manager M1102
3880 140904 Télésecrétaire médical / médicale M1609
3881 140905 Gestionnaire de contrats de vente - Contract m... M1701
3882 140906 Gestionnaire de planning e-mailing M1705
3883 140907 Pharmacien / Pharmacienne responsable BPDO - b... J1202

There seems to be a few different domain where new jobs were added, mainly construction security and sport instructors.

OK, let's check at the changes in items:


In [12]:
items = find_rome_dataset_by_name(rome_data, 'item')

new_items = set(items.new.code_ogr) - set(items.old.code_ogr)
obsolete_items = set(items.old.code_ogr) - set(items.new.code_ogr)
stable_items = set(items.new.code_ogr) & set(items.old.code_ogr)

matplotlib_venn.venn2((len(obsolete_items), len(new_items), len(stable_items)), (OLD_VERSION, NEW_VERSION));


As anticipated it is a very minor change (hard to see it visually): some items are now obsolete and new ones have been created. Let's have a look.


In [13]:
items.old[items.old.code_ogr.isin(obsolete_items)].tail()


Out[13]:
code_ogr libelle code_type_referentiel code_ref_rubrique code_tete_rgpmt libelle_activite_impression libelle_en_tete_regroupement
9721 123269 Organiser le planning d''un chantier 2 6 NaN NaN NaN
9925 123523 Conditionner un produit artisanal 2 9 NaN NaN NaN
11372 125052 Moteurs hors-bord 1 10 NaN NaN NaN
11374 125054 Moteurs in-bord 1 10 NaN NaN NaN
12441 126344 Capacité professionnelle à la conduite de taxis 1 10 NaN NaN NaN

In [14]:
items.new[items.new.code_ogr.isin(new_items)].head()


Out[14]:
code_ogr libelle code_type_referentiel code_ref_rubrique code_tete_rgpmt libelle_activite_impression libelle_en_tete_regroupement
3252 106636 Superviser un parc de véhicules et engins de t... 2 9 NaN NaN NaN
6531 117742 Indicateurs Qualité Sécurité Environnement (QSE) 1 7 NaN NaN NaN
7488 119783 Réaliser un suivi des carrières 2 9 NaN NaN NaN
7759 120445 Pratique de la randonnée 1 10 NaN NaN NaN
9817 123385 Gérer la location d''un patrimoine immobilier 2 9 NaN NaN NaN

Those entries look legitimate.

The changes in liens_rome_referentiels include changes for those items, so let's only check the changes not related to those.


In [15]:
links = find_rome_dataset_by_name(rome_data, 'liens_rome_referentiels')
old_links_on_stable_items = links.old[links.old.code_ogr.isin(stable_items)]
new_links_on_stable_items = links.new[links.new.code_ogr.isin(stable_items)]

old = old_links_on_stable_items[['code_rome', 'code_ogr']]
new = new_links_on_stable_items[['code_rome', 'code_ogr']]

links_merged = old.merge(new, how='outer', indicator=True)
links_merged['_diff'] = links_merged._merge.map({'left_only': 'removed', 'right_only': 'added'})
links_merged._diff.value_counts()


Out[15]:
added      167
removed    149
Name: _diff, dtype: int64

So in addition to the added and removed items, there are 316 fixes. Let's have a look:


In [16]:
job_group_names = find_rome_dataset_by_name(rome_data, 'referentiel_code_rome').old.set_index('code_rome').libelle_rome
item_names = items.new.set_index('code_ogr').libelle.drop_duplicates()
links_merged['job_group_name'] = links_merged.code_rome.map(job_group_names)
links_merged['item_name'] = links_merged.code_ogr.map(item_names)
links_merged[links_merged._diff == 'added'].head()


Out[16]:
code_rome code_ogr _merge _diff job_group_name item_name
30692 C1502 102694 right_only added Gestion locative immobilière Renseigner les supports de suivi d''interventi...
30693 B1301 121060 right_only added Décoration d''espaces de vente et d''exposition Former du personnel à des procédures et techni...
30694 E1203 121060 right_only added Production en laboratoire photographique Former du personnel à des procédures et techni...
30695 E1202 121060 right_only added Production en laboratoire cinématographique Former du personnel à des procédures et techni...
30696 E1302 102694 right_only added Conduite de machines de façonnage routage Renseigner les supports de suivi d''interventi...

Those fixes make sense (not sure why they were not done before, but let's not complain: it is fixed now).


In [17]:
links_merged[links_merged._diff == 'removed'].head()


Out[17]:
code_rome code_ogr _merge _diff job_group_name item_name
134 C1502 124083 left_only removed Gestion locative immobilière Réaliser des actions de communication interne
231 C1501 119084 left_only removed Gérance immobilière Coordonner et superviser la gestion du patrimoine mobilier et immobilier
1124 B1301 123036 left_only removed Décoration d''espaces de vente et d''exposition Former un public
2113 E1203 123036 left_only removed Production en laboratoire photographique Former un public
2115 E1203 123466 left_only removed Production en laboratoire photographique Sensibiliser et former les personnels aux consignes de sécurité et de prévention

Seems alright here too, those skills are not really mandatory for those jobs.

Conclusion

The new version of ROME, v334, introduces very minor changes which reflect quite well what they wrote in their changelog. The transition should be transparent with a very small advantage over the old version.

However, a whole part of the arborescence was dropped, without any explanation or apparent reason. It has no impact on Bob for now, since we don't use it much yet, but we should investigate the matter with Pole emploi, because we might need it more in the future.