ROME update from v335 to v337

In December 2018 a new version of the ROME was released. I want to investigate what changed and whether we need to do anything about it.

You might not be able to reproduce this notebook, mostly because it requires to have the two versions of the ROME in your data/rome/csv folder which happens only just before we switch to v335. You will have to trust me on the results ;-)

Skip the run test because it requires older versions of the ROME.



In [1]:

    
import collections
import glob
import os
from os import path

import matplotlib_venn
import pandas as pd

rome_path = path.join(os.getenv('DATA_FOLDER'), 'rome/csv')

OLD_VERSION = '335'
NEW_VERSION = '337'

old_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(OLD_VERSION)))
new_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(NEW_VERSION)))

First let's check if there are new or deleted files (only matching by file names).



In [2]:

    
new_files = new_version_files - frozenset(f.replace(OLD_VERSION, NEW_VERSION) for f in old_version_files)
deleted_files = old_version_files - frozenset(f.replace(NEW_VERSION, OLD_VERSION) for f in new_version_files)

print('{:d} new files'.format(len(new_files)))
print('{:d} deleted files'.format(len(deleted_files)))









    



0 new files
0 deleted files

So we have the same set of files in both versions: good start.

Now let's set up a dataset that, for each table, links both the old and the new file together.



In [3]:

    
# Load all ROME datasets for the two versions we compare.
VersionedDataset = collections.namedtuple('VersionedDataset', ['basename', 'old', 'new'])
rome_data = [VersionedDataset(
        basename=path.basename(f),
        old=pd.read_csv(f.replace(NEW_VERSION, OLD_VERSION)),
        new=pd.read_csv(f))
    for f in sorted(new_version_files)]

def find_rome_dataset_by_name(data, partial_name):
    for dataset in data:
        if 'unix_{}_v{}_utf8.csv'.format(partial_name, NEW_VERSION) == dataset.basename:
            return dataset
    raise ValueError('No dataset named {}, the list is\n{}'.format(partial_name, [d.basename for d in data]))

Let's make sure the structure hasn't changed:



In [4]:

    
for dataset in rome_data:
    if set(dataset.old.columns) != set(dataset.new.columns):
        print('Columns of {} have changed.'.format(dataset.basename))

All files have the same columns as before: still good.

Now let's see for each file if there are more or less rows.



In [5]:

    
same_row_count_files = 0
for dataset in rome_data:
    diff = len(dataset.new.index) - len(dataset.old.index)
    if diff > 0:
        print('{:d}/{:d} values added in {}'.format(
            diff, len(dataset.new.index), dataset.basename))
    elif diff < 0:
        print('{:d}/{:d} values removed in {}'.format(
            -diff, len(dataset.old.index), dataset.basename))
    else:
        same_row_count_files += 1
print('{:d}/{:d} files with the same number of rows'.format(
    same_row_count_files, len(rome_data)))









    



280/31099 values added in unix_coherence_item_v337_utf8.csv
17/11709 values added in unix_cr_gd_dp_appellations_v337_utf8.csv
9/2001 values added in unix_item_arborescence_v337_utf8.csv
159/13515 values added in unix_item_v337_utf8.csv
297/42684 values added in unix_liens_rome_referentiels_v337_utf8.csv
4/7410 values added in unix_referentiel_activite_riasec_v337_utf8.csv
174/9142 values removed in unix_referentiel_activite_v337_utf8.csv
17/11053 values added in unix_referentiel_appellation_v337_utf8.csv
76/4955 values removed in unix_referentiel_competence_v337_utf8.csv
2/5042 values added in unix_texte_v337_utf8.csv
11/21 files with the same number of rows

There are some minor changes in many files, but based on my knowledge of ROME, none from the main files.

The most interesting ones are in referentiel_appellation, item, and liens_rome_referentiels, so let's see more precisely.



In [6]:

    
jobs = find_rome_dataset_by_name(rome_data, 'referentiel_appellation')

new_jobs = set(jobs.new.code_ogr) - set(jobs.old.code_ogr)
obsolete_jobs = set(jobs.old.code_ogr) - set(jobs.new.code_ogr)
stable_jobs = set(jobs.new.code_ogr) & set(jobs.old.code_ogr)

matplotlib_venn.venn2((len(obsolete_jobs), len(new_jobs), len(stable_jobs)), (OLD_VERSION, NEW_VERSION));

Alright, so the only change seems to be 17 new jobs added. Let's take a look (only showing interesting fields):



In [7]:

    
pd.options.display.max_colwidth = 2000
jobs.new[jobs.new.code_ogr.isin(new_jobs)][['code_ogr', 'libelle_appellation_long', 'code_rome']]









    Out[7]:







  
    
      
      code_ogr
      libelle_appellation_long
      code_rome
    
  
  
    
      3903
      140943
      Motion designer multimédia
      E1205
    
    
      3904
      140944
      Chef de chantier génie climatique
      F1202
    
    
      3905
      140945
      Facilitateur / Facilitatrice de FabLab - laboratoire de fabrication numérique
      G1202
    
    
      3906
      140946
      Technicien / Technicienne ERD en acoustique
      H1210
    
    
      3907
      140947
      Technicien / Technicienne en mesures acoustiques et contrôle du bruit
      H1503
    
    
      3908
      140948
      Accompagnant éducatif et social / Accompagnante éducative et sociale en structure collective
      K1301
    
    
      3909
      140949
      Accompagnant éducatif et social / Accompagnante éducative et sociale à domicile
      K1302
    
    
      3911
      140950
      Accompagnant éducatif et social / Accompagnante éducative et sociale en milieu scolaire
      K2104
    
    
      3912
      140951
      Motion designer - films d''animation
      L1510
    
    
      3913
      140952
      Développeur / Développeuse de logiciel de simulation optique et réalité virtuelle
      M1805
    
    
      3914
      140953
      Développeur / Développeuse logiciel de réalité augmentée
      M1805
    
    
      3915
      140954
      Chauffeur / Chauffeuse VTC
      N4102
    
    
      3916
      140959
      Polyéleveur / Polyéleveuse
      A1416
    
    
      3918
      140960
      Batteur / Batteuse de palplanches
      F1701
    
    
      3919
      140961
      Agent / Agente de traversée des écoles
      K1707
    
    
      3920
      140962
      Responsable développement / Business Developer
      M1707
    
    
      3921
      140963
      Annonceur / Annonceuse sentinelle ferroviaire
      N4403

They mostly seem to be new jobs: Business Developer, VR-related jobs, global climate jobs.

OK, let's check at the changes in items:



In [8]:

    
items = find_rome_dataset_by_name(rome_data, 'item')

new_items = set(items.new.code_ogr) - set(items.old.code_ogr)
obsolete_items = set(items.old.code_ogr) - set(items.new.code_ogr)
stable_items = set(items.new.code_ogr) & set(items.old.code_ogr)

matplotlib_venn.venn2((len(obsolete_items), len(new_items), len(stable_items)), (OLD_VERSION, NEW_VERSION));

As anticipated it is a very minor change (hard to see it visually): there are no obsolete items but new ones have been created. Let's have a look at them.



In [9]:

    
items.new[items.new.code_ogr.isin(new_items)].head()









    Out[9]:







  
    
      
      code_ogr
      libelle
      code_type_referentiel
      code_ref_rubrique
      code_tete_rgpmt
      libelle_activite_impression
      libelle_en_tete_regroupement
    
  
  
    
      5934
      115802
      Mettre en place des outils d''aide à la surveillance, à la sécurité et à l''exploitation des équipements informatiques
      2
      9
      NaN
      NaN
      NaN
    
    
      5935
      115817
      Contrôler la gestion de la paie
      2
      9
      NaN
      NaN
      NaN
    
    
      5951
      115911
      Responsabilité Sociétale des Entreprises (RSE)
      1
      10
      NaN
      NaN
      NaN
    
    
      6074
      116538
      Techniques de réglage de lignes de conditionnement
      1
      10
      NaN
      NaN
      NaN
    
    
      6118
      116639
      Pharmacologie
      1
      7
      NaN
      NaN
      NaN

The new ones seem legit to me.

The changes in liens_rome_referentiels include changes for those items, so let's only check the changes not related to those.



In [10]:

    
links = find_rome_dataset_by_name(rome_data, 'liens_rome_referentiels')
old_links_on_stable_items = links.old[links.old.code_ogr.isin(stable_items)]
new_links_on_stable_items = links.new[links.new.code_ogr.isin(stable_items)]

old = old_links_on_stable_items[['code_rome', 'code_ogr']]
new = new_links_on_stable_items[['code_rome', 'code_ogr']]

links_merged = old.merge(new, how='outer', indicator=True)
links_merged['_diff'] = links_merged._merge.map({'left_only': 'removed', 'right_only': 'added'})
links_merged._diff.value_counts()









    Out[10]:





added      33
removed     2
Name: _diff, dtype: int64

So in addition to the added and removed items, there are few fixes. Let's have a look at them:



In [11]:

    
job_group_names = find_rome_dataset_by_name(rome_data, 'referentiel_code_rome').new.set_index('code_rome').libelle_rome
item_names = items.new.set_index('code_ogr').libelle.drop_duplicates()
links_merged['job_group_name'] = links_merged.code_rome.map(job_group_names)
links_merged['item_name'] = links_merged.code_ogr.map(item_names)
display(links_merged[links_merged._diff == 'removed'].dropna().head(5))
links_merged[links_merged._diff == 'added'].dropna().head(5)









    







  
    
      
      code_rome
      code_ogr
      _merge
      _diff
      job_group_name
      item_name
    
  
  
    
      15701
      J1502
      118117
      left_only
      removed
      Coordination de services médicaux ou paramédicaux
      Produits de puériculture
    
    
      29277
      A1503
      122370
      left_only
      removed
      Toilettage des animaux
      Réaliser une Publicité sur le Lieu de Vente (PLV)
    
  








    Out[11]:







  
    
      
      code_rome
      code_ogr
      _merge
      _diff
      job_group_name
      item_name
    
  
  
    
      30819
      D1205
      101175
      right_only
      added
      Nettoyage d''articles textiles ou cuirs
      Techniques de repassage
    
    
      30820
      A1203
      101869
      right_only
      added
      Aménagement et entretien des espaces verts
      Réglementation phytosanitaire
    
    
      30821
      G1101
      104283
      right_only
      added
      Accueil touristique
      Environnement culturel et touristique
    
    
      30822
      G1102
      104283
      right_only
      added
      Promotion du tourisme local
      Environnement culturel et touristique
    
    
      30823
      G1201
      104283
      right_only
      added
      Accompagnement de voyages, d''activités culturelles ou sportives
      Environnement culturel et touristique

Those fixes make sense (not sure why they were not done before, but let's not complain: it is fixed now).

That's all the changes we wanted to check (no change in referentiel_code_rome).

Conclusion

The new version of ROME, v337, introduces very minor changes which reflect quite well what they wrote in their changelog. The transition should be transparent with a very small advantage over the old version.

	code_ogr	libelle_appellation_long	code_rome
3903	140943	Motion designer multimédia	E1205
3904	140944	Chef de chantier génie climatique	F1202
3905	140945	Facilitateur / Facilitatrice de FabLab - laboratoire de fabrication numérique	G1202
3906	140946	Technicien / Technicienne ERD en acoustique	H1210
3907	140947	Technicien / Technicienne en mesures acoustiques et contrôle du bruit	H1503
3908	140948	Accompagnant éducatif et social / Accompagnante éducative et sociale en structure collective	K1301
3909	140949	Accompagnant éducatif et social / Accompagnante éducative et sociale à domicile	K1302
3911	140950	Accompagnant éducatif et social / Accompagnante éducative et sociale en milieu scolaire	K2104
3912	140951	Motion designer - films d''animation	L1510
3913	140952	Développeur / Développeuse de logiciel de simulation optique et réalité virtuelle	M1805
3914	140953	Développeur / Développeuse logiciel de réalité augmentée	M1805
3915	140954	Chauffeur / Chauffeuse VTC	N4102
3916	140959	Polyéleveur / Polyéleveuse	A1416
3918	140960	Batteur / Batteuse de palplanches	F1701
3919	140961	Agent / Agente de traversée des écoles	K1707
3920	140962	Responsable développement / Business Developer	M1707
3921	140963	Annonceur / Annonceuse sentinelle ferroviaire	N4403

	code_ogr	libelle	code_type_referentiel	code_ref_rubrique	code_tete_rgpmt	libelle_activite_impression	libelle_en_tete_regroupement
5934	115802	Mettre en place des outils d''aide à la surveillance, à la sécurité et à l''exploitation des équipements informatiques	2	9	NaN	NaN	NaN
5935	115817	Contrôler la gestion de la paie	2	9	NaN	NaN	NaN
5951	115911	Responsabilité Sociétale des Entreprises (RSE)	1	10	NaN	NaN	NaN
6074	116538	Techniques de réglage de lignes de conditionnement	1	10	NaN	NaN	NaN
6118	116639	Pharmacologie	1	7	NaN	NaN	NaN

	code_rome	code_ogr	_merge	_diff	job_group_name	item_name
15701	J1502	118117	left_only	removed	Coordination de services médicaux ou paramédicaux	Produits de puériculture
29277	A1503	122370	left_only	removed	Toilettage des animaux	Réaliser une Publicité sur le Lieu de Vente (PLV)

	code_rome	code_ogr	_merge	_diff	job_group_name	item_name
30819	D1205	101175	right_only	added	Nettoyage d''articles textiles ou cuirs	Techniques de repassage
30820	A1203	101869	right_only	added	Aménagement et entretien des espaces verts	Réglementation phytosanitaire
30821	G1101	104283	right_only	added	Accueil touristique	Environnement culturel et touristique
30822	G1102	104283	right_only	added	Promotion du tourisme local	Environnement culturel et touristique
30823	G1201	104283	right_only	added	Accompagnement de voyages, d''activités culturelles ou sportives	Environnement culturel et touristique