ROME update from v337 to v338

In March 2019 a new version of the ROME was released. I want to investigate what changed and whether we need to do anything about it.

You might not be able to reproduce this notebook, mostly because it requires to have the two versions of the ROME in your data/rome/csv folder which happens only just before we switch to v338. You will have to trust me on the results ;-)

Skip the run test because it requires older versions of the ROME.



In [1]:

    
import collections
import glob
import os
from os import path

import matplotlib_venn
import pandas as pd

rome_path = path.join(os.getenv('DATA_FOLDER'), 'rome/csv')

OLD_VERSION = '337'
NEW_VERSION = '338'

old_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(OLD_VERSION)))
new_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(NEW_VERSION)))

First let's check if there are new or deleted files (only matching by file names).



In [2]:

    
new_files = new_version_files - frozenset(f.replace(OLD_VERSION, NEW_VERSION) for f in old_version_files)
deleted_files = old_version_files - frozenset(f.replace(NEW_VERSION, OLD_VERSION) for f in new_version_files)

print('{:d} new files'.format(len(new_files)))
print('{:d} deleted files'.format(len(deleted_files)))









    



0 new files
0 deleted files

So we have the same set of files in both versions: good start.

Now let's set up a dataset that, for each table, links both the old and the new file together.



In [3]:

    
# Load all ROME datasets for the two versions we compare.
VersionedDataset = collections.namedtuple('VersionedDataset', ['basename', 'old', 'new'])
rome_data = [VersionedDataset(
        basename=path.basename(f),
        old=pd.read_csv(f.replace(NEW_VERSION, OLD_VERSION)),
        new=pd.read_csv(f))
    for f in sorted(new_version_files)]

def find_rome_dataset_by_name(data, partial_name):
    for dataset in data:
        if 'unix_{}_v{}_utf8.csv'.format(partial_name, NEW_VERSION) == dataset.basename:
            return dataset
    raise ValueError('No dataset named {}, the list is\n{}'.format(partial_name, [d.basename for d in data]))

Let's make sure the structure hasn't changed:



In [4]:

    
for dataset in rome_data:
    if set(dataset.old.columns) != set(dataset.new.columns):
        print('Columns of {} have changed.'.format(dataset.basename))

All files have the same columns as before: still good.

Now let's see for each file if there are more or less rows.



In [5]:

    
same_row_count_files = 0
for dataset in rome_data:
    diff = len(dataset.new.index) - len(dataset.old.index)
    if diff > 0:
        print('{:d}/{:d} values added in {}'.format(
            diff, len(dataset.new.index), dataset.basename))
    elif diff < 0:
        print('{:d}/{:d} values removed in {}'.format(
            -diff, len(dataset.old.index), dataset.basename))
    else:
        same_row_count_files += 1
print('{:d}/{:d} files with the same number of rows'.format(
    same_row_count_files, len(rome_data)))









    



21/31120 values added in unix_coherence_item_v338_utf8.csv
4/11713 values added in unix_cr_gd_dp_appellations_v338_utf8.csv
9/2001 values removed in unix_item_arborescence_v338_utf8.csv
7/13522 values added in unix_item_v338_utf8.csv
25/42709 values added in unix_liens_rome_referentiels_v338_utf8.csv
1/7411 values added in unix_referentiel_activite_riasec_v338_utf8.csv
1/8969 values added in unix_referentiel_activite_v338_utf8.csv
4/11057 values added in unix_referentiel_appellation_v338_utf8.csv
1/5043 values added in unix_texte_v338_utf8.csv
12/21 files with the same number of rows

There are some minor changes in many files, but based on my knowledge of ROME, none from the main files.

The most interesting ones are in referentiel_appellation, item, and liens_rome_referentiels, so let's see more precisely.



In [6]:

    
jobs = find_rome_dataset_by_name(rome_data, 'referentiel_appellation')

new_jobs = set(jobs.new.code_ogr) - set(jobs.old.code_ogr)
obsolete_jobs = set(jobs.old.code_ogr) - set(jobs.new.code_ogr)
stable_jobs = set(jobs.new.code_ogr) & set(jobs.old.code_ogr)

matplotlib_venn.venn2((len(obsolete_jobs), len(new_jobs), len(stable_jobs)), (OLD_VERSION, NEW_VERSION));

Alright, so the only change seems to be 4 new jobs added. Let's take a look (only showing interesting fields):



In [7]:

    
pd.options.display.max_colwidth = 2000
jobs.new[jobs.new.code_ogr.isin(new_jobs)][['code_ogr', 'libelle_appellation_long', 'code_rome']]









    Out[7]:







  
    
      
      code_ogr
      libelle_appellation_long
      code_rome
    
  
  
    
      3922
      140966
      Piqueteur / Piqueteuse
      F1107
    
    
      3923
      140967
      Préparateur vendeur / Préparatrice vendeuse de pâtes alimentaires fraîches
      G1604
    
    
      3924
      140968
      Directeur adjoint / Directrice adjointe de maison de retraite
      K1403
    
    
      3925
      140969
      Directeur adjoint / Directrice adjointe d''établissement médicosocial
      K1403

These seems to be refinements of existing jobs, but that's fine.

OK, let's check at the changes in items:



In [8]:

    
items = find_rome_dataset_by_name(rome_data, 'item')

new_items = set(items.new.code_ogr) - set(items.old.code_ogr)
obsolete_items = set(items.old.code_ogr) - set(items.new.code_ogr)
stable_items = set(items.new.code_ogr) & set(items.old.code_ogr)

matplotlib_venn.venn2((len(obsolete_items), len(new_items), len(stable_items)), (OLD_VERSION, NEW_VERSION));

As anticipated it is a very minor change (hard to see it visually): there is one obsolete item and 2 new ones have been created. Let's have a look at them.



In [9]:

    
items.new[items.new.code_ogr.isin(new_items)].head()









    Out[9]:







  
    
      
      code_ogr
      libelle
      code_type_referentiel
      code_ref_rubrique
      code_tete_rgpmt
      libelle_activite_impression
      libelle_en_tete_regroupement
    
  
  
    
      12657
      126446
      Effectuer le service de plats à table selon les techniques spécifiques (à l''assiette, à la française, à l''anglaise, ...)
      2
      9
      NaN
      NaN
      NaN
    
    
      12758
      140970
      Effectuer des relevés d''implantation de réseaux de distribution existants (électriques, télécommunications, ...)
      2
      9
      NaN
      NaN
      NaN

The new ones seem legit to me. Let's check the obsolete one:



In [10]:

    
items.old[items.old.code_ogr.isin(obsolete_items)].head()









    Out[10]:







  
    
      
      code_ogr
      libelle
      code_type_referentiel
      code_ref_rubrique
      code_tete_rgpmt
      libelle_activite_impression
      libelle_en_tete_regroupement
    
  
  
    
      13514
      53228
      Effectuer le service de plats à table selon des techniques spécifiques (à l''assiette, à la française, à l''anglaise, ...)
      2
      9
      NaN
      NaN
      NaN

Hmm, it seems to be simple renaming, but they preferred to create a new one and retire the old one.

The changes in liens_rome_referentiels include changes for those items, so let's only check the changes not related to those.



In [11]:

    
links = find_rome_dataset_by_name(rome_data, 'liens_rome_referentiels')
old_links_on_stable_items = links.old[links.old.code_ogr.isin(stable_items)]
new_links_on_stable_items = links.new[links.new.code_ogr.isin(stable_items)]

old = old_links_on_stable_items[['code_rome', 'code_ogr']]
new = new_links_on_stable_items[['code_rome', 'code_ogr']]

links_merged = old.merge(new, how='outer', indicator=True)
links_merged['_diff'] = links_merged._merge.map({'left_only': 'removed', 'right_only': 'added'})
links_merged._diff.value_counts()









    Out[11]:





added    20
Name: _diff, dtype: int64

So in addition to the added and removed items, there are few fixes. Let's have a look at them:



In [12]:

    
job_group_names = find_rome_dataset_by_name(rome_data, 'referentiel_code_rome').new.set_index('code_rome').libelle_rome
item_names = items.new.set_index('code_ogr').libelle.drop_duplicates()
links_merged['job_group_name'] = links_merged.code_rome.map(job_group_names)
links_merged['item_name'] = links_merged.code_ogr.map(item_names)
display(links_merged[links_merged._diff == 'removed'].dropna().head(5))
links_merged[links_merged._diff == 'added'].dropna().head(5)









    







  
    
      
      code_rome
      code_ogr
      _merge
      _diff
      job_group_name
      item_name
    
  
  
  








    Out[12]:







  
    
      
      code_rome
      code_ogr
      _merge
      _diff
      job_group_name
      item_name
    
  
  
    
      31098
      D1408
      115919
      right_only
      added
      Téléconseil et télévente
      Communication digitale
    
    
      31099
      F1107
      102717
      right_only
      added
      Mesures topographiques
      Electricité
    
    
      31100
      F1107
      103562
      right_only
      added
      Mesures topographiques
      Technologie des fibres optiques
    
    
      31101
      F1107
      104548
      right_only
      added
      Mesures topographiques
      Informatique
    
    
      31102
      F1107
      118443
      right_only
      added
      Mesures topographiques
      Réseaux de fibre optique Fiber To The Home (FTTH)

Those fixes make sense (not sure why they were not done before, but let's not complain: it is fixed now).

That's all the changes we wanted to check (no change in referentiel_code_rome).

Conclusion

The new version of ROME, v338, introduces very minor changes which reflect quite well what they wrote in their changelog. The transition should be transparent with a very small advantage over the old version.

	code_ogr	libelle_appellation_long	code_rome
3922	140966	Piqueteur / Piqueteuse	F1107
3923	140967	Préparateur vendeur / Préparatrice vendeuse de pâtes alimentaires fraîches	G1604
3924	140968	Directeur adjoint / Directrice adjointe de maison de retraite	K1403
3925	140969	Directeur adjoint / Directrice adjointe d''établissement médicosocial	K1403

	code_ogr	libelle	code_type_referentiel	code_ref_rubrique	code_tete_rgpmt	libelle_activite_impression	libelle_en_tete_regroupement
12657	126446	Effectuer le service de plats à table selon les techniques spécifiques (à l''assiette, à la française, à l''anglaise, ...)	2	9	NaN	NaN	NaN
12758	140970	Effectuer des relevés d''implantation de réseaux de distribution existants (électriques, télécommunications, ...)	2	9	NaN	NaN	NaN

	code_rome	code_ogr	_merge	_diff	job_group_name	item_name
31098	D1408	115919	right_only	added	Téléconseil et télévente	Communication digitale
31099	F1107	102717	right_only	added	Mesures topographiques	Electricité
31100	F1107	103562	right_only	added	Mesures topographiques	Technologie des fibres optiques
31101	F1107	104548	right_only	added	Mesures topographiques	Informatique
31102	F1107	118443	right_only	added	Mesures topographiques	Réseaux de fibre optique Fiber To The Home (FTTH)