ROME update from v330 to v331

In March 2017 a new version of the ROME was realeased. I want to investigate what changed and whether we need to do anything about it.

You might not be able to reproduce this notebook, mostly because it requires to have the two versions of the ROME in your data/rome/csv folder which happens only just before we switch to v330. You will have to trust me on the results ;-)

Skip the run test because it requires older versions of the ROME.



In [1]:

    
import collections
import glob
import os
from os import path

import matplotlib_venn
import pandas

rome_path = path.join(os.getenv('DATA_FOLDER'), 'rome/csv')

OLD_VERSION = '330'
NEW_VERSION = '331'

old_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(OLD_VERSION)))
new_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(NEW_VERSION)))

First let's check if there are new or deleted files (only matching by file names).



In [2]:

    
new_files = new_version_files - frozenset(f.replace(OLD_VERSION, NEW_VERSION) for f in old_version_files)
deleted_files = old_version_files - frozenset(f.replace(NEW_VERSION, OLD_VERSION) for f in new_version_files)

print('{:d} new files'.format(len(new_files)))
print('{:d} deleted files'.format(len(deleted_files)))









    



0 new files
0 deleted files

So we have the same set of files: good start.

Now let's set up a dataset that, for each table, links the old file and the new file.



In [3]:

    
new_to_old = dict((f, f.replace(NEW_VERSION, OLD_VERSION)) for f in new_version_files)

# Load all datasets.
Dataset = collections.namedtuple('Dataset', ['basename', 'old', 'new'])
data = [Dataset(
        basename=path.basename(f),
        old=pandas.read_csv(f.replace(NEW_VERSION, OLD_VERSION)),
        new=pandas.read_csv(f))
    for f in sorted(new_version_files)]

def find_dataset_by_name(data, partial_name):
    for dataset in data:
        if 'unix_{}_v{}_utf8.csv'.format(partial_name, NEW_VERSION) == dataset.basename:
            return dataset
    raise ValueError('No dataset named {}, the list is\n{}'.format(partial_name, [dataset.basename for d in data]))

Let's make sure the structure hasn't changed:



In [4]:

    
for dataset in data:
    if set(dataset.old.columns) != set(dataset.new.columns):
        print('Columns of {} have changed.'.format(dataset.basename))

All files have the same columns as before: still good.



In [5]:

    
untouched = 0
for dataset in data:
    diff = len(dataset.new.index) - len(dataset.old.index)
    if diff > 0:
        print('{:d} values added in {}'.format(diff, dataset.basename))
    elif diff < 0:
        print('{:d} values removed in {}'.format(diff, dataset.basename))
    else:
        untouched += 1
print('{:d}/{:d} files with the same number of rows'.format(untouched, len(data)))









    



219 values added in unix_coherence_item_v331_utf8.csv
79 values added in unix_item_v331_utf8.csv
219 values added in unix_liens_rome_referentiels_v331_utf8.csv
86 values added in unix_referentiel_activite_riasec_v331_utf8.csv
-22 values removed in unix_referentiel_activite_v331_utf8.csv
16/21 files with the same number of rows

So we have minor changes in 5 files. At one point we cared about referentiel_activite and referentiel_activite_riasec but have no concrete application for now.

The only interesting ones are in item and liens_rome_referentiels, so let's see more precisely.



In [6]:

    
items = find_dataset_by_name(data, 'item')

new_items = set(items.new.code_ogr) - set(items.old.code_ogr)
obsolete_items = set(items.old.code_ogr) - set(items.new.code_ogr)
stable_items = set(items.new.code_ogr) & set(items.old.code_ogr)

matplotlib_venn.venn2((len(obsolete_items), len(new_items), len(stable_items)), (OLD_VERSION, NEW_VERSION));

As anticipated it is a very minor change: some items are now obsolete and new ones have been created. Let's have a look.



In [7]:

    
items.old[items.old.code_ogr.isin(obsolete_items)].tail()









    Out[7]:






  
    
      
      code_ogr
      libelle
      code_type_referentiel
      code_ref_rubrique
      code_tete_rgpmt
      libelle_activite_impression
      libelle_en_tete_regroupement
    
  
  
    
      11732
      125599
      Techniques de gravure
      1
      7
      NaN
      NaN
      NaN
    
    
      11838
      125719
      Dératisation
      1
      7
      NaN
      NaN
      NaN
    
    
      11871
      125757
      Concevoir les objets et matériels nécessaires ...
      2
      6
      NaN
      NaN
      NaN
    
    
      11893
      125784
      Stérilisation
      1
      10
      NaN
      NaN
      NaN
    
    
      11894
      125784
      Stérilisation
      1
      7
      NaN
      NaN
      NaN



In [8]:

    
items.new[items.new.code_ogr.isin(new_items)].head()









    Out[8]:






  
    
      
      code_ogr
      libelle
      code_type_referentiel
      code_ref_rubrique
      code_tete_rgpmt
      libelle_activite_impression
      libelle_en_tete_regroupement
    
  
  
    
      5123
      113235
      Langage informatique Apache
      1
      10
      NaN
      NaN
      NaN
    
    
      5124
      113236
      Langage informatique CSS
      1
      10
      NaN
      NaN
      NaN
    
    
      5128
      113250
      Système d''exploitation informatique Apache To...
      1
      10
      NaN
      NaN
      NaN
    
    
      5130
      113268
      Système d''exploitation informatique VirtualBox
      1
      10
      NaN
      NaN
      NaN
    
    
      5132
      113281
      Langage informatique Script Shell
      1
      10
      NaN
      NaN
      NaN

Those entries look legitimate.

The changes in liens_rome_referentiels include changes for those items, so let's only check the changes not related to those.



In [9]:

    
links = find_dataset_by_name(data, 'liens_rome_referentiels')
old_links_on_stable_items = links.old[links.old.code_ogr.isin(stable_items)]
new_links_on_stable_items = links.new[links.new.code_ogr.isin(stable_items)]

old = old_links_on_stable_items[['code_rome', 'code_ogr']]
new = new_links_on_stable_items[['code_rome', 'code_ogr']]

links_merged = old.merge(new, how='outer', indicator=True)
links_merged['_diff'] = links_merged._merge.map({'left_only': 'removed', 'right_only': 'added'})
links_merged._diff.value_counts()









    Out[9]:





added      49
removed     3
Name: _diff, dtype: int64

So in addition to the added and remove items, there are 52 fixes. Let's have a look:



In [10]:

    
job_group_names = find_dataset_by_name(data, 'referentiel_code_rome').old.set_index('code_rome').libelle_rome
item_names = items.new.set_index('code_ogr').libelle.drop_duplicates()
links_merged['job_group_name'] = links_merged.code_rome.map(job_group_names)
links_merged['item_name'] = links_merged.code_ogr.map(item_names)
links_merged.dropna().head(10)









    Out[10]:






  
    
      
      code_rome
      code_ogr
      _merge
      _diff
      job_group_name
      item_name
    
  
  
    
      1542
      D1403
      121251.0
      left_only
      removed
      Relation commerciale auprès de particuliers
      Analyser un plan isométrique
    
    
      24575
      N1103
      126446.0
      left_only
      removed
      Magasinage et préparation de commandes
      Effectuer le service de plats à table selon de...
    
    
      30158
      D1214
      118367.0
      left_only
      removed
      Vente en habillement et accessoires de la pers...
      Articles de bijouterie / joaillerie
    
    
      30281
      D1202
      123058.0
      right_only
      added
      Coiffure
      Techniques de séchage
    
    
      30282
      E1103
      116902.0
      right_only
      added
      Communication
      Stratégies de communication interne
    
    
      30283
      E1103
      116903.0
      right_only
      added
      Communication
      Stratégies de communication externe
    
    
      30284
      E1103
      117426.0
      right_only
      added
      Communication
      Communication interne
    
    
      30285
      E1103
      117427.0
      right_only
      added
      Communication
      Communication externe
    
    
      30286
      D1403
      102065.0
      right_only
      added
      Relation commerciale auprès de particuliers
      Diététique
    
    
      30287
      B1303
      117952.0
      right_only
      added
      Gravure - ciselure
      Techniques de gravure

Those fixes make sense (not sure why they were not done before, but let's not complain: it is fixed now).

Conclusion

The new version of ROME, v333, introduces very minor changes which reflect quite well what they wrote in their changelog. The transition should be transparent with a very small advantage on the new one.

	code_ogr	libelle	code_type_referentiel	code_ref_rubrique	code_tete_rgpmt	libelle_activite_impression	libelle_en_tete_regroupement
11732	125599	Techniques de gravure	1	7	NaN	NaN	NaN
11838	125719	Dératisation	1	7	NaN	NaN	NaN
11871	125757	Concevoir les objets et matériels nécessaires ...	2	6	NaN	NaN	NaN
11893	125784	Stérilisation	1	10	NaN	NaN	NaN
11894	125784	Stérilisation	1	7	NaN	NaN	NaN

	code_ogr	libelle	code_type_referentiel	code_ref_rubrique	code_tete_rgpmt	libelle_activite_impression	libelle_en_tete_regroupement
5123	113235	Langage informatique Apache	1	10	NaN	NaN	NaN
5124	113236	Langage informatique CSS	1	10	NaN	NaN	NaN
5128	113250	Système d''exploitation informatique Apache To...	1	10	NaN	NaN	NaN
5130	113268	Système d''exploitation informatique VirtualBox	1	10	NaN	NaN	NaN
5132	113281	Langage informatique Script Shell	1	10	NaN	NaN	NaN

	code_rome	code_ogr	_merge	_diff	job_group_name	item_name
1542	D1403	121251.0	left_only	removed	Relation commerciale auprès de particuliers	Analyser un plan isométrique
24575	N1103	126446.0	left_only	removed	Magasinage et préparation de commandes	Effectuer le service de plats à table selon de...
30158	D1214	118367.0	left_only	removed	Vente en habillement et accessoires de la pers...	Articles de bijouterie / joaillerie
30281	D1202	123058.0	right_only	added	Coiffure	Techniques de séchage
30282	E1103	116902.0	right_only	added	Communication	Stratégies de communication interne
30283	E1103	116903.0	right_only	added	Communication	Stratégies de communication externe
30284	E1103	117426.0	right_only	added	Communication	Communication interne
30285	E1103	117427.0	right_only	added	Communication	Communication externe
30286	D1403	102065.0	right_only	added	Relation commerciale auprès de particuliers	Diététique
30287	B1303	117952.0	right_only	added	Gravure - ciselure	Techniques de gravure