Author: Cyrille, cyrille@bayesimpact.org
Date: 2018-04-09
In March 2018 a new version of the ROME was released. I want to investigate what changed and whether we need to do anything about it.
You might not be able to reproduce this notebook, mostly because it requires to have the two versions of the ROME in your data/rome/csv
folder which happens only just before we switch to v334. You will have to trust me on the results ;-)
Skip the run test because it requires older versions of the ROME.
In [1]:
import collections
import glob
import os
from os import path
import matplotlib_venn
import pandas as pd
rome_path = path.join(os.getenv('DATA_FOLDER'), 'rome/csv')
OLD_VERSION = '333'
NEW_VERSION = '334'
old_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(OLD_VERSION)))
new_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(NEW_VERSION)))
First let's check if there are new or deleted files (only matching by file names).
In [2]:
new_files = new_version_files - frozenset(f.replace(OLD_VERSION, NEW_VERSION) for f in old_version_files)
deleted_files = old_version_files - frozenset(f.replace(NEW_VERSION, OLD_VERSION) for f in new_version_files)
print('{:d} new files'.format(len(new_files)))
print('{:d} deleted files'.format(len(deleted_files)))
So we have the same set of files in both versions: good start.
Now let's set up a dataset that, for each table, links both the old and the new file together.
In [3]:
# Load all ROME datasets for the two versions we compare.
VersionedDataset = collections.namedtuple('VersionedDataset', ['basename', 'old', 'new'])
rome_data = [VersionedDataset(
basename=path.basename(f),
old=pd.read_csv(f.replace(NEW_VERSION, OLD_VERSION)),
new=pd.read_csv(f))
for f in sorted(new_version_files)]
def find_rome_dataset_by_name(data, partial_name):
for dataset in data:
if 'unix_{}_v{}_utf8.csv'.format(partial_name, NEW_VERSION) == dataset.basename:
return dataset
raise ValueError('No dataset named {}, the list is\n{}'.format(partial_name, [d.basename for d in data]))
Let's make sure the structure hasn't changed:
In [4]:
for dataset in rome_data:
if set(dataset.old.columns) != set(dataset.new.columns):
print('Columns of {} have changed.'.format(dataset.basename))
All files have the same columns as before: still good.
Now let's see for each file if there are more or less rows.
In [5]:
same_row_count_files = 0
for dataset in rome_data:
diff = len(dataset.new.index) - len(dataset.old.index)
if diff > 0:
print('{:d}/{:d} values added in {}'.format(
diff, len(dataset.new.index), dataset.basename))
elif diff < 0:
print('{:d}/{:d} values removed in {}'.format(
-diff, len(dataset.old.index), dataset.basename))
else:
same_row_count_files += 1
print('{:d}/{:d} files with the same number of rows'.format(
same_row_count_files, len(rome_data)))
There are some minor changes in many files, except for the arborescence, which seems to have lost most of its content.
Let's look into arborescence
more deeply, before looking at other files (a brief presentation of the tree structure in this dataset is done in this notebook).
It's describing a large tree of concepts grouped in 5 main subtrees defined by the code_type_referentiel field. In Bob we mainly use the branch Métiers accessibles sans diplôme et sans expérience of the subtree 7(code ROME).
In [6]:
arborescence = find_rome_dataset_by_name(rome_data, 'item_arborescence')
old_roots = arborescence.old[arborescence.old.libelle_type_noeud == 'RACINE']
new_roots = arborescence.new[arborescence.new.libelle_type_noeud == 'RACINE']
old = old_roots[['code_type_referentiel', 'libelle_item_arbo']]
new = new_roots[['code_type_referentiel', 'libelle_item_arbo']]
links_merged = old.merge(new, how='outer', indicator=True)
links_merged['_diff'] = links_merged._merge.map({'left_only': 'removed', 'right_only': 'added'})
links_merged.head()
Out[6]:
So three branches of the arborescence were trimmed, while the other two are kept. Subtrees for 'environnement de travail', 'activite' and 'competence' have been cut, while the principal subtree and the ROME subtree have been kept. According to "Pôle emploi", 'environnement de travail' was removed because it was done by hand and largely incomplete. The other two have been renamed 'savoir-faire' and 'savoirs', and moved outside the ROME (not sure where exactly, though).
Let's see if there is much change in this file, once those subtrees are removed:
In [7]:
is_old_obsolete = arborescence.old.code_type_referentiel.isin(
links_merged.code_type_referentiel[links_merged._diff == 'removed'])
old_arborescence_trimmed = arborescence.old[~is_old_obsolete]
obsolete_arborescence_titles = set(
old_arborescence_trimmed.libelle_item_arbo) - set(
arborescence.new.libelle_item_arbo)
new_arborescence_titles = set(
arborescence.new.libelle_item_arbo) - set(
old_arborescence_trimmed.libelle_item_arbo)
print('Titles removed: "{}"'.format('", "'.join(obsolete_arborescence_titles)))
print('Titles added: "{}"'.format('", "'.join(new_arborescence_titles)))
Hmm... Only one obsolete and one new title, and they seem quite similar. Let's try and replace the old one with the new one, and compare the datasets.
In [8]:
arborescence.old.loc[~is_old_obsolete, 'libelle_item_arbo'] = old_arborescence_trimmed.libelle_item_arbo.str.replace(
'Sécurité et protection santé du BTP',
'Qualité Sécurité Environnement et protection santé du BTP')
arborescence.old[arborescence.old.libelle_item_arbo == 'Sécurité et protection santé du BTP']
Out[8]:
In [9]:
arborescence.old[~is_old_obsolete].reset_index(drop=True).to_json() == arborescence.new.reset_index(drop=True).to_json()
Out[9]:
Ok, so apart from changing a title, they are actually the same!
So this dataset has seen a major change, but it is limited to dropping some branches and we weren't using them anyway, so this does not bother us too much.
Let's take a look at the changes in the other datasets, especially those of interest to us.
The most interesting ones are in referentiel_appellation
, item
, and liens_rome_referentiels
, so let's see more precisely.
In [10]:
jobs = find_rome_dataset_by_name(rome_data, 'referentiel_appellation')
new_jobs = set(jobs.new.code_ogr) - set(jobs.old.code_ogr)
obsolete_jobs = set(jobs.old.code_ogr) - set(jobs.new.code_ogr)
stable_jobs = set(jobs.new.code_ogr) & set(jobs.old.code_ogr)
matplotlib_venn.venn2((len(obsolete_jobs), len(new_jobs), len(stable_jobs)), (OLD_VERSION, NEW_VERSION));
Alright, so the only change seems to be 31 new jobs added. Let's take a look (only showing interesting fields):
In [11]:
jobs.new[jobs.new.code_ogr.isin(new_jobs)][['code_ogr', 'libelle_appellation_long', 'code_rome']]
Out[11]:
There seems to be a few different domain where new jobs were added, mainly construction security and sport instructors.
OK, let's check at the changes in items:
In [12]:
items = find_rome_dataset_by_name(rome_data, 'item')
new_items = set(items.new.code_ogr) - set(items.old.code_ogr)
obsolete_items = set(items.old.code_ogr) - set(items.new.code_ogr)
stable_items = set(items.new.code_ogr) & set(items.old.code_ogr)
matplotlib_venn.venn2((len(obsolete_items), len(new_items), len(stable_items)), (OLD_VERSION, NEW_VERSION));
As anticipated it is a very minor change (hard to see it visually): some items are now obsolete and new ones have been created. Let's have a look.
In [13]:
items.old[items.old.code_ogr.isin(obsolete_items)].tail()
Out[13]:
In [14]:
items.new[items.new.code_ogr.isin(new_items)].head()
Out[14]:
Those entries look legitimate.
The changes in liens_rome_referentiels
include changes for those items, so let's only check the changes not related to those.
In [15]:
links = find_rome_dataset_by_name(rome_data, 'liens_rome_referentiels')
old_links_on_stable_items = links.old[links.old.code_ogr.isin(stable_items)]
new_links_on_stable_items = links.new[links.new.code_ogr.isin(stable_items)]
old = old_links_on_stable_items[['code_rome', 'code_ogr']]
new = new_links_on_stable_items[['code_rome', 'code_ogr']]
links_merged = old.merge(new, how='outer', indicator=True)
links_merged['_diff'] = links_merged._merge.map({'left_only': 'removed', 'right_only': 'added'})
links_merged._diff.value_counts()
Out[15]:
So in addition to the added and removed items, there are 316 fixes. Let's have a look:
In [16]:
job_group_names = find_rome_dataset_by_name(rome_data, 'referentiel_code_rome').old.set_index('code_rome').libelle_rome
item_names = items.new.set_index('code_ogr').libelle.drop_duplicates()
links_merged['job_group_name'] = links_merged.code_rome.map(job_group_names)
links_merged['item_name'] = links_merged.code_ogr.map(item_names)
links_merged[links_merged._diff == 'added'].head()
Out[16]:
Those fixes make sense (not sure why they were not done before, but let's not complain: it is fixed now).
In [17]:
links_merged[links_merged._diff == 'removed'].head()
Out[17]:
Seems alright here too, those skills are not really mandatory for those jobs.
The new version of ROME, v334, introduces very minor changes which reflect quite well what they wrote in their changelog. The transition should be transparent with a very small advantage over the old version.
However, a whole part of the arborescence was dropped, without any explanation or apparent reason. It has no impact on Bob for now, since we don't use it much yet, but we should investigate the matter with Pole emploi, because we might need it more in the future.