Author: Pascal, pascal@bayesimpact.org
Date: 2016-06-28
In June 2017 a new version of the ROME was realeased. I want to investigate what changed and whether we need to do anything about it.
You might not be able to reproduce this notebook, mostly because it requires to have the two versions of the ROME in your data/rome/csv
folder which happens only just before we switch to v332. You will have to trust me on the results ;-)
Skip the run test because it requires older versions of the ROME.
In [1]:
import collections
import glob
import os
from os import path
import matplotlib_venn
import pandas
rome_path = path.join(os.getenv('DATA_FOLDER'), 'rome/csv')
OLD_VERSION = '331'
NEW_VERSION = '332'
old_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(OLD_VERSION)))
new_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(NEW_VERSION)))
First let's check if there are new or deleted files (only matching by file names).
In [2]:
new_files = new_version_files - frozenset(f.replace(OLD_VERSION, NEW_VERSION) for f in old_version_files)
deleted_files = old_version_files - frozenset(f.replace(NEW_VERSION, OLD_VERSION) for f in new_version_files)
print('{:d} new files'.format(len(new_files)))
print('{:d} deleted files'.format(len(deleted_files)))
So we have the same set of files in both versions: good start.
Now let's set up a dataset that, for each table, links the old file and the new file.
In [3]:
new_to_old = dict((f, f.replace(NEW_VERSION, OLD_VERSION)) for f in new_version_files)
# Load all ROME datasets for the two versions we compare.
VersionedDataset = collections.namedtuple('VersionedDataset', ['basename', 'old', 'new'])
rome_data = [VersionedDataset(
basename=path.basename(f),
old=pandas.read_csv(f.replace(NEW_VERSION, OLD_VERSION)),
new=pandas.read_csv(f))
for f in sorted(new_version_files)]
def find_rome_dataset_by_name(data, partial_name):
for dataset in data:
if 'unix_{}_v{}_utf8.csv'.format(partial_name, NEW_VERSION) == dataset.basename:
return dataset
raise ValueError('No dataset named {}, the list is\n{}'.format(partial_name, [dataset.basename for d in data]))
Let's make sure the structure hasn't changed:
In [4]:
for dataset in rome_data:
if set(dataset.old.columns) != set(dataset.new.columns):
print('Columns of {} have changed.'.format(dataset.basename))
All files have the same columns as before: still good.
Now let's see for each file if they more or less rows.
In [5]:
same_row_count_files = 0
for dataset in rome_data:
diff = len(dataset.new.index) - len(dataset.old.index)
if diff > 0:
print('{:d} values added in {}'.format(diff, dataset.basename))
elif diff < 0:
print('{:d} values removed in {}'.format(diff, dataset.basename))
else:
same_row_count_files += 1
print('{:d}/{:d} files with the same number of rows'.format(same_row_count_files, len(rome_data)))
One important change is the one added to referentiel_code_rome
, adding it might be the reason of all the other changes as it's adding a new job group and all other files would need to propagate that change.
Let's check it out. First let's make sure than no job groups were removed:
In [6]:
job_groups = find_rome_dataset_by_name(rome_data, 'referentiel_code_rome')
obsolete_job_groups = set(job_groups.old.code_rome) - set(job_groups.new.code_rome)
obsolete_job_groups
Out[6]:
Alright, so the only change was the job group added:
In [7]:
new_job_groups_codes = set(job_groups.new.code_rome) - set(job_groups.old.code_rome)
new_job_groups = job_groups.new[job_groups.new.code_rome.isin(new_job_groups_codes)]
new_job_groups
Out[7]:
Let's see if this is a different grouping of existing jobs or if it's entirely new jobs. First let's check the jobs in this new job group.
In [8]:
jobs = find_rome_dataset_by_name(rome_data, 'referentiel_appellation')
jobs.new[jobs.new.code_rome == 'L1510'].head()
Out[8]:
Now let's see if those jobs were already there, and if so which were there job groups:
In [9]:
jobs.old[jobs.old.code_ogr.isin(jobs.new[jobs.new.code_rome == 'L1510'].code_ogr)]
Out[9]:
Alright, it seems that these are entirely new jobs. Just to make sure let's check with a keyword.
In [10]:
jobs.old[jobs.old.libelle_appellation_court.str.contains('Animatrice 2D', case=False)]
Out[10]:
What? Wait a minute! what happened to this job that looks almost exactly like the new one `Animatrice 2D - films d'animation'.
In [11]:
jobs.new[jobs.new.code_ogr == 10969]
Out[11]:
OK, this one did not move at all. What is this other job group that seems so close to ours?
In [12]:
job_groups.new[job_groups.new.code_rome == 'E1205']
Out[12]:
Ouch, it's indeed quite close and might have fooled more than one jobseeker…
So we have an entirely new job group L1510
which stands for Films d'animation et effets spéciaux
. It's quite close to E1205
(Réalisation de contenus multimédias
) and by the past many jobs of the new job groups might have defaulted to similar jobs of E1205.
Let's check now the impact on the rest of the ROME datasets, especially to identify other changes that might have not be related to adding this job group.
Let's first check the ROME mobility (there were 8 new lines):
In [13]:
mobility = find_rome_dataset_by_name(rome_data, 'rubrique_mobilite')
mobility.new[(mobility.new.code_rome == 'L1510') | (mobility.new.code_rome_cible == 'L1510')]
Out[13]:
Cool, we found our 8 new rows, and as expected it's linking to closeby job groups. We can see that the two job groups E1104
and E1205
are especially close as there are some mobility in both ways to and from the new job group.
In [14]:
job_groups.new[job_groups.new.code_rome.isin(('E1205', 'E1104'))]
Out[14]:
Let's seek the skills related to that new job group:
In [15]:
skills = find_rome_dataset_by_name(rome_data, 'referentiel_competence')
link = find_rome_dataset_by_name(rome_data, 'liens_rome_referentiels')
new_linked_skills = link.new.join(skills.new.set_index('code_ogr'), 'code_ogr')[
['code_rome', 'code_ogr', 'libelle_competence', 'libelle_type_competence']]
new_linked_skills[new_linked_skills.code_rome == 'L1510'].dropna()
Out[15]:
Some of the skills already existed (e.g. Technique de dessin
), others have been added with this release specially for this job group (e.g. Logiciel de motion capture
).
OK I think this is enough scrutiny for this new job group. Let's check out the other changes.
In [16]:
new_jobs = set(jobs.new.code_ogr) - set(jobs.old.code_ogr)
new_linked_skills[new_linked_skills.code_rome == 'L1510'].dropna()
Out[16]:
Those looks legitimate. New jobs are added regularly to ROME and this release makes no exception.
What about the skills?
In [17]:
new_skills = set(skills.new.code_ogr) - set(skills.old.code_ogr)
skills_for_new_job_group = new_linked_skills[new_linked_skills.code_rome == 'L1510'].code_ogr
skills.new[skills.new.code_ogr.isin(new_skills) & (~skills.new.code_ogr.isin(skills_for_new_job_group))]
Out[17]:
Those entries look legitimate as well, some new skills have been added.
The new version of ROME, v332, introduces a major change: the addition of a new job group L1510
- Films d'animation et effets spéciaux
. It's quite close to E1205
and E1204
. There are also very minor changes as in each ROME release.
This reflect quite well what they wrote in their changelog (although at the time I am writing this notebook, their website is down).
So before switching to v332, we should examin what it would mean for users that would land in the new job group.