Author: Pascal, pascal@bayesimpact.org
Date: 2018-06-18
In June 2018 a new version of the ROME was released. I want to investigate what changed and whether we need to do anything about it.
One change that we noted during the preparation of this notebook was that the typo Conductrcie
instead of Conductrice
has finally been fixed!
You might not be able to reproduce this notebook, mostly because it requires to have the two versions of the ROME in your data/rome/csv
folder which happens only just before we switch to v335. You will have to trust me on the results ;-)
Skip the run test because it requires older versions of the ROME.
In [1]:
import collections
import glob
import os
from os import path
import matplotlib_venn
import pandas as pd
rome_path = path.join(os.getenv('DATA_FOLDER'), 'rome/csv')
OLD_VERSION = '334'
NEW_VERSION = '335'
old_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(OLD_VERSION)))
new_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(NEW_VERSION)))
First let's check if there are new or deleted files (only matching by file names).
In [2]:
new_files = new_version_files - frozenset(f.replace(OLD_VERSION, NEW_VERSION) for f in old_version_files)
deleted_files = old_version_files - frozenset(f.replace(NEW_VERSION, OLD_VERSION) for f in new_version_files)
print('{:d} new files'.format(len(new_files)))
print('{:d} deleted files'.format(len(deleted_files)))
So we have the same set of files in both versions: good start.
Now let's set up a dataset that, for each table, links both the old and the new file together.
In [3]:
# Load all ROME datasets for the two versions we compare.
VersionedDataset = collections.namedtuple('VersionedDataset', ['basename', 'old', 'new'])
rome_data = [VersionedDataset(
basename=path.basename(f),
old=pd.read_csv(f.replace(NEW_VERSION, OLD_VERSION)),
new=pd.read_csv(f))
for f in sorted(new_version_files)]
def find_rome_dataset_by_name(data, partial_name):
for dataset in data:
if 'unix_{}_v{}_utf8.csv'.format(partial_name, NEW_VERSION) == dataset.basename:
return dataset
raise ValueError('No dataset named {}, the list is\n{}'.format(partial_name, [d.basename for d in data]))
Let's make sure the structure hasn't changed:
In [4]:
for dataset in rome_data:
if set(dataset.old.columns) != set(dataset.new.columns):
print('Columns of {} have changed.'.format(dataset.basename))
All files have the same columns as before: still good.
Now let's see for each file if there are more or less rows.
In [5]:
same_row_count_files = 0
for dataset in rome_data:
diff = len(dataset.new.index) - len(dataset.old.index)
if diff > 0:
print('{:d}/{:d} values added in {}'.format(
diff, len(dataset.new.index), dataset.basename))
elif diff < 0:
print('{:d}/{:d} values removed in {}'.format(
-diff, len(dataset.old.index), dataset.basename))
else:
same_row_count_files += 1
print('{:d}/{:d} files with the same number of rows'.format(
same_row_count_files, len(rome_data)))
There are some minor changes in many files, but based on my knowledge of ROME, none from the main files.
The most interesting ones are in referentiel_appellation, item, and liens_rome_referentiels, so let's see more precisely.
In [6]:
jobs = find_rome_dataset_by_name(rome_data, 'referentiel_appellation')
new_jobs = set(jobs.new.code_ogr) - set(jobs.old.code_ogr)
obsolete_jobs = set(jobs.old.code_ogr) - set(jobs.new.code_ogr)
stable_jobs = set(jobs.new.code_ogr) & set(jobs.old.code_ogr)
matplotlib_venn.venn2((len(obsolete_jobs), len(new_jobs), len(stable_jobs)), (OLD_VERSION, NEW_VERSION));
Alright, so the only change seems to be 15 new jobs added. Let's take a look (only showing interesting fields):
In [7]:
pd.options.display.max_colwidth = 2000
jobs.new[jobs.new.code_ogr.isin(new_jobs)][['code_ogr', 'libelle_appellation_long', 'code_rome']]
Out[7]:
They mostly seem to be new jobs: DPO, ecology engineers, e-cigarette retailer.
OK, let's check at the changes in items:
In [8]:
items = find_rome_dataset_by_name(rome_data, 'item')
new_items = set(items.new.code_ogr) - set(items.old.code_ogr)
obsolete_items = set(items.old.code_ogr) - set(items.new.code_ogr)
stable_items = set(items.new.code_ogr) & set(items.old.code_ogr)
matplotlib_venn.venn2((len(obsolete_items), len(new_items), len(stable_items)), (OLD_VERSION, NEW_VERSION));
As anticipated it is a very minor change (hard to see it visually): some items are now obsolete and new ones have been created. Let's have a look at them.
In [9]:
display(items.old[items.old.code_ogr.isin(obsolete_items)].head())
items.new[items.new.code_ogr.isin(new_items)].head()
Out[9]:
The new ones seem legit to me. The old ones, though, don't feel obsolete. I'm going to trust the ROME makers as it's a small change anyway.
The changes in liens_rome_referentiels
include changes for those items, so let's only check the changes not related to those.
In [10]:
links = find_rome_dataset_by_name(rome_data, 'liens_rome_referentiels')
old_links_on_stable_items = links.old[links.old.code_ogr.isin(stable_items)]
new_links_on_stable_items = links.new[links.new.code_ogr.isin(stable_items)]
old = old_links_on_stable_items[['code_rome', 'code_ogr']]
new = new_links_on_stable_items[['code_rome', 'code_ogr']]
links_merged = old.merge(new, how='outer', indicator=True)
links_merged['_diff'] = links_merged._merge.map({'left_only': 'removed', 'right_only': 'added'})
links_merged._diff.value_counts()
Out[10]:
So in addition to the added and removed items, there are few fixes. Let's have a look at them:
In [11]:
job_group_names = find_rome_dataset_by_name(rome_data, 'referentiel_code_rome').new.set_index('code_rome').libelle_rome
item_names = items.new.set_index('code_ogr').libelle.drop_duplicates()
links_merged['job_group_name'] = links_merged.code_rome.map(job_group_names)
links_merged['item_name'] = links_merged.code_ogr.map(item_names)
display(links_merged[links_merged._diff == 'removed'].dropna().head(5))
links_merged[links_merged._diff == 'added'].dropna().head(5)
Out[11]:
Those fixes make sense (not sure why they were not done before, but let's not complain: it is fixed now).
Finally let's check the changes in the main table referentiel_code_rome
to make sure nothing big changed. First let's join the old and the new table in one dataset:
In [12]:
code_rome = find_rome_dataset_by_name(rome_data, 'referentiel_code_rome')
code_rome_diff = pd.merge(code_rome.old, code_rome.new, on='code_rome', suffixes=('_old', '_new'))
code_rome_diff.head()
Out[12]:
And now let's see the differences:
In [13]:
code_rome_diff[
(code_rome_diff.code_fiche_em_old != code_rome_diff.code_fiche_em_new) |
(code_rome_diff.code_ogr_old != code_rome_diff.code_ogr_new) |
(code_rome_diff.libelle_rome_old != code_rome_diff.libelle_rome_new) |
(code_rome_diff.statut_old != code_rome_diff.statut_new)
]
Out[13]:
OK, only 5 lines have be changed: only their names has changed. Jobs about constructions are now also about landscapes. And a slight rephrasing for two others. Not a big deal for us.
The new version of ROME, v335, introduces very minor changes which reflect quite well what they wrote in their changelog. The transition should be transparent with a very small advantage over the old version.