Author: Pascal, pascal@bayesimpact.org
Date: 2016-03-27
In March 2017 a new version of the ROME was realeased. I want to investigate what changed and whether we need to do anything about it.
You might not be able to reproduce this notebook, mostly because it requires to have the two versions of the ROME in your data/rome/csv
folder which happens only just before we switch to v330. You will have to trust me on the results ;-)
Skip the run test because it requires older versions of the ROME.
In [1]:
import collections
import glob
import os
from os import path
import matplotlib_venn
import pandas
rome_path = path.join(os.getenv('DATA_FOLDER'), 'rome/csv')
OLD_VERSION = '330'
NEW_VERSION = '331'
old_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(OLD_VERSION)))
new_version_files = frozenset(glob.glob(rome_path + '/*{}*'.format(NEW_VERSION)))
First let's check if there are new or deleted files (only matching by file names).
In [2]:
new_files = new_version_files - frozenset(f.replace(OLD_VERSION, NEW_VERSION) for f in old_version_files)
deleted_files = old_version_files - frozenset(f.replace(NEW_VERSION, OLD_VERSION) for f in new_version_files)
print('{:d} new files'.format(len(new_files)))
print('{:d} deleted files'.format(len(deleted_files)))
So we have the same set of files: good start.
Now let's set up a dataset that, for each table, links the old file and the new file.
In [3]:
new_to_old = dict((f, f.replace(NEW_VERSION, OLD_VERSION)) for f in new_version_files)
# Load all datasets.
Dataset = collections.namedtuple('Dataset', ['basename', 'old', 'new'])
data = [Dataset(
basename=path.basename(f),
old=pandas.read_csv(f.replace(NEW_VERSION, OLD_VERSION)),
new=pandas.read_csv(f))
for f in sorted(new_version_files)]
def find_dataset_by_name(data, partial_name):
for dataset in data:
if 'unix_{}_v{}_utf8.csv'.format(partial_name, NEW_VERSION) == dataset.basename:
return dataset
raise ValueError('No dataset named {}, the list is\n{}'.format(partial_name, [dataset.basename for d in data]))
Let's make sure the structure hasn't changed:
In [4]:
for dataset in data:
if set(dataset.old.columns) != set(dataset.new.columns):
print('Columns of {} have changed.'.format(dataset.basename))
All files have the same columns as before: still good.
In [5]:
untouched = 0
for dataset in data:
diff = len(dataset.new.index) - len(dataset.old.index)
if diff > 0:
print('{:d} values added in {}'.format(diff, dataset.basename))
elif diff < 0:
print('{:d} values removed in {}'.format(diff, dataset.basename))
else:
untouched += 1
print('{:d}/{:d} files with the same number of rows'.format(untouched, len(data)))
So we have minor changes in 5 files. At one point we cared about referentiel_activite
and referentiel_activite_riasec
but have no concrete application for now.
The only interesting ones are in item
and liens_rome_referentiels
, so let's see more precisely.
In [6]:
items = find_dataset_by_name(data, 'item')
new_items = set(items.new.code_ogr) - set(items.old.code_ogr)
obsolete_items = set(items.old.code_ogr) - set(items.new.code_ogr)
stable_items = set(items.new.code_ogr) & set(items.old.code_ogr)
matplotlib_venn.venn2((len(obsolete_items), len(new_items), len(stable_items)), (OLD_VERSION, NEW_VERSION));
As anticipated it is a very minor change: some items are now obsolete and new ones have been created. Let's have a look.
In [7]:
items.old[items.old.code_ogr.isin(obsolete_items)].tail()
Out[7]:
In [8]:
items.new[items.new.code_ogr.isin(new_items)].head()
Out[8]:
Those entries look legitimate.
The changes in liens_rome_referentiels
include changes for those items, so let's only check the changes not related to those.
In [9]:
links = find_dataset_by_name(data, 'liens_rome_referentiels')
old_links_on_stable_items = links.old[links.old.code_ogr.isin(stable_items)]
new_links_on_stable_items = links.new[links.new.code_ogr.isin(stable_items)]
old = old_links_on_stable_items[['code_rome', 'code_ogr']]
new = new_links_on_stable_items[['code_rome', 'code_ogr']]
links_merged = old.merge(new, how='outer', indicator=True)
links_merged['_diff'] = links_merged._merge.map({'left_only': 'removed', 'right_only': 'added'})
links_merged._diff.value_counts()
Out[9]:
So in addition to the added and remove items, there are 52 fixes. Let's have a look:
In [10]:
job_group_names = find_dataset_by_name(data, 'referentiel_code_rome').old.set_index('code_rome').libelle_rome
item_names = items.new.set_index('code_ogr').libelle.drop_duplicates()
links_merged['job_group_name'] = links_merged.code_rome.map(job_group_names)
links_merged['item_name'] = links_merged.code_ogr.map(item_names)
links_merged.dropna().head(10)
Out[10]:
Those fixes make sense (not sure why they were not done before, but let's not complain: it is fixed now).
The new version of ROME, v333, introduces very minor changes which reflect quite well what they wrote in their changelog. The transition should be transparent with a very small advantage on the new one.