Author: Valentin Lehuger
Skip the run test because the ROME version has to be updated to make it work in the exported repository. TODO: Update ROME and remove the skiptest flag.
This notebook is about how the mapping between BMO and ROME works and how to interpret the differents job categroy identifiers.
The ROME is a job classification created by the french employment agency "pole emploi" and the BMO is a study of labour market emitted by a Statistics agency.
In [1]:
import codecs
import os
import pandas as pd
import seaborn as sns
data_path = '../../../data'
In [2]:
bmo_df = pd.read_csv(os.path.join(data_path, 'bmo/bmo_2015.csv'))
bmo_df.sample(frac=0.0001)
Out[2]:
In [3]:
# Select useful columns of codes and names
bmo_df = bmo_df[[u'PROFESSION_FAMILY_CODE', u'PROFESSION_FAMILY_NAME', u'FAP_CODE', u'FAP_NAME']]
bmo_df = bmo_df.sort_values(['PROFESSION_FAMILY_CODE', 'FAP_CODE'])
# create correspondance profession_family/fap codes df
FAP_profession_family = bmo_df[[u'PROFESSION_FAMILY_CODE', u'FAP_CODE']].drop_duplicates()
# Create correspondance code/name dfs
profession_family_correspondance = bmo_df[[u'PROFESSION_FAMILY_CODE', u'PROFESSION_FAMILY_NAME']].drop_duplicates()
FAP_correspondance = bmo_df[[u'FAP_CODE', u'FAP_NAME']].drop_duplicates().sort_values([u'FAP_CODE'])
This document (http://travail-emploi.gouv.fr/IMG/pdf/FAP-2009_Introduction_et_table_de_correspondance.pdf) gives a very good explanation of how the FAP codes are built.
The first character is the professional field. (A = Agriculture, marins, fishing / B = Civil engineering / C = Electricity, electronics, etc)
The second and third characters are used to group 87 FAP categories.
The fourth character indicatesthe qualification level. (0 = undefined, 2 = unskilled worker to 9 = engineer and manager)
The fifth character is used to group the professionnal families in to 225 more specific categories.
In [4]:
rome_df = pd.read_csv(os.path.join(data_path, 'rome/csv/unix_referentiel_appellation_v332_utf8.csv'))
# Select useful columns of codes and names
rome_df = rome_df[['code_ogr', 'libelle_appellation_court', 'code_rome']]
rome_df.columns = [u'OGR_CODE', u'ROME_PROFESSION_SHORT_NAME', u'ROME_PROFESSION_CARD_CODE']
rome_df = rome_df[[u'OGR_CODE', u'ROME_PROFESSION_SHORT_NAME', u'ROME_PROFESSION_CARD_CODE']].drop_duplicates().sort_values([u'OGR_CODE', u'ROME_PROFESSION_CARD_CODE'])
In [5]:
print("{} uniques romes.".format(len(rome_df.ROME_PROFESSION_CARD_CODE.unique())))
In [6]:
rome_df[rome_df.ROME_PROFESSION_CARD_CODE == "L1503"]
Out[6]:
In [7]:
def parse_faprome_file(filename):
with codecs.open(filename, 'r', 'latin-1') as txtfile:
table = pd.DataFrame([x.replace('"', '').split("=") for x in txtfile.readlines() if x.startswith('"')])
return table
bmo_rome = parse_faprome_file(os.path.join(data_path, 'crosswalks/passage_fap2009_romev3.txt'))
bmo_rome[0] = bmo_rome.apply(lambda x: [s.strip() for s in x[0].split(',')], axis=1)
bmo_rome[1] = bmo_rome.apply(lambda x: x[1].replace('\n', '').replace('\r', '').replace('\t', '').strip(), axis=1)
bmo_rome.columns = [u"ROME", u"FAP"]
s = bmo_rome.ROME.apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = u"ROME"
bmo_rome = bmo_rome[[u"FAP"]].join(s)
bmo_rome_entire = bmo_rome
bmo_rome = bmo_rome[bmo_rome.ROME.str.len() == 5]
In [8]:
print("{} uniques romes.".format(len(bmo_rome.ROME.unique())))
In [9]:
bmo_rome.head()
Out[9]:
In [10]:
A = pd.merge(bmo_rome, rome_df, left_on="ROME", right_on=u"ROME_PROFESSION_CARD_CODE")[["FAP", "ROME", "OGR_CODE", "ROME_PROFESSION_SHORT_NAME"]]
In [11]:
bmo_rome_merged = pd.merge(A, bmo_df, left_on="FAP", right_on="FAP_CODE").drop_duplicates()[["FAP", "ROME", "OGR_CODE", "ROME_PROFESSION_SHORT_NAME", "PROFESSION_FAMILY_CODE", "PROFESSION_FAMILY_NAME", "FAP_NAME"]]
bmo_rome_merged.head()
Out[11]:
In [12]:
bmo_rome_merged.sample(frac=0.01).head()
Out[12]:
There are 4 kinds of codes to describe jobs in ROME and BMO datasets
Identifiers created by Pole emploi : ROME and OGR_CODE Identifiers created by DARES (Statistics Agency) : FAP and PROFESSION_FAMILY_CODE
From larger to smaller groups, we get : PROFESSION_FAMILY_CODE > FAP > ROME_CODE > OGR_CODE
In the ROME classification, the OGR_CODE is the most accurate job identifier (example: props or pyrotechnist or marketing director).
In [13]:
rome_df[rome_df.ROME_PROFESSION_CARD_CODE == "L1503"].head()
Out[13]:
A ROME_PROFESSION_CARD_CODE is a group of OGR_CODE for very similar jobs in one field on a same hierarchical level. (example in entertainment field: props, pyrotechnist, steward are under the same ROME_PROFESSION_CARD_CODE)
In [14]:
bmo_rome_merged[bmo_rome_merged.FAP == "U1Z80"].sample(frac=0.05)
Out[14]:
The FAP code is a larger group of jobs which can include multiple ROME_PROFESSION_CARD_CODE in same a field with differents hierarchical levels. (example: props, pyrotechnist, steward are grouped with production manager and ballet director)
In [15]:
bmo_rome_merged[bmo_rome_merged.PROFESSION_FAMILY_CODE == "C"].head()
Out[15]:
PROFESSION_FAMILY_CODE is the largest group of all identifiers. Each classification id includes many FAP codes. It contains 7 class of jobs within Administrative jobs, social and medical jobs, etc...
In [16]:
unique_fap_rome_couples = bmo_rome_merged[["FAP", "ROME"]].drop_duplicates()
rome_by_fap_count = unique_fap_rome_couples.groupby("FAP")["ROME"].count()
rome_by_fap_count.hist(bins=rome_by_fap_count.max())
print("mean : {0:.4f}".format(rome_by_fap_count.mean()))
print("standard deviation : {0:.4f}".format(rome_by_fap_count.std()))
print("{0:.2f}% of FAP contains less than 5 ROME.".format(rome_by_fap_count[rome_by_fap_count <= 4].count() / 130. * 100))
The FAP code seems to be a pretty low level of job groups. 2/3 FAP contains one or two ROME.
In [17]:
FAP_correspondance[FAP_correspondance.FAP_CODE.isin(rome_by_fap_count[rome_by_fap_count < 5].index)].sample(frac=0.3)
Out[17]:
The designations of FAP under 5 ROME are very specific. For example doctors (V2Z90), dentists (V2Z91), pharmacists (V2Z93), telemarketers (R1Z67)
In [18]:
FAP_correspondance[FAP_correspondance.FAP_CODE.isin(rome_by_fap_count[rome_by_fap_count >= 5].index)]
Out[18]:
The designations of FAP over 5 ROME are much larger group of jobs as other paramedical professions (V3Z80) or professionnal entertainers (U1Z80).
Although BMO and ROME are created by 2 different agencies, the FAP and ROME codes seems to be well mapped. Each FAP contains one or more ROME code. Most FAP categories are very specific as 90% contains less than 5 ROME codes.