Author: Pascal, pascal@bayesimpact.org
Updated: 2018-06-14
BMO means "Besoin en Main d'Oeuvre" and analyzes the working force needed (it is done by calling employers all over France). It's done once per year. See the official website for more details.
In Bob Emploi we want to display BMO data useful for the user. One way we want to show some context is to give the values for several years. Before doing that we want to make sure that no user would get surprised by the data for their specific case.
In [1]:
import glob
import os
from os import path
import re
import pandas
import seaborn as sns
sns.set()
bmo_file_names = glob.glob(path.join(os.getenv('DATA_FOLDER'), 'bmo/bmo_*'))
bmo_df_dict = {}
for bmo_file_name in sorted(bmo_file_names):
df = pandas.read_csv(bmo_file_name, dtype={'DEPARTEMENT_CODE': str})
year = int(re.search(r'\d+', bmo_file_name).group())
df['year'] = year
bmo_df_dict[year] = df
sorted(bmo_df_dict.keys())
Out[1]:
Check that all the CSV files have the same structure.
In [2]:
def assert_all_the_same(dataframes, select_data_func, name='Values'):
first = None
for key, df in sorted(dataframes.items(), key=lambda kv: kv[0]):
if first is None:
first = select_data_func(df)
continue
other = select_data_func(df)
if first - other:
print('{} removed in {}:\n{}'.format(name, key, sorted(first - other)))
if other - first:
print('{} added in {}:\n{}'.format(name, key, sorted(other - first)))
assert_all_the_same(bmo_df_dict, lambda df: set(df.columns), name='Columns')
Hmm, this is fishy, why would the BMO be by ROME? Let's check the values:
In [3]:
bmo_df_dict[2014].ROME_PROFESSION_CARD_CODE.unique()
Out[3]:
OK, those are actually the FAP code, so we only need some columns renaming.
In [4]:
for df in bmo_df_dict.values():
df.rename(columns={
'ROME_PROFESSION_CARD_CODE': 'FAP_CODE',
'ROME_PROFESSION_CARD_NAME': 'FAP_NAME',
}, inplace=True)
assert_all_the_same(bmo_df_dict, lambda df: set(df.columns), name='Columns')
Let's check the column types:
In [5]:
for column in bmo_df_dict[2017].columns:
assert_all_the_same(
bmo_df_dict,
lambda df: set([df.dtypes[column]]),
name='Column Type for {}'.format(column))
Ouch, those numbers should all be floats not objects. Let's check some values:
In [6]:
bmo_df_dict[2014].NB_RECRUT_PROJECTS.head()
Out[6]:
In [7]:
for field in ('NB_RECRUT_PROJECTS', 'NB_DIFF_RECRUT_PROJECTS', 'NB_SEASON_RECRUT_PROJECTS'):
for bmo_df in bmo_df_dict.values():
bmo_df[field] = bmo_df[field].astype(str)\
.str.replace(',', '.')\
.str.replace(' ', '')\
.replace('-', '0')\
.replace('*', '0')\
.astype(float)
bmo_df_dict[2014].NB_RECRUT_PROJECTS.head()
Out[7]:
After this cleanup, let's check again if the column types are consistant.
In [8]:
for column in bmo_df_dict[2017].columns:
assert_all_the_same(
bmo_df_dict,
lambda df: set([df.dtypes[column]]),
name='Column Type for {}'.format(column))
Cool, now column types are the same for each year.
Let's check if we have the same values in key fields, and do some point checks that were known to fail in previous versions:
In [9]:
assert_all_the_same(bmo_df_dict, lambda df: set(df.FAP_CODE.tolist()), name='FAP Codes')
assert_all_the_same(bmo_df_dict, lambda df: set(df.DEPARTEMENT_CODE.tolist()), name='Departement Codes')
fap_codes = set(bmo_df_dict[2016].FAP_CODE.tolist())
if 'A0Z40' not in fap_codes:
print('FAP A0Z40 is missing.')
departement_codes = set(bmo_df_dict[2016].DEPARTEMENT_CODE.tolist())
if '2A' not in departement_codes:
print('Département 2A is missing.')
if '01' not in departement_codes:
print('Département 01 is missing.')
So it seems that some départements are not correctly set. And there's a new FAP code in 2018.
In [10]:
for _, df in bmo_df_dict.items():
df['DEPARTEMENT_CODE'] = df.DEPARTEMENT_CODE.str.pad(2, fillchar='0')
assert_all_the_same(bmo_df_dict, lambda df: set(df['DEPARTEMENT_CODE'].tolist()), name='Departement Codes')
departement_codes = set(bmo_df_dict[2016].DEPARTEMENT_CODE.tolist())
assert('2A' in departement_codes)
assert('01' in departement_codes)
OK, we've fixed the départements.
What about the extra FAP code?
In [11]:
bmo_2018 = bmo_df_dict[2018]
v5z_counts = bmo_2018[bmo_2018.FAP_CODE.str.startswith('V5')]\
.groupby('FAP_CODE').NB_RECRUT_PROJECTS.sum().to_frame()
v5z_counts['name'] = bmo_2018[['FAP_CODE', 'FAP_NAME']]\
.drop_duplicates().set_index('FAP_CODE').FAP_NAME
v5z_counts
Out[11]:
Hmm OK, this does not seem to be a bug, just a type of job that is rare as there's very few hiring planned for that type of job for the whole France.
All seems good now: the Data Frame have the same columns, and the same set of values in the important columns.
Let's compare global stats.
In [12]:
bmo_df = pandas.concat(bmo_df_dict[year] for year in sorted(bmo_df_dict.keys()))
In [13]:
bmo_df.groupby(['year']).count()
Out[13]:
In [14]:
bmo_df.groupby(['year']).sum()
Out[14]:
In early 2017 when BMO 2017 was released, it had the same values as BMO 2016. Obviously this is not the case for BMO 2018, so we can use the data.
Let's get an idea on how the BMO is evolving locally. Here are some basic questions:
Let's try merging the data per local market. First let's check if we can merge on the "bassin d'emploi" level:
In [15]:
assert_all_the_same(bmo_df_dict, lambda df: set(df.CATCHMENT_AREA_CODE.tolist()), name='Catchment Area Codes')
Ooops no, it looks like the subdivision of départements has changed between the years, therefore we are going to first aggregate data at the départements level:
In [16]:
number_columns = ['NB_DIFF_RECRUT_PROJECTS', 'NB_RECRUT_PROJECTS', 'NB_SEASON_RECRUT_PROJECTS']
columns = list(set(bmo_df.columns) - set(number_columns) - {'CATCHMENT_AREA_CODE', 'CATCHMENT_AREA_NAME'})
bmo_df_by_departement = bmo_df.groupby(columns).sum()[number_columns].reset_index()
bmo_df_by_departement.head()
Out[16]:
OK, now we can merge data per local market:
In [17]:
bmo_evolution = bmo_df_by_departement[bmo_df_by_departement.year == 2018]\
.merge(
bmo_df_by_departement[bmo_df_by_departement.year == 2017],
on=['DEPARTEMENT_CODE', 'FAP_CODE'], how='outer', suffixes=['_2018', '_2017'])\
.merge(
bmo_df_by_departement[bmo_df_by_departement.year == 2016],
on=['DEPARTEMENT_CODE', 'FAP_CODE'], how='outer', suffixes=['', '_2016'])\
.merge(
bmo_df_by_departement[bmo_df_by_departement.year == 2015],
on=['DEPARTEMENT_CODE', 'FAP_CODE'], how='outer', suffixes=['', '_2015'])\
.merge(
bmo_df_by_departement[bmo_df_by_departement.year == 2014],
on=['DEPARTEMENT_CODE', 'FAP_CODE'], how='outer', suffixes=['', '_2014'])
In [18]:
bmo_per_departement = bmo_evolution.groupby(['DEPARTEMENT_CODE', 'DEPARTEMENT_NAME']).sum()
bmo_per_departement['evolution_2018'] = bmo_per_departement.NB_RECRUT_PROJECTS_2018.div(bmo_per_departement.NB_RECRUT_PROJECTS_2017)
bmo_per_departement['percent_evolution_2018'] = (bmo_per_departement.evolution_2018 - 1) * 100
bmo_per_departement.plot(kind='scatter', x='NB_RECRUT_PROJECTS_2018', y='percent_evolution_2018');
There are few outliers. Let's first check the départements with the most hiring to understand all the outliers on the right:
In [19]:
dimensions_2018 = ['percent_evolution_2018', 'NB_RECRUT_PROJECTS_2018']
bmo_per_departement.sort_values('NB_RECRUT_PROJECTS_2018', ascending=False)[dimensions_2018].head(8)
Out[19]:
Obviously Paris is way bigger than the rest but still has a very strong evolution (+18.6%). The other ones are all in large cities and metropolitan areas which correlates the France Stratégie report.
Now let's check the extreme changes:
In [20]:
extreme_changes_2018 = bmo_per_departement.sort_values('percent_evolution_2018', ascending=False)\
[dimensions_2018]
display(extreme_changes_2018.head())
extreme_changes_2018.tail()
Out[20]:
Nice, a lot of départements ~1h30 train away from Paris are growing a lot (although they might have had a bad year in 2017). On the bottom, we find some DOM départements again. This is a bit depressing to see that the situation that is not good is getting even worse. :-(
Show the distribution of hiring growth and volumes for job families:
In [21]:
bmo_per_fap = bmo_evolution.groupby(['FAP_CODE', 'FAP_NAME']).sum()
bmo_per_fap['evolution_2018'] = bmo_per_fap.NB_RECRUT_PROJECTS_2018.div(bmo_per_fap.NB_RECRUT_PROJECTS_2017)
bmo_per_fap['percent_evolution_2018'] = (bmo_per_fap.evolution_2018 - 1) * 100
bmo_per_fap.plot(kind='scatter', x='NB_RECRUT_PROJECTS_2018', y='percent_evolution_2018');
A first conclusion is that only job families with few hiring have huge evolutions. However let's look at the job family that got +150%.
In [22]:
bmo_per_fap.sort_values('percent_evolution_2018', ascending=False)[dimensions_2018].head()
Out[22]:
So "domotics" is a real hype in 2018…
Let's check the distribution again only focusing on job families with more than 20k hires in the year:
In [23]:
bmo_per_large_fap = bmo_per_fap[bmo_per_fap.NB_RECRUT_PROJECTS_2018 >= 20000]
bmo_per_large_fap.plot(kind='scatter', x='NB_RECRUT_PROJECTS_2018', y='percent_evolution_2018');
That looks reasonnable, let's check top growth:
In [24]:
bmo_per_large_fap.sort_values('percent_evolution_2018', ascending=False)[dimensions_2018].head()
Out[24]:
It looks like drivers are not a things of the past yet…
So BMO 2018 data looks pretty good and after a quick clean-up we were able to compare it with the previous years (2014, 2015, 2016 & 2017).
Some high level insights can be deduced from comparing these datasets. For instance, hiring is in really good shape in Paris, and in main metropolitan areas. Also hiring is going really well for transportation drivers as well as, ironically, self-service shops.
Possible next steps: