Author: Marie Laure, marielaure@bayesimpact.org
The IMT dataset provides regional statistics about different jobs. The salary information is collected annually by the French National Institute of Statistics and Economic Studies (INSEE). Job groups are described using the PCS (Classification of Professions and Socioprofessional Categories) classification. Previously, we retrieved IMT data by scraping the IMT website. As an exploratory step, we are interested in the sanity of the API based data and identifying putative additional information provided only by the API.
A crosswalk between PCS and ROME/FAP can be found in data/crosswalks/passage_pcs_fap2009_romev3.csv
.
The dataset can be obtained with the following command, note that it may take some time to download:
docker-compose run --rm data-analysis-prepare make data/imt/salaries.csv
Loading and General View First let's load the dataset:
In [1]:
import os
from os import path
import matplotlib
import pandas as pd
import seaborn as _
DATA_FOLDER = os.getenv('DATA_FOLDER')
salaries = pd.read_csv(path.join(DATA_FOLDER, 'imt/salaries.csv'), dtype={'AREA_CODE': 'str'})
salaries.head()
Out[1]:
Nice! We have the basic useful columns area, job groups but also data on the salaries and the age of the recruitee (a proxy for experience I guess). According to the documentation 90% of the employees earn more than the MINIMUM_SALARY
, while only 10% earn more than the MAXIMUM_SALARY
.
Concerning the age groups, how many categories do we have and what are there?
In [2]:
salaries.AGE_GROUP_NAME.unique()
Out[2]:
Only two… Maybe it is enough.
Let's do a basic check on the areas.
First what about the granularity?
In [3]:
salaries.AREA_TYPE_NAME.unique()
Out[3]:
Ok! We have 3 different levels, whole country, regions and departments.
Is everyone there? Let's start by the departments!
In [4]:
salaries[salaries.AREA_TYPE_CODE == 'D'].AREA_CODE.sort_values().unique().size
Out[4]:
We could expect 101 (metropolitan departments) or 103 (metropolitan departments + overseas collectivities). Here Mayotte is missing.
What about the regions?
In [5]:
salaries[salaries.AREA_TYPE_CODE == 'R'].AREA_CODE.sort_values().unique().size
Out[5]:
Oh… We expected 18! Again, Mayotte is missing.
The first overview revealed some negative salaries. The documentation states that when they are not enough data, the value is -1 while when the data in unavailable it is marked as -2. Let's have a basic description of the salaries.
How many missing or uninformative salary data do we have? We start with the minimum salary
In [6]:
len(salaries[salaries.MINIMUM_SALARY < 0]) / len(salaries) * 100
Out[6]:
Around 13% of the job groups for a given area (market) don't have salary data! That is a bit more than nothing!
When salaries are lesser than 0, are the minimum salary and the maximum salaries always the same?
In [7]:
invalid_rows = salaries[(salaries.MAXIMUM_SALARY < 0) | (salaries.MINIMUM_SALARY < 0)]
all(invalid_rows.MINIMUM_SALARY == invalid_rows.MAXIMUM_SALARY)
Out[7]:
Yeahh!! They are exactly the same… How convenient!
So let's get a basic overview of the salaries.
In [8]:
valid_salaries = salaries[salaries.MAXIMUM_SALARY > 0]
valid_salaries[['MAXIMUM_SALARY', 'MINIMUM_SALARY']].describe()
Out[8]:
Because the minimum MINIMUM_SALARY
is lower than the french minimum wage (~1400), we think that these data gather both full-time and part-time offers. It can be scary to deliver as such to our users…
Anyway, no weird or missing values here.
It can be an overkill but we'll see if the maximum salary is always greater or equal to the minimum salary.
In [9]:
all(salaries.MAXIMUM_SALARY >= salaries.MINIMUM_SALARY)
Out[9]:
Great!
Last but not least. Do we cover every job groups?
In [10]:
salaries.PCS_PROFESSION_CODE.unique().size
Out[10]:
According to INSEE documentation, we should expect around 500 job groups. ~85% of them are covered.
But, then how many of these job groups have valid data?
In [11]:
valid_salaries.PCS_PROFESSION_CODE.unique().size
Out[11]:
Yeahh!! All of them!!
Currently, in Bob, we are mostly using ROME classification. Then, we are interested in the number of ROME job groups covered by this dataset.
First, we need to download the mapping between PCS and ROME classifications.
In [12]:
pcs_to_rome = pd.read_csv(path.join(DATA_FOLDER, 'crosswalks/passage_pcs_romev3.csv'))
pcs_to_rome.head()
Out[12]:
Quite concise, isn't it!
In [13]:
pcs_to_rome[pcs_to_rome['PCS'].isin(salaries.PCS_PROFESSION_CODE.unique())]\
.ROME.unique().size
Out[13]:
Impressive! We have a ~97% coverage for ROME job groups.
What about the granularity of this coverage?
Coverage at the regions level.
In [14]:
region_professions = salaries[salaries.AREA_TYPE_CODE == 'R']\
.PCS_PROFESSION_CODE.unique()
pcs_to_rome[pcs_to_rome['PCS']\
.isin(region_professions)]\
.ROME.unique().size
Out[14]:
Exactly the same…
Let's have a look at the ROME job groups coverage at the department level.
In [15]:
department_professions = salaries[salaries.AREA_TYPE_CODE == 'D']\
.PCS_PROFESSION_CODE.unique()
pcs_to_rome[pcs_to_rome['PCS']\
.isin(department_professions)]\
.ROME.unique().size
Out[15]:
Again, no difference.
Everything is going well so far!
Actually, we have multiple source of data for salaries: the IMT and the FHS (more or less Pôle Emploi statistics history). The FHS dataset provides jobseekers salary expectancies. A notebook has been written before to investigate the distribution of these expected salaries.
An analysis of the IMT salary data has been done before.
The main conclusions of this notebook were:
How variable are the salaries within departments?
In [16]:
salaries['mean_senior_salary'] = salaries[['MINIMUM_SALARY', 'MAXIMUM_SALARY']].sum(axis=1).div(2)
valid_salaries = salaries[salaries.MAXIMUM_SALARY > 0]
stats_within_departments = valid_salaries[valid_salaries.AREA_TYPE_CODE == 'D']\
.groupby('AREA_NAME')\
.mean_senior_salary.agg({'mean', 'std'})\
.sort_values('mean', ascending=False)
stats_within_departments.plot(kind='box');
Within a department, ~30% of the jobs propose a salary greater than 4200€ or lesser than 1800€.
How variable are the salaries within job groups?
In [17]:
stats_within_jobgroups = valid_salaries[valid_salaries.AREA_TYPE_CODE == 'D'].groupby('PCS_PROFESSION_CODE')\
.mean_senior_salary.agg({'mean', 'std'})\
.sort_values('std', ascending=False)
stats_within_jobgroups.plot(kind='box');
As expected, within a job group, the dispersion is lesser than within a department (standard deviation most of the time lesser than 1000€).
Still, why not looking at some examples of highly variable job goups?
In [18]:
valid_salaries[
(valid_salaries.AREA_TYPE_CODE == 'D') &
(valid_salaries.PCS_PROFESSION_CODE.isin(stats_within_jobgroups.index))]\
.drop_duplicates().PCS_PROFESSION_NAME.to_frame().head(5)\
.style.set_properties( **{'width': '500px'})
Out[18]:
Sales persons are the ones with the most highly variable salaries. That seems sensible!
What about the conformity of API data with scraped data?
According to the website (on the 2nd of October 2017), a nurse in the Isère department, younger than 35 years old, could expect a salary between 1850€ and 4050€.
Note that ROME code for nurse is "J1502" which corresponds to 6 PCS classifications (431a, 431b, 431c, 431d, 431f and 431g).
In [19]:
nurse_pcs = ['431a', '431b', '431c', '431d', '431f', '431g']
valid_salaries[(valid_salaries.AREA_NAME == "ISERE") \
& (valid_salaries.PCS_PROFESSION_CODE.isin(nurse_pcs) \
& (valid_salaries.AGE_GROUP_NAME == 'Moins de 35 ans'))] \
[['MAXIMUM_SALARY', 'MINIMUM_SALARY', 'PCS_PROFESSION_CODE', 'PCS_PROFESSION_NAME']]
Out[19]:
Hummm.. So the website, seems to be using the minimum of the minimum salary and the maximum of the maximum salary. Smart!