Author: Marie Laure, marielaure@bayesimpact.org

IMT Salaries from API

The IMT dataset provides regional statistics about different jobs. The salary information is collected annually by the French National Institute of Statistics and Economic Studies (INSEE). Job groups are described using the PCS (Classification of Professions and Socioprofessional Categories) classification. Previously, we retrieved IMT data by scraping the IMT website. As an exploratory step, we are interested in the sanity of the API based data and identifying putative additional information provided only by the API.

A crosswalk between PCS and ROME/FAP can be found in data/crosswalks/passage_pcs_fap2009_romev3.csv.

The dataset can be obtained with the following command, note that it may take some time to download:
docker-compose run --rm data-analysis-prepare make data/imt/salaries.csv

Data Sanity

Loading and General View First let's load the dataset:


In [1]:
import os
from os import path

import matplotlib
import pandas as pd
import seaborn as _

DATA_FOLDER = os.getenv('DATA_FOLDER')

salaries = pd.read_csv(path.join(DATA_FOLDER, 'imt/salaries.csv'), dtype={'AREA_CODE': 'str'})
salaries.head()


Out[1]:
AGE_GROUP_CODE AGE_GROUP_NAME AREA_CODE AREA_NAME AREA_TYPE_CODE AREA_TYPE_NAME MAXIMUM_SALARY MINIMUM_SALARY PCS_PROFESSION_CODE PCS_PROFESSION_NAME
0 1 Moins de 35 ans 04 ALPES-DE-HAUTE-PROVENCE D Département 3600 2300 463b Techniciens commerciaux et technico-commerciau...
1 2 35 ans et plus 04 ALPES-DE-HAUTE-PROVENCE D Département 4500 2350 463b Techniciens commerciaux et technico-commerciau...
2 1 Moins de 35 ans 04 ALPES-DE-HAUTE-PROVENCE D Département 3400 2100 463c Techniciens commerciaux et technico-commerciau...
3 2 35 ans et plus 04 ALPES-DE-HAUTE-PROVENCE D Département 5800 1700 463c Techniciens commerciaux et technico-commerciau...
4 1 Moins de 35 ans 04 ALPES-DE-HAUTE-PROVENCE D Département -2 -2 463d Techniciens commerciaux et technico-commerciau...

Nice! We have the basic useful columns area, job groups but also data on the salaries and the age of the recruitee (a proxy for experience I guess). According to the documentation 90% of the employees earn more than the MINIMUM_SALARY, while only 10% earn more than the MAXIMUM_SALARY.

Concerning the age groups, how many categories do we have and what are there?


In [2]:
salaries.AGE_GROUP_NAME.unique()


Out[2]:
array(['Moins de 35 ans', '35 ans et plus'], dtype=object)

Only two… Maybe it is enough.

Let's do a basic check on the areas.

First what about the granularity?


In [3]:
salaries.AREA_TYPE_NAME.unique()


Out[3]:
array(['Département', 'Région', 'France entière'], dtype=object)

Ok! We have 3 different levels, whole country, regions and departments.

Is everyone there? Let's start by the departments!


In [4]:
salaries[salaries.AREA_TYPE_CODE == 'D'].AREA_CODE.sort_values().unique().size


Out[4]:
100

We could expect 101 (metropolitan departments) or 103 (metropolitan departments + overseas collectivities). Here Mayotte is missing.

What about the regions?


In [5]:
salaries[salaries.AREA_TYPE_CODE == 'R'].AREA_CODE.sort_values().unique().size


Out[5]:
17

Oh… We expected 18! Again, Mayotte is missing.

The first overview revealed some negative salaries. The documentation states that when they are not enough data, the value is -1 while when the data in unavailable it is marked as -2. Let's have a basic description of the salaries.

How many missing or uninformative salary data do we have? We start with the minimum salary


In [6]:
len(salaries[salaries.MINIMUM_SALARY < 0]) / len(salaries) * 100


Out[6]:
13.08211288464703

Around 13% of the job groups for a given area (market) don't have salary data! That is a bit more than nothing!

When salaries are lesser than 0, are the minimum salary and the maximum salaries always the same?


In [7]:
invalid_rows = salaries[(salaries.MAXIMUM_SALARY < 0) | (salaries.MINIMUM_SALARY < 0)]
all(invalid_rows.MINIMUM_SALARY == invalid_rows.MAXIMUM_SALARY)


Out[7]:
True

Yeahh!! They are exactly the same… How convenient!

So let's get a basic overview of the salaries.


In [8]:
valid_salaries = salaries[salaries.MAXIMUM_SALARY > 0]
valid_salaries[['MAXIMUM_SALARY', 'MINIMUM_SALARY']].describe()


Out[8]:
MAXIMUM_SALARY MINIMUM_SALARY
count 84512.000000 84512.000000
mean 3979.875047 2068.592034
std 2086.121357 616.621855
min 1000.000000 700.000000
25% 2650.000000 1650.000000
50% 3300.000000 1850.000000
75% 4550.000000 2250.000000
max 41600.000000 6500.000000

Because the minimum MINIMUM_SALARY is lower than the french minimum wage (~1400), we think that these data gather both full-time and part-time offers. It can be scary to deliver as such to our users…
Anyway, no weird or missing values here.

It can be an overkill but we'll see if the maximum salary is always greater or equal to the minimum salary.


In [9]:
all(salaries.MAXIMUM_SALARY >= salaries.MINIMUM_SALARY)


Out[9]:
True

Great!

Last but not least. Do we cover every job groups?


In [10]:
salaries.PCS_PROFESSION_CODE.unique().size


Out[10]:
412

According to INSEE documentation, we should expect around 500 job groups. ~85% of them are covered.

But, then how many of these job groups have valid data?


In [11]:
valid_salaries.PCS_PROFESSION_CODE.unique().size


Out[11]:
412

Yeahh!! All of them!!

Currently, in Bob, we are mostly using ROME classification. Then, we are interested in the number of ROME job groups covered by this dataset.
First, we need to download the mapping between PCS and ROME classifications.


In [12]:
pcs_to_rome = pd.read_csv(path.join(DATA_FOLDER, 'crosswalks/passage_pcs_romev3.csv'))
pcs_to_rome.head()


Out[12]:
ROME PCS
0 A1101 691a
1 A1201 533b
2 A1201 691f
3 A1202 631a
4 A1203 631a

Quite concise, isn't it!


In [13]:
pcs_to_rome[pcs_to_rome['PCS'].isin(salaries.PCS_PROFESSION_CODE.unique())]\
    .ROME.unique().size


Out[13]:
515

Impressive! We have a ~97% coverage for ROME job groups.

What about the granularity of this coverage?
Coverage at the regions level.


In [14]:
region_professions = salaries[salaries.AREA_TYPE_CODE == 'R']\
    .PCS_PROFESSION_CODE.unique()
pcs_to_rome[pcs_to_rome['PCS']\
    .isin(region_professions)]\
    .ROME.unique().size


Out[14]:
515

Exactly the same…

Let's have a look at the ROME job groups coverage at the department level.


In [15]:
department_professions = salaries[salaries.AREA_TYPE_CODE == 'D']\
    .PCS_PROFESSION_CODE.unique()
pcs_to_rome[pcs_to_rome['PCS']\
    .isin(department_professions)]\
    .ROME.unique().size


Out[15]:
515

Again, no difference.

Everything is going well so far!

Global Overview and Comparison with Scraped Data

Actually, we have multiple source of data for salaries: the IMT and the FHS (more or less Pôle Emploi statistics history). The FHS dataset provides jobseekers salary expectancies. A notebook has been written before to investigate the distribution of these expected salaries.

An analysis of the IMT salary data has been done before.
The main conclusions of this notebook were:

  • for a given job group, the salary was quite consistent on the french territory (you'll earn the almost the same if you are a deliverer in Lyon or in Paris).
  • there is a high variation in salaries within a single department.
    Does it still stands?

How variable are the salaries within departments?


In [16]:
salaries['mean_senior_salary'] = salaries[['MINIMUM_SALARY', 'MAXIMUM_SALARY']].sum(axis=1).div(2)
valid_salaries = salaries[salaries.MAXIMUM_SALARY > 0]
stats_within_departments = valid_salaries[valid_salaries.AREA_TYPE_CODE == 'D']\
    .groupby('AREA_NAME')\
    .mean_senior_salary.agg({'mean', 'std'})\
    .sort_values('mean', ascending=False)
    
stats_within_departments.plot(kind='box');

Within a department, ~30% of the jobs propose a salary greater than 4200€ or lesser than 1800€.

How variable are the salaries within job groups?


In [17]:
stats_within_jobgroups = valid_salaries[valid_salaries.AREA_TYPE_CODE == 'D'].groupby('PCS_PROFESSION_CODE')\
    .mean_senior_salary.agg({'mean', 'std'})\
    .sort_values('std', ascending=False)

stats_within_jobgroups.plot(kind='box');

As expected, within a job group, the dispersion is lesser than within a department (standard deviation most of the time lesser than 1000€).

Still, why not looking at some examples of highly variable job goups?


In [18]:
valid_salaries[
    (valid_salaries.AREA_TYPE_CODE == 'D') &
    (valid_salaries.PCS_PROFESSION_CODE.isin(stats_within_jobgroups.index))]\
    .drop_duplicates().PCS_PROFESSION_NAME.to_frame().head(5)\
    .style.set_properties( **{'width': '500px'})


Out[18]:
PCS_PROFESSION_NAME
0 Techniciens commerciaux et technico-commerciaux, représentants en biens d'équipement, en biens intermédiaires, commerce interindustriel (hors informatique)
1 Techniciens commerciaux et technico-commerciaux, représentants en biens d'équipement, en biens intermédiaires, commerce interindustriel (hors informatique)
2 Techniciens commerciaux et technico-commerciaux, représentants en biens de consommation auprès d'entreprises
3 Techniciens commerciaux et technico-commerciaux, représentants en biens de consommation auprès d'entreprises
5 Techniciens commerciaux et technico-commerciaux, représentants en services auprès d'entreprises ou de professionnels (hors banque, assurance, informatique)

Sales persons are the ones with the most highly variable salaries. That seems sensible!

What about the conformity of API data with scraped data?
According to the website (on the 2nd of October 2017), a nurse in the Isère department, younger than 35 years old, could expect a salary between 1850€ and 4050€. Note that ROME code for nurse is "J1502" which corresponds to 6 PCS classifications (431a, 431b, 431c, 431d, 431f and 431g).


In [19]:
nurse_pcs = ['431a', '431b', '431c', '431d', '431f', '431g']
valid_salaries[(valid_salaries.AREA_NAME == "ISERE") \
    & (valid_salaries.PCS_PROFESSION_CODE.isin(nurse_pcs) \
    & (valid_salaries.AGE_GROUP_NAME == 'Moins de 35 ans'))] \
    [['MAXIMUM_SALARY', 'MINIMUM_SALARY', 'PCS_PROFESSION_CODE', 'PCS_PROFESSION_NAME']]


Out[19]:
MAXIMUM_SALARY MINIMUM_SALARY PCS_PROFESSION_CODE PCS_PROFESSION_NAME
24574 3050 1850 431a Cadres infirmiers et assimilés
24578 2950 1900 431c Puéricultrices
24580 4050 2100 431d Infirmiers spécialisés (autres qu'infirmiers p...
24584 2850 1950 431f Infirmiers en soins généraux

Hummm.. So the website, seems to be using the minimum of the minimum salary and the maximum of the maximum salary. Smart!

Conclusion

  • The dataset is clean.
  • Missing and uninformative data are clearly documented.
  • We have now data at multiple granularity levels.
  • Global overview of the data confirms the observations done before.