Author: Marie Laure, marielaure@bayesimpact.org

IMT Salaries from API

The IMT dataset provides regional statistics about different jobs. The salary information is collected annually by the French National Institute of Statistics and Economic Studies (INSEE). Job groups are described using the PCS (Classification of Professions and Socioprofessional Categories) classification. Previously, we retrieved IMT data by scraping the IMT website. As an exploratory step, we are interested in the sanity of the API based data and identifying putative additional information provided only by the API.

A crosswalk between PCS and ROME/FAP can be found in data/crosswalks/passage_pcs_fap2009_romev3.csv.

The dataset can be obtained with the following command, note that it may take some time to download:
docker-compose run --rm data-analysis-prepare make data/imt/salaries.csv

Data Sanity

Loading and General View First let's load the dataset:



In [1]:

    
import os
from os import path

import matplotlib
import pandas as pd
import seaborn as _

DATA_FOLDER = os.getenv('DATA_FOLDER')

salaries = pd.read_csv(path.join(DATA_FOLDER, 'imt/salaries.csv'), dtype={'AREA_CODE': 'str'})
salaries.head()









    Out[1]:






  
    
      
      AGE_GROUP_CODE
      AGE_GROUP_NAME
      AREA_CODE
      AREA_NAME
      AREA_TYPE_CODE
      AREA_TYPE_NAME
      MAXIMUM_SALARY
      MINIMUM_SALARY
      PCS_PROFESSION_CODE
      PCS_PROFESSION_NAME
    
  
  
    
      0
      1
      Moins de 35 ans
      04
      ALPES-DE-HAUTE-PROVENCE
      D
      Département
      3600
      2300
      463b
      Techniciens commerciaux et technico-commerciau...
    
    
      1
      2
      35 ans et plus
      04
      ALPES-DE-HAUTE-PROVENCE
      D
      Département
      4500
      2350
      463b
      Techniciens commerciaux et technico-commerciau...
    
    
      2
      1
      Moins de 35 ans
      04
      ALPES-DE-HAUTE-PROVENCE
      D
      Département
      3400
      2100
      463c
      Techniciens commerciaux et technico-commerciau...
    
    
      3
      2
      35 ans et plus
      04
      ALPES-DE-HAUTE-PROVENCE
      D
      Département
      5800
      1700
      463c
      Techniciens commerciaux et technico-commerciau...
    
    
      4
      1
      Moins de 35 ans
      04
      ALPES-DE-HAUTE-PROVENCE
      D
      Département
      -2
      -2
      463d
      Techniciens commerciaux et technico-commerciau...

Nice! We have the basic useful columns area, job groups but also data on the salaries and the age of the recruitee (a proxy for experience I guess). According to the documentation 90% of the employees earn more than the MINIMUM_SALARY, while only 10% earn more than the MAXIMUM_SALARY.

Concerning the age groups, how many categories do we have and what are there?



In [2]:

    
salaries.AGE_GROUP_NAME.unique()









    Out[2]:





array(['Moins de 35 ans', '35 ans et plus'], dtype=object)

Only two… Maybe it is enough.

Let's do a basic check on the areas.

First what about the granularity?



In [3]:

    
salaries.AREA_TYPE_NAME.unique()









    Out[3]:





array(['Département', 'Région', 'France entière'], dtype=object)

Ok! We have 3 different levels, whole country, regions and departments.

Is everyone there? Let's start by the departments!



In [4]:

    
salaries[salaries.AREA_TYPE_CODE == 'D'].AREA_CODE.sort_values().unique().size









    Out[4]:





100

We could expect 101 (metropolitan departments) or 103 (metropolitan departments + overseas collectivities). Here Mayotte is missing.

What about the regions?



In [5]:

    
salaries[salaries.AREA_TYPE_CODE == 'R'].AREA_CODE.sort_values().unique().size









    Out[5]:





17

Oh… We expected 18! Again, Mayotte is missing.

The first overview revealed some negative salaries. The documentation states that when they are not enough data, the value is -1 while when the data in unavailable it is marked as -2. Let's have a basic description of the salaries.

How many missing or uninformative salary data do we have? We start with the minimum salary



In [6]:

    
len(salaries[salaries.MINIMUM_SALARY < 0]) / len(salaries) * 100









    Out[6]:





13.08211288464703

Around 13% of the job groups for a given area (market) don't have salary data! That is a bit more than nothing!

When salaries are lesser than 0, are the minimum salary and the maximum salaries always the same?



In [7]:

    
invalid_rows = salaries[(salaries.MAXIMUM_SALARY < 0) | (salaries.MINIMUM_SALARY < 0)]
all(invalid_rows.MINIMUM_SALARY == invalid_rows.MAXIMUM_SALARY)









    Out[7]:





True

Yeahh!! They are exactly the same… How convenient!

So let's get a basic overview of the salaries.



In [8]:

    
valid_salaries = salaries[salaries.MAXIMUM_SALARY > 0]
valid_salaries[['MAXIMUM_SALARY', 'MINIMUM_SALARY']].describe()









    Out[8]:






  
    
      
      MAXIMUM_SALARY
      MINIMUM_SALARY
    
  
  
    
      count
      84512.000000
      84512.000000
    
    
      mean
      3979.875047
      2068.592034
    
    
      std
      2086.121357
      616.621855
    
    
      min
      1000.000000
      700.000000
    
    
      25%
      2650.000000
      1650.000000
    
    
      50%
      3300.000000
      1850.000000
    
    
      75%
      4550.000000
      2250.000000
    
    
      max
      41600.000000
      6500.000000

Because the minimum MINIMUM_SALARY is lower than the french minimum wage (~1400), we think that these data gather both full-time and part-time offers. It can be scary to deliver as such to our users…
Anyway, no weird or missing values here.

It can be an overkill but we'll see if the maximum salary is always greater or equal to the minimum salary.



In [9]:

    
all(salaries.MAXIMUM_SALARY >= salaries.MINIMUM_SALARY)









    Out[9]:





True

Great!

Last but not least. Do we cover every job groups?



In [10]:

    
salaries.PCS_PROFESSION_CODE.unique().size









    Out[10]:





412

According to INSEE documentation, we should expect around 500 job groups. ~85% of them are covered.

But, then how many of these job groups have valid data?



In [11]:

    
valid_salaries.PCS_PROFESSION_CODE.unique().size









    Out[11]:





412

Yeahh!! All of them!!

Currently, in Bob, we are mostly using ROME classification. Then, we are interested in the number of ROME job groups covered by this dataset.
First, we need to download the mapping between PCS and ROME classifications.



In [12]:

    
pcs_to_rome = pd.read_csv(path.join(DATA_FOLDER, 'crosswalks/passage_pcs_romev3.csv'))
pcs_to_rome.head()

Quite concise, isn't it!



In [13]:

    
pcs_to_rome[pcs_to_rome['PCS'].isin(salaries.PCS_PROFESSION_CODE.unique())]\
    .ROME.unique().size









    Out[13]:





515

Impressive! We have a ~97% coverage for ROME job groups.

What about the granularity of this coverage?
Coverage at the regions level.



In [14]:

    
region_professions = salaries[salaries.AREA_TYPE_CODE == 'R']\
    .PCS_PROFESSION_CODE.unique()
pcs_to_rome[pcs_to_rome['PCS']\
    .isin(region_professions)]\
    .ROME.unique().size









    Out[14]:





515

Exactly the same…

Let's have a look at the ROME job groups coverage at the department level.



In [15]:

    
department_professions = salaries[salaries.AREA_TYPE_CODE == 'D']\
    .PCS_PROFESSION_CODE.unique()
pcs_to_rome[pcs_to_rome['PCS']\
    .isin(department_professions)]\
    .ROME.unique().size









    Out[15]:





515

Again, no difference.

Everything is going well so far!

Global Overview and Comparison with Scraped Data

Actually, we have multiple source of data for salaries: the IMT and the FHS (more or less Pôle Emploi statistics history). The FHS dataset provides jobseekers salary expectancies. A notebook has been written before to investigate the distribution of these expected salaries.

An analysis of the IMT salary data has been done before.
The main conclusions of this notebook were:

for a given job group, the salary was quite consistent on the french territory (you'll earn the almost the same if you are a deliverer in Lyon or in Paris).
there is a high variation in salaries within a single department.
Does it still stands?

How variable are the salaries within departments?



In [16]:

    
salaries['mean_senior_salary'] = salaries[['MINIMUM_SALARY', 'MAXIMUM_SALARY']].sum(axis=1).div(2)
valid_salaries = salaries[salaries.MAXIMUM_SALARY > 0]
stats_within_departments = valid_salaries[valid_salaries.AREA_TYPE_CODE == 'D']\
    .groupby('AREA_NAME')\
    .mean_senior_salary.agg({'mean', 'std'})\
    .sort_values('mean', ascending=False)
    
stats_within_departments.plot(kind='box');

Within a department, ~30% of the jobs propose a salary greater than 4200€ or lesser than 1800€.

How variable are the salaries within job groups?



In [17]:

    
stats_within_jobgroups = valid_salaries[valid_salaries.AREA_TYPE_CODE == 'D'].groupby('PCS_PROFESSION_CODE')\
    .mean_senior_salary.agg({'mean', 'std'})\
    .sort_values('std', ascending=False)

stats_within_jobgroups.plot(kind='box');

As expected, within a job group, the dispersion is lesser than within a department (standard deviation most of the time lesser than 1000€).

Still, why not looking at some examples of highly variable job goups?



In [18]:

    
valid_salaries[
    (valid_salaries.AREA_TYPE_CODE == 'D') &
    (valid_salaries.PCS_PROFESSION_CODE.isin(stats_within_jobgroups.index))]\
    .drop_duplicates().PCS_PROFESSION_NAME.to_frame().head(5)\
    .style.set_properties( **{'width': '500px'})









    Out[18]:





        

        
        

        
            
            
                
                
                
                  
                
                
                
                
                  PCS_PROFESSION_NAME
                
                
            
            
        
        
            
            
                
                
                
                    0
                
                
                
                
                    Techniciens commerciaux et technico-commerciaux, représentants en biens d'équipement, en biens intermédiaires, commerce interindustriel (hors informatique)
                
                
            
            
            
                
                
                
                    1
                
                
                
                
                    Techniciens commerciaux et technico-commerciaux, représentants en biens d'équipement, en biens intermédiaires, commerce interindustriel (hors informatique)
                
                
            
            
            
                
                
                
                    2
                
                
                
                
                    Techniciens commerciaux et technico-commerciaux, représentants en biens de consommation auprès d'entreprises
                
                
            
            
            
                
                
                
                    3
                
                
                
                
                    Techniciens commerciaux et technico-commerciaux, représentants en biens de consommation auprès d'entreprises
                
                
            
            
            
                
                
                
                    5
                
                
                
                
                    Techniciens commerciaux et technico-commerciaux, représentants en services auprès d'entreprises ou de professionnels (hors banque, assurance, informatique)

Sales persons are the ones with the most highly variable salaries. That seems sensible!

What about the conformity of API data with scraped data?
According to the website (on the 2nd of October 2017), a nurse in the Isère department, younger than 35 years old, could expect a salary between 1850€ and 4050€. Note that ROME code for nurse is "J1502" which corresponds to 6 PCS classifications (431a, 431b, 431c, 431d, 431f and 431g).



In [19]:

    
nurse_pcs = ['431a', '431b', '431c', '431d', '431f', '431g']
valid_salaries[(valid_salaries.AREA_NAME == "ISERE") \
    & (valid_salaries.PCS_PROFESSION_CODE.isin(nurse_pcs) \
    & (valid_salaries.AGE_GROUP_NAME == 'Moins de 35 ans'))] \
    [['MAXIMUM_SALARY', 'MINIMUM_SALARY', 'PCS_PROFESSION_CODE', 'PCS_PROFESSION_NAME']]









    Out[19]:






  
    
      
      MAXIMUM_SALARY
      MINIMUM_SALARY
      PCS_PROFESSION_CODE
      PCS_PROFESSION_NAME
    
  
  
    
      24574
      3050
      1850
      431a
      Cadres infirmiers et assimilés
    
    
      24578
      2950
      1900
      431c
      Puéricultrices
    
    
      24580
      4050
      2100
      431d
      Infirmiers spécialisés (autres qu'infirmiers p...
    
    
      24584
      2850
      1950
      431f
      Infirmiers en soins généraux

Hummm.. So the website, seems to be using the minimum of the minimum salary and the maximum of the maximum salary. Smart!

Conclusion

The dataset is clean.
Missing and uninformative data are clearly documented.
We have now data at multiple granularity levels.
Global overview of the data confirms the observations done before.

	ROME	PCS
0	A1101	691a
1	A1201	533b
2	A1201	691f
3	A1202	631a
4	A1203	631a

	AGE_GROUP_CODE	AGE_GROUP_NAME	AREA_CODE	AREA_NAME	AREA_TYPE_CODE	AREA_TYPE_NAME	MAXIMUM_SALARY	MINIMUM_SALARY	PCS_PROFESSION_CODE	PCS_PROFESSION_NAME
0	1	Moins de 35 ans	04	ALPES-DE-HAUTE-PROVENCE	D	Département	3600	2300	463b	Techniciens commerciaux et technico-commerciau...
1	2	35 ans et plus	04	ALPES-DE-HAUTE-PROVENCE	D	Département	4500	2350	463b	Techniciens commerciaux et technico-commerciau...
2	1	Moins de 35 ans	04	ALPES-DE-HAUTE-PROVENCE	D	Département	3400	2100	463c	Techniciens commerciaux et technico-commerciau...
3	2	35 ans et plus	04	ALPES-DE-HAUTE-PROVENCE	D	Département	5800	1700	463c	Techniciens commerciaux et technico-commerciau...
4	1	Moins de 35 ans	04	ALPES-DE-HAUTE-PROVENCE	D	Département	-2	-2	463d	Techniciens commerciaux et technico-commerciau...

	MAXIMUM_SALARY	MINIMUM_SALARY
count	84512.000000	84512.000000
mean	3979.875047	2068.592034
std	2086.121357	616.621855
min	1000.000000	700.000000
25%	2650.000000	1650.000000
50%	3300.000000	1850.000000
75%	4550.000000	2250.000000
max	41600.000000	6500.000000

	MAXIMUM_SALARY	MINIMUM_SALARY	PCS_PROFESSION_CODE	PCS_PROFESSION_NAME
24574	3050	1850	431a	Cadres infirmiers et assimilés
24578	2950	1900	431c	Puéricultrices
24580	4050	2100	431d	Infirmiers spécialisés (autres qu'infirmiers p...
24584	2850	1950	431f	Infirmiers en soins généraux