Author: Marie Laure, marielaure@bayesimpact.org

IMT Market Score from API

The IMT dataset provides regional statistics about different jobs. Here we are interested in the market score (called by Pôle Emploi the tension ratio. It is a bit misleading as a big tension ratio means plenty of jobs...). It corresponds to a ratio of the average number of weekly open offers to the average number of weekly applications per 10 candidates. This value is provided among others (e.g. number of offers in the last week, number of application in the last week...) in the "statitics on offers and demands" subset of the IMT dataset.

Previously, we retrieved IMT data by scraping the IMT website. As an exploratory step, we are interested in the sanity of the API based data and identifying putative additional information provided only by the API.

The dataset can be retrieved with the following command (it takes ~15 minutes):

docker-compose run --rm data-analysis-prepare make data/imt/market_score.csv

Data Sanity

Loading and General View First let's load the csv file:



In [1]:

    
import os
from os import path

import pandas as pd
import seaborn as _

DATA_FOLDER = os.getenv('DATA_FOLDER')

market_statistics = pd.read_csv(path.join(DATA_FOLDER, 'imt/market_score.csv'))
market_statistics.head()









    Out[1]:






  
    
      
      AREA_CODE
      AREA_NAME
      AREA_TYPE_CODE
      AREA_TYPE_NAME
      NB_APPLICATION_END_MONTH
      NB_APPLICATION_LAST_WEEK
      NB_OFFER_END_MONTH
      NB_OFFER_LAST_WEEK
      RICHER_CATCHMENT_AREA_CODE
      RICHER_CATCHMENT_AREA_NAME
      ...
      SEASONAL_FEB
      SEASONAL_JAN
      SEASONAL_JULY
      SEASONAL_JUNE
      SEASONAL_MAR
      SEASONAL_MAY
      SEASONAL_NOV
      SEASONAL_OCT
      SEASONAL_SEP
      TENSION_RATIO
    
  
  
    
      0
      95
      VAL-D'OISE
      D
      Département
      56
      55
      1
      1
      1101.0
      TRIANGLE D OR
      ...
      N
      N
      N
      N
      N
      N
      N
      N
      N
      NaN
    
    
      1
      02
      MARTINIQUE
      R
      Région
      14
      14
      0
      0
      NaN
      NaN
      ...
      N
      N
      N
      N
      O
      N
      N
      N
      N
      NaN
    
    
      2
      1101
      TRIANGLE D OR
      B
      Bassin
      235
      237
      80
      78
      NaN
      NaN
      ...
      N
      N
      N
      O
      N
      O
      N
      N
      N
      35.0
    
    
      3
      1166
      HAUTS DE SEINE CENTRE
      B
      Bassin
      219
      221
      12
      12
      NaN
      NaN
      ...
      N
      N
      N
      N
      O
      N
      N
      O
      N
      6.0
    
    
      4
      1189
      VAL DE MARNE EST
      B
      Bassin
      250
      251
      17
      16
      NaN
      NaN
      ...
      O
      N
      N
      O
      O
      N
      N
      N
      N
      3.0
    
  

5 rows × 26 columns

Wow! Tons of columns! There is a lot of information on whether a job is seasonal, shows a peak in offers at a particular month or not. Seasonal is described as having twice as much offers than the monthly average (calculated over a year), and seeing this pattern on two subsequent years. Because we are not interested in the seasonality here, we'll remove at least the per month data (12 columns).



In [2]:

    
to_remove = [name for name in market_statistics.columns if 'SEASONAL_' in name]
market_statistics.drop(to_remove, axis=1, inplace=True)
market_statistics.sort_values(['ROME_PROFESSION_CARD_CODE', 'AREA_CODE']).head()









    Out[2]:






  
    
      
      AREA_CODE
      AREA_NAME
      AREA_TYPE_CODE
      AREA_TYPE_NAME
      NB_APPLICATION_END_MONTH
      NB_APPLICATION_LAST_WEEK
      NB_OFFER_END_MONTH
      NB_OFFER_LAST_WEEK
      RICHER_CATCHMENT_AREA_CODE
      RICHER_CATCHMENT_AREA_NAME
      ROME_PROFESSION_CARD_CODE
      ROME_PROFESSION_CARD_NAME
      SEASONAL
      TENSION_RATIO
    
  
  
    
      111588
      000
      FRANCE ENTIERE
      F
      France entière
      4164
      4180
      518
      483
      2203.0
      SANTERRE SOMME
      A1101
      Conduite d'engins agricoles et forestiers
      N
      5.0
    
    
      38169
      01
      GUADELOUPE
      R
      Région
      89
      89
      3
      2
      103.0
      MARIE-GALANTE
      A1101
      Conduite d'engins agricoles et forestiers
      O
      NaN
    
    
      69864
      01
      AIN
      D
      Département
      8
      8
      2
      1
      8247.0
      VILLEFRANCHE
      A1101
      Conduite d'engins agricoles et forestiers
      O
      NaN
    
    
      194659
      0101
      SAINTE-ROSE
      B
      Bassin
      13
      13
      0
      0
      NaN
      NaN
      A1101
      Conduite d'engins agricoles et forestiers
      N
      NaN
    
    
      244409
      0102
      BASSE-TERRE
      B
      Bassin
      0
      0
      0
      0
      NaN
      NaN
      A1101
      Conduite d'engins agricoles et forestiers
      N
      NaN

OK. Some values are missing for Market score, documentation states that the ratio is undefined when offers and demands are below 30.

How many missing values do we have for tension ratio here?



In [3]:

    
market_statistics.TENSION_RATIO.notnull().describe()









    Out[3]:





count     247150
unique         2
top        False
freq      215137
Name: TENSION_RATIO, dtype: object

Data is missing for 87% of the lines!

Lines represents data for an Area x Rome job group. So how many lines should we expect? First, how many areas, area types and job groups do we have?



In [4]:

    
market_statistics[['ROME_PROFESSION_CARD_CODE', 'AREA_CODE', 'AREA_TYPE_CODE']].describe()









    Out[4]:






  
    
      
      ROME_PROFESSION_CARD_CODE
      AREA_CODE
      AREA_TYPE_CODE
    
  
  
    
      count
      247150
      247150
      247150
    
    
      unique
      532
      509
      4
    
    
      top
      D1212
      44
      B
    
    
      freq
      527
      1064
      182963

Oh! Look at the job groups... even the most recent ROME job groups groups are here! Good job Pôle Emploi! There are 4 area types (consistent with the documentation) and 509 areas.

Because some areas may be labelled with multiple area types, let's see how many area x area types we have here.



In [5]:

    
pd.concat([market_statistics['AREA_TYPE_CODE'], market_statistics['AREA_CODE']]).nunique()









    Out[5]:





513

A little bit more than the 509 unique area names. Thus, confirming the redundant area names describing more than one area types.

With this in mind, we would expect 509 x 532 = 272916 lines if there were information on each job in each area. Hmmm... That's not the case and we have ~9.5% of the expected lines missing.

For the remaining 32013 (~11.7% of the expected lines) lines with market score data, what is the distribution of these scores?



In [6]:

    
market_with_score = market_statistics[market_statistics.TENSION_RATIO.notnull()]
market_with_score.TENSION_RATIO.describe()









    Out[6]:





count    32013.000000
mean         7.345547
std         14.817923
min          0.000000
25%          3.000000
50%          5.000000
75%          8.000000
max       1664.000000
Name: TENSION_RATIO, dtype: float64

On the subset with market score information, the market score is usually between 3 and 8. Which is not super reassuring on a candidate point of view... we should remember that this corresponds to the number of offer per 10 candidates. At the end, most of the time we have less than 1 offer per candidate.

However, in some markets (area/job) we can find extreme values (the max is at 1664 offers for 10 persons). How many of these extreme/unexpected values can we found, and to which jobs and areas do they correspond?



In [7]:

    
market_with_score[market_with_score.TENSION_RATIO > 50].TENSION_RATIO.hist();
market_with_score[market_with_score.TENSION_RATIO > 50]\
    .sort_values('TENSION_RATIO', ascending=False)\
    [['TENSION_RATIO', 'ROME_PROFESSION_CARD_NAME', 'AREA_NAME', 'SEASONAL']].head()









    Out[7]:






  
    
      
      TENSION_RATIO
      ROME_PROFESSION_CARD_NAME
      AREA_NAME
      SEASONAL
    
  
  
    
      236886
      1664.0
      Animation de vente
      HAUTS DE SEINE SUD
      O
    
    
      73617
      807.0
      Animation de vente
      HAUTS-DE-SEINE
      O
    
    
      235104
      549.0
      Maintenance d'aéronefs
      PARIS
      N
    
    
      69305
      383.0
      Enseignement général du second degré
      TRIANGLE D OR
      N
    
    
      3597
      377.0
      Boulangerie - viennoiserie
      PAYS D ARLES
      N

The 1664 offers for 10 persons observed above appears to be a real outlier. However, it corresponds to a seasonal job and may be linked to a specific place recruiting tons of people (mall, resort...). Note that it sounds like a great idea to apply to be a baker in Arles!

We noticed that the AREA_TYPE_NAME variable can cover multiple values. Can we say more about this?



In [8]:

    
market_statistics.AREA_TYPE_NAME.value_counts()









    Out[8]:





Bassin            182963
Département        54366
Région              9289
France entière       532
Name: AREA_TYPE_NAME, dtype: int64

This dataset have multiple granularity layers. We have information at the department ("Département") level, region level or whole country!

For one job, can you have observations for multiple areas? Let's try for butchers in the "Lyon" area, the department is Rhône and the region Auvergne-Rhône-Alpes.



In [9]:

    
market_statistics[
    (market_statistics.AREA_NAME == 'LYON CENTRE') &\
    (market_statistics.ROME_PROFESSION_CARD_NAME == 'Boucherie')]\
    [['AREA_TYPE_NAME', 'ROME_PROFESSION_CARD_NAME', 'AREA_NAME', 'TENSION_RATIO']]









    Out[9]:






  
    
      
      AREA_TYPE_NAME
      ROME_PROFESSION_CARD_NAME
      AREA_NAME
      TENSION_RATIO
    
  
  
    
      128326
      Bassin
      Boucherie
      LYON CENTRE
      10.0



In [10]:

    
market_statistics[
    (market_statistics.AREA_NAME == 'RHONE') &\
    (market_statistics.ROME_PROFESSION_CARD_NAME == 'Boucherie')]\
    [['AREA_TYPE_NAME', 'ROME_PROFESSION_CARD_NAME', 'AREA_NAME', 'TENSION_RATIO']]









    Out[10]:






  
    
      
      AREA_TYPE_NAME
      ROME_PROFESSION_CARD_NAME
      AREA_NAME
      TENSION_RATIO
    
  
  
    
      3265
      Département
      Boucherie
      RHONE
      6.0



In [11]:

    
market_statistics[
    (market_statistics.AREA_NAME == 'AUVERGNE-RHONE-ALPES') &\
    (market_statistics.ROME_PROFESSION_CARD_NAME == 'Boucherie')]\
    [['AREA_TYPE_NAME', 'ROME_PROFESSION_CARD_NAME', 'AREA_NAME', 'TENSION_RATIO']]









    Out[11]:






  
    
      
      AREA_TYPE_NAME
      ROME_PROFESSION_CARD_NAME
      AREA_NAME
      TENSION_RATIO
    
  
  
    
      236560
      Région
      Boucherie
      AUVERGNE-RHONE-ALPES
      6.0

Good! We have info for all of these.

Let's go a little bit more general. How many jobs do we have here?



In [12]:

    
market_statistics.ROME_PROFESSION_CARD_CODE.nunique()









    Out[12]:





532

How many of these are represented in each area. If we have data for every job in every area, we expect to have 532 jobs in each area.



In [13]:

    
area_romes = market_statistics.groupby(['AREA_TYPE_CODE', 'AREA_CODE']).ROME_PROFESSION_CARD_NAME.size()
area_romes.hist();

For some areas, we have missing jobs. They could be missing because some jobs have 0 offers, 0 applications etc...

Can we find some of these zero values in the dataset?



In [14]:

    
market_statistics[(market_statistics.NB_APPLICATION_END_MONTH == 0) &\
                  (market_statistics.NB_OFFER_END_MONTH == 0) &\
                  (market_statistics.NB_OFFER_LAST_WEEK == 0) &\
                  (market_statistics.NB_APPLICATION_LAST_WEEK  == 0)].head()









    Out[14]:






  
    
      
      AREA_CODE
      AREA_NAME
      AREA_TYPE_CODE
      AREA_TYPE_NAME
      NB_APPLICATION_END_MONTH
      NB_APPLICATION_LAST_WEEK
      NB_OFFER_END_MONTH
      NB_OFFER_LAST_WEEK
      RICHER_CATCHMENT_AREA_CODE
      RICHER_CATCHMENT_AREA_NAME
      ROME_PROFESSION_CARD_CODE
      ROME_PROFESSION_CARD_NAME
      SEASONAL
      TENSION_RATIO
    
  
  
    
      121
      2420
      GIEN
      B
      Bassin
      0
      0
      0
      0
      NaN
      NaN
      E1101
      Animation de site multimédia
      N
      NaN
    
    
      146
      976
      MAYOTTE
      D
      Département
      0
      0
      0
      0
      NaN
      NaN
      E1101
      Animation de site multimédia
      N
      NaN
    
    
      280
      2609
      MACON
      B
      Bassin
      0
      0
      0
      0
      NaN
      NaN
      D1105
      Poissonnerie
      O
      NaN
    
    
      294
      5322
      QUIMPER
      B
      Bassin
      0
      0
      0
      0
      NaN
      NaN
      B1604
      Réparation - montage en systèmes horlogers
      N
      NaN
    
    
      295
      5326
      COMBOURG
      B
      Bassin
      0
      0
      0
      0
      NaN
      NaN
      B1604
      Réparation - montage en systèmes horlogers
      N
      NaN

There are jobs with zeros and NA. So probably, the missing values and the zeros are different things. We couldn't find any information about this in the documentation.

Is there an area level (except whole country) for which we have info for all job groups?



In [15]:

    
department_romes = market_statistics[market_statistics.AREA_TYPE_CODE == 'D'].\
    groupby('AREA_NAME').ROME_PROFESSION_CARD_NAME.size()
department_romes.hist();

Arf... Almost... A couple of departments have some jobs not represented.

Let's see with the region.



In [16]:

    
region_romes = market_statistics[market_statistics.AREA_TYPE_CODE == 'R'].\
    groupby('AREA_NAME').ROME_PROFESSION_CARD_NAME.size()
region_romes.hist();

Nothing is perfect! But most of the regions have information for all jobs.

Let's have a look to an area for which there is less jobs than expected (532)? First, what are the areas with less than 532 job groups?



In [17]:

    
area_romes = area_romes.to_frame()
area_romes = area_romes.reset_index(['AREA_TYPE_CODE', 'AREA_CODE'])
area_romes.columns = [['AREA_TYPE_CODE', 'AREA_CODE', 'jobgroups']]
department_romes = area_romes[area_romes.AREA_TYPE_CODE == 'D']
department_romes.sort_values('jobgroups').head(10)









    Out[17]:






  
    
      
      AREA_TYPE_CODE
      AREA_CODE
      jobgroups
    
  
  
    
      505
      D
      976
      416
    
    
      503
      D
      973
      473
    
    
      433
      D
      2A
      495
    
    
      434
      D
      2B
      495
    
    
      502
      D
      972
      503
    
    
      507
      D
      978
      510
    
    
      506
      D
      977
      510
    
    
      501
      D
      971
      510
    
    
      504
      D
      974
      518
    
    
      475
      D
      70
      529

Overseas territories (97X area codes) and Corsica (2X area codes) are the areas where there are the higher number of job groups missing.

Conclusion

This dataset seems quite clean even if:

There are few information on market score
There are some areas with missing jobs. This seems not to be related with lines with zeros... However, there are multiple granularity layers that seems consistent between each others.

Comparison with Scraped Data

Let's compare these data from the one that are now (2017/09/14) online. For a Nurse in the department "Yonne", there are no value.



In [18]:

    
market_statistics[
    (market_statistics.AREA_NAME == 'YONNE') &\
    (market_statistics.ROME_PROFESSION_CARD_CODE == 'J1502')]\
    [['AREA_TYPE_NAME', 'ROME_PROFESSION_CARD_NAME', 'AREA_NAME', 'TENSION_RATIO']]









    Out[18]:






  
    
      
      AREA_TYPE_NAME
      ROME_PROFESSION_CARD_NAME
      AREA_NAME
      TENSION_RATIO
    
  
  
    
      84149
      Département
      Coordination de services médicaux ou paramédicaux
      YONNE
      NaN

Same here!

What about a plumber in the Cher department. The website announces 3 offers for 10 people.



In [19]:

    
market_statistics[
    (market_statistics.AREA_NAME == 'CHER') &\
    (market_statistics.ROME_PROFESSION_CARD_CODE == 'F1603')]\
    [['AREA_TYPE_NAME', 'ROME_PROFESSION_CARD_NAME', 'AREA_NAME', 'TENSION_RATIO']]









    Out[19]:






  
    
      
      AREA_TYPE_NAME
      ROME_PROFESSION_CARD_NAME
      AREA_NAME
      TENSION_RATIO
    
  
  
    
      135082
      Département
      Installation d'équipements sanitaires et therm...
      CHER
      3.0

Same story here. Yippee!

Let's have a look to the areas with missing jobs. As an example we'll look at the Haute-Saône department (code 70).



In [20]:

    
haute_saone_jobs = market_statistics[market_statistics.AREA_CODE == '70'].ROME_PROFESSION_CARD_NAME.unique()
market_statistics[-market_statistics.ROME_PROFESSION_CARD_NAME.isin(haute_saone_jobs)].\
    ROME_PROFESSION_CARD_NAME.unique()









    Out[20]:





array(["Conservation et reconstitution d'espèces animales",
       'Encadrement équipage de la pêche',
       "Films d'animation et effets spéciaux"], dtype=object)

Online, the "Films d'animation et effets spéciaux" values are specified as insufficient data.

Conclusion

Scraped data and data provided by the API are similar. A previous overview of the scraped data observed a market score median of 4 jobs for 10 candidates. Here we observed it at 5 jobs per 10 candidates.

General Conclusion

This dataset gives various information on job offers and demands at multiple area level (from very local to national). It covers all the 532 job groups. We noticed that some jobs are missing at certain area levels and that this seems different from zero information. We should go back to Pôle Emploi with this question.

When focusing only on the market score we observed a majority of missing values probably due to the low average number of offers and demands. Nevertheless, as compared to the scraped data, we have now access to deeper and larger area types ("Country" and "Bassin" in addition to "Department" and "Région"). Furthermore, the data is clean enough to recommend to

Switch to the API data and drop the scraped data
Explore using the data at different levels

	AREA_CODE	AREA_NAME	AREA_TYPE_CODE	AREA_TYPE_NAME	NB_APPLICATION_END_MONTH	NB_APPLICATION_LAST_WEEK	NB_OFFER_END_MONTH	NB_OFFER_LAST_WEEK	RICHER_CATCHMENT_AREA_CODE	RICHER_CATCHMENT_AREA_NAME	...	SEASONAL_FEB	SEASONAL_JAN	SEASONAL_JULY	SEASONAL_JUNE	SEASONAL_MAR	SEASONAL_MAY	SEASONAL_NOV	SEASONAL_OCT	SEASONAL_SEP	TENSION_RATIO
0	95	VAL-D'OISE	D	Département	56	55	1	1	1101.0	TRIANGLE D OR	...	N	N	N	N	N	N	N	N	N	NaN
1	02	MARTINIQUE	R	Région	14	14	0	0	NaN	NaN	...	N	N	N	N	O	N	N	N	N	NaN
2	1101	TRIANGLE D OR	B	Bassin	235	237	80	78	NaN	NaN	...	N	N	N	O	N	O	N	N	N	35.0
3	1166	HAUTS DE SEINE CENTRE	B	Bassin	219	221	12	12	NaN	NaN	...	N	N	N	N	O	N	N	O	N	6.0
4	1189	VAL DE MARNE EST	B	Bassin	250	251	17	16	NaN	NaN	...	O	N	N	O	O	N	N	N	N	3.0

	AREA_CODE	AREA_NAME	AREA_TYPE_CODE	AREA_TYPE_NAME	NB_APPLICATION_END_MONTH	NB_APPLICATION_LAST_WEEK	NB_OFFER_END_MONTH	NB_OFFER_LAST_WEEK	RICHER_CATCHMENT_AREA_CODE	RICHER_CATCHMENT_AREA_NAME	ROME_PROFESSION_CARD_CODE	ROME_PROFESSION_CARD_NAME	SEASONAL	TENSION_RATIO
111588	000	FRANCE ENTIERE	F	France entière	4164	4180	518	483	2203.0	SANTERRE SOMME	A1101	Conduite d'engins agricoles et forestiers	N	5.0
38169	01	GUADELOUPE	R	Région	89	89	3	2	103.0	MARIE-GALANTE	A1101	Conduite d'engins agricoles et forestiers	O	NaN
69864	01	AIN	D	Département	8	8	2	1	8247.0	VILLEFRANCHE	A1101	Conduite d'engins agricoles et forestiers	O	NaN
194659	0101	SAINTE-ROSE	B	Bassin	13	13	0	0	NaN	NaN	A1101	Conduite d'engins agricoles et forestiers	N	NaN
244409	0102	BASSE-TERRE	B	Bassin	0	0	0	0	NaN	NaN	A1101	Conduite d'engins agricoles et forestiers	N	NaN

	ROME_PROFESSION_CARD_CODE	AREA_CODE	AREA_TYPE_CODE
count	247150	247150	247150
unique	532	509	4
top	D1212	44	B
freq	527	1064	182963

	TENSION_RATIO	ROME_PROFESSION_CARD_NAME	AREA_NAME	SEASONAL
236886	1664.0	Animation de vente	HAUTS DE SEINE SUD	O
73617	807.0	Animation de vente	HAUTS-DE-SEINE	O
235104	549.0	Maintenance d'aéronefs	PARIS	N
69305	383.0	Enseignement général du second degré	TRIANGLE D OR	N
3597	377.0	Boulangerie - viennoiserie	PAYS D ARLES	N

	AREA_CODE	AREA_NAME	AREA_TYPE_CODE	AREA_TYPE_NAME	RICHER_CATCHMENT_AREA_CODE	RICHER_CATCHMENT_AREA_NAME	ROME_PROFESSION_CARD_CODE	ROME_PROFESSION_CARD_NAME	SEASONAL	TENSION_RATIO
121	2420	GIEN	B	Bassin	NaN	NaN	E1101	Animation de site multimédia	N	NaN
146	976	MAYOTTE	D	Département	NaN	NaN	E1101	Animation de site multimédia	N	NaN
280	2609	MACON	B	Bassin	NaN	NaN	D1105	Poissonnerie	O	NaN
294	5322	QUIMPER	B	Bassin	NaN	NaN	B1604	Réparation - montage en systèmes horlogers	N	NaN
295	5326	COMBOURG	B	Bassin	NaN	NaN	B1604	Réparation - montage en systèmes horlogers	N	NaN

	AREA_TYPE_CODE	AREA_CODE	jobgroups
505	D	976	416
503	D	973	473
433	D	2A	495
434	D	2B	495
502	D	972	503
507	D	978	510
506	D	977	510
501	D	971	510
504	D	974	518
475	D	70	529

	AREA_TYPE_CODE	AREA_CODE	jobgroups
505	D	976	416
503	D	973	473
433	D	2A	495
434	D	2B	495
502	D	972	503
507	D	978	510
506	D	977	510
501	D	971	510
504	D	974	518
475	D	70	529

	AREA_TYPE_CODE	AREA_CODE	jobgroups
505	D	976	416
503	D	973	473
433	D	2A	495
434	D	2B	495
502	D	972	503
507	D	978	510
506	D	977	510
501	D	971	510
504	D	974	518
475	D	70	529