Author: Marie Laure, marielaure@bayesimpact.org
The IMT dataset provides regional statistics about different jobs. Here we are interested in the market score (called by Pôle Emploi the tension ratio. It is a bit misleading as a big tension ratio means plenty of jobs...). It corresponds to a ratio of the average number of weekly open offers to the average number of weekly applications per 10 candidates. This value is provided among others (e.g. number of offers in the last week, number of application in the last week...) in the "statitics on offers and demands" subset of the IMT dataset.
Previously, we retrieved IMT data by scraping the IMT website. As an exploratory step, we are interested in the sanity of the API based data and identifying putative additional information provided only by the API.
The dataset can be retrieved with the following command (it takes ~15 minutes):
docker-compose run --rm data-analysis-prepare make data/imt/market_score.csv
Loading and General View First let's load the csv file:
In [1]:
import os
from os import path
import pandas as pd
import seaborn as _
DATA_FOLDER = os.getenv('DATA_FOLDER')
market_statistics = pd.read_csv(path.join(DATA_FOLDER, 'imt/market_score.csv'))
market_statistics.head()
Out[1]:
Wow! Tons of columns! There is a lot of information on whether a job is seasonal, shows a peak in offers at a particular month or not. Seasonal is described as having twice as much offers than the monthly average (calculated over a year), and seeing this pattern on two subsequent years. Because we are not interested in the seasonality here, we'll remove at least the per month data (12 columns).
In [2]:
to_remove = [name for name in market_statistics.columns if 'SEASONAL_' in name]
market_statistics.drop(to_remove, axis=1, inplace=True)
market_statistics.sort_values(['ROME_PROFESSION_CARD_CODE', 'AREA_CODE']).head()
Out[2]:
OK. Some values are missing for Market score, documentation states that the ratio is undefined when offers and demands are below 30.
How many missing values do we have for tension ratio here?
In [3]:
market_statistics.TENSION_RATIO.notnull().describe()
Out[3]:
Data is missing for 87% of the lines!
Lines represents data for an Area x Rome job group. So how many lines should we expect? First, how many areas, area types and job groups do we have?
In [4]:
market_statistics[['ROME_PROFESSION_CARD_CODE', 'AREA_CODE', 'AREA_TYPE_CODE']].describe()
Out[4]:
Oh! Look at the job groups... even the most recent ROME job groups groups are here! Good job Pôle Emploi! There are 4 area types (consistent with the documentation) and 509 areas.
Because some areas may be labelled with multiple area types, let's see how many area x area types we have here.
In [5]:
pd.concat([market_statistics['AREA_TYPE_CODE'], market_statistics['AREA_CODE']]).nunique()
Out[5]:
A little bit more than the 509 unique area names. Thus, confirming the redundant area names describing more than one area types.
With this in mind, we would expect 509 x 532 = 272916
lines if there were information on each job in each area. Hmmm... That's not the case and we have ~9.5% of the expected lines missing.
For the remaining 32013 (~11.7% of the expected lines) lines with market score data, what is the distribution of these scores?
In [6]:
market_with_score = market_statistics[market_statistics.TENSION_RATIO.notnull()]
market_with_score.TENSION_RATIO.describe()
Out[6]:
On the subset with market score information, the market score is usually between 3 and 8. Which is not super reassuring on a candidate point of view... we should remember that this corresponds to the number of offer per 10 candidates. At the end, most of the time we have less than 1 offer per candidate.
However, in some markets (area/job) we can find extreme values (the max is at 1664 offers for 10 persons). How many of these extreme/unexpected values can we found, and to which jobs and areas do they correspond?
In [7]:
market_with_score[market_with_score.TENSION_RATIO > 50].TENSION_RATIO.hist();
market_with_score[market_with_score.TENSION_RATIO > 50]\
.sort_values('TENSION_RATIO', ascending=False)\
[['TENSION_RATIO', 'ROME_PROFESSION_CARD_NAME', 'AREA_NAME', 'SEASONAL']].head()
Out[7]:
The 1664 offers for 10 persons observed above appears to be a real outlier. However, it corresponds to a seasonal job and may be linked to a specific place recruiting tons of people (mall, resort...). Note that it sounds like a great idea to apply to be a baker in Arles!
We noticed that the AREA_TYPE_NAME
variable can cover multiple values. Can we say more about this?
In [8]:
market_statistics.AREA_TYPE_NAME.value_counts()
Out[8]:
This dataset have multiple granularity layers. We have information at the department ("Département") level, region level or whole country!
For one job, can you have observations for multiple areas? Let's try for butchers in the "Lyon" area, the department is Rhône and the region Auvergne-Rhône-Alpes.
In [9]:
market_statistics[
(market_statistics.AREA_NAME == 'LYON CENTRE') &\
(market_statistics.ROME_PROFESSION_CARD_NAME == 'Boucherie')]\
[['AREA_TYPE_NAME', 'ROME_PROFESSION_CARD_NAME', 'AREA_NAME', 'TENSION_RATIO']]
Out[9]:
In [10]:
market_statistics[
(market_statistics.AREA_NAME == 'RHONE') &\
(market_statistics.ROME_PROFESSION_CARD_NAME == 'Boucherie')]\
[['AREA_TYPE_NAME', 'ROME_PROFESSION_CARD_NAME', 'AREA_NAME', 'TENSION_RATIO']]
Out[10]:
In [11]:
market_statistics[
(market_statistics.AREA_NAME == 'AUVERGNE-RHONE-ALPES') &\
(market_statistics.ROME_PROFESSION_CARD_NAME == 'Boucherie')]\
[['AREA_TYPE_NAME', 'ROME_PROFESSION_CARD_NAME', 'AREA_NAME', 'TENSION_RATIO']]
Out[11]:
Good! We have info for all of these.
Let's go a little bit more general. How many jobs do we have here?
In [12]:
market_statistics.ROME_PROFESSION_CARD_CODE.nunique()
Out[12]:
How many of these are represented in each area. If we have data for every job in every area, we expect to have 532 jobs in each area.
In [13]:
area_romes = market_statistics.groupby(['AREA_TYPE_CODE', 'AREA_CODE']).ROME_PROFESSION_CARD_NAME.size()
area_romes.hist();
For some areas, we have missing jobs. They could be missing because some jobs have 0 offers, 0 applications etc...
Can we find some of these zero values in the dataset?
In [14]:
market_statistics[(market_statistics.NB_APPLICATION_END_MONTH == 0) &\
(market_statistics.NB_OFFER_END_MONTH == 0) &\
(market_statistics.NB_OFFER_LAST_WEEK == 0) &\
(market_statistics.NB_APPLICATION_LAST_WEEK == 0)].head()
Out[14]:
There are jobs with zeros and NA. So probably, the missing values and the zeros are different things. We couldn't find any information about this in the documentation.
Is there an area level (except whole country) for which we have info for all job groups?
In [15]:
department_romes = market_statistics[market_statistics.AREA_TYPE_CODE == 'D'].\
groupby('AREA_NAME').ROME_PROFESSION_CARD_NAME.size()
department_romes.hist();
Arf... Almost... A couple of departments have some jobs not represented.
Let's see with the region.
In [16]:
region_romes = market_statistics[market_statistics.AREA_TYPE_CODE == 'R'].\
groupby('AREA_NAME').ROME_PROFESSION_CARD_NAME.size()
region_romes.hist();
Nothing is perfect! But most of the regions have information for all jobs.
Let's have a look to an area for which there is less jobs than expected (532)? First, what are the areas with less than 532 job groups?
In [17]:
area_romes = area_romes.to_frame()
area_romes = area_romes.reset_index(['AREA_TYPE_CODE', 'AREA_CODE'])
area_romes.columns = [['AREA_TYPE_CODE', 'AREA_CODE', 'jobgroups']]
department_romes = area_romes[area_romes.AREA_TYPE_CODE == 'D']
department_romes.sort_values('jobgroups').head(10)
Out[17]:
Overseas territories (97X area codes) and Corsica (2X area codes) are the areas where there are the higher number of job groups missing.
This dataset seems quite clean even if:
Let's compare these data from the one that are now (2017/09/14) online. For a Nurse in the department "Yonne", there are no value.
In [18]:
market_statistics[
(market_statistics.AREA_NAME == 'YONNE') &\
(market_statistics.ROME_PROFESSION_CARD_CODE == 'J1502')]\
[['AREA_TYPE_NAME', 'ROME_PROFESSION_CARD_NAME', 'AREA_NAME', 'TENSION_RATIO']]
Out[18]:
Same here!
What about a plumber in the Cher department. The website announces 3 offers for 10 people.
In [19]:
market_statistics[
(market_statistics.AREA_NAME == 'CHER') &\
(market_statistics.ROME_PROFESSION_CARD_CODE == 'F1603')]\
[['AREA_TYPE_NAME', 'ROME_PROFESSION_CARD_NAME', 'AREA_NAME', 'TENSION_RATIO']]
Out[19]:
Same story here. Yippee!
Let's have a look to the areas with missing jobs. As an example we'll look at the Haute-Saône department (code 70).
In [20]:
haute_saone_jobs = market_statistics[market_statistics.AREA_CODE == '70'].ROME_PROFESSION_CARD_NAME.unique()
market_statistics[-market_statistics.ROME_PROFESSION_CARD_NAME.isin(haute_saone_jobs)].\
ROME_PROFESSION_CARD_NAME.unique()
Out[20]:
Online, the "Films d'animation et effets spéciaux" values are specified as insufficient data.
Scraped data and data provided by the API are similar. A previous overview of the scraped data observed a market score median of 4 jobs for 10 candidates. Here we observed it at 5 jobs per 10 candidates.
This dataset gives various information on job offers and demands at multiple area level (from very local to national). It covers all the 532 job groups. We noticed that some jobs are missing at certain area levels and that this seems different from zero information. We should go back to Pôle Emploi with this question.
When focusing only on the market score we observed a majority of missing values probably due to the low average number of offers and demands. Nevertheless, as compared to the scraped data, we have now access to deeper and larger area types ("Country" and "Bassin" in addition to "Department" and "Région"). Furthermore, the data is clean enough to recommend to