Author: Marie Laure, marielaure@bayesimpact.org
The IMT dataset provides regional statistics about different jobs. Here we are interested in the employment types distribution. Employment types are categories of contract types (mostly based on contract duration and permanent vs temporary). Previously, we retrieved IMT data by scraping the IMT website. As an exploratory step, we are interested in the sanity of the API based data and identifying putative additional information provided only by the API.
The dataset can be obtained with the following command, note that it may take some time to download:
docker-compose run --rm data-analysis-prepare make data/imt/employment_type.csv
Loading and General View First let's load the csv file:
In [1]:
import os
from os import path
import matplotlib
import pandas as pd
import seaborn as _
DATA_FOLDER = os.getenv('DATA_FOLDER')
employment_types = pd.read_csv(path.join(DATA_FOLDER, 'imt/employment_type.csv'), dtype={'AREA_CODE': 'str'})
employment_types.head()
Out[1]:
Done! So we have access to different area types, contract types for each area and a number and percentage of offers. Documentation defines the number of offers as the number of offers with this contract type for a specific market (a given area and a particular job). Same idea for percentages. Note that data are updated annually.
How big is this dataset (how many rows)?
In [2]:
len(employment_types)
Out[2]:
Not that bad!
Any missing value?
In [3]:
employment_types.isnull().describe()
Out[3]:
Nope! Good Job Pôle Emploi!
How many types of contract do we have?
In [4]:
employment_types.CONTRACT_TYPE_NAME.unique()
Out[4]:
The expected types are here. Short-term, long-term, permanent... and even "others".
How many job groups?
In [5]:
rome_list = employment_types.ROME_PROFESSION_CARD_CODE.unique()
rome_list.size
Out[5]:
In [6]:
'L1510' in rome_list
Out[6]:
Almost... The job group with ROME code L1510 (latest addition to the job groups) is not yet part of this team. Remember that this dataset is updated annually...
How many area types do we have?
In [7]:
employment_types.AREA_TYPE_NAME.unique()
Out[7]:
We have four different area types. Good.
Let's see if every region is represented.
In [8]:
employment_types[employment_types.AREA_TYPE_CODE == 'R'].AREA_CODE.unique().size
Out[8]:
Yes! In France, in Septempber 2017 there are 13 metropolitan regions and 5 overseas regions.
Same for the departments! Anyone missing??
In [9]:
employment_types[employment_types.AREA_TYPE_CODE == 'D'].AREA_CODE.unique().size
Out[9]:
Hmm. We would expect 101.
Let's see which ones are not expected (but truly welcomed)…
In [10]:
employment_types[employment_types.AREA_TYPE_CODE == 'D'].AREA_CODE.sort_values().unique()
Out[10]:
The overseas collectivities of Saint-Barthélémy et Saint-Martin are defined here as departments. So far so good.
Ok we have regions, departments, "bassin"... But how many areas in total are there?
In [11]:
len(employment_types.groupby(['AREA_CODE', 'AREA_TYPE_CODE']))
Out[11]:
OK! So we would expect 527 * 5 * 531 = 1399185
lines. Thus leaving 981944 (~70%) missing lines.
Why are these lines missing? Maybe uninformative rows (e.g. rows with zero offers) are missing (we've seen that before...). Let's get a brief description of the number and percentage of offers.
In [12]:
employment_types[['NB_OFFERS', 'OFFERS_PERCENT']].describe()
Out[12]:
First thing first, there are no row with zero offers. The good news is that no percentage of offers is above 100. I don't know if you feel the same but the maximum number of offers seems crazy to me!
Let's have a closer look at this one.
In [13]:
employment_types[employment_types.NB_OFFERS == 50854]\
[['AREA_NAME', 'CONTRACT_TYPE_NAME', 'NB_OFFERS', 'OFFERS_PERCENT', 'ROME_PROFESSION_CARD_NAME']]
Out[13]:
Seems legit. Nothing crazy here... Plenty of permanent position offers in children care on a national scope.
Maybe we should check that the sum of the percentages are close to 100%?
In [14]:
def sum_percentages(job_contracts):
num_contracts = len(job_contracts)
sum = 0.0
for i in range(num_contracts):
sum += job_contracts.OFFERS_PERCENT.iloc[i]
if sum < 99.9 or sum > 100.1:
print('{} {}'.format(job_contracts.ROME_PROFESSION_CARD_NAME, sum))
employment_types.groupby(['AREA_CODE', 'AREA_TYPE_CODE', 'ROME_PROFESSION_CARD_CODE']).apply(sum_percentages);
Great!! Everything is as expected.
This dataset is super clean. The most recent job group 'L1510' is not used here. To be present in the dataset a job has to have at least one offer with a specific employment type in an area. That leads to a lot of missing rows.
The scraped data provide the percentage of offers at the department level for a specific job group. So, the main differences are the availability of other area levels and the raw number of offers. We'll see how consistent this is.
An additional explanation for the high number of missing rows could be the fact that employment types are super specific for a market (job group x area intersection).
In [15]:
employment_types.groupby(['AREA_CODE', 'AREA_TYPE_CODE', 'ROME_PROFESSION_CARD_CODE'])\
.size()\
.value_counts(normalize=True)\
.sort_index()\
.plot(kind='bar');
Our hypothesis was not that bad. Only 5% of the markets have observations for the 5 employment types. Hopefully, we won't have too much of the "Other" category which does not mean a lot.
Out of curiosity, what are the jobs that need "other" contract types?
In [16]:
employment_types[employment_types.CONTRACT_TYPE_CODE == 99]\
.sort_values('NB_OFFERS', ascending = False)\
[['AREA_NAME', 'AREA_TYPE_NAME', 'NB_OFFERS', 'OFFERS_PERCENT', 'ROME_PROFESSION_CARD_NAME']]\
.head(10)
Out[16]:
Ok that makes sense... We can find here entrepreneurs. As internships are not in our list of contract types, we can imagine that we would find them in this category.
Let's have a look to how many of each employment type we have.
In [17]:
employment_types.CONTRACT_TYPE_NAME.value_counts(normalize=True)\
.plot(kind='bar');
Great the not-super-useful "Other" category is in a minority. Longer duration contracts are more often observed that shorter ones. Good for the future applicants!
One of the main difference is that we have now access to the number of observations not only to the percentages. So it may be interesting to know what part of these observations is based on very few observations. We focus here on the markets that have only 1 employment type observed. Why those? First, because it is easier (we are lazy). But also because, these are the ones more prone to be covered only by few observations (when you have only 1 observation it represents 100% of the cases... Quick win!).
In [18]:
employment_types[employment_types.OFFERS_PERCENT == 100]\
.NB_OFFERS.plot(kind='box', showfliers=False)
len(employment_types[employment_types.OFFERS_PERCENT == 100])
Out[18]:
Half of the 48090 markets with 1 contract type observed 100% of the time, have only 1 offer observed for this contract. That is a lot. Thus, setting a minimum threshold of offers seems like a good idea.
BTW, before dropping those rows, let's sum all of them to see what are the global numbers for the whole country. First let's check that we have the same totals for each area type:
In [19]:
employment_types\
.groupby(['AREA_TYPE_CODE', 'CONTRACT_TYPE_NAME']).NB_OFFERS.sum()\
.sort_values(ascending=False)\
.to_frame('total_offers')\
.reset_index()\
.pivot(index='CONTRACT_TYPE_NAME', columns='AREA_TYPE_CODE', values='total_offers')
Out[19]:
Perfect, congrats Pôle emploi, no offers got lost in the count. So now let's plot the distribution as a ratio.
In [20]:
total_offers = employment_types[employment_types.AREA_TYPE_CODE == 'F']\
.groupby('CONTRACT_TYPE_NAME').NB_OFFERS.sum()\
.sort_values(ascending=False)
total_offers.plot(kind='pie', figsize=(5, 5)).axis('off')
total_offers.div(total_offers.sum()).to_frame('ratio_offers')
Out[20]:
Wow, that's pretty cool, 38% of job offers are for long term employment, and a majority (64%) are for more than 3 months. That's a good news for jobseekers.
OK back to more precise area types… What is the distribution of the number of offers at the bassin, department and region levels?
Bassins:
In [21]:
employment_types[employment_types.AREA_TYPE_CODE == 'B'].NB_OFFERS.describe().to_frame()
Out[21]:
Departments:
In [22]:
employment_types[employment_types.AREA_TYPE_CODE == 'D'].NB_OFFERS.describe().to_frame()
Out[22]:
Regions:
In [23]:
employment_types[employment_types.AREA_TYPE_CODE == 'R'].NB_OFFERS.describe().to_frame()
Out[23]:
Overall, there aren't that much observations for each contract type x area x job group. At the department and Bassin levels, most of the jobs has less than 20 offers in the area. Thus let's stay cautious when considering those data.
Let's not focus on a specific contract type and look at the number of offers for each job groups. First the regions:
In [24]:
offers_sum = employment_types.groupby(['AREA_CODE', 'AREA_TYPE_CODE', 'ROME_PROFESSION_CARD_CODE'])\
.NB_OFFERS.sum()\
.to_frame('total_offers')\
.reset_index()
offers_sum[offers_sum.AREA_TYPE_CODE == 'R'].describe()
Out[24]:
Departments:
In [25]:
offers_sum[offers_sum.AREA_TYPE_CODE == 'D'].describe()
Out[25]:
And, Bassins:
In [26]:
offers_sum[offers_sum.AREA_TYPE_CODE == 'B'].describe()
Out[26]:
At the department level, most of the job groups have between 55 and 4 total offers (with information on contract types). This is a little bit more than twice what we saw when considering each contract type individually. Still, it is not huge.
Let's investigate how much data we would lost if using a threshold on the total number of offers. Let's focus on department, as it is the granularity level we are actually using in Bob.
In [27]:
department_offers_sum = offers_sum[offers_sum.AREA_TYPE_CODE == 'D']
department_offers_sum.sort_values('total_offers', ascending=False)\
.reset_index(drop=True).reset_index().drop_duplicates('total_offers', keep='last')\
.sort_values('total_offers')\
.set_index('total_offers')['index'].div(len(department_offers_sum))\
.plot(xlim=(0, 70));
We'll lost almost 25% of the data with a threshold at 5 but we'll still have 50% of the data for departments with a threshold at 15.
We've seen that most of the time, employers can propose more than one contract type for a market. Can we find if some contract types are more often used alone or in combination (and which combinations)?
In [28]:
job_contracts = employment_types\
.sort_values('CONTRACT_TYPE_NAME')\
.groupby(['AREA_CODE', 'AREA_TYPE_CODE', 'ROME_PROFESSION_CARD_CODE'])\
.CONTRACT_TYPE_NAME.apply(lambda t: ', '.join(t))
job_contracts.value_counts().div(len(job_contracts)).to_frame().head()
Out[28]:
For 17% of the job groups, employers propose only the classical contract types. Offering only a long-term contract (CDI) is also quite common (12%). This is good news for Bob users as they mostly look for these type of contracts.
However, previous work done on how this data could trigger specific recommendations, suggests that user would benefit from looking also for long CDDs. As seen on the barplot above, long CDDs have been proposed almost 28% of the time.
Let's now compare scraped with API-retrieved data. What about 'Extraction solide' in the Alpes-Maritimes department?
In [29]:
employment_types[(employment_types.AREA_CODE=='06') & (employment_types.ROME_PROFESSION_CARD_CODE == 'F1402')]\
[['AREA_NAME', 'CONTRACT_TYPE_NAME', 'NB_OFFERS', 'OFFERS_PERCENT', 'ROME_PROFESSION_CARD_NAME']]
Out[29]:
Yihii! On the 26th of Sep 2017, the website displays the same info. Note that this is different from the scraped data observations where CDDs, Interim and CDI offers could be found for this job.
For a selected use case, we have a perfect consistancy between API retrived data and data from the website. Overall only a low/medium number of offers have been observed per job groups. Most of the time, employers propose the four main contract types (CDI, CDDs and Interim). However, CDI (long term contracts) are also quite commonly proposed alone (12%).
Next steps: