Author: Marie Laure, marielaure@bayesimpact.org
The IMT dataset provides regional statistics about different jobs. Here we are interested in the "application modes" subset of this dataset. It gathers the means by which people find jobs. Previously, we retrieved IMT data by scraping the IMT website. Concerning application modes, the present dataset not only proposes application modes ranks (as before) but also percentages per FAP codes. As an exploratory step, we are interested in understanding what is the added value of the percentages compared to the ranks.
This dataset can be obtained with the following command: docker-compose run --rm data-analysis-prepare make data/imt/application_modes.csv
First let's load the csv file:
In [1]:
import os
from os import path
import pandas as pd
import seaborn as _
DATA_FOLDER = os.getenv('DATA_FOLDER')
modes = pd.read_csv(path.join(DATA_FOLDER, 'imt/application_modes.csv'))
modes.head()
Out[1]:
In [2]:
modes.describe(include='all').head(2)
Out[2]:
Good news, Everything seems to be up here! Let's see if there are any discrepancies between names and codes. Starting with FAP (Job types).
In [3]:
modes.groupby('FAP_CODE').FAP_NAME.nunique().value_counts()
Out[3]:
So far so good, Perfect concordance between FAP codes and FAP names. It is worthy to note that these 195 FAP represents only a subset of the entire FAP (225) as they are described here.
What about concordance between application type codes and names?
In [4]:
modes.groupby('APPLICATION_TYPE_CODE').APPLICATION_TYPE_NAME.nunique().value_counts()
Out[4]:
We also have a 1 to 1 correspondance between application type codes and names.
Is there anything going weird between FAP code and the application type rank? Like two #1?
In [5]:
modes.groupby('FAP_CODE').APPLICATION_TYPE_CODE.value_counts().value_counts()
Out[5]:
Nothing like that here.
Are there any weird values for ranks?
In [6]:
modes.APPLICATION_TYPE_ORDER.unique()
Out[6]:
Nope, they are going from first to fourth and we've already seen that there are set for every FAP.
However there may be misassigned (e.g. an application type with fourth rank showing up alone).
In [7]:
def check_order(fap_modes):
num_modes = len(fap_modes)
if num_modes == 1:
if fap_modes.iloc[0].RECRUT_PERCENT != 100:
raise Exception ('Single observations should have 100% percentage')
if fap_modes.iloc[0].APPLICATION_TYPE_ORDER != 1:
raise Exception ('Single observations should be ranked first')
return
for i in range(num_modes - 1):
if int(fap_modes.APPLICATION_TYPE_ORDER.iloc[i]) != i + 1:
raise Exception ('Rank order not consistent')
if fap_modes.RECRUT_PERCENT.iloc[i] < \
fap_modes.RECRUT_PERCENT.iloc[i + 1]:
raise Exception ('Percentage order not consistent')
modes.sort_values(\
'APPLICATION_TYPE_ORDER').groupby('FAP_CODE').apply(check_order);
Everything is in order!
Let's take care of the new comer... The percentage. Basic stats?
In [8]:
modes.RECRUT_PERCENT.describe()
Out[8]:
Here application modes not observed (0%) are not represented. So note that the mean has no real meaning here.
Let's end by a manual check for dataset adequacy with Pôle Emploi website. Application modes rank for "Tuyauteurs" on the 09/01/2017 at IMT is:
Here we have:
In [9]:
modes[modes.FAP_CODE == 'D2Z41']
Out[9]:
Yay! We have a match!
Maybe we should check that the sum of the percentages are close to 100%?
In [10]:
def sum_percentages(fap_modes):
num_modes = len(fap_modes)
sum = 0.0
for i in range(num_modes):
sum += fap_modes.RECRUT_PERCENT.iloc[i]
if sum < 99.9 or sum > 100.1:
print('{} {}'.format(fap_modes.FAP_CODE, sum))
modes.groupby('FAP_CODE').apply(sum_percentages)
Out[10]:
In [11]:
pd.options.display.max_colwidth = 100
modes.APPLICATION_TYPE_NAME.drop_duplicates().to_frame()
Out[11]:
The different possibilities are not super precise... Thus, only 'Candidature spontanée', 'Réseau...' and 'Intermédiaires du placement can be directly useful. For some FAP, only one, two or three application modes have been observed.
Let's have a look to how often this appears.
In [12]:
modes.groupby('FAP_CODE').size().value_counts(normalize=True).plot(kind='bar');
70% the job types have data for the 4 application modes. But we can still find some for which only 1 (<10%) or 2 modes (~10%) are observed.
So what is the application mode that is the most frequently ranked first?
In [13]:
modes[modes.APPLICATION_TYPE_ORDER == 1]\
.APPLICATION_TYPE_NAME.value_counts(normalize=True)\
.plot.pie(figsize=(6, 6), label='');
Network seems to be slightly better than other modes. Nothing new for the Bayesian people... However placement agencies gathers also 30% of the first rank modes.
Let's use percentages now by having a glimpse on the modes that represent more than half of the observations per job type.
In [14]:
modes[modes.RECRUT_PERCENT >= 50].APPLICATION_TYPE_NAME\
.value_counts(normalize=True)\
.plot.pie(figsize=(6, 6), label='');
No doubt, when one channel is doing most of the job it's pretty often the network (36%). But except for the "others" category, other modes appears also to be successful.
Application mode definitions are not very granular. As we already knew, network ranks first, both when considering the ranks and taking into account application modes with more than 50% of the observations. This make us think that there is space of personalisation. Next step will pursue on hos this dataset could be useful to us.
For example, we could investigate what are the job types for which application mode really makes the difference? Let's start with the easiest. When only one mode shows up.
In [15]:
total_modes = modes.groupby('FAP_CODE').size()
modes['total_modes'] = modes.FAP_CODE.map(total_modes)
modes[modes.total_modes == 1][['APPLICATION_TYPE_NAME','FAP_NAME']]
Out[15]:
The case of independant farmers that mostly apply spontaneously for jobs opens the question of the scope of this application mode ('Candidature spontanée'). Maybe it also includes people that create or take over companies. Here, for clear cut combinations of job type/application modes, use of professional or personal network is less represented.
What are the other jobs for which the application modes are really determinant? First we'll have a look to the gap between the modes ranked first and second. How clear cut it is?
In [16]:
def compute_top2_diff(fap_modes):
if len(fap_modes) == 1:
return 100
return fap_modes.iloc[0] - fap_modes.iloc[1]
top2_diff = modes.sort_values(\
'APPLICATION_TYPE_ORDER').groupby('FAP_CODE').RECRUT_PERCENT.apply(compute_top2_diff)
top2_diff.hist();
Most of the time (124/195) there are less than a 20% difference between the first and the second application mode. But on right hand we can clearly see thats some application modes are highly recommended for certain jobs.
Let's have a look to application modes that gather more than 60% difference between first and second.
In [17]:
modes['top2_diff'] = modes.FAP_CODE.map(top2_diff)
modes[modes.top2_diff >= 60].APPLICATION_TYPE_NAME.value_counts()
Out[17]:
Network is still there. But there are some job types for which we could recommend to have a look to placement agencies. Same for spontaneous application, maybe we could suggest hints to investigate the region specific ecosystem or job boards.
One of the main question was, what is the added value of percentages over ranks. So let's have a look to the first rank. How diverse it is?
In [18]:
modes[modes.APPLICATION_TYPE_ORDER == 1].RECRUT_PERCENT.describe()
Out[18]:
For half of the job types, the application mode ranked first represents less than half of the recruitment channels observed.
Is it more relevant to propose more than one, let's say two?
In [19]:
def compute_top2_sum(fap_modes):
if len(fap_modes) == 1:
return 100
return fap_modes.iloc[0] + fap_modes.iloc[1]
top2_sum = modes.sort_values(\
'APPLICATION_TYPE_ORDER').groupby('FAP_CODE').RECRUT_PERCENT.apply(compute_top2_sum)
top2_sum.hist();
It may be like pushing open doors here, but some users may benefits for more than one application mode suggestion (2). And for most job types, the first two modes gather more than 65% of the observations.
What are the jobs for which the first modes are almost equally relevant?
In [20]:
modes['top2_sum'] = modes.FAP_CODE.map(top2_sum)
modes[(modes.top2_sum > 70) & (modes.top2_diff < 15) & (modes.APPLICATION_TYPE_ORDER < 3)].\
sort_values(['FAP_CODE', 'top2_sum'], ascending = False)
Out[20]:
33 job types have less than a 15% difference between the first and the second application mode while both modes gather more than 70% of the observations. They include some public service job types (school principals or professors) that you can access by "concours" (permanent position) or spontaneous application/placement agencies (fixed-term contract). Users are already aware of these possibilities. But for other jobs like "cashiers", pushing both advice (spontaneous application and network) seems like a good strategy!
We know that for 70% of the job types the 4 application modes have been observed. For which job type the latest mode may be interesting? Let's see what is the distribution of the percentages of the least observed mode.
In [21]:
last_modes = modes[modes.APPLICATION_TYPE_ORDER == 4]
last_modes.RECRUT_PERCENT.plot(kind ='box');
The maximum percentage observed for the application mode ranked last, is 21%.
What are the job types for which the latest mode percentage is not that ridiculous (75th percentile)?
In [22]:
last_modes[last_modes.RECRUT_PERCENT > 16][['APPLICATION_TYPE_NAME', 'FAP_NAME', 'RECRUT_PERCENT']].\
sort_values('RECRUT_PERCENT', ascending = False)
Out[22]:
Some of these results seem to be contradictory with our knowledge: e.g. the fact that network is ranked last for law professionals. Thus, we keep our main strategy of promoting the Network. However it seems relevant to drop spontaneous advice when users are interested in jobs in which this mode is almost never observed (here less than 16%).
Here at Bayes, we are conviced that Network is really important. Let's see if and how the application modes reported by newly recruited people enforce our statement. First, how often is the network reported as the way newly recruited people?
In [23]:
modes[modes.APPLICATION_TYPE_CODE == 'R2'].APPLICATION_TYPE_ORDER.value_counts()
Out[23]:
There is only 20 job types for which Network has not been reported as an application mode. When it has been reported, it is usually ranked as the first or second application mode.
Among the jobs for which Network is ranked second for application mode
In [24]:
network_ranked_second = modes[(modes.APPLICATION_TYPE_ORDER == 2) & (modes.APPLICATION_TYPE_CODE == 'R2')]
network_ranked_second.RECRUT_PERCENT.hist();
There are only 2 cases where the Network is ranked second with more than 36% of the observations. Even if it won't concern that much job types, a threshold at 40% seems reasonable.
What about job types for which the Network mode does not reach this threshold?
In [25]:
network_ranked_second[network_ranked_second.RECRUT_PERCENT < 40].\
sort_values('RECRUT_PERCENT', ascending=False)[['FAP_NAME', 'RECRUT_PERCENT']]
Out[25]:
It sounds sensible that these job types do not put network as their top mode of application. We expect the advice to be included in the two stars section. However, it appears that we could improve the phrasing for less qualified workers.
The Job types that are above the 40% threshold are:
In [26]:
network_ranked_second[network_ranked_second.RECRUT_PERCENT >= 40].\
FAP_NAME.to_frame()
Out[26]:
For these jobs, it appears relevant that the network advice is put in the user priorities.
Another way to personnalize the Network advice could be to put am emphasis on the observed percentage for the job types for which the advantage is striking.
In [27]:
network_ranked_first = modes[(modes.APPLICATION_TYPE_ORDER == 1) & (modes.APPLICATION_TYPE_CODE == 'R2')]
network_ranked_first_ordered = network_ranked_first[['FAP_NAME', 'RECRUT_PERCENT', 'top2_diff', 'total_modes']].\
sort_values('RECRUT_PERCENT', ascending=False)
network_ranked_first_ordered.head(10)
Out[27]:
As an example, for the housekeepers (concierges), the network not only ranks first (70.53%) but also has a 58% difference with the application mode ranked second.
This dataset allows to refine our understanding of application modes importance. Network is definitely a key point but, some job types have specific recruitment channels and for others, proposing more than one could be beneficial. As an example, jobs for which first and second modes have been observed at similar rates and cover a large amount of the observations. Concerning the Network advice, when Network is ranked second, we can use a 40% threshold to distinguish users for which Network can still be consider higher priority. We can also use Spontaneous Application ranking or percentage to disable this advice for user for which it would not be relevant.
The dataset is super clean and ready to use.
Unfortunately, application modes definitions aren't very precise. The "Spontaneous Application" (candidature spontanée) mode might include also creating or taking over companies. Even if, there are job types for which there is a super successful application mode, most of the time there is less than a 20% difference between the applicatiod modes ranked first and second. Furthermore, in fifty percent of the cases the application mode ranked first gathers only 42% of the successful modes. Thus, percentages may help us to get a better coverage of what is working for a given job type. Focusing on the network, it seems reasonable to set up a 40% threshold for the second ranked application modes to include a relevant runner-up.
Finally, we should definetely investigate if switching from FAP to ROME codes influences consistency of the application modes ranking/percentage.