Pôle Emploi Active Job Postings

Author: pascal@bayes.org Date: 2016-09-05

Skip the run test because it would take too long to download the dataset.

This notebook provides a quick overview of the Active Job Postings dataset available on the Emploi Store Dev. It is updated every morning (at 6:00 in France) and contains only job postings that are still available on the Pôle Emploi website.

The original file is downloaded using the Python emploi_store module in the package offres, the resource named Offres d'emploi. You can use the make rule to download it:

docker-compose run --rm data-analysis-prepare make data/recent_job_offers.csv



In [1]:

    
import datetime
from os import path

import pandas as pd
import seaborn as _

from bob_emploi.data_analysis.lib import plot_helpers

DATA_PATH = '../../../data/'

pd.options.display.max_rows = 999

Let's import the file. Although after some tries it seems that some fields benefit from having a type set.



In [2]:

    
postings = pd.read_csv(
    path.join(DATA_PATH, 'recent_job_offers.csv'),
    dtype={'POSTCODE': str},
    parse_dates=['CREATION_DATE', 'MODIFICATION_DATE'], dayfirst=True, infer_datetime_format=True,
    low_memory=False)
postings.columns = postings.columns.str.lower()

Fields

Let's check what are the fields available for each job posting.



In [3]:

    
postings.head(2).transpose()









    Out[3]:






  
    
      
      0
      1
    
  
  
    
      activity_code
      7112B
      7820Z
    
    
      activity_name
      Ingénierie, études techniques
      Activités des agences de travail temporaire
    
    
      annual_maximum_salary
      44000
      18555.8
    
    
      annual_minimum_salary
      34000
      18479.4
    
    
      city_code
      95127
      44026
    
    
      city_name
      CERGY
      CARQUEFOU
    
    
      continent_code
      NaN
      NaN
    
    
      continent_name
      NaN
      NaN
    
    
      contract_duration
      NaN
      12
    
    
      contract_dur_unit_code
      NaN
      MO
    
    
      contract_dur_unit_name
      NaN
      Mois
    
    
      contract_nature_code
      E1
      I1
    
    
      contract_nature_name
      Contrat tout public
      Insertion par l'activ.éco.
    
    
      contract_type_code
      CDI
      TTI
    
    
      contract_type_name
      Contrat à durée indéterminée
      Contrat travail temporaire insertion
    
    
      country_code
      01
      01
    
    
      country_name
      FRANCE
      FRANCE
    
    
      creation_date
      2016-07-26 00:00:00
      2016-08-09 00:00:00
    
    
      degree_diploma_name_1
      NaN
      NaN
    
    
      degree_diploma_name_2
      NaN
      NaN
    
    
      degree_required_code_1
      E
      NaN
    
    
      degree_required_code_2
      NaN
      NaN
    
    
      degree_required_name_1
      exigé
      NaN
    
    
      degree_required_name_2
      NaN
      NaN
    
    
      degree_subject_area_code_1
      23637
      NaN
    
    
      degree_subject_area_code_2
      NaN
      NaN
    
    
      degree_subject_area_name_1
      Mécanique automobile
      NaN
    
    
      degree_subject_area_name_2
      NaN
      NaN
    
    
      degree_type_code_1
      NV1
      NaN
    
    
      degree_type_code_2
      NaN
      NaN
    
    
      degree_type_name_1
      Bac+5 et plus ou équivalent
      NaN
    
    
      degree_type_name_2
      NaN
      NaN
    
    
      departement_code
      95
      44
    
    
      departement_name
      VAL-D'OISE
      LOIRE-ATLANTIQUE
    
    
      desktop_tools_code_1
      NaN
      NaN
    
    
      desktop_tools_code_2
      NaN
      NaN
    
    
      desktop_tools_lev_code_1
      NaN
      NaN
    
    
      desktop_tools_lev_code_2
      NaN
      NaN
    
    
      desktop_tools_lev_name_1
      NaN
      NaN
    
    
      desktop_tools_lev_name_2
      NaN
      NaN
    
    
      desktop_tools_name_1
      NaN
      NaN
    
    
      desktop_tools_name_2
      NaN
      NaN
    
    
      disabled_accessibility
      N
      N
    
    
      driving_lic_code_1
      B
      NaN
    
    
      driving_lic_code_2
      NaN
      NaN
    
    
      driving_lic_name_1
      B - Véhicule léger
      NaN
    
    
      driving_lic_name_2
      NaN
      NaN
    
    
      driving_lic_req_code_1
      E
      NaN
    
    
      driving_lic_req_code_2
      NaN
      NaN
    
    
      driving_lic_req_name_1
      exigé
      NaN
    
    
      driving_lic_req_name_2
      NaN
      NaN
    
    
      employer_consent
      O
      O
    
    
      em_interview_modality_code
      MEL
      MEL
    
    
      em_interview_modality_name
      Envoyer votre CV par mail
      Envoyer votre CV par mail
    
    
      experience_code
      E
      E
    
    
      experience_max_duration
      NaN
      NaN
    
    
      experience_min_duration
      3
      3
    
    
      experience_name
      Expérience exigée
      Expérience exigée
    
    
      exp_duration_type_code
      AN
      MO
    
    
      exp_duration_type_name
      An(s)
      Mois
    
    
      flag_relation
      0
      0
    
    
      lang_code_1
      AN
      NaN
    
    
      lang_code_2
      NaN
      NaN
    
    
      lang_name_1
      Anglais
      NaN
    
    
      lang_name_2
      NaN
      NaN
    
    
      lang_proficiency_code_1
      3
      NaN
    
    
      lang_proficiency_code_2
      NaN
      NaN
    
    
      lang_proficiency_name_1
      Bon
      NaN
    
    
      lang_proficiency_name_2
      NaN
      NaN
    
    
      lang_required_code_1
      E
      NaN
    
    
      lang_required_code_2
      NaN
      NaN
    
    
      lang_required_name_1
      exigé
      NaN
    
    
      lang_required_name_2
      NaN
      NaN
    
    
      latitude
      49.0522
      47.2969
    
    
      longitude
      2.03611
      -1.49278
    
    
      maximum_salary
      44000
      9.71
    
    
      minimum_salary
      34000
      9.67
    
    
      modification_date
      2016-08-29 00:00:00
      2016-08-12 00:00:00
    
    
      monitoring_agency_code
      95074
      44243
    
    
      nb_month_salary
      12
      12
    
    
      nb_vacancies_creation
      1
      1
    
    
      nb_vacancies_left
      1
      1
    
    
      number_of_application
      1
      2
    
    
      postcode
      95000
      44470
    
    
      qualification_code
      9
      5
    
    
      qualification_name
      Cadre
      Employé non qualifié
    
    
      region_code
      11
      52
    
    
      region_name
      Ile de France
      Pays de la Loire
    
    
      rome_list_activity_code
      732;866;1312;1920
      NaN
    
    
      rome_list_skill_code
      21249
      NaN
    
    
      rome_list_work_env_code
      23410
      NaN
    
    
      rome_profession_card_code
      H1206
      N1103
    
    
      rome_profession_card_name
      Management et ingénierie études, recherche et ...
      Magasinage et préparation de commandes
    
    
      rome_profession_code
      15743
      17993
    
    
      rome_profession_name
      Ingénieur / Ingénieure en automobile en industrie
      Préparateur / Préparatrice de commandes
    
    
      salary_comment
      NaN
      NaN
    
    
      salary_supplement_code_1
      7
      NaN
    
    
      salary_supplement_code_2
      NaN
      NaN
    
    
      salary_supplement_name_1
      Mutuelle
      NaN
    
    
      salary_supplement_name_2
      NaN
      NaN
    
    
      salary_type_code
      A
      H
    
    
      salary_type_name
      Annuel
      Horaire
    
    
      salary_unit_code
      EU
      EU
    
    
      service_type_code
      APP
      ACP
    
    
      service_type_name
      En appui
      En accompagnement sans présélection
    
    
      status_code
      EC
      EC
    
    
      status_name
      En cours
      En cours
    
    
      sub_continent_code
      NaN
      NaN
    
    
      sub_continent_name
      NaN
      NaN
    
    
      travel_frequency_code
      3
      NaN
    
    
      travel_frequency_comment
      NaN
      NaN
    
    
      travel_frequency_name
      Fréquents
      NaN
    
    
      travel_type_code
      2
      NaN
    
    
      travel_type_comment
      NaN
      NaN
    
    
      travel_type_name
      Régional
      NaN
    
    
      wage_unit_name
      Euros
      Euros
    
    
      weekly_working_hours
      37
      36
    
    
      weekly_working_minutes
      NaN
      45
    
    
      workforce
      NaN
      NaN
    
    
      working_condition_code
      AUT
      AUT
    
    
      working_condition_comment
      NaN
      NaN
    
    
      working_condition_name
      Autre
      Autre
    
    
      working_hours_type_code
      NOR
      NOR
    
    
      working_hours_type_comment
      NaN
      NaN
    
    
      working_hours_type_name
      Horaires normaux
      Horaires normaux
    
    
      working_location_name
      CERGY
      CARQUEFOU
    
    
      working_location_type_code
      CO
      CO
    
    
      working_location_type_name
      Une commune
      Une commune

Unfortunately there is no fields that could help identify the company name or an email for the job seeker to contact. :-(

Creation Date

Let's see how recent those job postings are.



In [4]:

    
plot_helpers.hist_in_range(postings.creation_date, datetime.datetime(2014, 8, 1));









    



100.00% of values in range

As expected most job postings don't last more than few months.

How many new ones do we get per day?



In [5]:

    
postings.creation_date[postings.creation_date > datetime.datetime(2016, 8, 1)].value_counts().plot();

So we get up to 10k job postings per day except during weekend days. Note that this result might be very very seasonal so let's take it as a very large approximation.

I Want More!

In order for our app to suggest actual job offers we would need to get more info that is actually available in this dataset: name of the company, how to contact them, link a job posting from one day's data to another one. This would be easy if we could scrape the job posting that is on Pôle Emploi web pages and make it map this data (or actually only use a daily scrape from web). Note that another way would be to identify the fields we would like to have and ask them to Pôle Emploi (but then it wouldn't be anonymous job postings anymore).

We suggest a solution where we use the CSV to find out the newly created offers, and then a scraper to get the actual data. The advanced search page allows to get 100 job postings with some filters, the results are returned with unique IDs.

De-Anonymize

If we have data scraped from the web and the CSV downloaded from Emploi Store Dev, how easy would it be to match them? Here are some fields that are available in both datasets:

creation date (not in search, only in display)
rome profession code
postcode for city-located jobs
activity code (the two first items at least), which represents the sector or industry of the company

Let's see if that's enough to do unique matching.



In [6]:

    
postings['activity_group'] = postings['activity_code'].str[:2]
def _percent_of_uniques(fields, postings):
    dupes = postings[postings.duplicated(fields, keep=False)]
    return 100 - len(dupes) / len(postings) * 100
'{:.02f}%'.format(_percent_of_uniques(['creation_date', 'rome_profession_code', 'postcode', 'activity_group'], postings))









    Out[6]:





'86.42%'

So we can de-anonymize more than 85% of job postings. Let's see if we can do a bit better with additional fields.



In [7]:

    
'{:.02f}%'.format(_percent_of_uniques([
    'creation_date',
    'rome_profession_code',
    'postcode',
    'activity_group',
    'departement_code',
    'contract_type_code',
    'annual_minimum_salary',
    'annual_maximum_salary',
    'qualification_code',
    ],postings))









    Out[7]:





'92.03%'

With a lot more work we can de-anonymize 6% more, but I'm not sure it's worth it.

Let's try a de-anonymization manually:



In [8]:

    
_PE_SEARCH_PAGE = (
    'https://candidat.pole-emploi.fr/candidat/rechercheoffres/resultats/'
    'A__COMMUNE_{postcode}_5___{activity_group}-_____{rome_profession_code}____INDIFFERENT_______________________')
for s in postings.sample().itertuples():
    print(_PE_SEARCH_PAGE.format(s._asdict()))
    print(s.creation_date)









    



https://candidat.pole-emploi.fr/candidat/rechercheoffres/resultats/A__COMMUNE_76740_5___56-_____13861____INDIFFERENT_______________________
2016-08-16 00:00:00

Success! This search returns only one exact result and the date matches perfectly.

So we proved it could work for many many job offers, now having that in production is a bit harder as we would need to do daily:

fetch the CSV from Pole Emploi Store Dev (~40 minutes)
extract the ~10k new job postings (a simple filter)
scrape the corresponding search pages to get the unique IDs (10k scrapes)
(opt) scrape the actual job posting pages to get more info and not just a deep link (10k more scrapes)

Depending on how fast we can scrape Pôle Emploi this can take a long time.

Conclusion

The dataset of job postings on Pole Emploi Store Dev is indeed anonymized however together with scraping the website we could get most of the info we're after. However given the long scraping time and the final result, it would be way easier if we could get Pôle Emploi to give us the unique ID (half the scraping time) and/or even better, the fields we're after (company name and contact).

	0	1
activity_code	7112B	7820Z
activity_name	Ingénierie, études techniques	Activités des agences de travail temporaire
annual_maximum_salary	44000	18555.8
annual_minimum_salary	34000	18479.4
city_code	95127	44026
city_name	CERGY	CARQUEFOU
continent_code	NaN	NaN
continent_name	NaN	NaN
contract_duration	NaN	12
contract_dur_unit_code	NaN	MO
contract_dur_unit_name	NaN	Mois
contract_nature_code	E1	I1
contract_nature_name	Contrat tout public	Insertion par l'activ.éco.
contract_type_code	CDI	TTI
contract_type_name	Contrat à durée indéterminée	Contrat travail temporaire insertion
country_code	01	01
country_name	FRANCE	FRANCE
creation_date	2016-07-26 00:00:00	2016-08-09 00:00:00
degree_diploma_name_1	NaN	NaN
degree_diploma_name_2	NaN	NaN
degree_required_code_1	E	NaN
degree_required_code_2	NaN	NaN
degree_required_name_1	exigé	NaN
degree_required_name_2	NaN	NaN
degree_subject_area_code_1	23637	NaN
degree_subject_area_code_2	NaN	NaN
degree_subject_area_name_1	Mécanique automobile	NaN
degree_subject_area_name_2	NaN	NaN
degree_type_code_1	NV1	NaN
degree_type_code_2	NaN	NaN
degree_type_name_1	Bac+5 et plus ou équivalent	NaN
degree_type_name_2	NaN	NaN
departement_code	95	44
departement_name	VAL-D'OISE	LOIRE-ATLANTIQUE
desktop_tools_code_1	NaN	NaN
desktop_tools_code_2	NaN	NaN
desktop_tools_lev_code_1	NaN	NaN
desktop_tools_lev_code_2	NaN	NaN
desktop_tools_lev_name_1	NaN	NaN
desktop_tools_lev_name_2	NaN	NaN
desktop_tools_name_1	NaN	NaN
desktop_tools_name_2	NaN	NaN
disabled_accessibility	N	N
driving_lic_code_1	B	NaN
driving_lic_code_2	NaN	NaN
driving_lic_name_1	B - Véhicule léger	NaN
driving_lic_name_2	NaN	NaN
driving_lic_req_code_1	E	NaN
driving_lic_req_code_2	NaN	NaN
driving_lic_req_name_1	exigé	NaN
driving_lic_req_name_2	NaN	NaN
employer_consent	O	O
em_interview_modality_code	MEL	MEL
em_interview_modality_name	Envoyer votre CV par mail	Envoyer votre CV par mail
experience_code	E	E
experience_max_duration	NaN	NaN
experience_min_duration	3	3
experience_name	Expérience exigée	Expérience exigée
exp_duration_type_code	AN	MO
exp_duration_type_name	An(s)	Mois
flag_relation	0	0
lang_code_1	AN	NaN
lang_code_2	NaN	NaN
lang_name_1	Anglais	NaN
lang_name_2	NaN	NaN
lang_proficiency_code_1	3	NaN
lang_proficiency_code_2	NaN	NaN
lang_proficiency_name_1	Bon	NaN
lang_proficiency_name_2	NaN	NaN
lang_required_code_1	E	NaN
lang_required_code_2	NaN	NaN
lang_required_name_1	exigé	NaN
lang_required_name_2	NaN	NaN
latitude	49.0522	47.2969
longitude	2.03611	-1.49278
maximum_salary	44000	9.71
minimum_salary	34000	9.67
modification_date	2016-08-29 00:00:00	2016-08-12 00:00:00
monitoring_agency_code	95074	44243
nb_month_salary	12	12
nb_vacancies_creation	1	1
nb_vacancies_left	1	1
number_of_application	1	2
postcode	95000	44470
qualification_code	9	5
qualification_name	Cadre	Employé non qualifié
region_code	11	52
region_name	Ile de France	Pays de la Loire
rome_list_activity_code	732;866;1312;1920	NaN
rome_list_skill_code	21249	NaN
rome_list_work_env_code	23410	NaN
rome_profession_card_code	H1206	N1103
rome_profession_card_name	Management et ingénierie études, recherche et ...	Magasinage et préparation de commandes
rome_profession_code	15743	17993
rome_profession_name	Ingénieur / Ingénieure en automobile en industrie	Préparateur / Préparatrice de commandes
salary_comment	NaN	NaN
salary_supplement_code_1	7	NaN
salary_supplement_code_2	NaN	NaN
salary_supplement_name_1	Mutuelle	NaN
salary_supplement_name_2	NaN	NaN
salary_type_code	A	H
salary_type_name	Annuel	Horaire
salary_unit_code	EU	EU
service_type_code	APP	ACP
service_type_name	En appui	En accompagnement sans présélection
status_code	EC	EC
status_name	En cours	En cours
sub_continent_code	NaN	NaN
sub_continent_name	NaN	NaN
travel_frequency_code	3	NaN
travel_frequency_comment	NaN	NaN
travel_frequency_name	Fréquents	NaN
travel_type_code	2	NaN
travel_type_comment	NaN	NaN
travel_type_name	Régional	NaN
wage_unit_name	Euros	Euros
weekly_working_hours	37	36
weekly_working_minutes	NaN	45
workforce	NaN	NaN
working_condition_code	AUT	AUT
working_condition_comment	NaN	NaN
working_condition_name	Autre	Autre
working_hours_type_code	NOR	NOR
working_hours_type_comment	NaN	NaN
working_hours_type_name	Horaires normaux	Horaires normaux
working_location_name	CERGY	CARQUEFOU
working_location_type_code	CO	CO
working_location_type_name	Une commune	Une commune