Pôle Emploi Active Job Postings

Author: pascal@bayes.org Date: 2016-09-05

Skip the run test because it would take too long to download the dataset.

This notebook provides a quick overview of the Active Job Postings dataset available on the Emploi Store Dev. It is updated every morning (at 6:00 in France) and contains only job postings that are still available on the Pôle Emploi website.

The original file is downloaded using the Python emploi_store module in the package offres, the resource named Offres d'emploi. You can use the make rule to download it:

docker-compose run --rm data-analysis-prepare make data/recent_job_offers.csv

In [1]:
import datetime
from os import path

import pandas as pd
import seaborn as _

from bob_emploi.data_analysis.lib import plot_helpers

DATA_PATH = '../../../data/'

pd.options.display.max_rows = 999

Let's import the file. Although after some tries it seems that some fields benefit from having a type set.


In [2]:
postings = pd.read_csv(
    path.join(DATA_PATH, 'recent_job_offers.csv'),
    dtype={'POSTCODE': str},
    parse_dates=['CREATION_DATE', 'MODIFICATION_DATE'], dayfirst=True, infer_datetime_format=True,
    low_memory=False)
postings.columns = postings.columns.str.lower()

Fields

Let's check what are the fields available for each job posting.


In [3]:
postings.head(2).transpose()


Out[3]:
0 1
activity_code 7112B 7820Z
activity_name Ingénierie, études techniques Activités des agences de travail temporaire
annual_maximum_salary 44000 18555.8
annual_minimum_salary 34000 18479.4
city_code 95127 44026
city_name CERGY CARQUEFOU
continent_code NaN NaN
continent_name NaN NaN
contract_duration NaN 12
contract_dur_unit_code NaN MO
contract_dur_unit_name NaN Mois
contract_nature_code E1 I1
contract_nature_name Contrat tout public Insertion par l'activ.éco.
contract_type_code CDI TTI
contract_type_name Contrat à durée indéterminée Contrat travail temporaire insertion
country_code 01 01
country_name FRANCE FRANCE
creation_date 2016-07-26 00:00:00 2016-08-09 00:00:00
degree_diploma_name_1 NaN NaN
degree_diploma_name_2 NaN NaN
degree_required_code_1 E NaN
degree_required_code_2 NaN NaN
degree_required_name_1 exigé NaN
degree_required_name_2 NaN NaN
degree_subject_area_code_1 23637 NaN
degree_subject_area_code_2 NaN NaN
degree_subject_area_name_1 Mécanique automobile NaN
degree_subject_area_name_2 NaN NaN
degree_type_code_1 NV1 NaN
degree_type_code_2 NaN NaN
degree_type_name_1 Bac+5 et plus ou équivalent NaN
degree_type_name_2 NaN NaN
departement_code 95 44
departement_name VAL-D'OISE LOIRE-ATLANTIQUE
desktop_tools_code_1 NaN NaN
desktop_tools_code_2 NaN NaN
desktop_tools_lev_code_1 NaN NaN
desktop_tools_lev_code_2 NaN NaN
desktop_tools_lev_name_1 NaN NaN
desktop_tools_lev_name_2 NaN NaN
desktop_tools_name_1 NaN NaN
desktop_tools_name_2 NaN NaN
disabled_accessibility N N
driving_lic_code_1 B NaN
driving_lic_code_2 NaN NaN
driving_lic_name_1 B - Véhicule léger NaN
driving_lic_name_2 NaN NaN
driving_lic_req_code_1 E NaN
driving_lic_req_code_2 NaN NaN
driving_lic_req_name_1 exigé NaN
driving_lic_req_name_2 NaN NaN
employer_consent O O
em_interview_modality_code MEL MEL
em_interview_modality_name Envoyer votre CV par mail Envoyer votre CV par mail
experience_code E E
experience_max_duration NaN NaN
experience_min_duration 3 3
experience_name Expérience exigée Expérience exigée
exp_duration_type_code AN MO
exp_duration_type_name An(s) Mois
flag_relation 0 0
lang_code_1 AN NaN
lang_code_2 NaN NaN
lang_name_1 Anglais NaN
lang_name_2 NaN NaN
lang_proficiency_code_1 3 NaN
lang_proficiency_code_2 NaN NaN
lang_proficiency_name_1 Bon NaN
lang_proficiency_name_2 NaN NaN
lang_required_code_1 E NaN
lang_required_code_2 NaN NaN
lang_required_name_1 exigé NaN
lang_required_name_2 NaN NaN
latitude 49.0522 47.2969
longitude 2.03611 -1.49278
maximum_salary 44000 9.71
minimum_salary 34000 9.67
modification_date 2016-08-29 00:00:00 2016-08-12 00:00:00
monitoring_agency_code 95074 44243
nb_month_salary 12 12
nb_vacancies_creation 1 1
nb_vacancies_left 1 1
number_of_application 1 2
postcode 95000 44470
qualification_code 9 5
qualification_name Cadre Employé non qualifié
region_code 11 52
region_name Ile de France Pays de la Loire
rome_list_activity_code 732;866;1312;1920 NaN
rome_list_skill_code 21249 NaN
rome_list_work_env_code 23410 NaN
rome_profession_card_code H1206 N1103
rome_profession_card_name Management et ingénierie études, recherche et ... Magasinage et préparation de commandes
rome_profession_code 15743 17993
rome_profession_name Ingénieur / Ingénieure en automobile en industrie Préparateur / Préparatrice de commandes
salary_comment NaN NaN
salary_supplement_code_1 7 NaN
salary_supplement_code_2 NaN NaN
salary_supplement_name_1 Mutuelle NaN
salary_supplement_name_2 NaN NaN
salary_type_code A H
salary_type_name Annuel Horaire
salary_unit_code EU EU
service_type_code APP ACP
service_type_name En appui En accompagnement sans présélection
status_code EC EC
status_name En cours En cours
sub_continent_code NaN NaN
sub_continent_name NaN NaN
travel_frequency_code 3 NaN
travel_frequency_comment NaN NaN
travel_frequency_name Fréquents NaN
travel_type_code 2 NaN
travel_type_comment NaN NaN
travel_type_name Régional NaN
wage_unit_name Euros Euros
weekly_working_hours 37 36
weekly_working_minutes NaN 45
workforce NaN NaN
working_condition_code AUT AUT
working_condition_comment NaN NaN
working_condition_name Autre Autre
working_hours_type_code NOR NOR
working_hours_type_comment NaN NaN
working_hours_type_name Horaires normaux Horaires normaux
working_location_name CERGY CARQUEFOU
working_location_type_code CO CO
working_location_type_name Une commune Une commune

Unfortunately there is no fields that could help identify the company name or an email for the job seeker to contact. :-(

Creation Date

Let's see how recent those job postings are.


In [4]:
plot_helpers.hist_in_range(postings.creation_date, datetime.datetime(2014, 8, 1));


100.00% of values in range

As expected most job postings don't last more than few months.

How many new ones do we get per day?


In [5]:
postings.creation_date[postings.creation_date > datetime.datetime(2016, 8, 1)].value_counts().plot();


So we get up to 10k job postings per day except during weekend days. Note that this result might be very very seasonal so let's take it as a very large approximation.

I Want More!

In order for our app to suggest actual job offers we would need to get more info that is actually available in this dataset: name of the company, how to contact them, link a job posting from one day's data to another one. This would be easy if we could scrape the job posting that is on Pôle Emploi web pages and make it map this data (or actually only use a daily scrape from web). Note that another way would be to identify the fields we would like to have and ask them to Pôle Emploi (but then it wouldn't be anonymous job postings anymore).

We suggest a solution where we use the CSV to find out the newly created offers, and then a scraper to get the actual data. The advanced search page allows to get 100 job postings with some filters, the results are returned with unique IDs.

De-Anonymize

If we have data scraped from the web and the CSV downloaded from Emploi Store Dev, how easy would it be to match them? Here are some fields that are available in both datasets:

  • creation date (not in search, only in display)
  • rome profession code
  • postcode for city-located jobs
  • activity code (the two first items at least), which represents the sector or industry of the company

Let's see if that's enough to do unique matching.


In [6]:
postings['activity_group'] = postings['activity_code'].str[:2]
def _percent_of_uniques(fields, postings):
    dupes = postings[postings.duplicated(fields, keep=False)]
    return 100 - len(dupes) / len(postings) * 100
'{:.02f}%'.format(_percent_of_uniques(['creation_date', 'rome_profession_code', 'postcode', 'activity_group'], postings))


Out[6]:
'86.42%'

So we can de-anonymize more than 85% of job postings. Let's see if we can do a bit better with additional fields.


In [7]:
'{:.02f}%'.format(_percent_of_uniques([
    'creation_date',
    'rome_profession_code',
    'postcode',
    'activity_group',
    'departement_code',
    'contract_type_code',
    'annual_minimum_salary',
    'annual_maximum_salary',
    'qualification_code',
    ],postings))


Out[7]:
'92.03%'

With a lot more work we can de-anonymize 6% more, but I'm not sure it's worth it.

Let's try a de-anonymization manually:


In [8]:
_PE_SEARCH_PAGE = (
    'https://candidat.pole-emploi.fr/candidat/rechercheoffres/resultats/'
    'A__COMMUNE_{postcode}_5___{activity_group}-_____{rome_profession_code}____INDIFFERENT_______________________')
for s in postings.sample().itertuples():
    print(_PE_SEARCH_PAGE.format(s._asdict()))
    print(s.creation_date)


https://candidat.pole-emploi.fr/candidat/rechercheoffres/resultats/A__COMMUNE_76740_5___56-_____13861____INDIFFERENT_______________________
2016-08-16 00:00:00

Success! This search returns only one exact result and the date matches perfectly.

So we proved it could work for many many job offers, now having that in production is a bit harder as we would need to do daily:

  • fetch the CSV from Pole Emploi Store Dev (~40 minutes)
  • extract the ~10k new job postings (a simple filter)
  • scrape the corresponding search pages to get the unique IDs (10k scrapes)
  • (opt) scrape the actual job posting pages to get more info and not just a deep link (10k more scrapes)

Depending on how fast we can scrape Pôle Emploi this can take a long time.

Conclusion

The dataset of job postings on Pole Emploi Store Dev is indeed anonymized however together with scraping the website we could get most of the info we're after. However given the long scraping time and the final result, it would be way easier if we could get Pôle Emploi to give us the unique ID (half the scraping time) and/or even better, the fields we're after (company name and contact).