Date: 2017-11-19
Author: Pascal pascal@bayesimpact.org
In Pôle emploi Open Data (from Emploi Store Dev), since at least July 2017, one can find the list of Pôle emploi agencies. This notebook is an overview of this dataset.
You can download the csv file from the dataset by using the following command:
docker-compose run --rm data-analysis-prepare make data/pole-emploi-agencies.csv
First let's import the CSV and check the format of the various columns:
In [1]:
import os
from os import path
import pandas as pd
import seaborn as _
DATA_FOLDER = os.getenv('DATA_FOLDER')
agencies = pd.read_csv(path.join(DATA_FOLDER, 'pole-emploi-agencies.csv'))
agencies.head()
Out[1]:
OK, this sounds pretty cool. However there are a lot of fields (63) so let's check them vertically:
In [2]:
agencies.head(2).transpose()
Out[2]:
OK, it's a bit easier to read, but not enough to our taste. Let's try to cut out the fields for opening hours:
In [3]:
agencies_opening_hours = agencies[[column for column in agencies.columns if '_HOUR_' in column]]
agencies_opening_hours.head()
Out[3]:
Alright, and let's try to have it in a nicer format. First let's see what are the patterns of the field names. It seems that most of them end with SLOT_1
or SLOT_2
:
In [4]:
all_slots = {column[-len('_SLOT_1'):] for column in agencies_opening_hours.columns}
all_slots
Out[4]:
OK, this is confirmed, and do we have the same fields for both slots?
In [5]:
def _columns_for_slot(columns, slot_id):
return {column[:-len(slot_id)] for column in columns if column.endswith(slot_id)}
if _columns_for_slot(agencies_opening_hours.columns, '_SLOT_1') != \
_columns_for_slot(agencies_opening_hours.columns, '_SLOT_2'):
raise ValueError("Slots don't have the same fields")
_columns_for_slot(agencies_opening_hours.columns, '_SLOT_1')
Out[5]:
OK, that's clearer: for each day, there are two possible slots of openings (probably morning and afternoon), and for each slot there are opening and close hours, and appointment opening and closing hours.
We're going to pivot this to have one row per slot instead of one row per agency. First let's extract the list of days:
In [6]:
all_days = {column[-len('FRI'):] for column in _columns_for_slot(agencies_opening_hours.columns, '_SLOT_1')}
all_days
Out[6]:
Cool, now let's check that we can recreate the same fields by using the 4 variables: day, slot (1 or 2), slot type (appointment or not) and limit (opening or closing):
In [7]:
slot_fields = {
'{slot_type}{limit}_HOUR_{day}{slot}'.format(
slot_type=slot_type,
limit=limit,
day=day,
slot=slot,
)
for day in all_days
for slot in all_slots
for slot_type in {'' , 'APPT_'}
for limit in {'OPENING', 'CLOSING'}
}
slot_fields == set(agencies_opening_hours.columns)
Out[7]:
Indeed we can! So now here comes the magic:
In [8]:
agencies_opening_hours_flatten = pd.melt(agencies_opening_hours.reset_index(), id_vars=['index'])
agencies_opening_hours_flatten['appointment'] = agencies_opening_hours_flatten.variable.str.startswith('APPT_')
agencies_opening_hours_flatten['slot'] = agencies_opening_hours_flatten.variable.str[-1:]
agencies_opening_hours_flatten['day'] = agencies_opening_hours_flatten.variable.str[-len('FRI_SLOT_1'):-len('_SLOT_1')]
agencies_opening_hours_flatten['limit'] = agencies_opening_hours_flatten.variable.apply(lambda var: 'opening' if 'OPENING' in var else 'closing')
agencies_opening_slots = agencies_opening_hours_flatten\
.set_index(['index', 'appointment', 'day', 'slot', 'limit'])\
.value.unstack()[['opening', 'closing']]\
.dropna()\
.reset_index(level=[1, 2, 3])
# Sorting the days of the week for better readability:
agencies_opening_slots['day_value'] = agencies_opening_slots.day.map({
'MON': 0, 'TUE': 1, 'WED': 2, 'THU': 3, 'FRI': 4,
})
agencies_opening_slots = agencies_opening_slots\
.reset_index()\
.sort_values(['index', 'day_value', 'appointment', 'slot'])\
.set_index('index')\
.drop('day_value', 'columns')
agencies_opening_slots.head(10)
Out[8]:
OK, this is now more readable: the agency with index 0 is open every morning without appointments, and in the afternoon for appointments (except Thursday, and on Friday only up to 15:30).
Alright, now what about the other fields (non hours):
In [9]:
agencies_info = agencies[[column for column in agencies.columns if '_HOUR_' not in column]]
agencies_info.head(2).transpose()
Out[9]:
In [10]:
agencies_info.describe(include='all').head(2).transpose()
Out[10]:
So there are 881 agencies in this dataset. The agency names and "Aurore" codes are unique. They all have addresses, but few of them (12) are missing an email address, and some of them (11) do not have a latitude longitude.
Hmm however it seems that there are a few fields that are considered as integer but should actually be strings: POSTCODE
, SAFIR_AGENCY_CODE
and ESTABLISHMENT_TYPE
. Let's reimport properly:
In [11]:
agencies = pd.read_csv(path.join(DATA_FOLDER, 'pole-emploi-agencies.csv'), dtype={
'POSTCODE': str,
'SAFIR_AGENCY_CODE': str,
'ESTABLISHMENT_TYPE': str,
})
agencies_info = agencies[[column for column in agencies.columns if '_HOUR_' not in column]]
agencies_info[['POSTCODE', 'SAFIR_AGENCY_CODE', 'ESTABLISHMENT_TYPE']].describe(include='all').head(2).transpose()
Out[11]:
Cool, that's better. We can see that SAFIR Agency codes are unique as well.
As we saw in the description above there are 2 types of agencies:
In [12]:
agencies_info[['AGENCY_TYPE_CODE', 'AGENCY_TYPE_NAME']].drop_duplicates()
Out[12]:
OK, so some of them are "specialized". What does that mean? Let's check out some of them:
In [13]:
agencies_info[agencies_info.AGENCY_TYPE_CODE == 'APES'].head()
Out[13]:
According to their names they are specialized in executives ("cadres") and/or in certain sectors ("scenes et images").
There are also 2 establishment types:
In [14]:
agencies_info[['ESTABLISHMENT_TYPE', 'ESTABLISHMENT_TYPE_NAME']].drop_duplicates()
Out[14]:
Hmm, apparently some agencies seem not to have any appointment possible. Let's make sure this corresponds to the number of hours opened for appointments:
In [15]:
agencies_info[['ESTABLISHMENT_TYPE', 'ESTABLISHMENT_TYPE_NAME', 'APPT_WEEKLY_NUMBER_HOURS']]\
.drop_duplicates()\
.sort_values('APPT_WEEKLY_NUMBER_HOURS')
Out[15]:
Perfect, the different establishment type is due to not having any appointments.
By the way, let's check that the weekly number hours correspond to the actual schedule proposed. First let's re-compute the total number of weekly opening hours, both for appointments and for self-service:
In [16]:
def _time_to_hours(time):
hours, minutes = time.split(':')
return int(hours) + round(int(minutes) / 60 * 100) / 100
agencies_opening_slots['duration'] =\
agencies_opening_slots.closing.map(_time_to_hours) - \
agencies_opening_slots.opening.map(_time_to_hours)
weekly_number_hours = agencies_opening_slots\
.reset_index()\
.groupby(by=['index', 'appointment'])\
.sum()\
.reset_index(level=1)\
.pivot(columns='appointment', values='duration')
weekly_number_hours.head()
Out[16]:
OK, now that we have the same format, let's check the difference with the columns already in the dataset:
In [17]:
{
'weekly': ((weekly_number_hours.iloc[:, 0] - agencies_info.WEEKLY_NUMBER_HOURS)).abs().max(),
'appointment': ((weekly_number_hours.iloc[:, 1] - agencies_info.APPT_WEEKLY_NUMBER_HOURS)).abs().max(),
}
Out[17]:
There's a difference but it looks more like a rounding error (.03 hours correspond to less than 2 minutes), so we can just trust the WEEKLY_NUMBER_HOURS
fields.
Another field with a low cardinality is the phone number:
In [18]:
agencies_info.PUBLIC_PHONE_NUMBER.value_counts().to_frame()
Out[18]:
So there's a global phone number to call Pôle emploi, but for some reason 3 agencies have published a local phone number.
What about the email field? Looking at the first examples, it seemed that they were all of the format <name>.<id>@pole-emploi.fr
and that the ID matched the SAFIR_AGENCY_CODE
. Let's check it:
In [19]:
email_domains = agencies_info.EMAIL.dropna().str.split('@', 1).apply(lambda parts: parts[1])
email_domains.value_counts().to_frame()
Out[19]:
Hmm the third one looks like it has a typo.
What about the prefixes:
In [20]:
email_after_dot = agencies_info.EMAIL.dropna()\
.str.split('@', 1).apply(lambda parts: parts[0])\
.str.split('.', 1).apply(lambda parts: parts[1] if len(parts)> 1 else '')
safir_agency_code = agencies_info.SAFIR_AGENCY_CODE.astype(str)
(email_after_dot == safir_agency_code.loc[email_after_dot.index]).sum()
Out[20]:
Right on! Most of the email addresses contains the SAFIR agency code in it.
What about the other email addresses?
In [21]:
emails_not_matching = \
email_after_dot[email_after_dot != safir_agency_code.loc[email_after_dot.index]]
agencies_info.loc[emails_not_matching.index, ['AGENCY_NAME', 'EMAIL']].head()
Out[21]:
OK, so apparently it's not always the case so it's a good thing that the dataset contains the address to contact.
Let's check the distribution by region:
In [22]:
agencies_info.REG_DIRECTION_NAME.value_counts().to_frame().plot(kind='barh')
agencies_info.REG_DIRECTION_NAME.nunique()
Out[22]:
13 regions in Metropolitan France, and 4 for overseas (Guyane, Martinique, Guadeloupe and Réunion-Mayotte). It looks like they're all here with a decent number of agencies: densily populated regions have a lot of agencies, others have less.
Now let's check the geographic locations. First why are some of them missing?
In [23]:
agencies_info[agencies_info.LONGITUDE.isnull()].head()
Out[23]:
Hmm, I see no good reasons why they do not have their geographic coordinates. Google Maps seem to be able to find them, see the one in Rouen, and the one in Cergy. We should probably tell Pôle emploi about those.
In the meantime let's plot the others:
In [24]:
agencies_info.plot(kind='scatter', x='LONGITUDE', y='LATITUDE');
Cool: pretty neat and obvious. We have the location of Pôle emploi agencies including the ones in DOM/TOM. Awesome!
Wait a minute, there seems to be an agency next to Saint-Pierre et Miquelon whereas we did not see any corresponding region before. Let's check:
In [25]:
agencies_info[
(agencies_info.LATITUDE > 40) &
(agencies_info.LONGITUDE < -40)
][['AGENCY_NAME', 'REG_DIRECTION_NAME']]
Out[25]:
OK, it is indeed in Saint-Pierre et Miquelon, it's just that it's joined with the Normandie
region.
Let's zoom in Metropolitan France:
In [26]:
agencies_info[(agencies_info.LONGITUDE < 10) & (agencies_info.LONGITUDE > -5)]\
.plot(kind='scatter', x='LONGITUDE', y='LATITUDE',
xlim=(-5, 10), ylim=(41, 52),
# Fix figsize to have a good aspect ratio: 1.414 ~= 1/cos(45°).
figsize=(8, 8 * 1.414 * (52 - 41) / (10 - -5)));
That sounds like a good view of major cities in France! Finally, let's check that we have precise addresses and not only the center of cities. For instance let's check the agencies in Paris:
In [27]:
agencies_info[(agencies_info.POSTCODE.astype(str).str.startswith('75'))]\
[['AGENCY_NAME', 'LONGITUDE', 'LATITUDE']].head()
Out[27]:
Cool! Obviously they don't have all the same coordinates so most probably they would relate to actual addresses and not just cities. Finally let's do a point check for an agency we went to physically (Arles):
In [28]:
agencies_info[agencies_info.AGENCY_NAME == 'ARLES'][['LATITUDE', 'LONGITUDE']]
Out[28]:
Pretty neat! The coordinates are actually less than 50 meters from the agency's entrance. I think we can trust those coordinates.
The agencies dataset contains detailed information about Pôle emploi agencies:
3949
.It contains agencies across all France including DOM, Mayotte, Saint-Pierre & Miquelon (but not other overseas collectivities like French Polynesy or Wallis & Futuna).
It could probably be interesting to add to an advice recommending to go to a local Pôle emploi agency, especially on the self-serve slots. It would not change much the advice but make it super personalized in the eyes of the users.
Further possible investigations:
@pole-emplopi.fr
.