Tous Bénévoles - Volunteering Missions

Date: 2017-06-08 Author: pascal@bayes.org

We have access to an XML showing all the volunteering missions gathered by the Tous Bénévoles NGO. The URL is at the address http://www.tousbenevoles.org/linkedin_webservice/xml/linkedin.xml and is refreshed regularly.

This notebook analyzes a snapshot to see what kind of data we can expect. Note that there's absolutely no guarantee from our partner that this data will have the same kind of data in the future or even keep the same format.

To reproduce this notebook, first download the data using:

docker-compose run --rm make data/tous_benevoles.xml

Now let's import the Pandas module as well as the file itself.



In [1]:

    
import os
from os import path

import pandas as pd
import seaborn as _
import xmltodict

DATA_FOLDER = os.getenv('DATA_FOLDER')

dataset = xmltodict.parse(open(path.join(DATA_FOLDER, 'tous_benevoles.xml'), 'rb'))

Structure

Let's first have a look at the general structure of the XML.



In [2]:

    
type(dataset)









    Out[2]:





collections.OrderedDict



In [3]:

    
dataset.keys()









    Out[3]:





odict_keys(['jobs'])



In [4]:

    
type(dataset['jobs'])









    Out[4]:





collections.OrderedDict



In [5]:

    
dataset['jobs'].keys()









    Out[5]:





odict_keys(['job'])



In [6]:

    
type(dataset['jobs']['job'])









    Out[6]:





list



In [7]:

    
len(dataset['jobs']['job'])









    Out[7]:





2023

So it seems that we have a high level jobs element encompassing a list of job elements. Let's create a data frame with one "job" per row:



In [8]:

    
jobs = pd.DataFrame(dataset['jobs']['job'])
jobs.head()









    Out[8]:






  
    
      
      City
      Country
      DesiredSkillsAndExperience
      JobDescription
      JobId
      JobTitle
      JobType
      PostalCode
      State
      applyURL
    
  
  
    
      0
      NICE
      FR
      <b>Compétences et savoir-être</b>\n \nVous ête...
      Mission proposée par JobIRL <br /><b>Informati...
      35421
      Bénévolat : Journalistes sportifs -  Partagez ...
      VOLUNTEER
      06000
      France
      http://www.tousbenevoles.org/trouver-une-missi...
    
    
      1
      MARSEILLE
      FR
      <b>Compétences et savoir-être</b>\n \nVous ête...
      Mission proposée par JobIRL <br /><b>Informati...
      35421
      Bénévolat : Journalistes sportifs -  Partagez ...
      VOLUNTEER
      13001
      France
      http://www.tousbenevoles.org/trouver-une-missi...
    
    
      2
      TOULOUSE
      FR
      <b>Compétences et savoir-être</b>\n \nVous ête...
      Mission proposée par JobIRL <br /><b>Informati...
      35421
      Bénévolat : Journalistes sportifs -  Partagez ...
      VOLUNTEER
      31000
      France
      http://www.tousbenevoles.org/trouver-une-missi...
    
    
      3
      BORDEAUX
      FR
      <b>Compétences et savoir-être</b>\n \nVous ête...
      Mission proposée par JobIRL <br /><b>Informati...
      35421
      Bénévolat : Journalistes sportifs -  Partagez ...
      VOLUNTEER
      33000
      France
      http://www.tousbenevoles.org/trouver-une-missi...
    
    
      4
      MONTPELLIER
      FR
      <b>Compétences et savoir-être</b>\n \nVous ête...
      Mission proposée par JobIRL <br /><b>Informati...
      35421
      Bénévolat : Journalistes sportifs -  Partagez ...
      VOLUNTEER
      34000
      France
      http://www.tousbenevoles.org/trouver-une-missi...

Awesome, we're now ready to dig into the data itself.

Fields

Let's first check basic stats per column:



In [9]:

    
jobs.describe()









    Out[9]:






  
    
      
      City
      Country
      DesiredSkillsAndExperience
      JobDescription
      JobId
      JobTitle
      JobType
      PostalCode
      State
      applyURL
    
  
  
    
      count
      2023
      2023
      2023
      2023
      2023
      2023
      2023
      2023
      2023
      2023
    
    
      unique
      350
      1
      970
      1134
      1483
      1234
      1
      338
      1
      1483
    
    
      top
      CERGY
      FR
      <b>Compétences et savoir-être</b>Ecoute – Comp...
      Mission proposée par Auxilia Association Recon...
      26255
      Bénévolat : Devenir Parrain/Marraine d'un laur...
      VOLUNTEER
      95
      France
      http://www.tousbenevoles.org/trouver-une-missi...
    
    
      freq
      203
      2023
      63
      22
      11
      22
      2023
      136
      2023
      11

So not surprisingly some fields are always constant (they are probably there to fit the format of another partner, maybe LinkedIn):

Country is always "FR"
JobType is always "VOLUNTEER"
State is always "France"

JobId

Now let's check the JobId field as it could be used as a unique identifier. However apparently there are multiple rows with the same ID (1966 rows for only 1238 IDs), let's have a quick look at rows that have the same JobId. First let's find IDs where there are multiple rows.



In [10]:

    
jobs[jobs.duplicated(subset='JobId', keep=False)].groupby('JobId').size().head()









    Out[10]:





JobId
13377    11
18781    11
18782    11
20468    11
22448    11
dtype: int64

It seems that all the JobId that have several rows have exactly 11 rows. After checking with the partner, those 11 cities actually means "available everywhere": as LinkedIn required jobs to have a city, they decided to set it in the main 11 cities.

Let's devise a way to find such jobs quickly:



In [11]:

    
all_post_codes = jobs.groupby('JobId').PostalCode.apply(lambda codes: ','.join(codes.sort_values()))
all_post_codes[all_post_codes.str.contains(',')].iloc[0]









    Out[11]:





'06000,13001,31000,33000,34000,35000,44000,59000,67000,69001,75001'



In [12]:

    
jobs['isAvailableEverywhere'] = jobs.JobId.map(all_post_codes) == '06000,13001,31000,33000,34000,35000,44000,59000,67000,69001,75001'
jobs.sort_values('isAvailableEverywhere', ascending=False).head()









    Out[12]:






  
    
      
      City
      Country
      DesiredSkillsAndExperience
      JobDescription
      JobId
      JobTitle
      JobType
      PostalCode
      State
      applyURL
      isAvailableEverywhere
    
  
  
    
      0
      NICE
      FR
      <b>Compétences et savoir-être</b>\n \nVous ête...
      Mission proposée par JobIRL <br /><b>Informati...
      35421
      Bénévolat : Journalistes sportifs -  Partagez ...
      VOLUNTEER
      06000
      France
      http://www.tousbenevoles.org/trouver-une-missi...
      True
    
    
      391
      NANTES
      FR
      <b>Compétences et savoir-être</b>Respect de la...
      Mission proposée par Auxilia Association Recon...
      13377
      Bénévolat : Formation par correspondance REMIS...
      VOLUNTEER
      44000
      France
      http://www.tousbenevoles.org/trouver-une-missi...
      True
    
    
      393
      STRASBOURG
      FR
      <b>Compétences et savoir-être</b>Respect de la...
      Mission proposée par Auxilia Association Recon...
      13377
      Bénévolat : Formation par correspondance REMIS...
      VOLUNTEER
      67000
      France
      http://www.tousbenevoles.org/trouver-une-missi...
      True
    
    
      394
      LYON
      FR
      <b>Compétences et savoir-être</b>Respect de la...
      Mission proposée par Auxilia Association Recon...
      13377
      Bénévolat : Formation par correspondance REMIS...
      VOLUNTEER
      69001
      France
      http://www.tousbenevoles.org/trouver-une-missi...
      True
    
    
      395
      PARIS
      FR
      <b>Compétences et savoir-être</b>Respect de la...
      Mission proposée par Auxilia Association Recon...
      13377
      Bénévolat : Formation par correspondance REMIS...
      VOLUNTEER
      75001
      France
      http://www.tousbenevoles.org/trouver-une-missi...
      True

Now let's take the first ID above and check the stats for the rows corresponding to this JobId:



In [13]:

    
jobs[jobs.JobId == '35421'][list(set(jobs.columns) - {'isAvailableEverywhere'})].describe()









    Out[13]:






  
    
      
      JobDescription
      Country
      JobTitle
      PostalCode
      DesiredSkillsAndExperience
      City
      JobId
      applyURL
      JobType
      State
    
  
  
    
      count
      11
      11
      11
      11
      11
      11
      11
      11
      11
      11
    
    
      unique
      1
      1
      1
      11
      1
      11
      1
      1
      1
      1
    
    
      top
      Mission proposée par JobIRL <br /><b>Informati...
      FR
      Bénévolat : Journalistes sportifs -  Partagez ...
      33000
      <b>Compétences et savoir-être</b>\n \nVous ête...
      MONTPELLIER
      35421
      http://www.tousbenevoles.org/trouver-une-missi...
      VOLUNTEER
      France
    
    
      freq
      11
      11
      11
      1
      11
      1
      11
      11
      11
      11

Cool, for this one at least, it seems that all the fields are in common except for the City and the PostalCode as if the same job was available in several cities and as such was cut in several rows to fit a dedicated format. Let's make sure it's indeed the case with all the IDs with multiple rows:



In [14]:

    
all_fields_but_geo = set(jobs.columns) - set(('City', 'PostalCode'))
jobs.drop_duplicates(subset=all_fields_but_geo)\
    .groupby('JobId')\
    .size()\
    .value_counts()









    Out[14]:





1    1483
dtype: int64

Alright, our assumption was true, when the same job ID is used on multiple row it's only to specify several cities for the same job.

Geographic Distribution

Now let's look at the City field in order to get the distribution of jobs (= volunteering missions) in various cities.



In [15]:

    
jobs.groupby('City').size().sort_values(ascending=False).head(20).sort_values().plot(kind='barh');

Hmmm, this is not at all what we expected:

the city of Cergy has the most missions.
the first cities are the big ones, but Paris doesn't stand out that much. Looking a bit further it seems to be because Paris is divided by various postcodes.

After asking the partner: they are especially active in the 95 département, near Cergy.

What about Paris?



In [16]:

    
jobs[jobs.City.str.startswith('PARIS')].City.value_counts()









    Out[16]:





PARIS          86
PARIS 75018    33
PARIS 75020    26
PARIS 75012    24
PARIS 75014    19
PARIS 75015    17
PARIS 75019    15
PARIS 75011    14
PARIS 13       13
PARIS 75005    13
PARIS 75009    12
PARIS 18       10
PARIS 20        9
PARIS 75013     9
PARIS 75007     8
PARIS 75010     7
PARIS 15        7
PARIS 75017     6
PARIS 12        6
PARIS 75006     6
PARIS 10        5
PARIS 75002     5
PARIS 19        4
PARIS 75008     3
PARIS 7         3
PARIS 75116     3
PARIS 5         2
PARIS 6         2
PARIS 2         2
PARIS 17        2
PARIS 11        2
PARIS 14        2
PARIS 75004     1
PARIS 75016     1
PARIS 16        1
PARIS 9         1
Name: City, dtype: int64

It seems clear that the city name is not harmonized, at least for Paris: sometimes it's the whole city, sometimes it's only an arrondissement, and sometimes the city name contains the postcode itself.

Let's do a quick check if it's the case for other cities:



In [17]:

    
jobs[(jobs.City.str.contains('[A-Z] [0-9]')) & ~(jobs.City.str.startswith('PARIS'))].City.value_counts()









    Out[17]:





MARSEILLE 13008          5
LYON 69007               5
LYON 69002               5
AMIENS 80090             3
LYON 69006               3
LYON 7                   2
MULHOUSE 68200           2
TOULOUSE 31400           2
MARSEILLE 13005          2
ST DENIS 93210           2
MARSEILLE 13010          2
MARSEILLE 2              1
MONTPELLIER 34080        1
BORDEAUX 33300           1
ORLEANS 45100            1
TOULOUSE 31500           1
MARSEILLE 13002          1
MARSEILLE 13007          1
AIX EN PROVENCE 13290    1
MARSEILLE 13014          1
MARSEILLE 4              1
AIX EN PROVENCE 13090    1
MARSEILLE 13013          1
GRENOBLE 38100           1
ARLES 13200              1
MARSEILLE 5              1
LYON 69003               1
Name: City, dtype: int64

Obviously this is the case with the other cities that have multiple arrondissements, but it seems to be also the case with cities that have multiple postal codes for the same city (like Toulouse which is 31000 but use TOULOUSE 31400 and TOULOUSE 31500, or Orléans which is 45000 but also use ORLEANS 45100).

This might be a problem, as in our application we do not keep track of which part of the city someone is: even for Paris, Marseille & Lyon this info is only kept in the name for display, we do not have any distinguishing ID for user target job's location.

Let's clean it up and only keep the city's name:



In [18]:

    
jobs['clean_city'] = jobs['City'].str.replace(' \d+', '')
jobs.clean_city.value_counts().head()









    Out[18]:





PARIS        379
CERGY        203
MARSEILLE     77
LYON          75
BORDEAUX      67
Name: clean_city, dtype: int64

Now that it's cleaned, let's see the distribution by cities:



In [19]:

    
jobs[jobs.City != 'LEFFRINCKOUCKE'].\
    groupby('clean_city').size().sort_values(ascending=False).\
    head(20).sort_values().plot(kind='barh');

This is kind of what we would expect.

Let's plot now the number of cities with enough missions to be shown:



In [20]:

    
jobs[jobs.City != 'LEFFRINCKOUCKE']\
    .groupby('clean_city').size().sort_values(ascending=False)\
    .reset_index(drop=True)\
    .plot(ylim=(0, 50));  # 50 is taken from the chart above.

There are missions in about 300 cities, but more than half of them have only one mission and only ~60 cities have at least 3 missions. I suggest we show the missions in a person's cities but also the one in their département if there are less than 3 in the city.

So let's check the distribution at the département level:



In [21]:

    
jobs['departement'] = jobs.PostalCode.str[:2]
jobs_per_departement = jobs[jobs.City != 'LEFFRINCKOUCKE'].groupby('departement').size().sort_values(ascending=False)
jobs_per_departement.head(20).plot(kind='bar');

Not surprisingly we get the départment with top cities (Paris, Lyon, Marseille) as the first ones, and then in 95 where the organization did some dedicated work.

Let's now check the coverage over all départements:



In [22]:

    
jobs_per_departement\
    .reset_index(drop=True)\
    .plot(ylim=(0, 100));  # 100 was taken from the chart above.

Missions are not in all départements (there are 100 départements) and most of them have less than 20 missions. However there are more than 15 départements where there are 40 different missions and still a very large number of départements with 3 missions or more:



In [23]:

    
sum(jobs_per_departement >= 3)









    Out[23]:





47

Missing Fields

Now we would also like to extract from each row, the organization that is proposing the mission a nice title and a description. Let's see if we can find that in the JobDescription and JobTitle fields:



In [24]:

    
jobs[['JobDescription', 'JobTitle']].drop_duplicates().head().transpose()









    Out[24]:






  
    
      
      0
      11
      22
      33
      44
    
  
  
    
      JobDescription
      Mission proposée par JobIRL <br /><b>Informati...
      Mission proposée par JobIRL <br /><b>Informati...
      Mission proposée par JobIRL <br /><b>Informati...
      Mission proposée par JobIRL <br /><b>Informati...
      Mission proposée par JobIRL <br /><b>Informati...
    
    
      JobTitle
      Bénévolat : Journalistes sportifs -  Partagez ...
      Bénévolat : Journalistes reporters d'images - ...
      Bénévolat : Astronaute -  Partagez votre expér...
      Bénévolat : Techniciens qualité alimentaire - ...
      Bénévolat : Ingénieur du son -  Partagez votre...

It seems that JobDescription always starts with "Mission proposée par" and that JobTitle always start with "Bénévolat : ". Let's make sure it's the case:



In [25]:

    
jobs.JobDescription.str.startswith('Mission proposée par ').value_counts()









    Out[25]:





True    2023
Name: JobDescription, dtype: int64



In [26]:

    
jobs.JobTitle.str.startswith('Bénévolat : ').value_counts()









    Out[26]:





True    2023
Name: JobTitle, dtype: int64

Bingo, let's extract those constant strings and clean up the description and title:



In [27]:

    
jobs['title'] = jobs.JobTitle.str.replace('^Bénévolat : ', '')
jobs['proposed_by'] = jobs.JobDescription.str.extract('^Mission proposée par ([^<]+)<br />', expand=False)
jobs['description'] = jobs.JobDescription.str.replace('^Mission proposée par ([^<]+)<br />', '')
jobs[['title', 'proposed_by', 'description']].drop_duplicates().head()









    Out[27]:






  
    
      
      title
      proposed_by
      description
    
  
  
    
      0
      Journalistes sportifs -  Partagez votre expéri...
      JobIRL
      <b>Informations complémentaires</b>Vous êtes j...
    
    
      11
      Journalistes reporters d'images -  Partagez vo...
      JobIRL
      <b>Informations complémentaires</b>Vous êtes j...
    
    
      22
      Astronaute -  Partagez votre expérience profes...
      JobIRL
      <b>Informations complémentaires</b>Vous êtes a...
    
    
      33
      Techniciens qualité alimentaire -  Partagez vo...
      JobIRL
      <b>Informations complémentaires</b>Vous êtes t...
    
    
      44
      Ingénieur du son -  Partagez votre expérience ...
      JobIRL
      <b>Informations complémentaires</b>\n \nVous ê...

Nice title, nice "proposed by" field. The description could probably be cleaned a bit more:



In [28]:

    
jobs.description.str.startswith('<b>Informations complémentaires</b>').value_counts()









    Out[28]:





True    2023
Name: description, dtype: int64

As suspected it always continue with "Informations complémentaires", let's strip it as well.



In [29]:

    
jobs['description'] = jobs.description.str.replace('^<b>Informations complémentaires</b>', '').str.strip()
jobs[['title', 'proposed_by', 'description']].drop_duplicates().head()









    Out[29]:






  
    
      
      title
      proposed_by
      description
    
  
  
    
      0
      Journalistes sportifs -  Partagez votre expéri...
      JobIRL
      Vous êtes journaliste sportif, vous aimez votr...
    
    
      11
      Journalistes reporters d'images -  Partagez vo...
      JobIRL
      Vous êtes journaliste reporter d'images, vous ...
    
    
      22
      Astronaute -  Partagez votre expérience profes...
      JobIRL
      Vous êtes astronaute, vous aimez votre métier ...
    
    
      33
      Techniciens qualité alimentaire -  Partagez vo...
      JobIRL
      Vous êtes technicien qualité alimentaire, vous...
    
    
      44
      Ingénieur du son -  Partagez votre expérience ...
      JobIRL
      Vous êtes ingénieur du son, vous aimez votre m...

OK, this is good enough for now!

Conclusion

The data from Tous Bénévoles is quite clean and contains enough data to show an advice card pushing 3 different missions for about 60 cities, if we restrict it to city, or half of the départements if we extend to missions in the same département.

As a fallback there seems to be some missions available over the whole country.

However when using the data: be careful with the City field as it might sometimes contain more than just the city.

The original fields are not immediately usable, to get a clean title, the description or the organization that suggested the mission in the first place, you'll need some additional cleanup.

	City	Country	DesiredSkillsAndExperience	JobDescription	JobId	JobTitle	JobType	PostalCode	State	applyURL
0	NICE	FR	<b>Compétences et savoir-être</b>\n \nVous ête...	Mission proposée par JobIRL <br /><b>Informati...	35421	Bénévolat : Journalistes sportifs - Partagez ...	VOLUNTEER	06000	France	http://www.tousbenevoles.org/trouver-une-missi...
1	MARSEILLE	FR	<b>Compétences et savoir-être</b>\n \nVous ête...	Mission proposée par JobIRL <br /><b>Informati...	35421	Bénévolat : Journalistes sportifs - Partagez ...	VOLUNTEER	13001	France	http://www.tousbenevoles.org/trouver-une-missi...
2	TOULOUSE	FR	<b>Compétences et savoir-être</b>\n \nVous ête...	Mission proposée par JobIRL <br /><b>Informati...	35421	Bénévolat : Journalistes sportifs - Partagez ...	VOLUNTEER	31000	France	http://www.tousbenevoles.org/trouver-une-missi...
3	BORDEAUX	FR	<b>Compétences et savoir-être</b>\n \nVous ête...	Mission proposée par JobIRL <br /><b>Informati...	35421	Bénévolat : Journalistes sportifs - Partagez ...	VOLUNTEER	33000	France	http://www.tousbenevoles.org/trouver-une-missi...
4	MONTPELLIER	FR	<b>Compétences et savoir-être</b>\n \nVous ête...	Mission proposée par JobIRL <br /><b>Informati...	35421	Bénévolat : Journalistes sportifs - Partagez ...	VOLUNTEER	34000	France	http://www.tousbenevoles.org/trouver-une-missi...

	City	Country	DesiredSkillsAndExperience	JobDescription	JobId	JobTitle	JobType	PostalCode	State	applyURL
count	2023	2023	2023	2023	2023	2023	2023	2023	2023	2023
unique	350	1	970	1134	1483	1234	1	338	1	1483
top	CERGY	FR	<b>Compétences et savoir-être</b>Ecoute – Comp...	Mission proposée par Auxilia Association Recon...	26255	Bénévolat : Devenir Parrain/Marraine d'un laur...	VOLUNTEER	95	France	http://www.tousbenevoles.org/trouver-une-missi...
freq	203	2023	63	22	11	22	2023	136	2023	11

	0	11	22	33	44
JobDescription	Mission proposée par JobIRL <br /><b>Informati...	Mission proposée par JobIRL <br /><b>Informati...	Mission proposée par JobIRL <br /><b>Informati...	Mission proposée par JobIRL <br /><b>Informati...	Mission proposée par JobIRL <br /><b>Informati...
JobTitle	Bénévolat : Journalistes sportifs - Partagez ...	Bénévolat : Journalistes reporters d'images - ...	Bénévolat : Astronaute - Partagez votre expér...	Bénévolat : Techniciens qualité alimentaire - ...	Bénévolat : Ingénieur du son - Partagez votre...

	title	proposed_by	description
0	Journalistes sportifs - Partagez votre expéri...	JobIRL	<b>Informations complémentaires</b>Vous êtes j...
11	Journalistes reporters d'images - Partagez vo...	JobIRL	<b>Informations complémentaires</b>Vous êtes j...
22	Astronaute - Partagez votre expérience profes...	JobIRL	<b>Informations complémentaires</b>Vous êtes a...
33	Techniciens qualité alimentaire - Partagez vo...	JobIRL	<b>Informations complémentaires</b>Vous êtes t...
44	Ingénieur du son - Partagez votre expérience ...	JobIRL	<b>Informations complémentaires</b>\n \nVous ê...

	title	proposed_by	description
0	Journalistes sportifs - Partagez votre expéri...	JobIRL	Vous êtes journaliste sportif, vous aimez votr...
11	Journalistes reporters d'images - Partagez vo...	JobIRL	Vous êtes journaliste reporter d'images, vous ...
22	Astronaute - Partagez votre expérience profes...	JobIRL	Vous êtes astronaute, vous aimez votre métier ...
33	Techniciens qualité alimentaire - Partagez vo...	JobIRL	Vous êtes technicien qualité alimentaire, vous...
44	Ingénieur du son - Partagez votre expérience ...	JobIRL	Vous êtes ingénieur du son, vous aimez votre m...