Date: 2017-06-08
Author: pascal@bayes.org
We have access to an XML showing all the volunteering missions gathered by the Tous Bénévoles NGO. The URL is at the address http://www.tousbenevoles.org/linkedin_webservice/xml/linkedin.xml
and is refreshed regularly.
This notebook analyzes a snapshot to see what kind of data we can expect. Note that there's absolutely no guarantee from our partner that this data will have the same kind of data in the future or even keep the same format.
To reproduce this notebook, first download the data using:
docker-compose run --rm make data/tous_benevoles.xml
Now let's import the Pandas module as well as the file itself.
In [1]:
import os
from os import path
import pandas as pd
import seaborn as _
import xmltodict
DATA_FOLDER = os.getenv('DATA_FOLDER')
dataset = xmltodict.parse(open(path.join(DATA_FOLDER, 'tous_benevoles.xml'), 'rb'))
In [2]:
type(dataset)
Out[2]:
In [3]:
dataset.keys()
Out[3]:
In [4]:
type(dataset['jobs'])
Out[4]:
In [5]:
dataset['jobs'].keys()
Out[5]:
In [6]:
type(dataset['jobs']['job'])
Out[6]:
In [7]:
len(dataset['jobs']['job'])
Out[7]:
So it seems that we have a high level jobs
element encompassing a list of job
elements. Let's create a data frame with one "job" per row:
In [8]:
jobs = pd.DataFrame(dataset['jobs']['job'])
jobs.head()
Out[8]:
In [9]:
jobs.describe()
Out[9]:
So not surprisingly some fields are always constant (they are probably there to fit the format of another partner, maybe LinkedIn):
Country
is always "FR"JobType
is always "VOLUNTEER"State
is always "France"Now let's check the JobId
field as it could be used as a unique identifier. However apparently there are multiple rows with the same ID (1966 rows for only 1238 IDs), let's have a quick look at rows that have the same JobId
. First let's find IDs where there are multiple rows.
In [10]:
jobs[jobs.duplicated(subset='JobId', keep=False)].groupby('JobId').size().head()
Out[10]:
It seems that all the JobId that have several rows have exactly 11 rows. After checking with the partner, those 11 cities actually means "available everywhere": as LinkedIn required jobs to have a city, they decided to set it in the main 11 cities.
Let's devise a way to find such jobs quickly:
In [11]:
all_post_codes = jobs.groupby('JobId').PostalCode.apply(lambda codes: ','.join(codes.sort_values()))
all_post_codes[all_post_codes.str.contains(',')].iloc[0]
Out[11]:
In [12]:
jobs['isAvailableEverywhere'] = jobs.JobId.map(all_post_codes) == '06000,13001,31000,33000,34000,35000,44000,59000,67000,69001,75001'
jobs.sort_values('isAvailableEverywhere', ascending=False).head()
Out[12]:
Now let's take the first ID above and check the stats for the rows corresponding to this JobId
:
In [13]:
jobs[jobs.JobId == '35421'][list(set(jobs.columns) - {'isAvailableEverywhere'})].describe()
Out[13]:
Cool, for this one at least, it seems that all the fields are in common except for the City
and the PostalCode
as if the same job was available in several cities and as such was cut in several rows to fit a dedicated format. Let's make sure it's indeed the case with all the IDs with multiple rows:
In [14]:
all_fields_but_geo = set(jobs.columns) - set(('City', 'PostalCode'))
jobs.drop_duplicates(subset=all_fields_but_geo)\
.groupby('JobId')\
.size()\
.value_counts()
Out[14]:
In [15]:
jobs.groupby('City').size().sort_values(ascending=False).head(20).sort_values().plot(kind='barh');
Hmmm, this is not at all what we expected:
After asking the partner: they are especially active in the 95 département, near Cergy.
What about Paris?
In [16]:
jobs[jobs.City.str.startswith('PARIS')].City.value_counts()
Out[16]:
It seems clear that the city name is not harmonized, at least for Paris: sometimes it's the whole city, sometimes it's only an arrondissement, and sometimes the city name contains the postcode itself.
Let's do a quick check if it's the case for other cities:
In [17]:
jobs[(jobs.City.str.contains('[A-Z] [0-9]')) & ~(jobs.City.str.startswith('PARIS'))].City.value_counts()
Out[17]:
Obviously this is the case with the other cities that have multiple arrondissements, but it seems to be also the case with cities that have multiple postal codes for the same city (like Toulouse
which is 31000
but use TOULOUSE 31400
and TOULOUSE 31500
, or Orléans
which is 45000
but also use ORLEANS 45100
).
This might be a problem, as in our application we do not keep track of which part of the city someone is: even for Paris, Marseille & Lyon this info is only kept in the name for display, we do not have any distinguishing ID for user target job's location.
Let's clean it up and only keep the city's name:
In [18]:
jobs['clean_city'] = jobs['City'].str.replace(' \d+', '')
jobs.clean_city.value_counts().head()
Out[18]:
Now that it's cleaned, let's see the distribution by cities:
In [19]:
jobs[jobs.City != 'LEFFRINCKOUCKE'].\
groupby('clean_city').size().sort_values(ascending=False).\
head(20).sort_values().plot(kind='barh');
This is kind of what we would expect.
Let's plot now the number of cities with enough missions to be shown:
In [20]:
jobs[jobs.City != 'LEFFRINCKOUCKE']\
.groupby('clean_city').size().sort_values(ascending=False)\
.reset_index(drop=True)\
.plot(ylim=(0, 50)); # 50 is taken from the chart above.
There are missions in about 300 cities, but more than half of them have only one mission and only ~60 cities have at least 3 missions. I suggest we show the missions in a person's cities but also the one in their département if there are less than 3 in the city.
So let's check the distribution at the département level:
In [21]:
jobs['departement'] = jobs.PostalCode.str[:2]
jobs_per_departement = jobs[jobs.City != 'LEFFRINCKOUCKE'].groupby('departement').size().sort_values(ascending=False)
jobs_per_departement.head(20).plot(kind='bar');
Not surprisingly we get the départment with top cities (Paris, Lyon, Marseille) as the first ones, and then in 95
where the organization did some dedicated work.
Let's now check the coverage over all départements:
In [22]:
jobs_per_departement\
.reset_index(drop=True)\
.plot(ylim=(0, 100)); # 100 was taken from the chart above.
Missions are not in all départements (there are 100 départements) and most of them have less than 20 missions. However there are more than 15 départements where there are 40 different missions and still a very large number of départements with 3 missions or more:
In [23]:
sum(jobs_per_departement >= 3)
Out[23]:
In [24]:
jobs[['JobDescription', 'JobTitle']].drop_duplicates().head().transpose()
Out[24]:
It seems that JobDescription
always starts with "Mission proposée par" and that JobTitle
always start with "Bénévolat : ". Let's make sure it's the case:
In [25]:
jobs.JobDescription.str.startswith('Mission proposée par ').value_counts()
Out[25]:
In [26]:
jobs.JobTitle.str.startswith('Bénévolat : ').value_counts()
Out[26]:
Bingo, let's extract those constant strings and clean up the description and title:
In [27]:
jobs['title'] = jobs.JobTitle.str.replace('^Bénévolat : ', '')
jobs['proposed_by'] = jobs.JobDescription.str.extract('^Mission proposée par ([^<]+)<br />', expand=False)
jobs['description'] = jobs.JobDescription.str.replace('^Mission proposée par ([^<]+)<br />', '')
jobs[['title', 'proposed_by', 'description']].drop_duplicates().head()
Out[27]:
Nice title, nice "proposed by" field. The description could probably be cleaned a bit more:
In [28]:
jobs.description.str.startswith('<b>Informations complémentaires</b>').value_counts()
Out[28]:
As suspected it always continue with "Informations complémentaires", let's strip it as well.
In [29]:
jobs['description'] = jobs.description.str.replace('^<b>Informations complémentaires</b>', '').str.strip()
jobs[['title', 'proposed_by', 'description']].drop_duplicates().head()
Out[29]:
OK, this is good enough for now!
The data from Tous Bénévoles
is quite clean and contains enough data to show an advice card pushing 3 different missions for about 60 cities, if we restrict it to city, or half of the départements if we extend to missions in the same département.
As a fallback there seems to be some missions available over the whole country.
However when using the data: be careful with the City
field as it might sometimes contain more than just the city.
The original fields are not immediately usable, to get a clean title, the description or the organization that suggested the mission in the first place, you'll need some additional cleanup.