Author: Stephan stephan@bayesimpact.org
Date: Jun 14, 2016
We want to allow our users to build a profile of skills they have, in order to suggest suitable jobs to them. A PRD for this feature can be found at http://go/pe:profile-prd.
We want to use the skills taxonomy from the ROME dataset because it allows us to extract a mapping from skills to jobs. The main questions we want to answer in this notebook are:
In [1]:
import os
import pandas as pd
import seaborn as _
from bob_emploi.data_analysis.lib import read_data
from bob_emploi.data_analysis.lib import plot_helpers
data_folder = os.getenv('DATA_FOLDER')
Before using the XML version of the data, I had a look at the CSV data. I saw that i could use unix_coherence_item_v330_utf8
to establish a mapping between job_groups and skills, however I saw that the XML data contains this ordering of skills. I thought it could be quite useful, if that actually represents a ordering from more to less specific. That is why I used the XML version of the data in this case.
In [2]:
fiche_dicts = read_data.load_fiches_from_xml(os.path.join(data_folder, 'rome/ficheMetierXml'))
rome = [read_data.fiche_extractor(f) for f in fiche_dicts]
skills_data = [dict(skill, code_rome=job['code_rome'], job_group_name=job['intitule'])
for job in rome
for skill in job['skills']]
skills_raw = pd.DataFrame(skills_data)
skills_raw.columns = ['skill_id', 'job_group_id', 'type', 'job_group_name', 'name', 'position', 'priorisation']
A peek at the raw data.
In [3]:
skills_raw.head()
Out[3]:
In [4]:
skills_raw.describe().transpose()
Out[4]:
So we got 1741 different skills for the 531 different job groups in ROME. They are of two different types where one of them makes up about 70% of all skill occurrences (3964 / 5520). The most common skill is Utilisation d'outils bureautiques.... The position
column suggests that they are ordered and the priorisation
column does not contain any information.
Let's clean up the data a little and look at the distribution of skill type on the deduplicated list of skills. With that I mean just investigating the list of 1741 unique skills, instead of skills being associated with jobs as in the table above.
In [5]:
# Mapping extracted from data/rome/csv/unix_referentiel_competence_v330_utf8.csv
COMPETENCE_TYPES = {
'1': 'theoretical_skill',
'2': 'action_skill'
}
skills = skills_raw.drop(['priorisation'], axis=1)
skills['type'] = skills.type.map(COMPETENCE_TYPES)
skills['position'] = skills.position.astype(int)
In [6]:
dedup_skills = skills.drop_duplicates(subset=['skill_id'])
dedup_skills['type'].value_counts(normalize=True)
Out[6]:
The distribution of theoretical to action skills is present also in the deduplicated list of skills, not only in the skills associated to jobs.
Just as a sanity check, make sure that each skill is only associated with one type.
In [7]:
skills.groupby('skill_id')['type'].nunique().value_counts()
Out[7]:
Good!
In [8]:
skills.duplicated(subset=['job_group_id', 'type', 'position']).sum()
Out[8]:
In [9]:
skills.duplicated(subset=['job_group_id', 'position']).sum()
Out[9]:
Skills are ordered separately for the two types of skills.
In [10]:
skills = skills.sort_values('position')
first = skills.groupby(['job_group_name', 'type']).first().name
last = skills.groupby(['job_group_name', 'type']).last().name
pd.concat([first, last], axis=1, keys=['first', 'last'])
Out[10]:
Feedback I got from Pascal: I'm sorry but even if I read French quite fluently I cannot tell whether they are ordered, so I believe they are not. "Éducation de jeunes enfants" (young kinds education), has its first theoritical skill being "Civil Law" and its last "Technics of listening and realtion to people": I believe the last one is more correlated to the job group. "Élevage ovin ou caprin" (Breeding of sheep or goats) has its first action skill being "Setup of birth cages", and the last one "welding". I believe the first one is more correlated to the job group.
TODO:
However I feel that we can compute some interesting measure of it:
To make it even closer to something useful, you can also weigh each job group with the number of persons in this job group, for instance using a proxy with the FHS.
In [11]:
skills.groupby('job_group_id').skill_id.nunique().hist();
That's a neat distribution with a clear peak around 10. Florian asked in the PRD 'how many skills do we plan to collect per user'. With users having had several jobs in their past, I would expect a user to pick between 15-30 skills, as a very rough estimate. This of course also depends on the typical skill overlap between jobs and a probobably higher skill overlap between jobs in the same field.
Also, as we saw in the summary statistics above, all 531 job groups in ROME have at least one skill associated with it.
Is there a difference between the number of theoretical and action skills?
In [12]:
by_job_by_type = skills.groupby(['job_group_id', 'type'])
unique_job_counts = by_job_by_type.skill_id.nunique().reset_index()
unique_job_counts.hist(by='type', normed=True, sharex=True, sharey=True);
The theoretical skills resemble the above distribution, but many jobs seem to have only one or two action skills.
In [13]:
skill_frequency = skills.name.value_counts() / skills.job_group_id.nunique()
skill_frequency.index = pd.Series(range(len(skill_frequency))) / len(skill_frequency)
skill_frequency.plot(figsize=(8, 4));
There seem to be a very few skills that are associated with lots of different jobs. Most of the skills seem to be very specific to a job.
That means that we might have trouble suggesting totally new jobs for the skills a user imported from their old job, because there might be hardly any overlap between skills of his old job and a potential new job. Except of the few very common skills that probably don't provide much signal.
What are those very common skills?
In [14]:
counts = skills.groupby(['name', 'type']).skill_id.count().to_frame()
counts.sort_values('skill_id', ascending=False).head(10)
Out[14]:
Looks like very general soft skills.
Paul asked: check if a given skill is present across many different job categories (look at the first letter of ROME for example) -> possible there's a bias towards only reusing the exact same skill if the jobs belong in the same job category, which would defeat some of the purpose.
In [15]:
skills['category'] = skills.job_group_id.str.slice(0, 1)
skills.groupby('name').category.nunique().hist(bins=14);
In [16]:
skills.groupby(['name', 'type']).category.nunique().hist(bins=14);
In [17]:
skills.groupby(['name', 'type']).category.nunique().reset_index().hist(by='type', normed=True);
Most skills are only used within the same ROME category. However theoretical skills are more spread over categories than practical skills are.