ROME skills

Author: Stephan stephan@bayesimpact.org

Date: Jun 14, 2016

We want to allow our users to build a profile of skills they have, in order to suggest suitable jobs to them. A PRD for this feature can be found at http://go/pe:profile-prd.

We want to use the skills taxonomy from the ROME dataset because it allows us to extract a mapping from skills to jobs. The main questions we want to answer in this notebook are:

  • How many different skills do we have in total?
  • How many skills are common for a job?
  • Any other properties of the skills dataset?
  • Are some skills more unique to a job than others?
  • TODO: Are there jobs that have the exact same set of skills?

In [1]:
import os

import pandas as pd
import seaborn as _

from bob_emploi.data_analysis.lib import read_data
from bob_emploi.data_analysis.lib import plot_helpers

data_folder = os.getenv('DATA_FOLDER')

Before using the XML version of the data, I had a look at the CSV data. I saw that i could use unix_coherence_item_v330_utf8 to establish a mapping between job_groups and skills, however I saw that the XML data contains this ordering of skills. I thought it could be quite useful, if that actually represents a ordering from more to less specific. That is why I used the XML version of the data in this case.


In [2]:
fiche_dicts = read_data.load_fiches_from_xml(os.path.join(data_folder, 'rome/ficheMetierXml'))
rome = [read_data.fiche_extractor(f) for f in fiche_dicts]
skills_data = [dict(skill, code_rome=job['code_rome'], job_group_name=job['intitule'])
               for job in rome
               for skill in job['skills']]
skills_raw = pd.DataFrame(skills_data)
skills_raw.columns = ['skill_id', 'job_group_id', 'type', 'job_group_name', 'name', 'position', 'priorisation']

Overview of dataset structure

A peek at the raw data.


In [3]:
skills_raw.head()


Out[3]:
skill_id job_group_id type job_group_name name position priorisation
0 116765 A1101 1 Conduite d'engins agricoles et forestiers Procédures de maintenance de matériel 1 1
1 116766 A1101 1 Conduite d'engins agricoles et forestiers Procédures de maintenance de locaux 2 1
2 104547 A1101 1 Conduite d'engins agricoles et forestiers Pneumatique 3 1
3 109995 A1101 1 Conduite d'engins agricoles et forestiers Utilisation de système informatique (embarqué ... 4 1
4 100010 A1101 1 Conduite d'engins agricoles et forestiers Utilisation d'engins forestiers 5 1

In [4]:
skills_raw.describe().transpose()


Out[4]:
count unique top freq
skill_id 5939 1888 118975 164
job_group_id 5939 531 K2107 66
type 5939 1 1 5939
job_group_name 5939 531 Enseignement général du second degré 66
name 5939 1887 Outils bureautiques 164
position 5939 66 1 557
priorisation 5939 1 1 5939

So we got 1741 different skills for the 531 different job groups in ROME. They are of two different types where one of them makes up about 70% of all skill occurrences (3964 / 5520). The most common skill is Utilisation d'outils bureautiques.... The position column suggests that they are ordered and the priorisation column does not contain any information.

Let's clean up the data a little and look at the distribution of skill type on the deduplicated list of skills. With that I mean just investigating the list of 1741 unique skills, instead of skills being associated with jobs as in the table above.


In [5]:
# Mapping extracted from data/rome/csv/unix_referentiel_competence_v330_utf8.csv
COMPETENCE_TYPES = {
    '1': 'theoretical_skill',
    '2': 'action_skill'
}

skills = skills_raw.drop(['priorisation'], axis=1)
skills['type'] = skills.type.map(COMPETENCE_TYPES)
skills['position'] = skills.position.astype(int)

In [6]:
dedup_skills = skills.drop_duplicates(subset=['skill_id'])
dedup_skills['type'].value_counts(normalize=True)


Out[6]:
theoretical_skill    1.0
Name: type, dtype: float64

The distribution of theoretical to action skills is present also in the deduplicated list of skills, not only in the skills associated to jobs.

Just as a sanity check, make sure that each skill is only associated with one type.


In [7]:
skills.groupby('skill_id')['type'].nunique().value_counts()


Out[7]:
1    1888
Name: type, dtype: int64

Good!

Ordering of skills

Does that ordering actually play a role?

Let's first check whether the position is unique per job_group or per job_group and type. From looking at the XML it looks like there are individual orderings per type, but I want to make sure.


In [8]:
skills.duplicated(subset=['job_group_id', 'type', 'position']).sum()


Out[8]:
37

In [9]:
skills.duplicated(subset=['job_group_id', 'position']).sum()


Out[9]:
37

Skills are ordered separately for the two types of skills.

are the skills higher up in the ordering maybe more specific to a certain job_group?


In [10]:
skills = skills.sort_values('position')
first = skills.groupby(['job_group_name', 'type']).first().name
last = skills.groupby(['job_group_name', 'type']).last().name
pd.concat([first, last], axis=1, keys=['first', 'last'])


Out[10]:
first last
job_group_name type
Abattage et découpe des viandes theoretical_skill Etourdissement d'un animal par électrocution/é... Appréciation sensorielle
Accompagnement de voyages, d'activités culturelles ou sportives theoretical_skill Techniques d'animation de groupe Techniques de communication
Accompagnement et médiation familiale theoretical_skill Techniques de médiation Techniques de communication
Accompagnement médicosocial theoretical_skill Psychomotricité Gestes d'urgence et de secours
Accueil et renseignements theoretical_skill Veille informationnelle Modalités d'accueil
Accueil et services bancaires theoretical_skill Traitement des opérations sur titres Procédures d'administration de compte bancaire
Accueil touristique theoretical_skill Méthode de classement et d'archivage Principes de la relation client
Achat vente d'objets d'art, anciens ou d'occasion theoretical_skill Histoire de l'art Procédures d'encaissement
Achats theoretical_skill Techniques commerciales Procédures d'appels d'offres
Action sociale theoretical_skill Dispositifs d'aide sociale Droit de la sécurité sociale
Administration de systèmes d'information theoretical_skill Réglementation sur la protection des données à... Caractéristiques des logiciels d'interface (mi...
Administration des ventes theoretical_skill Analyse statistique Organisation de la chaîne logistique
Affrètement transport theoretical_skill Techniques commerciales Géographie des transports
Aide agricole de production fruitière ou viticole theoretical_skill Eclaircissage Normes qualité
Aide agricole de production légumière ou végétale theoretical_skill Desherbage Normes qualité
Aide aux bénéficiaires d'une mesure de protection juridique theoretical_skill Techniques de conduite d'entretien Caractéristiques socio-culturelles des publics
Aide aux soins animaux theoretical_skill Techniques d'approche et de manipulation des a... Pathologies animales
Aide d'élevage agricole et aquacole theoretical_skill Techniques d'approche et de manipulation des a... Réglementation d'Appellation d'Origine Contrôl...
Aide en puériculture theoretical_skill Pathologies de l'enfant Diététique
Aide et médiation judiciaire theoretical_skill Généalogie Droit européen
Ajustement et montage de fabrication theoretical_skill Banc de contrôle Utilisation d'outillages manuels
Analyse de crédits et risques bancaires theoretical_skill Techniques pédagogiques Règles de traitement des opérations bancaires
Analyse de tendance theoretical_skill Veille commerciale Méthodes d'enquête
Analyse et ingénierie financière theoretical_skill Veille informationnelle Calculs financiers
Analyses médicales theoretical_skill Guide de Bonne Utilisation de l'Informatique (... Validation biologique
Animation d'activités culturelles ou ludiques theoretical_skill Outils bureautiques Techniques de communication
Animation de loisirs auprès d'enfants ou d'adolescents theoretical_skill Psychologie de l'enfant Techniques d'éveil de l'enfant
Animation de site multimédia theoretical_skill Règles de diffusion et de communication de l'i... Rédaction de contenu web
Animation de vente theoretical_skill Utilisation de micro Techniques de vente
Animation musicale et scénique theoretical_skill Répertoire de musiques pop/rock Caractéristiques des matériels d'éclairage
... ... ... ...
Transaction immobilière theoretical_skill Méthodes de transaction immobilière Droit immobilier
Travaux d'étanchéité et d'isolation theoretical_skill Techniques de pose des revêtements souples Collage à froid
Trésorerie et financement theoretical_skill Gestion de trésorerie Comptabilité analytique
Téléconseil et télévente theoretical_skill Techniques de vente par téléphone Argumentation commerciale
Vente de voyages theoretical_skill Techniques de prévention et de gestion de conf... Géographie du tourisme
Vente de végétaux theoretical_skill Procédures d'encaissement Typologie du client
Vente en alimentation theoretical_skill Utilisation d'engins de manutention non motori... Utilisation d'appareils de lecture optique de ...
Vente en animalerie theoretical_skill Procédures d'encaissement Procédures de prévention des risques sanitaires
Vente en articles de sport et loisirs theoretical_skill Procédures d'encaissement Argumentation commerciale
Vente en décoration et équipement du foyer theoretical_skill Procédures d'encaissement Gestes et postures de manutention
Vente en gros de matériel et équipement theoretical_skill Commande en gros Techniques commerciales
Vente en gros de produits frais theoretical_skill Commande en gros Normes rédactionnelles
Vente en habillement et accessoires de la personne theoretical_skill Procédures d'encaissement Typologie du client
Éclairage spectacle theoretical_skill Procédures de maintenance de matériel d'éclairage Utilisation d'outillages manuels
Éducation de jeunes enfants theoretical_skill Techniques d'écoute et de la relation à la per... Règles d'hygiène et de sécurité
Éducation en activités sportives theoretical_skill Techniques d'animation de groupe Gestion de projet
Éducation et surveillance au sein d'établissements d'enseignement theoretical_skill Règles de vie collective Techniques de prévention et de gestion de conf...
Élaboration de plan média theoretical_skill Techniques de mesure d'audience Gestion de projet
Électricité bâtiment theoretical_skill Electricité du domaine des Voix, Données, Imag... Mécanique
Élevage bovin ou équin theoretical_skill Pathologies animales Réglementation d'Appellation d'Origine Contrôl...
Élevage d'animaux sauvages ou de compagnie theoretical_skill Règles de sécurité Engins agricoles
Élevage de lapins et volailles theoretical_skill Installations énergétiques Réglementation d'Appellation d'Origine Contrôl...
Élevage ovin ou caprin theoretical_skill Biologie animale Pathologies animales
Élevage porcin theoretical_skill Logiciel de calcul de ration alimentaire Utilisation de matériel de nettoyage
Études - modèles en industrie des matériaux souples theoretical_skill Logiciels de modélisation et simulation Procédures d'essayage
Études actuarielles en assurances theoretical_skill Finance Droit des assurances
Études et développement de réseaux de télécoms theoretical_skill Protocoles et normes télécoms Traitement du signal
Études et développement informatique theoretical_skill Technologies de l'accessibilité numérique Programmation informatique
Études et prospectives socio-économiques theoretical_skill Econométrie Logiciels de gestion de base de données
Études géologiques theoretical_skill Sondage de sol Mécanique des fluides

531 rows × 2 columns

Feedback I got from Pascal: I'm sorry but even if I read French quite fluently I cannot tell whether they are ordered, so I believe they are not. "Éducation de jeunes enfants" (young kinds education), has its first theoritical skill being "Civil Law" and its last "Technics of listening and realtion to people": I believe the last one is more correlated to the job group. "Élevage ovin ou caprin" (Breeding of sheep or goats) has its first action skill being "Setup of birth cages", and the last one "welding". I believe the first one is more correlated to the job group.

TODO:

However I feel that we can compute some interesting measure of it:

  1. assign a specificity score to each skill by computing some reverse function of its frequency among job groups (1 / # of job groups with this skill)
  2. normalize positions from 0 to 1 for each job group
  3. plot 1 & 2 and see if there's any kind of correlation Supposedly we should see very specific skills being closer to 0.

To make it even closer to something useful, you can also weigh each job group with the number of persons in this job group, for instance using a proxy with the FHS.

Skills per job

How many different skills are usually associated to a job?


In [11]:
skills.groupby('job_group_id').skill_id.nunique().hist();


That's a neat distribution with a clear peak around 10. Florian asked in the PRD 'how many skills do we plan to collect per user'. With users having had several jobs in their past, I would expect a user to pick between 15-30 skills, as a very rough estimate. This of course also depends on the typical skill overlap between jobs and a probobably higher skill overlap between jobs in the same field.

Also, as we saw in the summary statistics above, all 531 job groups in ROME have at least one skill associated with it.

Is there a difference between the number of theoretical and action skills?


In [12]:
by_job_by_type = skills.groupby(['job_group_id', 'type'])
unique_job_counts = by_job_by_type.skill_id.nunique().reset_index()
unique_job_counts.hist(by='type', normed=True, sharex=True, sharey=True);


The theoretical skills resemble the above distribution, but many jobs seem to have only one or two action skills.

Let's see how specific skills are to a job group.

Because each skill only appears once per job_group, I can simply divide the skill frequency by number of job_groups.


In [13]:
skill_frequency = skills.name.value_counts() / skills.job_group_id.nunique()
skill_frequency.index = pd.Series(range(len(skill_frequency))) / len(skill_frequency)
skill_frequency.plot(figsize=(8, 4));


There seem to be a very few skills that are associated with lots of different jobs. Most of the skills seem to be very specific to a job.

That means that we might have trouble suggesting totally new jobs for the skills a user imported from their old job, because there might be hardly any overlap between skills of his old job and a potential new job. Except of the few very common skills that probably don't provide much signal.

What are those very common skills?


In [14]:
counts = skills.groupby(['name', 'type']).skill_id.count().to_frame()
counts.sort_values('skill_id', ascending=False).head(10)


Out[14]:
skill_id
name type
Outils bureautiques theoretical_skill 164
Normes qualité theoretical_skill 84
Règles de sécurité theoretical_skill 83
Lecture de plan, de schéma theoretical_skill 68
Mécanique theoretical_skill 57
Techniques de communication theoretical_skill 53
Techniques commerciales theoretical_skill 52
Electricité theoretical_skill 50
Règles et consignes de sécurité theoretical_skill 50
Gestes et postures de manutention theoretical_skill 48

Looks like very general soft skills.

Skills spread over categories?

Paul asked: check if a given skill is present across many different job categories (look at the first letter of ROME for example) -> possible there's a bias towards only reusing the exact same skill if the jobs belong in the same job category, which would defeat some of the purpose.


In [15]:
skills['category'] = skills.job_group_id.str.slice(0, 1)
skills.groupby('name').category.nunique().hist(bins=14);



In [16]:
skills.groupby(['name', 'type']).category.nunique().hist(bins=14);



In [17]:
skills.groupby(['name', 'type']).category.nunique().reset_index().hist(by='type', normed=True);


Most skills are only used within the same ROME category. However theoretical skills are more spread over categories than practical skills are.

Conclusion

  • We have a list of ~1700 skills that are divided into theoretical and action skills.
  • Theoretical skills are more common (~70%).
  • Skills are ordered within a job_group, we need to confirm that this ordering is related to specificity of a skill.
  • There are on average ~12 different skills assigned to a job.
  • Most skills are pretty specific to a job_group.
  • Most skills are specific to the ROME category.