Author: Stephan, stephan@bayesimpact.org

Skip the run test because the ROME version has to be updated to make it work in the exported repository. TODO: Update ROME and remove the skiptest flag.

Holland Codes

https://en.wikipedia.org/wiki/Holland_Codes

There is a theory that associates 6 basic characteristics with people. These characteristics can be used to help people find jobs they like. The theory still seems to be widely used for career counceling.

RIASEC hexagon

Usually people are gradually associated with multiple characteristics, a person does ususally not fit neatly into one of the boxes. However when people are associated with multiple characteristics, these tend to be neighboring on the above hexagon. It is less common that someone is associated with two characteristics on opposite ends of the hexagaon. These are even called inconsistent personality patterns.

In the ROME dataset each job and each activity has a major and a minor Holland Code assigned. We want to use these to help people find a job in a field they like, and maybe did not think of previously.



In [1]:

    
from __future__ import division
import glob
import json
import os
import itertools as it

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import xmltodict
import numpy as np

from bob_emploi.data_analysis.lib import read_data

data_folder = os.getenv('DATA_FOLDER')



In [2]:

    
def riasec_dist(first, second):
    '''compute the distance between two characteristics on the hexagon'''
    if pd.isnull(first) or pd.isnull(second):
        return np.nan
    riasec = "RIASEC"
    a = riasec.find(first.upper())
    b = riasec.find(second.upper())
    assert a >= 0 and b >= 0
    return min( (a-b)%6, (b-a)%6)

# to call it on a dataframe row
riasec_dist_row = lambda row: riasec_dist(row.riasec_majeur, row.riasec_mineur)

Job fiches

First load all the job_groups (fiche metier) from the XML files



In [3]:

    
fiche_dicts = read_data.load_fiches_from_xml(os.path.join(data_folder, 'rome/ficheMetierXml'))



In [4]:

    
fiches = pd.DataFrame(fiche['bloc_code_rome'] for fiche in fiche_dicts)
fiches['riasec_mineur'] = fiches.riasec_mineur.str.upper()
fiches['combined'] = fiches.riasec_majeur + fiches.riasec_mineur
fiches['riasec_dist'] = fiches.apply(riasec_dist_row, axis=1)

Visualize the distributions of Holland Codes for job fiches



In [5]:

    
def visualize_codes(thing):
    '''Visualize the distribution of Holland codes
    major codes, minor codes, the combinations of both 
    and distances between
    '''
    riasec_counts = thing.riasec_majeur.value_counts().to_frame()
    riasec_counts['riasec_mineur'] = thing.riasec_mineur.value_counts()

    fig, ax = plt.subplots(3, figsize=(10, 10))
    riasec_counts.plot(kind='bar', ax=ax[0])
    thing.combined.value_counts().plot(kind='bar', ax=ax[1])
    thing.riasec_dist.hist(ax=ax[2])
    ax[0].set_title('Frequency of major and minor codes')
    ax[1].set_title('Frequency of major-minor combinations')
    ax[2].set_title('Histogram of hexagon distances')

    fig.tight_layout()
    
visualize_codes(fiches)

Holland Codes of activities associated with jobs



In [6]:

    
def extract(fiche):
    '''extract the base activities associated with a job fiche'''
    base_acts = fiche['bloc_activites_de_base']['activites_de_base']['item_ab'] 
    rome = {'rome_' + k: v for k, v in fiche['bloc_code_rome'].items()}
    return [dict(rome, **ba) for ba in base_acts]

fiche_acts = pd.DataFrame(sum(map(extract, fiche_dicts), []))
fiche_acts['riasec_mineur'] = fiche_acts.riasec_mineur.str.upper()
fiche_acts['rome_riasec_mineur'] = fiche_acts.riasec_mineur.str.upper()

How often are the Holland Codes of the activity the same as for the job?



In [7]:

    
combinations = it.product(['majeur', 'mineur'], ['majeur', 'mineur'])
for job, act in combinations:
    job_key = 'rome_riasec_' + job
    act_key = 'riasec_' + act
    match_count = (fiche_acts[job_key] == fiche_acts[act_key]).sum()
    fmt_str = "{} job fiche matches {} activity fiche in {:.2f}%"
    print(fmt_str.format(job, act, match_count / len(fiche_acts) * 100))









    



majeur job fiche matches majeur activity fiche in 56.58%
majeur job fiche matches mineur activity fiche in 13.15%
mineur job fiche matches majeur activity fiche in 0.00%
mineur job fiche matches mineur activity fiche in 42.95%

Activities

Let's look at the Holland Codes associated with activities.



In [8]:

    
activities = pd.read_csv('../../../data/rome/csv/unix_referentiel_activite_v330_utf8.csv')
act_riasec = pd.read_csv('../../../data/rome/csv/unix_referentiel_activite_riasec_v330_utf8.csv')
acts = pd.merge(activities, act_riasec, on='code_ogr')

acts['riasec_mineur'] = acts.riasec_mineur.str.upper()
acts['combined'] = acts.riasec_majeur + fiches.riasec_mineur
acts['riasec_dist'] = acts.apply(riasec_dist_row, axis=1)

base_acts = acts[acts.libelle_type_activite == 'ACTIVITE DE BASE']

visualize_codes(acts)
visualize_codes(fiches) #for comparison