Preparations

  • Import libraries:

In [ ]:
import my_util as my_util; from my_util import *
  • Load data

In [ ]:
HOME_DIR = 'd:/larc_projects/job_analytics/'
DATA_DIR = HOME_DIR + 'data/clean/'

# job descriptions (JDs)
init_posts = pd.read_csv(DATA_DIR + 'jd_df.csv')

In [ ]:
skill_df = pd.read_csv(DATA_DIR + 'skill_index.csv')
init_skills = skill_df['skill']
jd_docs = list(init_posts['clean_text'].apply(str.lower))

n_skill, n_jd = len(init_skills) , init_posts.shape[0]
print('Initial no. of skills: %d' %n_skill)
print('Initial no. of JDs: %d' %n_jd) # some garbage JDs with no text already removed

skill_df.head(3)

Filtering

There are two goals: i) to remove JDs with too few skills, and ii) to remove skills occurring in too few JDs. Thus, we repeat the following process until the two goals are satisfied.

  • Count no. of unique skills in each JD
  • Remove JDs with $<= 1$ skills
  • Count no. of JDs containing each skill
  • Remove skills occuring in $<= 1$ JDs

In [ ]:
n_iter, posts = 0, init_posts
n_post = posts.shape[0]

stop_cond, thres = False, .98
while not stop_cond:
    n_iter = n_iter + 1
    print('Iteration %d' %n_iter)
    new_posts = extractJDs(posts, skills, min_n_skill=2)
    n_new_post = new_posts.shape[0]
    print('No. of posts after filtering: %d' %n_new_post)
    
    skill_df = extractSkills(skills, new_posts, min_n_jd=2)
    new_skills = skill_df['skill']
    print('No. of skills after filtering: %d' %len(new_skills) )
    stop_cond = (n_new_post >= thres*n_post) and (len(new_skills) >= thres*len(skills))
    
    posts = new_posts
    n_post = posts.shape[0]
    skills = new_skills
# end
  • Save the hard-earned JDs and skills after all these filters:

In [ ]:
# print min(posts['n_uniq_skill'])
# print min(skill_df['n_jd_with_skill'])
posts.to_csv(DATA_DIR + 'filtered/posts.csv', index=False)
skill_df.to_csv(DATA_DIR + 'filtered/skills.csv', index=False)
  • Sample job postings:

In [ ]:
posts = posts.sort_values(by='n_uniq_skill', ascending=False)
posts.head()

In [ ]:
# Sanity check by pull up skills occuring in the JD with most skills
# post_with_most_skill = init_posts.query('job_id == {}'.format('JOB-2015-0196805') )

In [ ]:
train_idx, test_idx = mkPartition(n_instance, p=80)
X_train, X_test = doc_skill_tfidf[train_idx, :], doc_skill_tfidf[test_idx, :]
n_train, n_test = X_train.shape[0], X_test.shape[0]
print('Train set has %d JDs and test set has %d JDs' %(n_train, n_test))

In [ ]:
stats = pd.DataFrame({'n_train': n_train, 'n_test': n_test, 'n_jd (train & test)': n_post, 'n_skill': len(skills)}, index=[0])
stats.to_csv(RES_DIR + 'stats.csv', index=False)

Find stopword-like skills by TF-IDF


In [ ]:
from ja_helpers import toIDF

In [ ]:
idf = toIDF(terms=skills, doc_term_mat=doc_skill)
idf.sort_values('idf_log10', inplace=True)
idf.to_csv(SKILL_DIR + 'skill_idf.csv', index=False)

In [ ]:
idf['idf_log10'] = idf['idf'] * np.log10(np.e)
quantile(idf['idf_log10'])

In [ ]:
idf_log10 = idf['idf_log10']
n, bins, patches = plt.hist(idf_log10, bins=np.unique(idf_log10))
plt.xlabel('IDF of term (log-10 scale)')
plt.ylabel('# terms')
plt.grid(True)
plt.savefig(SKILL_DIR + 'idf_hist.pdf')

plt.show()
plt.close()

In [ ]:
# terms which occur in at least 10% of docs
idf.query('idf_log10 <= 1')

Setting idf threshold as 1 did not catch stop words like com, can, so I increase the threshold of idf.


In [ ]:
idf.query('idf_log10 <= 1.35').to_csv(SKILL_DIR + 'stop_words.csv', index=False)

Filter out stopword skills


In [ ]:
df = pd.read_csv(SKILL_DIR + 'stop_words.csv')
stop_words = df['term']

In [ ]:
skill_df = skill_df[- skill_df['skill'].isin(stop_words)]
print(skill_df.shape)
skill_df.to_csv(SKILL_DIR + 'skill_index.csv', index=False)

Handle reposted jobs

There are jobs reposted several times as shown below. Thus, job ids in job_posts are not unique.


In [ ]:
job_posts = pd.read_csv(DATA_DIR + 'full_job_posts.csv')
job_posts.head(5)

In [ ]:
by_job_id = job_posts[['job_id', 'job_posting_date_history']].groupby('job_id')
res = by_job_id.agg({'job_posting_date_history': lambda x:len(np.unique(x))})

res = res.rename(columns={'job_posting_date_history': 'n_post_date'}).reset_index()
res.sort_values('n_post_date', ascending=False, inplace=True)
res.head()

In [ ]:
quantile(res['n_post_date'])

In [ ]:
repost_jobs = res.query('n_post_date > 1')
print('# jobs reposted: %d' %repost_jobs.shape[0])

Remove jobs without title


In [ ]:
jobs = job_posts[['job_id', 'title', 'employer_name']].drop_duplicates()
print('# records in jobs bf merging: %d' %jobs.shape[0])

jobs = pd.merge(jobs, job_desc)
print('# records in jobs after merging: %d' %jobs.shape[0])

jobs_wo_title = job_posts[job_posts['title'].isnull()]
n_job_wo_title = jobs_wo_title.shape[0]
print('# job posts in WDA without title: %d' %n_job_wo_title)

jobs_wo_title

In [ ]:
jobs.to_csv(DATA_DIR + 'jobs.csv', index=False)
jobs.head()

Clean employer data


In [ ]:
employers = pd.read_csv(DATA_DIR + 'employers.csv')
print employers.shape

In [ ]:
employers.rename(columns={'company_registration_number_uen_ep': 'employer_id', 'organisation_name_ep': 'employer_name', 
                         'ssic_group_ep': 'industry'}, inplace=True)

# Standardize employer names by uppercase (problem detected below)
employers['employer_name'] = map(str.upper, employers['employer_name'])
employers = employers.drop_duplicates()
employers.shape

In [ ]:
# Handle the problem with PRIORITY CONSULTANTS (detected below)
employers.query('employer_name == "PRIORITY CONSULTANTS"')

In [ ]:
employers = employers.drop(10278)
employers.query('employer_name == "PRIORITY CONSULTANTS"')

In [ ]:
employers.to_csv(DATA_DIR + 'employers.csv', index=False)

Merge doc_index, posts and employers to get industry info

Note: need to maintain the index in doc_index as this index is required to retrive the correct topic distribution for each document from the matrix doc_topic_distr.


In [ ]:
posts = pd.read_csv(DATA_DIR + 'full_job_posts.csv')
posts.head()

In [ ]:
df = mergeKeepLeftIndex(doc_index, posts[['job_id', 'employer_id']])
df = df.drop_duplicates()
df.shape

In [ ]:
df = mergeKeepLeftIndex(df, employers[['employer_id', 'employer_name', 'industry']])
df = df.drop_duplicates()
df.shape[0]

In [ ]:
df.to_csv(SKILL_DIR + 'doc_index.csv', index=False)

Weird duplications in result of the first merge

The duplications were then detected as below:


In [ ]:
# First, verify duplication exists
print len(df.index)
print len(df.index.unique())

# Then detect them
import collections
print [(item, count) for item, count in collections.Counter(df.index).items() if count > 1]

df.iloc[25569:25571, :]

The problem is due to upper vs. lower case in employer names! That's why we need to standardize them.

Lesson learnt: Watch out for case-sensitive problem in data.

After handling this, we repeat the above process and check for duplications again.


In [ ]:
print [(item, count) for item, count in collections.Counter(tmp.index).items() if count > 1]

In [ ]:
tmp.iloc[29403:29405, :]

This time it is because the company PRIORITY CONSULTANTS registered itself with 2 different industries: Administrative Service and Scientific Activities.