Data Cleaning

Data Invalid Values Cleaning

Data cleaning is the first, and sometime considered the most critical stage since it could afffect the results in either positive or negative way.

In the following lines, we are presenting some of most cleaning ways for some dataset coloumns.

salary_min

  • Ensure that all values is equal or more than zero

salary_max

  • Ensure that all values is equal or more than zero

num_vacancies

  • Ensure that all values is equal or more than zero

experience_years_min

  • Ensure that all values is equal or more than zero

Data Preprocessing

We meaning by data pre-processing theses steps it could modify the current values or generate new values based on the existed ones.

The new [appended] generated coloumns is designed to be used many times in next steps to get rapid and accurate analysis reports.

experience_years

  • We might expected here to find a numeric structure discribe the experience years for a job, but it appears to be [alphanumeric] coloumn.

  • These un-structured values appear to hold like these pattern to describe the experience years:

    3-2
    above 5 years
    +3
    less than 6 years ..

  • We start by filtering the alpha characters and set some NLP rules to get meaning of these words. Then extract the values to new two columns [experience_years_min] for the minimum required experience years and [experience_years_max] for maximum required experience years. In addition to [experience_years_type] which describes the duration type [year or month]

post_date

  • Although [post_date] column is a structured value, but we extracted a new two columns which will be used frequantly next. [post_data_month] for month and [post_data_year] for year.

description | job_requirements | displayed_job_title

..* we applied many pre-processing tasks on these rich text columns.

  1. Remove HTML Tags
  2. Normalize English case-folding [Uppercase to Lowercase]
  3. Normalize Arabic characters. for example

    ة == ه
    أ,آ,إ == ا
    ى == ي

  4. Remove all special cahracters [*,%,#,(, ...]

  5. Remove stop words [the, of, on, from, ....]

city_name

We could consider it is the **most challenging** pre-processing task for these reasons:

1. Arabic/English Mixed values
2. User could enter many countries once. For example [Cairo and Giza], [Cairo, Aswan and Giza], [القاهرة و الاسماعيلية], ...
3. Typos or miss spelled country names.
4. The huge variance of the unique entries

Before the pre-processing step. the total unique values in [city_name] was: **478**

We started apply some steps to minimize and normalize this variance

1. Remove some special characters [%,^,(, ...]
2. Detect if there are many mentioned countries in the same entry
3. Prepare a list [gazetteer] of all Egypt countries which is extracted from wikipedia-[English]. And Translate some of the most famous countries into Arabic.
4. Unify all different values which could refere to the same entity. For example all of these values [**Cairo, Cairo and , cairo, القاهرة,قاهرة**] will be [**cairo**]

At the end, the unique values in [city_name] is decreased to be just **251** :)