Data cleaning is the first, and sometime considered the most critical stage since it could afffect the results in either positive or negative way.
In the following lines, we are presenting some of most cleaning ways for some dataset coloumns.
We meaning by data pre-processing theses steps it could modify the current values or generate new values based on the existed ones.
The new [appended] generated coloumns is designed to be used many times in next steps to get rapid and accurate analysis reports.
We might expected here to find a numeric structure discribe the experience years for a job, but it appears to be [alphanumeric] coloumn.
These un-structured values appear to hold like these pattern to describe the experience years:
3-2
above 5 years
+3
less than 6 years ..
We start by filtering the alpha characters and set some NLP rules to get meaning of these words. Then extract the values to new two columns [experience_years_min] for the minimum required experience years and [experience_years_max] for maximum required experience years. In addition to [experience_years_type] which describes the duration type [year or month]
..* we applied many pre-processing tasks on these rich text columns.
Normalize Arabic characters. for example
ة == ه
أ,آ,إ == ا
ى == ي
Remove all special cahracters [*,%,#,(, ...]
We could consider it is the **most challenging** pre-processing task for these reasons:
1. Arabic/English Mixed valuesBefore the pre-processing step. the total unique values in [city_name] was: **478**
We started apply some steps to minimize and normalize this variance
1. Remove some special characters [%,^,(, ...]At the end, the unique values in [city_name] is decreased to be just **251** :)