In this section of the project we shall clean the data so that we can do analysis in other sections.
Upon inspection of the Dataset, a few issues were encountered.
A Utils file was created for different helper functions used. The implementation of these methods is provided for clarity. A settings file is used too to handle configurations.
In [1]:
"""
Import different packages required
"""
import utils
import settings
import pandas as pd
Download The File incase it is not already downloaded.
In [2]:
utils.download_ple()
utils.download_ple
def download_ple():
if not os.path.exists(settings.MAIN_FILE):
download_file(
'http://ugandajournalistsresourcecentre.com/wp-content/uploads/2015/05/PLE-Results-2014.ALL-CANDIDATES.xlsx'
)
return
Next, we need to convert the Excel file to csv We can convert to one file with all the districts or a csv for each district. Let's go for different files this time.
This takes a bit of a while. Some optimization may be added later.
In [3]:
utils.ExcelConverter.excel_to_csv_multiple(xls_file=settings.MAIN_FILE, target_folder=settings.PROCESSED_FOLDER)
utils.ExcelConverter.excel_to_csv_multiple()
This is a rather long function so here it will simply be linked
Next, we look at the shapes of the different files
In [4]:
utils.find_csv_shape(settings.PROCESSED_FOLDER)
utils.find_csv_shape
def find_csv_shape(folder):
d = {}
for path, folders, files in os.walk(folder):
for file in files:
f = os.path.join(path, file)
csv = pd.read_csv(f)
if len(csv.columns) in d:
d[len(csv.columns)] += 1
else:
d[len(csv.columns)] = 1
print("{0}\n{1} - {2}".format(len(csv.columns), file, csv.columns))
total_districts = sum(d.values())
print("Sheets with column length {}".format(d))
for key, value in d.items():
percentage = (value / total_districts) * 100
print("{} - {:.2f}%".format(key, percentage))
We can see that most of the sheets have 10 columns. The ones with more columns have a variable number of 'Unamed' Columns. So the next step is to remove the unnamed
In [5]:
utils.remove_unnamed(settings.PROCESSED_FOLDER, right_size=10)
utils.remove_unnamed
def remove_unnamed(folder, right_size):
for path, folders, files in os.walk(folder):
for file in files:
f = os.path.join(path, file)
old_csv = pd.read_csv(f)
if len(old_csv.columns) != right_size:
new_csv = old_csv[old_csv.columns[~old_csv.columns.str.contains('Unnamed:')]]
new_csv.to_csv(f, quoting=csv.QUOTE_ALL, index=False)
After Removing the unnamed columns, let's check the shape again
In [6]:
utils.find_csv_shape(settings.PROCESSED_FOLDER)
All sheets have the same number of columns. Looks great.
Let's now try to view some stats. We'll try finding the number of Division 1s in Kampala
In [7]:
kampala = pd.read_csv(settings.PROCESSED_FOLDER+'KAMPALA.csv')
english_ones = kampala[kampala['ENG'] == 1]
english_ones
Out[7]:
It's impossible for there to be no division ones in English in Kampala. Looking at the data, let's get the datatypes for each column
In [8]:
kampala.info()
Looking at the info provided above, we see that only CNDIDATE NUMBER is numeric, we expect ENG, SCIE, SST, MATH, AGG, DIV to all be numeric so that we can make some calculations against them.
Another thing to note is the Column titles, CNDIDATE NUMBER wasn't seen in other datasets, it was CANDIDATE NUMBER. We'll deal with this first. Let's look at the columns in Kampala.
we use df.head(0) rather than df.columns because it shows prints the data in a tabular format in jupyter
In [9]:
kampala.head(0)
Out[9]:
Let's look at some other files. Butambala has similar names, but some columns are slightly different
In [10]:
butambala =pd.read_csv(settings.PROCESSED_FOLDER+'BUTAMBALA.csv')
butambala.head(0)
Out[10]:
Upon analysis of other randomly sampled datasets, we can see that the format in Kampala is probably wrong.
In [11]:
abim =pd.read_csv(settings.PROCESSED_FOLDER+'ABIM.csv')
abim.head(0)
Out[11]:
In [12]:
"""
This is probably supposed to be Kamwenge
"""
kamwenje =pd.read_csv(settings.PROCESSED_FOLDER+'KAMWENJE.csv')
kamwenje.head(0)
Out[12]:
A function that corrects the column names is in order.
In [13]:
utils.correct_headers(settings.PROCESSED_FOLDER)
utils.correct_headers
def correct_headers(location):
"""
Some files have inconsistent headings.
These are corrected her
"""
if os.path.isfile(location):
df = pd.read_csv(location)
df.rename(columns={'F/M': 'M/F', 'SCIE': 'SCI', 'MATH': 'MAT', 'CNDIDATE NUMBER': 'CANDIDATE NUMBER'},
inplace=True)
df.columns = df.columns.str.strip()
df.to_csv(location, quoting=csv.QUOTE_ALL, index=False)
elif os.path.isdir(location):
for path, folders, files in os.walk(location):
for f in files:
file = os.path.join(location, f)
df = pd.read_csv(file)
df.rename(columns={'F/M': 'M/F', 'SCIE': 'SCI', 'MATH': 'MAT', 'CNDIDATE NUMBER': 'CANDIDATE NUMBER'},
inplace=True)
df.columns = df.columns.str.strip()
df.to_csv(file, quoting=csv.QUOTE_ALL, index=False)
Let's reload the kampala dataset and view its columns.
In [14]:
kampala = pd.read_csv(settings.PROCESSED_FOLDER+'KAMPALA.csv')
kampala.head(0)
Out[14]:
Now, let's get back to changing the datatypes of the columns.
In [15]:
utils.convert_numeric_values(settings.PROCESSED_FOLDER)
In [16]:
kampala = pd.read_csv(settings.PROCESSED_FOLDER+'KAMPALA.csv')
kampala.info()
We see that ENG, SCI, SST, MAT, AGG, DIV are now float64 which are numeric types.
Some rules that apply to the data
def numerize(location):
"""
Convert Numeric fields to numeric data types
"""
if os.path.isfile(location):
df = pd.read_csv(location)
df['DIV'].replace('U', '0', inplace=True)
df[['MAT', 'SCI', 'SST', 'ENG', 'AGG', 'DIV']] = df[
['MAT', 'SCI', 'SST', 'ENG', 'AGG', 'DIV']].apply(pd.to_numeric, errors='coerce')
df.to_csv(location, quoting=csv.QUOTE_ALL, index=False)
elif os.path.isdir(location):
for path, folders, files in os.walk(location):
for f in files:
file = os.path.join(location, f)
df = pd.read_csv(file)
df['DIV'].replace("U", "0", inplace=True)
df[['MAT', 'SCI', 'SST', 'ENG', 'AGG', 'DIV']] = df[
['MAT', 'SCI', 'SST', 'ENG', 'AGG', 'DIV']].apply(pd.to_numeric, errors='coerce')
df.to_csv(file, quoting=csv.QUOTE_ALL, index=False)
Let's look at some random files.
In [17]:
abim = pd.read_csv(settings.PROCESSED_FOLDER+'ABIM.csv')
abim
Out[17]:
In [18]:
ebb = pd.read_csv(settings.PROCESSED_FOLDER+'ENTEBBE MUNICIPALITY.csv')
ebb
Out[18]:
The end of the Entebbe dataframe has many empty values. Let's delete those with df.dropna() But first we strip all the string records
In [19]:
utils.strip_records(settings.PROCESSED_FOLDER)
In [20]:
utils.remove_empty_records(settings.PROCESSED_FOLDER)
utils.remove_empty_records and utils.strip_records were initially done in one function, but stripped empty strings would not be dropped if done immediately after. Probably something about how pandas handles its data
In [21]:
ebb = pd.read_csv(settings.PROCESSED_FOLDER+'ENTEBBE MUNICIPALITY.csv')
ebb
Out[21]:
utils.remove_empty_records
def remove_empty_records(location):
"""
Remove all records empty records
"""
if os.path.isfile(location):
df = pd.read_csv(location)
# Drop all records without data
df.dropna(how='all', inplace=True)
df.to_csv(location, quoting=csv.QUOTE_ALL, index=False)
elif os.path.isdir(location):
for path, folders, files in os.walk(location):
for f in files:
file = os.path.join(location, f)
df = pd.read_csv(file)
# Drop all records without data
df.dropna(how='all', inplace=True)
df.to_csv(file, quoting=csv.QUOTE_ALL, index=False)
utils.strip_records
def strip_records(location):
"""
Strip all string columns
"""
if os.path.isfile(location):
df = pd.read_csv(location)
# Strip all String Columns
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
df.to_csv(location, quoting=csv.QUOTE_ALL, index=False)
elif os.path.isdir(location):
for path, folders, files in os.walk(location):
for f in files:
file = os.path.join(location, f)
df = pd.read_csv(file)
# Strip all String Columns
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
df.to_csv(file, quoting=csv.QUOTE_ALL, index=False)
Let's also join all the district data to make one csv file.
In [22]:
utils.merge_districts(settings.PROCESSED_FOLDER)
utils.merge_districts
def merge_districts(location):
all_districts = []
for path, folders, files in os.walk(location):
for f in files:
file = os.path.join(location, f)
df = pd.read_csv(file)
all_districts.append(df)
all_df = pd.concat(all_districts)
all_df_file = os.path.join(location, 'ALL.csv')
all_df.to_csv(all_df_file, quoting=csv.QUOTE_ALL, index=False)
The data cleaning phase of this data is complete. A few lessons learned
We have finally cleaned the data. Next we shall use this information to answer some questions.