Cleaning PLE DataSet

Introduction

In this section of the project we shall clean the data so that we can do analysis in other sections.

Upon inspection of the Dataset, a few issues were encountered.

  • The files are in excel format, yet we need to use csv
  • There are multiple sheets, one for each district
  • Some districts have different columns and column names
  • Some records have Non numeric values

A Utils file was created for different helper functions used. The implementation of these methods is provided for clarity. A settings file is used too to handle configurations.


In [1]:
"""
Import different packages required
"""
import utils
import settings
import pandas as pd

Download The File incase it is not already downloaded.


In [2]:
utils.download_ple()

utils.download_ple

def download_ple():
    if not os.path.exists(settings.MAIN_FILE):
        download_file(
            'http://ugandajournalistsresourcecentre.com/wp-content/uploads/2015/05/PLE-Results-2014.ALL-CANDIDATES.xlsx'
        )
    return

Next, we need to convert the Excel file to csv We can convert to one file with all the districts or a csv for each district. Let's go for different files this time.

This takes a bit of a while. Some optimization may be added later.


In [3]:
utils.ExcelConverter.excel_to_csv_multiple(xls_file=settings.MAIN_FILE, target_folder=settings.PROCESSED_FOLDER)


Start converting
...............................................................................................
Moving on, Kyegegwa has different data
...........
Done Converting 107 files

utils.ExcelConverter.excel_to_csv_multiple()

This is a rather long function so here it will simply be linked

utils.ExcelConverter.excel_to_csv_multiple()

Next, we look at the shapes of the different files


In [4]:
utils.find_csv_shape(settings.PROCESSED_FOLDER)


10
BUTAMBALA.csv - Index(['DISTRICT', 'SCHOOL', 'CANDIDATE NUMBER', 'M/F', 'ENG', 'SCI', 'SST',
       'MAT', 'AGG            ', 'DIV'],
      dtype='object')
11
ENTEBBE MUNICIPALITY.csv - Index(['DISTRICT', 'SCHOOL', 'CANDIDATE NUMBER', 'F/M', 'ENG', 'SCIE', 'SST',
       'MATH', 'AGG', 'DIV', 'Unnamed: 10'],
      dtype='object')
25
MANAFWA.csv - Index(['District', 'School', 'Candidate Number', 'M/F', 'ENG', 'SCI', 'SST',
       'MATH', 'AGG', 'DIV', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16',
       'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20',
       'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24'],
      dtype='object')
19
LIRA MAIN.csv - Index(['District', 'School', 'Candidate Number', 'M/F', 'ENG', 'SCI', 'SST',
       'MATH', 'AGG', 'DIV', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16',
       'Unnamed: 17', 'Unnamed: 18'],
      dtype='object')
18
KUMI.csv - Index(['District', 'School', 'Candidate Number', 'M/F', 'ENG', 'SCI', 'SST',
       'MATH', 'AGG', 'DIV', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16',
       'Unnamed: 17'],
      dtype='object')
15
NAMAYINGO.csv - Index(['District', 'School ', 'Candidate Number ', 'M/F', 'ENG', 'SCI', 'SST',
       'MATH', 'AGG', 'DIV', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14'],
      dtype='object')
Sheets with column length {18: 1, 19: 1, 25: 1, 10: 95, 11: 7, 15: 1}
18 - 0.94%
19 - 0.94%
25 - 0.94%
10 - 89.62%
11 - 6.60%
15 - 0.94%

utils.find_csv_shape

def find_csv_shape(folder):
    d = {}
    for path, folders, files in os.walk(folder):
        for file in files:
            f = os.path.join(path, file)
            csv = pd.read_csv(f)
            if len(csv.columns) in d:
                d[len(csv.columns)] += 1
            else:
                d[len(csv.columns)] = 1
                print("{0}\n{1} - {2}".format(len(csv.columns), file, csv.columns))
    total_districts = sum(d.values())
    print("Sheets with column length {}".format(d))
    for key, value in d.items():
        percentage = (value / total_districts) * 100
        print("{} - {:.2f}%".format(key, percentage))

We can see that most of the sheets have 10 columns. The ones with more columns have a variable number of 'Unamed' Columns. So the next step is to remove the unnamed


In [5]:
utils.remove_unnamed(settings.PROCESSED_FOLDER, right_size=10)

utils.remove_unnamed

def remove_unnamed(folder, right_size):
    for path, folders, files in os.walk(folder):
        for file in files:
            f = os.path.join(path, file)
            old_csv = pd.read_csv(f)
            if len(old_csv.columns) != right_size:
                new_csv = old_csv[old_csv.columns[~old_csv.columns.str.contains('Unnamed:')]]
                new_csv.to_csv(f, quoting=csv.QUOTE_ALL, index=False)

After Removing the unnamed columns, let's check the shape again


In [6]:
utils.find_csv_shape(settings.PROCESSED_FOLDER)


10
BUTAMBALA.csv - Index(['DISTRICT', 'SCHOOL', 'CANDIDATE NUMBER', 'M/F', 'ENG', 'SCI', 'SST',
       'MAT', 'AGG            ', 'DIV'],
      dtype='object')
Sheets with column length {10: 106}
10 - 100.00%

All sheets have the same number of columns. Looks great.

Let's now try to view some stats. We'll try finding the number of Division 1s in Kampala


In [7]:
kampala = pd.read_csv(settings.PROCESSED_FOLDER+'KAMPALA.csv')
english_ones = kampala[kampala['ENG'] == 1]
english_ones


Out[7]:
DISTRICT SCHOOL CNDIDATE NUMBER F/M ENG SCIE SST MATH AGG DIV

It's impossible for there to be no division ones in English in Kampala. Looking at the data, let's get the datatypes for each column


In [8]:
kampala.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27882 entries, 0 to 27881
Data columns (total 10 columns):
DISTRICT           27882 non-null object
SCHOOL             27881 non-null object
CNDIDATE NUMBER    27882 non-null int64
F/M                27874 non-null object
ENG                27881 non-null object
SCIE               27881 non-null object
SST                27881 non-null object
MATH               27881 non-null object
AGG                27881 non-null object
DIV                27877 non-null object
dtypes: int64(1), object(9)
memory usage: 2.1+ MB

Looking at the info provided above, we see that only CNDIDATE NUMBER is numeric, we expect ENG, SCIE, SST, MATH, AGG, DIV to all be numeric so that we can make some calculations against them.

Another thing to note is the Column titles, CNDIDATE NUMBER wasn't seen in other datasets, it was CANDIDATE NUMBER. We'll deal with this first. Let's look at the columns in Kampala.

we use df.head(0) rather than df.columns because it shows prints the data in a tabular format in jupyter


In [9]:
kampala.head(0)


Out[9]:
DISTRICT SCHOOL CNDIDATE NUMBER F/M ENG SCIE SST MATH AGG DIV

Let's look at some other files. Butambala has similar names, but some columns are slightly different


In [10]:
butambala =pd.read_csv(settings.PROCESSED_FOLDER+'BUTAMBALA.csv')
butambala.head(0)


Out[10]:
DISTRICT SCHOOL CANDIDATE NUMBER M/F ENG SCI SST MAT AGG DIV

Upon analysis of other randomly sampled datasets, we can see that the format in Kampala is probably wrong.


In [11]:
abim =pd.read_csv(settings.PROCESSED_FOLDER+'ABIM.csv')
abim.head(0)


Out[11]:
DISTRICT SCHOOL CANDIDATE NUMBER M/F ENG SCI SST MAT AGG DIV

In [12]:
"""
This is probably supposed to be Kamwenge
"""
kamwenje =pd.read_csv(settings.PROCESSED_FOLDER+'KAMWENJE.csv')
kamwenje.head(0)


Out[12]:
DISTRICT SCHOOL CANDIDATE NUMBER M/F ENG SCI SST MAT AGG DIV

A function that corrects the column names is in order.


In [13]:
utils.correct_headers(settings.PROCESSED_FOLDER)

utils.correct_headers

def correct_headers(location):
    """
    Some files have inconsistent headings.
    These are corrected her
    """
    if os.path.isfile(location):
        df = pd.read_csv(location)
        df.rename(columns={'F/M': 'M/F', 'SCIE': 'SCI', 'MATH': 'MAT', 'CNDIDATE NUMBER': 'CANDIDATE NUMBER'},
                  inplace=True)
        df.columns = df.columns.str.strip()
        df.to_csv(location, quoting=csv.QUOTE_ALL, index=False)
    elif os.path.isdir(location):
        for path, folders, files in os.walk(location):
            for f in files:
                file = os.path.join(location, f)
                df = pd.read_csv(file)
                df.rename(columns={'F/M': 'M/F', 'SCIE': 'SCI', 'MATH': 'MAT', 'CNDIDATE NUMBER': 'CANDIDATE NUMBER'},
                          inplace=True)
                df.columns = df.columns.str.strip()
                df.to_csv(file, quoting=csv.QUOTE_ALL, index=False)

Let's reload the kampala dataset and view its columns.


In [14]:
kampala = pd.read_csv(settings.PROCESSED_FOLDER+'KAMPALA.csv')
kampala.head(0)


Out[14]:
DISTRICT SCHOOL CANDIDATE NUMBER M/F ENG SCI SST MAT AGG DIV

Now, let's get back to changing the datatypes of the columns.


In [15]:
utils.convert_numeric_values(settings.PROCESSED_FOLDER)

In [16]:
kampala = pd.read_csv(settings.PROCESSED_FOLDER+'KAMPALA.csv')
kampala.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27882 entries, 0 to 27881
Data columns (total 10 columns):
DISTRICT            27882 non-null object
SCHOOL              27881 non-null object
CANDIDATE NUMBER    27882 non-null int64
M/F                 27874 non-null object
ENG                 27351 non-null float64
SCI                 27349 non-null float64
SST                 27348 non-null float64
MAT                 27351 non-null float64
AGG                 27348 non-null float64
DIV                 27344 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 2.1+ MB

We see that ENG, SCI, SST, MAT, AGG, DIV are now float64 which are numeric types.

The utils.numerize() function deserves a little explanation

  • Distinction 1(D1)
  • Distinction 2(D2)
  • Credit 3(C3)
  • Credit 4(C4)
  • Credit 5(C5)
  • Credit 6(C6)
  • Pass 7(P7)
  • Pass 8(P8)
  • Failure 9(F9)

Some rules that apply to the data

ENG, MAT, SCI, SST

  • X means a student registered for exam but did not sit for exam

GRADE, DIV

  • U means a student got a failure in any subject
  • X means a student did not sit for one of the exam

Our Conversion

  • X - NaN(through coerce)
  • U - 0(Our numeric Grade equivalent of failed)
def numerize(location):
    """
    Convert Numeric fields to numeric data types
    """
    if os.path.isfile(location):
        df = pd.read_csv(location)
        df['DIV'].replace('U', '0', inplace=True)
        df[['MAT', 'SCI', 'SST', 'ENG', 'AGG', 'DIV']] = df[
            ['MAT', 'SCI', 'SST', 'ENG', 'AGG', 'DIV']].apply(pd.to_numeric, errors='coerce')
        df.to_csv(location, quoting=csv.QUOTE_ALL, index=False)
    elif os.path.isdir(location):
        for path, folders, files in os.walk(location):
            for f in files:
                file = os.path.join(location, f)
                df = pd.read_csv(file)
                df['DIV'].replace("U", "0", inplace=True)
                df[['MAT', 'SCI', 'SST', 'ENG', 'AGG', 'DIV']] = df[
                    ['MAT', 'SCI', 'SST', 'ENG', 'AGG', 'DIV']].apply(pd.to_numeric, errors='coerce')
                df.to_csv(file, quoting=csv.QUOTE_ALL, index=False)

Let's look at some random files.


In [17]:
abim = pd.read_csv(settings.PROCESSED_FOLDER+'ABIM.csv')
abim


Out[17]:
DISTRICT SCHOOL CANDIDATE NUMBER M/F ENG SCI SST MAT AGG DIV
0 ABIM ORYEOTYENE P.7 SCHOOL 1 F 7.0 7.0 6.0 8.0 28.0 3.0
1 ABIM ORYEOTYENE P.7 SCHOOL 2 F 7.0 7.0 6.0 6.0 26.0 3.0
2 ABIM ORYEOTYENE P.7 SCHOOL 3 F 8.0 7.0 5.0 8.0 28.0 3.0
3 ABIM ORYEOTYENE P.7 SCHOOL 4 F 6.0 7.0 5.0 7.0 25.0 3.0
4 ABIM ORYEOTYENE P.7 SCHOOL 5 M 8.0 6.0 5.0 7.0 26.0 3.0
5 ABIM ORYEOTYENE P.7 SCHOOL 6 M 5.0 4.0 3.0 5.0 17.0 2.0
6 ABIM ORYEOTYENE P.7 SCHOOL 7 M 7.0 6.0 5.0 7.0 25.0 3.0
7 ABIM ORYEOTYENE P.7 SCHOOL 8 M 6.0 5.0 3.0 8.0 22.0 2.0
8 ABIM ORYEOTYENE P.7 SCHOOL 9 M 8.0 4.0 3.0 7.0 22.0 2.0
9 ABIM ORYEOTYENE P.7 SCHOOL 10 M 5.0 4.0 3.0 6.0 18.0 2.0
10 ABIM ORYEOTYENE P.7 SCHOOL 11 M 7.0 4.0 5.0 7.0 23.0 2.0
11 ABIM ORYEOTYENE P.7 SCHOOL 12 F 6.0 6.0 5.0 7.0 24.0 2.0
12 ABIM ORYEOTYENE P.7 SCHOOL 13 F 9.0 8.0 7.0 9.0 33.0 0.0
13 ABIM ORYEOTYENE P.7 SCHOOL 14 F 5.0 5.0 5.0 6.0 21.0 2.0
14 ABIM ORYEOTYENE P.7 SCHOOL 15 M 6.0 3.0 4.0 7.0 20.0 2.0
15 ABIM ORYEOTYENE P.7 SCHOOL 16 M 5.0 4.0 3.0 6.0 18.0 2.0
16 ABIM ORYEOTYENE P.7 SCHOOL 17 M 8.0 5.0 5.0 7.0 25.0 3.0
17 ABIM ORYEOTYENE P.7 SCHOOL 18 M 6.0 6.0 4.0 6.0 22.0 2.0
18 ABIM ORYEOTYENE P.7 SCHOOL 19 M 4.0 3.0 4.0 4.0 15.0 2.0
19 ABIM OTALABAR P.7 SCHOOL 20 F 7.0 8.0 6.0 9.0 30.0 4.0
20 ABIM OTALABAR P.7 SCHOOL 21 F 5.0 6.0 4.0 6.0 21.0 2.0
21 ABIM OTALABAR P.7 SCHOOL 22 M 9.0 9.0 9.0 9.0 36.0 0.0
22 ABIM OTALABAR P.7 SCHOOL 23 M 7.0 6.0 6.0 6.0 25.0 3.0
23 ABIM OTALABAR P.7 SCHOOL 24 F 8.0 6.0 5.0 7.0 26.0 3.0
24 ABIM OTALABAR P.7 SCHOOL 25 F 5.0 5.0 4.0 8.0 22.0 2.0
25 ABIM OTALABAR P.7 SCHOOL 26 M 6.0 3.0 4.0 4.0 17.0 2.0
26 ABIM OTALABAR P.7 SCHOOL 27 M 6.0 3.0 3.0 6.0 18.0 2.0
27 ABIM OTALABAR P.7 SCHOOL 28 M 4.0 2.0 2.0 2.0 10.0 1.0
28 ABIM OTALABAR P.7 SCHOOL 29 M 7.0 4.0 4.0 6.0 21.0 2.0
29 ABIM OTALABAR P.7 SCHOOL 30 M 7.0 6.0 4.0 5.0 22.0 2.0
... ... ... ... ... ... ... ... ... ... ...
977 ABIM AREMBWOLA PRIMARY SCHOOL 978 M 5.0 5.0 5.0 8.0 23.0 2.0
978 ABIM AREMBWOLA PRIMARY SCHOOL 979 M 8.0 6.0 5.0 8.0 27.0 3.0
979 ABIM GULOTWORO PRIMARY SCHOOL 980 F 7.0 9.0 7.0 8.0 31.0 4.0
980 ABIM GULOTWORO PRIMARY SCHOOL 981 F 8.0 9.0 7.0 7.0 31.0 4.0
981 ABIM GULOTWORO PRIMARY SCHOOL 982 F NaN NaN NaN NaN NaN NaN
982 ABIM GULOTWORO PRIMARY SCHOOL 983 F 8.0 7.0 7.0 8.0 30.0 3.0
983 ABIM GULOTWORO PRIMARY SCHOOL 984 F 7.0 6.0 6.0 8.0 27.0 3.0
984 ABIM GULOTWORO PRIMARY SCHOOL 985 M 7.0 6.0 5.0 8.0 26.0 3.0
985 ABIM GULOTWORO PRIMARY SCHOOL 986 F 9.0 8.0 7.0 9.0 33.0 0.0
986 ABIM GULOTWORO PRIMARY SCHOOL 987 M 9.0 9.0 9.0 7.0 34.0 0.0
987 ABIM GULOTWORO PRIMARY SCHOOL 988 M 9.0 8.0 7.0 9.0 33.0 0.0
988 ABIM GULOTWORO PRIMARY SCHOOL 989 M 8.0 7.0 5.0 8.0 28.0 3.0
989 ABIM GULOTWORO PRIMARY SCHOOL 990 M 8.0 6.0 6.0 7.0 27.0 3.0
990 ABIM GULOTWORO PRIMARY SCHOOL 991 M 9.0 9.0 8.0 8.0 34.0 0.0
991 ABIM GULOTWORO PRIMARY SCHOOL 992 M 7.0 7.0 6.0 7.0 27.0 3.0
992 ABIM GULOTWORO PRIMARY SCHOOL 993 F 9.0 9.0 7.0 9.0 34.0 0.0
993 ABIM GULOTWORO PRIMARY SCHOOL 994 F 9.0 9.0 6.0 8.0 32.0 4.0
994 ABIM GULOTWORO PRIMARY SCHOOL 995 F 6.0 6.0 5.0 7.0 24.0 2.0
995 ABIM GULOTWORO PRIMARY SCHOOL 996 F 8.0 7.0 5.0 6.0 26.0 3.0
996 ABIM GULOTWORO PRIMARY SCHOOL 997 F NaN NaN NaN NaN NaN NaN
997 ABIM GULOTWORO PRIMARY SCHOOL 998 F 9.0 9.0 8.0 9.0 35.0 0.0
998 ABIM GULOTWORO PRIMARY SCHOOL 999 F 7.0 7.0 6.0 8.0 28.0 3.0
999 ABIM GULOTWORO PRIMARY SCHOOL 1000 M 5.0 4.0 3.0 8.0 20.0 2.0
1000 ABIM GULOTWORO PRIMARY SCHOOL 1001 M 6.0 6.0 7.0 8.0 27.0 3.0
1001 ABIM GULOTWORO PRIMARY SCHOOL 1002 M 8.0 6.0 6.0 7.0 27.0 3.0
1002 ABIM GULOTWORO PRIMARY SCHOOL 1003 M 9.0 9.0 9.0 8.0 35.0 0.0
1003 ABIM GULOTWORO PRIMARY SCHOOL 1004 F 6.0 8.0 6.0 5.0 25.0 3.0
1004 ABIM GULOTWORO PRIMARY SCHOOL 1005 M 8.0 5.0 6.0 9.0 28.0 4.0
1005 ABIM GULOTWORO PRIMARY SCHOOL 1006 M 7.0 8.0 6.0 7.0 28.0 3.0
1006 ABIM GULOTWORO PRIMARY SCHOOL 1007 M 7.0 8.0 8.0 5.0 28.0 3.0

1007 rows × 10 columns


In [18]:
ebb = pd.read_csv(settings.PROCESSED_FOLDER+'ENTEBBE MUNICIPALITY.csv')
ebb


Out[18]:
DISTRICT SCHOOL CANDIDATE NUMBER M/F ENG SCI SST MAT AGG DIV
0 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 1.0 F 3.0 2.0 3.0 4.0 12.0 1.0
1 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 2.0 M 4.0 4.0 4.0 6.0 18.0 2.0
2 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 3.0 F 2.0 4.0 2.0 4.0 12.0 1.0
3 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 4.0 M 2.0 3.0 2.0 2.0 9.0 1.0
4 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 5.0 F 4.0 4.0 3.0 7.0 18.0 2.0
5 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 6.0 F 6.0 5.0 5.0 7.0 23.0 2.0
6 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 7.0 F 3.0 4.0 2.0 2.0 11.0 1.0
7 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 8.0 M 2.0 2.0 2.0 4.0 10.0 1.0
8 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 9.0 F 2.0 1.0 2.0 1.0 6.0 1.0
9 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 10.0 M 4.0 5.0 2.0 3.0 14.0 2.0
10 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 11.0 F 3.0 3.0 2.0 5.0 13.0 2.0
11 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 12.0 M 4.0 4.0 3.0 4.0 15.0 2.0
12 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 13.0 F 2.0 3.0 2.0 2.0 9.0 1.0
13 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 14.0 M 2.0 2.0 2.0 6.0 12.0 1.0
14 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 15.0 M 4.0 2.0 2.0 5.0 13.0 2.0
15 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 16.0 F 3.0 2.0 4.0 5.0 14.0 2.0
16 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 17.0 F 2.0 3.0 2.0 2.0 9.0 1.0
17 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 18.0 F 2.0 3.0 2.0 2.0 9.0 1.0
18 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 19.0 F 4.0 5.0 4.0 6.0 19.0 2.0
19 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 20.0 F 2.0 3.0 3.0 6.0 14.0 2.0
20 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 21.0 F 4.0 5.0 4.0 5.0 18.0 2.0
21 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 22.0 M 2.0 2.0 2.0 4.0 10.0 1.0
22 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 23.0 M 4.0 2.0 2.0 5.0 13.0 2.0
23 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 24.0 M 2.0 4.0 2.0 4.0 12.0 1.0
24 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 25.0 F 4.0 4.0 4.0 6.0 18.0 2.0
25 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 26.0 M 5.0 5.0 4.0 5.0 19.0 2.0
26 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 27.0 M 5.0 4.0 3.0 6.0 18.0 2.0
27 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 28.0 F 3.0 4.0 3.0 6.0 16.0 2.0
28 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 29.0 M 3.0 3.0 2.0 3.0 11.0 1.0
29 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 30.0 M 3.0 3.0 3.0 7.0 16.0 2.0
... ... ... ... ... ... ... ... ... ... ...
1616 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1617.0 M 1.0 2.0 2.0 2.0 7.0 1.0
1617 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1618.0 F 2.0 4.0 3.0 2.0 11.0 1.0
1618 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1619.0 F 2.0 5.0 3.0 4.0 14.0 2.0
1619 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1620.0 M 2.0 2.0 2.0 2.0 8.0 1.0
1620 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1621.0 M 3.0 2.0 2.0 2.0 9.0 1.0
1621 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1622.0 M 5.0 8.0 5.0 6.0 24.0 2.0
1622 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1623.0 M 3.0 4.0 3.0 3.0 13.0 2.0
1623 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1624.0 M 3.0 5.0 3.0 3.0 14.0 2.0
1624 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1625.0 M 3.0 4.0 3.0 3.0 13.0 2.0
1625 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1626.0 F 2.0 2.0 2.0 2.0 8.0 1.0
1626 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1627.0 M 3.0 5.0 4.0 3.0 15.0 2.0
1627 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1628.0 M 2.0 3.0 3.0 5.0 13.0 2.0
1628 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1629.0 M 2.0 3.0 3.0 5.0 13.0 2.0
1629 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1630.0 F 3.0 5.0 4.0 4.0 16.0 2.0
1630 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1631.0 F 1.0 3.0 3.0 6.0 13.0 2.0
1631 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1632.0 M 3.0 3.0 3.0 6.0 15.0 2.0
1632 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1633.0 F 4.0 7.0 5.0 6.0 22.0 2.0
1633 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1634.0 M 2.0 4.0 3.0 3.0 12.0 1.0
1634 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1635.0 M 3.0 3.0 3.0 3.0 12.0 1.0
1635 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1636.0 M 3.0 5.0 4.0 5.0 17.0 2.0
1636 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1637 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1638 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1639 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1640 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1641 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1642 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1643 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1644 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1645 NaN NaN NaN NaN NaN NaN NaN NaN NaN

1646 rows × 10 columns

The end of the Entebbe dataframe has many empty values. Let's delete those with df.dropna() But first we strip all the string records


In [19]:
utils.strip_records(settings.PROCESSED_FOLDER)

In [20]:
utils.remove_empty_records(settings.PROCESSED_FOLDER)

utils.remove_empty_records and utils.strip_records were initially done in one function, but stripped empty strings would not be dropped if done immediately after. Probably something about how pandas handles its data


In [21]:
ebb = pd.read_csv(settings.PROCESSED_FOLDER+'ENTEBBE MUNICIPALITY.csv')
ebb


Out[21]:
DISTRICT SCHOOL CANDIDATE NUMBER M/F ENG SCI SST MAT AGG DIV
0 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 1.0 F 3.0 2.0 3.0 4.0 12.0 1.0
1 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 2.0 M 4.0 4.0 4.0 6.0 18.0 2.0
2 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 3.0 F 2.0 4.0 2.0 4.0 12.0 1.0
3 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 4.0 M 2.0 3.0 2.0 2.0 9.0 1.0
4 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 5.0 F 4.0 4.0 3.0 7.0 18.0 2.0
5 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 6.0 F 6.0 5.0 5.0 7.0 23.0 2.0
6 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 7.0 F 3.0 4.0 2.0 2.0 11.0 1.0
7 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 8.0 M 2.0 2.0 2.0 4.0 10.0 1.0
8 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 9.0 F 2.0 1.0 2.0 1.0 6.0 1.0
9 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 10.0 M 4.0 5.0 2.0 3.0 14.0 2.0
10 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 11.0 F 3.0 3.0 2.0 5.0 13.0 2.0
11 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 12.0 M 4.0 4.0 3.0 4.0 15.0 2.0
12 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 13.0 F 2.0 3.0 2.0 2.0 9.0 1.0
13 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 14.0 M 2.0 2.0 2.0 6.0 12.0 1.0
14 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 15.0 M 4.0 2.0 2.0 5.0 13.0 2.0
15 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 16.0 F 3.0 2.0 4.0 5.0 14.0 2.0
16 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 17.0 F 2.0 3.0 2.0 2.0 9.0 1.0
17 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 18.0 F 2.0 3.0 2.0 2.0 9.0 1.0
18 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 19.0 F 4.0 5.0 4.0 6.0 19.0 2.0
19 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 20.0 F 2.0 3.0 3.0 6.0 14.0 2.0
20 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 21.0 F 4.0 5.0 4.0 5.0 18.0 2.0
21 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 22.0 M 2.0 2.0 2.0 4.0 10.0 1.0
22 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 23.0 M 4.0 2.0 2.0 5.0 13.0 2.0
23 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 24.0 M 2.0 4.0 2.0 4.0 12.0 1.0
24 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 25.0 F 4.0 4.0 4.0 6.0 18.0 2.0
25 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 26.0 M 5.0 5.0 4.0 5.0 19.0 2.0
26 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 27.0 M 5.0 4.0 3.0 6.0 18.0 2.0
27 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 28.0 F 3.0 4.0 3.0 6.0 16.0 2.0
28 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 29.0 M 3.0 3.0 2.0 3.0 11.0 1.0
29 ENTEBBE MUNICIPALITY ST.THERESA'S PRIMARY SCHOOL 30.0 M 3.0 3.0 3.0 7.0 16.0 2.0
... ... ... ... ... ... ... ... ... ... ...
1606 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1607.0 F 1.0 2.0 2.0 2.0 7.0 1.0
1607 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1608.0 F 2.0 3.0 2.0 3.0 10.0 1.0
1608 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1609.0 M 3.0 4.0 5.0 6.0 18.0 2.0
1609 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1610.0 M 3.0 2.0 3.0 4.0 12.0 1.0
1610 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1611.0 M 4.0 4.0 4.0 8.0 20.0 2.0
1611 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1612.0 M 4.0 4.0 3.0 2.0 13.0 2.0
1612 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1613.0 F 2.0 3.0 2.0 5.0 12.0 1.0
1613 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1614.0 M 2.0 3.0 2.0 2.0 9.0 1.0
1614 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1615.0 F 3.0 5.0 4.0 5.0 17.0 2.0
1615 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1616.0 F 1.0 2.0 2.0 2.0 7.0 1.0
1616 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1617.0 M 1.0 2.0 2.0 2.0 7.0 1.0
1617 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1618.0 F 2.0 4.0 3.0 2.0 11.0 1.0
1618 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1619.0 F 2.0 5.0 3.0 4.0 14.0 2.0
1619 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1620.0 M 2.0 2.0 2.0 2.0 8.0 1.0
1620 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1621.0 M 3.0 2.0 2.0 2.0 9.0 1.0
1621 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1622.0 M 5.0 8.0 5.0 6.0 24.0 2.0
1622 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1623.0 M 3.0 4.0 3.0 3.0 13.0 2.0
1623 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1624.0 M 3.0 5.0 3.0 3.0 14.0 2.0
1624 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1625.0 M 3.0 4.0 3.0 3.0 13.0 2.0
1625 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1626.0 F 2.0 2.0 2.0 2.0 8.0 1.0
1626 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1627.0 M 3.0 5.0 4.0 3.0 15.0 2.0
1627 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1628.0 M 2.0 3.0 3.0 5.0 13.0 2.0
1628 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1629.0 M 2.0 3.0 3.0 5.0 13.0 2.0
1629 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1630.0 F 3.0 5.0 4.0 4.0 16.0 2.0
1630 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1631.0 F 1.0 3.0 3.0 6.0 13.0 2.0
1631 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1632.0 M 3.0 3.0 3.0 6.0 15.0 2.0
1632 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1633.0 F 4.0 7.0 5.0 6.0 22.0 2.0
1633 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1634.0 M 2.0 4.0 3.0 3.0 12.0 1.0
1634 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1635.0 M 3.0 3.0 3.0 3.0 12.0 1.0
1635 ENTEBBE MUNICIPALITY ENTEBBE CHANGSHA MODEL PIS 1636.0 M 3.0 5.0 4.0 5.0 17.0 2.0

1636 rows × 10 columns

utils.remove_empty_records

def remove_empty_records(location):
    """
    Remove all records empty records
    """
    if os.path.isfile(location):
        df = pd.read_csv(location)
        # Drop all records without data
        df.dropna(how='all', inplace=True)
        df.to_csv(location, quoting=csv.QUOTE_ALL, index=False)
    elif os.path.isdir(location):
        for path, folders, files in os.walk(location):
            for f in files:
                file = os.path.join(location, f)
                df = pd.read_csv(file)
                # Drop all records without data
                df.dropna(how='all', inplace=True)
                df.to_csv(file, quoting=csv.QUOTE_ALL, index=False)

utils.strip_records

def strip_records(location):
    """
    Strip all string columns
    """
    if os.path.isfile(location):
        df = pd.read_csv(location)
        # Strip all String Columns
        df_obj = df.select_dtypes(['object'])
        df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
        df.to_csv(location, quoting=csv.QUOTE_ALL, index=False)
    elif os.path.isdir(location):
        for path, folders, files in os.walk(location):
            for f in files:
                file = os.path.join(location, f)
                df = pd.read_csv(file)
                # Strip all String Columns
                df_obj = df.select_dtypes(['object'])
                df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
                df.to_csv(file, quoting=csv.QUOTE_ALL, index=False)

Let's also join all the district data to make one csv file.


In [22]:
utils.merge_districts(settings.PROCESSED_FOLDER)

utils.merge_districts

def merge_districts(location):
    all_districts = []
    for path, folders, files in os.walk(location):
        for f in files:
            file = os.path.join(location, f)
            df = pd.read_csv(file)
            all_districts.append(df)
    all_df = pd.concat(all_districts)
    all_df_file = os.path.join(location, 'ALL.csv')
    all_df.to_csv(all_df_file, quoting=csv.QUOTE_ALL, index=False)

Conclusion

The data cleaning phase of this data is complete. A few lessons learned

  • It's good to look at the data in a simple format through a spreadsheet to get a good feel for the data
  • Some data is always not good so that has to be abandoned.
  • The data cleaning process is not linear. Unlike what this edited document may show.

We have finally cleaned the data. Next we shall use this information to answer some questions.