Introduction: Using the data gathered from Taarifa and the Tanzanian Ministry of Water, can we predict which pumps are functional, which need some repairs, and which don't work at all? Predicting one of these three classes based and a smart understanding of which waterpoints will fail, can improve the maintenance operations and ensure that clean, potable water is available to communities across Tanzania.
This is also an intermediate-level competition by DataDriven! All code & support scripts are in Github Repo
Goal: To do all basic transformation required for further processing Algorithmn selection, Parameter Tuning and if possible threshold check using ROC curve.
In [1]:
import pickle
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
from sklearn.model_selection import train_test_split
from scripts.tools import game
from scripts.sam_custom_labeler import CUST_CATEGORY_LABELER
from scripts.tools import check_metric, data_transformations, df_check_stats
from scripts.tools import sam_pickle_load, sam_pickle_save
%matplotlib inline
np.set_printoptions(precision=5)
np.random.seed(69572)
plt.style.use('ggplot')
sns.set(color_codes=True)
crazy_list = dir()
From Data Analysis, list of columns Selected for removing
scheme_name
has too many null values to fix.longitude
has zeros which is not possible.public_meetings, permit, amount_tsh, gps_height, population, constructrion_year
columns required interfilling of data has lots of outliers(as zeros)wpt_name, ward, subvillage, schema_name, installer, funder
has lots of categorical valuesFew columns which seems to hold simmilar kind of information
extraction_type, extraction_type_group, extraction_type_class
management, management_group
scheme_management, scheme_name
payment, payment_type
water_quality, quality_group
source, source_type, source_class
waterpoint_type, waterpoint_type_group
Geo Location information: All following parameter are availble for same reason, to find the address.
longitude, latitude
region
(region_code)district_code within region
ward
subvillage
Compared to all other columns regions
columns has complete data(no nulls)
In [2]:
########################################################################
# clean up
for each in dir():
if each not in crazy_list:
del each
########################################################################
# data collection
RAW_X = pd.read_csv('data/traning_set_values.csv', index_col='id')
RAW_y = pd.read_csv('data/training_set_labels.csv', index_col='id')
RAW_TEST_X = pd.read_csv('data/test_set_values.csv', index_col='id')
print('Loaded CSV Files.')
########################################################################
# ward, wpt_name, subvillage, wpt_name too many values counts
# amount_tsh: too many nulls
# num_private: dont know about private pumps
drop_columns = '''
scheme_name
recorded_by
amount_tsh
num_private
region_code
district_code
wpt_name
subvillage
ward
lga
extraction_type_class
extraction_type_group
management_group
payment
water_quality
source_type
source_class
waterpoint_type_group
'''.strip().splitlines()
RAW_X.drop(drop_columns, inplace=True, axis=1)
RAW_TEST_X.drop(drop_columns, inplace=True, axis=1)
print('Dropped following cols', drop_columns)
Generally date is a composition of 3 details, Year + Month + Day. In generally, overtime logically these possible with either increase or decrease based on the condition governing bodies and their state of growth. So, to make it easy for classifier to identify them, we are seperating them into following columns
In [3]:
########################################################################
# date_recorded -- convert it to datetime format
strptime = datetime.datetime.strptime
DATE_FORMAT = "%Y-%m-%d"
REFERENCE_DATE_POINT = strptime('2014-01-01', DATE_FORMAT)
f = lambda x: strptime(str(x), DATE_FORMAT)
RAW_X.date_recorded = RAW_X.date_recorded.apply(f)
RAW_TEST_X.date_recorded = RAW_TEST_X.date_recorded.apply(f)
f = lambda x: x.month
RAW_X['date_recorded_month'] = RAW_X.date_recorded.apply(f)
RAW_TEST_X['date_recorded_month'] = RAW_TEST_X.date_recorded.apply(f)
f = lambda x: (x - REFERENCE_DATE_POINT).days
RAW_X.date_recorded = RAW_X.date_recorded.apply(f)
RAW_TEST_X.date_recorded = RAW_TEST_X.date_recorded.apply(f)
During the Analysis stage, we have seen that logitude
is having zeros and as per Google's map, null coordinates(0, 0) represet a place outside Africa, in sea which is not possible.
As we have a region parameter, to identify logitude
suitable and use it. Please use plot comment below for better understanding.
In [4]:
########################################################################
# Longitude & Latitude -- zero values fix
aa = RAW_X['longitude latitude region'.split()].copy()
aa = aa[aa.longitude > 5]
bb = aa.groupby(by=['region']).mean()
bb.columns = ['longitude_mean', 'latitude_mean']
cc = aa.groupby(by=['region']).min()
cc.columns = ['longitude_min', 'latitude_min']
dd = aa.groupby(by=['region']).max()
dd.columns = ['longitude_max', 'latitude_max']
abcd = bb.join(cc).join(dd)[['latitude_max',
'latitude_mean',
'latitude_min',
'longitude_max',
'longitude_mean',
'longitude_min']].copy()
# abcd.plot()
abcde = dict(abcd[['latitude_mean', 'longitude_mean']].T).items()
long_lat_reg_dict = {each[0]: {'longitude': each[1][0], 'latitude': each[1][0]} for each in abcde}
def f(row):
# print (row)
try:
if (row['longitude'] < 1):
reg = row['region']
row['longitude'] = long_lat_reg_dict[reg]['longitude']
# row['latitude'] = long_lat_reg_dict[reg]['latitude']
except KeyError:
print('KeyError: An expected key(region/longitude/latitude) is missing!')
return row
RAW_X = RAW_X.apply(f, axis=1)
RAW_TEST_X = RAW_TEST_X.apply(f, axis=1)
Note: only logitude
is being replaced by function f(row)
.
In [5]:
df_check_stats(RAW_X)
In [6]:
########################################################################
## Area & Population
# Source: https://www.citypopulation.de/Tanzania-Cities.html
header = '''Name
Abbr.
Status
Capital
Area
Population_1978-08-26
Population_1988-08-27
Population_2002-08-01
Population_2012-08-26
'''.lower().strip().splitlines()
data = '''Arusha ARU Reg Arusha 37,576 ... 744,479 1,288,088 1,694,310
Dar es Salaam DAR Reg Dar es Salaam 1,393 843,090 1,360,850 2,487,288 4,364,541
Dodoma DOD Reg Dodoma 41,311 972,005 1,235,328 1,692,025 2,083,588
Geita GEI Reg Geita 20,054 ... ... 1,337,718 1,739,530
Iringa IRI Reg Iringa 35,503 ... ... 840,404 941,238
Kagera KAG Reg Bukoba 25,265 ... ... 1,791,451 2,458,023
Katavi KAT Reg Mpanda 45,843 ... ... 408,609 564,604
Kigoma KIG Reg Kigoma 37,040 648,941 856,770 1,674,047 2,127,930
Kilimanjaro KIL Reg Moshi 13,250 902,437 1,104,673 1,376,702 1,640,087
Lindi LIN Reg Lindi 66,040 527,624 646,494 787,624 864,652
Manyara MAY Reg Babati 44,522 ... 603,691 1,037,605 1,425,131
Mara MAR Reg Musoma 21,760 723,827 946,418 1,363,397 1,743,830
Mbeya MBE Reg Mbeya 60,350 1,079,864 1,476,278 2,063,328 2,707,410
Morogoro MOR Reg Morogoro 70,624 939,264 1,220,564 1,753,362 2,218,492
Mtwara MTW Reg Mtwara 16,710 771,818 889,100 1,124,481 1,270,854
Mwanza MWA Reg Mwanza 9,467 ... ... 2,058,866 2,772,509
Njombe NJO Reg Njombe 21,347 ... ... 648,464 702,097
Pwani PWA Reg Dar es Salaam 32,547 516,586 636,103 885,017 1,098,668
Rukwa RUK Reg Sumbawanga 22,792 ... ... 729,060 1,004,539
Ruvuma RUV Reg Songea 63,669 561,575 779,875 1,113,715 1,376,891
Shinyanga SHI Reg Shinyanga 18,901 ... ... 1,249,226 1,534,808
Simiyu SIM Reg Bariadi 25,212 ... ... 1,317,879 1,584,157
Singida SIN Reg Singida 49,340 613,949 792,387 1,086,748 1,370,637
Tabora TAB Reg Tabora 76,150 817,907 1,036,150 1,710,465 2,291,623
Tanga TAN Reg Tanga 26,677 1,037,767 1,280,212 1,636,280 2,045,205
Zanzibar ZAN St Zanzibar 2,460 476,111 640,685 981,754 1,303,569
'''.strip().splitlines()
# data to dataframe
df = pd.DataFrame([each.split('\t') for each in data])
df.columns = header
# columns
df.area = df.area.apply(lambda x: int(x.replace(',', '')))
df['population_2002-08-01'] = df['population_2002-08-01'].apply(lambda x: int(x.replace(',', '')))
df['population_2012-08-26'] = df['population_2012-08-26'].apply(lambda x: int(x.replace(',', '')))
# drop columns
df.drop(['abbr.', 'capital', 'status', 'population_1978-08-26', 'population_1988-08-27'], axis=1, inplace=True)
# df
# checking if we have all regions info
mm = RAW_X.region.unique()
nn = df.name.tolist()
if [ _ for _ in mm if _ not in nn]:
print('found some missing values');
assert False
area_pop = dict()
for i, each in df.T.iteritems():
area_pop[each['name']] = { 'area': each['area'],
'population_2002': each['population_2002-08-01'],
'population_2012': each['population_2012-08-26']}
def fill_area(region):
if region in area_pop:
return area_pop[region]['area']
else:
return 0
def fill_pop02(region):
if region in area_pop:
return area_pop[region]['population_2002']
else:
return 0
def fill_pop12(region):
if region in area_pop:
return area_pop[region]['population_2012']
else:
return 0
########################################################################
# Area & Population 2002 & Population 2012
RAW_X['Area'] = RAW_X.region.apply(fill_area)
RAW_X['Population02'] = RAW_X.region.apply(fill_pop02)
RAW_X['Population12'] = RAW_X.region.apply(fill_pop12)
RAW_TEST_X['Area'] = RAW_TEST_X.region.apply(fill_area)
RAW_TEST_X['Population02'] = RAW_TEST_X.region.apply(fill_pop02)
RAW_TEST_X['Population12'] = RAW_TEST_X.region.apply(fill_pop12)
In [7]:
RAW_X.gps_height[RAW_X.gps_height > 1].mean()
Out[7]:
In [8]:
########################################################################
# GPS Height
RAW_TEST_X.gps_height[RAW_TEST_X.gps_height < 1] = RAW_X.gps_height[RAW_X.gps_height > 1].mean()
RAW_X.gps_height[RAW_X.gps_height < 1] = RAW_X.gps_height[RAW_X.gps_height > 1].mean()
In [9]:
########################################################################
# Reducing geo location precision to 11 meters
LONG_LAT_PRECISION = 0.00001
fns_lola =lambda x: (x // LONG_LAT_PRECISION) * LONG_LAT_PRECISION
# Reducing Precision of Lat.
RAW_X.longitude = RAW_X.longitude.apply(fns_lola)
RAW_X.latitude = RAW_X.latitude.apply(fns_lola)
RAW_TEST_X.longitude = RAW_TEST_X.longitude.apply(fns_lola)
RAW_TEST_X.latitude = RAW_TEST_X.latitude.apply(fns_lola)
########################################################################
# reducing the unique count
RAW_X.construction_year = RAW_X.construction_year // 4
RAW_TEST_X.construction_year = RAW_TEST_X.construction_year // 4
########################################################################
# bool columns
tmp = ['public_meeting', 'permit']
for col in tmp:
RAW_X[col] = RAW_X[col].fillna(False)
RAW_TEST_X[col] = RAW_TEST_X[col].fillna(False)
In [10]:
########################################################################
# object columns list
obj_cols = RAW_X.dtypes[RAW_X.dtypes == 'O'].index.tolist()
def text_normalisation(text):
"""Simplify the text formats.
* strip trailing leading space
* conver all text to lower cases
* convert nan or None to 'other'
"""
if text:
text = str(text).strip().lower()
return text
return 'other'
for col in obj_cols:
RAW_X[col] = RAW_X[col].apply(text_normalisation)
RAW_TEST_X[col] = RAW_TEST_X[col].apply(text_normalisation)
print ('Text Trans', col)
# object columns
RAW_X[obj_cols] = RAW_X[obj_cols].fillna('other')
RAW_TEST_X[obj_cols] = RAW_TEST_X[obj_cols].fillna('other')
RAW_X.scheme_management.replace("none", "", inplace=True)
RAW_TEST_X.scheme_management.replace("none", "", inplace=True)
# "other - mkulima/shinyanga"(Not present in test) to "Other"
RAW_X.extraction_type = RAW_X.extraction_type.replace("other - mkulima/shinyanga", "other")
RAW_TEST_X.extraction_type = RAW_TEST_X.extraction_type.replace("other - mkulima/shinyanga", "other")
# district_code, lga, region_code, may also be a proxy for region.
# num_private, no idea of what it is for
# recorded_by, has no entropy
# wpt_name subvillage ward scheme_name -- too many to handle
CUST_CATEGORY_LABELER
is a custom labler desined to minimise the number of groups(entrpy) in the data with out much loss of information.
For example, for a columns like funder
has around 1500+ groups. With custom labler, we can check and found that 95% of the data is covered by less than 600 groups while remaining ~1000 groups cover only 5% of information.
In [11]:
########################################################################
# custom labler
# - for reducing entropy with less loss of data.
custom_labler = defaultdict(CUST_CATEGORY_LABELER)
# these are the data coverages we are expecting
# '<column name>' : <% of data is to be covered>
tmp = { 'funder': 75,
'installer': 75,
'wpt_name': 75,
# 'subvillage': 75,
# 'ward': 75,
# 'scheme_name': 75,
}
for col, limit in tmp.items():
if col not in obj_cols:
continue
print('--------------------------------------------------------', col)
labler = custom_labler[col]
labler.DATA_COVERAGE_LIMIT = limit
labler.fit(RAW_X[col])
tmp = labler.fit_transform(RAW_X[col])
print(len(RAW_X[col].value_counts()), len(tmp.value_counts()))
RAW_X[col] = tmp
tmp = labler.etransform(RAW_TEST_X[col])
print(len(RAW_TEST_X[col].value_counts()), len(tmp.value_counts()))
print(RAW_X.shape, RAW_TEST_X.shape, all(RAW_X.columns == RAW_TEST_X.columns))
In [12]:
X, y, TEST_X = data_transformations(RAW_X, RAW_y, RAW_TEST_X, pickle_path='tmp/')
In [13]:
# Saving the transformed data sources
df_check_stats(X, y, TEST_X)
sam_pickle_save(X, y, TEST_X, prefix="tmp/Iteration2_final_")
Note: End of Data Transformations
EODT
Conventions for Alogrithmns
Score:
In [14]:
########################################################################
# clean up
for each in dir():
if each not in crazy_list:
del each
In [15]:
########################################################################
# Just assining new names to transformed dataframe pointers
X, y, TEST_X = sam_pickle_load(prefix='tmp/Iteration2_final_')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=42, stratify=y)
In [16]:
# Benchmark
from sklearn.dummy import DummyClassifier
clf = game(X_train, X_test, y_train, y_test, algo='dc')
In [17]:
# Random Forest
clf = game(X_train, X_test, y_train, y_test, algo='rf')
In [18]:
# Gradient Boosting Trees
clf = game(X_train, X_test, y_train, y_test, algo='gb')
In [19]:
# K Nearest Neighbour
clf = game(X_train, X_test, y_train, y_test, algo='knn')
In [20]:
# Multiclass One vs One - Random Forest
clf = game(X_train, X_test, y_train, y_test, algo='mc_ovo_rf')
In [21]:
# Multiclass One vs All - Random Forest
clf = game(X_train, X_test, y_train, y_test, algo='mc_ova_rf')
In [24]:
#
import catboost
# Just assining new names to transformed dataframe pointers
X, y, TEST_X = sam_pickle_load(prefix='tmp/Iteration2_final_')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=42, stratify=y)
In [27]:
clf = catboost.CatBoostRegressor().fit(X_train, y_train)
In [28]:
clf.score(X_train, y_train)
Out[28]:
In [29]:
clf.score(X_test, y_test)
Out[29]:
In [22]:
for each in dir():
if each not in crazy_list:
del each