Introduction: Using the data gathered from Taarifa and the Tanzanian Ministry of Water, can we predict which pumps are functional, which need some repairs, and which don't work at all? Predicting one of these three classes based and a smart understanding of which waterpoints will fail, can improve the maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

This is also an intermediate-level competition by DataDriven! All code & support scripts are in Github Repo

Goal: To do all basic transformation required for further processing Algorithmn selection, Parameter Tuning and if possible threshold check using ROC curve.

Global Imports


In [1]:
import pickle
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from collections import defaultdict
from sklearn.model_selection import train_test_split
from scripts.tools import game
from scripts.sam_custom_labeler import CUST_CATEGORY_LABELER
from scripts.tools import check_metric, data_transformations, df_check_stats
from scripts.tools import sam_pickle_load, sam_pickle_save


%matplotlib inline

np.set_printoptions(precision=5)
np.random.seed(69572)
plt.style.use('ggplot')
sns.set(color_codes=True)

crazy_list = dir()

Data Transformation

From Data Analysis, list of columns Selected for removing

  • scheme_name has too many null values to fix.
  • longitude has zeros which is not possible.
  • public_meetings, permit, amount_tsh, gps_height, population, constructrion_year columns required interfilling of data has lots of outliers(as zeros)
  • wpt_name, ward, subvillage, schema_name, installer, funder has lots of categorical values

Few columns which seems to hold simmilar kind of information

  • extraction_type, extraction_type_group, extraction_type_class
  • management, management_group
  • scheme_management, scheme_name
  • payment, payment_type
  • water_quality, quality_group
  • source, source_type, source_class
  • waterpoint_type, waterpoint_type_group

Geo Location information: All following parameter are availble for same reason, to find the address.

  • longitude, latitude
  • region(region_code)
  • district_code within region
  • ward
  • subvillage

Compared to all other columns regions columns has complete data(no nulls)

Load Data required Data


In [2]:
########################################################################
# clean up
for each in dir():
    if each not in crazy_list:
        del each

########################################################################
# data collection
RAW_X = pd.read_csv('data/traning_set_values.csv', index_col='id')
RAW_y = pd.read_csv('data/training_set_labels.csv', index_col='id')
RAW_TEST_X = pd.read_csv('data/test_set_values.csv', index_col='id')
print('Loaded CSV Files.')

########################################################################
# ward, wpt_name, subvillage, wpt_name too many values counts
# amount_tsh: too many nulls
# num_private: dont know about private pumps

drop_columns = '''
scheme_name
recorded_by
amount_tsh
num_private
region_code
district_code
wpt_name
subvillage
ward
lga
extraction_type_class
extraction_type_group
management_group
payment
water_quality
source_type
source_class
waterpoint_type_group
'''.strip().splitlines()

RAW_X.drop(drop_columns, inplace=True, axis=1)
RAW_TEST_X.drop(drop_columns, inplace=True, axis=1)
print('Dropped following cols', drop_columns)


Loaded CSV Files.
Dropped following cols ['scheme_name', 'recorded_by', 'amount_tsh', 'num_private', 'region_code', 'district_code', 'wpt_name', 'subvillage', 'ward', 'lga', 'extraction_type_class', 'extraction_type_group', 'management_group', 'payment', 'water_quality', 'source_type', 'source_class', 'waterpoint_type_group']

Datetime

Generally date is a composition of 3 details, Year + Month + Day. In generally, overtime logically these possible with either increase or decrease based on the condition governing bodies and their state of growth. So, to make it easy for classifier to identify them, we are seperating them into following columns

  • column for storing year
  • column for storing month
  • columns for number of days, since a specific point of date.(similar to Epoch date)

In [3]:
########################################################################
# date_recorded -- convert it to datetime format
strptime = datetime.datetime.strptime

DATE_FORMAT = "%Y-%m-%d"
REFERENCE_DATE_POINT = strptime('2014-01-01', DATE_FORMAT)

f = lambda x: strptime(str(x), DATE_FORMAT)
RAW_X.date_recorded = RAW_X.date_recorded.apply(f)
RAW_TEST_X.date_recorded = RAW_TEST_X.date_recorded.apply(f)

f = lambda x: x.month
RAW_X['date_recorded_month'] = RAW_X.date_recorded.apply(f)
RAW_TEST_X['date_recorded_month'] = RAW_TEST_X.date_recorded.apply(f)

f = lambda x: (x - REFERENCE_DATE_POINT).days
RAW_X.date_recorded = RAW_X.date_recorded.apply(f)
RAW_TEST_X.date_recorded = RAW_TEST_X.date_recorded.apply(f)

Int Cols

Longitude and Latitude

During the Analysis stage, we have seen that logitude is having zeros and as per Google's map, null coordinates(0, 0) represet a place outside Africa, in sea which is not possible.

As we have a region parameter, to identify logitude suitable and use it. Please use plot comment below for better understanding.


In [4]:
########################################################################
# Longitude & Latitude -- zero values fix

aa = RAW_X['longitude latitude region'.split()].copy()
aa = aa[aa.longitude > 5]

bb = aa.groupby(by=['region']).mean()
bb.columns = ['longitude_mean', 'latitude_mean']
cc = aa.groupby(by=['region']).min()
cc.columns = ['longitude_min', 'latitude_min']
dd = aa.groupby(by=['region']).max()
dd.columns = ['longitude_max', 'latitude_max']

abcd = bb.join(cc).join(dd)[['latitude_max',
 'latitude_mean',
 'latitude_min',
 'longitude_max',
 'longitude_mean',
 'longitude_min']].copy()

# abcd.plot()

abcde = dict(abcd[['latitude_mean', 'longitude_mean']].T).items()

long_lat_reg_dict = {each[0]: {'longitude': each[1][0], 'latitude': each[1][0]} for each in abcde}

def f(row):
#     print (row)
    try:
        if (row['longitude'] < 1):
            reg = row['region']
            row['longitude'] = long_lat_reg_dict[reg]['longitude']
#             row['latitude'] = long_lat_reg_dict[reg]['latitude']
    except KeyError:
        print('KeyError: An expected key(region/longitude/latitude) is missing!')
    return row

RAW_X = RAW_X.apply(f, axis=1)
RAW_TEST_X = RAW_TEST_X.apply(f, axis=1)

Note: only logitude is being replaced by function f(row).


In [5]:
df_check_stats(RAW_X)


Data Frame Shape: (59400, 22) TotColumns: 22 ObjectCols: 0

Additional Cols

We were able to find city poulation details of states of Tanzania and as population needs are in generally has effects of consumption of water usages, we included them to add two new columns.

  • Population02 - Population in 2002
  • Population12 - Population in 2012

In [6]:
########################################################################
## Area & Population
# Source: https://www.citypopulation.de/Tanzania-Cities.html

header = '''Name
Abbr.
Status
Capital
Area
Population_1978-08-26
Population_1988-08-27
Population_2002-08-01
Population_2012-08-26
'''.lower().strip().splitlines()

data = '''Arusha	ARU	Reg	Arusha	37,576	...	744,479	1,288,088	1,694,310
Dar es Salaam	DAR	Reg	Dar es Salaam	1,393	843,090	1,360,850	2,487,288	4,364,541
Dodoma	DOD	Reg	Dodoma	41,311	972,005	1,235,328	1,692,025	2,083,588
Geita	GEI	Reg	Geita	20,054	...	...	1,337,718	1,739,530
Iringa	IRI	Reg	Iringa	35,503	...	...	840,404	941,238
Kagera	KAG	Reg	Bukoba	25,265	...	...	1,791,451	2,458,023
Katavi	KAT	Reg	Mpanda	45,843	...	...	408,609	564,604
Kigoma	KIG	Reg	Kigoma	37,040	648,941	856,770	1,674,047	2,127,930
Kilimanjaro	KIL	Reg	Moshi	13,250	902,437	1,104,673	1,376,702	1,640,087
Lindi	LIN	Reg	Lindi	66,040	527,624	646,494	787,624	864,652
Manyara	MAY	Reg	Babati	44,522	...	603,691	1,037,605	1,425,131
Mara	MAR	Reg	Musoma	21,760	723,827	946,418	1,363,397	1,743,830
Mbeya	MBE	Reg	Mbeya	60,350	1,079,864	1,476,278	2,063,328	2,707,410
Morogoro	MOR	Reg	Morogoro	70,624	939,264	1,220,564	1,753,362	2,218,492
Mtwara	MTW	Reg	Mtwara	16,710	771,818	889,100	1,124,481	1,270,854
Mwanza	MWA	Reg	Mwanza	9,467	...	...	2,058,866	2,772,509
Njombe	NJO	Reg	Njombe	21,347	...	...	648,464	702,097
Pwani	PWA	Reg	Dar es Salaam	32,547	516,586	636,103	885,017	1,098,668
Rukwa	RUK	Reg	Sumbawanga	22,792	...	...	729,060	1,004,539
Ruvuma	RUV	Reg	Songea	63,669	561,575	779,875	1,113,715	1,376,891
Shinyanga	SHI	Reg	Shinyanga	18,901	...	...	1,249,226	1,534,808
Simiyu	SIM	Reg	Bariadi	25,212	...	...	1,317,879	1,584,157
Singida	SIN	Reg	Singida	49,340	613,949	792,387	1,086,748	1,370,637
Tabora	TAB	Reg	Tabora	76,150	817,907	1,036,150	1,710,465	2,291,623
Tanga	TAN	Reg	Tanga	26,677	1,037,767	1,280,212	1,636,280	2,045,205
Zanzibar	ZAN	St	Zanzibar	2,460	476,111	640,685	981,754	1,303,569
'''.strip().splitlines()

# data to dataframe
df = pd.DataFrame([each.split('\t') for each in data])
df.columns = header

# columns
df.area = df.area.apply(lambda x: int(x.replace(',', '')))
df['population_2002-08-01'] = df['population_2002-08-01'].apply(lambda x: int(x.replace(',', '')))
df['population_2012-08-26'] = df['population_2012-08-26'].apply(lambda x: int(x.replace(',', '')))

# drop columns
df.drop(['abbr.', 'capital', 'status', 'population_1978-08-26', 'population_1988-08-27'], axis=1, inplace=True)

# df

# checking if we have all regions info
mm = RAW_X.region.unique()
nn = df.name.tolist()

if [ _ for _ in mm if _ not in nn]:
    print('found some missing values');
    assert False
        
area_pop = dict()

for i, each in df.T.iteritems():
    area_pop[each['name']] = { 'area': each['area'],
                               'population_2002': each['population_2002-08-01'],
                               'population_2012': each['population_2012-08-26']}

def fill_area(region):
    if region in area_pop:
        return area_pop[region]['area']
    else:
        return 0

def fill_pop02(region):
    if region in area_pop:
        return area_pop[region]['population_2002']
    else:
        return 0

def fill_pop12(region):
    if region in area_pop:
        return area_pop[region]['population_2012']
    else:
        return 0
    
    
########################################################################
# Area & Population 2002 & Population 2012
RAW_X['Area'] = RAW_X.region.apply(fill_area)
RAW_X['Population02'] = RAW_X.region.apply(fill_pop02)
RAW_X['Population12'] = RAW_X.region.apply(fill_pop12)

RAW_TEST_X['Area'] = RAW_TEST_X.region.apply(fill_area)
RAW_TEST_X['Population02'] = RAW_TEST_X.region.apply(fill_pop02)
RAW_TEST_X['Population12'] = RAW_TEST_X.region.apply(fill_pop12)

In [7]:
RAW_X.gps_height[RAW_X.gps_height > 1].mean()


Out[7]:
1061.304017952554

In [8]:
########################################################################
# GPS Height
RAW_TEST_X.gps_height[RAW_TEST_X.gps_height < 1] = RAW_X.gps_height[RAW_X.gps_height > 1].mean()
RAW_X.gps_height[RAW_X.gps_height < 1] = RAW_X.gps_height[RAW_X.gps_height > 1].mean()


/Users/sampathm/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/Users/sampathm/miniconda3/lib/python3.5/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [9]:
########################################################################
# Reducing geo location precision to 11 meters
LONG_LAT_PRECISION = 0.00001
fns_lola =lambda x: (x // LONG_LAT_PRECISION) * LONG_LAT_PRECISION
# Reducing Precision of Lat.
RAW_X.longitude = RAW_X.longitude.apply(fns_lola)
RAW_X.latitude = RAW_X.latitude.apply(fns_lola)
RAW_TEST_X.longitude = RAW_TEST_X.longitude.apply(fns_lola)
RAW_TEST_X.latitude = RAW_TEST_X.latitude.apply(fns_lola)


########################################################################
# reducing the unique count
RAW_X.construction_year = RAW_X.construction_year // 4
RAW_TEST_X.construction_year = RAW_TEST_X.construction_year // 4


########################################################################
# bool columns
tmp = ['public_meeting', 'permit']
for col in tmp:
    RAW_X[col] = RAW_X[col].fillna(False)
    RAW_TEST_X[col] = RAW_TEST_X[col].fillna(False)

Obj Cols


In [10]:
########################################################################
# object columns list
obj_cols = RAW_X.dtypes[RAW_X.dtypes == 'O'].index.tolist()

def text_normalisation(text):
    """Simplify the text formats.
    
    * strip trailing leading space
    * conver all text to lower cases
    * convert nan or None to 'other'
    """
    if text:
        text = str(text).strip().lower()
        return text
    return 'other'

for col in obj_cols:
    RAW_X[col] = RAW_X[col].apply(text_normalisation)
    RAW_TEST_X[col] = RAW_TEST_X[col].apply(text_normalisation)
    print ('Text Trans', col)

# object columns
RAW_X[obj_cols] = RAW_X[obj_cols].fillna('other')
RAW_TEST_X[obj_cols] = RAW_TEST_X[obj_cols].fillna('other')

RAW_X.scheme_management.replace("none", "", inplace=True)
RAW_TEST_X.scheme_management.replace("none", "", inplace=True)

# "other - mkulima/shinyanga"(Not present in test) to "Other"
RAW_X.extraction_type  = RAW_X.extraction_type.replace("other - mkulima/shinyanga", "other")
RAW_TEST_X.extraction_type  = RAW_TEST_X.extraction_type.replace("other - mkulima/shinyanga", "other")

# district_code, lga, region_code, may also be a proxy for region.
# num_private, no idea of what it is for
# recorded_by, has no entropy
# wpt_name subvillage ward scheme_name -- too many to handle


Text Trans funder
Text Trans installer
Text Trans basin
Text Trans region
Text Trans scheme_management
Text Trans extraction_type
Text Trans management
Text Trans payment_type
Text Trans quality_group
Text Trans quantity
Text Trans quantity_group
Text Trans source
Text Trans waterpoint_type

Obj Labeler

CUST_CATEGORY_LABELER is a custom labler desined to minimise the number of groups(entrpy) in the data with out much loss of information.

For example, for a columns like funder has around 1500+ groups. With custom labler, we can check and found that 95% of the data is covered by less than 600 groups while remaining ~1000 groups cover only 5% of information.


In [11]:
########################################################################
# custom labler
# - for reducing entropy with less loss of data.

custom_labler = defaultdict(CUST_CATEGORY_LABELER)

# these are the data coverages we are expecting
#  '<column name>' : <% of data is to be covered>

tmp = { 'funder': 75,
  'installer': 75,
  'wpt_name': 75,
#   'subvillage': 75,
#   'ward': 75,
#   'scheme_name': 75,
  }

for col, limit  in tmp.items():
    if col not in obj_cols:
        continue
    print('--------------------------------------------------------', col)
    labler = custom_labler[col]
    labler.DATA_COVERAGE_LIMIT = limit
    labler.fit(RAW_X[col])
    tmp = labler.fit_transform(RAW_X[col])
    print(len(RAW_X[col].value_counts()), len(tmp.value_counts()))
    RAW_X[col] = tmp
    tmp = labler.etransform(RAW_TEST_X[col])
    print(len(RAW_TEST_X[col].value_counts()), len(tmp.value_counts()))

print(RAW_X.shape, RAW_TEST_X.shape, all(RAW_X.columns == RAW_TEST_X.columns))


-------------------------------------------------------- installer
75 percentage of DATA coverage mean, 59 (in number) groups
1936 60
75 percentage of DATA coverage mean, 59 (in number) groups
980 60
-------------------------------------------------------- funder
75 percentage of DATA coverage mean, 72 (in number) groups
1898 73
75 percentage of DATA coverage mean, 72 (in number) groups
981 73
(59400, 25) (14850, 25) True

In [12]:
X, y, TEST_X = data_transformations(RAW_X, RAW_y, RAW_TEST_X, pickle_path='tmp/')

In [13]:
# Saving the transformed data sources
df_check_stats(X, y, TEST_X)
sam_pickle_save(X, y, TEST_X, prefix="tmp/Iteration2_final_")


Data Frame Shape: (59400, 25) TotColumns: 25 ObjectCols: 0
Numpy Array Size: 59400
Data Frame Shape: (14850, 25) TotColumns: 25 ObjectCols: 0
SAVE PREFIX USED:  tmp/Iteration2_final_

Note: End of Data Transformations



















EODT

Alogrithm Selection

Conventions for Alogrithmns

  • dc: dummy classifier - with strategy as most_frequent(benchmark)
  • rf: random forest trees
  • gb: gradient boosting trees
  • knn: k nearest neightbour(n=3)
  • mc_ovo_rf: multi class, one vs one wrapper over random forest
  • mc_ova_rf: multi class, one vs all wrapper over random forest

Score:

  • AC Score - imples Accuracy Score.
  • F1 Score - F1 Score(micro)

In [14]:
########################################################################
# clean up
for each in dir():
    if each not in crazy_list:
        del each

In [15]:
########################################################################
# Just assining new names to transformed dataframe pointers
X, y, TEST_X = sam_pickle_load(prefix='tmp/Iteration2_final_')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=42, stratify=y)


LOAD PREFIX USED:  tmp/Iteration2_final_

In [16]:
# Benchmark
from sklearn.dummy import DummyClassifier
clf = game(X_train, X_test, y_train, y_test, algo='dc')


Training Scores
------------------------------------------------
AC Score: 0.543075196409 F1 Score: 0.543075196409

Testing Scores
------------------------------------------------
AC Score: 0.543097643098 F1 Score: 0.543097643098

In [17]:
# Random Forest
clf = game(X_train, X_test, y_train, y_test, algo='rf')


Training Scores
------------------------------------------------
AC Score: 0.982356902357 F1 Score: 0.982356902357

Testing Scores
------------------------------------------------
AC Score: 0.798720538721 F1 Score: 0.798720538721

In [18]:
# Gradient Boosting Trees
clf = game(X_train, X_test, y_train, y_test, algo='gb')


Training Scores
------------------------------------------------
AC Score: 0.755824915825 F1 Score: 0.755824915825

Testing Scores
------------------------------------------------
AC Score: 0.751043771044 F1 Score: 0.751043771044

In [19]:
# K Nearest Neighbour
clf = game(X_train, X_test, y_train, y_test, algo='knn')


Training Scores
------------------------------------------------
AC Score: 0.788507295174 F1 Score: 0.788507295174

Testing Scores
------------------------------------------------
AC Score: 0.705521885522 F1 Score: 0.705521885522
## Some Additional Experiments - Out of scope of Project we are trying to look how does multiclass can improve a standard algrithm in real time scenarios.

In [20]:
# Multiclass One vs One - Random Forest
clf = game(X_train, X_test, y_train, y_test, algo='mc_ovo_rf')


Training Scores
------------------------------------------------
AC Score: 0.978900112233 F1 Score: 0.978900112233

Testing Scores
------------------------------------------------
AC Score: 0.799932659933 F1 Score: 0.799932659933

In [21]:
# Multiclass One vs All - Random Forest
clf = game(X_train, X_test, y_train, y_test, algo='mc_ova_rf')


Training Scores
------------------------------------------------
AC Score: 0.987138047138 F1 Score: 0.987138047138

Testing Scores
------------------------------------------------
AC Score: 0.79898989899 F1 Score: 0.79898989899

In [24]:
#
import catboost

# Just assining new names to transformed dataframe pointers
X, y, TEST_X = sam_pickle_load(prefix='tmp/Iteration2_final_')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=42, stratify=y)


LOAD PREFIX USED:  tmp/Iteration2_final_

In [27]:
clf = catboost.CatBoostRegressor().fit(X_train, y_train)

In [28]:
clf.score(X_train, y_train)


Out[28]:
0.48385638677842957

In [29]:
clf.score(X_test, y_test)


Out[29]:
0.49114605036400449

In [22]:
for each in dir():
    if each not in crazy_list:
        del each