Benchmark

Introduction: Using the data gathered from Taarifa and the Tanzanian Ministry of Water, can we predict which pumps are functional, which need some repairs, and which don't work at all? Predicting one of these three classes based and a smart understanding of which waterpoints will fail, can improve the maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

Goal: To set a benchmark for improving the data quality and find a best suited algorithm.

For more details please check Github Repo


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from scripts.tools import data_transformations, df_check_stats, game, sam_pickle_save, check_metric

np.set_printoptions(precision=5)
np.random.seed(69572)
plt.style.use('ggplot')
sns.set(color_codes=True)

%matplotlib inline

In [2]:
# data collection
RAW_X = pd.read_csv('data/traning_set_values.csv', index_col='id')
RAW_y = pd.read_csv('data/training_set_labels.csv', index_col='id')
RAW_TEST_X = pd.read_csv('data/test_set_values.csv', index_col='id')

df_check_stats(RAW_X, RAW_y, RAW_TEST_X)


Data Frame Shape: (59400, 39) TotColumns: 39 ObjectCols: 0
Data Frame Shape: (59400, 1) TotColumns: 1 ObjectCols: 0
Data Frame Shape: (14850, 39) TotColumns: 39 ObjectCols: 0

In [3]:
# bool columns
tmp = ['public_meeting', 'permit']
RAW_X[tmp] = RAW_X[tmp].fillna(True)
RAW_TEST_X[tmp] = RAW_TEST_X[tmp].fillna(True)

# object columns list
obj_cols = RAW_X.dtypes[RAW_X.dtypes == 'O'].index.tolist()

# object columns
RAW_X[obj_cols] = RAW_X[obj_cols].fillna('Other')
RAW_TEST_X[obj_cols] = RAW_TEST_X[obj_cols].fillna('Other')

# Just assining new names to transformed dataframe pointers
X, y, TEST_X = data_transformations(RAW_X, RAW_y, RAW_TEST_X)

sam_pickle_save(X, y, TEST_X, prefix="tmp/Iteration0_")


SAVE PREFIX USED:  tmp/Iteration0_

In [4]:
# Train Test Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=42, stratify=y)

Benchmark Score


In [5]:
# Benchmark
clf = DummyClassifier(strategy='most_frequent', random_state=0)
clf.fit(X, y)
y_pred = clf.predict(X)

In [6]:
print('\nTraining Scores')
_ = check_metric(clf.predict(X_train), y_train)
print('\nTesting Scores')
_ = check_metric(clf.predict(X_test), y_test, show_cm=True)


Training Scores
------------------------------------------------
AC Score: 0.543075196409 F1 Score: 0.543075196409

Testing Scores
------------------------------------------------
/Users/sampathm/miniconda3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
f1-score precision recall support
avg / total 0.38 0.29 0.54 14850.0
class 0 0.70 0.54 1.00 8065.0
class 1 0.00 0.00 0.00 1079.0
class 2 0.00 0.00 0.00 5706.0
------------------------------------------------
AC Score: 0.543097643098 F1 Score: 0.543097643098

In [9]:
# benchmark - rf
clf = game(X_train, X_test, y_train, y_test, algo='rf')


Training Scores
------------------------------------------------
AC Score: 0.984848484848 F1 Score: 0.984848484848

Testing Scores
------------------------------------------------
AC Score: 0.799865319865 F1 Score: 0.799865319865