Introduction: Using the data gathered from Taarifa and the Tanzanian Ministry of Water, can we predict which pumps are functional, which need some repairs, and which don't work at all? Predicting one of these three classes based and a smart understanding of which waterpoints will fail, can improve the maintenance operations and ensure that clean, potable water is available to communities across Tanzania.
Goal: To set a benchmark for improving the data quality and find a best suited algorithm.
For more details please check Github Repo
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from scripts.tools import data_transformations, df_check_stats, game, sam_pickle_save, check_metric
np.set_printoptions(precision=5)
np.random.seed(69572)
plt.style.use('ggplot')
sns.set(color_codes=True)
%matplotlib inline
In [2]:
# data collection
RAW_X = pd.read_csv('data/traning_set_values.csv', index_col='id')
RAW_y = pd.read_csv('data/training_set_labels.csv', index_col='id')
RAW_TEST_X = pd.read_csv('data/test_set_values.csv', index_col='id')
df_check_stats(RAW_X, RAW_y, RAW_TEST_X)
In [3]:
# bool columns
tmp = ['public_meeting', 'permit']
RAW_X[tmp] = RAW_X[tmp].fillna(True)
RAW_TEST_X[tmp] = RAW_TEST_X[tmp].fillna(True)
# object columns list
obj_cols = RAW_X.dtypes[RAW_X.dtypes == 'O'].index.tolist()
# object columns
RAW_X[obj_cols] = RAW_X[obj_cols].fillna('Other')
RAW_TEST_X[obj_cols] = RAW_TEST_X[obj_cols].fillna('Other')
# Just assining new names to transformed dataframe pointers
X, y, TEST_X = data_transformations(RAW_X, RAW_y, RAW_TEST_X)
sam_pickle_save(X, y, TEST_X, prefix="tmp/Iteration0_")
In [4]:
# Train Test Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=42, stratify=y)
In [5]:
# Benchmark
clf = DummyClassifier(strategy='most_frequent', random_state=0)
clf.fit(X, y)
y_pred = clf.predict(X)
In [6]:
print('\nTraining Scores')
_ = check_metric(clf.predict(X_train), y_train)
print('\nTesting Scores')
_ = check_metric(clf.predict(X_test), y_test, show_cm=True)
In [9]:
# benchmark - rf
clf = game(X_train, X_test, y_train, y_test, algo='rf')