Benchmark

Introduction: Using the data gathered from Taarifa and the Tanzanian Ministry of Water, can we predict which pumps are functional, which need some repairs, and which don't work at all? Predicting one of these three classes based and a smart understanding of which waterpoints will fail, can improve the maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

Goal: To set a benchmark for improving the data quality and find a best suited algorithm.

For more details please check Github Repo



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from scripts.tools import data_transformations, df_check_stats, game, sam_pickle_save, check_metric

np.set_printoptions(precision=5)
np.random.seed(69572)
plt.style.use('ggplot')
sns.set(color_codes=True)

%matplotlib inline



In [2]:

    
# data collection
RAW_X = pd.read_csv('data/traning_set_values.csv', index_col='id')
RAW_y = pd.read_csv('data/training_set_labels.csv', index_col='id')
RAW_TEST_X = pd.read_csv('data/test_set_values.csv', index_col='id')

df_check_stats(RAW_X, RAW_y, RAW_TEST_X)









    



Data Frame Shape: (59400, 39) TotColumns: 39 ObjectCols: 0
Data Frame Shape: (59400, 1) TotColumns: 1 ObjectCols: 0
Data Frame Shape: (14850, 39) TotColumns: 39 ObjectCols: 0



In [3]:

    
# bool columns
tmp = ['public_meeting', 'permit']
RAW_X[tmp] = RAW_X[tmp].fillna(True)
RAW_TEST_X[tmp] = RAW_TEST_X[tmp].fillna(True)

# object columns list
obj_cols = RAW_X.dtypes[RAW_X.dtypes == 'O'].index.tolist()

# object columns
RAW_X[obj_cols] = RAW_X[obj_cols].fillna('Other')
RAW_TEST_X[obj_cols] = RAW_TEST_X[obj_cols].fillna('Other')

# Just assining new names to transformed dataframe pointers
X, y, TEST_X = data_transformations(RAW_X, RAW_y, RAW_TEST_X)

sam_pickle_save(X, y, TEST_X, prefix="tmp/Iteration0_")









    



SAVE PREFIX USED:  tmp/Iteration0_



In [4]:

    
# Train Test Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=42, stratify=y)

Benchmark Score



In [5]:

    
# Benchmark
clf = DummyClassifier(strategy='most_frequent', random_state=0)
clf.fit(X, y)
y_pred = clf.predict(X)



In [6]:

    
print('\nTraining Scores')
_ = check_metric(clf.predict(X_train), y_train)
print('\nTesting Scores')
_ = check_metric(clf.predict(X_test), y_test, show_cm=True)









    



Training Scores
------------------------------------------------
AC Score: 0.543075196409 F1 Score: 0.543075196409

Testing Scores
------------------------------------------------






    



/Users/sampathm/miniconda3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)






    






  
    
      
      f1-score
      precision
      recall
      support
    
  
  
    
      avg / total
      0.38
      0.29
      0.54
      14850.0
    
    
      class 0
      0.70
      0.54
      1.00
      8065.0
    
    
      class 1
      0.00
      0.00
      0.00
      1079.0
    
    
      class 2
      0.00
      0.00
      0.00
      5706.0
    
  








    



------------------------------------------------
AC Score: 0.543097643098 F1 Score: 0.543097643098



In [9]:

    
# benchmark - rf
clf = game(X_train, X_test, y_train, y_test, algo='rf')









    



Training Scores
------------------------------------------------
AC Score: 0.984848484848 F1 Score: 0.984848484848

Testing Scores
------------------------------------------------
AC Score: 0.799865319865 F1 Score: 0.799865319865

	f1-score	precision	recall	support
avg / total	0.38	0.29	0.54	14850.0
class 0	0.70	0.54	1.00	8065.0
class 1	0.00	0.00	0.00	1079.0
class 2	0.00	0.00	0.00	5706.0