In [1]:
from IPython.display import display, display_markdown
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy.stats as stats
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score
This study will focus on identifying the functional status (functional, needs repair or non-functional) of Tanzanian water pumps. The possible explanatory variables will be location, construction year, funder, type of extraction, water quality and quantity, population using it, management organization and payment methods.
I picked up this challenge from the DrivenData competitions list because it shows a direct and practical application of how statistical analysis can help improve services and products quality. And as an engineer, those goals will be definitely the basis of any data science case I will have to solve. Moreover, as lots of possible explanatory variables are available, this will give me the chance to apply advance tools I learned during the Data Analysis and Interpretation online Specialization.
Predicting accurately the water pumps functional status will help planning maintenance earlier. That in turn will increase the availability of the water point and thus the quality of life for the people depending on those water supplies.
In [2]:
training_data = pd.read_csv('training_set_values.csv', index_col=0)
training_label = pd.read_csv('training_set_labels.csv', index_col=0)
test_data = pd.read_csv('test_set_values.csv', index_col=0)
# Merge test data and training data to apply same data management operations on them
data = training_data.append(test_data).sort_index()
The functional status of the water points are categorized in three groups: functional, functional needs repair and non functional.
The potential predictors will be:
From the various actors, the following categories will be created :
'organisation' : ('bank', 'msf', 'wwf', 'unicef', 'unisef', 'oxfam', 'oxfarm', 'rotary club', 'lion's club', 'care', 'without', 'action contre la faim', 'rain', 'red cross', 'blue angels', 'fundat', 'foundation'),
'church' : ('church', 'churc', 'rcchurch', 'roman', 'missionsry', 'lutheran', 'islamic', 'islam', 'vision'),
'private' : ('consulting', 'engineer', 'private', 'ltd', 'co.ltd', 'contractor', 'enterp', 'enterpr', 'company', 'contract'),
'community' : ('village', 'community', 'communit', 'district', 'council', 'commu', 'villigers', 'villagers'),
'government' : ('government', 'gov', 'govt', 'gover', 'gove', 'governme', 'ministry'),
'other' : ('0', 'nan', 'known', 'other', 'unknown'),
'danida' : ('danida', 'danid'),
'foreign government' : ('netherlands', 'germany', 'european')
Then the 9 most funders will be kept and the others will be gathered in the other
category.
As the Python package sklearn
cannot handle non-binary categorical variables, those variables will be expanded in as much new dichotomous variables as there are categories. Therefore the number of potential explanatory variables will be huge. So as a prepocess steps, a random forest test will be carried out to select only the variables having a substantial effect.
In [3]:
# As lots of waterpoints are missing a value for amount_tsh. For that field the missing
# data will be replaced by the mean data to drop less data for the model fit
imp = preprocessing.Imputer(missing_values=0, strategy='mean')
imp.fit(data['amount_tsh'].values.reshape(-1, 1))
data['water_amount'] = imp.transform(data['amount_tsh'].values.reshape(-1, 1)).ravel()
imp = preprocessing.Imputer(missing_values=0, strategy='median')
imp.fit(data['construction_year'].values.reshape(-1, 1))
data['construction_year'] = imp.transform(data['construction_year'].values.reshape(-1, 1)).ravel()
imp = preprocessing.Imputer(missing_values=0, strategy='mean')
imp.fit(data['gps_height'].values.reshape(-1, 1))
data['height'] = imp.transform(data['gps_height'].values.reshape(-1, 1)).ravel()
# Recode missing data as NaN
for field in ('longitude', 'latitude'):
data[field] = data[field].map(lambda x: x if x else pd.np.nan)
def group_installer(data):
def gather_installer(x):
installer_map = {
'organisation' : ('bank', 'msf', 'wwf', 'unicef', 'unisef', 'oxfam', 'oxfarm', 'club', 'care', 'without', 'faim', 'rain', 'red', 'angels', 'fundat', 'foundation'),
'church' : ('church', 'churc', 'rcchurch', 'roman', 'missionsry', 'lutheran', 'islamic', 'islam', 'vision'),
'private' : ('consulting', 'engineer', 'private', 'ltd', 'co.ltd', 'contractor', 'enterp', 'enterpr', 'company', 'contract'),
'community' : ('village', 'community', 'communit', 'district', 'council', 'commu', 'villigers', 'villagers'),
'government' : ('government', 'gov', 'govt', 'gover', 'gove', 'governme', 'ministry'),
'other' : ('0', 'nan', 'known', 'other', 'unknown'), # Group 'unknown' data with 'other' as finally this means the same for interpretation
'danida' : ('danida', 'danid'),
'foreign government' : ('netherlands', 'germany', 'european')
}
for substr in x.split():
for subsubstr in substr.split('/'):
for key in installer_map:
if subsubstr in installer_map[key]:
return key
return x
lower_data = data.map(lambda x: str(x).lower())
tmp_data = lower_data.map(gather_installer)
top10 = list(tmp_data.value_counts()[:10].index)
return tmp_data.map(lambda x: x if x in top10 else 'other')
data['installer'] = group_installer(data.installer)
data['funder'] = group_installer(data.funder)
clean_data = (data.iloc[training_data.index]
.join(training_label['status_group'])
.dropna())
# Create two columns one collapsing 'functional' and 'functional needs repair'
# and the other one collapsing 'non functional' and 'functional needs repair'
clean_data['functional'] = clean_data['status_group'].map({'functional' : 1,
'functional needs repair' : 1,
'non functional' : 0})
clean_data['no_repairs'] = clean_data['status_group'].map({'functional' : 1,
'functional needs repair' : 0,
'non functional' : 0})
In [4]:
# Extract predictors and convert categorical variables in dichotomic variables
predictors_name = ['water_amount', 'height', 'longitude', 'latitude',
'basin', 'region', 'population', 'public_meeting', 'management_group',
'permit', 'construction_year', 'extraction_type_class', 'payment_type',
'quality_group', 'quantity_group', 'source_type', 'waterpoint_type_group',
'installer', 'funder']
categorical_predictors = ('basin', 'region', 'management_group', 'extraction_type_class',
'payment_type', 'quality_group', 'quantity_group',
'source_type', 'waterpoint_type_group', 'installer', 'funder')
process_data = pd.DataFrame()
for name in predictors_name:
if name in categorical_predictors:
classes = data[name].unique()
deployed_categories = preprocessing.label_binarize(data[name], classes=classes)
# Avoid class name collision
classe_names = list()
for c in classes:
if c in process_data.columns:
classe_names.append('_'.join((c, name)))
else:
classe_names.append(c)
tmp_df = pd.DataFrame(deployed_categories,
columns=classe_names,
index=data.index)
process_data = process_data.join(tmp_df)
else:
process_data[name] = data[name]
predictors_columns = process_data.columns
deployed_data = (process_data.iloc[training_data.index]
.join(training_label['status_group'])
.dropna())
# Create two columns one collapsing 'functional' and 'functional needs repair'
# and the other one collapsing 'non functional' and 'functional needs repair'
deployed_data['functional'] = deployed_data['status_group'].map({'functional' : 1,
'functional needs repair' : 1,
'non functional' : 0})
deployed_data['no_repairs'] = deployed_data['status_group'].map({'functional' : 1,
'functional needs repair' : 0,
'non functional' : 0})
predictors = deployed_data[predictors_columns]
The distributions of the response and explanatory variables will be evaluated by looking at the frequency tables for categorical variables and by calculating statistical values (mean, standard deviation, minimum and maximum) for quantitative variables.
The response variable being categorical, bivariate associations will be visualized using bar charts after collapsing categories if needed. And the possible bivariate associations will be tested using Chi-Square test.
The random forest method will be applied to identify the best subset of predictors. The DrivenData competition has split the database in a training set containing 80% of the records and 20% are kept for testing by submission on the website. As multiple submissions are allowed for the competition, the accuracy of the model will be tested by submitting the prediction carried out on the test data.
First a random tree test was performed to limit the number of explanatory variables. From that first analysis (see the table below), the following explanatory variables are kept:
Although gps coordinates are important, the administration division (like geographic region) has low importance. It seems also than the way the water point was funded and installed and how it is managed are not of great importances. Some natural guesses like the quantity, the population living around and the year of construction come forward in the random forest test.
In [5]:
# fit an Extra Trees model to the data and look at the first 15 important fields
model = ExtraTreesClassifier()
model.fit(predictors, deployed_data['status_group'])
# display the relative importance of each attribute
cm = sns.light_palette("yellow", as_cmap=True)
display(pd.Series(model.feature_importances_, index=predictors.columns, name='importance')
.sort_values(ascending=False)
.to_frame()
.iloc[:15])
display_markdown("> Table 1 : The 15 most important features in the dataset.", raw=True)
In [6]:
# Extract predictors and convert categorical variables in dichotomic variables
predictors_name = ['height', 'longitude', 'latitude', 'population',
'permit', 'construction_year', 'extraction_type_class', 'payment_type',
'quantity_group', 'waterpoint_type_group']
categorical_predictors = ('extraction_type_class', 'payment_type', 'quantity_group',
'waterpoint_type_group')
process_data = pd.DataFrame()
for name in predictors_name:
if name in categorical_predictors:
classes = data[name].unique()
deployed_categories = preprocessing.label_binarize(data[name], classes=classes)
# Avoid class name collision
classe_names = list()
for c in classes:
if c in process_data.columns or c == 'other':
classe_names.append('_'.join((c, name)))
else:
classe_names.append(c)
tmp_df = pd.DataFrame(deployed_categories,
columns=classe_names,
index=data.index)
process_data = process_data.join(tmp_df)
else:
process_data[name] = data[name]
predictors_columns = process_data.columns
deployed_data = (process_data.iloc[training_data.index]
.join(training_label['status_group'])
.dropna())
# Create two columns one collapsing 'functional' and 'functional needs repair'
# and the other one collapsing 'non functional' and 'functional needs repair'
deployed_data['functional'] = deployed_data['status_group'].map({'functional' : 1,
'functional needs repair' : 1,
'non functional' : 0})
deployed_data['no_repairs'] = deployed_data['status_group'].map({'functional' : 1,
'functional needs repair' : 0,
'non functional' : 0})
predictors = deployed_data[predictors_columns]
In the training data set, 54.3% (N=32259) of the water point are functional, 7.3% (N=4317) need repair and 38.4% (N=22824) are non functional.
For those water points, the quantity of water available is enough for 55.9% (N=41522), insufficient for 25.4% (N=18896) and dry for 10.5% (N=7782). The quantity is unknown for 1.3% of the data (N=975).
The majority of the point are communal standpipes (58.2%, N=43239). The second most important type is hand pump type (29.5%, N=21884).
The method to extract the data are mostly gravity (44.8%, N=33263) and hand pumps (27.7%, N=20612).
To get water, people are usually never paying (42.7%, N=31712). For the points for which people pay, they are doing so on bucket basis (15.2%, N=11266) or by recurrent payment; monthly for 14% (N=10397) or annually for 6.1% (N=4570). The payment method is unknown for 13.7% of the cases (N=10149).
The majority of the water points were constructed with a permit (65.4%, N=48606). But 29.4% (N=21851) were not built having one. And the permit status is unknown for 5.1% of the water points (N=3793).
The distribution of the quantitative variables are presented in the table below.
In [7]:
pd.set_option('display.float_format', lambda x: '{:.5g}'.format(x))
quantitative_var = dict()
for field in ('gps_height', 'latitude', 'longitude', 'construction_year', 'population'):
if field == 'gps_height':
field_name = 'height'
else:
field_name = ' '.join(field.split('_'))
clean_field = training_data[field].map(lambda x: x if abs(x)>1e-8 else pd.np.nan)
clean_field = clean_field.dropna()
quantitative_var[field_name] = clean_field.describe()
(pd.DataFrame(quantitative_var)
.loc[['count', 'mean', 'std', 'min', 'max']]
.T)
Out[7]:
The figures below show the mean value of the functional variable (0 = non functional, 1 otherwise) for the different categorical variables.
Using post hoc chi-square tests, the major conclusions drawn are :
In [8]:
fig, axes = plt.subplots(3, 2,
sharey=True,
gridspec_kw=dict(hspace=0.285),
figsize=(10, 16.5))
axes = axes.ravel()
for i, field in enumerate(('extraction_type_class', 'waterpoint_type_group', 'payment_type',
'quantity_group', 'permit')):
field_name = ' '.join(field.split('_'))
var_analysis = clean_data[['status_group', 'functional', 'no_repairs', field]]
ax = sns.barplot(x=field, y='functional', data=var_analysis, ci=None, ax=axes[i])
ax.set_xlabel(field_name)
if i % 2 == 0:
ax.set_ylabel('functional vs non functional')
else:
ax.set_ylabel('')
lbls = ['\n'.join(l.get_text().split()) for l in ax.get_xticklabels()]
if len(lbls) > 5:
ax.set_xticklabels(lbls, rotation=60)
axes[5].set_visible(False)
fig.suptitle('Functional waterpoint proportion per categorical fields', fontsize=14)
plt.subplots_adjust(top=0.97)
plt.show();
To visualize the influence of the quantitative variables on the functional status of the water points, the quantitative variables have been collapsed in two bins; the median value being the separation.
Using chi-square test, all variables have a significant relationship with the response variable. Waterpoints with higher altitude are more likely to be functional (p-value = 2e-57). Those more in the eastern side of Tanzania have a lesser chance to be functional (p-value = 0.003). The water points constructed after 2000 are in better functional condition (p-value = 0). And those sustaining higher population tend to be less functional (p-value = 2.5e-13).
In [9]:
fig, axes = plt.subplots(2, 2,
sharey=True,
gridspec_kw=dict(hspace=0.12),
figsize=(10, 11))
axes = axes.ravel()
for i, field in enumerate(('gps_height', 'longitude', 'construction_year', 'population')):
if field == 'gps_height':
field_name = 'height'
else:
field_name = ' '.join(field.split('_'))
var_analysis = clean_data[['status_group', 'functional', 'no_repairs']]
clean_field = clean_data[field].map(lambda x: x if abs(x)>1e-8 else pd.np.nan)
var_analysis = var_analysis.join(clean_field).dropna()
var_analysis[field+'grp2'] = pd.qcut(var_analysis[field],
2,
labels=["50th%tile",
"100th%tile"])
# 4,
# labels=["25th%tile", "50th%tile",
# "75th%tile", "100th%tile"])
ax = sns.barplot(x=field+'grp2', y='functional', data=var_analysis, ci=None, ax=axes[i])
ax.set_xlabel(field_name)
if i % 2 == 0:
ax.set_ylabel('functional vs non functional')
else:
ax.set_ylabel('')
fig.suptitle('Functional waterpoint proportion per quantitative field quartile', fontsize=14)
plt.subplots_adjust(top=0.95)
plt.show();
With the subset of explanatory variables selected, we can split the data to estimate the number of trees needed to stabilize the accuracy. By taking 60% of the available data as training set, the accuracy of the random forest test stabilizes for a number of trees superior to 23 as shown in the figure below.
In [10]:
pd.np.random.seed(12345)
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors,
deployed_data['status_group'],
test_size=.4)
trees=range(1, 31)
accuracy=pd.np.zeros(len(trees))
for idx in trees:
classifier=RandomForestClassifier(n_estimators=idx)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
accuracy[idx-1]=accuracy_score(tar_test, predictions)
plt.plot(trees, accuracy)
plt.xlabel("Number of trees")
plt.ylabel("Accuracy score")
plt.show();
So I run a random forest test with 25 trees with all training data and submitted on DrivenData.org the resulting prediction. I got an accuracy score of 76.86%.
In [11]:
model = RandomForestClassifier(n_estimators=25)
model = model.fit(predictors, deployed_data['status_group'])
clean_test_data = process_data.iloc[test_data.index].dropna()
predictions = model.predict(process_data.iloc[test_data.index].dropna())
pred = pd.Series(predictions, index=clean_test_data.index, name='status_group')
missing_index = list()
for i in test_data.index:
if i not in clean_test_data.index:
missing_index.append(i)
data_list = list()
pd.np.random.seed(12345)
for rnd in pd.np.random.rand(len(missing_index)):
if rnd < 0.072677:
data_list.append('functional needs repair')
elif rnd < 0.384242 + 0.072677:
data_list.append('non functional')
else:
data_list.append('functional')
fill = pd.Series(data_list, index=missing_index)
pred = pred.append(fill)
to_file = pred[test_data.index]
to_file.to_csv('randomForest.csv', index_label='id', header=('status_group',))
This project used random forest test to identify the variables influencing the most the functional status of Tanzanian water pumps from N=74250 water points characteristics recorded between October 2002 and December 2013 by the Tanzanian Ministry of Water. There are around 55% of pumps working properly, 7% in needs of repair and 38% non functional.
Applying the random forest test, the number of potential explanatory variables was reduced from 20 to 10 by looking at the importance of each features. The most influential variables are the gps coordinates (longitude, latitude and height). Then comes the quantity of water available, the population living around the pumps, the type of extraction and the year of construction.
The random forest test using 25 trees had an accuracy score of 76.9% when tested against the DrivenData test set. The optimal number of trees was found by optimizing the accuracy score with the number of trees after dividing the provided data in two groups; 60% to train the method and 40% to test it. As the best score obtain was around 78.9%, it can be said that the model will predict fairly well new dataset.
From the feature importance calculation, it can be concluded that an improved water reparation policy should focus on dispatching teams not evently in the country as the gps coordinates influence greatly the water pumps status. And the primarily target should be based on the population size living around the waterpoint and its year of construction.
Although lots of parameters have been recorded for this analysis, it is possible that a factor non considered here is important and is confounding other factors reported here.
From the analysis, the funder and the installer do not seem to have a big impact on the functional status. But as those two categories contain a wide variety of answers (some containing spelling mistakes or abbreviations), a deeper analysis of those two categories should be carried out to gather in meaningful categories the various actors. Right now some doubts remain on a potential confounder effect. Some parameters statistically important (population, height and construction year) have lots of missing data. In this study, the missing data of those variables were filled by their mean or their median values to avoid dropping to many records. Trying to fulfill the missing data will help improving the accuracy. Therefore, adding additional records and fulfilling the missing value should be the priority of any additional effort to improve the predictive algorithm.
The Jupyter notebook used to generate this final report is available there: https://github.com/fcollonval/coursera_data_visualization/blob/master/WaterPumpsPrediction.ipynb.
In [13]:
!jupyter nbconvert --to html --template full_nice WaterPumpsPrediction.ipynb