AM207 Project Proposal Predicting Water Well Failure

Sam Kim, Harvard College '15

samuelkim@college.harvard.edu

Gareth Haslam, Ext. School

haslam.gareth@gmail.com

Date: 10th April 2015

Abstract

Predicting failure of water wells is an important issues for millions of people in developing countries who rely on these wells for their clean drinking water. Using the dataset collected on Tanzanian water wells, available as part of a DrivenData.org challenge, we aim to use the Bayesian Methods learned in AM207 to investigate how to better predict failure. Our second goal is to use information about the location of the wells to optimise their maintenance schedules by finding the shortest path between non-functioning or damaged wells. We will approach this second task using the stochastic optimisation methods developed in class.

Introduction

We have decided to tackle the challenge presented by DrivenData at http://www.drivendata.org/competitions/7/page/23/. The data is given by Taarifa, which is an open source platform for reporting infrastructure issues, and the Tanzanian Ministry of Water. The goal is to predict the operating condition of a waterpoint for each record in the dataset. The possible classifications are "functional, "non functional," and "functional needs repair." Being able to predict which waterpoints will fail will improve maintenance operations and infrastructure upkeep to ensure that communities across Tanzania have access to clean water without spending resources constantly monitoring each waterpoint. Whilst the data provided is specific to Tanzania, it is hoped that any methods for fault prediction could be used more broadly in similar situations.

Bayesian methods in general have been applied successfully to other classification problems, such as Box Office Prediction as described by Kapelner and Bleich (2014) (http://arxiv.org/pdf/1312.2171.pdf) and can have better predictive power than machine learning techniques such as support vector machines (SVM). We hope to use similar techniques in this problem.

In addition to simply predicting the functionality of the water wells, we hope to also provide insight into how the various attributes relate to the functionality of the water well. Bayesian techniques allows us to build models, learn the parameters for the models, and then find probability distributions for the dependent variable. This is one of the advantages of Bayesian techniques compared to other machine learning techniques, which are often black box models that do not provide intuitive insight into how the parameters relate to the dependent variable, and simply give an answer rather than a probability distribution.

Depending on the success of the first part of the project, we may also attempt to extend the project by recommending an optimal route for the engineering team to service non-functioning and damaged wells. This part is outside of the original competition objectives in the DRIVENDATA challenge but seems to be a useful and logical extension. Optimising the route can have further benefits in terms of minimising the down-time of the pumps and reducing the maintenance costs of the water supply.

Data

The data has been provided by the challenge, and includes 39 different variables on funding of the well, what kind of pump is operating, its location, when it was installed, population data around the well, quantity of water, cost of water, and how it is managed. There are 59400 data points available for training, which is more than enough for meaningful cross-validation tests.

Methodology

Well Functional Status Prediction

We plan on using Bayesian modeling to predict the functional status of each waterpoint as a function of all the variables. Naive Bayes gives us:

$$P(status|\Theta)\propto P(\Theta|status)P(status)$$

where $\Theta$ describes the attributes from our data. In the simplest case, it can simply be a product of the conditional probability for each attribute, $P(\theta_i|status)$. If we want to build a more complex model, we would include parameters $\alpha_i$ to relate the attributes to each other and to the status, such as $P(broken|age)=\alpha_1+\alpha_2\cdot age$.

Because we have very little experience in this field and know little about the model, the parameters $\alpha_i$ would also have distributions with hyperparameters, $\beta_i$. This chain can extend as far as we want. In the end, we are sampling from the joint posterior distribution for all the parameters, $\theta_i, \alpha_i, \beta_i, ...$, which is $P(\Theta, \alpha, \beta,... | status)$ to build the model, and then using this distribution to predict the status for unknown data. Sampling the joint posterior distribution can be done through any of the methods taught in class, including Metropolis-Hastings and its numberous variants, Gibbs, slice sampling, and so on.

Part of the challenge becomes building meaningful models. Machine learning techniques in dimension reduction such as PCA can be used to identify the most meaningful attributes and how they relate to each other.

Optimal Maintenance Route

The optimal route for the maintenance workers to visit all the problem sites, $n$, is mathematically defined as follows: $$ \text{min} \, \displaystyle\Sigma^n_{i=0}\Sigma^n_{j\neq i, j=0} c_{ij}x_{ij}$$

$$x_ij \in \{0,1\} \qquad \, \quad i,j = 0,..., n$$$$\displaystyle\Sigma^n_{i=0, i\neq i} \,\,x_{ij} \qquad \qquad j = 0,..., n$$$$\displaystyle\Sigma^n_{j=0, j\neq i} \,\, x_{ij} \qquad \qquad i = 0,..., n$$$$u_i - u_j + nx_{ij} \leq n - 1 \qquad \quad 1\leq i \neq j \leq n $$

where:

$\bullet \ x_{ij}$ is a binary decision variable about whether to go from location $i$ to location $j$

$\bullet \ c_{ij}$ is the distance between the two locations $i$ and $j$.

$\bullet \ $ the objective function is the sum of the distances that it is decided to take.

$\bullet \ $ the constraint is that each location should be visted once and only once.



In [3]:

    
from IPython.display import Image
Image(filename="PumpMap.png")









    Out[3]:

Shown above are the locations of all 59400 pumps (in blue) across Tanzania (highlighted in red). The coverage of the country is quite dense so we will first need to select which sites will need to be visited, perhaps based on the predictions of our Bayesian Analysis of pump failure. Then we will need to decide the number of maintenance crews, their location, and the number of sites they can visit in a day. The data breaks the locations down by Region, and Ward, so it may seem sensible to allocate the size of maintenance crews to the number of waterpoints in each region. This then becomes the classic Travelling Saleman Problem which can be solved by applying simulated annealing in order to arrive at a ‘good enough’ solution to the most efficient route for our maintenance crews.

Literature review

A large area of work exists in this area and whilst the dataset is currently being studied as part of the mentioned competition, we hope that the added challenge of the optimal maintenance schedule will bring further insights. Previous work in this area has been conducted by former Harvard SEAS master's students (Bull and Slavitt, 2014) who looked at prediction of conflict and optimal aid delivery in Uganda using the ACLED (Armed Conflict Location and Event Data) dataset.

[1] P.J. Bull and I.M. Slavitt, "Modelling Civil Conflict and Refugee Relief Aid in Uganda," unpublished, Harvard University (2014). http://pjbull.github.io/civil_conflict/

[2] A. Kapelner and J. Bleich, "bartMachine: Machine Learning with Bayesian Additive Regression Trees," ArXiv, http://arxiv.org/pdf/1312.2171.pdf.

[3] J. Eliashberg, "Green-lighting Movie Scripts: Revenue Forecasting and Risk Management," Ph.D. thesis, University of Pennsylvania (2010).



In [ ]: