Definition of the Problem:

Based on a range of different independent variables such as installation date, agency, and type, can we predict whether a given water pump will be either: i) functioning, ii) in need of repair, or iii) or not functioning. Should we convert these three possibilities to a continuous distribution?

$$ 0 \leq \text{ Not functioning } \leq 0.33 \leq \text{ Needs repair }\leq 0.66 \leq \text{Functioning} \leq 1 $$

Can we simplify the question to just a binomial distribution of Functioning/Not Functioning?

Possible models:

1) The probability of failure is based on an ordered logistic function related to the age etc. (similar to the Challenger Disaster Homework/BioAssay).

2) The probability of failure is based on a linear combination of parameters (similar to the Maize Weight/Chalk).

3) Naive Bayes Classifiers are not used because...??

1. Linear Model

A linear model would require bounds on each of our parameters in order to obtain a score for functionality between 0 and 1.

2. Ordered Logistic Model

As opposed to the normal logistic model which only provides outcomes of either 0 or 1, the ordered inverse logistic model (ologit) can categorise outcomes into a hierarchical series of outcomes which we translate to our functionality assessment.

Assumptions:

i) at t=0, functionality (y) has an initial (low) probability of failing.

ii) as time increases, probability of not functioning increases (parts decay).

iii) as height increases (h), probability of not functioning increases (increasing remoteness).

iv) as number of surrounding wells decreases (w), probability of not functioning increases (this is to act as a proxy for relative proximity to population centres. It could also be possible to use population as an easier way of getting this.)

The likelihood given that our functionality score can take any value between 0 and 1, is expressed as a skewed normal distribution given the assumption that wells are more likely to be in a working state (another possibility would be a exponential inverse):

$$ P(y_i| \theta_i) = {\rm Normal}( y_i \vert \theta_i) \,\,\,\, \rm{for}\,\, i=1, \ldots, n$$

where $\theta$ is the equipment decay rate which is modeled as a $\rm{ologit}^{-1}$:

$$\theta_i = \text{equipment decay rate} = \rm{ologit}^{-1}(\beta_0 + t_i\beta_1 + h_i\beta_2 + w_i\beta_3)$$

What priors to choose for $\beta_0, \beta_1, \beta_2, \rm{and} \, \beta_3 \,$?

$$ p(\beta_0) \propto \rm{exp}()$$$$ \beta_1, \beta_2, \rm{and} \, \beta_3) \propto 1 $$

Posterior:


In [22]:
from datetime import datetime, date, time
import sys
import numpy as np
import sklearn
import csv
import statsmodels.api as sm
import matplotlib.pyplot as plt

import pandas as pd
from pandas import Series, DataFrame, Panel

train_file = "WaterPump-training-values.csv"
train_labels = "WaterPump-training-labels.csv"
test_file = "WaterPump-test-values.csv"

data = pd.read_csv(train_file, parse_dates=True,index_col='id') #read into dataframe, parse dates, and set ID as index
data.head(20)


Out[22]:
amount_tsh date_recorded funder gps_height installer longitude latitude wpt_name num_private basin ... payment_type water_quality quality_group quantity quantity_group source source_type source_class waterpoint_type waterpoint_type_group
id
69572 6000 2011-03-14 Roman 1390 Roman 34.938093 -9.856322 none 0 Lake Nyasa ... annually soft good enough enough spring spring groundwater communal standpipe communal standpipe
8776 0 2013-03-06 Grumeti 1399 GRUMETI 34.698766 -2.147466 Zahanati 0 Lake Victoria ... never pay soft good insufficient insufficient rainwater harvesting rainwater harvesting surface communal standpipe communal standpipe
34310 25 2013-02-25 Lottery Club 686 World vision 37.460664 -3.821329 Kwa Mahundi 0 Pangani ... per bucket soft good enough enough dam dam surface communal standpipe multiple communal standpipe
67743 0 2013-01-28 Unicef 263 UNICEF 38.486161 -11.155298 Zahanati Ya Nanyumbu 0 Ruvuma / Southern Coast ... never pay soft good dry dry machine dbh borehole groundwater communal standpipe multiple communal standpipe
19728 0 2011-07-13 Action In A 0 Artisan 31.130847 -1.825359 Shuleni 0 Lake Victoria ... never pay soft good seasonal seasonal rainwater harvesting rainwater harvesting surface communal standpipe communal standpipe
9944 20 2011-03-13 Mkinga Distric Coun 0 DWE 39.172796 -4.765587 Tajiri 0 Pangani ... per bucket salty salty enough enough other other unknown communal standpipe multiple communal standpipe
19816 0 2012-10-01 Dwsp 0 DWSP 33.362410 -3.766365 Kwa Ngomho 0 Internal ... never pay soft good enough enough machine dbh borehole groundwater hand pump hand pump
54551 0 2012-10-09 Rwssp 0 DWE 32.620617 -4.226198 Tushirikiane 0 Lake Tanganyika ... unknown milky milky enough enough shallow well shallow well groundwater hand pump hand pump
53934 0 2012-11-03 Wateraid 0 Water Aid 32.711100 -5.146712 Kwa Ramadhan Musa 0 Lake Tanganyika ... never pay salty salty seasonal seasonal machine dbh borehole groundwater hand pump hand pump
46144 0 2011-08-03 Isingiro Ho 0 Artisan 30.626991 -1.257051 Kwapeto 0 Lake Victoria ... never pay soft good enough enough shallow well shallow well groundwater hand pump hand pump
49056 0 2011-02-20 Private 62 Private 39.209518 -7.034139 Mzee Hokororo 0 Wami / Ruvu ... never pay salty salty enough enough machine dbh borehole groundwater other other
50409 200 2013-02-18 Danida 1062 DANIDA 35.770258 -10.574175 Kwa Alid Nchimbi 0 Lake Nyasa ... on failure soft good insufficient insufficient shallow well shallow well groundwater hand pump hand pump
36957 0 2012-10-14 World Vision 0 World vision 33.798106 -3.290194 Pamba 0 Internal ... other soft good enough enough shallow well shallow well groundwater hand pump hand pump
50495 0 2013-03-15 Lawatefuka Water Supply 1368 Lawatefuka water sup 37.092574 -3.181783 Kwa John Izack Mmari 0 Pangani ... monthly soft good enough enough spring spring groundwater communal standpipe communal standpipe
53752 0 2012-10-20 Biore 0 WEDECO 34.364073 -3.629333 Mwabasabi 0 Internal ... never pay soft good enough enough shallow well shallow well groundwater hand pump hand pump
61848 0 2011-08-04 Rudep 1645 DWE 31.444121 -8.274962 Kwa Juvenal Ching'Ombe 0 Lake Tanganyika ... never pay soft good enough enough machine dbh borehole groundwater hand pump hand pump
48451 500 2011-07-04 Unicef 1703 DWE 34.642439 -9.106185 Kwa John Mtenzi 0 Rufiji ... monthly soft good dry dry river river/lake surface communal standpipe communal standpipe
58155 0 2011-09-04 Unicef 1656 DWE 34.569266 -9.085515 Kwa Rose Chaula 0 Rufiji ... on failure soft good dry dry river river/lake surface communal standpipe communal standpipe
34169 0 2011-07-22 Hesawa 1162 DWE 32.920154 -1.947868 Ngomee 0 Lake Victoria ... never pay milky milky insufficient insufficient spring spring groundwater other other
18274 500 2011-02-22 Danida 1763 Danid 34.508967 -9.894412 none 0 Lake Nyasa ... annually soft good enough enough spring spring groundwater communal standpipe communal standpipe

20 rows × 39 columns


In [3]:
sampledata = data[["gps_height", "longitude","latitude","construction_year"]]

In [4]:
print sampledata.construction_year


id
69572    1999
8776     2010
34310    2009
67743    1986
19728       0
9944     2009
19816       0
54551       0
53934       0
46144       0
49056    2011
50409    1987
36957       0
50495    2009
53752       0
61848    1991
48451    1978
58155    1978
34169    1999
18274    1992
48375    2008
6091        0
58500    1978
37862    2011
51058    2009
22308    1974
55012    2011
20145       0
19685    2000
69124    2002
         ... 
14796       0
20387       0
29940       0
15233    1988
49651       0
50998    2005
34716    1990
43986       0
38067    2008
58255       0
30647    1999
67885    1992
47002    2008
44616    2008
72148       0
34473    2011
34952    2009
26640    2000
72559    1995
30410    2009
13677    1991
44885    1967
40607       0
48348       0
11164    2007
60739    1999
27263    1996
37057       0
31282       0
26348    2002
Name: construction_year, dtype: int64

In [5]:
sampledata.describe()


Out[5]:
gps_height longitude latitude construction_year
count 59400.000000 59400.000000 5.940000e+04 59400.000000
mean 668.297239 34.077427 -5.706033e+00 1300.652475
std 693.116350 6.567432 2.946019e+00 951.620547
min -90.000000 0.000000 -1.164944e+01 0.000000
25% 0.000000 33.090347 -8.540621e+00 0.000000
50% 369.000000 34.908743 -5.021597e+00 1986.000000
75% 1319.250000 37.178387 -3.326156e+00 2004.000000
max 2770.000000 40.345193 -2.000000e-08 2013.000000

In [26]:
#get rid of records that have no value for either construction year or gps height
x = sampledata[sampledata["construction_year"] != 0]
x = x[x["gps_height"] != 0]
x.describe()


Out[26]:
gps_height longitude latitude construction_year
count 37928.000000 37928.000000 37928.000000 37928.000000
mean 1022.532456 35.941838 -6.247856 1996.862292
std 607.525299 2.564827 2.785097 12.458136
min -63.000000 29.607122 -11.649440 1960.000000
25% 406.000000 34.663640 -8.791236 1988.000000
50% 1168.000000 36.545613 -6.108577 2000.000000
75% 1495.000000 37.754377 -3.616188 2008.000000
max 2770.000000 40.345193 -1.042375 2013.000000

In [ ]:


In [28]:
#read in labels - this can then be joined as an extra column to initial dataframe
labels = pd.read_csv(train_labels, index_col="id") 
labels.head(20)


Out[28]:
status_group
id
69572 functional
8776 functional
34310 functional
67743 non functional
19728 functional
9944 functional
19816 non functional
54551 non functional
53934 non functional
46144 functional
49056 functional
50409 functional
36957 functional
50495 functional
53752 functional
61848 functional
48451 non functional
58155 non functional
34169 functional needs repair
18274 functional

In [31]:
#find some method to convert to factors - this currently isn't right but decide way to convert text labels into ints
dummy_ranks = pd.get_dummies(labels['status_group'], prefix='status_group')
print dummy_ranks.head()


       status_group_functional  status_group_functional needs repair  \
id                                                                     
69572                        1                                     0   
8776                         1                                     0   
34310                        1                                     0   
67743                        0                                     0   
19728                        1                                     0   

       status_group_non functional  
id                                  
69572                            0  
8776                             0  
34310                            0  
67743                            1  
19728                            0  

In [ ]:


In [ ]: