Definition of the Problem:

Based on a range of different independent variables such as installation date, agency, and type, can we predict whether a given water pump will be either: i) functioning, ii) in need of repair, or iii) or not functioning. Should we convert these three possibilities to a continuous distribution?

$$ 0 \leq \text{ Not functioning } \leq 0.33 \leq \text{ Needs repair }\leq 0.66 \leq \text{Functioning} \leq 1 $$

Can we simplify the question to just a binomial distribution of Functioning/Not Functioning?

Possible models:

1) The probability of failure is based on an ordered logistic function related to the age etc. (similar to the Challenger Disaster Homework/BioAssay).

2) The probability of failure is based on a linear combination of parameters (similar to the Maize Weight/Chalk).

3) Naive Bayes Classifiers are not used because...??

1. Linear Model

A linear model would require bounds on each of our parameters in order to obtain a score for functionality between 0 and 1.

2. Ordered Logistic Model

As opposed to the normal logistic model which only provides outcomes of either 0 or 1, the ordered inverse logistic model (ologit) can categorise outcomes into a hierarchical series of outcomes which we translate to our functionality assessment.

Assumptions:

i) at t=0, functionality (y) has an initial (low) probability of failing.

ii) as time increases, probability of not functioning increases (parts decay).

iii) as height increases (h), probability of not functioning increases (increasing remoteness).

iv) as number of surrounding wells decreases (w), probability of not functioning increases (this is to act as a proxy for relative proximity to population centres. It could also be possible to use population as an easier way of getting this.)

The likelihood given that our functionality score can take any value between 0 and 1, is expressed as a skewed normal distribution given the assumption that wells are more likely to be in a working state (another possibility would be a exponential inverse):

$$ P(y_i| \theta_i) = {\rm Normal}( y_i \vert \theta_i) \,\,\,\, \rm{for}\,\, i=1, \ldots, n$$

where $\theta$ is the equipment decay rate which is modeled as a $\rm{ologit}^{-1}$:

$$\theta_i = \text{equipment decay rate} = \rm{ologit}^{-1}(\beta_0 + t_i\beta_1 + h_i\beta_2 + w_i\beta_3)$$

What priors to choose for $\beta_0, \beta_1, \beta_2, \rm{and} \, \beta_3 \,$?

$$ p(\beta_0) \propto \rm{exp}()$$$$ \beta_1, \beta_2, \rm{and} \, \beta_3) \propto 1 $$

Posterior:

Use something from here: http://blog.yhathq.com/posts/logistic-regression-and-python.html



In [22]:

    
from datetime import datetime, date, time
import sys
import numpy as np
import sklearn
import csv
import statsmodels.api as sm
import matplotlib.pyplot as plt

import pandas as pd
from pandas import Series, DataFrame, Panel

train_file = "WaterPump-training-values.csv"
train_labels = "WaterPump-training-labels.csv"
test_file = "WaterPump-test-values.csv"

data = pd.read_csv(train_file, parse_dates=True,index_col='id') #read into dataframe, parse dates, and set ID as index
data.head(20)









    Out[22]:






  
    
      
      amount_tsh
      date_recorded
      funder
      gps_height
      installer
      longitude
      latitude
      wpt_name
      num_private
      basin
      ...
      payment_type
      water_quality
      quality_group
      quantity
      quantity_group
      source
      source_type
      source_class
      waterpoint_type
      waterpoint_type_group
    
    
      id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      69572
      6000
      2011-03-14
      Roman
      1390
      Roman
      34.938093
      -9.856322
      none
      0
      Lake Nyasa
      ...
      annually
      soft
      good
      enough
      enough
      spring
      spring
      groundwater
      communal standpipe
      communal standpipe
    
    
      8776
      0
      2013-03-06
      Grumeti
      1399
      GRUMETI
      34.698766
      -2.147466
      Zahanati
      0
      Lake Victoria
      ...
      never pay
      soft
      good
      insufficient
      insufficient
      rainwater harvesting
      rainwater harvesting
      surface
      communal standpipe
      communal standpipe
    
    
      34310
      25
      2013-02-25
      Lottery Club
      686
      World vision
      37.460664
      -3.821329
      Kwa Mahundi
      0
      Pangani
      ...
      per bucket
      soft
      good
      enough
      enough
      dam
      dam
      surface
      communal standpipe multiple
      communal standpipe
    
    
      67743
      0
      2013-01-28
      Unicef
      263
      UNICEF
      38.486161
      -11.155298
      Zahanati Ya Nanyumbu
      0
      Ruvuma / Southern Coast
      ...
      never pay
      soft
      good
      dry
      dry
      machine dbh
      borehole
      groundwater
      communal standpipe multiple
      communal standpipe
    
    
      19728
      0
      2011-07-13
      Action In A
      0
      Artisan
      31.130847
      -1.825359
      Shuleni
      0
      Lake Victoria
      ...
      never pay
      soft
      good
      seasonal
      seasonal
      rainwater harvesting
      rainwater harvesting
      surface
      communal standpipe
      communal standpipe
    
    
      9944
      20
      2011-03-13
      Mkinga Distric Coun
      0
      DWE
      39.172796
      -4.765587
      Tajiri
      0
      Pangani
      ...
      per bucket
      salty
      salty
      enough
      enough
      other
      other
      unknown
      communal standpipe multiple
      communal standpipe
    
    
      19816
      0
      2012-10-01
      Dwsp
      0
      DWSP
      33.362410
      -3.766365
      Kwa Ngomho
      0
      Internal
      ...
      never pay
      soft
      good
      enough
      enough
      machine dbh
      borehole
      groundwater
      hand pump
      hand pump
    
    
      54551
      0
      2012-10-09
      Rwssp
      0
      DWE
      32.620617
      -4.226198
      Tushirikiane
      0
      Lake Tanganyika
      ...
      unknown
      milky
      milky
      enough
      enough
      shallow well
      shallow well
      groundwater
      hand pump
      hand pump
    
    
      53934
      0
      2012-11-03
      Wateraid
      0
      Water Aid
      32.711100
      -5.146712
      Kwa Ramadhan Musa
      0
      Lake Tanganyika
      ...
      never pay
      salty
      salty
      seasonal
      seasonal
      machine dbh
      borehole
      groundwater
      hand pump
      hand pump
    
    
      46144
      0
      2011-08-03
      Isingiro Ho
      0
      Artisan
      30.626991
      -1.257051
      Kwapeto
      0
      Lake Victoria
      ...
      never pay
      soft
      good
      enough
      enough
      shallow well
      shallow well
      groundwater
      hand pump
      hand pump
    
    
      49056
      0
      2011-02-20
      Private
      62
      Private
      39.209518
      -7.034139
      Mzee Hokororo
      0
      Wami / Ruvu
      ...
      never pay
      salty
      salty
      enough
      enough
      machine dbh
      borehole
      groundwater
      other
      other
    
    
      50409
      200
      2013-02-18
      Danida
      1062
      DANIDA
      35.770258
      -10.574175
      Kwa Alid Nchimbi
      0
      Lake Nyasa
      ...
      on failure
      soft
      good
      insufficient
      insufficient
      shallow well
      shallow well
      groundwater
      hand pump
      hand pump
    
    
      36957
      0
      2012-10-14
      World Vision
      0
      World vision
      33.798106
      -3.290194
      Pamba
      0
      Internal
      ...
      other
      soft
      good
      enough
      enough
      shallow well
      shallow well
      groundwater
      hand pump
      hand pump
    
    
      50495
      0
      2013-03-15
      Lawatefuka Water Supply
      1368
      Lawatefuka water sup
      37.092574
      -3.181783
      Kwa John Izack Mmari
      0
      Pangani
      ...
      monthly
      soft
      good
      enough
      enough
      spring
      spring
      groundwater
      communal standpipe
      communal standpipe
    
    
      53752
      0
      2012-10-20
      Biore
      0
      WEDECO
      34.364073
      -3.629333
      Mwabasabi
      0
      Internal
      ...
      never pay
      soft
      good
      enough
      enough
      shallow well
      shallow well
      groundwater
      hand pump
      hand pump
    
    
      61848
      0
      2011-08-04
      Rudep
      1645
      DWE
      31.444121
      -8.274962
      Kwa Juvenal Ching'Ombe
      0
      Lake Tanganyika
      ...
      never pay
      soft
      good
      enough
      enough
      machine dbh
      borehole
      groundwater
      hand pump
      hand pump
    
    
      48451
      500
      2011-07-04
      Unicef
      1703
      DWE
      34.642439
      -9.106185
      Kwa John Mtenzi
      0
      Rufiji
      ...
      monthly
      soft
      good
      dry
      dry
      river
      river/lake
      surface
      communal standpipe
      communal standpipe
    
    
      58155
      0
      2011-09-04
      Unicef
      1656
      DWE
      34.569266
      -9.085515
      Kwa Rose Chaula
      0
      Rufiji
      ...
      on failure
      soft
      good
      dry
      dry
      river
      river/lake
      surface
      communal standpipe
      communal standpipe
    
    
      34169
      0
      2011-07-22
      Hesawa
      1162
      DWE
      32.920154
      -1.947868
      Ngomee
      0
      Lake Victoria
      ...
      never pay
      milky
      milky
      insufficient
      insufficient
      spring
      spring
      groundwater
      other
      other
    
    
      18274
      500
      2011-02-22
      Danida
      1763
      Danid
      34.508967
      -9.894412
      none
      0
      Lake Nyasa
      ...
      annually
      soft
      good
      enough
      enough
      spring
      spring
      groundwater
      communal standpipe
      communal standpipe
    
  

20 rows × 39 columns



In [3]:

    
sampledata = data[["gps_height", "longitude","latitude","construction_year"]]



In [4]:

    
print sampledata.construction_year









    



id
69572    1999
8776     2010
34310    2009
67743    1986
19728       0
9944     2009
19816       0
54551       0
53934       0
46144       0
49056    2011
50409    1987
36957       0
50495    2009
53752       0
61848    1991
48451    1978
58155    1978
34169    1999
18274    1992
48375    2008
6091        0
58500    1978
37862    2011
51058    2009
22308    1974
55012    2011
20145       0
19685    2000
69124    2002
         ... 
14796       0
20387       0
29940       0
15233    1988
49651       0
50998    2005
34716    1990
43986       0
38067    2008
58255       0
30647    1999
67885    1992
47002    2008
44616    2008
72148       0
34473    2011
34952    2009
26640    2000
72559    1995
30410    2009
13677    1991
44885    1967
40607       0
48348       0
11164    2007
60739    1999
27263    1996
37057       0
31282       0
26348    2002
Name: construction_year, dtype: int64



In [5]:

    
sampledata.describe()









    Out[5]:






  
    
      
      gps_height
      longitude
      latitude
      construction_year
    
  
  
    
      count
      59400.000000
      59400.000000
      5.940000e+04
      59400.000000
    
    
      mean
      668.297239
      34.077427
      -5.706033e+00
      1300.652475
    
    
      std
      693.116350
      6.567432
      2.946019e+00
      951.620547
    
    
      min
      -90.000000
      0.000000
      -1.164944e+01
      0.000000
    
    
      25%
      0.000000
      33.090347
      -8.540621e+00
      0.000000
    
    
      50%
      369.000000
      34.908743
      -5.021597e+00
      1986.000000
    
    
      75%
      1319.250000
      37.178387
      -3.326156e+00
      2004.000000
    
    
      max
      2770.000000
      40.345193
      -2.000000e-08
      2013.000000



In [26]:

    
#get rid of records that have no value for either construction year or gps height
x = sampledata[sampledata["construction_year"] != 0]
x = x[x["gps_height"] != 0]
x.describe()









    Out[26]:






  
    
      
      gps_height
      longitude
      latitude
      construction_year
    
  
  
    
      count
      37928.000000
      37928.000000
      37928.000000
      37928.000000
    
    
      mean
      1022.532456
      35.941838
      -6.247856
      1996.862292
    
    
      std
      607.525299
      2.564827
      2.785097
      12.458136
    
    
      min
      -63.000000
      29.607122
      -11.649440
      1960.000000
    
    
      25%
      406.000000
      34.663640
      -8.791236
      1988.000000
    
    
      50%
      1168.000000
      36.545613
      -6.108577
      2000.000000
    
    
      75%
      1495.000000
      37.754377
      -3.616188
      2008.000000
    
    
      max
      2770.000000
      40.345193
      -1.042375
      2013.000000



In [ ]:



In [28]:

    
#read in labels - this can then be joined as an extra column to initial dataframe
labels = pd.read_csv(train_labels, index_col="id") 
labels.head(20)









    Out[28]:






  
    
      
      status_group
    
    
      id
      
    
  
  
    
      69572
      functional
    
    
      8776
      functional
    
    
      34310
      functional
    
    
      67743
      non functional
    
    
      19728
      functional
    
    
      9944
      functional
    
    
      19816
      non functional
    
    
      54551
      non functional
    
    
      53934
      non functional
    
    
      46144
      functional
    
    
      49056
      functional
    
    
      50409
      functional
    
    
      36957
      functional
    
    
      50495
      functional
    
    
      53752
      functional
    
    
      61848
      functional
    
    
      48451
      non functional
    
    
      58155
      non functional
    
    
      34169
      functional needs repair
    
    
      18274
      functional



In [31]:

    
#find some method to convert to factors - this currently isn't right but decide way to convert text labels into ints
dummy_ranks = pd.get_dummies(labels['status_group'], prefix='status_group')
print dummy_ranks.head()









    



       status_group_functional  status_group_functional needs repair  \
id                                                                     
69572                        1                                     0   
8776                         1                                     0   
34310                        1                                     0   
67743                        0                                     0   
19728                        1                                     0   

       status_group_non functional  
id                                  
69572                            0  
8776                             0  
34310                            0  
67743                            1  
19728                            0



In [ ]:



In [ ]:

	amount_tsh	date_recorded	funder	gps_height	installer	longitude	latitude	wpt_name	num_private	basin	...	payment_type	water_quality	quality_group	quantity	quantity_group	source	source_type	source_class	waterpoint_type	waterpoint_type_group
id
69572	6000	2011-03-14	Roman	1390	Roman	34.938093	-9.856322	none	0	Lake Nyasa	...	annually	soft	good	enough	enough	spring	spring	groundwater	communal standpipe	communal standpipe
8776	0	2013-03-06	Grumeti	1399	GRUMETI	34.698766	-2.147466	Zahanati	0	Lake Victoria	...	never pay	soft	good	insufficient	insufficient	rainwater harvesting	rainwater harvesting	surface	communal standpipe	communal standpipe
34310	25	2013-02-25	Lottery Club	686	World vision	37.460664	-3.821329	Kwa Mahundi	0	Pangani	...	per bucket	soft	good	enough	enough	dam	dam	surface	communal standpipe multiple	communal standpipe
67743	0	2013-01-28	Unicef	263	UNICEF	38.486161	-11.155298	Zahanati Ya Nanyumbu	0	Ruvuma / Southern Coast	...	never pay	soft	good	dry	dry	machine dbh	borehole	groundwater	communal standpipe multiple	communal standpipe
19728	0	2011-07-13	Action In A	0	Artisan	31.130847	-1.825359	Shuleni	0	Lake Victoria	...	never pay	soft	good	seasonal	seasonal	rainwater harvesting	rainwater harvesting	surface	communal standpipe	communal standpipe
9944	20	2011-03-13	Mkinga Distric Coun	0	DWE	39.172796	-4.765587	Tajiri	0	Pangani	...	per bucket	salty	salty	enough	enough	other	other	unknown	communal standpipe multiple	communal standpipe
19816	0	2012-10-01	Dwsp	0	DWSP	33.362410	-3.766365	Kwa Ngomho	0	Internal	...	never pay	soft	good	enough	enough	machine dbh	borehole	groundwater	hand pump	hand pump
54551	0	2012-10-09	Rwssp	0	DWE	32.620617	-4.226198	Tushirikiane	0	Lake Tanganyika	...	unknown	milky	milky	enough	enough	shallow well	shallow well	groundwater	hand pump	hand pump
53934	0	2012-11-03	Wateraid	0	Water Aid	32.711100	-5.146712	Kwa Ramadhan Musa	0	Lake Tanganyika	...	never pay	salty	salty	seasonal	seasonal	machine dbh	borehole	groundwater	hand pump	hand pump
46144	0	2011-08-03	Isingiro Ho	0	Artisan	30.626991	-1.257051	Kwapeto	0	Lake Victoria	...	never pay	soft	good	enough	enough	shallow well	shallow well	groundwater	hand pump	hand pump
49056	0	2011-02-20	Private	62	Private	39.209518	-7.034139	Mzee Hokororo	0	Wami / Ruvu	...	never pay	salty	salty	enough	enough	machine dbh	borehole	groundwater	other	other
50409	200	2013-02-18	Danida	1062	DANIDA	35.770258	-10.574175	Kwa Alid Nchimbi	0	Lake Nyasa	...	on failure	soft	good	insufficient	insufficient	shallow well	shallow well	groundwater	hand pump	hand pump
36957	0	2012-10-14	World Vision	0	World vision	33.798106	-3.290194	Pamba	0	Internal	...	other	soft	good	enough	enough	shallow well	shallow well	groundwater	hand pump	hand pump
50495	0	2013-03-15	Lawatefuka Water Supply	1368	Lawatefuka water sup	37.092574	-3.181783	Kwa John Izack Mmari	0	Pangani	...	monthly	soft	good	enough	enough	spring	spring	groundwater	communal standpipe	communal standpipe
53752	0	2012-10-20	Biore	0	WEDECO	34.364073	-3.629333	Mwabasabi	0	Internal	...	never pay	soft	good	enough	enough	shallow well	shallow well	groundwater	hand pump	hand pump
61848	0	2011-08-04	Rudep	1645	DWE	31.444121	-8.274962	Kwa Juvenal Ching'Ombe	0	Lake Tanganyika	...	never pay	soft	good	enough	enough	machine dbh	borehole	groundwater	hand pump	hand pump
48451	500	2011-07-04	Unicef	1703	DWE	34.642439	-9.106185	Kwa John Mtenzi	0	Rufiji	...	monthly	soft	good	dry	dry	river	river/lake	surface	communal standpipe	communal standpipe
58155	0	2011-09-04	Unicef	1656	DWE	34.569266	-9.085515	Kwa Rose Chaula	0	Rufiji	...	on failure	soft	good	dry	dry	river	river/lake	surface	communal standpipe	communal standpipe
34169	0	2011-07-22	Hesawa	1162	DWE	32.920154	-1.947868	Ngomee	0	Lake Victoria	...	never pay	milky	milky	insufficient	insufficient	spring	spring	groundwater	other	other
18274	500	2011-02-22	Danida	1763	Danid	34.508967	-9.894412	none	0	Lake Nyasa	...	annually	soft	good	enough	enough	spring	spring	groundwater	communal standpipe	communal standpipe

	gps_height	longitude	latitude	construction_year
count	59400.000000	59400.000000	5.940000e+04	59400.000000
mean	668.297239	34.077427	-5.706033e+00	1300.652475
std	693.116350	6.567432	2.946019e+00	951.620547
min	-90.000000	0.000000	-1.164944e+01	0.000000
25%	0.000000	33.090347	-8.540621e+00	0.000000
50%	369.000000	34.908743	-5.021597e+00	1986.000000
75%	1319.250000	37.178387	-3.326156e+00	2004.000000
max	2770.000000	40.345193	-2.000000e-08	2013.000000

	gps_height	longitude	latitude	construction_year
count	37928.000000	37928.000000	37928.000000	37928.000000
mean	1022.532456	35.941838	-6.247856	1996.862292
std	607.525299	2.564827	2.785097	12.458136
min	-63.000000	29.607122	-11.649440	1960.000000
25%	406.000000	34.663640	-8.791236	1988.000000
50%	1168.000000	36.545613	-6.108577	2000.000000
75%	1495.000000	37.754377	-3.616188	2008.000000
max	2770.000000	40.345193	-1.042375	2013.000000

	status_group
id
69572	functional
8776	functional
34310	functional
67743	non functional
19728	functional
9944	functional
19816	non functional
54551	non functional
53934	non functional
46144	functional
49056	functional
50409	functional
36957	functional
50495	functional
53752	functional
61848	functional
48451	non functional
58155	non functional
34169	functional needs repair
18274	functional