Definition of the Problem:

Based on a range of different independent variables such as installation date, agency, and type, can we predict whether a given water pump will be either: i) functioning, ii) in need of repair, or iii) or not functioning. Should we convert these three possibilities to a continuous distribution?

$$ 0 \leq \text{ Not functioning } \leq 0.33 \leq \text{ Needs repair }\leq 0.66 \leq \text{Functioning} \leq 1 $$

Can we simplify the question to just a binomial distribution of Functioning/Not Functioning?

Possible models:

1) The probability of failure is based on an ordered logistic function related to the age etc. (similar to the Challenger Disaster Homework/BioAssay).

2) The probability of failure is based on a linear combination of parameters (similar to the Maize Weight/Chalk).

3) Naive Bayes Classifiers are not used because...??

1. Linear Model

A linear model would require bounds on each of our parameters in order to obtain a score for functionality between 0 and 1.

2. Ordered Logistic Model

As opposed to the normal logistic model which only provides outcomes of either 0 or 1, the ordered inverse logistic model (ologit) can categorise outcomes into a hierarchical series of outcomes which we translate to our functionality assessment.

Assumptions:

i) at t=0, functionality (y) has an initial (low) probability of failing.

ii) as time increases, probability of not functioning increases (parts decay).

iii) as height increases (h), probability of not functioning increases (increasing remoteness).

iv) as number of surrounding wells decreases (w), probability of not functioning increases (this is to act as a proxy for relative proximity to population centres. It could also be possible to use population as an easier way of getting this.)

Revised Model:

Our final variable that we are predicting is $y_i$, where the following converts $y_i$ into one of the classifications: $$ 0 \leq \text{ Not functioning } \leq 0.33 \leq \text{ Needs repair }\leq 0.66 \leq \text{Functioning} \leq 1 $$

$y_i$ is modeled by a backwards logistic function (other sigmoid functions can be used?) to keep it within the bounds $0 \leq y_i \leq 1$: $$y_i=\frac{1}{1+e^{t}}+\sigma \epsilon_i$$. $t$ is a parameter of our model that has a probability distribution defined by $x_i$, the input features. It is the equipment decay rate, which we model as $\theta_i=\beta_0 + t_i\beta_1 + h_i\beta_2 + w_i\beta_3$ where each of $\beta_i$ are hyperparameters.

The priors can be $$p(\beta_0,\beta_1,\beta_2,\beta_3,\sigma^2)\propto \frac{1}{\sigma^2}$$

$$p(y_1,\ldots, y_n|\beta_0,\beta_1,\beta_2,\beta_3,\sigma^2)\propto \frac{1}{\sigma^n}\exp(-\frac{\sum_{i=1}^n (y_i-\frac{1}{2(1+e^\theta)}-1))^2}{2\sigma^2}),$$

We want to sample from the posterior: $$p(Y,\Theta)=p(Y|\Theta)p(\Theta)$$

Use something from here: http://blog.yhathq.com/posts/logistic-regression-and-python.html



In [16]:

    
from datetime import datetime, date, time
import sys

import pandas as pd
from pandas import Series, DataFrame, Panel

train_file = "WaterPump-training-values.csv"
train_labels = "WaterPump-training-labels.csv"
test_file = "WaterPump-test-values.csv"

data = pd.read_csv(train_file, parse_dates=True,index_col='id') #read into dataframe, parse dates, and set ID as index
data.head(10)









    Out[16]:






  
    
      
      amount_tsh
      date_recorded
      funder
      gps_height
      installer
      longitude
      latitude
      wpt_name
      num_private
      basin
      ...
      payment_type
      water_quality
      quality_group
      quantity
      quantity_group
      source
      source_type
      source_class
      waterpoint_type
      waterpoint_type_group
    
    
      id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      69572
       6000
       2011-03-14
                     Roman
       1390
              Roman
       34.938093
       -9.856322
                       none
       0
                    Lake Nyasa
      ...
         annually
        soft
        good
             enough
             enough
                     spring
                     spring
       groundwater
                communal standpipe
       communal standpipe
    
    
      8776 
          0
       2013-03-06
                   Grumeti
       1399
            GRUMETI
       34.698766
       -2.147466
                   Zahanati
       0
                 Lake Victoria
      ...
        never pay
        soft
        good
       insufficient
       insufficient
       rainwater harvesting
       rainwater harvesting
           surface
                communal standpipe
       communal standpipe
    
    
      34310
         25
       2013-02-25
              Lottery Club
        686
       World vision
       37.460664
       -3.821329
                Kwa Mahundi
       0
                       Pangani
      ...
       per bucket
        soft
        good
             enough
             enough
                        dam
                        dam
           surface
       communal standpipe multiple
       communal standpipe
    
    
      67743
          0
       2013-01-28
                    Unicef
        263
             UNICEF
       38.486161
      -11.155298
       Zahanati Ya Nanyumbu
       0
       Ruvuma / Southern Coast
      ...
        never pay
        soft
        good
                dry
                dry
                machine dbh
                   borehole
       groundwater
       communal standpipe multiple
       communal standpipe
    
    
      19728
          0
       2011-07-13
               Action In A
          0
            Artisan
       31.130847
       -1.825359
                    Shuleni
       0
                 Lake Victoria
      ...
        never pay
        soft
        good
           seasonal
           seasonal
       rainwater harvesting
       rainwater harvesting
           surface
                communal standpipe
       communal standpipe
    
    
      9944 
         20
       2011-03-13
       Mkinga Distric Coun
          0
                DWE
       39.172796
       -4.765587
                     Tajiri
       0
                       Pangani
      ...
       per bucket
       salty
       salty
             enough
             enough
                      other
                      other
           unknown
       communal standpipe multiple
       communal standpipe
    
    
      19816
          0
       2012-10-01
                      Dwsp
          0
               DWSP
       33.362410
       -3.766365
                 Kwa Ngomho
       0
                      Internal
      ...
        never pay
        soft
        good
             enough
             enough
                machine dbh
                   borehole
       groundwater
                         hand pump
                hand pump
    
    
      54551
          0
       2012-10-09
                     Rwssp
          0
                DWE
       32.620617
       -4.226198
               Tushirikiane
       0
               Lake Tanganyika
      ...
          unknown
       milky
       milky
             enough
             enough
               shallow well
               shallow well
       groundwater
                         hand pump
                hand pump
    
    
      53934
          0
       2012-11-03
                  Wateraid
          0
          Water Aid
       32.711100
       -5.146712
          Kwa Ramadhan Musa
       0
               Lake Tanganyika
      ...
        never pay
       salty
       salty
           seasonal
           seasonal
                machine dbh
                   borehole
       groundwater
                         hand pump
                hand pump
    
    
      46144
          0
       2011-08-03
               Isingiro Ho
          0
            Artisan
       30.626991
       -1.257051
                    Kwapeto
       0
                 Lake Victoria
      ...
        never pay
        soft
        good
             enough
             enough
               shallow well
               shallow well
       groundwater
                         hand pump
                hand pump
    
  

10 rows × 39 columns



In [26]:

    
labels = pd.read_csv(train_labels, index_col = 'id')

#columns to keep
cols_to_keep = ['gps_height', 'construction_year']
data = data[cols_to_keep]

# manually add the intercept
data['intercept'] = 1.0



In [14]:

    
print data.columns









    



Index([u'gps_height', u'construction_year', u'intercept'], dtype='object')



In [19]:

    
data.dtypes









    Out[19]:





gps_height             int64
construction_year      int64
intercept            float64
dtype: object



In [31]:

    
labelsVect = pd.get_dummies(labels['status_group'])
print labelsVect.columns









    



Index([u'functional', u'functional needs repair', u'non functional'], dtype='object')



In [32]:

    
labelsVect['functionality'] = labelsVect['functional'] + 0.5*labelsVect['functional needs repair']



In [37]:

    
import statsmodels.api as sm
train_cols = data.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)
 
logit = sm.Logit(labelsVect['functionality'], data)
 
# fit the model
result = logit.fit()









    



Optimization terminated successfully.
         Current function value: 0.672248
         Iterations 4



In [69]:

    
result.summary()









    Out[69]:





Logit Regression Results

  Dep. Variable:    functionality     No. Observations:      59400  


  Model:                Logit         Df Residuals:          59397  


  Method:                MLE          Df Model:                  2  


  Date:           Fri, 24 Apr 2015    Pseudo R-squ.:        0.01075 


  Time:               17:28:02        Log-Likelihood:       -39932. 


  converged:            True          LL-Null:              -40365. 


                                    LLR p-value:        4.254e-189




                       coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  gps_height             0.0004   1.63e-05     26.580   0.000      0.000     0.000


  construction_year     -0.0001   1.16e-05    -10.065   0.000     -0.000 -9.41e-05


  intercept              0.1879      0.014     13.420   0.000      0.160     0.215



In [41]:

    
data = pd.read_csv(train_file, parse_dates=True,index_col='id') #read into dataframe, parse dates, and set ID as index
locData = data[['longitude','latitude']]



In [42]:

    
locData.head(5)



In [43]:

    
dist = data['longitude']**2+data['latitude']**2



In [44]:

    
print dist.head(5)









    



id
69572    1317.817404
8776     1208.615978
34310    1417.903934
67743    1605.625247
19728     972.461552
dtype: float64



In [50]:

    
dist.order()









    Out[50]:





id
6561     4.000000e-16
20447    4.000000e-16
68569    4.000000e-16
41883    4.000000e-16
63236    4.000000e-16
33193    4.000000e-16
52797    4.000000e-16
45716    4.000000e-16
74058    4.000000e-16
43256    4.000000e-16
28633    4.000000e-16
16417    4.000000e-16
32055    4.000000e-16
25830    4.000000e-16
70312    4.000000e-16
...
73518    1735.054127
12547    1735.184393
62933    1735.185269
20666    1735.186264
72746    1735.186659
31454    1735.256078
17282    1735.262702
7759     1735.335333
25152    1735.350887
20120    1735.358554
47144    1735.437606
48316    1735.438688
69830    1735.456795
41000    1737.824318
39105    1737.896098
Length: 59400, dtype: float64



In [55]:

    
labels.ix[39105]['status_group']









    Out[55]:





'functional'



In [59]:

    
dist.order()[:5]









    Out[59]:





id
6561     4.000000e-16
20447    4.000000e-16
68569    4.000000e-16
41883    4.000000e-16
63236    4.000000e-16
dtype: float64



In [67]:

    
dist.order()









    Out[67]:





id
6561     4.000000e-16
20447    4.000000e-16
68569    4.000000e-16
41883    4.000000e-16
63236    4.000000e-16
33193    4.000000e-16
52797    4.000000e-16
45716    4.000000e-16
74058    4.000000e-16
43256    4.000000e-16
28633    4.000000e-16
16417    4.000000e-16
32055    4.000000e-16
25830    4.000000e-16
70312    4.000000e-16
...
73518    1735.054127
12547    1735.184393
62933    1735.185269
20666    1735.186264
72746    1735.186659
31454    1735.256078
17282    1735.262702
7759     1735.335333
25152    1735.350887
20120    1735.358554
47144    1735.437606
48316    1735.438688
69830    1735.456795
41000    1737.824318
39105    1737.896098
Length: 59400, dtype: float64



In [ ]:

	amount_tsh	date_recorded	funder	gps_height	installer	longitude	latitude	wpt_name	num_private	basin	...	payment_type	water_quality	quality_group	quantity	quantity_group	source	source_type	source_class	waterpoint_type	waterpoint_type_group
id
69572	6000	2011-03-14	Roman	1390	Roman	34.938093	-9.856322	none	0	Lake Nyasa	...	annually	soft	good	enough	enough	spring	spring	groundwater	communal standpipe	communal standpipe
8776	0	2013-03-06	Grumeti	1399	GRUMETI	34.698766	-2.147466	Zahanati	0	Lake Victoria	...	never pay	soft	good	insufficient	insufficient	rainwater harvesting	rainwater harvesting	surface	communal standpipe	communal standpipe
34310	25	2013-02-25	Lottery Club	686	World vision	37.460664	-3.821329	Kwa Mahundi	0	Pangani	...	per bucket	soft	good	enough	enough	dam	dam	surface	communal standpipe multiple	communal standpipe
67743	0	2013-01-28	Unicef	263	UNICEF	38.486161	-11.155298	Zahanati Ya Nanyumbu	0	Ruvuma / Southern Coast	...	never pay	soft	good	dry	dry	machine dbh	borehole	groundwater	communal standpipe multiple	communal standpipe
19728	0	2011-07-13	Action In A	0	Artisan	31.130847	-1.825359	Shuleni	0	Lake Victoria	...	never pay	soft	good	seasonal	seasonal	rainwater harvesting	rainwater harvesting	surface	communal standpipe	communal standpipe
9944	20	2011-03-13	Mkinga Distric Coun	0	DWE	39.172796	-4.765587	Tajiri	0	Pangani	...	per bucket	salty	salty	enough	enough	other	other	unknown	communal standpipe multiple	communal standpipe
19816	0	2012-10-01	Dwsp	0	DWSP	33.362410	-3.766365	Kwa Ngomho	0	Internal	...	never pay	soft	good	enough	enough	machine dbh	borehole	groundwater	hand pump	hand pump
54551	0	2012-10-09	Rwssp	0	DWE	32.620617	-4.226198	Tushirikiane	0	Lake Tanganyika	...	unknown	milky	milky	enough	enough	shallow well	shallow well	groundwater	hand pump	hand pump
53934	0	2012-11-03	Wateraid	0	Water Aid	32.711100	-5.146712	Kwa Ramadhan Musa	0	Lake Tanganyika	...	never pay	salty	salty	seasonal	seasonal	machine dbh	borehole	groundwater	hand pump	hand pump
46144	0	2011-08-03	Isingiro Ho	0	Artisan	30.626991	-1.257051	Kwapeto	0	Lake Victoria	...	never pay	soft	good	enough	enough	shallow well	shallow well	groundwater	hand pump	hand pump

Dep. Variable:	functionality	No. Observations:	59400
Model:	Logit	Df Residuals:	59397
Method:	MLE	Df Model:	2
Date:	Fri, 24 Apr 2015	Pseudo R-squ.:	0.01075
Time:	17:28:02	Log-Likelihood:	-39932.
converged:	True	LL-Null:	-40365.
		LLR p-value:	4.254e-189

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
gps_height	0.0004	1.63e-05	26.580	0.000	0.000 0.000
construction_year	-0.0001	1.16e-05	-10.065	0.000	-0.000 -9.41e-05
intercept	0.1879	0.014	13.420	0.000	0.160 0.215