Definition of the Problem:

Based on a range of different independent variables such as installation date, agency, and type, can we predict whether a given water pump will be either: i) functioning, ii) in need of repair, or iii) or not functioning. Should we convert these three possibilities to a continuous distribution?

$$ 0 \leq \text{ Not functioning } \leq 0.33 \leq \text{ Needs repair }\leq 0.66 \leq \text{Functioning} \leq 1 $$

Can we simplify the question to just a binomial distribution of Functioning/Not Functioning?

Possible models:

1) The probability of failure is based on an ordered logistic function related to the age etc. (similar to the Challenger Disaster Homework/BioAssay).

2) The probability of failure is based on a linear combination of parameters (similar to the Maize Weight/Chalk).

3) Naive Bayes Classifiers are not used because...??

1. Linear Model

A linear model would require bounds on each of our parameters in order to obtain a score for functionality between 0 and 1.

2. Ordered Logistic Model

As opposed to the normal logistic model which only provides outcomes of either 0 or 1, the ordered inverse logistic model (ologit) can categorise outcomes into a hierarchical series of outcomes which we translate to our functionality assessment.

Assumptions:

i) at t=0, functionality (y) has an initial (low) probability of failing.

ii) as time increases, probability of not functioning increases (parts decay).

iii) as height increases (h), probability of not functioning increases (increasing remoteness).

iv) as number of surrounding wells decreases (w), probability of not functioning increases (this is to act as a proxy for relative proximity to population centres. It could also be possible to use population as an easier way of getting this.)

Revised Model:

Our final variable that we are predicting is $y_i$, where the following converts $y_i$ into one of the classifications: $$ 0 \leq \text{ Not functioning } \leq 0.33 \leq \text{ Needs repair }\leq 0.66 \leq \text{Functioning} \leq 1 $$

$y_i$ is modeled by a backwards logistic function (other sigmoid functions can be used?) to keep it within the bounds $0 \leq y_i \leq 1$: $$y_i=\frac{1}{1+e^{t}}+\sigma \epsilon_i$$. $t$ is a parameter of our model that has a probability distribution defined by $x_i$, the input features. It is the equipment decay rate, which we model as $\theta_i=\beta_0 + t_i\beta_1 + h_i\beta_2 + w_i\beta_3$ where each of $\beta_i$ are hyperparameters.

The priors can be $$p(\beta_0,\beta_1,\beta_2,\beta_3,\sigma^2)\propto \frac{1}{\sigma^2}$$

$$p(y_1,\ldots, y_n|\beta_0,\beta_1,\beta_2,\beta_3,\sigma^2)\propto \frac{1}{\sigma^n}\exp(-\frac{\sum_{i=1}^n (y_i-\frac{1}{2(1+e^\theta)}-1))^2}{2\sigma^2}),$$

We want to sample from the posterior: $$p(Y,\Theta)=p(Y|\Theta)p(\Theta)$$


In [16]:
from datetime import datetime, date, time
import sys

import pandas as pd
from pandas import Series, DataFrame, Panel

train_file = "WaterPump-training-values.csv"
train_labels = "WaterPump-training-labels.csv"
test_file = "WaterPump-test-values.csv"

data = pd.read_csv(train_file, parse_dates=True,index_col='id') #read into dataframe, parse dates, and set ID as index
data.head(10)


Out[16]:
amount_tsh date_recorded funder gps_height installer longitude latitude wpt_name num_private basin ... payment_type water_quality quality_group quantity quantity_group source source_type source_class waterpoint_type waterpoint_type_group
id
69572 6000 2011-03-14 Roman 1390 Roman 34.938093 -9.856322 none 0 Lake Nyasa ... annually soft good enough enough spring spring groundwater communal standpipe communal standpipe
8776 0 2013-03-06 Grumeti 1399 GRUMETI 34.698766 -2.147466 Zahanati 0 Lake Victoria ... never pay soft good insufficient insufficient rainwater harvesting rainwater harvesting surface communal standpipe communal standpipe
34310 25 2013-02-25 Lottery Club 686 World vision 37.460664 -3.821329 Kwa Mahundi 0 Pangani ... per bucket soft good enough enough dam dam surface communal standpipe multiple communal standpipe
67743 0 2013-01-28 Unicef 263 UNICEF 38.486161 -11.155298 Zahanati Ya Nanyumbu 0 Ruvuma / Southern Coast ... never pay soft good dry dry machine dbh borehole groundwater communal standpipe multiple communal standpipe
19728 0 2011-07-13 Action In A 0 Artisan 31.130847 -1.825359 Shuleni 0 Lake Victoria ... never pay soft good seasonal seasonal rainwater harvesting rainwater harvesting surface communal standpipe communal standpipe
9944 20 2011-03-13 Mkinga Distric Coun 0 DWE 39.172796 -4.765587 Tajiri 0 Pangani ... per bucket salty salty enough enough other other unknown communal standpipe multiple communal standpipe
19816 0 2012-10-01 Dwsp 0 DWSP 33.362410 -3.766365 Kwa Ngomho 0 Internal ... never pay soft good enough enough machine dbh borehole groundwater hand pump hand pump
54551 0 2012-10-09 Rwssp 0 DWE 32.620617 -4.226198 Tushirikiane 0 Lake Tanganyika ... unknown milky milky enough enough shallow well shallow well groundwater hand pump hand pump
53934 0 2012-11-03 Wateraid 0 Water Aid 32.711100 -5.146712 Kwa Ramadhan Musa 0 Lake Tanganyika ... never pay salty salty seasonal seasonal machine dbh borehole groundwater hand pump hand pump
46144 0 2011-08-03 Isingiro Ho 0 Artisan 30.626991 -1.257051 Kwapeto 0 Lake Victoria ... never pay soft good enough enough shallow well shallow well groundwater hand pump hand pump

10 rows × 39 columns


In [26]:
labels = pd.read_csv(train_labels, index_col = 'id')

#columns to keep
cols_to_keep = ['gps_height', 'construction_year']
data = data[cols_to_keep]

# manually add the intercept
data['intercept'] = 1.0

In [14]:
print data.columns


Index([u'gps_height', u'construction_year', u'intercept'], dtype='object')

In [19]:
data.dtypes


Out[19]:
gps_height             int64
construction_year      int64
intercept            float64
dtype: object

In [31]:
labelsVect = pd.get_dummies(labels['status_group'])
print labelsVect.columns


Index([u'functional', u'functional needs repair', u'non functional'], dtype='object')

In [32]:
labelsVect['functionality'] = labelsVect['functional'] + 0.5*labelsVect['functional needs repair']

In [37]:
import statsmodels.api as sm
train_cols = data.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)
 
logit = sm.Logit(labelsVect['functionality'], data)
 
# fit the model
result = logit.fit()


Optimization terminated successfully.
         Current function value: 0.672248
         Iterations 4

In [69]:
result.summary()


Out[69]:
Logit Regression Results
Dep. Variable: functionality No. Observations: 59400
Model: Logit Df Residuals: 59397
Method: MLE Df Model: 2
Date: Fri, 24 Apr 2015 Pseudo R-squ.: 0.01075
Time: 17:28:02 Log-Likelihood: -39932.
converged: True LL-Null: -40365.
LLR p-value: 4.254e-189
coef std err z P>|z| [95.0% Conf. Int.]
gps_height 0.0004 1.63e-05 26.580 0.000 0.000 0.000
construction_year -0.0001 1.16e-05 -10.065 0.000 -0.000 -9.41e-05
intercept 0.1879 0.014 13.420 0.000 0.160 0.215

In [41]:
data = pd.read_csv(train_file, parse_dates=True,index_col='id') #read into dataframe, parse dates, and set ID as index
locData = data[['longitude','latitude']]

In [42]:
locData.head(5)


Out[42]:
longitude latitude
id
69572 34.938093 -9.856322
8776 34.698766 -2.147466
34310 37.460664 -3.821329
67743 38.486161 -11.155298
19728 31.130847 -1.825359

In [43]:
dist = data['longitude']**2+data['latitude']**2

In [44]:
print dist.head(5)


id
69572    1317.817404
8776     1208.615978
34310    1417.903934
67743    1605.625247
19728     972.461552
dtype: float64

In [50]:
dist.order()


Out[50]:
id
6561     4.000000e-16
20447    4.000000e-16
68569    4.000000e-16
41883    4.000000e-16
63236    4.000000e-16
33193    4.000000e-16
52797    4.000000e-16
45716    4.000000e-16
74058    4.000000e-16
43256    4.000000e-16
28633    4.000000e-16
16417    4.000000e-16
32055    4.000000e-16
25830    4.000000e-16
70312    4.000000e-16
...
73518    1735.054127
12547    1735.184393
62933    1735.185269
20666    1735.186264
72746    1735.186659
31454    1735.256078
17282    1735.262702
7759     1735.335333
25152    1735.350887
20120    1735.358554
47144    1735.437606
48316    1735.438688
69830    1735.456795
41000    1737.824318
39105    1737.896098
Length: 59400, dtype: float64

In [55]:
labels.ix[39105]['status_group']


Out[55]:
'functional'

In [59]:
dist.order()[:5]


Out[59]:
id
6561     4.000000e-16
20447    4.000000e-16
68569    4.000000e-16
41883    4.000000e-16
63236    4.000000e-16
dtype: float64

In [67]:
dist.order()


Out[67]:
id
6561     4.000000e-16
20447    4.000000e-16
68569    4.000000e-16
41883    4.000000e-16
63236    4.000000e-16
33193    4.000000e-16
52797    4.000000e-16
45716    4.000000e-16
74058    4.000000e-16
43256    4.000000e-16
28633    4.000000e-16
16417    4.000000e-16
32055    4.000000e-16
25830    4.000000e-16
70312    4.000000e-16
...
73518    1735.054127
12547    1735.184393
62933    1735.185269
20666    1735.186264
72746    1735.186659
31454    1735.256078
17282    1735.262702
7759     1735.335333
25152    1735.350887
20120    1735.358554
47144    1735.437606
48316    1735.438688
69830    1735.456795
41000    1737.824318
39105    1737.896098
Length: 59400, dtype: float64

In [ ]: