Based on a range of different independent variables such as installation date, agency, and type, can we predict whether a given water pump will be either: i) functioning, ii) in need of repair, or iii) or not functioning. Should we convert these three possibilities to a continuous distribution?
$$ 0 \leq \text{ Not functioning } \leq 0.33 \leq \text{ Needs repair }\leq 0.66 \leq \text{Functioning} \leq 1 $$Can we simplify the question to just a binomial distribution of Functioning/Not Functioning?
Possible models:
1) The probability of failure is based on an ordered logistic function related to the age etc. (similar to the Challenger Disaster Homework/BioAssay).
2) The probability of failure is based on a linear combination of parameters (similar to the Maize Weight/Chalk).
3) Naive Bayes Classifiers are not used because...??
A linear model would require bounds on each of our parameters in order to obtain a score for functionality between 0 and 1.
As opposed to the normal logistic model which only provides outcomes of either 0 or 1, the ordered inverse logistic model (ologit) can categorise outcomes into a hierarchical series of outcomes which we translate to our functionality assessment.
Assumptions:
i) at t=0, functionality (y) has an initial (low) probability of failing.
ii) as time increases, probability of not functioning increases (parts decay).
iii) as height increases (h), probability of not functioning increases (increasing remoteness).
iv) as number of surrounding wells decreases (w), probability of not functioning increases (this is to act as a proxy for relative proximity to population centres. It could also be possible to use population as an easier way of getting this.)
Our final variable that we are predicting is $y_i$, where the following converts $y_i$ into one of the classifications: $$ 0 \leq \text{ Not functioning } \leq 0.33 \leq \text{ Needs repair }\leq 0.66 \leq \text{Functioning} \leq 1 $$
$y_i$ is modeled by a backwards logistic function (other sigmoid functions can be used?) to keep it within the bounds $0 \leq y_i \leq 1$: $$y_i=\frac{1}{1+e^{t}}+\sigma \epsilon_i$$. $t$ is a parameter of our model that has a probability distribution defined by $x_i$, the input features. It is the equipment decay rate, which we model as $\theta_i=\beta_0 + t_i\beta_1 + h_i\beta_2 + w_i\beta_3$ where each of $\beta_i$ are hyperparameters.
The priors can be $$p(\beta_0,\beta_1,\beta_2,\beta_3,\sigma^2)\propto \frac{1}{\sigma^2}$$
We want to sample from the posterior: $$p(Y,\Theta)=p(Y|\Theta)p(\Theta)$$
Use something from here: http://blog.yhathq.com/posts/logistic-regression-and-python.html
In [16]:
from datetime import datetime, date, time
import sys
import pandas as pd
from pandas import Series, DataFrame, Panel
train_file = "WaterPump-training-values.csv"
train_labels = "WaterPump-training-labels.csv"
test_file = "WaterPump-test-values.csv"
data = pd.read_csv(train_file, parse_dates=True,index_col='id') #read into dataframe, parse dates, and set ID as index
data.head(10)
Out[16]:
In [26]:
labels = pd.read_csv(train_labels, index_col = 'id')
#columns to keep
cols_to_keep = ['gps_height', 'construction_year']
data = data[cols_to_keep]
# manually add the intercept
data['intercept'] = 1.0
In [14]:
print data.columns
In [19]:
data.dtypes
Out[19]:
In [31]:
labelsVect = pd.get_dummies(labels['status_group'])
print labelsVect.columns
In [32]:
labelsVect['functionality'] = labelsVect['functional'] + 0.5*labelsVect['functional needs repair']
In [37]:
import statsmodels.api as sm
train_cols = data.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)
logit = sm.Logit(labelsVect['functionality'], data)
# fit the model
result = logit.fit()
In [69]:
result.summary()
Out[69]:
In [41]:
data = pd.read_csv(train_file, parse_dates=True,index_col='id') #read into dataframe, parse dates, and set ID as index
locData = data[['longitude','latitude']]
In [42]:
locData.head(5)
Out[42]:
In [43]:
dist = data['longitude']**2+data['latitude']**2
In [44]:
print dist.head(5)
In [50]:
dist.order()
Out[50]:
In [55]:
labels.ix[39105]['status_group']
Out[55]:
In [59]:
dist.order()[:5]
Out[59]:
In [67]:
dist.order()
Out[67]:
In [ ]: