Based on a range of different independent variables such as installation date, agency, and type, can we predict whether a given water pump will be either: i) functioning, ii) in need of repair, or iii) or not functioning. Should we convert these three possibilities to a continuous distribution?
$$ 0 \leq \text{ Not functioning } \leq 0.33 \leq \text{ Needs repair }\leq 0.66 \leq \text{Functioning} \leq 1 $$Can we simplify the question to just a binomial distribution of Functioning/Not Functioning?
Possible models:
1) The probability of failure is based on an ordered logistic function related to the age etc. (similar to the Challenger Disaster Homework/BioAssay).
2) The probability of failure is based on a linear combination of parameters (similar to the Maize Weight/Chalk).
3) Naive Bayes Classifiers are not used because...??
A linear model would require bounds on each of our parameters in order to obtain a score for functionality between 0 and 1.
As opposed to the normal logistic model which only provides outcomes of either 0 or 1, the ordered inverse logistic model (ologit) can categorise outcomes into a hierarchical series of outcomes which we translate to our functionality assessment.
Assumptions:
i) at t=0, functionality (y) has an initial (low) probability of failing.
ii) as time increases, probability of not functioning increases (parts decay).
iii) as height increases (h), probability of not functioning increases (increasing remoteness).
iv) as number of surrounding wells decreases (w), probability of not functioning increases (this is to act as a proxy for relative proximity to population centres. It could also be possible to use population as an easier way of getting this.)
The likelihood given that our functionality score can take any value between 0 and 1, is expressed as a skewed normal distribution given the assumption that wells are more likely to be in a working state (another possibility would be a exponential inverse):
$$ P(y_i| \theta_i) = {\rm Normal}( y_i \vert \theta_i) \,\,\,\, \rm{for}\,\, i=1, \ldots, n$$where $\theta$ is the equipment decay rate which is modeled as a $\rm{ologit}^{-1}$:
$$\theta_i = \text{equipment decay rate} = \rm{ologit}^{-1}(\beta_0 + t_i\beta_1 + h_i\beta_2 + w_i\beta_3)$$What priors to choose for $\beta_0, \beta_1, \beta_2, \rm{and} \, \beta_3 \,$?
$$ p(\beta_0) \propto \rm{exp}()$$$$ \beta_1, \beta_2, \rm{and} \, \beta_3) \propto 1 $$Posterior:
Use something from here: http://blog.yhathq.com/posts/logistic-regression-and-python.html
In [22]:
from datetime import datetime, date, time
import sys
import numpy as np
import sklearn
import csv
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd
from pandas import Series, DataFrame, Panel
train_file = "WaterPump-training-values.csv"
train_labels = "WaterPump-training-labels.csv"
test_file = "WaterPump-test-values.csv"
data = pd.read_csv(train_file, parse_dates=True,index_col='id') #read into dataframe, parse dates, and set ID as index
data.head(20)
Out[22]:
In [3]:
sampledata = data[["gps_height", "longitude","latitude","construction_year"]]
In [4]:
print sampledata.construction_year
In [5]:
sampledata.describe()
Out[5]:
In [26]:
#get rid of records that have no value for either construction year or gps height
x = sampledata[sampledata["construction_year"] != 0]
x = x[x["gps_height"] != 0]
x.describe()
Out[26]:
In [ ]:
In [28]:
#read in labels - this can then be joined as an extra column to initial dataframe
labels = pd.read_csv(train_labels, index_col="id")
labels.head(20)
Out[28]:
In [31]:
#find some method to convert to factors - this currently isn't right but decide way to convert text labels into ints
dummy_ranks = pd.get_dummies(labels['status_group'], prefix='status_group')
print dummy_ranks.head()
In [ ]:
In [ ]: