Officials are investing \$27Million to dig up lead pipes in Flint MI. Before they spend \$5k on digging up the pipes for a home, they want a better estimate whether the pipe is lead. We have two observable variables, whether the home is Old (i.e. built before 1950) or not (built after 1950), and what the messy city records suggest. Keep in mind the city records are often wrong.
We make the "naive bayes" assumption that, given the target HasLead(X), the events IsOld(X) and RecordsSayLead(X) are independent of each other. Initially, the city believes the following parameters are roughly true:
\begin{align} P(HasLead(X)) &= 0.4\\ P(IsOld(X) \mid HasLead(X)) &= 0.7\\ P(IsOld(X) \mid Not HasLead(X)) &= 0.3\\ P(RecordsSayLead(X) \mid HasLead(X)) &= 0.8\\ P(RecordsSayLead(X) \mid Not HasLead(X)) &= 0.5 \end{align}Compute the probabilty: $$P(HasLead(X) \mid IsOld(X), RecordsSayLead(X))$$ Now do the same for the other three conditions (i.e. conditioning on $ IsOld(X) \& Not RecordsSayLead(X)$, etc.)
We use Bayes Rule, then use the Naive Bayes assumption.
\begin{align} P(HasLead(X) \mid IsOld(X), RecordsSayLead(X)) &= \frac{P(IsOld(X), RecordsSayLead(X) \mid HasLead(X)) \cdot P(HasLead(X))}{P(IsOld(X), RecordsSayLead(X))} \\ &= \frac{P(IsOld(X) \mid HasLead(X)) \cdot P(RecordsSayLead(X) \mid HasLead(X)) \cdot P(HasLead(X))}{P(IsOld(X), RecordsSayLead(X))} \\ &= \frac{0.7 \cdot 0.8 \cdot 0.4}{P(IsOld(X), RecordsSayLead(X))} \end{align}We also have to marginalize to compute the denominator.
\begin{align} P(IsOld(X), RecordsSayLead(X)) &= P(IsOld(X), RecordsSayLead(X), HasLead(X)) \\ & \quad \quad + P(IsOld(X), RecordsSayLead(X), Not HasLead(X))\\ &= P(IsOld(X), RecordsSayLead(X) \mid HasLead(X)) \cdot P(HasLead(X)) \\ & \quad \quad + P(IsOld(X), RecordsSayLead(X) \mid Not HasLead(X)) \cdot P(Not HasLead(X))\\ &= P(IsOld(X) \mid HasLead(X)) \cdot P(RecordsSayLead(X) \mid HasLead(X)) \cdot P(HasLead(X)) \\ & \quad \quad + P(IsOld(X) \mid Not HasLead(X)) \cdot P(RecordsSayLead(X) \mid Not HasLead(X)) \cdot P(Not HasLead(X))\\ &= 0.7 \cdot 0.8 \cdot 0.4 + 0.3 \cdot 0.5 \cdot (1-0.4) \end{align}To the final answer is $\frac{0.7 \cdot 0.8 \cdot 0.4}{0.7 \cdot 0.8 \cdot 0.4 + 0.3 \cdot 0.5 \cdot (1-0.4)} \approx 0.713$.
Over the past month, Flint has dug up about 200 service lines, and they've observed the pipe materials for several of these. They are starting to believe their initial estimates are incorrect.
They want to update these values, still assuming the Naive Bayes model. Here are the necessary parameters of this model: \begin{align} P(HasLead(X)) &= \pi_{\text{HasLead}} = ?\\ P(IsOld(X) \mid HasLead(X)) & = \theta_{\text{HasLead}, \text{IsOld}} = ?\\ P(IsOld(X) \mid Not HasLead(X)) &= \theta_{\text{NotHasLead}, \text{IsOld}} = ?\\ P(RecordsSayLead(X) \mid HasLead(X)) &= \theta_{\text{HasLead}, \text{RecordsSayLead}} = ?\\ P(RecordsSayLead(X) \mid Not HasLead(X)) &= \theta_{\text{NotHasLead}, \text{RecordsSayLead}} = ? \end{align}
Load the dataset and compute the maximum likelihood estimate for the above parameters.
In [1]:
%pylab inline
import numpy as np
import pandas as pd
data = pd.read_csv('estimating_lead.csv')
# Run this to see a printout of the data
data
Out[1]:
In [2]:
# The object 'data' is a pandas DataFrame
# Don't worry if you don't know what that is, we can turn it into a numpy array
datamatrix = data.as_matrix()
For the Naive Bayes model, computing the MLE estimate for the parameters is pretty easy because it can be reduced to estimating empirical frequencies (i.e. you just need to compute the "count" of how many times something occured in our dataset). Let's say we have $N$ examples in our dataset, and I'll use the notation $\#\{ \}$ to mean "size of set".
\begin{align} \pi_{\text{HasLead}}^{\text{MLE}} &= \frac{\#\{HasLead(X_i)\}}{N} \\ \theta_{\text{HasLead},\text{IsOld}}^{\text{MLE}} &= \frac{\#\{HasLead(X_i) \wedge IsOld(X_i)\}}{\#\{HasLead(X_i)\}} \\ \theta_{\text{Not HasLead},\text{IsOld}}^{\text{MLE}} &= \frac{\#\{Not HasLead(X_i) \wedge IsOld(X_i)\}}{\#\{Not HasLead(X_i)\}} \\ \theta_{\text{HasLead},\text{RSL}}^{\text{MLE}} &= \frac{\#\{HasLead(X_i) \wedge RecordSaysLead(X_i)\}}{\#\{HasLead(X_i)\}} \\ \theta_{\text{Not HasLead},\text{RSL}}^{\text{MLE}} &= \frac{\#\{Not HasLead(X_i) \wedge RecordSaysLead(X_i)\}}{\#\{Not HasLead(X_i)\}} \\ \end{align}
In [14]:
# We can use some pandas magic to do these counts quickly
N = data.shape[0]
params = {
'pi_mle': data[data.Lead == 'Lead'].shape[0] / N,
'theta_haslead_isold': data[(data.Lead == 'Lead') & (data.IsOld == True) ].shape[0] / \
data[data.Lead == 'Lead'].shape[0],
'theta_nothaslead_isold': data[(data.Lead != 'Lead') & (data.IsOld == True) ].shape[0] / \
data[data.Lead != 'Lead'].shape[0],
'theta_haslead_rsl': data[(data.Lead == 'Lead') & (data.RecordSaysLead == True) ].shape[0] / \
data[data.Lead == 'Lead'].shape[0],
'theta_nothaslead_rsl': data[(data.Lead != 'Lead') & (data.RecordSaysLead == True) ].shape[0] / \
data[data.Lead != 'Lead'].shape[0],
}
print(pd.Series(params))
For the case of the discreet event, such as material=Lead or =NoLead, we are working with a categorical distribution, i.e. a distrbution on one of $C$ things occuring. The parameters of this distribution are a probability vector $\pi \in \Delta_C$. (That is, $\pi_c \geq 0$ for all $c$, and $\sum_c \pi_c = 1$.)
Often when we have limited data, we want to add a prior distribution on our parameters. The standard prior to use is a Dirichlet with parameters $\alpha_1, \ldots, \alpha_C$. That is, we assume that $\pi \sim \mathrm{Dirichlet}(\alpha_1, \dots, \alpha_C)$. Recall that the Dirichlet has PDF $f(\pi_1, \dots, \pi_C) = \frac{1}{B(\alpha)} \prod_{c=1}^C \pi_c^{\alpha_c-1}$, where $B(\cdot)$ is just the normalizing term.
For our Flint problem, assume that the parameters $(\pi_{\text{HasLead}}, 1-\pi_{\text{HasLead}}) \sim \mathrm{Dirichlet}(3,3)$. Compute the MAP estimate of $\pi_{\text{HasLead}}$ for this distribution using the above dataset.
For the Naive Bayes model, computing the MAP estimate is very similar to the MLE, but you have to add the Dirichlet prior parameters as "pseudocounts" to the frequency calculation. In this case we have two prior parameters $\alpha_{\text{HasLead}}$ and $\alpha_{\text{NotHasLead}}$, which we are choosing to be the value 3.
\begin{align} \pi_{\text{HasLead}}^{\text{MAP}} &= \frac{\#\{HasLead(X_i)\} + \alpha_{\text{HasLead}} - 1 }{N + \alpha_{\text{HasLead}} - 1 + \alpha_{\text{NotHasLead}} - 1} = \frac{\#\{HasLead(X_i)\} + 2 }{N + 4} \end{align}