$$ \LaTeX \text{ command declarations here.} \newcommand{\N}{\mathcal{N}} \newcommand{\R}{\mathbb{R}} \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\norm}[1]{\|#1\|_2} \newcommand{\d}{\mathop{}\!\mathrm{d}} \newcommand{\qed}{\qquad \mathbf{Q.E.D.}} \newcommand{\vx}{\mathbf{x}} \newcommand{\vy}{\mathbf{y}} \newcommand{\vt}{\mathbf{t}} \newcommand{\vb}{\mathbf{b}} \newcommand{\vw}{\mathbf{w}} \newcommand{\vm}{\mathbf{m}} \newcommand{\I}{\mathbb{I}} \newcommand{\th}{\text{th}} $$

EECS 445: Machine Learning

Hands On 08: More on Naive Bayes (saving SVM for later...)

  • Instructor: Ben Bray, Chansoo Lee, Jia Deng, Jake Abernethy
  • Date: October 5, 2016

Estimating the probability of a lead pipes Flint

Officials are investing \$27Million to dig up lead pipes in Flint MI. Before they spend \$5k on digging up the pipes for a home, they want a better estimate whether the pipe is lead. We have two observable variables, whether the home is Old (i.e. built before 1950) or not (built after 1950), and what the messy city records suggest. Keep in mind the city records are often wrong.

We make the "naive bayes" assumption that, given the target HasLead(X), the events IsOld(X) and RecordsSayLead(X) are independent of each other. Initially, the city believes the following parameters are roughly true:

\begin{align} P(HasLead(X)) &= 0.4\\ P(IsOld(X) \mid HasLead(X)) &= 0.7\\ P(IsOld(X) \mid Not HasLead(X)) &= 0.3\\ P(RecordsSayLead(X) \mid HasLead(X)) &= 0.8\\ P(RecordsSayLead(X) \mid Not HasLead(X)) &= 0.5 \end{align}

Compute the probabilty: $$P(HasLead(X) \mid IsOld(X), RecordsSayLead(X))$$ Now do the same for the other three conditions (i.e. conditioning on $ IsOld(X) \& Not RecordsSayLead(X)$, etc.)

Solution:

We use Bayes Rule, then use the Naive Bayes assumption.

\begin{align} P(HasLead(X) \mid IsOld(X), RecordsSayLead(X)) &= \frac{P(IsOld(X), RecordsSayLead(X) \mid HasLead(X)) \cdot P(HasLead(X))}{P(IsOld(X), RecordsSayLead(X))} \\ &= \frac{P(IsOld(X) \mid HasLead(X)) \cdot P(RecordsSayLead(X) \mid HasLead(X)) \cdot P(HasLead(X))}{P(IsOld(X), RecordsSayLead(X))} \\ &= \frac{0.7 \cdot 0.8 \cdot 0.4}{P(IsOld(X), RecordsSayLead(X))} \end{align}

We also have to marginalize to compute the denominator.

\begin{align} P(IsOld(X), RecordsSayLead(X)) &= P(IsOld(X), RecordsSayLead(X), HasLead(X)) \\ & \quad \quad + P(IsOld(X), RecordsSayLead(X), Not HasLead(X))\\ &= P(IsOld(X), RecordsSayLead(X) \mid HasLead(X)) \cdot P(HasLead(X)) \\ & \quad \quad + P(IsOld(X), RecordsSayLead(X) \mid Not HasLead(X)) \cdot P(Not HasLead(X))\\ &= P(IsOld(X) \mid HasLead(X)) \cdot P(RecordsSayLead(X) \mid HasLead(X)) \cdot P(HasLead(X)) \\ & \quad \quad + P(IsOld(X) \mid Not HasLead(X)) \cdot P(RecordsSayLead(X) \mid Not HasLead(X)) \cdot P(Not HasLead(X))\\ &= 0.7 \cdot 0.8 \cdot 0.4 + 0.3 \cdot 0.5 \cdot (1-0.4) \end{align}

To the final answer is $\frac{0.7 \cdot 0.8 \cdot 0.4}{0.7 \cdot 0.8 \cdot 0.4 + 0.3 \cdot 0.5 \cdot (1-0.4)} \approx 0.713$.

Flint Starts Gathering Data, wants to update parameters

Over the past month, Flint has dug up about 200 service lines, and they've observed the pipe materials for several of these. They are starting to believe their initial estimates are incorrect.

They want to update these values, still assuming the Naive Bayes model. Here are the necessary parameters of this model: \begin{align} P(HasLead(X)) &= \pi_{\text{HasLead}} = ?\\ P(IsOld(X) \mid HasLead(X)) & = \theta_{\text{HasLead}, \text{IsOld}} = ?\\ P(IsOld(X) \mid Not HasLead(X)) &= \theta_{\text{NotHasLead}, \text{IsOld}} = ?\\ P(RecordsSayLead(X) \mid HasLead(X)) &= \theta_{\text{HasLead}, \text{RecordsSayLead}} = ?\\ P(RecordsSayLead(X) \mid Not HasLead(X)) &= \theta_{\text{NotHasLead}, \text{RecordsSayLead}} = ? \end{align}

Load the dataset and compute the maximum likelihood estimate for the above parameters.


In [1]:
%pylab inline
import numpy as np
import pandas as pd
data = pd.read_csv('estimating_lead.csv')

# Run this to see a printout of the data
data


Populating the interactive namespace from numpy and matplotlib
Out[1]:
Lead IsOld RecordSaysLead
0 Lead True True
1 Lead True True
2 Lead True True
3 Lead True True
4 Lead True True
5 NotLead True True
6 Lead True True
7 NotLead False False
8 Lead True True
9 Lead True True
10 Lead True True
11 NotLead False False
12 NotLead True False
13 Lead True False
14 Lead True True
15 Lead True True
16 Lead True True
17 Lead True True
18 Lead True False
19 Lead True True
20 Lead True True
21 NotLead False True
22 Lead True False
23 Lead True False
24 Lead True False
25 NotLead False True
26 NotLead False False
27 Lead True True
28 Lead True True
29 Lead False True
... ... ... ...
170 NotLead True True
171 NotLead True True
172 NotLead False False
173 Lead True True
174 Lead True True
175 Lead False True
176 NotLead True True
177 Lead True True
178 Lead True True
179 NotLead False False
180 Lead True True
181 Lead False True
182 NotLead True False
183 Lead True True
184 Lead True True
185 Lead True True
186 Lead False True
187 Lead True True
188 NotLead False False
189 Lead True True
190 NotLead False False
191 Lead True True
192 Lead True True
193 Lead True False
194 NotLead True False
195 Lead True True
196 Lead False True
197 Lead False True
198 Lead False True
199 Lead True True

200 rows × 3 columns


In [2]:
# The object 'data' is a pandas DataFrame
# Don't worry if you don't know what that is, we can turn it into a numpy array
datamatrix = data.as_matrix()

Solution

For the Naive Bayes model, computing the MLE estimate for the parameters is pretty easy because it can be reduced to estimating empirical frequencies (i.e. you just need to compute the "count" of how many times something occured in our dataset). Let's say we have $N$ examples in our dataset, and I'll use the notation $\#\{ \}$ to mean "size of set".

\begin{align} \pi_{\text{HasLead}}^{\text{MLE}} &= \frac{\#\{HasLead(X_i)\}}{N} \\ \theta_{\text{HasLead},\text{IsOld}}^{\text{MLE}} &= \frac{\#\{HasLead(X_i) \wedge IsOld(X_i)\}}{\#\{HasLead(X_i)\}} \\ \theta_{\text{Not HasLead},\text{IsOld}}^{\text{MLE}} &= \frac{\#\{Not HasLead(X_i) \wedge IsOld(X_i)\}}{\#\{Not HasLead(X_i)\}} \\ \theta_{\text{HasLead},\text{RSL}}^{\text{MLE}} &= \frac{\#\{HasLead(X_i) \wedge RecordSaysLead(X_i)\}}{\#\{HasLead(X_i)\}} \\ \theta_{\text{Not HasLead},\text{RSL}}^{\text{MLE}} &= \frac{\#\{Not HasLead(X_i) \wedge RecordSaysLead(X_i)\}}{\#\{Not HasLead(X_i)\}} \\ \end{align}

In [14]:
# We can use some pandas magic to do these counts quickly

N = data.shape[0]
params = {
    'pi_mle': data[data.Lead == 'Lead'].shape[0] / N,
    'theta_haslead_isold': data[(data.Lead == 'Lead') & (data.IsOld == True) ].shape[0] / \
                           data[data.Lead == 'Lead'].shape[0],
    'theta_nothaslead_isold': data[(data.Lead != 'Lead') & (data.IsOld == True) ].shape[0] / \
                           data[data.Lead != 'Lead'].shape[0],
    'theta_haslead_rsl': data[(data.Lead == 'Lead') & (data.RecordSaysLead == True) ].shape[0] / \
                           data[data.Lead == 'Lead'].shape[0],
    'theta_nothaslead_rsl': data[(data.Lead != 'Lead') & (data.RecordSaysLead == True) ].shape[0] / \
                           data[data.Lead != 'Lead'].shape[0],
}
print(pd.Series(params))


pi_mle                    0.700000
theta_haslead_isold       0.778571
theta_haslead_rsl         0.900000
theta_nothaslead_isold    0.383333
theta_nothaslead_rsl      0.433333
dtype: float64

Putting a Prior on $\pi_{\text{HasLead}}$

For the case of the discreet event, such as material=Lead or =NoLead, we are working with a categorical distribution, i.e. a distrbution on one of $C$ things occuring. The parameters of this distribution are a probability vector $\pi \in \Delta_C$. (That is, $\pi_c \geq 0$ for all $c$, and $\sum_c \pi_c = 1$.)

Often when we have limited data, we want to add a prior distribution on our parameters. The standard prior to use is a Dirichlet with parameters $\alpha_1, \ldots, \alpha_C$. That is, we assume that $\pi \sim \mathrm{Dirichlet}(\alpha_1, \dots, \alpha_C)$. Recall that the Dirichlet has PDF $f(\pi_1, \dots, \pi_C) = \frac{1}{B(\alpha)} \prod_{c=1}^C \pi_c^{\alpha_c-1}$, where $B(\cdot)$ is just the normalizing term.

For our Flint problem, assume that the parameters $(\pi_{\text{HasLead}}, 1-\pi_{\text{HasLead}}) \sim \mathrm{Dirichlet}(3,3)$. Compute the MAP estimate of $\pi_{\text{HasLead}}$ for this distribution using the above dataset.

Solution

For the Naive Bayes model, computing the MAP estimate is very similar to the MLE, but you have to add the Dirichlet prior parameters as "pseudocounts" to the frequency calculation. In this case we have two prior parameters $\alpha_{\text{HasLead}}$ and $\alpha_{\text{NotHasLead}}$, which we are choosing to be the value 3.

\begin{align} \pi_{\text{HasLead}}^{\text{MAP}} &= \frac{\#\{HasLead(X_i)\} + \alpha_{\text{HasLead}} - 1 }{N + \alpha_{\text{HasLead}} - 1 + \alpha_{\text{NotHasLead}} - 1} = \frac{\#\{HasLead(X_i)\} + 2 }{N + 4} \end{align}