In [4]:
%matplotlib inline
from Lec07 import *
In this lecture, we will first talk about the some concepts of probabilistic models for classifiers, especially generative model and discriminant model. And we will introduce Naive Bayes classifier, which assumes independent features give label and is a classical classifier commonly used in spam email classification.
Generative model learns class-conditional $P(X | Y)$ and label densities $P(Y)$ from training data
Perform prediction using the posterior via Bayes' Rule. For some new data $\vx^{new}$ $$ \begin{align} y & = \underset{k \in \{1, \dots, K\}}{\arg \max} P(Y=k | X = \vx^{new} ) \\ & = \underset{k \in \{1, \dots, K\}}{\arg \max} \frac{P(X = \vx^{new} | Y=k)P(Y=k)}{P(X = \vx^{new})} \\ & = \boxed{\underset{k \in \{1, \dots, K\}}{\arg \max} P(X = \vx^{new} | Y=k)P(Y=k)} \end{align} $$ of which the last equality holds because the denominator $P(X = \vx^{new})$ is independent of $k$.
Basic idea of prediction is picking the label with largest posterior probability given its features $\vx^{new}$.
Why is this model called generative?
The prior $P(Y)$ encodes beliefs about popularity of each label
By comparing the synthetic data and real data, we get a sense of how good our generative model is.
Conversely, a discriminative model learns posterior $P(Y | X)$ directly from training data.
Goal: select a hypothesis to discriminates between class labels.
The prediction for some new data $\vx^{new}$ is $$ y = \underset{k \in \{1, \dots, K\}}{\arg \max} P(Y=k | X = \vx^{new} ) \\$$
Does not (necessarily) provide the ability to generate new random examples because unlike generative models, we have no idea what $P(X|Y)$ is.
Allows us to focus purely on the classification task
We will discuss the pros and cons of each model later.
We will use Naive Bayes to solve the following classification problem:
For example, in Spam Mail Classification,
SPAM
($y=1$) or HAM
($y=0$)The essence of Naive Bayes is the conditionally independence assumption $$ P(\vx | y = c) = \prod_{d=1}^D P(x_d | y=c) $$ i.e., given the label, all features are independent.
The full generative model of Naive Bayes is: $$ \begin{align} y &\sim \mathrm{Categorical}(\pi) \\ x_d | y=c &\sim \mathrm{Categorical}(\theta_{cd}) \quad \forall\, d = 1,\dots,D \end{align} $$ with parameters:
Parameter $\pi$ and $\theta$ are learned from training data.
Remark
NOTE in definition and derivation of this lecture, we assume a more general case $x_d \in \{1, \dots ,M \}$ of which $M>2$. But in spam email classification and the derivation in textbook, binary feature, i.e. $M=2$, is used. So don't get confused!
When $M=2$, $x_d | y=c$ is also Bernoulli distribution.
Given the independence assumption and full model, for some new data $\vx^{\text{new}} = (x_1^{\text{new}}, \dots, x_D^{\text{new}})$ we will classify based on $$ \begin{align} y &=\underset{c \in \{1,\dots,C\}}{\arg \max} P(y=c|\vx = \vx^{\text{new}}) \\ &=\underset{c \in \{1,\dots,C\}}{\arg \max} P(\vx = \vx^{\text{new}} | y=c) P(y=c) \\ &=\underset{c \in \{1,\dots,C\}}{\arg \max} P(y=c) \prod \nolimits_{d=1}^{D} P(x_d = x_d^{\text{new}} | y=c) \\ &=\boxed{\underset{c \in \{1,\dots,C\}}{\arg \max} \pi_c \prod \nolimits_{d=1}^{D} \theta_{cdx_d^{\text{new}}}} \\ \end{align} $$
If we assume $x_d^{\text{new}} \in \{1,\dots,M \}, \forall d = 1,\dots,D$, we could also express the above expression equivalently using indicator function $$ y = \underset{c \in \{1,\dots,C\}}{\arg \max} \pi_c \prod \nolimits_{d=1}^{D} \prod \nolimits_{m=1}^{M} \theta_{cdm}^{\mathbb{I}(m=x_d^{\text{new}})} $$
So as long as we learned parameter $\pi$ and $\theta$, we could classify.
Remark
Indicator function $$ \mathbb{I}(m=x_d^{\text{new}}) = \begin{cases} 1 & \text{ if } m=x_d^{\text{new}}\\ 0 & \text{ otherwise} \end{cases} $$
In inner product $\prod \nolimits_{m=1}^{M} \theta_{cdm}^{\mathbb{I}(m=x_d^{\text{new}})}$, only $\theta_{cdx_d^{\text{new}}}$ is multiplied and all the other multipliers are 1 due to the power of indicator function.
One thing to note is that the above classification criterion is the product of a series numbers smaller than 1 which will generate a rather small number. A better way is to take logarithm to transform product into summation and then compare.
The likelihood for a single data case $(\vec{x}_n, y_n=c)$ is $$ \begin{align} & P((\vec{x}_n, y_n) | \pi, \theta) \\ &= P(y_n) \prod \nolimits_{d=1}^D P(x_{nd}|y_n) \\ &= \prod \nolimits_{c=1}^C P(y_n=c)^{\I(y_n=c)} \cdot \prod \nolimits_{c=1}^C \prod \nolimits_{d=1}^D \prod \nolimits_{m=1}^M P(x_{nd}=m|y_n=c)^{\I(x_{nd}=m) \I(y_n=c)}\\ &= \prod \nolimits_{c=1}^C \pi_c^{\I(y_n=c)} \cdot \prod \nolimits_{c=1}^C \prod \nolimits_{d=1}^D \prod \nolimits_{m=1}^M \theta_{cdm}^{\I(x_{nd}=m) \I(y_n=c)}\\ \end{align} $$
Therefore, the log-likelihood is $$ \begin{split} & \log P((\vec{x}_n, y_n) | \pi, \theta) \\ & = \sum \nolimits_{c=1}^C \I(y_n=c) \log \pi_c + \sum \nolimits_{c=1}^C \sum \nolimits_{d=1}^D \sum \nolimits_{m=1}^M \I(x_{nd}=m) \I(y_n=c) \log \theta_{cdm} \end{split} $$
The log-likelihood for all training data $\mathcal{D} = \{ (\vec{x}_n, y_n) \}_{n=1}^N $ is $$ \begin{align} & \log P(\mathcal{D}| \pi, \theta)\\ &= \log \prod \nolimits_{n=1}^N P((\vec{x}_n, y_n) | \pi, \theta) = \sum \nolimits_{n=1}^N \log P((\vec{x}_n, y_n) | \pi, \theta) \\ &= \boxed{\sum \nolimits_{n=1}^N \sum \nolimits_{c=1}^C \I(y_n=c) \log \pi_c + \sum \nolimits_{n=1}^N \sum \nolimits_{c=1}^C \sum \nolimits_{d=1}^D \sum \nolimits_{m=1}^M \I(x_{nd}=m) \I(y_n=c) \log \theta_{cdm}} \end{align} $$
With the constraints $\sum_{c=1}^C \pi_c=1$ and $\sum_{m=1}^M \theta_{cdm}=1$, we could maximize log-likelihood function $\log P(\mathcal{D}| \pi, \theta)$ using Lagrange multiplier. (Derivation is in the notes!)
By maximizing log-likelihood function, we could have maximum likelihood estimators: $$ \hat{\pi}_c = \frac{N_c}{N} \quad \hat{\theta}_{cdm} = \frac{N_{cdm}}{N_c} $$ and $$ \hat{\pi} = (\hat{\pi}_1, \dots,\hat{\pi}_c, \dots,\hat{\pi}_C); \hat{\theta}_{cd} = (\hat{\theta}_{cd1}, \dots,\hat{\theta}_{cdm}, \dots,\hat{\theta}_{cdM}) $$
Intuitive Interpretation
Remark
Derivation of maximum likelihood estimator $\hat{\pi}_c$
- We have the following problem $$ \begin{matrix} \left\{ \begin{split} \max \quad &\log P(\mathcal{D}| \pi, \theta) \\ \text{s.t.} \quad &\sum \nolimits_{c=1}^C \pi_c=1 \end{split} \right. & \overset{\text{equivalent to}}{\Longrightarrow} & \left \{ \begin{split} \max \quad & \sum \nolimits_{n=1}^N \sum \nolimits_{c=1}^C \I(y_n=c) \log \pi_c \\ \text{s.t.} \quad &\sum \nolimits_{c=1}^C \pi_c = 1 \end{split} \right. \end{matrix} $$ We drop the second term in $\log P(\mathcal{D}| \pi, \theta)$ because it doesn't depend on $\pi_c$
- The lagragian is $$ L(\pi, \lambda) = \sum \nolimits_{n=1}^N \sum \nolimits_{c=1}^C \I(y_n=c) \log \pi_c - \lambda \sum \nolimits_{c=1}^C \pi_c -\lambda $$
- Setting partial derivative with respect to $\pi_c$ to 0, we have $$ \frac{\partial L(\pi, \lambda)}{\partial \pi_c} = 0 \quad \Rightarrow \quad \sum \nolimits_{n=1}^N \I(y_n=c) \frac{1}{\pi_c} - \lambda = 0 \quad \Rightarrow \quad \pi_c = \frac{1}{\lambda} \sum \nolimits_{n=1}^N \I(y_n=c) $$
- Plug $\pi_c$ back into the constraint $\sum \nolimits_{c=1}^C \pi_c=1$, we have $$ \frac{1}{\lambda} \sum \nolimits_{c=1}^C \sum \nolimits_{n=1}^N \I(y_n=c) = 1 \quad \Rightarrow \quad \lambda = \sum \nolimits_{c=1}^C \sum \nolimits_{n=1}^N \I(y_n=c) $$
- Plug $\lambda$ into $\pi_c = \frac{1}{\lambda} \sum \nolimits_{n=1}^N \I(y_n=c)$, we have $$ \hat{\pi}_c = \frac{\sum \nolimits_{n=1}^N \I(y_n=c)}{ \sum \nolimits_{n=1}^N \sum \nolimits_{c=1}^C \I(y_n=c)} = \frac{N_c}{N} $$
Derivation of maximum likelihood estimator $\hat{\theta}_{cdm}$
With the constraint $\sum_{m=1}^M \theta_{cdm}=1$, using similar approach, we could have $$ \hat{\theta}_{cdm} = \frac{\sum \nolimits_{n=1}^N \I(x_{nd}=m) \I(y_n=c)}{\sum_{n=1}^N \sum_{m=1}^M \I(x_{nd}=m) \I(y_n=c)} = \frac{N_{cdm}}{N_c} $$
- Details are left as an exercise XD
Problem: When working with text, features are sparse:
This causes overfitting!
subject:
") occurs in every training example of both classes?Solution: Place Dirichlet priors on $\pi$ and $\theta_{cd}$ to smooth out unknowns: $$ \begin{align} \pi &\sim \mathrm{Dirichlet}(\alpha_1, \dots, \alpha_C) \\ \theta_{cd} &\sim \mathrm{Dirichlet}(\beta_{cd1}, \dots, \beta_{cdM}) & &\quad \forall\, c=1,\dots,C; d=1, \dots D \\ y &\sim \mathrm{Categorical}(\pi) \\ x_d | y=c &\sim \mathrm{Categorical}(\theta_{cd}) & & \quad \forall\, d = 1,\dots,D \end{align} $$
Dirichlet distribution $\pi \sim \mathrm{Dirichlet}(\alpha_1, \dots, \alpha_C)$ defines the distribution of C-simplex $\pi = (\pi_1, \dots, \pi_C)$ such that
When $M=2$, $\theta_{cd}$ reduces to Beta distribution
The MAP parameter estimates with priors $$ \pi \sim \mathrm{Dirichlet}(\alpha_1, \dots, \alpha_C) \qquad \theta_{cd} \sim \mathrm{Dirichlet}(\beta_{cd1}, \dots, \beta_{cdM}) $$ are $$ \hat{\pi}_c = \frac{N_c+\alpha_c-1}{N + \sum_{c'=1}^C (\alpha_{c'}-1)} \quad \hat{\theta}_{cdm} = \frac{N_{cdm}+\beta_{cdm}-1}{N_c + \sum_{m'=1}^M (\beta_{cdm'}-1)} $$
Proof is in the notes!
The Dirichlet $\alpha$ and $\beta_{cd}$ parameters turn out to be pseudocounts!
The choice $\alpha_c = \beta_{cdm} = 1$ is referred to as Laplace Smoothing
Note that the posterior of parameters still have Dirichlet distributions! $$ \pi|\mathcal{D} \sim \mathrm{Dirichlet}(N_1+\alpha_1, \dots, N_C+\alpha_C) \qquad \theta_{cd}|\mathcal{D} \sim \mathrm{Dirichlet}(N_{cd1}+\beta_{cd1}, \dots, N_{cdM}+\beta_{cdM}) $$ Proof for $\pi|\mathcal{D}$ is in the notes! Proof for $\theta_{cd}|\mathcal{D}$ is left as an exercise.
If we pick the mean of posteriors as parameter estimate, we could get a slightly different result: $$ \bar{\pi}_c = \frac{N_c+\alpha_c}{N + \sum_{c'=1}^C \alpha_{c'}} \quad \bar{\theta}_{cdm} = \frac{N_{cdm}+\beta_{cdm}}{N_c + \sum_{m'=1}^M \beta_{cdm'}} $$ $-1$ terms no longer exist!
You might see this version of estimate in [MLAPP] and some online materials
Advantage of using mean estimate
Remark
Posterior also has Dirichlet distribution!
Prior: $$ \pi \sim \mathrm{Dirichlet}(\alpha_1, \dots, \alpha_C) \qquad f(\pi) = \frac{1}{B(\alpha)} \prod \nolimits_{c=1}^C \pi_c^{\alpha_c-1} $$
Likelihood: $$ P(\mathcal{D} | \pi) = \prod \nolimits_{n=1}^N \prod \nolimits_{c=1}^C \pi_c^{\I(y_n=c)} = \prod \nolimits_{c=1}^C \pi_c^{\sum \nolimits_{n=1}^N \I(y_n=c)} = \prod \nolimits_{c=1}^C \pi_c^{N_c} $$
Posterior $$ \begin{split} f(\pi | \mathcal{D}) &= \frac{P(\mathcal{D} | \pi) f(\pi)}{P(\mathcal{D})} = \frac{1}{P(\mathcal{D})} \prod \nolimits_{c=1}^C \pi_c^{N_c} \cdot \frac{1}{B(\alpha)} \prod \nolimits_{c=1}^C \pi_c^{\alpha_c-1} \\ &= \frac{1}{P(\mathcal{D}) B(\alpha)} \prod \nolimits_{c=1}^C \pi_c^{N_c+\alpha_c-1} \\ &= \frac{1}{B'(\alpha)} \prod \nolimits_{c=1}^C \pi_c^{N_c+\alpha_c-1} \end{split} $$ of which $B'(\alpha) = P(\mathcal{D}) B(\alpha)$.
From the expression of $f(\pi | \mathcal{D}) $, we could see posterior also has Dirichlet distribution $$ \pi | \mathcal{D} \sim \mathrm{Dirichlet}(N_1+\alpha_1, \dots, N_C+\alpha_C) $$
Conjugate Prior
If posterior and prior are in the same distribution family with respect to some likelihood, we could call this distribution conjugate prior for that likelihood.
This is useful because we could take the posterior as the prior for next learning phase, which enables us to do sequential Bayesian learning.
In our case, we have shown Dirichlet distribution is the conjugate prior of the multinomial distribution!
Derivation of MAP estimate ${\hat{\pi}_c}_{MAP}$
MAP estimate is obtained by maximizing the posterior $f(\pi | \mathcal{D})$ $$ \begin{matrix} \left\{ \begin{split} \max \quad & \prod \nolimits_{c=1}^C \pi_c^{N_c+\alpha_c-1}\\ \text{s.t.} \quad &\sum \nolimits_{c=1}^C \pi_c=1 \end{split} \right. & \overset{\text{equivalent to}}{\Longrightarrow} & \left \{ \begin{split} \max \quad & \sum \nolimits_{c=1}^C \left( N_c + \alpha_c -1 \right) \log \pi_c \\ \text{s.t.} \quad &\sum \nolimits_{c=1}^C \pi_c = 1 \end{split} \right. \end{matrix} $$
Recall when deriving maximum likelihood estimator, we solved the following problem $$ \left \{ \begin{split} \max \quad & \sum \nolimits_{c=1}^C \sum \nolimits_{n=1}^N \I(y_n=c) \log \pi_c \\ \text{s.t.} \quad &\sum \nolimits_{c=1}^C \pi_c = 1 \end{split} \right. \longrightarrow {\hat{\pi}_c}_{MLE} = \frac{\sum \nolimits_{n=1}^N \I(y_n=c)}{ \sum \nolimits_{n=1}^N \sum \nolimits_{c=1}^C \I(y_n=c)} = \frac{N_c}{N} $$
So the solution of our current problem can be easily read off $$ \begin{split} {\hat{\pi}_c}_{MAP} &= \frac{N_c + \alpha_c -1 }{\sum \nolimits_{c'=1}^C \left(N_c + \alpha_c' -1 \right)} = \frac{N_c + \alpha_c -1}{N + \sum \nolimits_{c'=1}^C \left(\alpha_c' -1\right)} \end{split} $$
Derivation of MAP estimator $\hat{\theta}_{cdm}$ is left as an exercise! Approach is exactly the same as deriving $\hat{\pi}_c$.
Naive Bayes assumes features contribute independently to the class label.
This model is naive because we would never expect features to be independent!
It seems not to matter that independence is often false...