$p(A \, | \, B)$: the probability of event $A$ given $B$
From basic probability: $$ \begin{align} p(A \, | \, B) &= \frac{p(A \bigcap B)}{p(B)} \end{align} $$ where $p(A \bigcap B)$ is the joint probability, of $A$ and $B$ both happen.
Alternative representation: $$ \begin{align} p(A \bigcap B) &= p(A, B) \end{align} $$
$$ \begin{align} p(A \, | \, B) &= \frac{p(A, B)}{p(B)} \end{align} $$Assume $p(B) > 0$ as otherwise the question is ill-defined.
Apply conditional probability to the numerator again: $$ \begin{align} p(A \, | \, B) &= \frac{p(A, B)}{p(B)} \\ &= \frac{p(B \, | \, A) p(A)}{p(B)} \end{align} $$
To help remember, consider symmetry: $$ \begin{align} p(A, B) &= p(A \, | \, B) p(B) \\ &= p(B \, | \, A) p(A) \end{align} $$
A bank decides whether to make a loan to a customer:
$ \mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} $ : customer income $x_1$ and asset $x_2$
$C$: 0/1 if the customer is unlike/likely to pay back the loan
Make a loan if $P(C = 1 \, | \, \mathbf{x}) > 0.5$ or other threshold value
How to compute $P(C \, | \, \mathbf{x})$, which is unknown?
From conditional probability: $$ \begin{align} p\left(C \, | \, \mathbf{x}\right) &= \frac{p\left(\mathbf{x} \, | \, C\right) p\left(C\right)}{p\left(\mathbf{x}\right)} \\ &\propto p\left(\mathbf{x} \, | \, C\right) p\left(C\right) \end{align} $$
$p(C \, | \, \mathbf{x})$: posterior
the likelihood of $C$ given $\mathbf{x}$
$p(C)$: prior
how likely $C$ is before observing $\mathbf{x}$
$p(\mathbf{x} \, | \, C)$: likelihood
how likely $\mathbf{x}$ is if it belongs to $C$
$p(\mathbf{x})$: marginal/evidence
constant for given $\mathbf{x}$
We can compute $P(C \, | \, \mathbf{x})$ (posterior), if given $p(C)$ (prior) and $p(\mathbf{x} \, | \, C)$ (likelihood)
Plane crash is much more deadly than car crash: $ p(\mathbf{x} \, | \, C_{plane}) \gg p(\mathbf{x} \, | \, C_{car}) $
say $ \begin{align} p(\mathbf{x} \, | \, C_{plane}) &= 1.0 \\ p(\mathbf{x} \, | \, C_{car}) &= 0.1 \end{align} $
But plane crash is much rarer than car crash: $ p(C_{plane}) \ll p(C_{car}) $
say $ \begin{align} p(C_{plane}) &= 0.001 \\ p(C_{car}) &= 0.1 \end{align} $
Multiply together: $ \begin{align} p(\mathbf{x} \, | \, C_{plane}) p(C_{plane}) &= 0.001 \\ p(\mathbf{x} \, | \, C_{car}) p(C_{car}) &= 0.01 \end{align} $
$ \begin{align} \frac{p(C_{plane} \, | \, \mathbf{x})}{p(C_{car} \, | \, \mathbf{x})} &= \frac{p(\mathbf{x} \, | \, C_{plane}) p(C_{plane})}{p(\mathbf{x} \, | \, C_{car}) p(C_{car})} \end{align} $
Thus plane travel is actually safter than car travel: $ p(C_{plane} \, | \, \mathbf{x}) < p(C_{car} \, | \, \mathbf{x}) $
$C$: has cancer
$\overline{C}$: no cancer
$1/0$: positive/negative cancer screening result
$p(C) = 0.01$
$C$:
$\overline{C}$:
What is $p(C | 1)$?
https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/
$R(C_i\, | \mathbf{x})$: expected value (e.g. utility/loss/risk) for taking class $C_i$ given data $\mathbf{x}$
$R_{ik}$: value for taking class $C_i$ when the actual class is $C_k$
$$ \begin{align} R(C_i \, | \mathbf{x}) = \sum_{k} R_{ik} p(C_k \, | \, \mathbf{x}) \end{align} $$Goal: select $C_i$ to optimize $R(C_i \, | \, \mathbf{x})$ given $\mathbf{x}$
$ X \rightarrow Y $
Example: basket analysis for shopping, $X$ and $Y$ can be sets of item(s).
Support: $$ p(X, Y) $$ , the statistical significance of having $X$ and $Y$ together
Confidence: $$ p(Y | X) $$ , how likely $Y$ can be predicted from $X$
Lift: $$ \begin{align} \frac{p(X, Y)}{p(X)p(Y)} &= \frac{p(Y | X)}{p(Y)} \end{align} $$ , $> 1$, $< 1$, $=1$, $X$ makes $Y$ more, less, equally likely
From training data $\mathbf{X}$ we want to estimate model parameters $\Theta$.
$$ \begin{align} p\left(\Theta | \mathbf{X}\right) &= \frac{p\left(\mathbf{X} | \Theta\right) p\left(\Theta\right)}{p\left(\mathbf{X}\right)} \\ &\propto p\left(\mathbf{X} | \Theta\right) p\left(\Theta\right) \end{align} $$$p(\Theta | \mathbf{X})$: posterior
the likelihood of $\Theta$ given $\mathbf{X}$
$p(\Theta)$: prior
how likely $\Theta$ is before observing $\mathbf{X}$
$p(\mathbf{X} | \Theta)$: likelihood
how likely $\mathbf{X}$ is if the model parameters are $\Theta$
$p(\mathbf{X})$: marginal/evidence
constant for given $\mathbf{X}$
If we don't have $p(\Theta)$, it can be assumed to be flat $ p(\Theta) = 1 $ and MAP is equivalent to ML:
Under the often iid (identical and independently distributed) assumption: $$ p(\mathbf{X} | \Theta) = \prod_{t=1}^{N} p(\mathbf{x}^{(t)} | \Theta) $$ , where $\{\mathbf{x}^{t}\}$ are the individual samples within $\mathbf{X}$. ($\mathbf{X}$ is a matrix and $\{\mathbf{x}^{t}\}$ are its individual columns.)
The expected value of the posterior density: $$ \begin{align} \Theta_{Bayes} &= E(\Theta | \mathbf{X}) \\ &= \int \Theta p(\Theta | \mathbf{X}) d\Theta \end{align} $$
The best estimate of a random variable/vector is its mean.
Compute the data $\mathbf{X}$ mean and variance: $$ \begin{align} \Theta_{ML} &= \mathbf{m} = \frac{\sum_{t=1}^N \mathbf{x}^{(t)}}{N} \\ \sigma_{ML}^2 &= s^2 = \frac{\sum_{t=1}^N \left(\mathbf{x} - \mathbf{m} \right)^2}{N} \end{align} $$
The $\mathbf{m}$ part holds regardless of $\sigma$ (constant or a variable to be optimized).
It can be shown that $$ \Theta_{MAP} = \frac{N/\sigma^2}{N/\sigma^2 + 1/\sigma_0^2} \mathbf{m} + \frac{1/\sigma_0^2}{N/\sigma^2 + 1/\sigma_0^2} \mu_0 $$ , i.e. weighted average of prior mean $\mu_0$ and sample $\mathbf{X}$ mean $\mathbf{m}$, with weights inversely proportional to variances.
Note that if we don't know $p(\Theta)$, we can assume it is a constant distribution $p(\Theta) = 1$, i.e. $\sigma_0 = \infty$. This will give us $\Theta_{MAP} = \mathbf{m} = \Theta_{ML}$, as expected.
, i.e. same as $\Theta_{MAP}$.
The math derivations are left as exercise.
Naive Bayes assume the features are independent for the likelihood, i.e. for an $n$-dimensional data vector $\mathbf{x}$ $$ \begin{align} p(\mathbf{x} | \Theta) &= p(\mathbf{x}_1, \mathbf{x}_2, \cdots \mathbf{x}_n | \Theta) \\ &= \prod_{k=1}^n p( \mathbf{x}_k | \Theta) \end{align} $$
We can generalize from a single data item $\mathbf{x}$ (a vector) to an entire data set $\mathbf{X}$ (a matrix whose columns are data items) by considering columns of $\mathbf{X}$ as features, i.e. $\mathbf{X}_{(k)}$
Put the above into our Bayesian rule: $$ \begin{align} p(\Theta | \mathbf{X}) &= \frac{p(\Theta) p\left(\mathbf{X} | \Theta\right)}{p(\mathbf{X})} \\ &= \frac{p(\Theta) \prod_{k=1}^n p\left(\mathbf{X}_{(k)} | \Theta\right)}{p(\mathbf{X})} \end{align} $$
The main merit of naive Bayes is that the estimation/computation of individual $ p\left(\mathbf{X}_{(k)} | \Theta\right) $ terms is easier/faster than the joint term $ p\left(\mathbf{X} | \Theta \right) $ .
This feature independence is just an assumption, but tends to work well in practice.
More details can be found under:
In [1]:
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target
In [2]:
from sklearn.cross_validation import train_test_split
# splitting data into 70% training and 30% test data:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)
In [3]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
In [4]:
from sklearn.metrics import accuracy_score
In [5]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
_ = gnb.fit(X_train_std, y_train)
In [6]:
y_pred = gnb.predict(X_test_std)
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))