Bayesian estimation

A core concept in machine learning and related fields

Probablistic views and concepts

Bayes classifiers

Conditional probability

$p(A \, | \, B)$: the probability of event $A$ given $B$

From basic probability: $$ \begin{align} p(A \, | \, B) &= \frac{p(A \bigcap B)}{p(B)} \end{align} $$ where $p(A \bigcap B)$ is the joint probability, of $A$ and $B$ both happen.

Alternative representation: $$ \begin{align} p(A \bigcap B) &= p(A, B) \end{align} $$

$$ \begin{align} p(A \, | \, B) &= \frac{p(A, B)}{p(B)} \end{align} $$

Assume $p(B) > 0$ as otherwise the question is ill-defined.

Bayes' rule

Apply conditional probability to the numerator again: $$ \begin{align} p(A \, | \, B) &= \frac{p(A, B)}{p(B)} \\ &= \frac{p(B \, | \, A) p(A)}{p(B)} \end{align} $$

To help remember, consider symmetry: $$ \begin{align} p(A, B) &= p(A \, | \, B) p(B) \\ &= p(B \, | \, A) p(A) \end{align} $$

Classification

$p(C \, | \, \mathbf{x})$

From an observed data $\mathbf{x}$ we want to predict probability of $C$, the class label.

We take a probabilistic view because real world is often non-deterministic.

Iris example

$C$: type of flower

$\mathbf{x}$: flower features


Setosa	Versicolor	Virginica

Banking example

A bank decides whether to make a loan to a customer:

$ \mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} $ : customer income $x_1$ and asset $x_2$
$C$: 0/1 if the customer is unlike/likely to pay back the loan

Make a loan if $P(C = 1 \, | \, \mathbf{x}) > 0.5$ or other threshold value

Bayes' rule

How to compute $P(C \, | \, \mathbf{x})$, which is unknown?

From conditional probability: $$ \begin{align} p\left(C \, | \, \mathbf{x}\right) &= \frac{p\left(\mathbf{x} \, | \, C\right) p\left(C\right)}{p\left(\mathbf{x}\right)} \\ &\propto p\left(\mathbf{x} \, | \, C\right) p\left(C\right) \end{align} $$

$p(C \, | \, \mathbf{x})$: posterior
the likelihood of $C$ given $\mathbf{x}$
$p(C)$: prior
how likely $C$ is before observing $\mathbf{x}$
$p(\mathbf{x} \, | \, C)$: likelihood
how likely $\mathbf{x}$ is if it belongs to $C$
$p(\mathbf{x})$: marginal/evidence
constant for given $\mathbf{x}$

We can compute $P(C \, | \, \mathbf{x})$ (posterior), if given $p(C)$ (prior) and $p(\mathbf{x} \, | \, C)$ (likelihood)

Rational decision making

Humans tend to over focus on rare events

For example:

$\mathbf{x}$: get killed

$C$: cause of death

$C_{car}$: car crash
$C_{plane}$: airplane crash

Plane crash is much more deadly than car crash: $ p(\mathbf{x} \, | \, C_{plane}) \gg p(\mathbf{x} \, | \, C_{car}) $

say $ \begin{align} p(\mathbf{x} \, | \, C_{plane}) &= 1.0 \\ p(\mathbf{x} \, | \, C_{car}) &= 0.1 \end{align} $

But plane crash is much rarer than car crash: $ p(C_{plane}) \ll p(C_{car}) $

say $ \begin{align} p(C_{plane}) &= 0.001 \\ p(C_{car}) &= 0.1 \end{align} $

Multiply together: $ \begin{align} p(\mathbf{x} \, | \, C_{plane}) p(C_{plane}) &= 0.001 \\ p(\mathbf{x} \, | \, C_{car}) p(C_{car}) &= 0.01 \end{align} $

$ \begin{align} \frac{p(C_{plane} \, | \, \mathbf{x})}{p(C_{car} \, | \, \mathbf{x})} &= \frac{p(\mathbf{x} \, | \, C_{plane}) p(C_{plane})}{p(\mathbf{x} \, | \, C_{car}) p(C_{car})} \end{align} $

Thus plane travel is actually safter than car travel: $ p(C_{plane} \, | \, \mathbf{x}) < p(C_{car} \, | \, \mathbf{x}) $

Breast cancer example

$C$: has cancer

$\overline{C}$: no cancer

$1/0$: positive/negative cancer screening result

$p(C) = 0.01$

$C$:

$p(1 | C) = 0.8$
$p(0 | C) = 0.2$

$\overline{C}$:

$p(1 | \overline{C}) = 0.1$
$p(0 | \overline{C}) = 0.9$

What is $p(C | 1)$?

https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/

Other applications

Value

$R(C_i\, | \mathbf{x})$: expected value (e.g. utility/loss/risk) for taking class $C_i$ given data $\mathbf{x}$

$R_{ik}$: value for taking class $C_i$ when the actual class is $C_k$

$$ \begin{align} R(C_i \, | \mathbf{x}) = \sum_{k} R_{ik} p(C_k \, | \, \mathbf{x}) \end{align} $$

Goal: select $C_i$ to optimize $R(C_i \, | \, \mathbf{x})$ given $\mathbf{x}$

Association

$ X \rightarrow Y $

$X$: antecedent
$Y$: consequent

Example: basket analysis for shopping, $X$ and $Y$ can be sets of item(s).

Support: $$ p(X, Y) $$ , the statistical significance of having $X$ and $Y$ together

Confidence: $$ p(Y | X) $$ , how likely $Y$ can be predicted from $X$

Lift: $$ \begin{align} \frac{p(X, Y)}{p(X)p(Y)} &= \frac{p(Y | X)}{p(Y)} \end{align} $$ , $> 1$, $< 1$, $=1$, $X$ makes $Y$ more, less, equally likely

Learning

From training data $\mathbf{X}$ we want to estimate model parameters $\Theta$.

$$ \begin{align} p\left(\Theta | \mathbf{X}\right) &= \frac{p\left(\mathbf{X} | \Theta\right) p\left(\Theta\right)}{p\left(\mathbf{X}\right)} \\ &\propto p\left(\mathbf{X} | \Theta\right) p\left(\Theta\right) \end{align} $$

$p(\Theta | \mathbf{X})$: posterior
the likelihood of $\Theta$ given $\mathbf{X}$
$p(\Theta)$: prior
how likely $\Theta$ is before observing $\mathbf{X}$
$p(\mathbf{X} | \Theta)$: likelihood
how likely $\mathbf{X}$ is if the model parameters are $\Theta$
$p(\mathbf{X})$: marginal/evidence
constant for given $\mathbf{X}$

MAP (maximum a posteriori) estimation

$$ \Theta_{MAP} = argmax_{\Theta} p(\Theta | \mathbf{X}) $$

If we don't have $p(\Theta)$, it can be assumed to be flat $ p(\Theta) = 1 $ and MAP is equivalent to ML:

ML (maximum likelihood) estimation

$$ \Theta_{ML} = argmax_{\Theta} p(\mathbf{X} | \Theta) $$

Under the often iid (identical and independently distributed) assumption: $$ p(\mathbf{X} | \Theta) = \prod_{t=1}^{N} p(\mathbf{x}^{(t)} | \Theta) $$ , where $\{\mathbf{x}^{t}\}$ are the individual samples within $\mathbf{X}$. ($\mathbf{X}$ is a matrix and $\{\mathbf{x}^{t}\}$ are its individual columns.)

Bayes estimator

The expected value of the posterior density: $$ \begin{align} \Theta_{Bayes} &= E(\Theta | \mathbf{X}) \\ &= \int \Theta p(\Theta | \mathbf{X}) d\Theta \end{align} $$

The best estimate of a random variable/vector is its mean.

Gaussian example

$$ \begin{align} p(\mathbf{X} | \Theta) &= \frac{1}{(2\pi)^{N/2}\sigma^N} \exp\left(-\frac{\sum_{t=1}^N (\mathbf{x}^{(t)} - \Theta)^2}{2\sigma^2}\right) \\ p(\Theta) &= \frac{1}{\sqrt{2\pi}\sigma_0} \exp\left( -\frac{(\Theta - \mu_0)^2}{2\sigma_0^2} \right) \end{align} $$

ML (maximum likelihood) estimation

$$ \Theta_{ML} = argmax_{\Theta} p(\mathbf{X} | \Theta) $$

Compute the data $\mathbf{X}$ mean and variance: $$ \begin{align} \Theta_{ML} &= \mathbf{m} = \frac{\sum_{t=1}^N \mathbf{x}^{(t)}}{N} \\ \sigma_{ML}^2 &= s^2 = \frac{\sum_{t=1}^N \left(\mathbf{x} - \mathbf{m} \right)^2}{N} \end{align} $$

The $\mathbf{m}$ part holds regardless of $\sigma$ (constant or a variable to be optimized).

MAP estimation

$$ \begin{align} \Theta_{MAP} &= argmax_{\Theta} p(\Theta | \mathbf{X}) \\ &= argmax_{\Theta} p(\mathbf{X} | \Theta) p(\Theta) \end{align} $$

It can be shown that $$ \Theta_{MAP} = \frac{N/\sigma^2}{N/\sigma^2 + 1/\sigma_0^2} \mathbf{m} + \frac{1/\sigma_0^2}{N/\sigma^2 + 1/\sigma_0^2} \mu_0 $$ , i.e. weighted average of prior mean $\mu_0$ and sample $\mathbf{X}$ mean $\mathbf{m}$, with weights inversely proportional to variances.

Note that if we don't know $p(\Theta)$, we can assume it is a constant distribution $p(\Theta) = 1$, i.e. $\sigma_0 = \infty$. This will give us $\Theta_{MAP} = \mathbf{m} = \Theta_{ML}$, as expected.

Bayes estimation

$$ \Theta_{Bayes} = E(\Theta | \mathbf{X}) = \frac{N/\sigma^2}{N/\sigma^2 + 1/\sigma_0^2} \mathbf{m} + \frac{1/\sigma_0^2}{N/\sigma^2 + 1/\sigma_0^2} \mu_0 $$

, i.e. same as $\Theta_{MAP}$.

The math derivations are left as exercise.

Naive Bayes

Naive Bayes assume the features are independent for the likelihood, i.e. for an $n$-dimensional data vector $\mathbf{x}$ $$ \begin{align} p(\mathbf{x} | \Theta) &= p(\mathbf{x}_1, \mathbf{x}_2, \cdots \mathbf{x}_n | \Theta) \\ &= \prod_{k=1}^n p( \mathbf{x}_k | \Theta) \end{align} $$

We can generalize from a single data item $\mathbf{x}$ (a vector) to an entire data set $\mathbf{X}$ (a matrix whose columns are data items) by considering columns of $\mathbf{X}$ as features, i.e. $\mathbf{X}_{(k)}$

Put the above into our Bayesian rule: $$ \begin{align} p(\Theta | \mathbf{X}) &= \frac{p(\Theta) p\left(\mathbf{X} | \Theta\right)}{p(\mathbf{X})} \\ &= \frac{p(\Theta) \prod_{k=1}^n p\left(\mathbf{X}_{(k)} | \Theta\right)}{p(\mathbf{X})} \end{align} $$

The main merit of naive Bayes is that the estimation/computation of individual $ p\left(\mathbf{X}_{(k)} | \Theta\right) $ terms is easier/faster than the joint term $ p\left(\mathbf{X} | \Theta \right) $ .

This feature independence is just an assumption, but tends to work well in practice.

More details can be found under:

Code example

scikit learn supports naive Bayes with different math functions for the likelihood, such as Gaussian, Bernoulli, and multinomial.



In [1]:

    
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()

X = iris.data[:, [2, 3]]
y = iris.target



In [2]:

    
from sklearn.cross_validation import train_test_split

# splitting data into 70% training and 30% test data: 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)



In [3]:

    
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)



In [4]:

    
from sklearn.metrics import accuracy_score



In [5]:

    
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

_ = gnb.fit(X_train_std, y_train)



In [6]:

    
y_pred = gnb.predict(X_test_std)

print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))









    



Accuracy: 0.98