Conditional Probability

Basic Concepts

Definition

$$ P(A|B) = \frac {P(A,B)}{P(B)} $$

Basically, when $P(A) \ne 0$, $P(B) \ne 0$: $$P(A) = \sum_B P(A|B)P(B) = \sum_B P(A,B)$$ $$ 1 = \sum_A P(A|B) = \sum_A \frac {P(A,B)}{P(B)} = \frac {\sum_A P(A,B)}{P(B)} = \frac {P(B)}{P(B)}$$

Chain Rule

In probability theory, the chain rule permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities. The rule is useful in the study of Bayesian networks, which describe a probability distribution in terms of conditional probabilities.

Consider an indexed set of sets $A_1,...,A_n$. To find the value of this member of the joint distribution, we can apply the definition of conditional probability to obtain: $$P(A_n,...,A_1) = P(A_n|A_{n-1},...,A_1)P(A_{n-1},...,A_1)$$

Repeating this process with each final term creates the product: $$P(\bigcap^n_{k=1}A_k) = \prod_{k=1}^n P(A_k|\bigcap_{j=1}^{k-1} A_j) $$

Factorization

Factorize the probability in a trail, for example $$ P(A,B,C) = P(A|B,C)P(B|C)P(C) $$ This factorization correspond a Bayesian map:
$$ C \to B \to A $$ $$ C \to A $$

Statistical independence

Events $A$ and $B$ are defined to be statistically independent if $ P(A,B)=P(A)P(B) $, so we have: $$ P(A|B) = P(A) $$ $$ P(B|A) = P(B) $$ The notations of independence is $ P \models A \perp B$, where $\models$ means "satisfy", and $\perp$ means "independent".

For random variables $X$,$Y$, $ P \models X \perp Y$, similarly, we have: $$ P(X,Y) = P(X)P(Y) $$ $$ P(X|Y) = P(X) $$ $$ P(Y|X) = P(Y) $$

Conditional Independence

For (sets of) random variables $X$,$Y$,$Z$, $P \models (X \perp Y | Z)$ if: $$ P(X,Y|Z) = P(X|Z)P(Y|Z)$$ $$ P(X|Y,Z) = P(X|Z) $$ $$ P(Y|X,Z) = P(Y|Z) $$ The conditional independence means that the $X$ and $Y$ are conditionally independent given $Z$, but reminder that this statement may not work given another variables other than $Z$.

Bayes' theorem

$$ P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$

Actually, they follow as the form $ P(A,B)=P(A|B)P(B) = P(B|A)P(A)$. In general, it cannot be assumed that $P(A|B) \approx P(B|A)$.

$$ \frac {P(B|A)}{P(A|B)} = \frac {P(B)}{P(A)} $$

Sometimes, for $A$ is a binary variable, we can write the probability in alternative form as $ P(B) = P(B|A)P(A) + P(B|\overline{A})P(\overline{A})$, thus we have: $$ P(A|B)=\frac{P(B|A)P(A)} {P(B|A)P(A) + P(B|\overline{A})P(\overline{A})}$$

Often, for some partition ${A_j}$ of the sample space, the event space is given or conceptualized in terms of $P(A_j)$ and $P(B|A_j)$. It is then useful to compute $P(B)$ using the law of total probability $P(B) = \sum_j P(B|A_j)P(A_j)$: $$ P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum_j P(B|A_j)P(A_j)}$$

Multi-conditional probability

$$ P(Y|X_1,X_2) = \frac {P(Y,X_1,X_2)} {P(X_1,X_2)}$$$$ P(Y_1,Y_2|X) = \frac {P(Y_1,Y_2,X)} {P(X)} $$

Influence Flow

$$ P(X_1|X_2) = \frac {P(X_1,X_2)} {P(X_2)} = \frac{P(Y,X_1,X_2)} {P(Y|X_1,X_2)P(X_2)}$$

$X_1$ and $X_2$ are independent, but conditionally dependent given $Y$, which activates the V-structure, $ P(X_1|X_2) \ne P(X_1) $.

D-separation

Definition: $X$ and $Y$ are d-separated in G given $Z$ if there is no active trail in G (likely, V-structure) betweeb $X$ and $Y$ given $Z$, notation: $d-sep_G(X,Y|Z)$.

Theorem: If $P$ factorizes over $G$, and $d-sep_G(X,Y|Z)$ then $P$ satisfies $(X \perp Y |Z)$

For example, $ P(D,I,G,S,L) = P(D)P(I)P(G|D,I)P(S|I)P(L|G)$, so $$ P(D,S) = \sum_{G,L,I} P(D)P(I)P(G|D,I)P(S|I)P(L|G) $$ $$ P(D,S) = \sum_I P(D)P(I)P(S|I) \sum_G (P(G|D,I)\sum_L P(L|G)) $$

we have $1 = \sum_L P(L|G)$, $1 = \sum_G P(G|D,I)$, and $P(S) = \sum_I P(I)P(S|I)$, so

$$ P(D,S) = P(D)P(S) $$

Thus, $D$ and $S$ are independent.

Any node is d-separated from its non-descendants given its parents

I-maps

  • Definition: d-separation in $G \to P$ satisfies corresonding independence statement $I(G)={(X \perp Y|Z):d-sep_G(X,Y|Z)}$.
  • Definition: If $P$ satisfies $I(G)$, we say that $G$ is an I-map(independency map) of $P$.

If $P$ factorizes over $G$, then $G$ is an I-map for $P$, reversely, if $G$ is an I-map for $P$, then $P$ factorizes over $G$.

Basically, according to the chain rule, the factorization shall be written as: $$P(D,I,G,S,L)=P(D)P(I|D)P(G|D,I)P(S|D,I,G)P(L|D,I,G,S)$$ because $P(S|D,I,G) = P(S|I)$, $P(L|D,I,G,S) = P(L|G)$, $P(I) = P(I|D)$ $$P(D,I,G,S,L)=P(D)P(I|D)P(G|D,I)P(S|I)P(L|G)$$

Bayesian Network

Naive Bayes Classifier

For class variables ${C_k}$, with features ${x_n}$, we have conditional probability as: $$P(C_k,x_1,...,x_n) = P(x_1|x_2,...,x_n,C_k)P(x_2|x_3,...,x_n,C_k)...P(x_{n-1}|x_n,C_k)P(x_n|C_k)P(C_k)$$

The "naive" conditional independence assumptions come into play: assume that each feature $x_i$ is conditionally independent of every other feature $x_j$ for $j \ne i$, given the category $C$. This means that $$P(x_i|x_{i+1},...,x_n,C_k)=P(x_i|C_k)$$

Thus, the joint model can be expressed as: $$P(C_k|x_1,...,x_n) = P(C_k)\prod_{i=1}^{n} P(x_i|C_k)$$

Constructing a classifier from the probability model

The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probalbe, this is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label $y=C_k$ for some $k$ as follows: $$ y = \underset{k\in\{1,...K\}}{argmax} P(C_k)\prod_{i=1}^n P(x_i|C_k)$$

It's surprisingly effective in domains with many weakly relevant features. Strong independence assumptions reduce performance when many features strongly correlated.

  • Bernoulli Naive Bayes
  • Multinomial Naive Bayes