In probability theory, the chain rule permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities. The rule is useful in the study of Bayesian networks, which describe a probability distribution in terms of conditional probabilities.
Consider an indexed set of sets $A_1,...,A_n$. To find the value of this member of the joint distribution, we can apply the definition of conditional probability to obtain: $$P(A_n,...,A_1) = P(A_n|A_{n-1},...,A_1)P(A_{n-1},...,A_1)$$
Repeating this process with each final term creates the product: $$P(\bigcap^n_{k=1}A_k) = \prod_{k=1}^n P(A_k|\bigcap_{j=1}^{k-1} A_j) $$
Events $A$ and $B$ are defined to be statistically independent if $ P(A,B)=P(A)P(B) $, so we have: $$ P(A|B) = P(A) $$ $$ P(B|A) = P(B) $$ The notations of independence is $ P \models A \perp B$, where $\models$ means "satisfy", and $\perp$ means "independent".
For random variables $X$,$Y$, $ P \models X \perp Y$, similarly, we have: $$ P(X,Y) = P(X)P(Y) $$ $$ P(X|Y) = P(X) $$ $$ P(Y|X) = P(Y) $$
For (sets of) random variables $X$,$Y$,$Z$, $P \models (X \perp Y | Z)$ if: $$ P(X,Y|Z) = P(X|Z)P(Y|Z)$$ $$ P(X|Y,Z) = P(X|Z) $$ $$ P(Y|X,Z) = P(Y|Z) $$ The conditional independence means that the $X$ and $Y$ are conditionally independent given $Z$, but reminder that this statement may not work given another variables other than $Z$.
Actually, they follow as the form $ P(A,B)=P(A|B)P(B) = P(B|A)P(A)$. In general, it cannot be assumed that $P(A|B) \approx P(B|A)$.
$$ \frac {P(B|A)}{P(A|B)} = \frac {P(B)}{P(A)} $$Sometimes, for $A$ is a binary variable, we can write the probability in alternative form as $ P(B) = P(B|A)P(A) + P(B|\overline{A})P(\overline{A})$, thus we have: $$ P(A|B)=\frac{P(B|A)P(A)} {P(B|A)P(A) + P(B|\overline{A})P(\overline{A})}$$
Often, for some partition ${A_j}$ of the sample space, the event space is given or conceptualized in terms of $P(A_j)$ and $P(B|A_j)$. It is then useful to compute $P(B)$ using the law of total probability $P(B) = \sum_j P(B|A_j)P(A_j)$: $$ P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum_j P(B|A_j)P(A_j)}$$
Definition: $X$ and $Y$ are d-separated in G given $Z$ if there is no active trail in G (likely, V-structure) betweeb $X$ and $Y$ given $Z$, notation: $d-sep_G(X,Y|Z)$.
Theorem: If $P$ factorizes over $G$, and $d-sep_G(X,Y|Z)$ then $P$ satisfies $(X \perp Y |Z)$
For example, $ P(D,I,G,S,L) = P(D)P(I)P(G|D,I)P(S|I)P(L|G)$, so $$ P(D,S) = \sum_{G,L,I} P(D)P(I)P(G|D,I)P(S|I)P(L|G) $$ $$ P(D,S) = \sum_I P(D)P(I)P(S|I) \sum_G (P(G|D,I)\sum_L P(L|G)) $$
we have $1 = \sum_L P(L|G)$, $1 = \sum_G P(G|D,I)$, and $P(S) = \sum_I P(I)P(S|I)$, so
$$ P(D,S) = P(D)P(S) $$Thus, $D$ and $S$ are independent.
Any node is d-separated from its non-descendants given its parents
If $P$ factorizes over $G$, then $G$ is an I-map for $P$, reversely, if $G$ is an I-map for $P$, then $P$ factorizes over $G$.
Basically, according to the chain rule, the factorization shall be written as: $$P(D,I,G,S,L)=P(D)P(I|D)P(G|D,I)P(S|D,I,G)P(L|D,I,G,S)$$ because $P(S|D,I,G) = P(S|I)$, $P(L|D,I,G,S) = P(L|G)$, $P(I) = P(I|D)$ $$P(D,I,G,S,L)=P(D)P(I|D)P(G|D,I)P(S|I)P(L|G)$$
For class variables ${C_k}$, with features ${x_n}$, we have conditional probability as: $$P(C_k,x_1,...,x_n) = P(x_1|x_2,...,x_n,C_k)P(x_2|x_3,...,x_n,C_k)...P(x_{n-1}|x_n,C_k)P(x_n|C_k)P(C_k)$$
The "naive" conditional independence assumptions come into play: assume that each feature $x_i$ is conditionally independent of every other feature $x_j$ for $j \ne i$, given the category $C$. This means that $$P(x_i|x_{i+1},...,x_n,C_k)=P(x_i|C_k)$$
Thus, the joint model can be expressed as: $$P(C_k|x_1,...,x_n) = P(C_k)\prod_{i=1}^{n} P(x_i|C_k)$$
The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probalbe, this is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label $y=C_k$ for some $k$ as follows: $$ y = \underset{k\in\{1,...K\}}{argmax} P(C_k)\prod_{i=1}^n P(x_i|C_k)$$
It's surprisingly effective in domains with many weakly relevant features. Strong independence assumptions reduce performance when many features strongly correlated.