Probabilistic graphical models are an elegant framework which combines uncertainty (probabilities) and logical structure (independence constraints) to compactly represent complex, real-world phenomena.
In [1]:
%pylab inline
pylab.style.use('ggplot')
import pandas as pd
import numpy as np
Let us assume that the random variables take the following values:
In [2]:
difficulty = ['Easy', 'Hard']
intelligence = ['Avg', 'Below Avg', 'Above Avg']
grade = ['A', 'B', 'C']
sat = ['Good', 'Bad']
letter = ['Strong', 'Lukewarm']
In [3]:
from itertools import product
rows = [p for p in product(difficulty, intelligence, grade, sat, letter)]
rows_df = pd.DataFrame(rows, columns=['difficulty', 'intelligence', 'grade', 'sat', 'letter'])
probs = np.random.uniform(0, 1, len(rows_df))
rows_df = rows_df.assign(p=probs/probs.sum())
In [4]:
rows_df.head()
Out[4]:
In [5]:
rows_df.shape
Out[5]:
For most practical applications, the joint probability distribution is over a ombinatorial space. This prohibits both learning and inference.
In general however, we're not so lucky. Encoding the dependency structure in a graph allows us to take advantage of the efficient computation techniques from graph theory and apply them in solving the learning and inference problems.
If the dependencies are causal, then the representation is a directed graph (usually DAG). This is called a Belief Network or a Bayesian Network.
If the dependencies signify relatedness but not causality, then the representation is an undirected graph. This is called a Markov Network.
A PGM can have two types of nodes - Observed and Latent.
An Observed node represents a random variable whose output can be measured e.g. Grade.
A latent node represents a random variable whose values cannot be seen or measured. Instead we learn the variable (i.e. it's probability distribution) from data. For a known structure, this is the learning problem.
For example, an HMM (a type of Bayesian Network) for a PoS tagger represents the words in a sentence by observed nodes and the corresponding PoS as a latent node.
Active trails determine which nodes in a Bayesian Network are able to influence each other.
Let X, Y and Z be three nodes in a Bayesian Network. Then we have the following relationship cases:
Indirect Causal Effect: $X \rightarrow Y \rightarrow Z$ => $X \perp Y \mid Z$
Indirect Evidential Effect: $X \leftarrow Y \leftarrow Z$ => $X \perp Y \mid Z$
Common Cause: $X \leftarrow Y \rightarrow Z$ => $X \perp Y \mid Z$
Common Effect: $X \rightarrow Y \leftarrow Z$ => $X \perp Y$ if $Z$ is not observed.
Active trail is the generalization of the above idea. Formally, an active trail is a path through a Bayesian network such that for any 3 consecutive nodes $X_{i-1}, X_i, X_{i+1}$, none of the above independence relationships hold. In other words, an active trail is a flow of causal inference from end to end.
Let X, Y and Z be three sets of nodes in a Bayesian Network. X and Y are d-separated if there are no active trails between them when $Z$ is observed.
d-Separation converts the problem of statistical indpendence to graph independence. If P factorizes over G and $d-sep_G(X, Y \mid Z)$ then P satisfies $X \perp Y \mid Z$.
Theorem: In a Bayesian Network $(P, G)$, each node is independent of non-descendants given its immediate parents.
Which leads to:
Chain Rule for Bayesian Network: In a Bayesian network $(P, G)$,
$$P(x_1, x_2, \cdots x_k) = \prod_{i=1}^k P(x_i \mid Pa_{G}(x_i))$$In this case, we say P factorizes over G.
Applying this to the student graph,
$P(D, I, G, S, L) = P(D)P(I)P(G \mid D, I)P(S \mid I)P(L \mid G)$
Every probability distribution always has at least one graph representation (which one?).
The set of all independencies in P are: $I(P) = \{(X \perp Y \mid Z) : P \models (X \perp Y \mid Z)\}$.
Theorem: If P factorizes over G, then G is an I-map of P.
The converse is also true.
Theorem (Hammersley-Clifford): If G is an I-map of P, then P factorizes over G.
Note that $I(G) \subseteq I(P)$, i.e. P can contain indpendencies that are not in G.
Usually we are interested in the sparsest representation possible.
G is a minimal I-map of P if removing any edge from G introduces independencies that are not in P.
G is a perfect I-map of P if I(G) = I(P). Not all distributions have a perfect I-map.
Markov Networks don't have a notion of d-separation, there's only separation.
Let X, Y and Z be mututally exclusive sets of nodes in a Markov Network (P, H). Then X and Y are separated given Z if there are no active trails between X and Y in H when Z is observed.
For a Markov Network, an active trail is a path through the Network along which no nodes are observed.
Representation Theorem for Markov Networks: If P factorizes over H and $sep_H(X, Y \mid Z)$, then P satisfies $(X \perp Y \mid Z)$.
Theorem: If P factorizes over H, H is an I-map for P.
Example: Which independencies are in I(H) for the Markov Network below?
A factor of a set of random variables $(x_1, \cdots, x_k)$ is a function that maps a particular assignment of $(x_1, \cdots, x_k)$ to $\mathbb{R}.$
$\phi: Val(x_1, \cdots, x_k) \rightarrow \mathbb{R}$
The set $(x_1, \cdots, x_k)$ is called the scope of the factor.
The CPDs in BNs are a special case of the factor, but in general factors do not make a probability distribution.
Example: Let X and Y be two binary valued random variables.