Introduction to Probabilistic Graphical Models

Probabilistic graphical models are an elegant framework which combines uncertainty (probabilities) and logical structure (independence constraints) to compactly represent complex, real-world phenomena.

Joint Probability Distribution

The main tool for probabilistic inference for any problem is the joint probability distribution.

The Student Example


In [1]:
%pylab inline
pylab.style.use('ggplot')
import pandas as pd
import numpy as np


Populating the interactive namespace from numpy and matplotlib

Let us assume that the random variables take the following values:


In [2]:
difficulty = ['Easy', 'Hard']
intelligence = ['Avg', 'Below Avg', 'Above Avg']
grade = ['A', 'B', 'C']
sat = ['Good', 'Bad']
letter = ['Strong', 'Lukewarm']

In [3]:
from itertools import product

rows = [p for p in product(difficulty, intelligence, grade, sat, letter)]
rows_df = pd.DataFrame(rows, columns=['difficulty', 'intelligence', 'grade', 'sat', 'letter'])
probs = np.random.uniform(0, 1, len(rows_df))                          
rows_df = rows_df.assign(p=probs/probs.sum())

In [4]:
rows_df.head()


Out[4]:
difficulty intelligence grade sat letter p
0 Easy Avg A Good Strong 0.023762
1 Easy Avg A Good Lukewarm 0.025619
2 Easy Avg A Bad Strong 0.006901
3 Easy Avg A Bad Lukewarm 0.005887
4 Easy Avg B Good Strong 0.005623

In [5]:
rows_df.shape


Out[5]:
(72, 6)

For most practical applications, the joint probability distribution is over a ombinatorial space. This prohibits both learning and inference.

Independencies Reduce the Joint PD

If all the variables were independent, we could simply the joint probability distribution to:

$P(D, I, G, S, L) = P(D) P(I) P(G) P(S) P(L)$

which requires only 12 entries instead of 72.

In general however, we're not so lucky. Encoding the dependency structure in a graph allows us to take advantage of the efficient computation techniques from graph theory and apply them in solving the learning and inference problems.

Two Kinds of PGMs

If the dependencies are causal, then the representation is a directed graph (usually DAG). This is called a Belief Network or a Bayesian Network.

If the dependencies signify relatedness but not causality, then the representation is an undirected graph. This is called a Markov Network.

Observed and Latent Variables

A PGM can have two types of nodes - Observed and Latent.

An Observed node represents a random variable whose output can be measured e.g. Grade.

A latent node represents a random variable whose values cannot be seen or measured. Instead we learn the variable (i.e. it's probability distribution) from data. For a known structure, this is the learning problem.

For example, an HMM (a type of Bayesian Network) for a PoS tagger represents the words in a sentence by observed nodes and the corresponding PoS as a latent node.

Independencies in Bayesian Network

Condition Independence

True indpendence between variables are hard to find. In reality, conditional independence is more useful.

Given three variables $X, Y, Z: \space X \perp Y \mid Z \text{ iff } P(X, Y | Z) = P(X \mid Z) P(Y \mid Z)$

Active Trails

Active trails determine which nodes in a Bayesian Network are able to influence each other.

Let X, Y and Z be three nodes in a Bayesian Network. Then we have the following relationship cases:

  • Indirect Causal Effect: $X \rightarrow Y \rightarrow Z$ => $X \perp Y \mid Z$

  • Indirect Evidential Effect: $X \leftarrow Y \leftarrow Z$ => $X \perp Y \mid Z$

  • Common Cause: $X \leftarrow Y \rightarrow Z$ => $X \perp Y \mid Z$

  • Common Effect: $X \rightarrow Y \leftarrow Z$ => $X \perp Y$ if $Z$ is not observed.

Active trail is the generalization of the above idea. Formally, an active trail is a path through a Bayesian network such that for any 3 consecutive nodes $X_{i-1}, X_i, X_{i+1}$, none of the above independence relationships hold. In other words, an active trail is a flow of causal inference from end to end.

d-Separation

Let X, Y and Z be three sets of nodes in a Bayesian Network. X and Y are d-separated if there are no active trails between them when $Z$ is observed.

d-Separation converts the problem of statistical indpendence to graph independence. If P factorizes over G and $d-sep_G(X, Y \mid Z)$ then P satisfies $X \perp Y \mid Z$.

Theorem: In a Bayesian Network $(P, G)$, each node is independent of non-descendants given its immediate parents.

Which leads to:

Chain Rule for Bayesian Network: In a Bayesian network $(P, G)$,

$$P(x_1, x_2, \cdots x_k) = \prod_{i=1}^k P(x_i \mid Pa_{G}(x_i))$$

In this case, we say P factorizes over G.

Applying this to the student graph,

$P(D, I, G, S, L) = P(D)P(I)P(G \mid D, I)P(S \mid I)P(L \mid G)$

Every probability distribution always has at least one graph representation (which one?).

I-Map

The set of all independencies in P are: $I(P) = \{(X \perp Y \mid Z) : P \models (X \perp Y \mid Z)\}$.

Theorem: If P factorizes over G, then G is an I-map of P.

The converse is also true.

Theorem (Hammersley-Clifford): If G is an I-map of P, then P factorizes over G.

Note that $I(G) \subseteq I(P)$, i.e. P can contain indpendencies that are not in G.

Minimal I-Map

Usually we are interested in the sparsest representation possible.

G is a minimal I-map of P if removing any edge from G introduces independencies that are not in P.

Perfect I-Map

G is a perfect I-map of P if I(G) = I(P). Not all distributions have a perfect I-map.

Indepdendencies in Markov Network

Markov Networks don't have a notion of d-separation, there's only separation.

Let X, Y and Z be mututally exclusive sets of nodes in a Markov Network (P, H). Then X and Y are separated given Z if there are no active trails between X and Y in H when Z is observed.

For a Markov Network, an active trail is a path through the Network along which no nodes are observed.

Representation Theorem for Markov Networks: If P factorizes over H and $sep_H(X, Y \mid Z)$, then P satisfies $(X \perp Y \mid Z)$.

Theorem: If P factorizes over H, H is an I-map for P.

Example: Which independencies are in I(H) for the Markov Network below?

  • $(A \perp F \mid B, D)$
  • $(A \perp E \mid B)$

Parameterization in Markov Networks

Parameterization in MNs is different from BNs because MNs do not have the notion of conditioning - inference flows in both directions.

Factors

A factor of a set of random variables $(x_1, \cdots, x_k)$ is a function that maps a particular assignment of $(x_1, \cdots, x_k)$ to $\mathbb{R}.$

$\phi: Val(x_1, \cdots, x_k) \rightarrow \mathbb{R}$

The set $(x_1, \cdots, x_k)$ is called the scope of the factor.

The CPDs in BNs are a special case of the factor, but in general factors do not make a probability distribution.

Example: Let X and Y be two binary valued random variables.