author: Jacob Schreiber
contact: jmschreiber91@gmail.com
Markov networks are probabilistic models that are usually represented as an undirected graph, where the nodes represent variables and the edges represent associations. Markov networks are similar to Bayesian networks with the primary difference being that Bayesian networks can be represented as directed graphs with known parental relations. Generally, Bayesian networks are easier to interpret and can be used to calculate probabilities faster but, naturally, require that causality is known. However, in many settings, one can only know the associations between variables and not necessarily the direction of causality.
The underlying implementation of inference in pomegranate for both Markov networks and Bayesian networks is the same, because both get converted to their factor graph representations. However, there are many important differences between the two models that should be considered before choosing one.
In this tutorial we will go over how to use Markov networks in pomegranate and what some of the current limitations of the implementation are.
In [1]:
%matplotlib inline
import numpy
import itertools
from pomegranate import *
numpy.random.seed(0)
numpy.set_printoptions(suppress=True)
%load_ext watermark
%watermark -m -n -p numpy,scipy,pomegranate
A Markov network is defined by passing in a list of the joint probability tables associated with a series of cliques rather than an explicit graph structure. This is because the probability distributions for a particular variable are not defined by themselves, but rather through associations with other variables. While a Bayesian network has root variables that do not hav e parents, the undirected nature of the edges in a Markov networks means that variables are generally grouped together.
Let's define a simple Markov network where the cliques are A-B, B-C-D, and C-D-E. B-C-D-E is almost a clique but are missing the connection between B and E.
In [2]:
d1 = JointProbabilityTable([
[0, 0, 0.1],
[0, 1, 0.2],
[1, 0, 0.4],
[1, 1, 0.3]], [0, 1])
d2 = JointProbabilityTable([
[0, 0, 0, 0.05],
[0, 0, 1, 0.15],
[0, 1, 0, 0.07],
[0, 1, 1, 0.03],
[1, 0, 0, 0.12],
[1, 0, 1, 0.18],
[1, 1, 0, 0.10],
[1, 1, 1, 0.30]], [1, 2, 3])
d3 = JointProbabilityTable([
[0, 0, 0, 0.08],
[0, 0, 1, 0.12],
[0, 1, 0, 0.11],
[0, 1, 1, 0.19],
[1, 0, 0, 0.04],
[1, 0, 1, 0.06],
[1, 1, 0, 0.23],
[1, 1, 1, 0.17]], [2, 3, 4])
model = MarkovNetwork([d1, d2, d3])
model.bake()
We can see that the initialization is fairly straightforward. An important note is that the JointProbabilityTable object requires as the second argument a list of variables that are included in that clique in the order that they appear in the table, from left to right.
Similar to the other probabilistic models in pomegranate, Markov networks can be used to calculate the probability or log probability of examples. However, unlike the other models, calculating the log probability for Markov networks is generally computationally intractable for data with even a modest number of variables (~30).
The process for calculating the log probability begins by calculating the "unnormalized" log probability $\hat{P}$, which is just the product of the probabilities for the variables $c$ in each clique $c \in C$ under their joint probability table $JPT(c)$. This step is easy because it just involves, for each clique, taking the columns corresponding to that clique and performing a table lookup.
\begin{equation} \hat{P}(X=x) = \prod\limits_{c \in C} JPT(c) \end{equation}The reason this is called the unnormalized probability is because the sum of all combinations of variables that $X$ can take $\sum\limits_{x \in X} \hat{P}(X=x)$ does not sum to 1; thus, it is not a true probability distribution.
We can calculate the normalized probability $P(X)$ by dividing by the sum of probabilities under all combinations of variables, frequently referred to as the "partition function". Calculating the partition function $Z$ is as simple as summing the unnormalized probabilities over all possible combinations of variables $x \in X$.
\begin{equation} Z = \sum\limits_{x \in X} \hat{P}(X=x) \end{equation}Finally, we can divide any unnormalized probability calculation by the partition function to get the correct probability.
\begin{equation} P(X = x) = \frac{1}{Z} \hat{P}(X = x) \end{equation}The probability
method returns the normalized probability value. We can check this by seeing that it is different than simply passing the columns of data in to the distributions for their respective cliques.
In [3]:
model.probability([0, 1, 0, 0, 1])
Out[3]:
And the probability if we simply passed the columns into the corresponding cliques:
In [4]:
d1.probability([0, 1]) * d2.probability([1, 0, 0]) * d3.probability([0, 0, 1])
Out[4]:
However, by passing the unnormalized=True
parameter in to the probability
method we can return the unnormalized probability values.
In [5]:
model.probability([0, 1, 0, 0, 1], unnormalized=True)
Out[5]:
We can see that the two are identical, subject to machine precision.
Calculating the partition function involves summing the unnormalized probabilities of all combinations of variables that an example can take. Unfortunately, the time it takes to calculate the partition function grows exponentially with the number of dimensions. This means that it may not be able to calculate the partition function exactly for more than ~25 variables, depending on your machine. While pomegranate does not currently support any methods for calculating the partition function other than the exact method, it is flexible enough to allow users to get around this limitation.
The partition function itself is calculated in the bake
method because, at that point, all combinations of variables are known to the model. This value is then cached so that calls to probability
or log_probability
are just as fast regardless of if the normalized or unnormalized probabilities are calculated. However, if the user passes in calculate_partition=False
the model will not spend time calculating the partition function. We can see the difference in time here:
In [6]:
X = numpy.random.randint(2, size=(100, 14))
model2 = MarkovNetwork.from_samples(X)
%timeit model2.bake()
%timeit model2.bake(calculate_partition=False)
There are two main reasons that one might not want to calculate the partition function when creating the model. The first is when the user will only be inspecting the model, such as after structure learning, or only doing inference, which uses the approximate loopy belief propogation. The second is if the user wants to estimate the partition function themselves using an approximate algorithm.
Let's look at how one would manually calculate the exact partition function to see how an approximate algorithm could be substituted in. First, what happens if we don't calculate the partition but try to calculate probabilities?
In [7]:
model.bake(calculate_partition=False)
model.probability([0, 1, 0, 0, 1])
Looks like we get an error. We can still calculate unnormalized probabilities though.
In [8]:
model.probability([0, 1, 0, 0, 1], unnormalized=True)
Out[8]:
Now we can calculate the partition function by calculating the unnormalized probability of all combinations.
In [9]:
Z = model.probability(list(itertools.product(*model.keys_)), unnormalized=True).sum()
Z
Out[9]:
We can set the stored partition function in the model to be this value (or specifically the log partition function) and then calculate normalized probabilities as before.
In [10]:
model.partition = numpy.log(Z)
model.probability([0, 1, 0, 0, 1])
Out[10]:
Now you can calculate $Z$ however you'd like and simply plug it in.
Similar to Bayesian networks, Markov networks can be used to infer missing values in data sets. In pomegranate, inference is done for Markov networks by converting them to their factor graph representations and then using loopy belief propagation. This results in fast, but approximate, inference.
In [11]:
model.predict([[None, 1, 0, None, None], [None, None, 1, None, 0]])
Out[11]:
If we go back and inspect the joint probability tables we can easily see that the inferred values are the correct ones.
If we want the full probability distribution rather than just the most likely value we can use the predict_proba
method.
In [12]:
model.predict_proba([[None, None, 1, None, 0]])
Out[12]:
Markov networks can be learned from data just as Bayesian networks can. While there are many algorithms that have been proposed for Bayesian network structure learning, there are fewer for Markov networks. Currently, pomegranate only supports the Chow-Liu tree-building algorithm for learning a tree-like network. Let's see a simple example where we learn a Markov network over a chunk of features from the digits data set.
In [13]:
from sklearn.datasets import load_digits
X = load_digits().data[:, 22:40]
X = (X > numpy.median(X)).astype(int)
model = MarkovNetwork.from_samples(X)
icd = IndependentComponentsDistribution.from_samples(X, distributions=DiscreteDistribution)
model.log_probability(X).sum(), icd.log_probability(X).sum()
Out[13]:
It looks like the Markov network is somewhat better than simply modeling each pixel individually for this set of features.