Markov Networks

author: Jacob Schreiber
contact: jmschreiber91@gmail.com

Markov networks are probabilistic models that are usually represented as an undirected graph, where the nodes represent variables and the edges represent associations. Markov networks are similar to Bayesian networks with the primary difference being that Bayesian networks can be represented as directed graphs with known parental relations. Generally, Bayesian networks are easier to interpret and can be used to calculate probabilities faster but, naturally, require that causality is known. However, in many settings, one can only know the associations between variables and not necessarily the direction of causality.

The underlying implementation of inference in pomegranate for both Markov networks and Bayesian networks is the same, because both get converted to their factor graph representations. However, there are many important differences between the two models that should be considered before choosing one.

In this tutorial we will go over how to use Markov networks in pomegranate and what some of the current limitations of the implementation are.


In [1]:
%matplotlib inline
import numpy
import itertools

from pomegranate import *

numpy.random.seed(0)
numpy.set_printoptions(suppress=True)

%load_ext watermark
%watermark -m -n -p numpy,scipy,pomegranate


Mon Dec 02 2019 

numpy 1.17.2
scipy 1.3.1
pomegranate 0.11.3

compiler   : GCC 7.3.0
system     : Linux
release    : 4.15.0-66-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit

Defining a Markov Network

A Markov network is defined by passing in a list of the joint probability tables associated with a series of cliques rather than an explicit graph structure. This is because the probability distributions for a particular variable are not defined by themselves, but rather through associations with other variables. While a Bayesian network has root variables that do not hav e parents, the undirected nature of the edges in a Markov networks means that variables are generally grouped together.

Let's define a simple Markov network where the cliques are A-B, B-C-D, and C-D-E. B-C-D-E is almost a clique but are missing the connection between B and E.


In [2]:
d1 = JointProbabilityTable([
    [0, 0, 0.1],
    [0, 1, 0.2],
    [1, 0, 0.4],
    [1, 1, 0.3]], [0, 1])

d2 = JointProbabilityTable([
    [0, 0, 0, 0.05],
    [0, 0, 1, 0.15],
    [0, 1, 0, 0.07],
    [0, 1, 1, 0.03],
    [1, 0, 0, 0.12],
    [1, 0, 1, 0.18],
    [1, 1, 0, 0.10],
    [1, 1, 1, 0.30]], [1, 2, 3])

d3 = JointProbabilityTable([
    [0, 0, 0, 0.08],
    [0, 0, 1, 0.12],
    [0, 1, 0, 0.11],
    [0, 1, 1, 0.19],
    [1, 0, 0, 0.04],
    [1, 0, 1, 0.06],
    [1, 1, 0, 0.23],
    [1, 1, 1, 0.17]], [2, 3, 4])

model = MarkovNetwork([d1, d2, d3])
model.bake()

We can see that the initialization is fairly straightforward. An important note is that the JointProbabilityTable object requires as the second argument a list of variables that are included in that clique in the order that they appear in the table, from left to right.

Calculating the probability of examples

Similar to the other probabilistic models in pomegranate, Markov networks can be used to calculate the probability or log probability of examples. However, unlike the other models, calculating the log probability for Markov networks is generally computationally intractable for data with even a modest number of variables (~30).

The process for calculating the log probability begins by calculating the "unnormalized" log probability $\hat{P}$, which is just the product of the probabilities for the variables $c$ in each clique $c \in C$ under their joint probability table $JPT(c)$. This step is easy because it just involves, for each clique, taking the columns corresponding to that clique and performing a table lookup.

\begin{equation} \hat{P}(X=x) = \prod\limits_{c \in C} JPT(c) \end{equation}

The reason this is called the unnormalized probability is because the sum of all combinations of variables that $X$ can take $\sum\limits_{x \in X} \hat{P}(X=x)$ does not sum to 1; thus, it is not a true probability distribution.

We can calculate the normalized probability $P(X)$ by dividing by the sum of probabilities under all combinations of variables, frequently referred to as the "partition function". Calculating the partition function $Z$ is as simple as summing the unnormalized probabilities over all possible combinations of variables $x \in X$.

\begin{equation} Z = \sum\limits_{x \in X} \hat{P}(X=x) \end{equation}

Finally, we can divide any unnormalized probability calculation by the partition function to get the correct probability.

\begin{equation} P(X = x) = \frac{1}{Z} \hat{P}(X = x) \end{equation}

The probability method returns the normalized probability value. We can check this by seeing that it is different than simply passing the columns of data in to the distributions for their respective cliques.


In [3]:
model.probability([0, 1, 0, 0, 1])


Out[3]:
0.020425530968183916

And the probability if we simply passed the columns into the corresponding cliques:


In [4]:
d1.probability([0, 1]) * d2.probability([1, 0, 0]) * d3.probability([0, 0, 1])


Out[4]:
0.0028800000000000006

However, by passing the unnormalized=True parameter in to the probability method we can return the unnormalized probability values.


In [5]:
model.probability([0, 1, 0, 0, 1], unnormalized=True)


Out[5]:
0.0028799999999999997

We can see that the two are identical, subject to machine precision.

Calculating the partition function

Calculating the partition function involves summing the unnormalized probabilities of all combinations of variables that an example can take. Unfortunately, the time it takes to calculate the partition function grows exponentially with the number of dimensions. This means that it may not be able to calculate the partition function exactly for more than ~25 variables, depending on your machine. While pomegranate does not currently support any methods for calculating the partition function other than the exact method, it is flexible enough to allow users to get around this limitation.

The partition function itself is calculated in the bake method because, at that point, all combinations of variables are known to the model. This value is then cached so that calls to probability or log_probability are just as fast regardless of if the normalized or unnormalized probabilities are calculated. However, if the user passes in calculate_partition=False the model will not spend time calculating the partition function. We can see the difference in time here:


In [6]:
X = numpy.random.randint(2, size=(100, 14))
model2 = MarkovNetwork.from_samples(X)

%timeit model2.bake()
%timeit model2.bake(calculate_partition=False)


2.67 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.06 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

There are two main reasons that one might not want to calculate the partition function when creating the model. The first is when the user will only be inspecting the model, such as after structure learning, or only doing inference, which uses the approximate loopy belief propogation. The second is if the user wants to estimate the partition function themselves using an approximate algorithm.

Let's look at how one would manually calculate the exact partition function to see how an approximate algorithm could be substituted in. First, what happens if we don't calculate the partition but try to calculate probabilities?


In [7]:
model.bake(calculate_partition=False)
model.probability([0, 1, 0, 0, 1])


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-253972a8e7bb> in <module>
      1 model.bake(calculate_partition=False)
----> 2 model.probability([0, 1, 0, 0, 1])

~/Desktop/pomegranate/pomegranate/MarkovNetwork.pyx in pomegranate.MarkovNetwork.MarkovNetwork.probability()

~/Desktop/pomegranate/pomegranate/MarkovNetwork.pyx in pomegranate.MarkovNetwork.MarkovNetwork.log_probability()

ValueError: Must calculate partition before computing probability

Looks like we get an error. We can still calculate unnormalized probabilities though.


In [8]:
model.probability([0, 1, 0, 0, 1], unnormalized=True)


Out[8]:
0.0028799999999999997

Now we can calculate the partition function by calculating the unnormalized probability of all combinations.


In [9]:
Z = model.probability(list(itertools.product(*model.keys_)), unnormalized=True).sum()
Z


Out[9]:
0.141

We can set the stored partition function in the model to be this value (or specifically the log partition function) and then calculate normalized probabilities as before.


In [10]:
model.partition = numpy.log(Z)
model.probability([0, 1, 0, 0, 1])


Out[10]:
0.020425530968183916

Now you can calculate $Z$ however you'd like and simply plug it in.

Inference

Similar to Bayesian networks, Markov networks can be used to infer missing values in data sets. In pomegranate, inference is done for Markov networks by converting them to their factor graph representations and then using loopy belief propagation. This results in fast, but approximate, inference.


In [11]:
model.predict([[None, 1, 0, None, None], [None, None, 1, None, 0]])


Out[11]:
[array([1, 1, 0, 1, 1], dtype=object), array([1, 1, 1, 1, 0], dtype=object)]

If we go back and inspect the joint probability tables we can easily see that the inferred values are the correct ones.

If we want the full probability distribution rather than just the most likely value we can use the predict_proba method.


In [12]:
model.predict_proba([[None, None, 1, None, 0]])


Out[12]:
[array([{
     "class" :"Distribution",
     "dtype" :"int",
     "name" :"DiscreteDistribution",
     "parameters" :[
         {
             "0" :0.37654171704957684,
             "1" :0.623458282950423
         }
     ],
     "frozen" :false
 },
        {
     "class" :"Distribution",
     "dtype" :"int",
     "name" :"DiscreteDistribution",
     "parameters" :[
         {
             "0" :0.11729141475211632,
             "1" :0.8827085852478838
         }
     ],
     "frozen" :false
 },
        1,
        {
     "class" :"Distribution",
     "dtype" :"int",
     "name" :"DiscreteDistribution",
     "parameters" :[
         {
             "0" :0.08222490931076197,
             "1" :0.917775090689238
         }
     ],
     "frozen" :false
 },
        0], dtype=object)]

Structure Learning

Markov networks can be learned from data just as Bayesian networks can. While there are many algorithms that have been proposed for Bayesian network structure learning, there are fewer for Markov networks. Currently, pomegranate only supports the Chow-Liu tree-building algorithm for learning a tree-like network. Let's see a simple example where we learn a Markov network over a chunk of features from the digits data set.


In [13]:
from sklearn.datasets import load_digits

X = load_digits().data[:, 22:40]
X = (X > numpy.median(X)).astype(int)

model = MarkovNetwork.from_samples(X)
icd = IndependentComponentsDistribution.from_samples(X, distributions=DiscreteDistribution)

model.log_probability(X).sum(), icd.log_probability(X).sum()


Out[13]:
(-12699.71644426486, -13141.625038482453)

It looks like the Markov network is somewhat better than simply modeling each pixel individually for this set of features.