Bayesian Networks

author: Jacob Schreiber
contact: jmschreiber91@gmail.com

Bayesian networks are a powerful inference tool, in which a set of variables are represented as nodes, and the lack of an edge represents a conditional independence statement between the two variables, and an edge represents a dependence between the two variables. One of the powerful components of a Bayesian network is the ability to infer the values of certain variables, given observed values for another set of variables. These are referred to as the 'hidden' and 'observed' variables respectively, and need not be set at the time the network is created. The same network can have a different set of variables be hidden or observed between two data points. The more values which are observed, the closer the inferred values will be to the truth.

While Bayesian networks can have extremely complex emission probabilities, usually Gaussian or conditional Gaussian distributions, pomegranate currently supports only discrete Bayesian networks. Bayesian networks are explicitly turned into Factor Graphs when inference is done, wherein the Bayesian network is turned into a bipartite graph with all variables having marginal nodes on one side, and joint tables on the other.


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set_style('whitegrid')
import numpy

from pomegranate import *

numpy.random.seed(0)
numpy.set_printoptions(suppress=True)

%load_ext watermark
%watermark -m -n -p numpy,scipy,pomegranate


Sat Mar 14 2020 

numpy 1.18.1
scipy 1.3.2
pomegranate 0.12.0

compiler   : GCC 7.3.0
system     : Linux
release    : 4.15.0-88-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit

The Monty Hall Gameshow

The Monty Hall problem arose from the gameshow Let's Make a Deal, where a guest had to choose which one of three doors had a prize behind it. The twist was that after the guest chose, the host, originally Monty Hall, would then open one of the doors the guest did not pick and ask if the guest wanted to switch which door they had picked. Initial inspection may lead you to believe that if there are only two doors left, there is a 50-50 chance of you picking the right one, and so there is no advantage one way or the other. However, it has been proven both through simulations and analytically that there is in fact a 66% chance of getting the prize if the guest switches their door, regardless of the door they initially went with.

We can reproduce this result using Bayesian networks with three nodes, one for the guest, one for the prize, and one for the door Monty chooses to open. The door the guest initially chooses and the door the prize is behind are completely random processes across the three doors, but the door which Monty opens is dependent on both the door the guest chooses (it cannot be the door the guest chooses), and the door the prize is behind (it cannot be the door with the prize behind it).

To create the Bayesian network in pomegranate, we first create the distributions which live in each node in the graph. For a discrete (aka categorical) bayesian network we use DiscreteDistribution objects for the root nodes and ConditionalProbabilityTable objects for the inner and leaf nodes. The columns in a ConditionalProbabilityTable correspond to the order in which the parents (the second argument) are specified, and the last column is the value the ConditionalProbabilityTable itself takes. In the case below, the first column corresponds to the value 'guest' takes, then the value 'prize' takes, and then the value that 'monty' takes. 'B', 'C', 'A' refers then to the probability that Monty reveals door 'A' given that the guest has chosen door 'B' and that the prize is actually behind door 'C', or P(Monty='A'|Guest='B', Prize='C').


In [2]:
# The guests initial door selection is completely random
guest = DiscreteDistribution({'A': 1./3, 'B': 1./3, 'C': 1./3})

# The door the prize is behind is also completely random
prize = DiscreteDistribution({'A': 1./3, 'B': 1./3, 'C': 1./3})

    # Monty is dependent on both the guest and the prize. 
monty = ConditionalProbabilityTable(
        [[ 'A', 'A', 'A', 0.0 ],
         [ 'A', 'A', 'B', 0.5 ],
         [ 'A', 'A', 'C', 0.5 ],
         [ 'A', 'B', 'A', 0.0 ],
         [ 'A', 'B', 'B', 0.0 ],
         [ 'A', 'B', 'C', 1.0 ],
         [ 'A', 'C', 'A', 0.0 ],
         [ 'A', 'C', 'B', 1.0 ],
         [ 'A', 'C', 'C', 0.0 ],
         [ 'B', 'A', 'A', 0.0 ],
         [ 'B', 'A', 'B', 0.0 ],
         [ 'B', 'A', 'C', 1.0 ],
         [ 'B', 'B', 'A', 0.5 ],
         [ 'B', 'B', 'B', 0.0 ],
         [ 'B', 'B', 'C', 0.5 ],
         [ 'B', 'C', 'A', 1.0 ],
         [ 'B', 'C', 'B', 0.0 ],
         [ 'B', 'C', 'C', 0.0 ],
         [ 'C', 'A', 'A', 0.0 ],
         [ 'C', 'A', 'B', 1.0 ],
         [ 'C', 'A', 'C', 0.0 ],
         [ 'C', 'B', 'A', 1.0 ],
         [ 'C', 'B', 'B', 0.0 ],
         [ 'C', 'B', 'C', 0.0 ],
         [ 'C', 'C', 'A', 0.5 ],
         [ 'C', 'C', 'B', 0.5 ],
         [ 'C', 'C', 'C', 0.0 ]], [guest, prize])

Next, we pass these distributions into state objects along with the name for the node.


In [3]:
# State objects hold both the distribution, and a high level name.
s1 = State(guest, name="guest")
s2 = State(prize, name="prize")
s3 = State(monty, name="monty")

Then we add the states to the network, exactly like we did when making a HMM. In the future, all matrices of data should have their columns organized in the same order that the states are added to the network. The way the states are added to the network makes no difference to it, and so you should add the states according to how the columns are organized in your data.


In [4]:
# Create the Bayesian network object with a useful name
model = BayesianNetwork("Monty Hall Problem")

# Add the three states to the network 
model.add_states(s1, s2, s3)

Then we need to add edges to the model. The edges represent which states are parents of which other states. This is currently a bit redundant with passing in the distribution objects that are parents for each ConditionalProbabilityTable. For now edges are added from parent -> child by calling model.add_edge(parent, child).


In [5]:
# Add edges which represent conditional dependencies, where the second node is 
# conditionally dependent on the first node (Monty is dependent on both guest and prize)
model.add_edge(s1, s3)
model.add_edge(s2, s3)

Lastly, the model must be baked to finalize the internals. Since Bayesian networks use factor graphs for inference, an explicit factor graph is produced from the Bayesian network during the bake step.


In [6]:
model.bake()

Predicting Probabilities

We can calculate probabilities of a sample under the Bayesian network in the same way that we can calculate probabilities under other models. In this case, let's calculate the probability that you initially said door A, that Monty then opened door B, but that the actual car was behind door C.


In [7]:
model.probability([['A', 'B', 'C']])


Out[7]:
0.11111111111111109

That seems in line with what we know, that there is a 1/9th probability of that happening.

Next, let's look at an impossible situation. What is the probability of initially saying door A, that Monty opened door B, and that the car was actually behind door B.


In [8]:
model.probability([['A', 'B', 'B']])


Out[8]:
0.0

The reason that situation is impossible is because Monty can't open a door that has the car behind it.

Performing Inference

Perhaps the most powerful aspect of Bayesian networks is their ability to perform inference. Given any set of observed variables, including no observations, Bayesian networks can make predictions for all other variables. Obviously, the more variables that are observed, the more accurate the predictions will get of the remaining variables.

pomegranate uses the loopy belief propagation algorithm to do inference. This is an approximate algorithm which can yield exact results in tree-like graphs, and in most other cases still yields good results. Inference on a Bayesian network consists of taking in observations for a subset of the variables and using that to infer the values that the other variables take. The most variables which are observed, the closer the inferred values will be to truth. One of the powers of Bayesian networks is that the set of observed and 'hidden' (or unobserved) variables does not need to be specified beforehand, and can change from sample to sample.

We can run inference using the predict_proba method and passing in a dictionary of values, where the key is the name of the state and the value is the observed value for that state. If we don't supply any values, we get the marginal of the graph, which is just the frequency of each value for each variable over an infinite number of randomly drawn samples from the graph.

Lets see what happens when we look at the marginal of the Monty hall network.


In [9]:
model.predict_proba({})


Out[9]:
array([{
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.33333333333333337,
            "B" :0.33333333333333337,
            "C" :0.33333333333333337
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "A" :0.33333333333333337,
            "B" :0.33333333333333337,
            "C" :0.33333333333333337
        }
    ],
    "frozen" :false
},
       {
    "class" :"Distribution",
    "dtype" :"str",
    "name" :"DiscreteDistribution",
    "parameters" :[
        {
            "C" :0.3333333333333333,
            "B" :0.3333333333333333,
            "A" :0.3333333333333333
        }
    ],
    "frozen" :false
}], dtype=object)

We are returned three DiscreteDistribution objects, each representing the marginal distribution for each variable, in the same order they were put into the model. In this case, they represent the guest, prize, and monty variables respectively. We see that everything is equally likely.

We can also pass in an array where None (or np.nan) correspond to the unobserved values.


In [10]:
model.predict_proba([[None, None, None]])


Out[10]:
[array([{
     "class" :"Distribution",
     "dtype" :"str",
     "name" :"DiscreteDistribution",
     "parameters" :[
         {
             "A" :0.33333333333333337,
             "B" :0.33333333333333337,
             "C" :0.33333333333333337
         }
     ],
     "frozen" :false
 },
        {
     "class" :"Distribution",
     "dtype" :"str",
     "name" :"DiscreteDistribution",
     "parameters" :[
         {
             "A" :0.33333333333333337,
             "B" :0.33333333333333337,
             "C" :0.33333333333333337
         }
     ],
     "frozen" :false
 },
        {
     "class" :"Distribution",
     "dtype" :"str",
     "name" :"DiscreteDistribution",
     "parameters" :[
         {
             "C" :0.3333333333333333,
             "B" :0.3333333333333333,
             "A" :0.3333333333333333
         }
     ],
     "frozen" :false
 }], dtype=object)]

All of the variables that were observed will be the observed value, and all of the variables that were unobserved will be a DiscreteDistribution object. This means that parameters[0] will return the underlying dictionary used by the distribution.

Now lets do something different, and say that the guest has chosen door 'A'. We do this by passing a dictionary to predict_proba with key pairs consisting of the name of the state (in the state object), and the value which that variable has taken, or by passing in a list where the first index is our observation.


In [11]:
model.predict_proba([['A', None, None]])


Out[11]:
[array(['A',
        {
     "class" :"Distribution",
     "dtype" :"str",
     "name" :"DiscreteDistribution",
     "parameters" :[
         {
             "A" :0.3333333333333333,
             "B" :0.3333333333333333,
             "C" :0.3333333333333333
         }
     ],
     "frozen" :false
 },
        {
     "class" :"Distribution",
     "dtype" :"str",
     "name" :"DiscreteDistribution",
     "parameters" :[
         {
             "C" :0.49999999999999994,
             "B" :0.49999999999999994,
             "A" :0.0
         }
     ],
     "frozen" :false
 }], dtype=object)]

We can see that now Monty will not open door 'A', because the guest has chosen it. At the same time, the distribution over the prize has not changed, it is still equally likely that the prize is behind each door.

Now, lets say that Monty opens door 'C' and see what happens. Here we use a dictionary rather than a list simply to show how one can use both input forms depending on what is more convenient.


In [12]:
model.predict_proba([{'guest': 'A', 'monty': 'C'}])


Out[12]:
[array(['A',
        {
     "class" :"Distribution",
     "dtype" :"str",
     "name" :"DiscreteDistribution",
     "parameters" :[
         {
             "A" :0.3333333333333334,
             "B" :0.6666666666666664,
             "C" :0.0
         }
     ],
     "frozen" :false
 },
        'C'], dtype=object)]

Suddenly, we see that the distribution over prizes has changed. It is now twice as likely that the car is behind the door labeled 'B'. This demonstrates that when on the game show, it is always better to change your initial guess after being shown an open door. Now you could go and win tons of cars, except that the game show got cancelled.

Imputation Given Structured Constraints

The task of filling in an incomplete matrix can be called imputation, and there are many approaches for doing so. One of the most well known is that of matrix factorization, where a latent representation is learned for each of the columns and each of the rows such that the dot product between the two can reconstruct values in the matrix. Due to the manner that these latent representations are learned, the matrix does not need to be complete, and the dot product can then be used to fill in all of the missing values.

One weakness of the matrix factorization approach is that constraints and known relationships can't be enforced between the features. A simple example is that the rule "when column 1 is 'A' and column 2 is 'B', column 3 must be 'C'" can potentially be learned in the representation, but can't be simply hard-coded like a conditional probability table would. A more complex example would say that a pixel in an image can be predicted from its neighbors, whereas the notion of neighbors is more difficult to specify for a factorization approach.

The process for imputing data given a Bayesian network is to either first learn the structure of the network from the given data, or have a known structure, and then use loopy-belief propogation to predict the best values for the missing features.

Let's see an example of this on the digits data set, binarizing the data based on the median value. We'll only use the first two rows because learning large, unconstrained Bayesian networks is difficult.


In [13]:
from sklearn.datasets import load_digits

data = load_digits()
X, _ = data.data, data.target

plt.imshow(X[0].reshape(8, 8))
plt.grid(False)
plt.show()

X = X[:,:16]
X = (X > numpy.median(X)).astype('float64')


Now let's remove a large portion of the pixels randomly from each of the images. We can do that with numpy arrays by setting missing values to np.nan.


In [14]:
numpy.random.seed(111)

i = numpy.random.randint(X.shape[0], size=10000)
j = numpy.random.randint(X.shape[1], size=10000)

X_missing = X.copy()
X_missing[i, j] = numpy.nan
X_missing


Out[14]:
array([[ 0.,  0.,  1., ...,  1.,  1.,  0.],
       [ 0., nan, nan, ..., nan,  0.,  0.],
       [ 0.,  0., nan, ...,  1.,  0.,  0.],
       ...,
       [ 0.,  0., nan, ...,  1.,  0., nan],
       [ 0., nan, nan, ...,  1., nan,  0.],
       [nan,  0.,  1., ..., nan,  0., nan]])

We can set up a baseline for how good an imputation is by using the average pixel value as a replacement. Because this is binary data, we can use the mean absolute error to measure how good these approaches are on imputing the pixels that are not observed.


In [15]:
from fancyimpute import SimpleFill

y_pred = SimpleFill().fit_transform(X_missing)[i, j]
numpy.abs(y_pred - X[i, j]).mean()


Using Theano backend.
Out[15]:
0.1954958904004812

Next, we can see how good an IterativeSVD approach is, which is similar to a matrix factorization.


In [16]:
from fancyimpute import IterativeSVD

y_pred = IterativeSVD(verbose=False).fit_transform(X_missing)[i, j]
numpy.abs(y_pred - X[i, j]).mean()


Out[16]:
0.26645818594371573

Now, we can try building a Bayesian network using the Chow-Liu algorithm and then use the resulting network to fill in the matrix.


In [17]:
y_hat = BayesianNetwork.from_samples(X_missing, max_parents=1).predict(X_missing)
numpy.abs(numpy.array(y_hat)[i, j] - X[i, j]).mean()


Out[17]:
0.1092

We can compare this to a better imputation approach, that of K-nearest neighbors, and see how good using a Bayesian network is.


In [18]:
from fancyimpute import KNN

y_pred = KNN(verbose=False).fit_transform(X_missing)[i, j]
numpy.abs(y_pred - X[i, j]).mean()


Out[18]:
0.16290500020624946

Looks like in this case the Bayesian network is better than KNN for imputing the pixels. It is, however, slower than the fancyimpute methods.

The API

Initialization

While the methods are similar across all models in pomegranate, Bayesian networks are more closely related to hidden Markov models than the other models. This makes sense, because both of them rely on graphical structures.

The first way to initialize Bayesian networks is to pass in one distribution and state at a time, and then pass in edges. This is similar to hidden Markov models.


In [19]:
d1 = DiscreteDistribution({True: 0.2, False: 0.8})
d2 = DiscreteDistribution({True: 0.6, False: 0.4})
d3 = ConditionalProbabilityTable(
        [[True,  True,  True,  0.2],
         [True,  True,  False, 0.8],
         [True,  False, True,  0.3],
         [True,  False, False, 0.7],
         [False, True,  True,  0.9],
         [False, True,  False, 0.1],
         [False, False, True,  0.4],
         [False, False, False, 0.6]], [d1, d2])

s1 = State(d1, name="s1")
s2 = State(d2, name="s2")
s3 = State(d3, name="s3")

model = BayesianNetwork()
model.add_states(s1, s2, s3)
model.add_edge(s1, s3)
model.add_edge(s2, s3)
model.bake()

The other way is to use the from_samples method if given a data set.


In [20]:
numpy.random.seed(111)

X = numpy.random.randint(2, size=(15, 15))
X[:,5] = X[:,4] = X[:,3]
X[:,11] = X[:,12] = X[:,13]

model = BayesianNetwork.from_samples(X)
model.plot()


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-c38850c0bf0e> in <module>
      6 
      7 model = BayesianNetwork.from_samples(X)
----> 8 model.plot()

~/anaconda3/lib/python3.7/site-packages/pomegranate-0.12.0-py3.7-linux-x86_64.egg/pomegranate/BayesianNetwork.pyx in pomegranate.BayesianNetwork.BayesianNetwork.plot()

ValueError: must have pygraphviz installed for visualization

The structure learning algorithms are covered more in depth in the accompanying notebook.

We can look at the structure of the network by using the structure attribute. Each tuple is a node, and the integers in the tuple correspond to the parents of the node.


In [ ]:
model.structure

Prediction

The prediction method is similar to the other models. Inference is done using loopy belief propogation, which is an approximate version of the forward-backward algorithm that can be significantly faster while still precise.


In [ ]:
model.predict([[False, False, False, False, None, None, False, None, False, None, True, None, None, True, False]])

The predict method will simply return the argmax of all the distributions after running the predict_proba method.


In [ ]:
model.predict_proba([[False, False, False, False, None, None, False, None, False, None, 
                      True, None, None, True, False]])