Introduction To Probabalistic Graph Models

Scott Hendrickson

2016-Aug-19

Requirements:

* numpy
* pandas
* libpgm
* networkx  (for plotting)

Notes: This is a introductory survey. Unlike my favorite kind of RST, we won't build up models from first assumptions and small steps as much as try I will try to survey the basic concepts with examples, demonstrations and tie-ins to things with which you already are familiar.

Why PGMs?

When we started statistics, we talked about measurements, for example, of the height of students in you school, and noted that instances of these meansurements--events--fall into a probabiliy distribution. In this case, it was likely a Normal-looking distribution with many measurements around the average and a few measurements larger or smaller than average.

This introduced the idea of a random variable.

What do we when measurements of a random variable fall according to some complex probability distribution? This happens all the time. Sometimes we approximate. Other times, we realize that the random variable we are measuring has some comlex underlying dependencies, possibly on other distributions, and we can address these with a model.

For example, imagine that we measure the heights of 7th graders in your school and call this our random variable. Later, we realize that the average height of males and females is different by a few inches. We need a model that accounts for the depencence of height on sex at this age.

From our data we can might write:

$$p(male) = .49$$$$p(female) = .51$$$$p(h) \propto \exp{\frac{(h-\bar{h})^2}{2 \sigma_h^2}}$$

And now, we know that $\bar{h}$ and $\sigma_h$ will have different values for males and females.

This kind of problem is common! It turns out there are powerful general strategies for dealing with whole classes of proglems of dependencies in the distribution of random variables.

Probabalistic Graph Models (PGMs)

In general, we want to model the joint probability distribution for a random variable in terms of other random variables and model parameters.

$$X = P(x_1, x_2, \dots ,x_n)$$

for $n$ random variables.

(Let's assume they're are all observable for today.)

Now, we can decompose this probability by applying the chain rule:

$$P(x_1, x_2, \dots ,x_n) = P(x_1| x_2, \dots ,x_n) P(x_2, |x_3, \dots ,x_n) \dots P(x_n)$$

If we unroll all of the terms, this is a pretty big mess. But, for many problems there is another simplificaiton. In general, if $x_1$ is independent from $x_2$ then,

$$P(x_1|x_2) = P(x_1)$$$$P(x_1, x_2) = P(x_1)P(x_2)$$

Also, with Bayes rule, we can invert dependencies:

$$P(y|x) = \frac{P(y)P(x|y)}{P(x)}$$

Back to our height example...

For height and gender, we have,

$$P(h,s) = P(h|s)P(s)$$

Observations:

We could have decomposed $P(g,h)$ and eneded up with $P(g|h)P(h)$
Bayes lets us swap "sex is dependent on height" case to the case where our model says "height is dependent on sex"
Direction is somewhat of a choice for simultaneous observations (determining causality is still hard in PGMs)

In general, we can represent this decompostion by a graph

To do this, map it like this:

Nodes: random variables.
Each node has a Conditional Probability Distribution (CDP).
Edges: dependencies

There are more tractable graphs and lest tractable graphs for problem solving. An example of one important property is whether P factorizes G. We say a JPD P factorizes over graph G, if P can be encoded by:

$$P(x_1, x_2, \dots ,x_n) = \Pi_{i=1}^n P(x_i|Par_G(x_i))$$

Where $Par_G(x_i)$ is the parent graph of G.



In [ ]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import networkx as nx

G=nx.DiGraph()
G.add_edge('sex','height',weight=0.6)
nx.draw_networkx(G, node_color='y',node_size=2000, width=3)
plt.axis('off')
plt.show()

Why is this formalism a useful probabalistic problem solving tool?

This tool can model a much general set of Joint Probability Distributions than our simple student-height problem.

Some advantages of managing statustical thinking with graphs include:

We can use graphs for book-keeping about distributions and dependencies. Practical models can have many nodes. Graphs models help us manage complexity.

We can talk generally about classes of graphs and sub-graphs that solve problems. E.g.

Bi-directional Graphs are Markov Models
Directed Acyclic Graphs (DAGs) are called Baysian Models (we will work here the rest of the day)

We can state reasoning rules that are constent with the statistical assumptions that allow understand and transform nodes or groups of nodes. E.g. by using rules of graph manipulation to reduce, simplify or cut graphs.

Learning and Resoning Tasks

There a some jobs we need to figure out how to do in order to make this tool practically useful. Maybe in order of obviousness:

Ask questions about the probabilities represented by the model. I.e. Inference:
- Causal Reasoning is looking for downstream effects of assumptions about parents. E.g. Given a male student, how likely is it he will be 165 cm in height?
- Evidential Reasoning is looking for upstream effects of assumptions about childern. E.g. Give a height of 174 cm, what is probability that the student is female?
- Intercausal. Given two observations of a common cause, what can we say about the likelihood of one or other of the "causal" measurements?
Learn parameters of a model given graph structure and representative data (E.g. naive bayes, LSA topic modeling)
Learn the structure of the graph given representative data
Build heuristics and rules by which we can reason about graphs. E.g. How to convert Bayese models to Markov models? What sort of structures allow influence to propogate and which do not? How to simplify complex sub-graphs? In general, learn how to reason about statistics by learning the rules of reasoning about PGMs.

Tasks and Examples

Let's try to look at 1-3. Also, I am going to switch 0 and 1 so we can have a graph model with parameters to use for inference.

Practical problem solving often procees from guessing the graph. Graph discovery may require big, big data and be computationally challenging.



In [ ]:

    
import numpy as np
import pandas as pd
import csv
import json

from libpgm.graphskeleton import GraphSkeleton
from libpgm.nodedata import NodeData
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.tablecpdfactorization import TableCPDFactorization
from libpgm.pgmlearner import PGMLearner

And now, for some data

Survival data from Titanic based on age, sex and travel class.



In [ ]:

    
titanic = pd.DataFrame.from_csv("./data/titanic3.csv", index_col = None)
titanic.head()



In [ ]:

    
titanic.describe()

A pivot table might give another useful summary



In [ ]:

    
ptable = pd.pivot_table(titanic, values=["name"], columns=["survived", "pclass","sex"], aggfunc=lambda x: len(x.unique()), margins=True)
print ptable



In [ ]:

    
# housekeeping
# libpgm needs data as node:value list for each row
with open("./data/titanic3.csv") as f:
    rdr = csv.reader(f, )
    headers = next(rdr, None)
    data = [{k:float(v) for k,v in zip(headers, row) if k !="name"} for row in rdr]
headers.remove("name")  # not going to model survival based on name
#print data

Learn Parameters of Graph Model give Data

Guess a graph for the model. I guess this:

age determines class -- older people have more money and less patience
survival is determined by sex -- "women and children first"
survival is determined by class of travel -- people in steerage had farther to go to get out



In [ ]:

    
pgn = {
    "V": headers,
    "E": [["age", "pclass"],
        ["sex", "survived"],
         ["pclass", "survived"]],
    "Vdata": None }
# print pgn
G=nx.DiGraph()
for f,t in pgn["E"]:
    G.add_edge(f,t)
nx.draw_networkx(G, node_color='y',node_size=2000, width=3)
plt.axis('off')
plt.show()

Some choices:

Baysian model (directed graph, probabilities "propogate")
Discrete distributions on the nodes (continuous is another world)
A common algorithm for fitting parameters is a maximum likelihood algorithm. There are others.

While it is totally worth seeing how it is done, here we just do it to make sure we get to look at examples of most of our tasks.



In [ ]:

    
skel = GraphSkeleton()
skel.V = pgn["V"]
skel.E = pgn["E"]
skel.toporder()

learner = PGMLearner()
result = learner.discrete_mle_estimateparams(skel, data)

Now the nodes have conditional probability information stored in them. For example,



In [ ]:

    
pd.DataFrame(result.Vdata["sex"]["cprob"]).transpose()



In [ ]:

    
pd.DataFrame(result.Vdata["age"]["cprob"]).transpose()

Now let's look at a downstream node.



In [ ]:

    
pd.DataFrame(result.Vdata["pclass"]["cprob"]).transpose()

Causal Reasoning

Set some assumptions and see how this changes marginal probabilities associated with the other nodes in the garph.

Note: Querying change the graph information--the graph has to be recalculated for the stated assumptions.



In [ ]:

    
# use our solutions from above
nd = NodeData()
nd.Vdata = result.Vdata
nd.alldata = None
bn = DiscreteBayesianNetwork(skel, nd)



In [ ]:

    
# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(male=0)? {:.3%}".format(
        tcpd.specificquery(dict(sex=[1]), dict())
)

tcpd = TableCPDFactorization(bn)
print "What is p(female=1)? {:.3%}".format(
        tcpd.specificquery(dict(sex=[0]), dict())
)

# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(female=1,survived=1)? {:.3%}".format(
         tcpd.specificquery(dict(sex=[1]), dict(survived=1))
)

# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(male=0,survived=0)? {:.3%}".format(
         tcpd.specificquery(dict(sex=[0]), dict(survived=0))
)

# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(male=0,class=3,survived=0)? {:.3%}".format(
         tcpd.specificquery(dict(sex=[0],pclass=[3.0]), dict(survived=0))
)



In [ ]:

    
# maybe useful for comparison
pd.pivot_table(titanic, values=["name"], columns=["sex", "pclass","survived"], aggfunc=lambda x: len(x.unique()))

Learn Graph Structure

One general strategy uses a score to determine structures

chose a score (for example AIC or BIC).
grab a node, make an edge
calculate a model based on data
calculate score
if AIC goes down, keep the edge
go to 2 until you run out of edges to try

This is a lot of computation. People have tried all kinds of heuristics, simplification and sophisticated calculation caching schemes. Some cases can to 1000s of nodes in hours.

Another scheme uses constraints optimizaiton. That's what we will do...



In [ ]:

    
# instantiate my learner 
learner = PGMLearner()

# estimate structure
result = learner.lg_constraint_estimatestruct(data, indegree=1)

# output
print json.dumps(result.E, indent=2)
print json.dumps(result.V, indent=2)

G=nx.DiGraph()
for f,t in result.E:
    G.add_edge(f,t,weight=0.6)
nx.draw_networkx(G, node_color='y',node_size=2000, width=3)
plt.axis('off')
plt.show()

different model, so learn new parameters



In [ ]:

    
skel = GraphSkeleton()
skel.V = result.V
skel.E = result.E
skel.toporder()

learner = PGMLearner()
result = learner.discrete_mle_estimateparams(skel, data)

Queries with New Model



In [ ]:

    
nd = NodeData()
nd.Vdata = result.Vdata
nd.alldata = None
bn = DiscreteBayesianNetwork(skel, nd)

# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(male=0)? {:.3%}".format(
        tcpd.specificquery(dict(sex=[1]), dict())
)

tcpd = TableCPDFactorization(bn)
print "What is p(female=1)? {:.3%}".format(
        tcpd.specificquery(dict(sex=[0]), dict())
)

# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(female=1,survived=1)? {:.3%}".format(
         tcpd.specificquery(dict(sex=[1]), dict(survived=1))
)

# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(male=0,survived=0)? {:.3%}".format(
         tcpd.specificquery(dict(sex=[0]), dict(survived=0))
)

# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(male=0,class=3,survived=0)? {:.3%}".format(
         tcpd.specificquery(dict(sex=[0],pclass=[3.0]), dict(survived=0))
)

Graph types and reasoning about probabilities

Lots to do here, but for another day. A short read here hints at how much I skipped: https://en.wikipedia.org/wiki/Graphical_model

Some other types:

A factor graph is an undirected bipartite graph connecting variables and factors. Each factor represents a function over the variables it is connected to. This is a helpful representation for understanding and implementing belief propagation.
A clique tree or junction tree is a tree of cliques, used in the junction tree algorithm.
A chain graph is a graph which may have both directed and undirected edges, but without any directed cycles (i.e. if we start at any vertex and move along the graph respecting the directions of any arrows, we cannot return to the vertex we started from if we have passed an arrow). Both directed acyclic graphs and undirected graphs are special cases of chain graphs, which can therefore provide a way of unifying and generalizing Bayesian and Markov networks.[2]
An ancestral graph is a further extension, having directed, bidirected and undirected edges.[3]
A conditional random field is a discriminative model specified over an undirected graph.
A restricted Boltzmann machine is a bipartite generative model specified over an undirected graph.

But maybe I gave your a place to start?



In [ ]: