Notes: This is a introductory survey. Unlike my favorite kind of RST, we won't build up models from first assumptions and small steps as much as try I will try to survey the basic concepts with examples, demonstrations and tie-ins to things with which you already are familiar.
When we started statistics, we talked about measurements, for example, of the height of students in you school, and noted that instances of these meansurements--events--fall into a probabiliy distribution. In this case, it was likely a Normal-looking distribution with many measurements around the average and a few measurements larger or smaller than average.
This introduced the idea of a random variable.
What do we when measurements of a random variable fall according to some complex probability distribution? This happens all the time. Sometimes we approximate. Other times, we realize that the random variable we are measuring has some comlex underlying dependencies, possibly on other distributions, and we can address these with a model.
For example, imagine that we measure the heights of 7th graders in your school and call this our random variable. Later, we realize that the average height of males and females is different by a few inches. We need a model that accounts for the depencence of height on sex at this age.
From our data we can might write:
$$p(male) = .49$$$$p(female) = .51$$$$p(h) \propto \exp{\frac{(h-\bar{h})^2}{2 \sigma_h^2}}$$And now, we know that $\bar{h}$ and $\sigma_h$ will have different values for males and females.
This kind of problem is common! It turns out there are powerful general strategies for dealing with whole classes of proglems of dependencies in the distribution of random variables.
In general, we want to model the joint probability distribution for a random variable in terms of other random variables and model parameters.
$$X = P(x_1, x_2, \dots ,x_n)$$for $n$ random variables.
(Let's assume they're are all observable for today.)
Now, we can decompose this probability by applying the chain rule:
$$P(x_1, x_2, \dots ,x_n) = P(x_1| x_2, \dots ,x_n) P(x_2, |x_3, \dots ,x_n) \dots P(x_n)$$If we unroll all of the terms, this is a pretty big mess. But, for many problems there is another simplificaiton. In general, if $x_1$ is independent from $x_2$ then,
$$P(x_1|x_2) = P(x_1)$$$$P(x_1, x_2) = P(x_1)P(x_2)$$Also, with Bayes rule, we can invert dependencies:
$$P(y|x) = \frac{P(y)P(x|y)}{P(x)}$$For height and gender, we have,
$$P(h,s) = P(h|s)P(s)$$Observations:
To do this, map it like this:
There are more tractable graphs and lest tractable graphs for problem solving. An example of one important property is whether P factorizes G. We say a JPD P factorizes over graph G, if P can be encoded by:
$$P(x_1, x_2, \dots ,x_n) = \Pi_{i=1}^n P(x_i|Par_G(x_i))$$Where $Par_G(x_i)$ is the parent graph of G.
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import networkx as nx
G=nx.DiGraph()
G.add_edge('sex','height',weight=0.6)
nx.draw_networkx(G, node_color='y',node_size=2000, width=3)
plt.axis('off')
plt.show()
This tool can model a much general set of Joint Probability Distributions than our simple student-height problem.
Some advantages of managing statustical thinking with graphs include:
We can use graphs for book-keeping about distributions and dependencies. Practical models can have many nodes. Graphs models help us manage complexity.
We can talk generally about classes of graphs and sub-graphs that solve problems. E.g.
We can state reasoning rules that are constent with the statistical assumptions that allow understand and transform nodes or groups of nodes. E.g. by using rules of graph manipulation to reduce, simplify or cut graphs.
There a some jobs we need to figure out how to do in order to make this tool practically useful. Maybe in order of obviousness:
Ask questions about the probabilities represented by the model. I.e. Inference:
Learn parameters of a model given graph structure and representative data (E.g. naive bayes, LSA topic modeling)
Learn the structure of the graph given representative data
Build heuristics and rules by which we can reason about graphs. E.g. How to convert Bayese models to Markov models? What sort of structures allow influence to propogate and which do not? How to simplify complex sub-graphs? In general, learn how to reason about statistics by learning the rules of reasoning about PGMs.
In [ ]:
import numpy as np
import pandas as pd
import csv
import json
from libpgm.graphskeleton import GraphSkeleton
from libpgm.nodedata import NodeData
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.tablecpdfactorization import TableCPDFactorization
from libpgm.pgmlearner import PGMLearner
In [ ]:
titanic = pd.DataFrame.from_csv("./data/titanic3.csv", index_col = None)
titanic.head()
In [ ]:
titanic.describe()
A pivot table might give another useful summary
In [ ]:
ptable = pd.pivot_table(titanic, values=["name"], columns=["survived", "pclass","sex"], aggfunc=lambda x: len(x.unique()), margins=True)
print ptable
In [ ]:
# housekeeping
# libpgm needs data as node:value list for each row
with open("./data/titanic3.csv") as f:
rdr = csv.reader(f, )
headers = next(rdr, None)
data = [{k:float(v) for k,v in zip(headers, row) if k !="name"} for row in rdr]
headers.remove("name") # not going to model survival based on name
#print data
Guess a graph for the model. I guess this:
In [ ]:
pgn = {
"V": headers,
"E": [["age", "pclass"],
["sex", "survived"],
["pclass", "survived"]],
"Vdata": None }
# print pgn
G=nx.DiGraph()
for f,t in pgn["E"]:
G.add_edge(f,t)
nx.draw_networkx(G, node_color='y',node_size=2000, width=3)
plt.axis('off')
plt.show()
Some choices:
While it is totally worth seeing how it is done, here we just do it to make sure we get to look at examples of most of our tasks.
In [ ]:
skel = GraphSkeleton()
skel.V = pgn["V"]
skel.E = pgn["E"]
skel.toporder()
learner = PGMLearner()
result = learner.discrete_mle_estimateparams(skel, data)
Now the nodes have conditional probability information stored in them. For example,
In [ ]:
pd.DataFrame(result.Vdata["sex"]["cprob"]).transpose()
In [ ]:
pd.DataFrame(result.Vdata["age"]["cprob"]).transpose()
Now let's look at a downstream node.
In [ ]:
pd.DataFrame(result.Vdata["pclass"]["cprob"]).transpose()
In [ ]:
# use our solutions from above
nd = NodeData()
nd.Vdata = result.Vdata
nd.alldata = None
bn = DiscreteBayesianNetwork(skel, nd)
In [ ]:
# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(male=0)? {:.3%}".format(
tcpd.specificquery(dict(sex=[1]), dict())
)
tcpd = TableCPDFactorization(bn)
print "What is p(female=1)? {:.3%}".format(
tcpd.specificquery(dict(sex=[0]), dict())
)
# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(female=1,survived=1)? {:.3%}".format(
tcpd.specificquery(dict(sex=[1]), dict(survived=1))
)
# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(male=0,survived=0)? {:.3%}".format(
tcpd.specificquery(dict(sex=[0]), dict(survived=0))
)
# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(male=0,class=3,survived=0)? {:.3%}".format(
tcpd.specificquery(dict(sex=[0],pclass=[3.0]), dict(survived=0))
)
In [ ]:
# maybe useful for comparison
pd.pivot_table(titanic, values=["name"], columns=["sex", "pclass","survived"], aggfunc=lambda x: len(x.unique()))
One general strategy uses a score to determine structures
This is a lot of computation. People have tried all kinds of heuristics, simplification and sophisticated calculation caching schemes. Some cases can to 1000s of nodes in hours.
Another scheme uses constraints optimizaiton. That's what we will do...
In [ ]:
# instantiate my learner
learner = PGMLearner()
# estimate structure
result = learner.lg_constraint_estimatestruct(data, indegree=1)
# output
print json.dumps(result.E, indent=2)
print json.dumps(result.V, indent=2)
G=nx.DiGraph()
for f,t in result.E:
G.add_edge(f,t,weight=0.6)
nx.draw_networkx(G, node_color='y',node_size=2000, width=3)
plt.axis('off')
plt.show()
In [ ]:
skel = GraphSkeleton()
skel.V = result.V
skel.E = result.E
skel.toporder()
learner = PGMLearner()
result = learner.discrete_mle_estimateparams(skel, data)
In [ ]:
nd = NodeData()
nd.Vdata = result.Vdata
nd.alldata = None
bn = DiscreteBayesianNetwork(skel, nd)
# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(male=0)? {:.3%}".format(
tcpd.specificquery(dict(sex=[1]), dict())
)
tcpd = TableCPDFactorization(bn)
print "What is p(female=1)? {:.3%}".format(
tcpd.specificquery(dict(sex=[0]), dict())
)
# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(female=1,survived=1)? {:.3%}".format(
tcpd.specificquery(dict(sex=[1]), dict(survived=1))
)
# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(male=0,survived=0)? {:.3%}".format(
tcpd.specificquery(dict(sex=[0]), dict(survived=0))
)
# query alters tables
tcpd = TableCPDFactorization(bn)
print "What is p(male=0,class=3,survived=0)? {:.3%}".format(
tcpd.specificquery(dict(sex=[0],pclass=[3.0]), dict(survived=0))
)
Lots to do here, but for another day. A short read here hints at how much I skipped: https://en.wikipedia.org/wiki/Graphical_model
Some other types:
But maybe I gave your a place to start?
In [ ]: