In [ ]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
Before we can proceed with statistical inference on graphs, we must first refresh ourselves with some ideas from the world of statistics. Otherwise, the methods that we will end up using may seem a tad weird, and hence difficult to follow along.
To review statistical ideas, let's set up a few statements and explore what they mean.
Another concrete example is to consider the six-sided dice. Here, the space of all possibilities of "stuff" is the set of numbers in the range $[1, 6]$. If we have a fair dice, then we have 1/6 credibility points assigned to each of the numbers. (Unfair dice will have an unequal distribution of credibility points across each face.)
If we receive an undirected social network graph with 5 nodes and 6 edges, we have to keep in mind that this graph with 6 edges was merely one of $15 \choose 6$ ways to construct 5 node, 6 edge graphs. (15 comes up because there are 15 edges that can be constructed in a 5-node undirected graph.)
As an example, hypothesis testing in coin flips follows this logic:
The same protocol applies when we perform hypothesis testing on graphs.
Firstly, we calculate a summary statistic that describes our graph.
Secondly, we propose a null graph model, and calculate our summary statistic under simulated versions of that null graph model.
Thirdly, we look at the probability of observing the summary statistic value that we calculated in step 1 or more extreme, under the assumed graph null model distribution.
On easy one to study is the Erdos-Renyi graph, also known as the "binomial" graph.
The data generation story here is that we instantiate an undirected graph with $n$ nodes, giving $\frac{n^2 - n}{2}$ possible edges. Each edge has a probability $p$ of being created.
In [ ]:
import networkx as nx
G_er = nx.erdos_renyi_graph(n=30, p=0.2)
In [ ]:
nx.draw(G_er)
You can verify that there's approximately 20% of $\frac{30^2 - 30}{2} = 435$.
In [ ]:
len(G_er.edges())
In [ ]:
len(G_er.edges()) / 435
We can also look at the degree distribution:
In [ ]:
import pandas as pd
from nams.functions import ecdf
import matplotlib.pyplot as plt
x, y = ecdf(pd.Series(dict(nx.degree(G_er))))
plt.scatter(x, y)
In [ ]:
G_ba = nx.barabasi_albert_graph(n=30, m=3)
nx.draw(G_ba)
In [ ]:
len(G_ba.edges())
And the degree distribution:
In [ ]:
x, y = ecdf(pd.Series(dict(nx.degree(G_ba))))
plt.scatter(x, y)
You can see that even though the number of edges between the two graphs are similar, their degree distribution is wildly different.
For this notebook, we are going to look at a protein-protein interaction network, and test the hypothesis that this network was not generated by the data generating process described by an Erdos-Renyi graph.
Let's load a protein-protein interaction network dataset.
This undirected network contains protein interactions contained in yeast. Research showed that proteins with a high degree were more important for the surivial of the yeast than others. A node represents a protein and an edge represents a metabolic interaction between two proteins. The network contains loops.
In [ ]:
from nams import load_data as cf
G = cf.load_propro_network()
As is always the case, let's make sure we know some basic stats of the graph.
In [ ]:
len(G.nodes())
In [ ]:
len(G.edges())
Let's also examine the degree distribution of the graph.
In [ ]:
x, y = ecdf(pd.Series(dict(nx.degree(G))))
plt.scatter(x, y)
Given a graph dataset, how do we identify which data generating model provides the best fit?
One way to do this is to compare characteristics of a graph generating model against the characteristics of the graph. The logic here is that if we have a good graph generating model for the data, we should, in theory, observe the observed graph's characteristics in the graphs generated by the graph generating model.
In [ ]:
from ipywidgets import interact, IntSlider
m = IntSlider(value=2, min=1, max=10)
@interact(m=m)
def compare_barabasi_albert_graph(m):
fig, ax = plt.subplots()
G_ba = nx.barabasi_albert_graph(n=len(G.nodes()), m=m)
x, y = ecdf(pd.Series(dict(nx.degree(G_ba))))
ax.scatter(x, y, label="Barabasi-Albert Graph")
x, y = ecdf(pd.Series(dict(nx.degree(G))))
ax.scatter(x, y, label="Protein Interaction Network")
ax.legend()
In [ ]:
from ipywidgets import FloatSlider
p = FloatSlider(value=0.6, min=0, max=0.1, step=0.001)
@interact(p=p)
def compare_erdos_renyi_graph(p):
fig, ax = plt.subplots()
G_er = nx.erdos_renyi_graph(n=len(G.nodes()), p=p)
x, y = ecdf(pd.Series(dict(nx.degree(G_er))))
ax.scatter(x, y, label="Erdos-Renyi Graph")
x, y = ecdf(pd.Series(dict(nx.degree(G))))
ax.scatter(x, y, label="Protein Interaction Network")
ax.legend()
ax.set_title(f"p={p}")
Given the degree distribution only, which model do you think better describes the generation of a protein-protein interaction network?
Each time we plug in a value of $m$ for the Barabasi-Albert graph model, we are using one of many possible Barabasi-Albert graph models, each with a different $m$. Similarly, each time we choose a different $p$ for the Erdos-Renyi model, we are using one of many possible Erdos-Renyi graph models, each with a different $p$.
To quantitatively compare degree distributions, we can use the Wasserstein distance between the data. Let's see how to implement this.
In [ ]:
from scipy.stats import wasserstein_distance
def erdos_renyi_degdist(n, p):
"""Return a Pandas series of degree distribution of an Erdos-Renyi graph."""
G = nx.erdos_renyi_graph(n=n, p=p)
return pd.Series(dict(nx.degree(G)))
def barabasi_albert_degdist(n, m):
"""Return a Pandas series of degree distribution of an Barabasi-Albert graph."""
G = nx.barabasi_albert_graph(n=n, m=m)
return pd.Series(dict(nx.degree(G)))
In [ ]:
deg = pd.Series(dict(nx.degree(G)))
er_deg = erdos_renyi_degdist(n=len(G.nodes()), p=0.001)
ba_deg = barabasi_albert_degdist(n=len(G.nodes()), m=1)
wasserstein_distance(deg, er_deg), wasserstein_distance(deg, ba_deg)
Notice that because the graphs are instantiated in a non-deterministic fashion, re-running the cell above will give you different values for each new graph generated.
Let's now plot the wasserstein distance to our graph data for the two particular Erdos-Renyi and Barabasi-Albert graph models shown above.
In [ ]:
import matplotlib.pyplot as plt
from tqdm.autonotebook import tqdm
er_dist = []
ba_dist = []
for _ in tqdm(range(100)):
er_deg = erdos_renyi_degdist(n=len(G.nodes()), p=0.001)
er_dist.append(wasserstein_distance(deg, er_deg))
ba_deg = barabasi_albert_degdist(n=len(G.nodes()), m=1)
ba_dist.append(wasserstein_distance(deg, ba_deg))
# er_degs = [erdos_renyi_degdist(n=len(G.nodes()), p=0.001) for _ in range(100)]
In [ ]:
import seaborn as sns
import janitor
data = (
pd.DataFrame(
{
"Erdos-Renyi": er_dist,
"Barabasi-Albert": ba_dist,
}
)
.melt(value_vars=["Erdos-Renyi", "Barabasi-Albert"])
.rename_columns({"variable": "Graph Model", "value": "Wasserstein Distance"})
)
sns.swarmplot(data=data, x="Graph Model", y="Wasserstein Distance")
From this, we might conclude that the Barabasi-Albert graph with $m=1$ has the better fit to the protein-protein interaction network graph.
That statement, accurate as it might be, still does not connect the dots to biology.
Let's think about the generative model for this graph. The Barabasi-Albert graph gives us a model for "rich gets richer". Given the current state of the graph, if we want to add a new edge, we first pick a node with probability proportional to the number of edges it already has. Then, we pick another node with probability proportional to the number of edges that it has too. Finally, we add an edge there. This has the effect of "enriching" nodes that have a large number of edges with more edges.
How might this connect to biology?
We can't necessarily provide a concrete answer, but this model might help raise new hypotheses.
For example, if protein-protein interactions of the "binding" kind are driven by subdomains, then proteins that acquire a domain through recombination may end up being able to bind to everything else that the domain was able to. In this fashion, proteins with that particular binding domain gain new edges more readily.
Testing these hypotheses would be a totally different matter, and at this point, I submit the above hypothesis with a large amount of salt thrown over my shoulder. In other words, the hypothesized mechanism could be completely wrong. However, I hope that this example illustrated that the usage of a "graph generative model" can help us narrow down hypotheses about the observed world.