Network Analysis--Using Null Models

Adapted from Professor Clauset's lectures and homeworks for Network Analysis and Modeling // Course page: http://tuvalu.santafe.edu/~aaronc/courses/5352/


In [ ]:
#relatively fast networks package (pip install python-igraph) that I used for these homeworks
import igraph 
# slow-and-steady networks package. fewer bugs, easier drawing
import networkx as nx
# plots!
import matplotlib.pyplot as plt
%matplotlib inline

# other packages
from __future__ import division
from random import random, shuffle
from numpy import percentile
from operator import itemgetter
from tabulate import tabulate

Graphs!

A good example of a real-world graph (because it happens to be one). For now it's just important to know that this is a graph of social interactions between 34 individuals involved in the same karate club. Drawing it less because it's informative, and more because plotting is fun.


In [ ]:
real_graph = nx.karate_club_graph()
positions = nx.spring_layout(real_graph)
nx.draw(real_graph, node_color = 'blue', pos = positions)

Now. What's the difference between that (^) drawing of nodes and edges and a completely random assembly of dots and lines? How can we quantify the difference between a social network, which we think probably has important structure, and a completely random network, whose structure contains very little useful information? Which aspects of a network can be explained by simple statistics like average degree, the number of nodes, or the degree distribution? Which characteristics of a network depend on a structure or generative process that could reveal an underlying truth about the way the network came about?

The question to ask is: how likely is a specific network characteristic to have been generated by a random process?

Random Graph Models

The Erdös-Rényi Random Graph

The simplest random graph you can think of. For a graph $G$ with $n$ nodes, each pair of nodes gets an (undirected) edge with probability $p$. There are ${n \choose 2}$ pairs of nodes, so ${n \choose 2}$ possible edges. Then the average degree of a node in this random graph is $(n-1)p$, where $(n-1)$ is the number of possible connections for a node $i$ and $p$ is the probability of that connection existing. Call the expected average degree $\bar k = (n-1)p$.

Giant Components

One property that we see all the time in social graphs (and many other graphs) is the emergence of a "giant" connected component. The Erdös-Rényi also develops a giant component for certain parameter spaces. In fact, when the average degree is more than 1 we see a giant component emerging, and when it is more that 3 that giant component is all or almost all of the graph. That means that for a random graph with $p > \frac{1}{n-1}$ we will always start to see a giant component.

To demonstrate why this is true, consider $u$ to be the fraction of vertices not in the giant component. Then where $u$ is also the probability that a randomly chosen vertex $i$ does not belong to the giant component of the graph. For $i$ to not be a part of the giant component, for every other vertex $j$ ($n-1$ vertices), $i$ is either not connected to $j$ (with probability $1-p$), or $j$ is not connected to the giant component (with probability $pu$). Then: $$ u = ((1-p) + (pu))^{n-1} $$ We can use $ p = \frac{\bar k}{n-1} $ to rewrite the expression as: $$ u = (1 - \frac{\bar k(1-u)}{n-1})^{n-1} $$ And then taking the limit for large $n$ and using the fact that $\lim_{x\rightarrow\infty}(1-\frac{x}{n})^n = e^{-x}$: $$ u = e^{-k(1-u)} $$ Now if $u$ is the fraction of vertices not in the giant component, call $S = 1-u$ the fraction of vertices in the giant component. Then: $$ S = e^{-\bar kS} $$

There is no closed-form solution to this equation, but below we can show a simulation of random graphs and the size of the largest connected component in each one.


In [ ]:
# list of the sizes of the largest components
big_comp = []
# number of nodes in the graph
num_nodes = 500
# vector of edge probabilities
p_values = [(1-x*.0001) for x in xrange(9900,10000)]
# try it a few times to get a smoother curve
iterations = 5
for p in p_values:
    size_comps = []
    for h in xrange(0, iterations):
        edge_list = []
        for i in xrange(0,num_nodes):
            for j in xrange(i,num_nodes):
                if (random() < p):
                    edge_list.append((i,j))
        G = igraph.Graph(directed = False)
        G.add_vertices(num_nodes)
        G.add_edges(edge_list)
        comps = [len(x) for x in G.clusters()]
        size_comps.append(comps)
    big_comp.append(sum([max(x) for x in size_comps])/len(size_comps))

In [ ]:
plt.plot([x*(num_nodes-1) for x in p_values], big_comp, '.')
plt.title("Phase transitions in connectedness")
plt.ylabel("Fraction of nodes in the largest component")
plt.xlabel("Average degree (k = p(n-1)), {} < p < {}".format(p_values[99],p_values[0]))

Clustering coefficient

The clustering coefficient is a measure of how many trianges (completely connected triples) there are in a graph. You can think about it as the probability that if Alice knows Bob and Charlie, Bob also knows Charlie. The clustering coefficient of a graph is equal to $$ C = \frac{\text{(number of closed triples)}}{\text{number of connected triples}} $$

Finding the expected value of $C$ for a random graph is simple. For any 3 vertices, the probability that they are all connected is $p^3$ and the probability that at least 2 of them are connected is $p^2$. Then the expected values of closed triples (triangles) and connected triples respectiely are ${n \choose 3}p^3 $ and ${n \choose 3}p^2 $, and the expected value for $C$ is then $\frac{p^3}{p^2} = p$. Notice in the above plot that the values for $p$ are very small, even when the graph is fully connected. In a randomly generated sparse graph (a graph where a small fraction of the total possible ${n \choose 2}$ edges exist), the clustering coefficient $C$ is very low.


In [ ]:
# number of nodes in the graph
num_nodes = 500
# vector of edge probabilities
p_values_clustering = [x*.01 for x in xrange(0,100)]
# try it a few times to get a smoother curve
iterations = 1
# store the clustering coefficient
clustering = []
for p in p_values_clustering:
    size_comps = []
    for h in xrange(0, iterations):
        edge_list = []
        for i in xrange(0,num_nodes):
            for j in xrange(i,num_nodes):
                if (random() < p):
                    edge_list.append((i,j))
        G = igraph.Graph(directed = False)
        G.add_vertices(num_nodes)
        G.add_edges(edge_list)
        clustering.append((p, G.transitivity_undirected(mode="zero")))

In [ ]:
plt.plot([x[0]*(num_nodes-1) for x in clustering], [x[1] for x in clustering], '.')
plt.title("Clustering coeff vs avg degree in a random graph")
plt.ylabel("Clustering coefficient")
plt.xlabel("Average degree (k = (n-1)p), 0 < p < 1")

Small diameter graphs

So we know that the giant component is very likely, even for sparse graphs, and also that the clustering coefficient is very low, even for relatively dense graphs. This means that the graph is almost completely connected, and that it is, at least locally, pretty similar to a tree graph (acyclic).

Consider that a graph has a mean degree $\bar k$. Now consider the number of vertices reachable from some vertex in the graph, $i$, call the number of vertices that $i$ can reach $l$. Because the clustering coefficient is very low (the graph is locally tree-like), it is likely that any neighbor of $i$'s has a completely new set of neightbors ($k$ neighbors, less $i$, $k-1$ total new neighbors). Then for each step, you reach $k-1$ new vertices. Thus the number of vertices reachable in $l$ steps from some vertex $i$ is $(k-1)^l$.

The diameter of a graph is the maximum number of steps $l$ one would have to take to reach any vetex from any other vertex, or the number of steps needed to make any vertex reachable. $$ (k-1)^l = n $$ $$ l = \frac{1}{log(k-1)}log(n) \approx O(log(n))$$

Thus the diamater of the graph grows as $O(log(n))$, or shows "small world" characteristics.

A comparison with a real social graph:


In [ ]:
print("The number of nodes in the graph (all are connected): {}".format(len(real_graph.nodes())))
print("The number of edges in the graph: {}".format(len(real_graph.edges())))
print("The average degree: {}".format(sum(nx.degree(real_graph).values())/len(real_graph.nodes())))
print("The clustering coefficient: {}".format(nx.average_clustering(real_graph)))
print("The clustering coefficient that a random graph with the same degree would predict (k/(n-1)): {}"
      .format(sum(nx.degree(real_graph).values())/len(real_graph.nodes())/(len(real_graph.nodes())-1)))
print("The diameter of the graph: {}".format(nx.diameter(real_graph)))

The Configuration Model

Another random graph model: the configuration model. Instead of generating our own degree sequence, we use a specified degree sequence (say, use the degree sequence of a social graph that we have) and change how the edges are connected. This allows us to ask the question: "how much of this characteristic is completely explained by degree?"

This is an example of using the configuration model to create a null model of our "real graph." Note that the algorithm that I am using works well for creating configuration models for large graphs, but produces more error on this smaller graph.


In [ ]:
A = []
for v in real_graph.nodes():
    for x in range(0, real_graph.degree(v)):
        A.append(v)
    shuffle(A)
    # make the edge list
    _E = [(A[2*x], A[2*x+1]) for x in range(0,int(len(A)/2))]
    E = set([x for x in _E if x[0]!=x[1]])
# add the edges to a new graph with the name node list
C = real_graph.copy()
C.remove_edges_from(real_graph.edges())
C.add_edges_from(E)
nx.draw(C, node_color = 'blue', pos = positions)

In [ ]:
print("The number of nodes in the graph (all are connected): {}".format(len(C.nodes())))
print("The number of edges in the graph: {}".format(len(C.edges())))
print("The average degree: {}".format(sum(nx.degree(C).values())/len(C.nodes())))
print("The clustering coefficient: {}".format(nx.average_clustering(C)))
print("The clustering coefficient that a random graph with the same degree would predict (k/(n-1)): {}"
      .format(sum(nx.degree(real_graph).values())/len(C.nodes())/(len(C.nodes())-1)))
print("The diameter of the graph: {}".format(nx.diameter(C)))

Asking questions using a null model

A famous example of centrality measuring on a social network is the Florentine Families graph. Padgett's reseach on this graph claims that the Medicci family's rise to power can be explained by their high centrality on the graph of business interactions between families in Italy during that time. We will use a null model (configuration model) of the graph to rearrange how edges are places without altering any node's degree to discover how much of the Medicci's power is determined by thier degree (ranther than other structural components of the graph).


In [ ]:
# get the graph
florentine_families = igraph.Nexus.get("padgett")["PADGB"]

First, let's show the relative rankings of the families with respect to vertex degree in the network and with respect to our chosen centrality measure, harmonic centrality. I won't go into various centrality measures here, beyond to say that harmonic centrality is formulated: $$ c_i = \frac{1}{n-1}\sum_{i,i\neq j}^{n-1}\frac{1}{d_{ij}} $$ where $d_{ij}$ is the geodesic distance between vertices $i$ and $j$. Basically, harmonic centrality is a measure of how close a vertes is to every other vertex.


In [ ]:
# degree centrality
d = florentine_families.degree()
d_rank = [(x, florentine_families.vs[x]['name'], d[x]) for x in range(0,len(florentine_families.vs()))]
d_rank.sort(key = itemgetter(2), reverse = True)

In [ ]:
# harmonic centrality
distances = florentine_families.shortest_paths_dijkstra()
h = [sum([1/x for x in dist if x != 0])/(len(distances)-1) for dist in distances]
h_rank = [(x, florentine_families.vs[x]['name'], h[x]) for x in range(0,len(florentine_families.vs()))]
h_rank.sort(key = itemgetter(2), reverse = True)
# make the table
d_table = []
d_table.append(["Rank (by degree)", "degree", "Rank (h centrality)", "harmonic"])
for n in xrange(0,len(florentine_families.vs())):
    table_row = []
    table_row.extend([d_rank[n][1], str(d_rank[n][2])[0:5]])
    table_row.extend([h_rank[n][1], str(h_rank[n][2])[0:5]])
    #table_row.extend([e_rank[n][1], str(e_rank[n][2])[0:5]])
    #table_row.extend([b_rank[n][1], str(b_rank[n][2])[0:5]])
    d_table.append(table_row)
print tabulate(d_table)

Now the fun (?) part. Create a bunch of different random configuration models based on the florentine families graph, then measure the harmonic centrality on those graphs. The harmonic centality of a node on the null model will deend only on its degree (as the graph structure is now ranom).


In [ ]:
config_model_centrality = [[] for x in florentine_families.vs()]
config_model_means = []
hc_differences = [[] for x in range(0,16)]
for i in xrange(0,1000):
    # build a random graph based on the configuration model
    C = florentine_families.copy()
    # graph with the same edge list as G
    C.delete_edges(None)
    # print C.summary()
    # Add random edges
    # vertex list A
    A = []
    for v in florentine_families.vs().indices:
        for x in range(0,florentine_families.degree(v)):
            A.append(v)
    shuffle(A)
    # print A
    # make the edge list
    _E = [(A[2*x], A[2*x+1]) for x in range(0,int(len(A)/2))]
    E = set([x for x in _E if x[0]!=x[1]])
    # add the edges to C
    # print E
    C.add_edges(E)

    # rank the vertices by harmonic centrality
    C_distances = C.shortest_paths_dijkstra()
    C_h = [sum([1/x for x in dist if x != 0])/(len(C_distances)-1) for dist in C_distances]
    del C
    for vertex in range(0,16):
        hc_differences[vertex].append(h[vertex] - C_h[vertex])

In [ ]:
plt.plot([percentile(diff, 50) for diff in hc_differences], '--')
plt.plot([percentile(diff, 25) for diff in hc_differences], 'r--')
plt.plot([percentile(diff, 75) for diff in hc_differences], 'g--')

plt.xticks(range(0,16))
plt.gca().set_xticklabels(florentine_families.vs()['name'])
plt.xticks(rotation = 90)

plt.gca().grid(True)

plt.ylabel("(centrality) - (centrality on the null model)")
plt.title("How much of harmonic centrality is explained by degree?")

The dashed red, blue and green lines repectively represent the 25th, 50th and 75th percentile of the difference in centralities between the real graph and the pool of null models. We can see that for the Mediccis, that difference is basically always high--their centrality on the real graph is higher than their centrality on this null model. This shows that there is in fact something important about their place structurally in the graph, as well as thier high degree.

That's all folks!