Scalar measures of graph structure can be useful for a few reasons:
Unfortunately, no single number describes a network fully -- which ones are relevant depend on the perspective. See for example this link for a list of some possible graph metrics:
https://reference.wolfram.com/language/guide/GraphMeasures.html
One thing that can be useful in considering different metrics is to look at a range of different network generation methods and see how the measures behave across those different methods.
Let's re-import everything to make sure we have the packages we need:
In [ ]:
import networkx as nx
import random as rand
import pylab as plt
import warnings # For Wakari only.
warnings.filterwarnings("ignore")
Networks from different domains come in different shapes and sizes. Social networks, for exmaple, tend to have different structures than transportation networks or trophic networks in ecology. Let's use NetworkX to generate some sample graphs to get a sense of some fo the possiblities.
Graphs with many different properties can be generated by NetworkX functions. A list is here:
https://networkx.github.io/documentation/networkx-1.10/reference/generators.html.
But let's start creating them from scratch to get a sense of how some of the different graph metrics work.
In [ ]:
pylab.rcParams['figure.figsize'] = (5.0, 5.0) # This sets the width of the plots.
g = nx.Graph()
n = 20
g.add_nodes_from(range(n))
for i in range(n):
for j in range(n):
if i != j:
g.add_edge(i, j) # Note that, for un-directed graphs (the default),
# adding an edge j,i to a graph that already includes
# i,j will not create a new edge.
nx.draw(g)
Alternatively:
In [ ]:
nx.draw(nx.complete_graph(20))
We can verify that the graph is fully connected by calculating the density, which is the ratio of the number of edges to the maximum possible number of edges, and see in this case that it is 1.0:
In [ ]:
m = nx.number_of_edges(g)
n = nx.number_of_nodes(g)
print n
print m
print n *(n-1)/2
print float(m) / (n * (n-1)/2.0)
Or, using the built-in function:
In [ ]:
nx.density(g)
Most graphs contaning information about real phenomena, however, are not fully connected. A very basic form of network is a random graph (otherwise known as an Erdos-Renyi graph), where edges are formed with a given probability:
See Newman Chapter 12 for a complete discussion of random graphs and their properties.
Here's a function that implements a version of this that samples from a list of available edges:
In [ ]:
def generate_random_graph(n, p):
g = nx.Graph()
g.add_nodes_from([i for i in range(n)])
m = []
for i in range(n):
for j in range(n):
if (i,j) not in m and [j,i] not in m and i != j:
m.append([i, j])
for i in range(n*(n-1) / 2): # Iterate through the nodes and assign an avaialble edge with probabiliy p.
if rand.random() < p: # This selects a random subset with probability p.
index = rand.randint(0, len(m)-1)
g.add_edge(m[index][0], m[index][1])
del(m[index])
return g
g = generate_random_graph(20, 0.2)
nx.draw(g)
This is very similar to the built-in erdos_renyi_graph generation functions.
Experimenting with different values of p, we can see that the graph is fragmented into separate components when p is low. We can use buil-in networkx functions to get a sense of how density relates to p and to the number of separate components. Let's check out 50 different probabilities ranging from zero to 1:
In [ ]:
d = []
cc = []
n = 10
for i in range(50):
p = i / 50.0
# g = nx.erdos_renyi_graph(n, p)
g = generate_random_graph(n, p)
d.append(nx.density(g))
cc.append(len([c for c in nx.connected_components(g)]) / float(n))
plt.plot(d)
plt.plot(cc)
The overall density and number of components of the graph tells us a little bit about the number and distribution of edges, but nothing about how these edges are distributed among nodes.
One of the key descriptive tools for this is the degree distribution. A node's degree is the number of edges connected to it. We can access this for a graph using the built-in NetworkX function:
In [ ]:
g = nx.erdos_renyi_graph(500, 0.2) # Note that if we change the value of n and p, the shape of the distribution remains the same.
degree_sequence=nx.degree(g).values()
plt.hist(degree_sequence, 20)
Now let's see how this looks compared to some other kinds of networks. You can see that the distribution varies substantially with the different graph structures.
In [ ]:
pylab.rcParams['figure.figsize'] = (18.0, 18.0) # This sets the width of the plots.
g = nx.erdos_renyi_graph(200, 0.2) # The first parameter is the number of nodes, the second is the probabilty of an edge
plt.subplot(321) # Defines a cell in a grid of plots
plt.title("Random")
nx.draw(g)
plt.subplot(322)
plt.hist(nx.degree(g).values())
g = nx.nx.karate_club_graph() # Fixed social network graph
plt.subplot(323)
plt.title("Karate Club")
nx.draw(g)
plt.subplot(324)
plt.hist(nx.degree(g).values(), 20)
g = nx.geographical_threshold_graph(200, 50) # Graph created by setting node distance threshold
plt.subplot(325)
plt.title("Geographical")
nx.draw(g)
plt.subplot(326)
plt.hist(nx.degree(g).values(), 20)
plt.show()
g = nx.ladder_graph(200)
plt.subplot(321)
plt.title("Ladder")
nx.draw(g)
plt.subplot(322)
plt.hist(nx.degree(g).values(), 20)
g = nx.barabasi_albert_graph(200, 2) # Preferential attachment; second parameter is number of edges per node
plt.title("Preferntial Attachment")
plt.subplot(323)
nx.draw(g)
plt.subplot(324)
plt.hist(nx.degree(g).values(), 20)
g = nx.nx.connected_watts_strogatz_graph(200, 4, 0.2) # Small world - third parameter is probability of random attachment
plt.subplot(325)
plt.title("Small World")
nx.draw(g)
plt.subplot(326)
plt.hist(nx.degree(g).values(), 20)
plt.show()
Most real-world networks don't form randomly. For example, in social networks, links tend to form with nodes that have more links (your friends tend to have more friends than you do!)
See Newman p. 498 for a more complete algorithm for generating a preferential attachment graph. Here is a simple approximation. You can see in this example that most nodes have few connections, but a few have many connections, in contrast to the random graph.
In [ ]:
pylab.rcParams['figure.figsize'] = (18.0, 6.0)
g = nx.Graph()
n = 50
g.add_nodes_from(range(n))
candidate_node_list = range(n)
node_list = range(n)
for i in node_list:
# Sampling from a list of nodes, two for every edge formed, will increase the dominance of eges with nodes.
found = False
while not found:
j = candidate_node_list[rand.randint(0, len(candidate_node_list)-1)]
if (not g.has_edge(i, j)) and (i != j):
g.add_edge(i, j)
candidate_node_list.append(j)
found = True
plt.subplot(121)
nx.draw(g)
plt.subplot(122)
plt.hist(sorted(nx.degree(g).values(),reverse=True), 20)
plt.show()
This is somewhat similar to the barabasi-albert graph:
In [ ]:
pylab.rcParams['figure.figsize'] = (18.0, 6.0)
g =nx.barabasi_albert_graph(50, 1)
plt.subplot(121)
nx.draw(g)
plt.subplot(122)
plt.hist(sorted(nx.degree(g).values(),reverse=True), 20)
plt.show()
One thing that this kind of preferential attachment doesn't capture, is the tendency in some domains for edges to form between nodes that are adjacent to nodes that are connected. For example, I am more likely to become acquainted with the a friend of my friend than with a random stranger. At the same time, random aquaintances do form.
Small-world networks capture this behavior by first connecting adjacent nodes, then the nodes adjacent to those, in addition to random connections, as described here:
https://en.wikipedia.org/wiki/Small-world_network
Newman p. 552 has a succinct description of the algorithm.
In [ ]:
pylab.rcParams['figure.figsize'] = (3, 3)
g = nx.connected_watts_strogatz_graph(20, 4, 0.15) # The last parameter is the probability of a random connection.
nx.draw(g)
In [ ]:
pylab.rcParams['figure.figsize'] = (5, 5)
# Note that, as we change the third parameter, the probability of making a random connection,
# the clustering coefficient goes down.
g = nx.connected_watts_strogatz_graph(200, 4, 0.2)
plt.hist(nx.clustering(g).values())
So far we've looked at three ways of generating graphs - random, preferential attachment, and small world. There are many others, and when considering representing data in a network format, it can be useful to think of the way in which the network is formed to help shed light on which metrics might be the most relevant.
Graph traversal algrorithms allow us to quantify attributes of paths within a graph. A workhorse algorithm for this purpose is breadth-first search, described, for example here: https://www.youtube.com/watch?v=ZDpaXRvUOSQ
This also allows us to find the minimum distance between any two pairs of nodes in a connected graph or graph component. The calculation can include a distance parameter or not.
In [ ]:
pylab.rcParams['figure.figsize'] = (12.0, 12.0)
import networkx as nx
import random as rand
import pylab as plt # Required for drawing
g = nx.Graph()
n = 15
p = 0.2
g.add_nodes_from([i for i in range(n)])
for u in range(n):
for v in range(u+1,n):
if rand.random() < p:
g.add_edge(u,v, d=round(rand.random(), 3))
# Path length and shortest path can be calculated using a distance or cost (default is 1).
print nx.shortest_path_length(g, source=1, target=10, weight="d")
print nx.shortest_path(g, source=1, target=10, weight="d")
print nx.shortest_path_length(g, source=1, target=10)
print nx.shortest_path(g, source=1, target=10)
pos = nx.spring_layout(g)
nx.draw(g, pos, with_labels=True, font_size=42, font_weight="bold")
edge_labels = nx.get_edge_attributes(g,'d')
nx.draw_networkx_edge_labels(g, pos, labels=edge_labels, font_size=12)
plt.show()
From this we can also infer several useful measures of a graph's structure:
More information on some of these centrality measures is availble here:
https://sites.google.com/site/networkanalysisacourse/schedule/an-introduction-to-centrality-measures
In [ ]:
pylab.rcParams['figure.figsize'] = (18.0, 5.0) # This sets the width of the plots.
plt.subplot(131)
plt.hist(nx.closeness_centrality(nx.erdos_renyi_graph(200, .1)).values())
plt.subplot(132)
plt.hist(nx.closeness_centrality(nx.barabasi_albert_graph(200, 2)).values())
plt.subplot(133)
plt.hist(nx.closeness_centrality(nx.connected_watts_strogatz_graph(200, 4, 0.1)).values())
plt.show()
plt.subplot(131)
plt.hist(nx.betweenness_centrality(nx.erdos_renyi_graph(200, .1)).values())
plt.subplot(132)
plt.hist(nx.betweenness_centrality(nx.barabasi_albert_graph(200, 2)).values())
plt.subplot(133)
plt.hist(nx.betweenness_centrality(nx.connected_watts_strogatz_graph(200, 4, 0.1)).values())
plt.show()
While these overall distributions vary with the type of graph, often we are more interested in the attributes of a specific node. To help quantify this, it can be useful to add node metrics as node attributes, then use these to control the graph output. For example:
In [ ]:
g = nx.erdos_renyi_graph(20, 0.1)
# g = nx.barabasi_albert_graph(20, 5)
# g = nx.connected_watts_strogatz_graph(20, 4, 0.2)
nx.set_node_attributes(g, 'degree', nx.degree(g))
nx.set_node_attributes(g, 'betweenness', nx.betweenness_centrality(g))
nx.set_node_attributes(g, 'closeness', nx.closeness_centrality(g))
We can then use these values in the graph diagram:
In [ ]:
pylab.rcParams['figure.figsize'] = (5.0, 5.0)
pos=nx.spring_layout(g, k=1, iterations=20)
node_sizes = [n[1]["degree"] * 1000 for n in g.nodes(data=True)]
node_colors = [n[1]["betweenness"] * 100 for n in g.nodes(data=True)]
# node_colors = [n[1]["closeness"] * 100 for n in g.nodes(data=True)]
nx.draw(g,pos,node_color=node_colors,node_size=node_sizes,cmap=plt.cm.Blues, with_labels=True)
It can be interesting to see what the most and least central nodes are based on these measures:
In [ ]:
d = sorted(g.nodes(data=True), key=lambda x: x[1]["degree"], reverse=True)
print "Nodes with the most and least connections: " + str(d[0]) + "; " + str(d[-1])
d = sorted(g.nodes(data=True), key=lambda x: x[1]["betweenness"], reverse=True)
print "Nodes with the highest and lowest betweenness: " + str(d[0]) + "; " + str(d[-1])
d = sorted(g.nodes(data=True), key=lambda x: x[1]["closeness"], reverse=True)
print "Nodes with the highest and lowest closeness: " + str(d[0]) + "; " + str(d[-1])
Another family of measures worth mentioning are based on adjacency matrices, where rows and columns represent vertices, and cells represent egdes.
https://en.wikipedia.org/wiki/Adjacency_matrix
If you consider each row to be analagous to a feature vector in machine learning applications, you can see how the network structure can be used to, for example, cluster similar nodes together by deriving distance measurements such as cosine similarity, correlation coefficient, or Euclidean distance, from rows the adjacency matrix.
One important additional network metric based on this is Assortativity, that is, the extent to which nodes that have similar attributes are connected. This can be expressed as a ratio between the number of connections within a group, and the number of expected connections if the edges were randomly distributed between nodes.
Social stratification, by, for example, age, can result in assortative mixing in a network.
Let's wrap this up by first creating a function that displays some key metrics across a few sample graphs:
In [ ]:
def summarize(g):
return [round(x, 2) for x in [nx.diameter(g), nx.average_shortest_path_length(g), mean(nx.degree(g).values()), sorted(nx.degree(g).values(),reverse=True)[0],
nx.density(g), nx.average_clustering(g)]]
print "Diameter, Average shortest path, Average degree, Maximum degree, Density, Average clustering"
print str(summarize(nx.erdos_renyi_graph(50, 0.1)))
print str(summarize(nx.erdos_renyi_graph(50, 0.5)))
print str(summarize(nx.erdos_renyi_graph(50, 0.9)))
print "\nDiameter, Average shortest path, Average degree, Maximum degree, Density, Average clustering"
print summarize(nx.barabasi_albert_graph(50, 2))
print summarize(nx.barabasi_albert_graph(50, 4))
print summarize(nx.barabasi_albert_graph(50, 6))
print "\nDiameter, Average shortest path, Average degree, Maximum degree, Density, Average clustering"
print summarize(nx.connected_watts_strogatz_graph(50, 4, 0.1))
print summarize(nx.connected_watts_strogatz_graph(50, 4, 0.5))
print summarize(nx.connected_watts_strogatz_graph(50, 4, 0.9))
We can also look at graph metrics in action here:
http://gis.foundationcenter.org/networkxd3/city_subject_trends.html