For this assignment we will work on a network representing the collaboration between scientists in the field of General Relativity and Quantum Cosmology. The network comes from SNAP and is described as follows:
Arxiv GR-QC (General Relativity and Quantum Cosmology) collaboration network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to General Relativity and Quantum Cosmology category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes. The data covers papers in the period from January 1993 to April 2003 (124 months). It begins within a few months of the inception of the arXiv.
In [ ]:
%matplotlib inline
import os
import random
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import warnings
warnings.filterwarnings('ignore')
In [ ]:
G = nx.read_edgelist(os.path.join('..', 'data', 'arxiv_collaboration_network.txt'))
In [ ]:
print('My network has {} nodes.'.format(len(G.nodes())))
print('My network has {} edges.'.format(G.size()))
1.1 In this part of the assignment, you have to create an Erdős–Rényi and Barabási-Albert graph using NetworkX, and compare them to the collaboration network. Try to simulate the original network as best as you can. When choosing parameters for the networks, take into account the number of vertices and edges of the original network. The number of vertices should be exactly the same. Comment on your choice of parameters.
In [ ]:
n = len(G.nodes())
m = G.size()
p = m*2/(n*(n-1))
er=nx.erdos_renyi_graph(n,p)
ba=nx.barabasi_albert_graph(n,3)
For the ER network, the expected number of edges is $$E[m] = pn(n-1)/2.$$ For the BA network, m is fixed to $$m = q*(n-q),$$ where $q$ is the model parameter.
Since $q$ had to be a natural number, $q=3$ gives the best approximation.
In [ ]:
print('My Erdos Renyi network has {} nodes.'.format(len(er.nodes())))
print('My Erdos Renyi network has {} edges.'.format(er.size()))
print('My Barabasi Albert network has {} nodes.'.format(len(ba.nodes())))
print('My Barabasi Albert network has {} edges.'.format(ba.size()))
1.2 Check the size of the largest connected component in each graph and compare them to the original network. In the Erdős–Rényi model, what should the probability of creating each edge be in order to have the same expected size of the largest component? Justify. Generate a graph with this parameter to check if you indeed get a similar value.
In [ ]:
giant_G = max(nx.connected_component_subgraphs(G), key=len)
giant_er = max(nx.connected_component_subgraphs(er), key=len)
giant_ba = max(nx.connected_component_subgraphs(ba), key=len)
print(len(giant_G.nodes()))
print(len(giant_er.nodes()))
print(len(giant_ba.nodes()))
We know from the lectures that in Erdős–Rényi networks, the fraction of nodes in the giant component $$S = \frac{N_G}{N}$$ grows with the average degree by $$S = 1 - e^{-\langle k \rangle S}.$$ Therefore, as $\langle k \rangle = (N-1)p$:
In [ ]:
S_G = len(giant_G.nodes())/n
p_giant = - np.log(1-S_G)/((n-1)*S_G)
print('The parameter p for an Erdos Renyi network with the same expected size of the giant component is {}.'.format(p_giant))
er_giant = max(nx.connected_component_subgraphs(nx.erdos_renyi_graph(n,p_giant)), key=len)
print('The size of the component in a randomly generated network with this parameter is {}.'.format(len(er_giant.nodes())))
1.3 Look at the clustering coefficient of the original network. Is there a network model we talked about that could have a clustering coefficient that is close? Explain.
In [ ]:
nx.average_clustering(G)
Yes, the Watts - Strogatz model can have a very high clustering coefficient with the similar number of vertices and edges. This coefficient can be decreased with a high rewiring parameter, leaving the number of vertices and edges the same.
In this part of the assignment, you will have to create a random network from a predefined degree distribution. There are several network models which can create a random network with the exact same degree distribution as the original, or with the same expected distribution as the original. Refer to section 4.8 of the Barabási book for more information.
One of the most famous ones is the configuration model. The model for a graph with $L$ edges in total is constructed in the following steps:
Reminder: A stub is a half-link, representing the half of an edge. It contains one node and can be paired up with another stub to create an edge (between the two corresponding nodes).
2.1 However, this model allows for the creation of multi-links (multiple edges between the same pair of vertices) and self-loops, thus leading to a non-simple graph. In this assignment, you will implement a greedy configuration model, to avoid these problems.
2.1 However, this model allows for the creation of multi-links (multiple edges between the same pair of vertices) and self-loops, thus leading to a non-simple graph. In this assignment, you will implement a greedy configuration model, to avoid these problems.
The algorithm goes as follows:
Hints:
nx.empty_graph()
to create an empty graph.G.add_edge(node1,node2,weight = 1)
to add an edge to a weighted graph.G.edge[node1][node2]['weight'] += 1
to increment the weight of an edge by one.
In [ ]:
def greedy_configuration(degree_distribution):
N=len(degree_distribution)
G=nx.empty_graph()
degree_distribution = sorted(degree_distribution,reverse=True)
stublist=[]
for node in range(len(degree_distribution)):
for i in range(degree_distribution[node]):
stublist.append(node)
for node in range(len(degree_distribution)):
for stub in range(degree_distribution[node]):
del stublist[0]
while(degree_distribution[node] and len(stublist) > 0):
ind = np.random.randint(0,len(stublist))
b = stublist.pop(ind)
if G.has_edge(node,b):
G.edge[node][b]['weight'] += 1
else:
G.add_edge(node,b,weight = 1)
degree_distribution[node] = degree_distribution[node] - 1
degree_distribution[b] = degree_distribution[b] - 1
return G
In [ ]:
degree_sequence=sorted(nx.degree(G).values(),reverse=True) # degree sequence
gc = greedy_configuration(degree_sequence)
2.2 Verify that the networks have the same number of nodes. Plot the difference between the weighted degree distributions to verify that they are identical. If not, why?
In [ ]:
print('My network has {} nodes.'.format(len(gc.nodes())))
degree_sequence_gc=sorted(nx.degree(gc, weight = 'weight').values(),reverse=True)
plt.plot(np.array(degree_sequence) - np.array(degree_sequence_gc));
The number of nodes and weighted distributions are the same. However, they could've been different in case our algorithm finished with just one node and a pair number of stubs.
2.3 Should these two networks have the same adjacency matrices? Justify.
No, the configuration model only tries to ensure the weighted degree distributions are the same. However, it still chooses connections between nodes randomly and there is no reason to expect the same structure. Having the same adjacency matrices would mean having identical graphs. In addition, having the same weighted degrees doesn't even guarantee the same number of neighbors for corresponding nodes.
2.4 Draw both the generated and original networks. Are they similar? If not, why? Try to explain.
In [ ]:
nx.draw(G,node_size = 5)
In [ ]:
nx.draw(gc,node_size = 5)
Not really. We can see that the random graph has a much larger giant component. This is due to the special structure of the collaboration graph, with a lot of well connected small groups that collaborate within, but not to the outside. The random model couldn't capture this structure.
2.5 Do you expect the properties studied in the first part of the assignment to be close to the original graph? Justify.
We expect the number of nodes and edges to be similar by design. However, as we've seen in ex 2.4, the size of the giant component is quite larger. As for the clustering coefficient, we know from the first part that the original graph has a very large clustering coefficient. Therefore, we don't expect the random graph to have a similar value.