During the course, you will learn how to study real-life datasets. The goal of this practical is to begin with simple (e.g. non-oriented) graphs. Two datasets are proposed: DBLP and IMDB. In further practicals, you will also play with oriented graphs using Wikipedia datasets.
Remark: the IMDB dataset is much larger than the DBLP dataset. You should play with it only after you mastered the DBLP dataset.
In [ ]:
%pylab inline
import networkx as nx
import zipfile
import gzip
import codecs
We provide you with a few functions to deal with the dataset format. You will have to use them in the practical. There are strictly speaking no question for this part, yet it is strongly advised that you take your time to understand how the different functions proposed here work.
The datasets are provided in the form of zip archives. Each zip archive contains the following files:
Datasets will be given on a USB key. This practical assumes they are stored in ../Datasets/ relatively to your working directory.
In [ ]:
directory = "../Datasets/"
It can be a burden to parse again and again the same datas from ASCII files and convert them into a proper format. The following function will be used to stash your variable in a compressed gz file for further use. Thanks to Pierre-Antoine (ACN 2017-2018) for pointing out the possibility (even if his solution based on npz seemed to only work on his laptop for some reason, so some adjustments had to be made).
In [ ]:
def use_cache(builder, prefix="dblp", variable = "size", rebuild = False, compress = False):
try:
if rebuild:
raise ValueError('Value needs to be rebuilt')
if compress:
with gzip.GzipFile(directory+prefix+"-"+variable+".npy.gz", "r") as f:
return np.load(f)
else:
return np.load(directory+prefix+"-"+variable+".npy")
except:
data = builder(prefix)
if compress:
with gzip.GzipFile(directory+prefix+"-"+variable+".npy.gz", "w") as f:
np.save(f, data)
else:
np.save(directory+prefix+"-"+variable, data)
return data
# Set default behavior
compress = False
Most of the core functions below have the following behavior: first they try to load the results from an npy file if one exists, otherwise they parse the dataset to extract the information and save it in an npy file for the next use. This approach avoids re-doing the same work over and over again.
The following function gives the number of nodes $n$ and the total number of oriented edges $m$ of the graph. In the case where the graph is undirected, all edges will be counted twice ($(i,j)$ and $(j,i)$ are the same edge on an undirected graph) so the actual number of edges is $\frac m 2$.
In [ ]:
def build_size(prefix = "dblp"):
n = 0
m = 0
with zipfile.ZipFile(directory+prefix+".zip") as myzip:
with myzip.open(prefix+".adja") as f:
for line in f:
n += 1
m += len([int(s) for s in line.split()])
size = array([n, m])
return size
def get_size(prefix = "dblp", rebuild = False):
return use_cache(build_size, prefix, variable = "size", rebuild = rebuild, compress = compress)
Let us run this function to create the npy file.
In [ ]:
prefix = "dblp"
n, m = get_size(prefix)
print("Number of nodes in %s: %s" % (prefix, n))
print("Number of edges in %s: %s" % (prefix, m//2))
A natural way to store the adjacency list would be to use an array (or list) of arrays (or lists), such that A[i][j] would refer to the $j$th neighbor of node $i$. In practice, this structure can have some memory usage overhead, so we will store the adjacency list in a flat array with the function below.
Remark: the proposed format is known in literature as Compressed Sparse Row (CSR). More details can be found on https://en.wikipedia.org/wiki/Sparse_matrix
In [ ]:
def build_adjacency(prefix = "dblp"):
n, m = get_size(prefix)
A = zeros(n+m+1, dtype = int)
A[0] = n+1 # Don't forget the +1!!!
with zipfile.ZipFile(directory+prefix+".zip") as myzip:
with myzip.open(prefix+".adja") as f:
i = 0
for line in f:
neighbors = array(line.split(), dtype = int)
A[i+1] = A[i]+len(neighbors)
A[A[i]:A[i+1]] = neighbors
i += 1
return A
def get_adjacency(prefix = "dblp", rebuild = False):
return use_cache(build_adjacency, prefix, variable = "adjacency", rebuild = rebuild, compress = compress)
We can load $A$ in memory.
In [ ]:
A = get_adjacency("dblp")
The result, A, is a numpy array of integer of size $n+m+1$, organized as follows:
The following function just return the neighbors.
In [ ]:
def neighbors(A, index):
return A[A[index]:A[index+1]]
In practice, just use A[A[i]:A[i+1]] if you can, it avoids calling a function.
All the functions above assume a node is represented by an integer $0\leq i<n$, but researchers, Wikipedia pages, and even actors have names! Let us write some functions to translate integers to names and vice versa.
In [ ]:
def build_ids(prefix = "dblp"):
n, m = get_size(prefix)
delimiter = zeros(n+1, dtype = int)
text = ""
with zipfile.ZipFile(directory+prefix+".zip") as myzip:
with myzip.open(prefix+".ids") as f:
i = 0
for line in codecs.iterdecode(f, 'utf8'):
delimiter[i+1] = delimiter[i]+len(line)-1
text += line[0:-1]
i += 1
return [delimiter, text]
def get_ids(prefix = "dblp", rebuild = False):
return use_cache(build_ids, prefix, variable = "ids", rebuild = rebuild)
The function above returns a delimiter array of size $n+1$ and a text string that concatenates all researcher names. It uses the same principle used for the adjacency list: the name of a researcher associated to number $i$ is text[delimiter[i]:delimiter[i+1]]. This allows us to do the conversion from name to index, and vice versa.
In [ ]:
def index2name(index, prefix = "dblp", delimiter = None, text = None):
if delimiter is None:
delimiter, text = get_ids(prefix)
return text[delimiter[index]:delimiter[index+1]]
In [ ]:
def name2index(name, prefix = "dblp", delimiter = None, text = None):
try:
if delimiter is None:
delimiter, text = get_ids(prefix)
offset = text.index(name)
return where(delimiter == offset)[0][0]
except:
print("Name not found.")
Let us try with some names. Note that the first execution will build the index.
In [ ]:
name2index("Paul_Erdös")
In [ ]:
name2index("Fabien_Mathieu")
In [ ]:
index2name(711561)
In [ ]:
index2name(149114)
Remark: The name2index
function is very rough. It just tries to match name
as a substring of text
and find the corresponding index in the delimiter array. It is quite slow and may fail if the name of a researcher is a substring of the name of another researcher, but it will be enough for this practical.
You may have already seen list comprehension before: it is for example used in some of the function above to convert a text line into a list of neighbors when you parse the adjacency list: [int(s) for s in line.split()]
.
They are a powerful tool to construct a list by describing how it is built, and you will have to use them in this practical, so you should study the following examples.
A first simple example: the list of the squares of the integers from 0 to 5:
In [ ]:
[i**2 for i in range(6)]
One of the interest of list comprehension is that they can be nested. For example, the list of the squares of the 6 first positive odd integers.
In [ ]:
[i**2 for i in [2*k+1 for k in range(6)]]
A last example of list comprehension, which will be very helpful for the clustering coefficient. Can you figure out what it does?
In [ ]:
[k for nj in [range(j) for j in range(6)] for k in nj]
Let us check that you can use the tools above: write a function coauthors that gives you the co-authors of someone. It should take a string as input and return a list of strings.
Answer:
Answer:
Answer:
Write a function that takes a node index and gives a shortest path from Erdös. Test it.
Answer:
Consider a few researchers of your choice in DBLP database with finite Erdös numbers. Choose them from different branches of science. For each pair of these researchers, compare the distance between them to the sum of their Erdös numbers. Comment.
Remark: The exercise will be more interesting with IMDB dataset.
Answer:
Informally, the clustering definition tells how likely it is that the friend of your friend is your friend. In other words, it tries to quantify the so-call small-world effect.
When one wants to give a proper definition of what is a clustering coefficient, multiple options exist. For this practical, we will rely on the metrics introduced by Newman in The structure and function of complex networks. Note that these are the definitions proposed in Wikipedia (avoid the French version by the way, the translation is lossy).
Note that we will slightly alter some of the definitions proposed by Newman to simplify the practical, but the metrics are exactly the same.
For a simple graph $G = (V,E)$ (simple means not oriented, no loops). We introduce the following notation for a node $i \in V$:
Remark that $T(i)\subseteq B(i)$: all oriented triangles are also edge pairs, the only difference is the existence of an edge between the pair nodes.
The clustering coefficient just measures the ratio between $T$ and $B$. It comes in two flavors, local and global:
Last definition: the reference clustering of a simple graph $G = (V, E)$ is defined as $C_r:=\frac{2|E|}{|V|(|V|-1)}$. It corresponds to the typical clustering coefficient one would observe on a random graph (e.g. Erdös-Rényi) having on average the same number of nodes and edges as $G$.
To illustrate the definitions above, consider the following graph :
It is displayed in the cell below in GraphViz, but you may need a working installation of GraphViz for it to display correctly. If you want to see it (this is optional, unless you want to do bonus question 3.3), you'll need to:
In [ ]:
from graphviz import Graph
d = Graph(strict = True)
d.edge("A","B")
d.edge("A","C")
d.edge("B","C")
d.edge("A","D")
d
We have the following:
Remark: you may have noticed that the reference clustering is greater than the others, yet it is impossible to produce a graph with 4 nodes and 4 edges with a higher clustering. The short answer is that is is a border effect due to the small size of the graph. A deeper, slightly longer answer can be provided on demand.
Express the global clustering coefficient as a function of the local clustering coefficients. Discuss qualitatively how the average local clustering coefficient and the global clustering coefficient may differ.
Answer:
Write a function that gives the local clustering coefficient $c_i$ of a node with identifier $i$ in DBLP. Try to make it as fast as possible.
Hints:
in1d
, it may be handy.Answer:
Write a function that displays the neighborhood graph of a node $i$ using GraphViz. The neighborhood graph of $i$, denoted $G(i) = (V(i), E(i))$, is such that:
Test it on a few researchers (avoid researchers with too many co-authors if you want to see something) to see how it looks.
Answer:
Using list comprehensions, compute the list of $t_i$, then $C_l$, $C_g$, and $C_r$ for the DBLP graph.
array
with np.array()
.Answer:
Answer:
The DBLP dataset comes from
http://konect.uni-koblenz.de/downloads/tsv/dblp_coauthor.tar.bz2
The format was not exactly the one used here, so it was converted with the following code.
Sadly, the intern that wrote the code didn't comment it. Can you reverse engineer the code and explain how it works?
Answer:
Try to redo the practical using the IMDB dataset. As it is a bigger dataset, you will have to be cautious.
Answer: