If you want to deepen your theoretical knowledge of power laws, you can read (this is not mandatory):
In [ ]:
%pylab inline
In [ ]:
import zipfile
import codecs
We first focus on the second version of the Albert-Barabási model introduced during the class. Starting with an arbitrary undirected graph that contains at least one edge, we progressively expand the graph by adding nodes step by step. Each new node is attached to a single existing node chosen as follows:
Write a function alba(n = 1000, α = 0)
that returns the vector of the node degrees in an instance of an Albert-Barabási graph of $n$ nodes. The initial graph contains two nodes that are connected together.
Hint 1: Drawing nodes proportionally to their degree may be a computational bottleneck. Observing that the degrees in an Albert-Barabási graph of $n$ nodes sum to $2n-2$, can you build an array of size $2n-2$ such that choosing a node proportionally to its degree is equivalent to choosing an element of this array uniformly at random?
Hint 2: The function bincount
from numpy
package could be useful.
Answer:
For a value of $n$ that is as large as possible but still gives a reasonable running time (less than a dozen of seconds), plot the probability mass function and the complementary cumulative distribution function of the degree distribution for a few values of $\alpha$ between $0$ and $1$. Compare the results to what you saw in the course.
Hint: You may need the function cumsum
from numpy
package.
Answer:
Answer:
These are the same functions as in the practical Distances and Clustering.
The datasets are provided in the form of zip archives. Each zip archive contains the following files:
Datasets will be given on a USB key. This practical assumes they are stored in ../Datasets/ relatively to your working directory.
In [ ]:
directory = "../Datasets/"
It can be a burden to parse again and again the same datas from ASCII files and convert them into a proper format. The following function will be used to stash your variable in a compressed gz file for further use. Thanks to Pierre-Antoine (ACN 2017-2018) for pointing out the possibility (even if his solution based on npz seemed to only work on his laptop for some reason, so some adjustments had to be made).
In [ ]:
def use_cache(builder, prefix, variable = "size", rebuild = False, compress = False):
try:
if rebuild:
raise ValueError('Value needs to be rebuilt')
if compress:
with gzip.GzipFile(directory + prefix + "-" + variable + ".npy.gz", "r") as f:
return np.load(f)
else:
return np.load(directory + prefix + "-" + variable + ".npy")
except:
data = builder(prefix)
if compress:
with gzip.GzipFile(directory + prefix + "-" + variable + ".npy.gz", "w") as f:
np.save(f, data)
else:
np.save(directory + prefix + "-" + variable, data)
return data
# set default behavior
compress = False
Most of the core functions below have the following behavior: first they try to load the results from an npy file if one exists, otherwise they parse the dataset to extract the information and save it in an npy file for the next use. This approach avoids re-doing the same work over and over again.
The following function gives the number $n$ of nodes and the total number $m$ of oriented edges of the graph. In the case where the graph is undirected, all edges will be counted twice ($(i,j)$ and $(j,i)$ are the same edge in an undirected graph) so the actual number of edges is $\frac m 2$.
In [ ]:
def build_size(prefix):
n = 0
m = 0
with zipfile.ZipFile(directory + prefix + ".zip") as myzip:
with myzip.open(prefix + ".adja") as f:
for line in f:
n += 1
m += len([int(s) for s in line.split()])
size = array([n, m])
return size
def get_size(prefix, rebuild = False):
return use_cache(build_size, prefix, variable = "size", rebuild = rebuild, compress = compress)
Let us run this function to create the corresponding npy file.
In [ ]:
n, m = get_size("dblp")
print("Number of nodes in dblp: %d" % n)
print("Number of edges in dblp: %d" % (m // 2))
A natural way to store the adjacency list would be to use an array (or list) of arrays (or lists), such that A[i][j] would refer to the $j$th neighbor of node $i$. In practice, this structure can have some memory usage overhead, so we will store the adjacency list in a flat array with the function below.
Remark: the proposed format is known in literature as Compressed Sparse Row (CSR). More details are available at https://en.wikipedia.org/wiki/Sparse_matrix
In [ ]:
def build_adjacency(prefix):
n, m = get_size(prefix)
A = zeros(n + m + 1, dtype = int)
A[0] = n + 1 # Don't forget the +1!!!
with zipfile.ZipFile(directory + prefix + ".zip") as myzip:
with myzip.open(prefix + ".adja") as f:
i = 0
for line in f:
neighbors = array(line.split(), dtype = int)
A[i+1] = A[i] + len(neighbors)
A[A[i]:A[i+1]] = neighbors
i += 1
return A
def get_adjacency(prefix, rebuild = False):
return use_cache(build_adjacency, prefix, variable = "adjacency", rebuild = rebuild, compress = compress)
We can load $A$ in memory.
In [ ]:
A = get_adjacency("dblp")
The result, A
, is a numpy array of integers of size $n+m+1$, organized as follows:
A[A[i]:A[i+1]]
The following function just returns the neighbors.
In [ ]:
def neighbors(A, index):
return A[A[index]:A[index+1]]
In practice, just use A[A[i]:A[i+1]]
if you can, it avoids calling a function.
All the functions above assume a node is represented by an integer $0\leq i<n$, but researchers, Wikipedia pages, and even actors have names! Let us write some functions to translate integers to names and vice versa.
In [ ]:
def build_ids(prefix):
n, m = get_size(prefix)
delimiter = zeros(n+1, dtype = int)
text = ""
with zipfile.ZipFile(directory + prefix + ".zip") as myzip:
with myzip.open(prefix + ".ids") as f:
i = 0
for line in codecs.iterdecode(f, 'utf8'):
delimiter[i+1] = delimiter[i] + len(line) - 1
text += line[0:-1]
i += 1
return [delimiter, text]
def get_ids(prefix = "dblp", rebuild = False):
return use_cache(build_ids, prefix, variable = "ids", rebuild = rebuild)
The function above returns an array delimiter
of size $n+1$ and a string text
that concatenates all researcher names. It uses the same principle used for the adjacency list: the name of a researcher associated to number $i$ is text[delimiter[i]:delimiter[i+1]]
. This allows us to do the conversion from name to index, and vice versa.
In [ ]:
def index2name(index, prefix, delimiter = None, text = None):
if delimiter is None:
delimiter, text = get_ids(prefix)
return text[delimiter[index]:delimiter[index+1]]
In [ ]:
def name2index(name, prefix, delimiter = None, text = None):
try:
if delimiter is None:
delimiter, text = get_ids(prefix)
offset = text.index(name)
return where(delimiter == offset)[0][0]
except:
print("Name not found.")
Let us try with some names. Note that the first execution will build the index.
In [ ]:
name2index("Paul_Erdös", "dblp")
In [ ]:
name2index("Fabien_Mathieu", "dblp")
In [ ]:
index2name(711561, "dblp")
In [ ]:
index2name(149114, "dblp")
Remark: The name2index
function is very rough. It just tries to match name
as a substring of text
and finds the corresponding index in the delimiter array. It is quite slow and may fail if the name of a researcher is a substring of the name of another researcher, but it will be enough for this practical.
You may have already seen list comprehension before: it is for example used in some of the function above to convert a text line into a list of neighbors when you parse the adjacency list: [int(s) for s in line.split()]
.
They are a powerful tool to construct a list by describing how it is built, and you will have to use them in this practical, so you should study the following examples.
A first simple example: the list of the squares of the integers from 0 to 5:
In [ ]:
[i**2 for i in range(6)]
One of the interest of list comprehension is that they can be nested. For example, the list of the squares of the 6 first positive odd integers.
In [ ]:
[i**2 for i in [2*k+1 for k in range(6)]]
A last example of list comprehension, which will be very helpful for the clustering coefficient. Can you figure out what it does?
In [ ]:
[k for nj in [range(j) for j in range(6)] for k in nj]
DBLP (DataBase systems and Logic Programming or Digital Bibliography & Library Project) is THE database that records computer science publications. It records authors, conferences, journals... The co-authorship graph of DBLP is a good entry point to study undirected small-worlds.
There are multiple versions of the DBLP graph available. As in the previous practical, we will focus on the one available on KONECT.
Let us first see how to compute the size of the Dataset.
Answer:
Write two functions plot_pmf(prefix, rebuild = False)
and plot_ccdf(prefix, rebuild = False)
that display the degree distribution of a dataset in a loglog scale. For example, they may build an array distribution
such that the number of nodes that have degree $i$ is distribution[i]
. Comment the results in view of the previous parts of this practical.
Answer:
Answer:
We now play with French and English crawls of Wikipedia made in 2013 and available on WebGraph. These graphs have been cleaned: only links from one article to another article are kept.
The two main differences with the (non-oriented) DBLP/IMDB databases are:
The French crawl is made of three files:
The English crawl is provided in a similar way, with the prefix enwiki-2013 instead of frwiki-2013. Note that it is roughly thrice bigger than the French crawl. Feel free to use the dataset(s) you want.
The questions are essentially the same as for the DBLP/IMDB datasets.
Answer:
Write two functions build_degrees(prefix)
and get_degrees(prefix, rebuild = False)
that return two arrays indegree
and outdegree
containing the in and out-degrees in a dataset, respectively. These functions can be built on the same model as the functions build_degree
and get_degree
defined in Question 4.
Answer:
Answer: