ACN 903 S5: Power Laws

Céline Comte & Fabien Mathieu

If you want to deepen your theoretical knowledge of power laws, you can read (this is not mandatory):


In [ ]:
%pylab inline

In [ ]:
import zipfile
import codecs

1. Albert-Barabási generative model

We first focus on the second version of the Albert-Barabási model introduced during the class. Starting with an arbitrary undirected graph that contains at least one edge, we progressively expand the graph by adding nodes step by step. Each new node is attached to a single existing node chosen as follows:

  • with probability $\alpha$, we perform a uniform attachment, meaning that all nodes can be chosen with the same probability;
  • with probability $1 - \alpha$, we perform a preferential attachment, so that a node is chosen with a probability that is proportional to its degree.

Question 1

Write a function alba(n = 1000, α = 0) that returns the vector of the node degrees in an instance of an Albert-Barabási graph of $n$ nodes. The initial graph contains two nodes that are connected together.

Hint 1: Drawing nodes proportionally to their degree may be a computational bottleneck. Observing that the degrees in an Albert-Barabási graph of $n$ nodes sum to $2n-2$, can you build an array of size $2n-2$ such that choosing a node proportionally to its degree is equivalent to choosing an element of this array uniformly at random?

Hint 2: The function bincount from numpy package could be useful.

Answer:

Question 2

For a value of $n$ that is as large as possible but still gives a reasonable running time (less than a dozen of seconds), plot the probability mass function and the complementary cumulative distribution function of the degree distribution for a few values of $\alpha$ between $0$ and $1$. Compare the results to what you saw in the course.

Hint: You may need the function cumsum from numpy package.

Answer:

Question 3 (Optional)

What can you say specifically for the case $\alpha = 1$ ?

Answer:

2. Using datasets

These are the same functions as in the practical Distances and Clustering.

File format

The datasets are provided in the form of zip archives. Each zip archive contains the following files:

  • dataset.ids contains the actual names of the nodes (one per line, $ n $ lines in total). By convention, each node is associated to its line number (from $ 0 $ to $ n-1 $). Actual names may contain special characters (e.g. ç, é), so they are encoded with utf-8.
  • dataset.adja contains the adjacency list of the graph: line $ i $ (from $ 0 $ to $ n-1 $) contains, in plain ASCII, the numbers of the nodes that are neighbors of $ i $.
  • For oriented graphs, dataset-t.adja contains the adjacency list of the transposed graph: line $ i $ (from $ 0 $ to $ n-1 $) contains, in plain ASCII, the indices of the nodes that are linked by $ i $.

Datasets will be given on a USB key. This practical assumes they are stored in ../Datasets/ relatively to your working directory.


In [ ]:
directory = "../Datasets/"

Caching the results

It can be a burden to parse again and again the same datas from ASCII files and convert them into a proper format. The following function will be used to stash your variable in a compressed gz file for further use. Thanks to Pierre-Antoine (ACN 2017-2018) for pointing out the possibility (even if his solution based on npz seemed to only work on his laptop for some reason, so some adjustments had to be made).


In [ ]:
def use_cache(builder, prefix, variable = "size", rebuild = False, compress = False):            
        try:
            if rebuild:
                raise ValueError('Value needs to be rebuilt')
            if compress:
                with gzip.GzipFile(directory + prefix + "-" + variable + ".npy.gz", "r") as f:
                    return np.load(f)
            else:
                return np.load(directory + prefix + "-" + variable + ".npy")
        except:
            data = builder(prefix)
            if compress:
                with gzip.GzipFile(directory + prefix + "-" + variable + ".npy.gz", "w") as f:
                    np.save(f, data)
            else:
                np.save(directory + prefix + "-" + variable, data)
            return data

# set default behavior
compress = False

Most of the core functions below have the following behavior: first they try to load the results from an npy file if one exists, otherwise they parse the dataset to extract the information and save it in an npy file for the next use. This approach avoids re-doing the same work over and over again.

Size

The following function gives the number $n$ of nodes and the total number $m$ of oriented edges of the graph. In the case where the graph is undirected, all edges will be counted twice ($(i,j)$ and $(j,i)$ are the same edge in an undirected graph) so the actual number of edges is $\frac m 2$.


In [ ]:
def build_size(prefix):
    n = 0
    m = 0
    with zipfile.ZipFile(directory + prefix + ".zip") as myzip:
        with myzip.open(prefix + ".adja") as f:
            for line in f:
                n += 1
                m += len([int(s) for s in line.split()])
    size = array([n, m])
    return size

def get_size(prefix, rebuild = False):
    return use_cache(build_size, prefix, variable = "size", rebuild = rebuild, compress = compress)

Let us run this function to create the corresponding npy file.


In [ ]:
n, m = get_size("dblp")
print("Number of nodes in dblp: %d" % n)
print("Number of edges in dblp: %d" % (m // 2))

Adjacency List

A natural way to store the adjacency list would be to use an array (or list) of arrays (or lists), such that A[i][j] would refer to the $j$th neighbor of node $i$. In practice, this structure can have some memory usage overhead, so we will store the adjacency list in a flat array with the function below.

Remark: the proposed format is known in literature as Compressed Sparse Row (CSR). More details are available at https://en.wikipedia.org/wiki/Sparse_matrix


In [ ]:
def build_adjacency(prefix):
    n, m = get_size(prefix)
    A = zeros(n + m + 1, dtype = int)
    A[0] = n + 1 # Don't forget the +1!!!
    with zipfile.ZipFile(directory + prefix + ".zip") as myzip:
        with myzip.open(prefix + ".adja") as f:
            i = 0
            for line in f:
                neighbors = array(line.split(), dtype = int)
                A[i+1] = A[i] + len(neighbors)
                A[A[i]:A[i+1]] = neighbors
                i += 1
    return A

def get_adjacency(prefix, rebuild = False):
    return use_cache(build_adjacency, prefix, variable = "adjacency", rebuild = rebuild, compress = compress)

We can load $A$ in memory.


In [ ]:
A = get_adjacency("dblp")

The result, A, is a numpy array of integers of size $n+m+1$, organized as follows:

  • The $n+1$ first values are indexes
  • The $m$ last values are destinations
  • The neighbors of a node $i$ are stored in A[A[i]:A[i+1]]

The following function just returns the neighbors.


In [ ]:
def neighbors(A, index):
    return A[A[index]:A[index+1]]

In practice, just use A[A[i]:A[i+1]] if you can, it avoids calling a function.

Index / Name conversion

All the functions above assume a node is represented by an integer $0\leq i<n$, but researchers, Wikipedia pages, and even actors have names! Let us write some functions to translate integers to names and vice versa.


In [ ]:
def build_ids(prefix):
    n, m = get_size(prefix)
    delimiter = zeros(n+1, dtype = int)
    text = ""
    with zipfile.ZipFile(directory + prefix + ".zip") as myzip:
        with myzip.open(prefix + ".ids") as f:
            i = 0
            for line in codecs.iterdecode(f, 'utf8'):
                delimiter[i+1] = delimiter[i] + len(line) - 1
                text += line[0:-1]
                i += 1
    return [delimiter, text]
    
def get_ids(prefix = "dblp", rebuild = False):
    return use_cache(build_ids, prefix, variable = "ids", rebuild = rebuild)

The function above returns an array delimiter of size $n+1$ and a string text that concatenates all researcher names. It uses the same principle used for the adjacency list: the name of a researcher associated to number $i$ is text[delimiter[i]:delimiter[i+1]]. This allows us to do the conversion from name to index, and vice versa.


In [ ]:
def index2name(index, prefix, delimiter = None, text = None):
    if delimiter is None:
        delimiter, text = get_ids(prefix)
    return text[delimiter[index]:delimiter[index+1]]

In [ ]:
def name2index(name, prefix, delimiter = None, text = None):
    try:
        if delimiter is None:
            delimiter, text = get_ids(prefix)
        offset = text.index(name)
        return where(delimiter == offset)[0][0]
    except:
        print("Name not found.")

Let us try with some names. Note that the first execution will build the index.


In [ ]:
name2index("Paul_Erdös", "dblp")

In [ ]:
name2index("Fabien_Mathieu", "dblp")

In [ ]:
index2name(711561, "dblp")

In [ ]:
index2name(149114, "dblp")

Remark: The name2index function is very rough. It just tries to match name as a substring of text and finds the corresponding index in the delimiter array. It is quite slow and may fail if the name of a researcher is a substring of the name of another researcher, but it will be enough for this practical.

List comprehension

You may have already seen list comprehension before: it is for example used in some of the function above to convert a text line into a list of neighbors when you parse the adjacency list: [int(s) for s in line.split()].

They are a powerful tool to construct a list by describing how it is built, and you will have to use them in this practical, so you should study the following examples.

A first simple example: the list of the squares of the integers from 0 to 5:


In [ ]:
[i**2 for i in range(6)]

One of the interest of list comprehension is that they can be nested. For example, the list of the squares of the 6 first positive odd integers.


In [ ]:
[i**2 for i in [2*k+1 for k in range(6)]]

A last example of list comprehension, which will be very helpful for the clustering coefficient. Can you figure out what it does?


In [ ]:
[k for nj in [range(j) for j in range(6)] for k in nj]

3. Power laws in the wild: undirected graphs (DBLP and IMDB)

DBLP (DataBase systems and Logic Programming or Digital Bibliography & Library Project) is THE database that records computer science publications. It records authors, conferences, journals... The co-authorship graph of DBLP is a good entry point to study undirected small-worlds.

There are multiple versions of the DBLP graph available. As in the previous practical, we will focus on the one available on KONECT.

Let us first see how to compute the size of the Dataset.

Question 4

Write two functions build_degree(prefix) and get_degree(prefix, rebuild = False) that return the degrees of a dataset, that is, the number of neighbors of each node. These functions can be built on the same model as the functions build_adjacency and get_adjacency above.

Answer:

Question 5

Write two functions plot_pmf(prefix, rebuild = False) and plot_ccdf(prefix, rebuild = False) that display the degree distribution of a dataset in a loglog scale. For example, they may build an array distribution such that the number of nodes that have degree $i$ is distribution[i]. Comment the results in view of the previous parts of this practical.

Answer:

Question 6 (Optional)

Redo the previous questions with IMDB dataset.

Answer:

4. Power laws in the wild: directed graphs (Wikipedia)

We now play with French and English crawls of Wikipedia made in 2013 and available on WebGraph. These graphs have been cleaned: only links from one article to another article are kept.

The two main differences with the (non-oriented) DBLP/IMDB databases are:

  • The graphs are now oriented: a link from $i$ to $j$ does not mean there is a link from $j$ to $i$.
  • The graphs are bigger. If you didn't optimize your code for DBLP/IMDB, you will probably have to optimize it now.

The French crawl is made of three files:

  • frwiki-2013.ids contains the article titles (one per line, $ n $ lines in total). By convention, the index $i$ of an article is its line number in this file (from $ 0 $ to $ n-1 $).
  • frwiki-2013.adja contains the adjacency list of the graph: for each $i = 0, \ldots, n-1$, line $i$ contains the indices of the articles that are linked by $i$, written in plain ASCII.
  • frwiki-2013-t.adja contains the adjacency list of the transposed graph: for each $i = 0, \ldots, n-1$, line $i$ contains the indices of the articles that have a link to $i$, written in plain ASCII.

The English crawl is provided in a similar way, with the prefix enwiki-2013 instead of frwiki-2013. Note that it is roughly thrice bigger than the French crawl. Feel free to use the dataset(s) you want.

The questions are essentially the same as for the DBLP/IMDB datasets.

Question 7

Give the number of nodes and edges of the dataset(s).

Answer:

Question 8

Write two functions build_degrees(prefix) and get_degrees(prefix, rebuild = False) that return two arrays indegree and outdegree containing the in and out-degrees in a dataset, respectively. These functions can be built on the same model as the functions build_degree and get_degree defined in Question 4.

Answer:

 Question 9

Write two functions plot_pmfs(prefix, rebuild = False) and plot_ccdfs(prefix, rebuild = False) that display the in and outdegree distributions of a dataset in a loglog scale. Comment the results.

Answer: