INF674 S5: Power Laws

Céline Comte & Fabien Mathieu

2017 - 2018

If you want to deepen your theoretical knowledge of power laws, you can read (this is not mandatory):

Other optional references are given in the course.


In [ ]:
%pylab inline
import zipfile
import gzip
import codecs

1. Albert-Barabási Degree distribution

Question 1

Write a function that gives, for given $\alpha$ and $n$, the vector of node degrees of an Albert-Barabási graph of $n$ nodes built from a graph seed $G(0)$ of two nodes linked together.

Advice 1: Drawing proportionnally to the degree may be a computational bottleneck. Noticing that the degrees of such an Albert-Barabási graph of size $n$ sum to $2n-2$, can you build an array of size $2n-2$ such that picking a node proportionnally to its degree boils down to picking up a random uniform element of the array?

Advice 2: The function bincount from numpy package could be useful.

Answer:

Question 2

After choosing a value of $n$:

  • such that the code above runs reasonably fast (less than a dozen seconds),
  • as large as you can,

Study the degree distribution (CCDF and number of nodes having specified degrees) for a few values of $\alpha$ between $0$ and $1$. Compare the results to what you saw in course.

Answer:

Question 3 (Bonus)

What can you say specifically for the case $\alpha = 1$ ?

Answer:

2. Power Laws on Datasets

When doing research, it is frequent to deal with some datasets. Sometimes, you will have to produce them yourself. This is not as easy as it seems and a full course would probably only scratch the surface of the required skills. Sometimes, others have done all the heavy lifting for you, and you just need to learn how to use existing datasets for your specific needs.

File Format: The datasets are provided with the following format:

  • dataset.ids contains the actual names of the nodes (one per line, $ n $ lines in total). By convention, each node is associated to its line number (from $ 0 $ to $ n-1 $). Actual names may contain special characters (e.g. ç, é), so it is encoded with utf-8.
  • dataset.adja contains the adjacency list of the graph: line $ i $ (from $ 0 $ to $ n-1 $) contains, in plain ASCII, the numbers of the nodes that are neighbors of $ i $.

Remind Python where your dataset files are!


In [ ]:
directory = "../Datasets/"

Caching results

It can be a burden to parse again and again the same datas from ASCII files and convert them into a proper format. The following function will be used to automatically save the variables you extract from the dataset to a more numpy-friendly format. Currently comes in two flavors:

  • using compress = False will store variables in uncompressed npy files. Very fast to read (especially if you have a SSD drive), but storage costly.
  • using compress = True will store variables in compressed npy files. Storage efficient, but CPU costly. Can be interesting in case you have a mechanic hard drive, ultra-fast CPU or storage shortage.

Also note the presence of a rebuild flag to force the variable to be recomputed.

Thanks to Pierre-Antoine (ACN 2017-2018) for hinting the approach (even if his solution based on npz seemed to work only on his laptop for some reason, so I had to try something else).


In [ ]:
def use_cache(builder, prefix="dblp", variable = "size", rebuild = False, compress = False):            
        try:
            if rebuild:
                raise ValueError('Value needs to be rebuilt')
            if compress:
                with gzip.GzipFile(directory+prefix+"-"+variable+".npy.gz", "r") as f:
                    return np.load(f)
            else:
                return np.load(directory+prefix+"-"+variable+".npy")
        except:
            data = builder(prefix)
            if compress:
                with gzip.GzipFile(directory+prefix+"-"+variable+".npy.gz", "w") as f:
                    np.save(f, data)
            else:
                np.save(directory+prefix+"-"+variable, data)
            return data
# Set default behavior
compress = True

Most of the core functions below have the following behavior: first they try to load the results from an npy file if one exists, otherwise they parse the dataset to extract the information and save it in an npy file for the next use. This approach avoids re-doing the same work over and over again.

Size

The following function gives the number of nodes $n$ and the total number of oriented edges $m$ of the graph. In the case where the graph is undirected, all edges will be counted twice ($(i,j)$ and $(j,i)$ are the same edge on an undirected graph) so the actual number of edges is $\frac m 2$.


In [ ]:
def build_size(prefix = "dblp"):
    n = 0
    m = 0
    with zipfile.ZipFile(directory+prefix+".zip") as myzip:
        with myzip.open(prefix+".adja") as f:
            for line in f:
                n += 1
                m += len([int(s) for s in line.split()])
    size = array([n, m])
    return size

def get_size(prefix = "dblp", rebuild = False):
    return use_cache(build_size, prefix, variable = "size", rebuild = rebuild, compress = compress)

Let us run this function to create the npy file.


In [ ]:
prefix = "dblp"
n, m = get_size(prefix)
print("Number of nodes in %s: %s" % (prefix, n))
print("Number of edges in %s: %s" % (prefix, m//2))

Adjacency List

A natural way to store the adjacency list would be to use an array (or list) of arrays (or lists), such that A[i][j] would refer to the $j$th neighbor of node $i$. In practice, this structure can have some memory usage overhead, so we will store the adjacency list in a flat array with the function below.


In [ ]:
def build_adjacency(prefix = "dblp"):
    n, m = get_size(prefix)
    A = zeros(n+m+1, dtype = int)
    A[0] = n+1 # Don't forget the +1!!!
    with zipfile.ZipFile(directory+prefix+".zip") as myzip:
        with myzip.open(prefix+".adja") as f:
            i = 0
            for line in f:
                neighbors = array(line.split(), dtype = int)
                A[i+1] = A[i]+len(neighbors)
                A[A[i]:A[i+1]] = neighbors
                i += 1
    return A

def get_adjacency(prefix = "dblp", rebuild = False):
    return use_cache(build_adjacency, prefix, variable = "adjacency", rebuild = rebuild, compress = compress)

We can load $A$ in memory.


In [ ]:
A = get_adjacency("dblp")

The result, A, is a numpy array of integer of size $n+m+1$, organized as follows:

  • The first $n+1$ values are indices
  • The last $m$ values are destinations
  • The neighbors of node $i$ are stored in A[A[i]:A[i+1]], for each $i = 0,\ldots,n-1$

The following function just return the neighbors.


In [ ]:
def neighbors(A, index):
    return A[A[index]:A[index+1]]

In practice, just use A[A[i]:A[i+1]] if you can, it avoids calling a function.

Index / Name conversion

All the functions above assume a node is represented by an integer $0\leq i<n$, but researchers, Wikipedia pages, and even actors have names! Let us write some functions to translate integers to names and vice versa.


In [ ]:
def build_ids(prefix = "dblp"):
    n, m = get_size(prefix)
    delimiter = zeros(n+1, dtype = int)
    text = ""
    with zipfile.ZipFile(directory+prefix+".zip") as myzip:
        with myzip.open(prefix+".ids") as f:
            i = 0
            for line in codecs.iterdecode(f, 'utf8'):
                delimiter[i+1] = delimiter[i]+len(line)-1
                text += line[0:-1]
                i += 1
    return [delimiter, text]
    
def get_ids(prefix = "dblp", rebuild = False):
    return use_cache(build_ids, prefix, variable = "ids", rebuild = rebuild, compress = compress)

The function above returns a delimiter array of size $n+1$ and a text string that concatenates all researcher names. It uses the same principle used for the adjacency list: the name of a researcher associated to number $i$ is text[delimiter[i]:delimiter[i+1]]. This allows us to do the conversion from name to index, and vice versa.


In [ ]:
def index2name(index, prefix = "dblp", delimiter = None, text = None):
    if delimiter is None:
        delimiter, text = get_ids(prefix)
    return text[delimiter[index]:delimiter[index+1]]

In [ ]:
def name2index(name, prefix = "dblp", delimiter = None, text = None):
    try:
        if delimiter is None:
            delimiter, text = get_ids(prefix)
        offset = text.index(name)
        return where(delimiter == offset)[0][0]
    except:
        print("Name not found.")

Let us try with some names. Note that the first execution will build the index.


In [ ]:
name2index("Paul_Erdös")

In [ ]:
name2index("Fabien_Mathieu")

In [ ]:
index2name(711561)

In [ ]:
index2name(149114)

Remark: the name2index function is very rough: it just try to match name as a substring of text and find the corresponding index in the delimiter array. It is quite slow and may fail if the name of a researcher is a substring of the name of another researcher, but it will be enough for this practical.

Datasets will be given on a USB key. This practical assumes they are stored in ../Datasets/ relatively to your working directory (adjust according to your own organization).

DBLP dataset

DBLP (DataBase systems and Logic Programming or Digital Bibliography & Library Project) is THE database that records CS publications. It records authors, conferences, journals... It is a good entry point to study a good example of undirected small-world: the co-authorship graph.

There are multiple versions of the DBLP graph available. For this practical, we will focus on the one available in http://konect.uni-koblenz.de/networks/dblp_coauthor

Let us begin with a simple example on how to compute the size of the Dataset.

Question 1 (Easy)

Write a function that returns the degrees of a dataset, i.e. the number of neighbors in the adjacency list for each node (in an array). For directed graphs, the function will return the out-degrees.

Answer:

Question 2

Write a function that gives the degree distribution. For example, it may return an array deg_dist such that the number of nodes that have degree $i$ is deg_dist[i-1]. Display the degree distribution in a loglog scale. Also display the Complentary Cumulative Distribution Function of the degree. Comment the results in view of the previous parts of this practical.

Answer:

IMDB dataset

Question 3

Redo question 2 for IMDB dataset.

Answer:

Wikipedia dataset

We now play with French and English crawls of Wikipedia made in 2013 and available on http://webgraph.di.unimi.it/. The graphs have been cleaned: only links from one article to another article are kept.

Two main differences with the DBLP/IMDB databases you have just dealt with:

  • The graphs are now oriented: a link from $i$ to $j$ does not mean there is a link from $j$ to $i$.
  • The graphs are bigger. If you didn't optimize your code for DBLP, you will probably have to here.

The French crawl is made of three files:

  • frwiki-2013.ids contains the article titles (one per line, $ n $ lines in total). By convention, each article is associated with its line number (from $ 0 $ to $ n-1 $).
  • frwiki-2013.adja contains the adjacency list of the graph: line $ i $ (from $ 0 $ to $ n-1 $) contains, in plain ASCII, the numbers of the articles that are linked by $ i $.
  • frwiki-2013-t.adja contains the adjacency list of the transposed graph: line $ i $ (from $ 0 $ to $ n-1 $) contains the numbers of the articles that have a link to $ i $.

The English crawl is provided in a similar way, with the prefix enwiki-2013 instead of frwiki-2013. Note that it is roughly three times bigger than the French crawl. Feel free to use the dataset(s) you want.

The questions are essentially the same as for the DBLP dataset.

Question 4

Give the number of nodes and edges of the dataset(s).

Answer:

Question 5

Display in a loglog scale the degree distribution(s). Comment the results.

Answer:

3. Cryptocurrency

A friend of yours sends the following link to you: https://medium.com/@jamiecopeland_52423/zipfian-distribution-of-cryptocurrency-market-capitalizations-649bf31e2fb8

As you are now a specialist in power law matters, he asks your opinion. What should you answer?

Answer: