If you want to deepen your theoretical knowledge of power laws, you can read (this is not mandatory):
Other optional references are given in the course.
In [ ]:
%pylab inline
A random variable $X$ has a lognormal distribution if the random variable $Y = log(X)$ has a normal distribution. The normal distribution with mean $\mu$ and standart deviation $\sigma$ is given by the density function $$ f(y) = \frac1{\sqrt{2\pi} \sigma} e^{-(y - \mu)^2 / 2 \sigma^2}, \quad \forall y \in \mathbb{R}. $$ The density function of the corresponding log-normal distribution is then $$ g(x) = \frac1{\sqrt{2\pi} \sigma x} e^{-(\ln(x) - \mu)^2 / 2 \sigma^2}, \quad \forall x > 0. $$
On a log-log plot, plot the probability density function of a power law with parameters $\beta = 1/2, 1$ and $x_{min} = 1$ and a log-normal distribution with approximately the same slope. What can you observe?
Answer: For all $x > 0$, we have $$ \ln(g(x)) = - \left( 1 - \frac\mu{\sigma^2} \right) \ln(x)
Recall that the probability density function of a power law with parameters $\beta > 0$ and $x_{min} = 1$ is given by $$ f(x) = \frac\beta{x^{\beta + 1}}, \quad \forall x > 1. $$ This yields $$ \ln(f(x)) = \ln(\beta) - (\beta + 1) \ln(x), \quad \forall x > 1, $$ so that the slope is $\beta + 1$.
Hence, we can let $(\mu = \beta \sigma^2,\sigma)$ for any $\sigma > 0$ (the corresponding log-normal distribution then has a mean $e^{\mu + \sigma^2/2} = e^{(1/2 - \beta) \sigma^2}$). We plot the curves for different values of $\sigma$.
Write a function that gives, for given $\alpha$ and $n$, the vector of node degrees of an Albert-Barabási graph of $n$ nodes built from a graph seed $G(0)$ of two nodes linked together. Advice: drawing proportionnally to degree may be a computational bottleneck. Noticing that the degrees of such an AB graph of size $n$ sum to $2n-2$, can you build an array of size $2n-2$ such that picking a node proportionnally to its degree corresponds to picking up a random uniform element of the array?
Answer: The trick is to log the start and end nodes of added edges in some growing (pre-allocated ;) ) array, so at any time any node is present in the array with multiplicity equal to its current degree. The code can be made faster the numpy way, but it is not necessary (to do... or not).
After choosing a value of $n$:
Study the degree distribution (CCDF and number of nodes having specified degrees) for a few values of $\alpha$ between $0$ and $1$. Compare the results to what you saw in course.
Answer:
Answer: If you look at the formula from the course, you get $\frac{c_{i+1}}{c_i}=\frac 1 2$. In other words, you have a geometric distribution and the probability to have degree $i$ ($i\geq 1$) is $\frac{1}{2^i}$. You can see it by drawing the previous results on a semilog plot.
In [ ]:
figure()
semilogy((arange(1, len(deg_dist3)+1)), deg_dist3, 'g.', label = "$\\alpha = 1$")
xlabel("Degree")
ylabel("Number of nodes")
legend(loc = 1, numpoints = 1)
show()
When doing research, it is frequent to deal with some datasets. Sometimes, you will have to produce them yourself. This is not as easy as it seems and a full course would probably only scratch the surface of the required skills. Sometimes, others have done all the heavy lifting for you, and you just need to learn how to use existing datasets for your specific needs.
File Format: The datasets are provided with the following format:
Datasets will be given on a USB key. This practical assumes they are stored in ../Datasets/ relatively to your working directory (adjust according to your own organization).
DBLP (DataBase systems and Logic Programming or Digital Bibliography & Library Project) is THE database that records CS publications. It records authors, conferences, journals... It is a good entry point to study a good example of undirected small-world: the co-authorship graph.
There are multiple versions of the DBLP graph available. For this practical, we will focus on the one available in http://konect.uni-koblenz.de/networks/dblp_coauthor
Let us begin with a simple example on how to compute the size of the Dataset.
In [ ]:
import codecs
directory = "../Datasets/"
def get_size_from_dataset(prefix = "dblp"):
n = 0
m = 0
with codecs.open(directory+prefix+".adja", "r", "utf-8") as f:
for line in f:
n += 1
m += len([int(s) for s in line.split()])
return n, m
Note: the use of codecs is to prepare you to using .ids files, which may have special characters.
Answer:
Hum, this takes quite a while just to get such simple information. Let us try the following approach:
In [ ]:
def get_size(prefix = "dblp"):
try:
return np.load(directory+prefix+"-size.npy")
except IOError:
np.save(directory+prefix+"-size", get_size_from_dataset(prefix))
return get_size(prefix)
Test and explain the interest of using the get_size function (remark: you are very strongly encouraged to re-use this technique).
Answer:
Write a function that gives the degree distribution. For example, it may return an array deg_dist such that the number of nodes that have degree $i$ is deg_dist[i-1]. Display the degree distribution in a loglog scale. Also display the Complentary Cumulative Distribution Function of the degree. Comment the results in view of the previous parts of this practical.
Answer:
We now play with French and English crawls of Wikipedia made in 2013 and available on http://webgraph.di.unimi.it/. The graphs have been cleaned: only links from one article to another article are kept.
Two main differences with the DBLP database you just dealt with:
The French crawl is made of three files:
The English crawl is provided in a similar way, with the prefix enwiki-2013 instead of frwiki-2013. Note that it is roughly three times bigger than the French crawl. Feel free to use the dataset(s) you want.
The questions are essentially the same than for the DBLP dataset.
Give the number of nodes and edges of the dataset(s).
Answer:
Display in a loglog scale the degree distribution(s). Comment the results.
Answer: