INF674 S8: Wikipedia PageRank

Céline Comte & Fabien Mathieu

2016 - 2017

You learned the theoretical basis of PageRank in previous session. Time to get your hands dirty.


In [ ]:
%pylab inline

1. Reminders

Wikipedia Dataset

We remind that we use French and English crawls of Wikipedia made in 2013 and available on \url{http://webgraph.di.unimi.it/}. The graphs have been cleaned: only links from one article to another article are kept.

Two main differences with the DBLP database:

  • The graphs are now oriented: a link from $i$ to $j$ does not mean there is a link from $j$ to $i$.
  • The graphs are bigger. You need to be smart on the way you write your code, both from theoretical and practical points of view.

The French crawl is made of three files:

  • frwiki-2013.ids contains the article titles (one per line, $ n $ lines in total). By convention, each article is associated to its line number (from $ 0 $ to $ n-1 $).
  • frwiki-2013.adja contains the adjacency list of the graph: line $ i $ (from $ 0 $ to $ n-1 $) contains, in plain ASCII, the numbers of the articles that are linked by $ i $.
  • frwiki-2013-t.adja contains the adjacency list of the transposed graph: line $ i $ (from $ 0 $ to $ n-1 $) contains the numbers of the articles that have a link to $ i $.

The English crawl is provided in a similar way, with the prefix enwiki-2013 instead of frwiki-2013. Note that it is roughly three times bigger than the French crawl. Feel free to use the dataset(s) you want, but starting with French is recommended (a PageRank computation can take a few minutes on a slow machine).

This practical assumes the datasets are stored in ../Datasets/ relatively to your working directory. If it not the case, just change the following cell accordingly.


In [ ]:
directory = "../Datasets/"

Size and degree

The following function gives the number of nodes $n$ and the total number of oriented edges $m$ of the graph.

The function tries to load the size from a npy file if one exists, otherwise it parses through the dataset to extract the information and save it in a npy file.


In [ ]:
def get_size(prefix = "enwiki-2013"):
    try:
        return np.load(directory+prefix+"-size.npy")
    except IOError:
        n = 0
        m = 0
        with open(directory+prefix+".adja") as f:
            for line in f:
                n += 1
                m += len([int(s) for s in line.split()])
        np.save(directory+prefix+"-size", [n, m])
        return n, m

Now, the function for extracting degree. Remind that in oriented graphs, you have to distinguish between indegree and outdegree (cf the Power Laws practical).


In [ ]:
def get_degree(prefix = "enwiki-2013"):
    try:
        return np.load(directory+prefix+"-deg.npy")
    except IOError:
        n, m = get_size(prefix)
        degree = zeros(n, dtype = int)
        i = 0
        with open(directory+prefix+".adja") as f:
            for line in f:
                degree[i] = len([int(s) for s in line.split()])
                i += 1
        np.save(directory+prefix+"-deg", degree)
        return degree

Adjacency List

In order to avoid crippling your RAM, we'll use the same format as the one introduced for studying the DBLP dataset.


In [ ]:
def get_adjacency(prefix = "enwiki-2013"):
    try:
        return np.load(directory+prefix+"-adja.npy")
    except IOError:
        n, m = get_size(prefix)
        A = zeros(n+m+1, dtype = int)
        A[0] = n+1 # Don't forget the +1!!!
        with open(directory+prefix+".adja") as f:
            i = 0
            for line in f:
                neighbors = [int(s) for s in line.split()]
                A[i+1] = A[i]+len(neighbors)
                A[A[i]:A[i+1]] = neighbors
                i += 1
        np.save(directory+prefix+"-adja", A)
        return A

The result, A, is a numpy array of integers of size $n+m+1$, organized as follows:

  • The $n+1$ first values are indexes
  • The $m$ last values are destinations
  • The neighbors of a node $i$ are stored in A[A[i]:A[i+1]]

Index / Name conversion

The following cell extracts the names (an updated version compared to previous one because the old way was too slow).


In [ ]:
import codecs
def get_ids(prefix = "enwiki-2013"):
    try:
        return np.load(directory+prefix+"-ids.npy")
    except IOError:
        n, m = get_size(prefix)
        delimiter = zeros(n+1, dtype = int)
        with codecs.open(directory+prefix+".ids", "r", "utf-8") as f:
            i = 0
            for line in f:
                delimiter[i+1] = delimiter[i]+len(line)-1
                #text += line[0:-1]
                i += 1
            f.seek(0)
            text = ''.join([line[0:-1] for line in f])
        np.save(directory+prefix+"-ids", [delimiter, text])
        return delimiter, text

Then the traditional index2name translator (remark: name2index will not be required for this practical).


In [ ]:
def index2name(index, prefix = "enwiki-2013", delimiter = None, text = None):
    if delimiter is None:
        delimiter, text = get_ids(prefix)
    return text[delimiter[index]:delimiter[index+1]]

2. Computing Rankings

Question 1

Propose a function that takes for input a vector $ P $ of size $ n $, an integer $ k>0 $ (default to $ k=10 $) and a prefix for a dataset of size $n$. The function should print the titles of the $ k $ articles with highest value according to $ P $. Test your function by displaying the names of the $ k $ articles with highest indegree, then the name of the $k$ articles with highest outdegree.

In terms of ranking, which seems to be the more relevant: indegree or outdegree? (remember to try and justify your answer, please)

Answer:

Question 2

Let $A$ the transition matrix seen in course, defined by $$A[i,j] = \frac{1}{outdegree(i)}\text{ if }i\rightarrow j, 0\text{ otherwise.}$$

Compute a PageRank based on the following iteration: do $P_{n+1} = \alpha P_n A + (1-\alpha)Z$ until $||P_{n+1}-P_n||_1\leq \epsilon ||P_n||_1$.

You will probably use a $P_{old}$ vector to store previous iteration and a $P_{new}$ vector to store the new one.

You will monitor (e.g. print) the time per iteration and the total time (you can use import time, time.clock()).

Comment the results (top $k$).

Recommended values: $\alpha = 0.7$, $Z = [1,\ldots, 1]$, $\epsilon=0.1$ (you may go down to $0.01$ if you have a fast computer).

Answer:

3. Optimization

The algorithm above may be a little bit slow. Let's do some tuning. You will learn to speed up your algorithm by mixing theoretical and practical considerations, and it will allow you to play more with the last part. Don't forget to compare your results to the previous question to verify that you did not break anything.

It is OK if the name of your PageRank function remains the same thorough all questions.

Question 1

Playing with $ \alpha $ and with the precision may help to speed up the process. Is it a good idea?

Answer:

Question 2

Try to downsize the time per iteration. How many divisions do you perform per iteration? If it is $m$, you should be able to cut that to $ n $ (with better memory access too). Beware of leaves! How does the time per iteration evolve? Also note that for some obscure reasons, multiplications are slightly faster than divisions in Python.

Answer:

Question 3

Downsize the number of iterations. Instead of using a new vector for $PA$, try to do the update in place. How does the number of iterations evolve (note: you will probably not notice a difference if $\epsilon$ is too large, you should try $\epsilon=0.01$)? For the record, this is called the Gauss-Seidel method, and one can prove that it enhances convergence.

Answer:

Question 4

For practical use, the ranking matters but not the actual importance values. Replace the stopping condition by last iteration did not change the ranking of the top $ k $ articles (e.g. with $ k = 20 $). If you did the previous optimization, you probably do not need to store two distinct values of $P$.

Answer:

Question 5 (bonus)

Look at the following code. Can you tell what it does? Compare it with the previous PageRank function in terms of results and efficiency (time and memory).


In [ ]:
def diteration(prefix = "frwiki-2013", alpha = 0.75, k = 20):
    # Setup
    n, m = get_size(prefix)
    H = zeros(n)
    F = ones(n)
    A = get_adjacency(prefix)
    total_time = time.clock()
    ind = argsort(H)[-1:-k-1:-1]
    stable = False
    iter = 0
    # Outer loop
    while not(stable):
        iter += 1
        iter_time = time.clock()
        ind_old = ind
        # Inner loop
        for i in nditer(nonzero(F>alpha*mean(F))):
            outnodes = A[A[i]:A[i+1]]
            H[i] += F[i]
            if outnodes.size:
                F[outnodes] += alpha*F[i]/outnodes.size
            F[i]=0
        ind = argsort(H)[-1:-k-1:-1]
        stable = (ind_old == ind).all()
        print("Iteration {}: time ".format(iter)+str(time.clock()-iter_time))
    print("Total time: "+str(time.clock()-total_time)+" ({} iterations)".format(iter))
    return H

Answer:

Question 6

Instead of using $ [1,\ldots, 1] $ as initial value, why not use a previously saved computation? Interest: if you already computed a PageRank for a small $ k $ and want better precision (larger $ k $), you do not need to restart from scratch. For the record, this is how one updates PageRank of Web graphs when a fresh crawl is made: one uses the old PageRank as an educated guess for the new PageRank.

Answer:


In [ ]:
def pagerank(prefix = "frwiki-2013", alpha = 0.7, k=20):
    # Setup
    n, m = get_size(prefix)
    A = get_adjacency(prefix+"-t")
    inv_out = get_degree(prefix)
    inv_out[inv_out == 0] = 1
    inv_out = 1/inv_out
        
    # Initiate PR Value
    pname = prefix + '-P'
    try:
        P = np.load(pname + '.npy')*inv_out
        print('Using Previous Estimate as Starting Point')
    except IOError:
        print('Start with PR = [1,...,1]')
        P = ones(n)*inv_out/(1-alpha)

    ind = argsort(P/inv_out)[-1:-k-1:-1]
    total_time = time.clock()
    stable = False
    iter = 0
    # Iterations loop
    while not(stable):
        iter += 1
        iter_time = time.clock()
        ind_old = ind
        # Core loop
        for i in range(n):
            P[i] = (alpha*sum(P[A[A[i]:A[i+1]]])+1)*inv_out[i]
        ind = argsort(P/inv_out)[-1:-k-1:-1]
        stable = (ind_old == ind).all()
        print("Iteration {}: time ".format(iter)+str(time.clock()-iter_time))
    print("Total time: "+str(time.clock()-total_time)+" ({} iterations)".format(iter))
    P = P / inv_out
    np.save(pname, P)
    return P

Question 7 (bonus)

Discuss what should be done for computing PageRank on a much larger graph (hundred of millions, or even billions of nodes).

Answer:

4. Customization

The only semantic information you have about the datasets is the name of the articles. Is it enough to get pertinent pages relatively to a given subject?

Remark: It is strongly recommended (but not mandatory) that you work on the Optimization questions before starting this part.

Question 1

The algorithm uses a uniform vector $ Z = [1, \ldots, 1] $. Modify it to take an additional argument s (a string) such that $ Z[i] $ is 1 if s is in the title of $ i $, 0 otherwise. Save your $ Z $ and $ P $ in files with s in the name. Test your algorithm on some values of s and comment. Possible inputs (don't hesitate to propose your owns; some inputs work better than others):

  • for frwiki-2013: film, biologie, physique, philosoph, ...
  • for enwiki-2013: movie, biology, physic, philosoph, ... As usual, comment the results!

Answer:

Question 2

Using a non-uniform $ Z $ will create a bias in favor of the pages that contain s in the title. A simple workaround consists in ranking according to $ P-(1-\alpha)Z $ instead of $ P $ ($ P-(1-\alpha)Z $ is called the residual PageRank).

Explain what is the meaning of $P-(1-\alpha)Z$ and modify your algorithm to output that instead of $P$.

Answer:

Question 3 (bonus)

Adapt the algorithm from 3. Optimization Question 5 to compute the residual PageRank. Discuss the differences.

Remark: the optimization from Question 3.6 is not easy to make on the algorithm from 3.5. Don't try to implement it.

Answer:

Question 4 (bonus)

Try the other dataset (English / French).

Answer: