You learned the theoretical basis of PageRank in previous session and earlier today. Time to get your hands dirty.
In [ ]:
%pylab inline
import zipfile
import gzip
import codecs
We use French and English crawls of Wikipedia made in 2013 and available on http://webgraph.di.unimi.it/. The graphs have been cleaned: only links from one article to another article are kept.
Two main differences with the DBLP database:
Important note: as the English dataset is quite bigger than the French dataset, it is recommended to play with the French dataset first.
The datasets are provided in the form of zip archives. Each zip archive contains the following files:
You should have the datasets by now. If not, ask for them. This practical assumes they are stored in ../Datasets/ relatively to your working directory (change the line below to adapt if it is not the case).
In [ ]:
directory = "../Datasets/"
It can be a burden to parse again and again the same datas from ASCII files and convert them into a proper format. The following function will be used to stash your variable in a file for further use. Thanks to Pierre-Antoine (ACN 2017-2018) for pointing out the possibility.
Remark: compared to previous versions, the code has slightly been updated to be able to pass the transposition through a transpose
boolean. That been said, your old files from previous practicals should still work, so you do not have to recompute them.
In [ ]:
def use_cache(builder, prefix="frwiki-2013", variable = "size", transpose=False, rebuild=False, compress=False):
if transpose:
t = "-t"
else:
t = ""
try:
if rebuild:
raise ValueError('Value needs to be rebuilt')
if compress:
with gzip.GzipFile(directory+prefix+t+"-"+variable+".npy.gz", "r") as f:
return np.load(f)
else:
return np.load(directory+prefix+t+"-"+variable+".npy")
except:
data = builder(prefix, transpose=transpose)
if compress:
with gzip.GzipFile(directory+prefix+t+"-"+variable+".npy.gz", "w") as f:
np.save(f, data)
else:
np.save(directory+prefix+t+"-"+variable, data)
return data
# Set default behavior
compress = False
Most of the core functions below have the following behavior: first they try to load the results from an npy file if one exists, otherwise they parse the dataset to extract the information and save it in an npy file for the next use. This approach avoids re-doing the same work over and over again.
You are strongly encouraged to use the same tactic for saving your PageRank computations.
The following function gives the number of nodes $n$ and the total number of oriented edges $m$ of the graph.
In [ ]:
def build_size(prefix = "frwiki-2013", transpose=False):
n = 0
m = 0
with zipfile.ZipFile(directory+prefix+".zip") as myzip:
with myzip.open(prefix+".adja") as f:
for line in f:
n += 1
m += len([int(s) for s in line.split()])
size = array([n, m])
return size
def get_size(prefix = "frwiki-2013", rebuild = False):
return use_cache(build_size, prefix, variable = "size", rebuild = rebuild, compress = compress)
Let us run it once to create the value if you didn't keep it from previous practicals.
In [ ]:
prefix = "frwiki-2013"
n, m = get_size(prefix)
print("Number of nodes in %s: %s" % (prefix, n))
print("Number of edges in %s: %s" % (prefix, m//2))
Now, the function for extracting degree. Remind that in oriented graphs, you have to distinguish between indegree and outdegree (cf the Power Laws practical).
In [ ]:
def build_degree(prefix = "frwiki-2013", transpose=False):
if transpose:
t = "-t"
else:
t = ""
n, m = get_size(prefix)
degree = zeros(n, dtype = int)
i = 0
with zipfile.ZipFile(directory+prefix+".zip") as myzip:
with myzip.open(prefix+t+".adja") as f:
for line in f:
degree[i] = len([int(s) for s in line.split()])
i += 1
return degree
def get_degree(prefix = "frwiki-2013", transpose=False, rebuild = False):
return use_cache(build_degree, prefix, variable = "degree", transpose=transpose, rebuild = rebuild, compress = compress)
Let us parse the degrees.
In [ ]:
out_deg = get_degree(prefix)
in_deg = get_degree(prefix, transpose = True)
In [ ]:
loglog(sort(out_deg), 1-linspace(0,1,len(out_deg)), 'r', label = "Out Degree")
loglog(sort(in_deg), 1-linspace(0,1,len(in_deg)), 'b', label = "In Degree")
xlabel("Degree")
ylabel("CCDF")
title(prefix)
legend(loc = 3)
show()
In order to avoid crippling your RAM with a full matrix, we'll use the same format as the one introduced in previous practicals.
In [ ]:
def build_adjacency(prefix="frwiki-2013", transpose=False):
if transpose:
t = "-t"
else:
t = ""
n, m = get_size(prefix)
A = zeros(n + m + 1, dtype=int)
A[0] = n + 1 # Don't forget the +1!!!
with zipfile.ZipFile(directory + prefix + ".zip") as myzip:
with myzip.open(prefix + t + ".adja") as f:
i = 0
for line in f:
neighbors = array(line.split(), dtype=int)
A[i + 1] = A[i] + len(neighbors)
A[A[i]:A[i + 1]] = neighbors
i += 1
return A
def get_adjacency(prefix="frwiki-2013", transpose=False, rebuild=False):
return use_cache(
build_adjacency,
prefix,
variable="adjacency",
transpose=transpose,
rebuild=rebuild,
compress=compress)
The result, A, is a numpy array of integers of size $n+m+1$, organized as follows:
In [ ]:
A = get_adjacency(prefix)
The following cell extracts the names.
In [ ]:
def build_ids(prefix="dblp", transpose=False):
n, m = get_size(prefix)
delimiter = zeros(n + 1, dtype=int)
text = ""
with zipfile.ZipFile(directory + prefix + ".zip") as myzip:
with myzip.open(prefix + ".ids") as f:
i = 0
for line in codecs.iterdecode(f, 'utf8'):
delimiter[i + 1] = delimiter[i] + len(line) - 1
text += line[0:-1]
i += 1
return [delimiter, text]
def get_ids(prefix="dblp", rebuild=False):
return use_cache(
build_ids, prefix, variable="ids", rebuild=rebuild, compress=compress)
Then the traditional index2name translator (remark: name2index will not be required for this practical).
In [ ]:
def index2name(index, prefix = "dblp", delimiter = None, text = None):
if delimiter is None:
delimiter, text = get_ids(prefix)
return text[delimiter[index]:delimiter[index+1]]
In [ ]:
index2name(123456, prefix)
Most rankings techniques give a score to each page, and then display the pages that have the best score.
Answer:
Let $A$ the transition matrix seen in course, defined by $$A[i,j] = \frac{1}{outdegree(i)}\text{ if }i\rightarrow j, 0\text{ otherwise.}$$
Compute a PageRank based on the following iteration: do $P_{n+1} = \alpha P_n A + (1-\alpha)Z$ until $||P_{n+1}-P_n||_1\leq \epsilon ||P_n||_1$.
A few hints:
Answer:
In [ ]:
def pagerank_iteration(A, P, out, alpha):
# Your code here
# ...
return new_P
def pagerank(prefix = "frwiki-2013", alpha = 0.7, epsilon = .01):
# Your code here
# ...
return P
The algorithm above may be a little bit slow. Let's do some tuning. You will learn to speed up your algorithm by mixing theoretical and practical considerations, and it will allow you to play more with the last part. Don't forget to compare your results to the previous question to verify that you did not break anything.
It is OK if the name of your PageRank function remains the same thorough all questions.
In some corrections of previous practicals, we introduced the package numba
that can be used to accelerate your code. numba
is a Python package that compiles your python code on the fly instead of evaluating it. This called Just-n-Time (JiT) compilation.
If you want to learn more about Numba, you can look at the following gist: https://gist.github.com/balouf/007f9360127da12fb6455a9bbceeca36
While usually it is best to first optimize your code from an algorithmic perspective before using numba
acceleration, for this practical, you should start with speeding up your inner function pagerank_iteration
.
Try the following change to your code.
In [ ]:
from numba import njit
@njit
def pagerank_iteration(A, P, out, alpha):
# Your code here, from previous question
# ...
return new_P
If it works, well done! You can probably use better precision now (recommended: $\epsilon=0.01$).
If not, try to debug to make it work.
Answer:
Playing with $ \alpha $ and with the precision may help to speed up the process. Is it a good idea?
Answer:
Try to downsize the time per iteration. How many divisions do you perform per iteration? If it is $m$, you should be able to cut that to $ n $ (with better memory access too). Beware of leaves! How does the time per iteration evolve?
Answer:
Downsize the number of iterations. Instead of using a new vector for $PA$, try to do the update in place. How does the number of iterations evolve (note: you will probably not notice a difference if $\epsilon$ is too large, you should try $\epsilon=0.01$)? For the record, this is called the Gauss-Seidel method, and one can prove that it enhances convergence.
Answer:
For practical use, the ranking matters but not the actual importance values. Replace the stopping condition by last iteration did not change the ranking of the top $ k $ articles (e.g. with $ k = 20 $). If you did the previous optimization, you probably do not need to store two distinct values of $P$ anymore.
Answer:
Look at the following code. Can you tell what it does? Compare it with the previous PageRank function in terms of results and efficiency (time and memory).
In [ ]:
@njit
def diteration_loop(F, H, A, alpha):
for i in nditer(nonzero(F>alpha*mean(F))):
outnodes = A[A[i]:A[i+1]]
H[i] += F[i]
if outnodes.size:
F[outnodes] += alpha*F[i]/outnodes.size
F[i]=0
def diteration(prefix = "frwiki-2013", alpha = 0.75, k = 20):
# Setup
n, m = get_size(prefix)
H = zeros(n)
F = ones(n)
A = get_adjacency(prefix)
total_time = time.clock()
ind = argsort(H)[-1:-k-1:-1]
stable = False
iter = 0
# Outer loop
while not(stable):
iter += 1
iter_time = time.clock()
ind_old = ind
# Inner loop
diteration_loop(F, H, A, alpha)
ind = argsort(H)[-1:-k-1:-1]
stable = (ind_old == ind).all()
print("Iteration {}: time ".format(iter)+str(time.clock()-iter_time))
print("Total time: "+str(time.clock()-total_time)+" ({} iterations)".format(iter))
return H
Answer:
Instead of using a uniform distribution as initial value for $P$, why not use a previously saved computation? Interest:
Answer:
Discuss what should be done for computing PageRank on a much larger graph (hundred of millions, or even billions of nodes).
Answer:
The only semantic information you have about the datasets is the name of the articles. Is it enough to get pertinent pages relatively to a given subject?
Remark: It is strongly recommended (but not mandatory) that you work on the Optimization questions before starting this part.
The algorithm uses a uniform vector $ Z = [1, \ldots, 1] $. Modify it to take an additional argument kw (a string) such that $ Z[i] $ is 1 if kw is in the title of $ i $, 0 otherwise. Save your $ Z $ and $ P $ in files with kw in the name. Test your algorithm on some values of kw and comment. Possible inputs (don't hesitate to propose your owns; some inputs work better than others):
Answer:
Using a non-uniform $ Z $ will create a bias in favor of the pages that contain kw in the title. A simple workaround consists in ranking according to $ P-(1-\alpha)Z $ instead of $ P $.
$ P-(1-\alpha)Z $ is called the residual PageRank.
Explain what is the meaning of $P-(1-\alpha)Z$ and modify your algorithm to output that instead of $P$.
Remark: Some of you may have suppressed the $(1-\alpha)$ factor from your update function. In that case, the residual PageRank would be just $P-Z$.
Answer:
Adapt the algorithm from 3. Optimization Question 6 to compute the residual PageRank. Discuss the differences.
Remark: the optimization from Question 3.7 is not easy to make on the algorithm from 3.6. Don't try to implement it.
Answer:
Try the other dataset (English / French).
Answer: