| notebook.community

PageRank notes

Based on your many questions and puzzled looks, we've put together this set of notes on the PageRank algorithm, as we presented it in class.

If these notes still don't make sense, you may also find the following references to be useful:

Cleve Moler's notes on PageRank -- Moler's explanation is similar to what we discussed in class and what appears in these notes, but they are not identical as these notes as explained below: link
The original PageRank paper by Page et al.: link

High-level Overview

Motivating problem. The original PageRank algorithm addresses the following problem. Imagine a user who is looking for information on the web, having only some search terms, e.g., "free donuts." There may be many pages that match these terms; in what order should the search engine return matching pages to the user?

A search framework. One idea is to adopt a two-phase framework for search. One phase is offline, meaning it happens before any searches take place. The other phase is online, meaning it happens at search time, when the user's search terms are known.

The offline phase precomputes a global ranking of all web pages, given some snapshot of the web. The online phase uses the search terms to filter the ranked list, and it returns those pages in the global rank order.

The high-level PageRank idea. The PageRank algorithm is the offline phase. It is based on a probabilistic model of how a hypothetical user might surf the web, in the absence of any specific query.

Here is a somewhat informal description of the model; the details follow in the next section.

Suppose there are $n$ web pages, represented by a vertex set $V = \{1, 2, \ldots, n\}$. The pages link to one another; let $E$ denote the edge set, which is a set of pairs $(i, j)$ indicating that page $i$ points to page $j$. We will sometimes write this link relationship as $i \rightarrow j$. This representation is also known as a directed graph representation.

Next, consider a "random web surfer." This surfer visits web pages one at a time, according to the following stochastic process.

At each time step $t \geq 0$, the surfer visits a page. Further assume that the surfer's choice of page at time $t+1$ depends only on the page visited at time $t$. (This assumption makes this process a discrete-time Markov chain.)
Initially, at time $t=0$, the web surfer starts on a random page.
Suppose the surfer is on page $i$ at time $t$. With probability $\alpha$, the surfer will decide to follow a link going from $i$ to some new page $j$.
At time $t$, the surfer might instead decide, with probability $1-\alpha$, to jump to some page $j$. Page $j$ is not necessarily directly connected to $i$.

As time proceeds, the surfer jumps from page to page, sometimes hitting a page it has already visited, and sometimes jumping to an entirely different part of the web graph. Now imagine that the surfer surfs for an infinitely long time; what is the probability that the surfer will be on any given page? If this probability distribution can be calculated, then the PageRank idea is to use the distribution to rank the web pages: the most likely page for the surfer to land on is the top-ranked page, the next most likely page is the second-ranked page, and so on.

Mathematical Details

To fully specify the process outlined above, we need to pin down a few more details. We will do so using probabilistic statements, which it then turns out we will be able to write down succinctly in the language of linear algebra.

Connectivity matrix. Let's start by representing the graph by a matrix $G \equiv (g_{ij})$, where the $(i, j)$ entry, $g_{ij}$, is 1 if there is a link $i \rightarrow j$, and 0 otherwise.

Probability state vector. Next, let $x(t)$ be a (column) vector that represents the probabilities that the surfer is on each page at time $t$. That is,

$x(t) \equiv \left(\begin{array}{c} x_1(t) \\ x_2(t) \\ \vdots \\ x_n(t) \end{array}\right)$,

where $x_i(t)$ is the probability that the surfer is on page $i$ at time $t$. Since the surfer must always be on some page, these probabilities must sum to 1: $\sum_{i=1}^{n} x_i(t) = 1$.

Surfing process. At time $t=0$, suppose that the surfer is equally likely to start on any page. Then, $x_i(0) = \frac{1}{n}$.

Now suppose that the surfer is on page $i$ at time $t$. What page will the surfer visit at time $t+1$? Recall that there are two scenarios in our high-level model: follow an out-link or jump to another page.

Scenario 1. If the surfer decides to follow an out-link, which one will it choose?

Let's assume the surfer picks an outgoing link uniformly at random. That is, if page $i$ has $d_i$ outgoing links, then let the probability of choosing an out-link be $\frac{1}{d_i}$. The value $d_i$ is also called the out-degree of vertex $i$. It may be computed by summing each row of $G$, i.e., $d_i \equiv \sum_{j=1}^{n} g_{ij}$.

Thus, given the decision to follow an out-link starting from page $i$, the probability of moving to page $j$ is $p_{ij} \equiv \frac{g_{ij}}{d_i}$.

What if page $i$ has no outgoing edges? There are several ways to handle this case. The simple one we will adopt is to force it to have a self-edge, $i \rightarrow i$. In other words, the surfer has decided to follow an out-link but has nowhere to go; therefore, it stays put on page $i$. Mathematically, the self-edge means $d_i = g_{ii} = 1$ and $p_{ii} = 1$.

Aside: This way of handling pages without outgoing edges differs from the way they are treated both in the original PageRank scheme and in Moler's notes. The original PageRank scheme simply removed these pages. By contrast, Moler assumes that when there is no outgoing edge, then the surfer jumps to any random page, just like the $1-\alpha$ case. In other words, Moler would set all $g_{ij} = 1$ for all $j$ when $i$ has no outgoing links.

Given all of the $g_{ij}$, including self-edges when needed, we can express all of these quantities in matrix notation:

$G \equiv \left( g_{ij} \right), \qquad D \equiv \left(\begin{array}{ccc} d_1 && \\ & \ddots & \\ && d_n \end{array}\right), \qquad P \equiv \left( p_{ij} \right) = D^{-1}G$

The matrix $P$ is sometimes called a probability transition matrix. From the definitions above, you should convince yourself that every row of the matrix $P \equiv (p_{ij})$ sums to 1.

Scenario 2. If instead the surfer decides to jump to a random page, which one will it choose?

Again, let's assume the surfer jumps uniformly at random to any one of the $n$ pages, which includes itself and any outgoing links from the current page. Then, the probability of choosing any other page $j$ would be just $\frac{1}{n}$.

Putting it all together. We now have all the details we need to compute the probability of ending up on a page $i$ at time $t+1$ starting from some page $j$ at time $t$. This probability is, as a scalar formula,

$x_i(t+1) = \left[\alpha \cdot \sum_{j=1}^{n} p_{ji} x_j(t)\right] + \left[(1-\alpha) \cdot \frac{1}{n}\right].$

We can also write this more compactly in matrix notation. Let $u$ be a vector whose entries are all equal to 1. Then the above formula for all $i$ is the same as,

$x(t+1) = \alpha P^T x(t) + \frac{1 - \alpha}{n} u$.

From the definition of $P$, note that $P^T = G^T D^{-1}$, which is the notation we used in class and which we wrote some code to compute.

Thus, the PageRank algorithm is the following: run the preceding iteration, starting at $t=0$, until $x(t)$ appears to stabilize (i.e., has reached steady-state) or, as is typically done, until some maximum number of time steps has been reached.

Convergence?

At least one detail remains: how do we know that the state vector $x(t)$ will ever reach a steady-state? Could the probabilities oscillate, diverge, or exhibit chaotic behavior?

The analysis to prove that a steady-state exists is somewhat involved, but the gist is the following. The formula to compute $x(t+1)$ from $x(t)$ can also be written as follows:

$x(t+1) = \hat{A} x(t)$,

where

$\hat{A} \equiv \alpha P^T + \frac{1-\alpha}{n} uu^T$. (You can convince yourself of this fact by first observing that $u^Tx(t)=1$.)

Thus, when we ask whether this process reaches steady-state, then we are effectively asking whether there is a vector $x$ such that $x = \hat{A} x$.

Like $P^T$ itself, the matrix $\hat{A}$ has the following properties, which you can verify:

Its entries all lie between 0 and 1.
The columns sum to 1.

From these facts, one may apply a theorem from linear algebra called the Perron-Frobenius theorem and conclude that $x = \hat{A} x$ has a solution that is both non-zero and unique to within some scaling factor. (See the notes by Moler.)