Lecture 33: Markov chains (cont.), Google PageRank as a Markov chain

Stat 110, Prof. Joe Blitzstein, Harvard University

Generalizing Reversible Markov Chains

Now let's take what we learned about undirected graphs and reversible Markov chains, and extend that just a bit more: let's assume that the transition probabilities are not uniform for a node, that some edges are more likely to be traversed than others (but still keeping to an undirected graph).

Recall that the degree for node is the number of edges it has. We proved last time that the stationary probability for any node is proportional to its degree.

Now we will generalize that to deal with edge weights: $w_{ij} \ge 0, w_{ij} = 0$ if there is no edge $\{ij\}$.

Since the graph in question is undirected, assume $w_{ij} = w_{ji}$.

Random walk

From state $i$, go to $j$ with transition probability $\propto w_{ij}$. If $\{ij\}$ is an edge, then $q_{ij} = \frac{w_{ij}}{\sum_k w_{ik}}$, assuming $w_{ij} \ge 0$. In the case of no weights, just think of all the weights as being 1; hence, the transition probabilities are proportional to degree.

Proof

Proving reversibility, we have:

\begin{align} \left( \sum_k w_{ik} \right) \, q_{ij} &= w_{ij} &\text{ from assumption above } \\ &= w_{ji} &\text{ since undirected graph } \\ &= \left( \sum_k w_{jk} \right) \, q_{ji} \\ \\\\ \Rightarrow s_i &\propto \sum_k w_{ik} \end{align}

And so you see, here we have a complete generalization of a reversible Markov chain.

Examining the converse

Conversely, any reversible Markov chain can be represented by a graph like the above.

Assuming we have a Markov chain (we have $q_{ij}$; and assuming the Markov chain is reversible, we have:

\begin{align} \text{Let } w_{ij} &= s_{i} \, q_{ij} \\ &= s_{i} \, q_{ji} \\ &= w_{ji} \\ \\\\ \rightarrow P (X_{n+1} = j | X_n = i) &= \frac{w_{ij}}{\sum_k w_{ik}} \\ &= \frac{s_i \, q_{ij}}{\sum_k w_{ik}} \\ &= \frac{s_i \, q_{ij}}{\sum_k s_i \, q_{ik}} \\ &= \frac{q_{ij}}{\sum_k q_{ik}} \\ &= \frac{q_{ij}}{1} \\ &= q_{ij} \end{align}

And so this chain we constructed has the desired transition probabilities: this is the quintessential example of a reversible Markov chain.

Non-reversible Example: Google PageRank

Now let's generalize this just a bit more, and look at the case of non-reversible Markov chains.

In this example, consider nodes to be pages in the world-wide web, and the links between the pages as directed edges. In this way, you can see that the Internet is an example of a non-reversible Markov chain. We will consider a very tiny, toy representation of the web as a 4-node directed graph as below.

In the case of web search, the results should be returned in ranked form, with the most important pages with relation to your search query ranked the highest. But how to measure importance?

Google's original approach, known as PageRank, measured the importance of a page in terms of pages linking to it and also their importance. This relation suggests a sort of eigenvalue/eigenvector relation, which motivates the key concept of modelling the web as a non-reversible Markov chain.

Intuition & Motivation

Let a web page's score be $s_j = \sum_i s_i \, q_{ij}$. Now all the links a page might have should be "diluted" equally, and so that is what the $q_{ij}$ is doing the previous equation.

So given the simple, non-reversible Markov chain above, our matrix $Q$ would be:

\begin{align} Q &= \begin{pmatrix} 0 & \frac{1}{2} & \frac{1}{2} & 0 \\ \frac{1}{2} & 0 & \frac{1}{2} & 0 \\ 0 & 0 & 0 & 1 \\ \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \\ \end{pmatrix} \end{align}

Note that with regards to page 4, since we want to represent the web with a Markov chain, we assume that you can jump to any other page from page 4 with equal probability. After all, if you navigated to some page in the Internet that had no outgoing links, you would just open up another tab or window and jump elsewhere.

So from

\begin{align} s_j &= \sum_i s_i \, q_{ij} \\ \\ \Rightarrow \vec{s} &= \vec{s} \, Q &\text{ if } \vec{s} \text{ is normalized, all elements} \in \mathbb{Z}_{\ge 0} \end{align}

Hence we can see that in our model, $\vec{s}$ is the stationary distribution of the random web-surfing chain. Intuitively, this stationary distribution gives us the long-run probabilities of being in certain states (pages). Therefore, all we want to do is solve that linear system and normalize it to gain $\vec{s}$.

Why model with a Markov chain?

Using a Markov chain for our model makes sense from the intuition that the stationary distribution $\vec{s}$ is the amount of time one would spend on certain pages.

But there are other other reasons that motivate the use of a Markov chain with regards to computation.

In the real world, there are billions of web pages, and the transition matrix $Q$ will have an equivalent rank. Algorithms such as Gaussian Elimination have cubic (polynomial) complexity, and if matrix $Q$ has rank in the billions, you can see that the problem could be intractable.

Here is the original PageRank formula:

\begin{align} G &= \alpha \, Q + (1 - \alpha) \, \frac{J}{M} \end{align}

where

$M$ is the # pages in the web
$J$ is an $m \times m$ matrix of $1$s
$0 \lt \alpha \lt 1$

There are now two transition matrices: $Q$ for clicking links to pages on the web; and $\frac{J}{M}$ matrix for "teleporting" to another page. So, with probability $\alpha$ the web surfer will follow a link from the present page to another, and with probability $1-\alpha$ the surfer will jump to another page (with all pages equally probable).

Note that with the above formula irreducibility is guaranteed, as all transition probabilities are greater than 0. And since this is an irreducible Markov chain, a unique stationary distribution does exist.

Eigen computations on a matrix with a rank of this magnitude could very well be computationally unfeasible, but if we use a Markov model and agree that we do not need the exact stationary distribution but only an approximation, then we can use property of convergence to stationarity to find a reasonable answer!

We can start with any initial distribution and run the chain for until we reach a level of convergence that we deem acceptable.

Let $\vec{t}$ be our initial probability vector.

\begin{align} \vec{t} \, G &= \alpha \, \vec{t} \, Q + (1 - \alpha) \, \vec{t} \, \frac{J}{M} \end{align}

$\vec{t} \, Q$ is a very sparse matrix, and there are many algorithms and data structures for dealing with such in an optimal manner
$\vec{t}$ is $\mathbb{R}^{m}$ and $J$ is a square matrix $1$s with rank $m$, so $\vec{t} \, J$ is just a $1 \times m$ vector of $1$s
$\Rightarrow (\vec{t} \, G)G = \vec{t} \, G^2$
$\Rightarrow (\vec{t} \, G^2)G = \vec{t} \, G^3$
$\cdots$
$\Rightarrow (\vec{t} \, G^{n-1})G = \vec{t} \, G^n$
and as $n \rightarrow \infty$ we get the stationary distribution for matrix $G$!

View Lecture 33: Markov Chains Continued Further | Statistics 110 on YouTube.