In [ ]:

    
%%HTML
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}

body:after {
background-image: url('../04-Small-Worlds/lincs.png');
background-size: 200px 100px;
position: fixed;
bottom: 1em;
right: 8em;
width: 200px; 
height: 100px;
content:"";
}
</style>

Markov Chains & PageRank

Céline Comte, Fabien Mathieu

ACN Master

Roadmap

1. Introduction

2. Markov Chains

3. PageRank

Introduction

Online resources

For theoretical background, look at

How to find the page I am looking for?

I know it already (bookmark...)
I surf from page to page
I use a search engine

Ranking: A Big Issue

One request, lot of answers:
- Google : 11 billions results
- Algorithm: 128 millions results
- Platypus: 9 millions results
Humans seldom read past the first page (10 results)
How to select relevant pages?

An Old Issue

Find a relevant research paper
$ \rightarrow $ Scientometrics (middle of 20th century)
http://www.andreasaltelli.eu/file/repository/Little_science_big_science_and_beyond.pdf

Solutions

Content of pages (pertinence)
Study the URL
Number of incoming links
Hyperlinks

Markov Chains

Definitions

$ V $: a set of state
Random process $ X_k, k\in \mathbb{N} $: set of random variables on $ V $
Markov chain: $\forall k$, $$P(X_k=j|X_{k-1} = i_{k-1}, \ldots, X_0 = i_0) = P(X_k=j | X_{k-1} = i_{k-1})$$
Transition coefficients: $ p^k_{i,j} = P(X_k = j | X_{k-1} = i) $
Homogeneous chain: $ p^k_{i,j} \rightarrow p_{i,j} $.

Stochastic matrix

$ A = (p_{i,j}) $: right stochastic matrix (each line sums to $1$.
If $ x_k $ (row vector) represents state distribution at step $ k $, then $ x_k = x_0 A^k $.
Interpretation : multiply by $ A $ $\Rightarrow$ move one step.
Proof: just write it.

Transition Graph

A Markov chain can be represented by a weighted oriented graph $ G = (V,E,m) $

$ V $: states
$ E $: non zero transitions
$ m(e) $: probability of transition $ e $

Spoiler : PageRank interprets a Web Graph as a transition graph

Aperiodic, Irreducible

Aperiodic: all nodes have period 1
Irreducible: $ \forall i,j, \exists k, (A^k)_{i,j} > 0 $
Irreducibility is linked to connectivity

Stochastic Perron-Frobenius

Main Theorem

Let $A$ be a stochastic matrix

The spectral radius of $ A $ is $1$, (it is an eigenvalue)
If $ A $ is irreducible, there is a unique probability $P$ that is left eigenvector. $ P > 0 $.
If $ A $ is also aperiodic, all eigenvalues but $1$ have modulus < 1.

Proof of 1. as exercise

Convergence of $A$

Let $ A $ be stochastic, irreducible, aperiodic, $ P $ the corresponding probability eigenvalue.

Then $ A^k \underset{k\to \infty}{\longrightarrow} \mathbf{1}.P $

Proof: $ A^k $ converges to $ A^\infty $, stochastic rank 1 projection over $ P $.

Corollary

Let $ P_0 $ be a probability distribution over $ V $. Then $ P_k := P_0 A^k \underset{k\to \infty}{\longrightarrow} P $.

Proof: straightforward

Interpretation

An irreducible, aperiodic, Markov Chain forgets its starting point.
$ P $: stationary distribution.
$ P > 0 $: all states are visited infinitely often

Removing hypothesis

Perron-Frobenius conditions are not always met:

reducible matrices
periodic graphs
sub-stochastic matrices

Reducible matrices

Any directed graph can be split into strongly connected components (SCCs) If $A$ is reducible, $G$ possesses more than one SCC:

At least one recurrent SCC
Possibly transient SCC's

Reducible matrices

Up to a permutation of states we have: $$A = \begin{pmatrix} T & E \\ 0 & R \end{pmatrix}\text{, with } R = \begin{pmatrix} R_1 & 0 & \cdots & 0 \\ 0 & \ddots & \ddots & \vdots \\ \vdots & \ddots & \ddots & 0 \\ 0 & \cdots & 0 & R_d \end{pmatrix} $$

Reducible matrices

If the recurrent SCC's are ap. ir. $$A^k \underset{k\to \infty}{\longrightarrow} \begin{pmatrix} 0 & F \\ 0 & R^\infty \end{pmatrix}\text{, with } \left\{ \begin{array}{l} R^\infty = \begin{pmatrix} R^\infty_1 & 0 & \cdots & 0 \\ 0 & \ddots & \ddots & \vdots \\ \vdots & \ddots & \ddots & 0 \\ 0 & \cdots & 0 & R^\infty_d \end{pmatrix} \\ \text{ and }\\ F = (\mathbf{1}-T)^{-1}ER^\infty \end{array} \right. $$

Interpretation

$ R^\infty $: cannot escape a recurrent SCC.
$ T^k \to 0 $: probability of been in transient state goes to $0$.
$ F $ describes the dispatch from transient to recurrent:
- $ (1-T)^{-1} = \sum T^k $: walk in transient SCC's;
- $ E $: transient/recurrent transition;
- $ R^\infty $: stationary state.

Periodic Matrices

Cycles
Bipartite graphs

If irreducible:

Keep existence of a unique eigenprobability.
Lose standard convergence (keep Cesàro convergence)

Periodic Matrices

Let $ A $ be stochastic, irreducible, associated to probability $ P $. Let $B = (\alpha A) + (1-\alpha \mathbf{1}) $, for $ \alpha \in ]0,1[ $, $ P_0 $ a probability. Then $$P_0 B^k\underset{k\to \infty}{\longrightarrow} P$$

Interpretation: uniform loops preserve $P$ and suppress periodicity.

Substochastic matrices

$ A $ sub-stochastic: $ 0\leq A \lneq B $, with $ B $ stochastic.

$ A $ sub-irreducible: $ 0\leq A \lneq B $, with $ B $ stochastic irreducible.

Theorem: if $ A $ sub-irreducible, $ A^k \to 0 $

Interpretation: leak (aka incomplete Markov Chain)

Pseudo-Recurrent Component

Let $ A $ be non-negative, associated to $G$. A SCC is pseudo-recurrent iff:

Its spectral radius is maximal (compared to other SCCs);
Any SCC reachable from it has a strictly smaller spectral radius

Theorem

The spectral radius of $A$ is the one of PR SCC's;
There is a positive maximal eigenvalue (unique if aperiodicity), with multiplicity equal to the number of PR SCC's;
To each PR SCC $ C $ is associated a non-negative eigenvalue with support $ \uparrow C $.

Interpretation

If $ A $ is sub-irreducible, $ A^k $ goes to $ 0 $, but:

Convergence is not uniform.
Pseudo-recurrent components are the slowest $ \rightarrow $ they will prevail in $ A^k $.
If one unique pseudo-recurrent $ C $ and $ \uparrow C = V $, $ A $ is pseudo-irreducible.

Take-Away

Markov Chain $ \Leftrightarrow $ random walk in a graph.
Perfect case (fully defined, irreducible, aperiodic) $ \Rightarrow $ convergence to a unique stationary distribution.
Otherwise, workarounds are required

PageRank Basics

Use hyperlinks

Simple method: Indegree
Problem: robustness
Brin & Page: a page is important if referenced by important pages
- Recursive indegree
- Must be turned into equation

Random surfer

Assumes important pages are easily found thanks to structure
Model: user randomly clicks from page to page
Importance $ \Leftrightarrow $ probability
A few details to check

Web graph transition matrix

$ A = (a_{ij}) $, with $ a_{i,j} = 1 / d(i) $ if $ i \rightarrow j $, 0 otherwise.

$ A $ is substochastic.
$ A $ stochastic $ \Leftrightarrow $ no leaf.

PageRank: simple definition

Build $A$ from Web Graph, assume PF conditions are met
Choose some $P_0$ over $V$
Do $P_{n+1} = P_n A$ as long as necessary
PR on simple graph?

Coping with Web Graph Reality

~~Stochastic~~
~~Irreducible~~
~~Fit in memory~~

Stochasticity of a Web graph

A web graph possesses many leaves:

Pages with no hrefs
Unexplored pages (pages are crawled)

As a result, probability leaks.

Normalized PageRank

Simple workaround for leaves: $P_{n+1} = P_n A / ||P_n A||_1$

Simple, leaks are compensated
Only works for pseudo-irreducible matrices
Appeared in What can you do with a Web in your Pocket?

PageRank with damping

Actual class of PageRank to use
$P_{n+1} = \alpha P_n A +(1-\alpha) Z$
- $Z$: zapping distribution
- $\alpha$: damping factor

Interpretations

Random walk with reset
Propagation of vanishing fluid injected from $Z$
Fixed point equation

Memory footprint

Usually, $A$ does not fit into memory.

Solution: cf Wikipedia practical

Customization

In practice, ranking corresponds to a request.

Mix PR with other metrics
Customize PR itself

Customization

Three ways to customize $A$:

$\alpha$
$A$
$Z$ (cf practical)