CSE 6040, Fall 2015 [14]: PageRank (still cont'd)

This notebook is identical to Lab 13, but with solutions provided for Part 1 and partial solutions for Part 2.

In this notebook, you'll implement the PageRank algorithm summarized in class. You'll test it on a real dataset (circa 2005) that consists of political blogs and their links among one another.

Note that the presentation in class follows the matrix view of the algorithm. Cleve Moler (inventor of MATLAB) has a nice set of notes here.

For today's notebook, you'll need to download the following additional materials:

A cse6040utils module, which is a Python module containing some handy routines from previous classes: link (Note: This module is already part of the git repo for our notebooks if you are pulling from there.)
A SQLite version of the political blogs dataset: http://cse6040.gatech.edu/fa15/poliblogs.db (~ 611 KiB)

Part 1: Explore the Dataset

Let's start by looking at the dataset, to get a feel for what it contains.

Incidentally, one of you asked recently how to get the schema for a SQLite database when using Python. Here is some code adapted from a few ideas floating around on the web. Let's use these to inspect the tables available in the political blogs dataset.



In [ ]:

    
import sqlite3 as db
import pandas as pd

def get_table_names (conn):
    assert type (conn) == db.Connection # Only works for sqlite3 DBs
    query = "SELECT name FROM sqlite_master WHERE type='table'"
    return pd.read_sql_query (query, conn)

def print_schemas (conn, table_names=None, limit=0):
    assert type (conn) == db.Connection # Only works for sqlite3 DBs
    if table_names is None:
        table_names = get_table_names (conn)
    c = conn.cursor ()
    query = "PRAGMA TABLE_INFO ({table})"
    for name in table_names:
        c.execute (query.format (table=name))
        columns = c.fetchall ()
        print ("=== {table} ===".format (table=name))
        col_string = "[{id}] {name} : {type}"
        for col in columns:
            print (col_string.format (id=col[0],
                                      name=col[1],
                                      type=col[2]))
        print ("\n")

conn = db.connect ('poliblogs.db')

for name in get_table_names (conn)['name']:
    print_schemas (conn, [name])
    query = '''SELECT * FROM %s LIMIT 5''' % name
    print (pd.read_sql_query (query, conn))
    print ("\n")

Exercise. Write a snippet of code to verify that the vertex IDs are dense in some interval $[1, n]$. That is, there is a minimum value of $1$, some maximum value $n$, and no missing values between $1$ and $n$.



In [ ]:

    
query = '''
  SELECT MIN(Id) AS MinId,
         MAX(Id) AS MaxId,
         COUNT(DISTINCT Id) AS NumDistinctIds
    FROM Vertices
'''
df = pd.read_sql_query (query, conn)
print df
assert df.MinId[0] == 1
assert df.MaxId[0] == df.NumDistinctIds[0]
print ("\n==> Verified: Vertex ids cover [1, %d] densely." \
       % df.NumDistinctIds[0])

Exercise. Make sure every edge has its end points in the vertex table.



In [ ]:

    
query = '''
  SELECT {col} FROM Edges
    WHERE {col} NOT IN (SELECT Id FROM Vertices)
'''

df_s = pd.read_sql_query (query.format (col='Source'), conn)
print (df_s['Source'])

df_t = pd.read_sql_query (query.format (col='Target'), conn)
print (df_t['Target'])

assert df_s['Source'].empty
assert df_t['Target'].empty

print ("==> Verified: All source and target IDs are vertices.")

Exercise. Determine which vertices have no incident edges. Store the number of such vertices in a variable, num_solo_vertices.



In [ ]:

    
query = '''
  SELECT Id, Url
    FROM Vertices
    WHERE (Id NOT IN (SELECT DISTINCT Source FROM Edges))
          AND (Id NOT IN (SELECT DISTINCT Target FROM Edges))
'''
df_solo_vertices = pd.read_sql_query (query, conn)
print df_solo_vertices.head ()

num_solo_vertices = len (df_solo_vertices)

# Our testing code follows, assuming your `num_solo_vertices` variable:
print ("\n==> %d vertices have no incident edges." % num_solo_vertices)
assert num_solo_vertices == 266

Exercise. Compute a view called Outdegrees, which contains the following columns:

Id: vertex ID
Degree: the out-degree of this vertex.

To help you test your view, the following snippet includes a second query that selects from your view but adds a Url field and orders the results in descending order of degree. It also prints first few and last few rows of this query, so you can inspect the URLs as a sanity check. (Perhaps it also provides a small bit of entertainment!)



In [ ]:

    
# Complete this query:
query = '''
  CREATE VIEW IF NOT EXISTS Outdegrees AS
    SELECT Source AS Id, COUNT(*) AS Degree
      FROM Edges
      GROUP BY Source
'''
c = conn.cursor ()
c.execute (query)



In [ ]:

    
from IPython.display import display

query = '''
  SELECT Outdegrees.Id, Degree, Url
    FROM Outdegrees, Vertices
    WHERE Outdegrees.Id = Vertices.Id
    ORDER BY -Degree
'''
df_outdegrees = pd.read_sql_query (query, conn)
print "==> A few entries with large out-degrees:"
display (df_outdegrees.head ())
print "\n==> A few entries with small out-degrees:"
display (df_outdegrees.tail ())

Exercise. Query the database to extract a report of which URLs point to which URLs. Also include the source vertex out-degree and order the rows in descending order by it.



In [ ]:

    
query = '''
  SELECT S.Url, T.Url, Out.Degree
    FROM Edges AS E,
         (SELECT Id, Url FROM Vertices) AS S,
         (SELECT Id, Url FROM Vertices) AS T,
         (SELECT Id, Degree FROM Outdegrees) AS Out
    WHERE (E.Source=S.Id) AND (E.Target=T.Id) AND (E.Source=Out.Id)
    ORDER BY -Out.Degree
'''
df_G = pd.read_sql_query (query, conn)

from IPython.display import display
display (df_G.head ())
print ("...")
display (df_G.tail ())

Part 2: Implement PageRank

The following exercises will walk you through a possible implementation of PageRank for this dataset.

Exercise. Build a sparse matrix, A_1, that stores $G^TD^{-1}$, where $G^T$ is the transpose of the connectivity matrix $G$, and $D^{-1}$ is the diagonal matrix of inverse out-degrees.



In [ ]:

    
from cse6040utils import sparse_matrix



In [ ]:

    
# Extract entries from the table
query = '''
  SELECT Target AS Row, Source AS Col, 1.0/Degree AS Val
    FROM Edges, Outdegrees
    WHERE Edges.Source = Outdegrees.Id
'''
df_A = pd.read_sql_query (query, conn)
display (df_A.head (10))



In [ ]:

    
# Copy entries from df_A into A_1
A_1 = sparse_matrix () # Initially all zeros, with no rows or columns
for (i, j, a_ij) in zip (df_A['Row'], df_A['Col'], df_A['Val']):
    A_1[i-1][j-1] += a_ij # "-1" switches to 0-based indexing

Errata: Bug in matrix construction. Based on questions from students after class, it seems the construction of $A \equiv G^TD^{-1}$ as Prof. Vuduc described it in class has a subtle bug: it does not treat unlinked pages correctly!

To see why, suppose you are the random surfer visiting page $i$, and, with probability $\alpha$, you decide to follow an outgoing link. But what if the page has no outgoing link?

This scenario corresponds to row $i$ of $G$ being entirely zero. So, the random surfer would just "disappear." The easiest fix to the model to account for this case is to assume that the random surfer stays on the same page, which means we should set $a_{ii}$ to 1. The following code snippet handles this case.



In [ ]:

    
# Select all vertices with no outgoing edges
query = '''
  SELECT Id FROM Vertices
    WHERE Id NOT IN (SELECT DISTINCT Source FROM Edges)
'''
df_anti_social = pd.read_sql_query (query, conn)
print ("==> Found %d vertices with no outgoing links." \
       % len (df_anti_social))

# Add self-edges for empty rows/columns
for i in df_anti_social['Id']:
    A_1[i-1][i-1] = 1.0

Exercise. Implement a function to multiply a sparse matrix by a dense vector, assuming a dense vector defined as follows.



In [ ]:

    
def dense_vector (n, init_val=0.0):
    """
    Returns a dense vector of length `n`, with all entries set to
    `init_val`.
    """
    return [init_val] * n



In [ ]:

    
def spmv (n, A, x):
    """Returns a dense vector y of length n, where y = A*x."""
    y = dense_vector (n)
    for (i, A_i) in A.items ():
        s = 0
        for (j, a_ij) in A_i.items ():
            s += a_ij * x[j]
        y[i] = s
    return y

As a quick test, let's verify that multiplying $A_1$ by the vector of all ones, $u$, counts the number of vertices.

Why should that be the case? Two of you asked about this after class.



In [ ]:

    
n = df.NumDistinctIds[0] # Number of vertices, from Part 1
u = dense_vector (n, 1.0)
y = spmv (n, A_1, u)
print sum (y)

Exercise. Complete the PageRank implementation for this dataset. To keep it simple, you may take $\alpha=0.85$, $x(0)$ equal to the vector of all $1/n$ values, and 25 iterations.

Additionally, you may find the following functions helpful.

The support code in the next code cell differs slightly from the notebook we posted originally. It renames those functions and provides additional functions (e.g., vec_2norm), in case you want to implement a residual-based termination test.



In [ ]:

    
# Some helper functions, in case you need them

import math

def vec_scale (x, alpha):
    """Scales the vector x by a constant alpha."""
    return [x_i*alpha for x_i in x]

def vec_add_scalar (x, c):
    """Adds the scalar value c to every element of x."""
    return [x_i+c for x_i in x]

def vec_sub (x, y):
    """Returns x - y"""
    return [x_i - y_i for (x_i, y_i) in zip (x, y)]

def vec_2norm (x):
    """Returns ||x||_2"""
    return math.sqrt (sum ([x_i**2 for x_i in x]))



In [ ]:

    
# YOUR CODE GOES BELOW. We've provided some scaffolding code,
# so you just need to complete it.

ALPHA = 0.85 # Probability of following some link
MAX_ITERS = 25
n = df.NumDistinctIds[0] # Number of vertices, from Part 1

# Let X[t] store the dense x(t) vector at time t
X = []

x_0 = dense_vector (n, 1.0/n) # Initial distribution: 1/n at each page
X.append (x_0)

for t in range (1, MAX_ITERS):
    # Complete this implementation
    
    X.append (...)

Exercise. Check your result by first inserting the final computed PageRank vector back into the database, and then using a SQL query to see the ranked URLs. In your query output, also include both the in-degrees and out-degrees of each vertex.



In [ ]:

    
# Write some code here to create a table in the database
# called PageRank. It should have one column to hold the
# page (vertex) ID, and one for the rank value.



In [ ]:

    
# Some helper code to compute a view containing the indegrees.

query = '''
  CREATE VIEW IF NOT EXISTS Indegrees AS
    SELECT Target AS Id, COUNT(*) AS Degree
      FROM Edges
      GROUP BY Target
'''
c = conn.cursor ()
c.execute (query)



In [ ]:

    
# Complete this query:
query = '''
  ...
'''
df_ranks = pd.read_sql_query (query, conn)
display (df_ranks)
sum (df_ranks['Rank'])

Exercise. The Vertices table includes a column called, Leaning, which expresses a political leaning -- either "Left" or "Right". How might you use this column to come up with an alternative ranking scheme?



In [ ]:

Exercise (advanced?). Create an SQL-based implementation of the PageRank algorithm, where you implement the sparse matrix-vector multiply in SQL, rather than in Python as above.



In [ ]: