CSE 6040, Fall 2015 [12]: PageRank

In this notebook, you'll implement the PageRank algorithm summarized in class. You'll test it on a real dataset (circa 2005) that consists of political blogs and their links among one another.

Note that the presentation in class follows the matrix view of the algorithm. Cleve Moler (inventor of MATLAB) has a nice set of notes here.

For today's notebook, you'll need to download the following additional materials:

  • A cse6040utils module, which is a Python module containing some handy routines from previous classes: link (Note: This module is already part of the git repo for our notebooks if you are pulling from there.)
  • A SQLite version of the political blogs dataset: http://cse6040.gatech.edu/fa15/poliblogs.db (~ 611 KiB)

Part 1: Explore the Dataset

Let's start by looking at the dataset, to get a feel for what it contains.

Incidentally, one of you asked recently how to get the schema for a SQLite database when using Python. Here is some code adapted from a few ideas floating around on the web. Let's use these to inspect the tables available in the political blogs dataset.


In [3]:
import sqlite3 as db
import pandas as pd

def get_table_names (conn):
    assert type (conn) == db.Connection # Only works for sqlite3 DBs
    query = "SELECT name FROM sqlite_master WHERE type='table'"
    return pd.read_sql_query (query, conn)

def print_schemas (conn, table_names=None, limit=0):
    assert type (conn) == db.Connection # Only works for sqlite3 DBs
    if table_names is None:
        table_names = get_table_names (conn)
    c = conn.cursor ()
    query = "PRAGMA TABLE_INFO ({table})"
    for name in table_names:
        c.execute (query.format (table=name))
        columns = c.fetchall ()
        print ("=== {table} ===".format (table=name))
        col_string = "[{id}] {name} : {type}"
        for col in columns:
            print (col_string.format (id=col[0],
                                      name=col[1],
                                      type=col[2]))
        print ("\n")

conn = db.connect ('poliblogs.db')

for name in get_table_names (conn)['name']:
    print_schemas (conn, [name])
    query = '''SELECT * FROM %s LIMIT 5''' % name
    print (pd.read_sql_query (query, conn))
    print ("\n")


=== Vertices ===
[0] index : INTEGER
[1] Id : INTEGER
[2] Url : TEXT
[3] Leaning : TEXT


   index  Id                         Url Leaning
0      0   1        100monkeystyping.com    Left
1      1   2  12thharmonic.com/wordpress    Left
2      2   3       40ozblog.blogspot.com    Left
3      3   4             4lina.tblog.com    Left
4      4   5       750volts.blogspot.com    Left


=== Edges ===
[0] index : INTEGER
[1] Source : INTEGER
[2] Target : INTEGER


   index  Source  Target
0      0     267    1394
1      1     267     483
2      2     267    1051
3      3     904    1479
4      4     904     919


Exercise. Write a snippet of code to verify that the vertex IDs are dense in some interval $[1, n]$. That is, there is a minimum value of $1$, some maximum value $n$, and no missing values between $1$ and $n$.


In [ ]:
# Insert your code here

Exercise. Make sure every edge has its end points in the vertex table.


In [ ]:
# Insert your code here

Exercise. Determine which vertices have no incident edges. Store the number of such vertices in a variable, num_solo_vertices.


In [ ]:
# Insert your code here:

# Our testing code follows, assuming your `num_solo_vertices` variable:
print ("==> %d vertices have no incident edges." % num_solo_vertices)
assert num_solo_vertices == 266

Exercise. Compute a view called Outdegrees, which contains the following columns:

  1. Id: vertex ID
  2. Degree: the out-degree of this vertex.

To help you test your view, the following snippet includes a second query that selects from your view but adds a Url field and orders the results in descending order of degree. It also prints first few and last few rows of this query, so you can inspect the URLs as a sanity check. (Perhaps it also provides a small bit of entertainment!)


In [ ]:
# Complete this query:
query = '''
  CREATE VIEW IF NOT EXISTS Outdegrees AS
    ...
'''
c = conn.cursor ()
c.execute (query)

In [ ]:
from IPython.display import display

query = '''
  SELECT Outdegrees.Id, Degree, Url
    FROM Outdegrees, Vertices
    WHERE Outdegrees.Id = Vertices.Id
    ORDER BY -Degree
'''
df_outdegrees = pd.read_sql_query (query, conn)
print "==> A few entries with large out-degrees:"
display (df_outdegrees.head ())
print "\n==> A few entries with small out-degrees:"
display (df_outdegrees.tail ())

Exercise. Query the database to extract a report of which URLs point to which URLs. Also include the source vertex out-degree and order the rows in descending order by it.


In [ ]:

Part 2: Implement PageRank

The following exercises will walk you through a possible implementation of PageRank for this dataset.

Exercise. Build a sparse matrix, A_1, that stores $G^TD^{-1}$, where $G^T$ is the transpose of the connectivity matrix $G$, and $D^{-1}$ is the diagonal matrix of inverse out-degrees.


In [ ]:
from cse6040utils import sparse_matrix

A_1 = sparse_matrix () # Initially all zeros, with no rows or columns

In [ ]:
# Insert your code here

Exercise. Implement a function to multiply a sparse matrix by a dense vector, assuming a dense vector defined as follows.


In [ ]:
def dense_vector (n, init_val=0.0):
    """
    Returns a dense vector of length `n`, with all entries set to
    `init_val`.
    """
    return [init_val] * n

In [ ]:
# Implement this routine:
def spmv (n, A, x):
    """Returns a dense vector y of length n, where y = A*x."""
    pass

Exercise. Complete the PageRank implementation for this dataset.


In [ ]:
def scale_vector (x, alpha):
    """Scales the dense vector x by a constant alpha."""
    return [x_i*alpha for x_i in x]

def offset_vector (x, c):
    """Adds the scalar value c to every element of a dense vector x."""
    return [x_i+c for x_i in x]

In [ ]:
ALPHA = 0.85 # Probability of following some link
MAX_ITERS = 25

# Let X[t] store the dense x(t) vector at time t
X = []

x_0 = dense_vector (n, 1.0/n) # Initial distribution: 1/n at each page
X.append (x_0)

for t in range (1, MAX_ITERS):
    # Complete this implementation
    
    X.append (...)

Exercise. Check your result by first inserting the final computed PageRank vector back into the database, and then using a SQL query to see the ranked URLs. In your query output, also include both the in-degrees and out-degrees of each vertex.


In [ ]:


In [ ]:
query = '''
  CREATE VIEW IF NOT EXISTS Indegrees AS
    SELECT Target AS Id, COUNT(*) AS Degree
      FROM Edges
      GROUP BY Target
'''
c = conn.cursor ()
c.execute (query)

In [ ]:
# Complete this query:
query = '''
  ...
'''
df_ranks = pd.read_sql_query (query, conn)
display (df_ranks)

Exercise. The Vertices table includes a column called, Leaning, which expresses a political leaning -- either "Left" or "Right". How might you use this column to come up with an alternative ranking scheme?


In [ ]:

Exercise (advanced?). Create an SQL-based implementation of the PageRank algorithm, where you implement the sparse matrix-vector multiply in SQL, rather than in Python as above.


In [ ]: