CSE 6040, Fall 2015 [13]: PageRank (cont'd)

This notebook is identical to Lab 12, but with solutions provided for Part 1.

In this notebook, you'll implement the PageRank algorithm summarized in class. You'll test it on a real dataset (circa 2005) that consists of political blogs and their links among one another.

Note that the presentation in class follows the matrix view of the algorithm. Cleve Moler (inventor of MATLAB) has a nice set of notes here.

For today's notebook, you'll need to download the following additional materials:

  • A cse6040utils module, which is a Python module containing some handy routines from previous classes: link (Note: This module is already part of the git repo for our notebooks if you are pulling from there.)
  • A SQLite version of the political blogs dataset: http://cse6040.gatech.edu/fa15/poliblogs.db (~ 611 KiB)

Part 1: Explore the Dataset

Let's start by looking at the dataset, to get a feel for what it contains.

Incidentally, one of you asked recently how to get the schema for a SQLite database when using Python. Here is some code adapted from a few ideas floating around on the web. Let's use these to inspect the tables available in the political blogs dataset.


In [1]:
import sqlite3 as db
import pandas as pd

def get_table_names (conn):
    assert type (conn) == db.Connection # Only works for sqlite3 DBs
    query = "SELECT name FROM sqlite_master WHERE type='table'"
    return pd.read_sql_query (query, conn)

def print_schemas (conn, table_names=None, limit=0):
    assert type (conn) == db.Connection # Only works for sqlite3 DBs
    if table_names is None:
        table_names = get_table_names (conn)
    c = conn.cursor ()
    query = "PRAGMA TABLE_INFO ({table})"
    for name in table_names:
        c.execute (query.format (table=name))
        columns = c.fetchall ()
        print ("=== {table} ===".format (table=name))
        col_string = "[{id}] {name} : {type}"
        for col in columns:
            print (col_string.format (id=col[0],
                                      name=col[1],
                                      type=col[2]))
        print ("\n")

conn = db.connect ('poliblogs.db')

for name in get_table_names (conn)['name']:
    print_schemas (conn, [name])
    query = '''SELECT * FROM %s LIMIT 5''' % name
    print (pd.read_sql_query (query, conn))
    print ("\n")


=== Vertices ===
[0] index : INTEGER
[1] Id : INTEGER
[2] Url : TEXT
[3] Leaning : TEXT


   index  Id                         Url Leaning
0      0   1        100monkeystyping.com    Left
1      1   2  12thharmonic.com/wordpress    Left
2      2   3       40ozblog.blogspot.com    Left
3      3   4             4lina.tblog.com    Left
4      4   5       750volts.blogspot.com    Left


=== Edges ===
[0] index : INTEGER
[1] Source : INTEGER
[2] Target : INTEGER


   index  Source  Target
0      0     267    1394
1      1     267     483
2      2     267    1051
3      3     904    1479
4      4     904     919


Exercise. Write a snippet of code to verify that the vertex IDs are dense in some interval $[1, n]$. That is, there is a minimum value of $1$, some maximum value $n$, and no missing values between $1$ and $n$.


In [5]:
query = '''
  SELECT MIN(Id) AS MinId,
         MAX(Id) AS MaxId,
         COUNT(DISTINCT Id) AS NumDistinctIds
    FROM Vertices
'''
df = pd.read_sql_query (query, conn)
print df
assert df.MinId[0] == 1
assert df.MaxId[0] == df.NumDistinctIds[0]
print ("\n==> Verified: Vertex ids cover [1, %d] densely." \
       % df.NumDistinctIds[0])


   MinId  MaxId  NumDistinctIds
0      1   1490            1490

==> Verified: Vertex ids cover [1, 1490] densely.

Exercise. Make sure every edge has its end points in the vertex table.


In [7]:
query = '''
  SELECT {col} FROM Edges
    WHERE {col} NOT IN (SELECT Id FROM Vertices)
'''

df_s = pd.read_sql_query (query.format (col='Source'), conn)
print (df_s['Source'])

df_t = pd.read_sql_query (query.format (col='Target'), conn)
print (df_t['Target'])

assert df_s['Source'].empty
assert df_t['Target'].empty

print ("==> Verified: All source and target IDs are vertices.")


Series([], Name: Source, dtype: object)
Series([], Name: Target, dtype: object)
==> Verified: All source and target IDs are vertices.

Exercise. Determine which vertices have no incident edges. Store the number of such vertices in a variable, num_solo_vertices.


In [9]:
query = '''
  SELECT Id, Url
    FROM Vertices
    WHERE (Id NOT IN (SELECT DISTINCT Source FROM Edges))
          AND (Id NOT IN (SELECT DISTINCT Target FROM Edges))
'''
df_solo_vertices = pd.read_sql_query (query, conn)
print df_solo_vertices.head ()

num_solo_vertices = len (df_solo_vertices)

# Our testing code follows, assuming your `num_solo_vertices` variable:
print ("\n==> %d vertices have no incident edges." % num_solo_vertices)
assert num_solo_vertices == 266


   Id                                   Url
0   3                 40ozblog.blogspot.com
1   4                       4lina.tblog.com
2  25       americandreamslost.blogspot.com
3  48  asian-nation.org/headlines/index.php
4  49           asiegeofherons.blogspot.com

==> 266 vertices have no incident edges.

Exercise. Compute a view called Outdegrees, which contains the following columns:

  1. Id: vertex ID
  2. Degree: the out-degree of this vertex.

To help you test your view, the following snippet includes a second query that selects from your view but adds a Url field and orders the results in descending order of degree. It also prints first few and last few rows of this query, so you can inspect the URLs as a sanity check. (Perhaps it also provides a small bit of entertainment!)


In [10]:
# Complete this query:
query = '''
  CREATE VIEW IF NOT EXISTS Outdegrees AS
    SELECT Source AS Id, COUNT(*) AS Degree
      FROM Edges
      GROUP BY Source
'''
c = conn.cursor ()
c.execute (query)


Out[10]:
<sqlite3.Cursor at 0x106d5cce0>

In [11]:
from IPython.display import display

query = '''
  SELECT Outdegrees.Id, Degree, Url
    FROM Outdegrees, Vertices
    WHERE Outdegrees.Id = Vertices.Id
    ORDER BY -Degree
'''
df_outdegrees = pd.read_sql_query (query, conn)
print "==> A few entries with large out-degrees:"
display (df_outdegrees.head ())
print "\n==> A few entries with small out-degrees:"
display (df_outdegrees.tail ())


==> A few entries with large out-degrees:
Id Degree Url
0 855 256 blogsforbush.com
1 454 140 newleftblogs.blogspot.com
2 387 131 madkane.com/notable.html
3 512 131 politicalstrategy.org
4 880 123 cayankee.blogs.com
==> A few entries with small out-degrees:
Id Degree Url
1060 1468 1 watchandwait.blogspot.com
1061 1481 1 writehouse.us
1062 1486 1 youngconservative.blogspot.com
1063 1488 1 zeke01.blogspot.com
1064 1490 1 zeph1z.tripod.com/blog

Exercise. Query the database to extract a report of which URLs point to which URLs. Also include the source vertex out-degree and order the rows in descending order by it.


In [12]:
query = '''
  SELECT S.Url, T.Url, Out.Degree
    FROM Edges AS E,
         (SELECT Id, Url FROM Vertices) AS S,
         (SELECT Id, Url FROM Vertices) AS T,
         (SELECT Id, Degree FROM Outdegrees) AS Out
    WHERE (E.Source=S.Id) AND (E.Target=T.Id) AND (E.Source=Out.Id)
    ORDER BY -Out.Degree
'''
df_G = pd.read_sql_query (query, conn)

from IPython.display import display
display (df_G.head ())
print ("...")
display (df_G.tail ())


S.Url T.Url Out.Degree
0 blogsforbush.com ibe.blogspot.com 256
1 blogsforbush.com aceoftrumpblog.blogspot.com 256
2 blogsforbush.com achicknamedmarzi.com 256
3 blogsforbush.com alamonation.blogspot.com 256
4 blogsforbush.com alarmingnews.com 256
...
S.Url T.Url Out.Degree
19085 watchandwait.blogspot.com freerepublic.com 1
19086 writehouse.us realclearpolitics.com 1
19087 youngconservative.blogspot.com blogsforbush.com 1
19088 zeke01.blogspot.com zeke01.typepad.com 1
19089 zeph1z.tripod.com/blog anncoulter.org 1

Part 2: Implement PageRank

The following exercises will walk you through a possible implementation of PageRank for this dataset.

Exercise. Build a sparse matrix, A_1, that stores $G^TD^{-1}$, where $G^T$ is the transpose of the connectivity matrix $G$, and $D^{-1}$ is the diagonal matrix of inverse out-degrees.


In [ ]:
from cse6040utils import sparse_matrix

A_1 = sparse_matrix () # Initially all zeros, with no rows or columns

In [ ]:
# Insert your code here

Exercise. Implement a function to multiply a sparse matrix by a dense vector, assuming a dense vector defined as follows.


In [ ]:
def dense_vector (n, init_val=0.0):
    """
    Returns a dense vector of length `n`, with all entries set to
    `init_val`.
    """
    return [init_val] * n

In [ ]:
# Implement this routine:
def spmv (n, A, x):
    """Returns a dense vector y of length n, where y = A*x."""
    pass

Exercise. Complete the PageRank implementation for this dataset.


In [ ]:
def scale_vector (x, alpha):
    """Scales the dense vector x by a constant alpha."""
    return [x_i*alpha for x_i in x]

def offset_vector (x, c):
    """Adds the scalar value c to every element of a dense vector x."""
    return [x_i+c for x_i in x]

In [ ]:
ALPHA = 0.85 # Probability of following some link
MAX_ITERS = 25

# Let X[t] store the dense x(t) vector at time t
X = []

x_0 = dense_vector (n, 1.0/n) # Initial distribution: 1/n at each page
X.append (x_0)

for t in range (1, MAX_ITERS):
    # Complete this implementation
    
    X.append (...)

Exercise. Check your result by first inserting the final computed PageRank vector back into the database, and then using a SQL query to see the ranked URLs. In your query output, also include both the in-degrees and out-degrees of each vertex.


In [ ]:


In [ ]:
query = '''
  CREATE VIEW IF NOT EXISTS Indegrees AS
    SELECT Target AS Id, COUNT(*) AS Degree
      FROM Edges
      GROUP BY Target
'''
c = conn.cursor ()
c.execute (query)

In [ ]:
# Complete this query:
query = '''
  ...
'''
df_ranks = pd.read_sql_query (query, conn)
display (df_ranks)

Exercise. The Vertices table includes a column called, Leaning, which expresses a political leaning -- either "Left" or "Right". How might you use this column to come up with an alternative ranking scheme?


In [ ]:

Exercise (advanced?). Create an SQL-based implementation of the PageRank algorithm, where you implement the sparse matrix-vector multiply in SQL, rather than in Python as above.


In [ ]: