CSE 6040, Fall 2015 [13]: PageRank (cont'd)

This notebook is identical to Lab 12, but with solutions provided for Part 1.

In this notebook, you'll implement the PageRank algorithm summarized in class. You'll test it on a real dataset (circa 2005) that consists of political blogs and their links among one another.

Note that the presentation in class follows the matrix view of the algorithm. Cleve Moler (inventor of MATLAB) has a nice set of notes here.

For today's notebook, you'll need to download the following additional materials:

A cse6040utils module, which is a Python module containing some handy routines from previous classes: link (Note: This module is already part of the git repo for our notebooks if you are pulling from there.)
A SQLite version of the political blogs dataset: http://cse6040.gatech.edu/fa15/poliblogs.db (~ 611 KiB)

Part 1: Explore the Dataset

Let's start by looking at the dataset, to get a feel for what it contains.

Incidentally, one of you asked recently how to get the schema for a SQLite database when using Python. Here is some code adapted from a few ideas floating around on the web. Let's use these to inspect the tables available in the political blogs dataset.



In [1]:

    
import sqlite3 as db
import pandas as pd

def get_table_names (conn):
    assert type (conn) == db.Connection # Only works for sqlite3 DBs
    query = "SELECT name FROM sqlite_master WHERE type='table'"
    return pd.read_sql_query (query, conn)

def print_schemas (conn, table_names=None, limit=0):
    assert type (conn) == db.Connection # Only works for sqlite3 DBs
    if table_names is None:
        table_names = get_table_names (conn)
    c = conn.cursor ()
    query = "PRAGMA TABLE_INFO ({table})"
    for name in table_names:
        c.execute (query.format (table=name))
        columns = c.fetchall ()
        print ("=== {table} ===".format (table=name))
        col_string = "[{id}] {name} : {type}"
        for col in columns:
            print (col_string.format (id=col[0],
                                      name=col[1],
                                      type=col[2]))
        print ("\n")

conn = db.connect ('poliblogs.db')

for name in get_table_names (conn)['name']:
    print_schemas (conn, [name])
    query = '''SELECT * FROM %s LIMIT 5''' % name
    print (pd.read_sql_query (query, conn))
    print ("\n")









    



=== Vertices ===
[0] index : INTEGER
[1] Id : INTEGER
[2] Url : TEXT
[3] Leaning : TEXT


   index  Id                         Url Leaning
0      0   1        100monkeystyping.com    Left
1      1   2  12thharmonic.com/wordpress    Left
2      2   3       40ozblog.blogspot.com    Left
3      3   4             4lina.tblog.com    Left
4      4   5       750volts.blogspot.com    Left


=== Edges ===
[0] index : INTEGER
[1] Source : INTEGER
[2] Target : INTEGER


   index  Source  Target
0      0     267    1394
1      1     267     483
2      2     267    1051
3      3     904    1479
4      4     904     919

Exercise. Write a snippet of code to verify that the vertex IDs are dense in some interval $[1, n]$. That is, there is a minimum value of $1$, some maximum value $n$, and no missing values between $1$ and $n$.



In [5]:

    
query = '''
  SELECT MIN(Id) AS MinId,
         MAX(Id) AS MaxId,
         COUNT(DISTINCT Id) AS NumDistinctIds
    FROM Vertices
'''
df = pd.read_sql_query (query, conn)
print df
assert df.MinId[0] == 1
assert df.MaxId[0] == df.NumDistinctIds[0]
print ("\n==> Verified: Vertex ids cover [1, %d] densely." \
       % df.NumDistinctIds[0])









    



   MinId  MaxId  NumDistinctIds
0      1   1490            1490

==> Verified: Vertex ids cover [1, 1490] densely.

Exercise. Make sure every edge has its end points in the vertex table.



In [7]:

    
query = '''
  SELECT {col} FROM Edges
    WHERE {col} NOT IN (SELECT Id FROM Vertices)
'''

df_s = pd.read_sql_query (query.format (col='Source'), conn)
print (df_s['Source'])

df_t = pd.read_sql_query (query.format (col='Target'), conn)
print (df_t['Target'])

assert df_s['Source'].empty
assert df_t['Target'].empty

print ("==> Verified: All source and target IDs are vertices.")









    



Series([], Name: Source, dtype: object)
Series([], Name: Target, dtype: object)
==> Verified: All source and target IDs are vertices.

Exercise. Determine which vertices have no incident edges. Store the number of such vertices in a variable, num_solo_vertices.



In [9]:

    
query = '''
  SELECT Id, Url
    FROM Vertices
    WHERE (Id NOT IN (SELECT DISTINCT Source FROM Edges))
          AND (Id NOT IN (SELECT DISTINCT Target FROM Edges))
'''
df_solo_vertices = pd.read_sql_query (query, conn)
print df_solo_vertices.head ()

num_solo_vertices = len (df_solo_vertices)

# Our testing code follows, assuming your `num_solo_vertices` variable:
print ("\n==> %d vertices have no incident edges." % num_solo_vertices)
assert num_solo_vertices == 266









    



   Id                                   Url
0   3                 40ozblog.blogspot.com
1   4                       4lina.tblog.com
2  25       americandreamslost.blogspot.com
3  48  asian-nation.org/headlines/index.php
4  49           asiegeofherons.blogspot.com

==> 266 vertices have no incident edges.

Exercise. Compute a view called Outdegrees, which contains the following columns:

Id: vertex ID
Degree: the out-degree of this vertex.

To help you test your view, the following snippet includes a second query that selects from your view but adds a Url field and orders the results in descending order of degree. It also prints first few and last few rows of this query, so you can inspect the URLs as a sanity check. (Perhaps it also provides a small bit of entertainment!)



In [10]:

    
# Complete this query:
query = '''
  CREATE VIEW IF NOT EXISTS Outdegrees AS
    SELECT Source AS Id, COUNT(*) AS Degree
      FROM Edges
      GROUP BY Source
'''
c = conn.cursor ()
c.execute (query)









    Out[10]:





<sqlite3.Cursor at 0x106d5cce0>



In [11]:

    
from IPython.display import display

query = '''
  SELECT Outdegrees.Id, Degree, Url
    FROM Outdegrees, Vertices
    WHERE Outdegrees.Id = Vertices.Id
    ORDER BY -Degree
'''
df_outdegrees = pd.read_sql_query (query, conn)
print "==> A few entries with large out-degrees:"
display (df_outdegrees.head ())
print "\n==> A few entries with small out-degrees:"
display (df_outdegrees.tail ())









    



==> A few entries with large out-degrees:






    






  
    
      
      Id
      Degree
      Url
    
  
  
    
      0
      855
      256
      blogsforbush.com
    
    
      1
      454
      140
      newleftblogs.blogspot.com
    
    
      2
      387
      131
      madkane.com/notable.html
    
    
      3
      512
      131
      politicalstrategy.org
    
    
      4
      880
      123
      cayankee.blogs.com
    
  








    



==> A few entries with small out-degrees:






    






  
    
      
      Id
      Degree
      Url
    
  
  
    
      1060
      1468
      1
      watchandwait.blogspot.com
    
    
      1061
      1481
      1
      writehouse.us
    
    
      1062
      1486
      1
      youngconservative.blogspot.com
    
    
      1063
      1488
      1
      zeke01.blogspot.com
    
    
      1064
      1490
      1
      zeph1z.tripod.com/blog

Exercise. Query the database to extract a report of which URLs point to which URLs. Also include the source vertex out-degree and order the rows in descending order by it.



In [12]:

    
query = '''
  SELECT S.Url, T.Url, Out.Degree
    FROM Edges AS E,
         (SELECT Id, Url FROM Vertices) AS S,
         (SELECT Id, Url FROM Vertices) AS T,
         (SELECT Id, Degree FROM Outdegrees) AS Out
    WHERE (E.Source=S.Id) AND (E.Target=T.Id) AND (E.Source=Out.Id)
    ORDER BY -Out.Degree
'''
df_G = pd.read_sql_query (query, conn)

from IPython.display import display
display (df_G.head ())
print ("...")
display (df_G.tail ())









    






  
    
      
      S.Url
      T.Url
      Out.Degree
    
  
  
    
      0
      blogsforbush.com
      ibe.blogspot.com
      256
    
    
      1
      blogsforbush.com
      aceoftrumpblog.blogspot.com
      256
    
    
      2
      blogsforbush.com
      achicknamedmarzi.com
      256
    
    
      3
      blogsforbush.com
      alamonation.blogspot.com
      256
    
    
      4
      blogsforbush.com
      alarmingnews.com
      256
    
  








    



...






    






  
    
      
      S.Url
      T.Url
      Out.Degree
    
  
  
    
      19085
      watchandwait.blogspot.com
      freerepublic.com
      1
    
    
      19086
      writehouse.us
      realclearpolitics.com
      1
    
    
      19087
      youngconservative.blogspot.com
      blogsforbush.com
      1
    
    
      19088
      zeke01.blogspot.com
      zeke01.typepad.com
      1
    
    
      19089
      zeph1z.tripod.com/blog
      anncoulter.org
      1

Part 2: Implement PageRank

The following exercises will walk you through a possible implementation of PageRank for this dataset.

Exercise. Build a sparse matrix, A_1, that stores $G^TD^{-1}$, where $G^T$ is the transpose of the connectivity matrix $G$, and $D^{-1}$ is the diagonal matrix of inverse out-degrees.



In [ ]:

    
from cse6040utils import sparse_matrix

A_1 = sparse_matrix () # Initially all zeros, with no rows or columns



In [ ]:

    
# Insert your code here

Exercise. Implement a function to multiply a sparse matrix by a dense vector, assuming a dense vector defined as follows.



In [ ]:

    
def dense_vector (n, init_val=0.0):
    """
    Returns a dense vector of length `n`, with all entries set to
    `init_val`.
    """
    return [init_val] * n



In [ ]:

    
# Implement this routine:
def spmv (n, A, x):
    """Returns a dense vector y of length n, where y = A*x."""
    pass

Exercise. Complete the PageRank implementation for this dataset.



In [ ]:

    
def scale_vector (x, alpha):
    """Scales the dense vector x by a constant alpha."""
    return [x_i*alpha for x_i in x]

def offset_vector (x, c):
    """Adds the scalar value c to every element of a dense vector x."""
    return [x_i+c for x_i in x]



In [ ]:

    
ALPHA = 0.85 # Probability of following some link
MAX_ITERS = 25

# Let X[t] store the dense x(t) vector at time t
X = []

x_0 = dense_vector (n, 1.0/n) # Initial distribution: 1/n at each page
X.append (x_0)

for t in range (1, MAX_ITERS):
    # Complete this implementation
    
    X.append (...)

Exercise. Check your result by first inserting the final computed PageRank vector back into the database, and then using a SQL query to see the ranked URLs. In your query output, also include both the in-degrees and out-degrees of each vertex.



In [ ]:



In [ ]:

    
query = '''
  CREATE VIEW IF NOT EXISTS Indegrees AS
    SELECT Target AS Id, COUNT(*) AS Degree
      FROM Edges
      GROUP BY Target
'''
c = conn.cursor ()
c.execute (query)



In [ ]:

    
# Complete this query:
query = '''
  ...
'''
df_ranks = pd.read_sql_query (query, conn)
display (df_ranks)

Exercise. The Vertices table includes a column called, Leaning, which expresses a political leaning -- either "Left" or "Right". How might you use this column to come up with an alternative ranking scheme?



In [ ]:

Exercise (advanced?). Create an SQL-based implementation of the PageRank algorithm, where you implement the sparse matrix-vector multiply in SQL, rather than in Python as above.



In [ ]:

	Id	Degree	Url
0	855	256	blogsforbush.com
1	454	140	newleftblogs.blogspot.com
2	387	131	madkane.com/notable.html
3	512	131	politicalstrategy.org
4	880	123	cayankee.blogs.com

	Id	Degree	Url
1060	1468	1	watchandwait.blogspot.com
1061	1481	1	writehouse.us
1062	1486	1	youngconservative.blogspot.com
1063	1488	1	zeke01.blogspot.com
1064	1490	1	zeph1z.tripod.com/blog

	S.Url	T.Url	Out.Degree
0	blogsforbush.com	ibe.blogspot.com	256
1	blogsforbush.com	aceoftrumpblog.blogspot.com	256
2	blogsforbush.com	achicknamedmarzi.com	256
3	blogsforbush.com	alamonation.blogspot.com	256
4	blogsforbush.com	alarmingnews.com	256

	S.Url	T.Url	Out.Degree
19085	watchandwait.blogspot.com	freerepublic.com	1
19086	writehouse.us	realclearpolitics.com	1
19087	youngconservative.blogspot.com	blogsforbush.com	1
19088	zeke01.blogspot.com	zeke01.typepad.com	1
19089	zeph1z.tripod.com/blog	anncoulter.org	1