This notebook is identical to Lab 12, but with solutions provided for Part 1.
In this notebook, you'll implement the PageRank algorithm summarized in class. You'll test it on a real dataset (circa 2005) that consists of political blogs and their links among one another.
Note that the presentation in class follows the matrix view of the algorithm. Cleve Moler (inventor of MATLAB) has a nice set of notes here.
For today's notebook, you'll need to download the following additional materials:
cse6040utils
module, which is a Python module containing some handy routines from previous classes: link (Note: This module is already part of the git
repo for our notebooks if you are pulling from there.)Incidentally, one of you asked recently how to get the schema for a SQLite database when using Python. Here is some code adapted from a few ideas floating around on the web. Let's use these to inspect the tables available in the political blogs dataset.
In [1]:
import sqlite3 as db
import pandas as pd
def get_table_names (conn):
assert type (conn) == db.Connection # Only works for sqlite3 DBs
query = "SELECT name FROM sqlite_master WHERE type='table'"
return pd.read_sql_query (query, conn)
def print_schemas (conn, table_names=None, limit=0):
assert type (conn) == db.Connection # Only works for sqlite3 DBs
if table_names is None:
table_names = get_table_names (conn)
c = conn.cursor ()
query = "PRAGMA TABLE_INFO ({table})"
for name in table_names:
c.execute (query.format (table=name))
columns = c.fetchall ()
print ("=== {table} ===".format (table=name))
col_string = "[{id}] {name} : {type}"
for col in columns:
print (col_string.format (id=col[0],
name=col[1],
type=col[2]))
print ("\n")
conn = db.connect ('poliblogs.db')
for name in get_table_names (conn)['name']:
print_schemas (conn, [name])
query = '''SELECT * FROM %s LIMIT 5''' % name
print (pd.read_sql_query (query, conn))
print ("\n")
Exercise. Write a snippet of code to verify that the vertex IDs are dense in some interval $[1, n]$. That is, there is a minimum value of $1$, some maximum value $n$, and no missing values between $1$ and $n$.
In [5]:
query = '''
SELECT MIN(Id) AS MinId,
MAX(Id) AS MaxId,
COUNT(DISTINCT Id) AS NumDistinctIds
FROM Vertices
'''
df = pd.read_sql_query (query, conn)
print df
assert df.MinId[0] == 1
assert df.MaxId[0] == df.NumDistinctIds[0]
print ("\n==> Verified: Vertex ids cover [1, %d] densely." \
% df.NumDistinctIds[0])
Exercise. Make sure every edge has its end points in the vertex table.
In [7]:
query = '''
SELECT {col} FROM Edges
WHERE {col} NOT IN (SELECT Id FROM Vertices)
'''
df_s = pd.read_sql_query (query.format (col='Source'), conn)
print (df_s['Source'])
df_t = pd.read_sql_query (query.format (col='Target'), conn)
print (df_t['Target'])
assert df_s['Source'].empty
assert df_t['Target'].empty
print ("==> Verified: All source and target IDs are vertices.")
Exercise. Determine which vertices have no incident edges. Store the number of such vertices in a variable, num_solo_vertices
.
In [9]:
query = '''
SELECT Id, Url
FROM Vertices
WHERE (Id NOT IN (SELECT DISTINCT Source FROM Edges))
AND (Id NOT IN (SELECT DISTINCT Target FROM Edges))
'''
df_solo_vertices = pd.read_sql_query (query, conn)
print df_solo_vertices.head ()
num_solo_vertices = len (df_solo_vertices)
# Our testing code follows, assuming your `num_solo_vertices` variable:
print ("\n==> %d vertices have no incident edges." % num_solo_vertices)
assert num_solo_vertices == 266
Exercise. Compute a view called Outdegrees
, which contains the following columns:
Id
: vertex IDDegree
: the out-degree of this vertex.To help you test your view, the following snippet includes a second query that selects from your view but adds a Url field and orders the results in descending order of degree. It also prints first few and last few rows of this query, so you can inspect the URLs as a sanity check. (Perhaps it also provides a small bit of entertainment!)
In [10]:
# Complete this query:
query = '''
CREATE VIEW IF NOT EXISTS Outdegrees AS
SELECT Source AS Id, COUNT(*) AS Degree
FROM Edges
GROUP BY Source
'''
c = conn.cursor ()
c.execute (query)
Out[10]:
In [11]:
from IPython.display import display
query = '''
SELECT Outdegrees.Id, Degree, Url
FROM Outdegrees, Vertices
WHERE Outdegrees.Id = Vertices.Id
ORDER BY -Degree
'''
df_outdegrees = pd.read_sql_query (query, conn)
print "==> A few entries with large out-degrees:"
display (df_outdegrees.head ())
print "\n==> A few entries with small out-degrees:"
display (df_outdegrees.tail ())
Exercise. Query the database to extract a report of which URLs point to which URLs. Also include the source vertex out-degree and order the rows in descending order by it.
In [12]:
query = '''
SELECT S.Url, T.Url, Out.Degree
FROM Edges AS E,
(SELECT Id, Url FROM Vertices) AS S,
(SELECT Id, Url FROM Vertices) AS T,
(SELECT Id, Degree FROM Outdegrees) AS Out
WHERE (E.Source=S.Id) AND (E.Target=T.Id) AND (E.Source=Out.Id)
ORDER BY -Out.Degree
'''
df_G = pd.read_sql_query (query, conn)
from IPython.display import display
display (df_G.head ())
print ("...")
display (df_G.tail ())
Exercise. Build a sparse matrix, A_1
, that stores $G^TD^{-1}$, where $G^T$ is the transpose of the connectivity matrix $G$, and $D^{-1}$ is the diagonal matrix of inverse out-degrees.
In [ ]:
from cse6040utils import sparse_matrix
A_1 = sparse_matrix () # Initially all zeros, with no rows or columns
In [ ]:
# Insert your code here
Exercise. Implement a function to multiply a sparse matrix by a dense vector, assuming a dense vector defined as follows.
In [ ]:
def dense_vector (n, init_val=0.0):
"""
Returns a dense vector of length `n`, with all entries set to
`init_val`.
"""
return [init_val] * n
In [ ]:
# Implement this routine:
def spmv (n, A, x):
"""Returns a dense vector y of length n, where y = A*x."""
pass
Exercise. Complete the PageRank implementation for this dataset.
In [ ]:
def scale_vector (x, alpha):
"""Scales the dense vector x by a constant alpha."""
return [x_i*alpha for x_i in x]
def offset_vector (x, c):
"""Adds the scalar value c to every element of a dense vector x."""
return [x_i+c for x_i in x]
In [ ]:
ALPHA = 0.85 # Probability of following some link
MAX_ITERS = 25
# Let X[t] store the dense x(t) vector at time t
X = []
x_0 = dense_vector (n, 1.0/n) # Initial distribution: 1/n at each page
X.append (x_0)
for t in range (1, MAX_ITERS):
# Complete this implementation
X.append (...)
Exercise. Check your result by first inserting the final computed PageRank vector back into the database, and then using a SQL query to see the ranked URLs. In your query output, also include both the in-degrees and out-degrees of each vertex.
In [ ]:
In [ ]:
query = '''
CREATE VIEW IF NOT EXISTS Indegrees AS
SELECT Target AS Id, COUNT(*) AS Degree
FROM Edges
GROUP BY Target
'''
c = conn.cursor ()
c.execute (query)
In [ ]:
# Complete this query:
query = '''
...
'''
df_ranks = pd.read_sql_query (query, conn)
display (df_ranks)
Exercise. The Vertices
table includes a column called, Leaning
, which expresses a political leaning -- either "Left" or "Right". How might you use this column to come up with an alternative ranking scheme?
In [ ]:
Exercise (advanced?). Create an SQL-based implementation of the PageRank algorithm, where you implement the sparse matrix-vector multiply in SQL, rather than in Python as above.
In [ ]: