Today's lab continues Lab 18, which was about how to store and operate on dense matrices using Numpy (and SciPy). By the way, a partial solution set for Lab 18 is also available here.
Downloads. For today's lab, you'll need two additional downloads:
Go ahead and download these files now.
The dataset is actually from your current homework (#2)! It's the first million pairs of the user-user connectivity graph, in CSV format.
To repeat, the recommended importing convention for Numpy is (execute this now):
In [ ]:
import numpy as np
In [ ]:
import pandas as pd
from IPython.display import display
In [ ]:
edges = pd.read_csv ('UserEdges-1M.csv')
display (edges.head ())
Exercise. What does this code do?
In [ ]:
V_names = set (edges.Source)
V_names.update (set (edges.Target))
m = len (edges)
n = len (V_names)
print ("==> |V| == %d, |E| == %d" % (n, m))
In [ ]:
import cse6040utils as cse6040
Exercise. What does the following code do?
In [ ]:
id2name = {}
name2id = {}
for (k, v) in enumerate (V_names):
if k <= 5: print ("[%d] %s" % (k, v)) # for debugging
id2name[k] = v
name2id[v] = k
A_named_keys = cse6040.sparse_matrix ()
A_numbered_keys = cse6040.sparse_matrix ()
for (k, row) in edges.iterrows ():
ni = row['Source']
nj = row['Target']
A_named_keys[ni][nj] = 1.
A_named_keys[nj][ni] = 1.
i = name2id[ni]
j = name2id[nj]
A_numbered_keys[i][j] = 1.
A_numbered_keys[j][i] = 1.
In [ ]:
# Build a dense vector
x = cse6040.dense_vector (n)
%timeit cse6040.spmv (n, A_numbered_keys, x)
Exercise. Implement a sparse matrix-vector multiply that works when the matrix is A_named_keys
. How much faster or slower is it than cse6040.spmv()
?
Hint: Feel free to take a look at
cse6040.spmv()
.
In [ ]:
def spmv_named_keys (n, A, x, name2id):
y = cse6040.dense_vector (n)
# @YOUSE: Complete this routine
pass
return y
In [ ]:
# Measures the speed of your implementation:
%timeit spmv_named_keys (n, A_named_keys, x, name2id)
Take a look at the slides that we just started in the last class, which cover the basics of sparse matrix storage formats: link
Although these are available as native formats in SciPy, let's create native Python versions first using lists. We can then compare the performance of, say, sparse matrix-vector multiply, against the ones we ran above.
Exercise. Create a COO-like data structure. You may use the edges
and name2id
raw data structures created above. Name your arrays, coo_rows
, coo_cols
, and coo_vals
.
In [ ]:
coo_rows = [name2id[e] for e in edges['Source']]
coo_cols = [name2id[e] for e in edges['Target']]
coo_vals = [1.] * len (coo_rows)
Exercise. Implement a sparse matrix-vector multiply routine for this COO implementation.
In [ ]:
def coo_spmv (n, R, C, V, x):
"""
Returns y = A*x, where A has 'n' rows and is stored in
COO format by the array triples, (R, C, V).
"""
assert n > 0
assert type (x) is list
assert type (R) is list
assert type (C) is list
assert type (V) is list
assert len (R) == len (C) == len (V)
y = cse6040.dense_vector (n)
# @YOUSE: Fill in this implementation
pass
return y
In [ ]:
%timeit coo_spmv (n, coo_rows, coo_cols, coo_vals, x)
Exercise. Now create a CSR data structure, again using native Python lists. Name your output CSR lists csr_ptrs
, csr_inds
, and csr_vals
.
It's easiest to start with the COO representation. We've given you some start code; just fill in the missing loop.
In [ ]:
# Aside: What does this do? Try running it to see.
z1 = ['q', 'v', 'c']
z2 = [1, 2, 3]
z3 = ['dog', 7, 'man']
print sorted (zip (z1, z2, z3), key=lambda z: z[0])
In [ ]:
C = sorted (zip (coo_rows, coo_cols, coo_vals),
key=lambda t: t[0])
nnz = len (C)
assert n == (C[-1][0] + 1) # Why?
csr_inds = [j for (i, j, a_ij) in C]
csr_vals = [a_ij for (i, j, a_ij) in C]
# @YOUSE: Construct `csr_ptrs`
pass
# Some checks on your implementation: Test the first 10 entries
assert len (csr_ptrs) == (n+1)
assert all ([a==b for (a, b) in zip (csr_ptrs[0:10], [0, 1, 3, 60, 66, 72, 73, 74, 78, 82])])
print ("==> Passed quick test")
Exercise. Now implement a CSR-based sparse matrix-vector multiply.
In [ ]:
def csr_spmv (n, ptr, ind, val, x):
assert n > 0
assert type (ptr) == list
assert type (ind) == list
assert type (val) == list
assert type (x) == list
assert len (ptr) >= (n+1) # Why?
assert len (ind) >= ptr[n] # Why?
assert len (val) >= ptr[n] # Why?
y = cse6040.dense_vector (n)
# @YOUSE: Insert your implementation here
pass
return y
In [ ]:
%timeit csr_spmv (n, csr_ptrs, csr_inds, csr_vals, x)
In [ ]:
import scipy.sparse as sp
Per the notes, here is how we can convert our COO representation from before into a SciPy implementation.
In [ ]:
A_coo = sp.coo_matrix ((vals, (rows, cols)))
Now measure the time to do a sparse matrix-vector multiply in the COO representation. How does it compare to the nested default dictionary approach?
In [ ]:
x_np = np.array (x)
%timeit A_coo.dot (x_np)
Exercise. Repeat the same experiment for SciPy-based CSR.
In [ ]:
# @YOUSE: Fill in your code here
pass
%timeit A_csr.dot (x_np)
In [ ]: