CSE 6040, Fall 2015 [19]: Sparse matrix storage

Today's lab continues Lab 18, which was about how to store and operate on dense matrices using Numpy (and SciPy). By the way, a partial solution set for Lab 18 is also available here.

Downloads. For today's lab, you'll need two additional downloads:

Go ahead and download these files now.

The dataset is actually from your current homework (#2)! It's the first million pairs of the user-user connectivity graph, in CSV format.

To repeat, the recommended importing convention for Numpy is (execute this now):


In [ ]:
import numpy as np

Sample dataset

Start by looking at the sample dataset.


In [ ]:
import pandas as pd
from IPython.display import display

In [ ]:
edges = pd.read_csv ('UserEdges-1M.csv')
display (edges.head ())

Exercise. What does this code do?


In [ ]:
V_names = set (edges.Source)
V_names.update (set (edges.Target))

m = len (edges)
n = len (V_names)
print ("==> |V| == %d, |E| == %d" % (n, m))

Sparse matrix storage: Baseline methods

Let's start by reminding ourselves how our previous method for storing sparse matrices, based on nested default dictionaries, works and performs.


In [ ]:
import cse6040utils as cse6040

Exercise. What does the following code do?


In [ ]:
id2name = {}
name2id = {}
for (k, v) in enumerate (V_names):
    if k <= 5: print ("[%d] %s" % (k, v)) # for debugging
    id2name[k] = v
    name2id[v] = k
    
A_named_keys = cse6040.sparse_matrix ()
A_numbered_keys = cse6040.sparse_matrix ()

for (k, row) in edges.iterrows ():
    ni = row['Source']
    nj = row['Target']
    A_named_keys[ni][nj] = 1.
    A_named_keys[nj][ni] = 1.
    
    i = name2id[ni]
    j = name2id[nj]
    A_numbered_keys[i][j] = 1.
    A_numbered_keys[j][i] = 1.

Hopefully, you deduced that A_numbered_keys above is constructed in such a way that it will work with the sparse matrix-vector multiply routine we created in Lab 14 (solutions).


In [ ]:
# Build a dense vector
x = cse6040.dense_vector (n)

%timeit cse6040.spmv (n, A_numbered_keys, x)

Exercise. Implement a sparse matrix-vector multiply that works when the matrix is A_named_keys. How much faster or slower is it than cse6040.spmv()?

Hint: Feel free to take a look at cse6040.spmv().


In [ ]:
def spmv_named_keys (n, A, x, name2id):
    y = cse6040.dense_vector (n)
    
    # @YOUSE: Complete this routine
    pass

    return y

In [ ]:
# Measures the speed of your implementation:
%timeit spmv_named_keys (n, A_named_keys, x, name2id)

@TODO: Check error

Alternative formats: COO and CSR formats

Take a look at the slides that we just started in the last class, which cover the basics of sparse matrix storage formats: link

Although these are available as native formats in SciPy, let's create native Python versions first using lists. We can then compare the performance of, say, sparse matrix-vector multiply, against the ones we ran above.

Exercise. Create a COO-like data structure. You may use the edges and name2id raw data structures created above. Name your arrays, coo_rows, coo_cols, and coo_vals.


In [ ]:
coo_rows = [name2id[e] for e in edges['Source']]
coo_cols = [name2id[e] for e in edges['Target']]
coo_vals = [1.] * len (coo_rows)

Exercise. Implement a sparse matrix-vector multiply routine for this COO implementation.


In [ ]:
def coo_spmv (n, R, C, V, x):
    """
    Returns y = A*x, where A has 'n' rows and is stored in
    COO format by the array triples, (R, C, V).
    """
    assert n > 0
    assert type (x) is list
    assert type (R) is list
    assert type (C) is list
    assert type (V) is list
    assert len (R) == len (C) == len (V)
    
    y = cse6040.dense_vector (n)
    
    # @YOUSE: Fill in this implementation
    pass
        
    return y

In [ ]:
%timeit coo_spmv (n, coo_rows, coo_cols, coo_vals, x)

Exercise. Now create a CSR data structure, again using native Python lists. Name your output CSR lists csr_ptrs, csr_inds, and csr_vals.

It's easiest to start with the COO representation. We've given you some start code; just fill in the missing loop.


In [ ]:
# Aside: What does this do? Try running it to see.

z1 = ['q', 'v', 'c']
z2 = [1, 2, 3]
z3 = ['dog', 7, 'man']

print sorted (zip (z1, z2, z3), key=lambda z: z[0])

In [ ]:
C = sorted (zip (coo_rows, coo_cols, coo_vals),
            key=lambda t: t[0])
nnz = len (C)

assert n == (C[-1][0] + 1)  # Why?

csr_inds = [j for (i, j, a_ij) in C]
csr_vals = [a_ij for (i, j, a_ij) in C]

# @YOUSE: Construct `csr_ptrs`
pass

# Some checks on your implementation: Test the first 10 entries
assert len (csr_ptrs) == (n+1)
assert all ([a==b for (a, b) in zip (csr_ptrs[0:10], [0, 1, 3, 60, 66, 72, 73, 74, 78, 82])])
print ("==> Passed quick test")

Exercise. Now implement a CSR-based sparse matrix-vector multiply.


In [ ]:
def csr_spmv (n, ptr, ind, val, x):
    assert n > 0
    assert type (ptr) == list
    assert type (ind) == list
    assert type (val) == list
    assert type (x) == list
    assert len (ptr) >= (n+1)  # Why?
    assert len (ind) >= ptr[n]  # Why?
    assert len (val) >= ptr[n]  # Why?
    
    y = cse6040.dense_vector (n)
    
    # @YOUSE: Insert your implementation here
    pass
    
    return y

In [ ]:
%timeit csr_spmv (n, csr_ptrs, csr_inds, csr_vals, x)

Sparse matrix storage using SciPy (Numpy)

Let's implement and time some of these routines below.


In [ ]:
import scipy.sparse as sp

Per the notes, here is how we can convert our COO representation from before into a SciPy implementation.


In [ ]:
A_coo = sp.coo_matrix ((vals, (rows, cols)))

Now measure the time to do a sparse matrix-vector multiply in the COO representation. How does it compare to the nested default dictionary approach?


In [ ]:
x_np = np.array (x)

%timeit A_coo.dot (x_np)

Exercise. Repeat the same experiment for SciPy-based CSR.


In [ ]:
# @YOUSE: Fill in your code here
pass

%timeit A_csr.dot (x_np)

In [ ]: