CSE 6040, Fall 2015 [20]: Sparsity in Numpy/SciPy (wrap-up) + Least Squares (new topic)

Today's lab continues Lab 19, which introduced different ways of storing a sparse matrix. We used these as a vehicle for thinking a little bit more about the costs of code.

By the way, a partial solution set for Lab 19 is also available here.

To repeat, the recommended importing convention for Numpy is (execute this now):

Getting started

The following code cells repeat some of the things we need from Lab 19 to finish the topic.



In [ ]:

    
import numpy as np
import pandas as pd
from IPython.display import display
import cse6040utils as cse6040



In [ ]:

    
edges = pd.read_csv ('UserEdges-1M.csv')



In [ ]:

    
V_names = set (edges.Source)
V_names.update (set (edges.Target))



In [ ]:

    
name2id = {v: k for (k, v) in enumerate (V_names)}
A_numbered_keys = cse6040.sparse_matrix ()
for (k, row) in edges.iterrows ():
    i = name2id[row['Source']]
    j = name2id[row['Target']]
    A_numbered_keys[i][j] = 1.
    A_numbered_keys[j][i] = 1.



In [ ]:

    
nnz = len (edges) # Number of non-zeros (edges)
n = len (V_names) # Matrix dimension

# Build a dense vector
x = cse6040.dense_vector (n)

%timeit cse6040.spmv (n, A_numbered_keys, x)

Review: COO format

Take a look at the slides that we just started in the last class, which cover the basics of sparse matrix storage formats: link

These are available as native formats in SciPy. However, last time we went ahead and implemented COO using pure native Python objects. The goals of doing so were two-fold:

Learn about an alternative to the "nested dictionary" approach to storing a sparse matrix.
Establish a baseline for comparison against a native Numpy/SciPy implementation.

The following code reminds you how to build a matrix in COO format and measures the performance of a native Python implementation of sparse matrix-vector multiply that operates on COO matrices.



In [ ]:

    
coo_rows = [name2id[e] for e in edges['Source']]
coo_cols = [name2id[e] for e in edges['Target']]
coo_vals = [1.] * len (coo_rows)

assert len (coo_vals) == nnz # Sanity check against the raw data



In [ ]:

    
def coo_spmv (n, R, C, V, x):
    """
    Returns y = A*x, where A has 'n' rows and is stored in
    COO format by the array triples, (R, C, V).
    """
    assert n > 0
    assert type (x) is list
    assert type (R) is list
    assert type (C) is list
    assert type (V) is list
    assert len (R) == len (C) == len (V)
    
    y = cse6040.dense_vector (n)
    
    for k in range (len (V)):
        i = R[k]
        j = C[k]
        aij = V[k]
        y[i] += aij * x[j]
        
    return y



In [ ]:

    
%timeit coo_spmv (n, coo_rows, coo_cols, coo_vals, x)

What follows picks up from last time.

CSR format

The compressed sparse row (CSR) format is an alternative to COO. The basic idea is to compress COO a little, by recognizing that there is redundancy in the row indices. To see that redundancy, the example in the slides sorts COO format by row.

Exercise. Now create a CSR data structure, again using native Python lists. Name your output CSR lists csr_ptrs, csr_inds, and csr_vals, starting from the COO representation.



In [ ]:

    
# Aside: What does this do? Try running it to see.

z1 = [  'q', 'v',  'c' ]
z2 = [   3 ,  1 ,   2  ]
z3 = ['dog',  7 , 'man']

Z = list (zip (z1, z2, z3))
print "==> Before:"
print Z

Z.sort (key=lambda z: z[1])
print "\n==> After:"
print Z

# Note: Alternative to using a lambda (anonymous) function:
def get_second_coord (z):
    return z[1]

Z.sort (key=get_second_coord)



In [ ]:

    
C = list (zip (coo_rows, coo_cols, coo_vals))
C.sort (key=lambda t: t[0])

assert len (C) == nnz
assert n == (C[-1][0] + 1)  # Why?

csr_inds = [j for (i, j, a_ij) in C]
csr_vals = [a_ij for (i, j, a_ij) in C]

csr_ptrs = [0] * (n+1)
i = 0 # next row to update
for j in range (nnz):
    while C[j][0] >= i:
        csr_ptrs[i] = j
        i += 1
csr_ptrs[n] = nnz

# Alternative solution: See https://piazza.com/class/idap9v1ktp94u9?cid=89



In [ ]:

    
# Some checks on your implementation: Look at the first 10 rows
assert len (csr_ptrs) == (n+1)

print ("==> csr_ptrs[:10]:\n")
print (csr_ptrs[:10])



In [ ]:

    
first_ten_tuples = ["[%d] %s" % (i, str (t))
                    for (i, t) in enumerate (C[:csr_ptrs[10]])]
print ("==> First ten tuples, C[:%d]:" % csr_ptrs[10])
print ("\n".join (first_ten_tuples))

FIRST_TEN = [0, 1, 3, 60, 66, 72, 73, 74, 78, 82]
assert all ([a==b for (a, b) in zip (csr_ptrs[0:10], FIRST_TEN)])
print ("\n==> Passed quick test")

Exercise. Now implement a CSR-based sparse matrix-vector multiply.



In [ ]:

    
def csr_spmv (n, ptr, ind, val, x):
    assert n > 0
    assert type (ptr) == list
    assert type (ind) == list
    assert type (val) == list
    assert type (x) == list
    assert len (ptr) >= (n+1)  # Why?
    assert len (ind) >= ptr[n]  # Why?
    assert len (val) >= ptr[n]  # Why?
    
    y = cse6040.dense_vector (n)
    
    # @YOUSE: Insert your implementation here
    for i in range (n):
        for k in range (ptr[i], ptr[i+1]):
            y[i] += val[k] * x[ind[k]]
    
    return y



In [ ]:

    
%timeit csr_spmv (n, csr_ptrs, csr_inds, csr_vals, x)

Sparse matrix storage using SciPy (Numpy)

Let's implement and time some of these routines below.



In [ ]:

    
import scipy.sparse as sp

Per the notes, here is how we can convert our COO representation from before into a SciPy implementation.



In [ ]:

    
A_coo = sp.coo_matrix ((coo_vals, (coo_rows, coo_cols)))

Now measure the time to do a sparse matrix-vector multiply in the COO representation. How does it compare to the nested default dictionary approach?



In [ ]:

    
x_np = np.array (x)

%timeit A_coo.dot (x_np)

Exercise. Repeat the same experiment for SciPy-based CSR.



In [ ]:

    
# @YOUSE: Solution
A_csr = A_coo.tocsr ()

%timeit A_csr.dot (x_np)

Linear regression and least squares

Yay! Time for a new topic: linear regression by the method of least squares.

For this topic, let's use the following dataset, which is a crimes dataset from 1960: http://cse6040.gatech.edu/fa15/uscrime.csv

This dataset comes from: http://www.statsci.org/data/general/uscrime.html



In [ ]:

    
df = pd.read_csv ('uscrime.csv', skiprows=1)
display (df.head ())

Each row of this dataset is a US State. The columns are described here: http://www.statsci.org/data/general/uscrime.html



In [ ]:

    
import seaborn as sns
%matplotlib inline



In [ ]:

    
# Look at a few relationships
sns.pairplot (df[['Crime', 'Wealth', 'Ed', 'U1']])

Suppose we wish to build a model of some quantity, called the response variable, given some set of predictors. In the US crimes dataset, the response might be the crime rate (Crime), which we wish to predict from the predictors of income (Wealth), education (Ed), and the unemployment rate of young males (U1).

In a linear regression model, we posit that the response is a linear function of the predictors. That is, suppose there are $m$ observations in total and consider the $i$-th observation. Let $b_i$ be the response of that observation. Then denote the $n$ predictors for observation $i$ as $\{a_{i,1}, a_{i,2}, \ldots, a_{i,n}\}$. From this starting point, we might then posit a linear model of $b$ having the form,

$b_i = x_0 + a_{i,1} x_1 + a_{i,2} x_2 + \cdots + a_{i,n} x_n$,

where we wish to compute the "best" set of coefficients, $\{x_0, x_1, \ldots, x_n\}$. Note that this model includes a constant offset term, $x_0$. Since we want this model to hold for observations, then we effectively want to solve the system of equations,

$\left( \begin{array}{c} b_1 \\ b_2 \\ \vdots \\ b_m \end{array} \right)

$

$\left( \begin{array}{ccccc}

 1. & a_{1,1} & a_{1,2} & \ldots & a_{1,n} \\
 1. & a_{2,1} & a_{2,2} & \ldots & a_{2,n} \\
    &         & \cdots  &        &         \\
 1. & a_{m,1} & a_{m,2} & \ldots & a_{m,n}

\end{array} \right). $

Typically, $m \gg n$.



In [ ]: