Today's lab continues Lab 19, which introduced different ways of storing a sparse matrix. We used these as a vehicle for thinking a little bit more about the costs of code.
By the way, a partial solution set for Lab 19 is also available here.
To repeat, the recommended importing convention for Numpy is (execute this now):
In [ ]:
import numpy as np
import pandas as pd
from IPython.display import display
import cse6040utils as cse6040
In [ ]:
edges = pd.read_csv ('UserEdges-1M.csv')
In [ ]:
V_names = set (edges.Source)
V_names.update (set (edges.Target))
In [ ]:
name2id = {v: k for (k, v) in enumerate (V_names)}
A_numbered_keys = cse6040.sparse_matrix ()
for (k, row) in edges.iterrows ():
i = name2id[row['Source']]
j = name2id[row['Target']]
A_numbered_keys[i][j] = 1.
A_numbered_keys[j][i] = 1.
In [ ]:
nnz = len (edges) # Number of non-zeros (edges)
n = len (V_names) # Matrix dimension
# Build a dense vector
x = cse6040.dense_vector (n)
%timeit cse6040.spmv (n, A_numbered_keys, x)
Take a look at the slides that we just started in the last class, which cover the basics of sparse matrix storage formats: link
These are available as native formats in SciPy. However, last time we went ahead and implemented COO using pure native Python objects. The goals of doing so were two-fold:
The following code reminds you how to build a matrix in COO format and measures the performance of a native Python implementation of sparse matrix-vector multiply that operates on COO matrices.
In [ ]:
coo_rows = [name2id[e] for e in edges['Source']]
coo_cols = [name2id[e] for e in edges['Target']]
coo_vals = [1.] * len (coo_rows)
assert len (coo_vals) == nnz # Sanity check against the raw data
In [ ]:
def coo_spmv (n, R, C, V, x):
"""
Returns y = A*x, where A has 'n' rows and is stored in
COO format by the array triples, (R, C, V).
"""
assert n > 0
assert type (x) is list
assert type (R) is list
assert type (C) is list
assert type (V) is list
assert len (R) == len (C) == len (V)
y = cse6040.dense_vector (n)
for k in range (len (V)):
i = R[k]
j = C[k]
aij = V[k]
y[i] += aij * x[j]
return y
In [ ]:
%timeit coo_spmv (n, coo_rows, coo_cols, coo_vals, x)
Exercise. Now create a CSR data structure, again using native Python lists. Name your output CSR lists csr_ptrs
, csr_inds
, and csr_vals
, starting from the COO representation.
In [ ]:
# Aside: What does this do? Try running it to see.
z1 = [ 'q', 'v', 'c' ]
z2 = [ 3 , 1 , 2 ]
z3 = ['dog', 7 , 'man']
Z = list (zip (z1, z2, z3))
print "==> Before:"
print Z
Z.sort (key=lambda z: z[1])
print "\n==> After:"
print Z
# Note: Alternative to using a lambda (anonymous) function:
def get_second_coord (z):
return z[1]
Z.sort (key=get_second_coord)
In [ ]:
C = list (zip (coo_rows, coo_cols, coo_vals))
C.sort (key=lambda t: t[0])
assert len (C) == nnz
assert n == (C[-1][0] + 1) # Why?
csr_inds = [j for (i, j, a_ij) in C]
csr_vals = [a_ij for (i, j, a_ij) in C]
csr_ptrs = [0] * (n+1)
i = 0 # next row to update
for j in range (nnz):
while C[j][0] >= i:
csr_ptrs[i] = j
i += 1
csr_ptrs[n] = nnz
# Alternative solution: See https://piazza.com/class/idap9v1ktp94u9?cid=89
In [ ]:
# Some checks on your implementation: Look at the first 10 rows
assert len (csr_ptrs) == (n+1)
print ("==> csr_ptrs[:10]:\n")
print (csr_ptrs[:10])
In [ ]:
first_ten_tuples = ["[%d] %s" % (i, str (t))
for (i, t) in enumerate (C[:csr_ptrs[10]])]
print ("==> First ten tuples, C[:%d]:" % csr_ptrs[10])
print ("\n".join (first_ten_tuples))
FIRST_TEN = [0, 1, 3, 60, 66, 72, 73, 74, 78, 82]
assert all ([a==b for (a, b) in zip (csr_ptrs[0:10], FIRST_TEN)])
print ("\n==> Passed quick test")
Exercise. Now implement a CSR-based sparse matrix-vector multiply.
In [ ]:
def csr_spmv (n, ptr, ind, val, x):
assert n > 0
assert type (ptr) == list
assert type (ind) == list
assert type (val) == list
assert type (x) == list
assert len (ptr) >= (n+1) # Why?
assert len (ind) >= ptr[n] # Why?
assert len (val) >= ptr[n] # Why?
y = cse6040.dense_vector (n)
# @YOUSE: Insert your implementation here
for i in range (n):
for k in range (ptr[i], ptr[i+1]):
y[i] += val[k] * x[ind[k]]
return y
In [ ]:
%timeit csr_spmv (n, csr_ptrs, csr_inds, csr_vals, x)
In [ ]:
import scipy.sparse as sp
Per the notes, here is how we can convert our COO representation from before into a SciPy implementation.
In [ ]:
A_coo = sp.coo_matrix ((coo_vals, (coo_rows, coo_cols)))
Now measure the time to do a sparse matrix-vector multiply in the COO representation. How does it compare to the nested default dictionary approach?
In [ ]:
x_np = np.array (x)
%timeit A_coo.dot (x_np)
Exercise. Repeat the same experiment for SciPy-based CSR.
In [ ]:
# @YOUSE: Solution
A_csr = A_coo.tocsr ()
%timeit A_csr.dot (x_np)
Yay! Time for a new topic: linear regression by the method of least squares.
For this topic, let's use the following dataset, which is a crimes dataset from 1960: http://cse6040.gatech.edu/fa15/uscrime.csv
This dataset comes from: http://www.statsci.org/data/general/uscrime.html
In [ ]:
df = pd.read_csv ('uscrime.csv', skiprows=1)
display (df.head ())
Each row of this dataset is a US State. The columns are described here: http://www.statsci.org/data/general/uscrime.html
In [ ]:
import seaborn as sns
%matplotlib inline
In [ ]:
# Look at a few relationships
sns.pairplot (df[['Crime', 'Wealth', 'Ed', 'U1']])
Suppose we wish to build a model of some quantity, called the response variable, given some set of predictors. In the US crimes dataset, the response might be the crime rate (Crime
), which we wish to predict from the predictors of income (Wealth
), education (Ed
), and the unemployment rate of young males (U1
).
In a linear regression model, we posit that the response is a linear function of the predictors. That is, suppose there are $m$ observations in total and consider the $i$-th observation. Let $b_i$ be the response of that observation. Then denote the $n$ predictors for observation $i$ as $\{a_{i,1}, a_{i,2}, \ldots, a_{i,n}\}$. From this starting point, we might then posit a linear model of $b$ having the form,
$b_i = x_0 + a_{i,1} x_1 + a_{i,2} x_2 + \cdots + a_{i,n} x_n$,
where we wish to compute the "best" set of coefficients, $\{x_0, x_1, \ldots, x_n\}$. Note that this model includes a constant offset term, $x_0$. Since we want this model to hold for observations, then we effectively want to solve the system of equations,
$\left( \begin{array}{c} b_1 \\ b_2 \\ \vdots \\ b_m \end{array} \right)
$\left( \begin{array}{ccccc}
1. & a_{1,1} & a_{1,2} & \ldots & a_{1,n} \\
1. & a_{2,1} & a_{2,2} & \ldots & a_{2,n} \\
& & \cdots & & \\
1. & a_{m,1} & a_{m,2} & \ldots & a_{m,n}
\end{array} \right). $
Typically, $m \gg n$.
In [ ]: