Lecture 7: Singular value decomposition

Syllabus

Week 1: Python intro
Week 2: Matrices, vectors, norms, ranks
Week 3: Linear systems, eigenvectors, eigenvalues
Week 4: Singular value decomposition + test + homework seminar

Recap of the previous lecture

  • LU decomposition = Solving linear systems
  • QR decomposition = Solving linear systems (can be more accurate in the ill-conditioned case, slower than LU, used in the orthogonalization)

Today lecture

Today we will talk about the singular value decomposition (SVD)

  • SVD and best low-rank approximation
  • Applications

Singular value decomposition: definition

Any $n \times m$ matrix $A$ can be decomposed into a product
$$ A = U \Sigma V^*, $$ where $U$ is an $n \times k$ unitary matrix, $V$ is an $m \times k$ unitary matrix, and $\Sigma = \mathrm{diag}(\sigma_1, \ldots, \sigma_k)$ a diagonal matrix with $\sigma_1 \geq \ldots \sigma_k \geq 0$ and $k = \min(m, n)$.

SVD and low-rank approximation

One of the main features of the SVD is to solve the minimization problem
$$ \Vert A - A_r \Vert \rightarrow \min, $$ i.e. find a closest rank-r matrix to a given matrix.
The solution is given as
$$ A_r = U_r \Sigma_r V^*_r, $$ where $U_r$ are the first $r$ columns of the matrix $U$, $V_r$ are the first $r$ columns of the matrix $V$ and $\Sigma_r$ are the first $r$ singular values

Existence of the SVD

Theorem. Any matrix has an SVD decomposition.

Proof. Recall the definition of $\sigma = \Vert A \Vert_2$ as an operator norm. Thus, there exist vectors $x$, $y$ of unit length such that $Ay = \sigma x$. Now introduce two unitary matrices
$$U = [x, U_1], \quad V = [y, V_1].$$ Then, $$ U^* A V = \begin{bmatrix} \sigma & w^* \\ 0 & B \end{bmatrix}, $$ and $$ \Vert U^* A V \begin{bmatrix} \sigma \\ w \end{bmatrix}\Vert^2_2 \geq (\sigma^2 + w^* w)^2 \longrightarrow \Vert U^* A V \Vert^2_2 \geq \sigma^2 + w^* w. $$ Since $\Vert U^* A V \Vert = \sigma$, we get $w = 0$.

SVD and eigenvalues

From $A = U \Sigma V^*$ we get $$ A A^* = U \Sigma^2 U^*, \quad A^* A = V \Sigma^2 V^*, $$ i.e. you can compute the singular values as a square roots of the eigenvalues of $A^*A$ (or $A A^*$). It is not a good idea, since the maximal precision you might get this way is $\sqrt{\varepsilon}$. You just loose half of the digits!

How the SVD is actually computed

All SVD computations are done in two steps.

  1. Reduction of a given matrix to a bidiagonal form, $B = U^* A V$ by means of Householder/Givens transformations $\mathcal{O}(n^3)$
  2. Solving the SVD problem for a bidiagonal matrix.

The Golub-Kahan bidiagonalization algorithm was the first (and still one of the most popular) algorithms for solving the SVD problem. It is basically a QR algorithm, implicitly applied to a matrix $B^* B$.

Some other SVD properties

  1. SVD can be used to compute pseudoinverses of a matrix, trying to solve the ill-conditioned / rectangular systems, $$A^{\dagger} = V^* \Sigma^{\dagger} U,$$ where $\Sigma^{\dagger}_i = \min(\frac{1}{\sigma_i}, \frac{1}{\delta})$, i.e. small singular values are smoothed

  2. SVD gives a full information about the nullspace and range of the matrix

Application 1: Latent semantic analysis

One of the most famous application is Latent semantic indexing, see, for example,
Deerwester, Scott C., et al. "Indexing by latent semantic analysis." (1990)

The problem setup: we have a set of text documents $D_1, \ldots, D_N.$ We want to solve the search problem: i.e., we have a query as a set of words, and we want to find the best documents.
Our data is processed as follows: for each document we create a list of words contained in the document, and count the frequencies of each word. This is called the bag of words model (i.e., the document is unordered set of words).

Term-document matrix

This is how the term-document matrix $A$ is obtained. Its row size is the size of the dictionary.
Its column size is the number of documents. An element $A_{ij}$ the the frequency of occurence of the $i$-th word in the $j$-document.

To do the search, we can just multiply a term-document matrix by a search vector, i.e. a list of words.

The problem: The document will be returned only if there is an exact word match. However, we might search for "Samuel Clemens", and hope to get the results for "Mark Twain" as well. But there might be no exact match!
How the SVD can help?

Idea of LSI

Compute low-rank approximation $A_r$ of the term-document matrix $A$.
$$A \approx A_r,$$ and we do not care about the approximation error (i.e., we do not require it to be small).
The matrix $A_r$ can be then used to do queries.

We project the documents to low-dimensional subspace, given a query $q$ the projection is
$$q_r = \Sigma^{-1}_r U^{\top}_r q$$ Now we can compute the similarity between $d_r$ and other projected documents
$$\widehat{d}_i = \Sigma^{-1}_r U^{\top}_r d_i,$$ and compute the cosine of the angles between the query and the projected document.

Demo

Now we can test a demo database


In [3]:
import numpy as np
import pandas as pd
import re #Regular expressions
rows = ['human', 'interface', 'computer', 'user', 'system', 'response', 'time', 'EPS', 'survey', 'trees', 'graph', 'minors']
nterms = len(rows)
docs = []
docs += ['Human machine interface for Lab ABC computer applications']
docs += ['A survey of user opinions of computer system response time']
docs += ['The EPS user interfaces management system']
docs += ['System and human system engineering testing of EPS']
docs += ['Relation of user-perceived response time on user management']
docs += ['The generation of random, binary, unordered trees']
docs += ['The intersection graph of paths in trees']
docs += ['Graph minors IV: Width of trees and well-quasi-ordering']
docs += ['Graph minors: A survey']
ndocs = len(docs)
term_doc = np.zeros((nterms, ndocs))

nonletters = re.compile('[^a-zA-Z]+') # split by anything non-alphanumeric
docs = [set(nonletters.split(i.lower())) for i in docs]
rows = [i.lower() for i in rows]
for i in xrange(nterms):
    for j in xrange(ndocs):
        if rows[i] in docs[j]:
            term_doc[i, j] = 1

#Use pandas to plot        
pd.DataFrame(data=term_doc,index=rows)


Out[3]:
0 1 2 3 4 5 6 7 8
human 1 0 0 1 0 0 0 0 0
interface 1 0 0 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
user 0 1 1 0 1 0 0 0 0
system 0 1 1 1 0 0 0 0 0
response 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
eps 0 0 1 1 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1

Now we can compare the results between ordinary matvec and low-rank matvec.


In [5]:
query = 'Human computer interaction'
qv = np.zeros((nterms))
for i in xrange(nterms):
    if re.search(rows[i], query, re.IGNORECASE):
        qv[i] = 1
res1 = qv.dot(term_doc) #Non-compressed search result

u, s, v = np.linalg.svd(term_doc)
r = 5
u = u[:, :r]
s = s[:r]
v = v[:r, :] #Numpy transposes
appr1  = u.dot(np.diag(s)).dot(v)
res2 = qv.dot(appr1)
res_all = np.vstack((res1, res2)).T #To make two columns
print 'There query is:', query, ',the scores are:'
pd.DataFrame(res_all, index=docs, columns=['No SVD', 'SVD'])


There query is: Human computer interaction ,the scores are:
Out[5]:
No SVD SVD
Human machine interface for Lab ABC computer applications 2 2.087969
A survey of user opinions of computer system response time 1 0.927427
The EPS user interfaces management system 0 0.329092
System and human system engineering testing of EPS 1 0.605656
Relation of user-perceived response time on user management 0 -0.043570
The generation of random, binary, unordered trees 0 -0.008597
The intersection graph of paths in trees 0 -0.027086
Graph minors IV: Width of trees and well-quasi-ordering 0 -0.044239
Graph minors: A survey 0 0.097481

Application 2: Collaborative filtering

Another important (and similar) application comes from recommender systems.
Suppose you have a user-product matrix: each user puts a rating for a particular product.
The matrix is then the number of users times the number of products. The goal is to recommend additional products to be bought for a particular user.
The scheme is the same: we compute the SVD, and the recommendation for each user is just a column of the approximated matrix.


In [6]:
data_read = np.loadtxt('task2_transact_eval.txt',dtype=np.int32, skiprows=1,delimiter='|')

In [8]:
#Read the data
columns = ['SessionNo', 'ItemNo', 'TransType']
ds = pd.DataFrame(data=data_read,columns=columns)
ds


Out[8]:
SessionNo ItemNo TransType
0 0 15029 0
1 0 12103 0
2 0 11755 0
3 0 2246 0
4 1 6498 0
5 1 6498 1
6 1 6498 0
7 1 3726 0
8 1 3726 1
9 1 6498 0
10 1 6720 0
11 1 11396 0
12 1 11396 1
13 1 6498 2
14 1 3726 2
15 1 11396 2
16 3 5335 0
17 3 724 0
18 3 9436 0
19 4 10003 0
20 4 5166 0
21 4 1962 0
22 4 15235 0
23 4 1608 0
24 4 5214 0
25 4 11358 0
26 4 7566 0
27 4 2825 0
28 4 5214 0
29 4 1962 0
... ... ... ...
985115 145683 3901 2
985116 145683 329 2
985117 145685 13867 0
985118 145685 13805 0
985119 145685 11800 0
985120 145685 14161 0
985121 145685 9266 0
985122 145685 4710 0
985123 145685 4710 1
985124 145685 5208 0
985125 145685 1179 0
985126 145685 7555 0
985127 145685 7555 1
985128 145685 14690 0
985129 145685 14690 1
985130 145687 9378 0
985131 145687 1027 0
985132 145687 3671 0
985133 145687 9378 0
985134 145688 8197 0
985135 145688 4079 0
985136 145688 4842 0
985137 145688 14759 0
985138 145688 15259 0
985139 145688 13197 0
985140 145689 14168 0
985141 145689 5373 0
985142 145690 3781 0
985143 145690 5025 0
985144 145690 5025 1

985145 rows × 3 columns


In [10]:
ds_buy = ds[ds.TransType==2][['SessionNo', 'ItemNo']]
arr = np.array(ds_buy)
users, inv_users = np.unique(arr[:, 0], return_inverse=True)
products, inv_products = np.unique(arr[:, 1], return_inverse=True)
print 'Unique sessions:', len(users), 'Unique products', len(products)
#Scipy spars


Unique sessions: 14164 Unique products 9679

In [17]:
import scipy.sparse
import scipy.sparse.linalg
%matplotlib inline 
import matplotlib.pyplot as plt
plt.xkcd()
nnz = len(inv_users)
spmat = scipy.sparse.coo_matrix((np.ones(nnz), (inv_users, inv_products)))
r =  2#Approximation rank
u, s, v = scipy.sparse.linalg.svds(spmat, r)
n = spmat.shape[0]
m = spmat.shape[1]
q = np.zeros(n)
user_id = 5
q[user_id] = 1.0

qrec = q.T.dot(u).dot(np.diag(s)).dot(v)
qrec = qrec / np.max(abs(qrec))
qb = spmat.T.dot(q)
plt.plot(qrec)
plt.plot(qb)
plt.xlabel('Product ID')
plt.ylabel('Recommendation')


Out[17]:
<matplotlib.text.Text at 0x1143df050>

Application 3: Dense matrix compression

Dense matrices typically require $N^2$ elements to be stored. For $N \sim 10^4 - 10^5$ the memory requirements. A low rank approximation can reduces this number of $\mathcal{O}(Nr)$


In [22]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.xkcd()
n = 256
a = [[1.0/(np.sqrt(i + 3 * j + 0.5)) for i in xrange(n)] for j in xrange(n)]
a = np.array(a)
u, s, v = np.linalg.svd(a)
plt.semilogy(s/s[0])
plt.title('Singular values decay for a Hilbert matrix')


Out[22]:
<matplotlib.text.Text at 0x1157e4bd0>

A more realistic example

We can try to test on a more realistic matrix, since solving linear systems with Hilbert matrix has little practical sense. Instead, we solve a linear system with a matrix $$A_{ij} = \frac{1}{i - j + \frac{1}{2}},$$ which corresponds to an integral equation $$ \int \frac{q(y)dy}{x - y } = f(x). $$ In real life, the equation
$$ \int_{\Omega} \frac{q(y)dy}{\Vert x - y\Vert } = f(x), $$ is solved, where $\Omega$ is a surface in 3D. This is used, for example, in modelling integral circuits. Let us see what happens with the singular values.


In [29]:
import numpy as np
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.xkcd()
n = 256
a = [[1.0/(i - j + 0.5) for i in xrange(n)] for j in xrange(n)]
a = np.array(a)
u, s, v = np.linalg.svd(a)
plt.plot(s)
plt.title('Singular values decay for a Cauchy-Hilbert matrix')
plt.tight_layout()


What to do?
The idea is to break the matrix intro blocks
$$ A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} $$ and the blocks $A_{12}$ and $A_{21}$ will be of low-rank! Let us try that..


In [30]:
import numpy as np
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.xkcd()
n = 256
a = [[1.0/(i - j + 0.5) for i in xrange(n)] for j in xrange(n)]
a = np.array(a)

a12 = a[:n/2,n/2:]
a21 = a[n/2:,:n/2]
s12 = np.linalg.svd(a12)[1]
s21 = np.linalg.svd(a21)[1]
plt.semilogy(s12)
plt.semilogy(s21)#Actually, they are the same


Out[30]:
[<matplotlib.lines.Line2D at 0x115c44850>]

The whole idea is the applied recursively, leading to $\mathcal{O}(N \log N r)$ storage.
A good question is how to compute the approximation without computing all the entries -- here the skeleton decomposition comes into play: to get a rank-$r$ approximation, we can compute $r$ columns and $r$ rows.

Application 4: Eigenfaces

You will have this task in your HW. The problem setup is similar: you have a database of faces, which you write down as a face-pixel matrix (you can also write this down as a 3D array as well). The goal is given a new face, find a best match to the one in the database. The scheme is the same: you compute the projection onto a low-dimensional subspace using the SVD, and then compare everything in this low-dimensional subspace. The example "singular vectors" look similar to the following.

Higher dimensions

There are several ways how to generalize SVD to multidimensional arrays (tensors). In data mining, this is very natural. For example, in real recommender system, the data are organized as

user-product-action

where action is either "watch" or "put into the bucket" or "buy".
It is natural to organize such data into a 3D tensor of size

number of users $\times$ number of products $\times$ number of actions.

This is what High Order SVD (HOSVD) is used.
The idea is to decompose a tensor in the form
$$ A_{ijk} \approx \sum_{\alpha \beta \gamma} U_{i \alpha} V_{j \beta} W_{k \gamma} g_{\alpha \beta \gamma}. $$ This can be done by considering three unfoldings of a tensor and computing its left singular matrices.

Questions?

In [137]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[137]: