Lecture 3: Scalar product, orthogonality, matrix rank, singular value decomposition


Week 2: Matrices, vectors, norms, ranks

Recap of the previous lecture

  • Norms are measures of smallness, used to compute the accuracy
  • Definition of $1$, $p$ and Euclidean norms
  • Matrix norms
  • $L_1$ norm (Manhattan distance) is used in compressed sensing as a surrogate for sparsity.

Today lecture

Today we will talk about:

  • Scalar products and unitary (orthogonal) matrices
  • Rank of the matrix
  • Singular value decomposition (SVD) and low-rank approximation

Scalar product

If norm is a measure of distances, then the scalar product takes angle into account.
The scalar product is defined as $$ (x, y) = \sum_{i=1}^n \overline{x}_i y_i, $$ where $\overline{x}$ denotes the complex conjugate of $x$. The Euclidean norm is then $$ \Vert x \Vert^2 = (x, x), $$ or it is said the the norm is induced by the scalar product.

Remark For the angle between two vectors is defined as $$ \cos \phi = \frac{(x, y)}{\Vert x \Vert \Vert y \Vert} $$

An important property of the scalar product is the Cauchy-Bunyakovski inequality: $$ |(x, y)| \leq \Vert x \Vert \Vert y \Vert, $$ and thus the angle between two vectors is defined properly.

The scalar product can be written as a matrix-by-matrix product
$$ (x, y) = x^* y, $$ where $^*$ is a conjugate transpose of the matrix:
$$ B = A^*, \quad B_{ij} = \overline{A_{ji}}. $$

Norm conservation

For stability it is really important that the error does not grow. Suppose that you approximately get a vector,
$$ \Vert x - \widehat{x} \Vert \leq \varepsilon. $$ Then the final result is (some) linear transformation of $x$:
$$ y = Ux, \quad \widehat{y} = U \widehat{x}. $$ If we want to estimate a difference between $\widehat{y}$ and $y$:
$$ \Vert y - \widehat{y} \Vert = \Vert U ( x - \widehat{x}) \Vert \leq \Vert U \Vert \varepsilon. $$ The question is for which kind of matrices the length of the vector will not change. For the euclidean norm this produces a very important class of matrices: unitary (or orthogonal in the real case) matrices.

Unitary (orthogonal) matrices

Let $U$ be an $n \times r$ matrix, and $\Vert U z \Vert = \Vert z \Vert$ for all $z$. This can happen if and only if
$$ U^* U = I, $$ where $I$ is an identity matrix Indeed, $$\Vert Uz \Vert^2 = (Uz, Uz) = (Uz)^* Uz = z^* (U^* U) z = z^* z,$$ which can only hold if $U^* U = I$.

Unitary matrices

In the real case, when $U^* = U^{\top}$, the matrix is called orthogonal.
Are there many unitary matrices? First of all, a product of two unitary matrices is a unitary matrix:
$$(UV)^* UV = V^* (U^* U) V = V^* V = I,$$ thus if we give some non-trivial examples of unitary matrices, we will be able to get any unitary transformation.

Examples of unitary matrices

There are two important classes of unitary matrices:

  1. Householder matrices
  2. Givens (Jacobi) matrices

Householder matrices

Householder matrix is the matrix of the form
$$H = I - 2 vv^*,$$ where $v$ is an $n \times 1$ matrix and $v^* v = 1$. Can you show that $H$ is unitary? It is also a reflection.
A simple proof: $H^* H = (I - 2 vv^*)(I - 2 v v^*) = I$

Givens(Jacobi) matrix

A Givens matrix is a matrix
$$ A = \begin{bmatrix} \cos \alpha & \sin \alpha \\ -\sin \alpha & \cos \alpha \end{bmatrix}, $$ which is a rotation. For a general case, we select two $(i, j)$ planes and rotate only in those: $$ x'_i = \cos \alpha x_i + \sin \alpha x_j, \quad x'_j = -\sin \alpha x_i + \cos\alpha x_j, $$ with all other $x_i$ remain unchanged.

Summary on unitary matrices

  • Unitary matrices preserve the norm
  • There are two "basic" classes of unitary matrices, Householder and Givens.
  • Every unitary matrix can be represented as a product of those.

Matrix and linear spaces

A matrix can be considered as a sequence of vectors, its columns: $$ A = [a_1, \ldots, a_m], $$ where $a_m \in \mathbb{C}^n$.
A matrix-by-vector product is equivalent to taking a linear combination of those vectors
$$ y = Ax, \quad \Longleftrightarrow y = a_1 x_1 + a_2 x_2 + \ldots +a_m x_m. $$

Linear dependence

The vectors are called linearly dependent, if there exist non-zero coefficients $x_i$ such that $$\sum_i a_i x_i = 0,$$ or in the matrix form, $$ Ax = 0, \quad \Vert x \Vert \ne 0. $$ In this case, we say that the matrix $A$ has non-trivial nullspace.

Dimension of a linear space

A linear space is defined as all possible vectors of the form $$ \sum_{i=1}^m a_i x_i, $$ where $x_i$ are some coefficients and $a_i, i = 1, \ldots, m$ are given vectors. In the matrix form, the linear space is a collection of the vectors of the form
$$A x.$$ This set is also called the range of the matrix.

Matrix rank

What is a matrix rank? A matrix rank is defined as the maximal number of linearly independent columns in a matrix, or the dimension of its column space.
You can also use linear combination of rows to define the rank.

The dimension of the column space of the matrix is equal to the dimension of the row space of the matrix.

Skeleton decomposition

A very useful representation for computation of the matrix rank is the skeleton decomposition, which can be graphically represented as follows:
or in the matrix form $$ A = U \widehat{A}^{-1} V^{\top}, $$ where $U$ are some $r$ columns of $A$, $V^{\top}$ are some rows of $A$, $\widehat{A}$ is the submatrix on the intersection, which should be nonsingular.

An inverse of the matrix $P$ the matrix $Q = P^{-1}$ such that
$ P Q = QP = I$.
If the matrix is square and has full rank (i.e. equal to $n$) then the inverse exists.

Proof for the skeleton decomposition

Let $U$ be the $r$ columns based on the submatrix $\widehat{A}$. Then they are linearly independent. Take any other column $u$ of $A$. Then it is a linear combination of the columns of $U$, i.e.
$$u = U x$$.
These are $n$ equations. We take $r$ of those corresponding to the rows that contain $\widehat{A}$ and get the equation
$$\widehat{u} = \widehat{A} x,$$ therefore
$$x = \widehat{A}^{-1} \widehat{u},$$ and that gives us the result.

A closer look

Any rank-$r$ matrix can be written in the form $$A = U \widehat{A}^{-1} V^{\top},$$ where $U$ is $n \times r$, $V$ is $m \times r$ and $\widehat{A}$ is $r \times r$, or $$ A = U' V'^{\top}. $$ So, every rank-$r$ matrix can be written as a product of a "tall" matrix $U'$ by a long matrix $V'$.

In the index form, it is
$$ a_{ij} = \sum_{\alpha=1}^r u_{i \alpha} v_{j \alpha}. $$ For rank 1 we have $$ a_{ij} = u_i v_j, $$ i.e. it is a separation of indices and rank-$r$ is a sum of rank-$1$ matrices!


It is interesting to note, that for the rank-$r$ matrix $$A = U V^{\top}$$ only $U$ and $V$ can be stored, which gives us $(n+m) r$ parameters, so it can be used for compression. We can also compute matrix-by-vector product much faster.

In [1]:
import numpy as np
n = 1000
r = 10
u = np.random.randn(n, r)
v = np.random.randn(n, r)
a = u.dot(v.T) #u.dot(v.T.conj())
x = np.random.randn(n)
%timeit a.dot(x)
%timeit u.dot(v.T.dot(x))

1000 loops, best of 3: 559 µs per loop
100000 loops, best of 3: 11.9 µs per loop

We can also try to compute the matrix rank using the built-in np.linalg.matrix_rank function

In [3]:
#Computing matrix rank
import numpy as np
n = 50 
a = np.ones((n, n))
print 'Rank of the matrix:', np.linalg.matrix_rank(a)
b = a + 1e-10 * np.random.randn(n, n)
print 'Rank of the matrix:', np.linalg.matrix_rank(b)

Rank of the matrix: 1
Rank of the matrix: 50

Instability of the matrix rank

For any rank-$r$ matrix $A$ with $r < \min(m, n)$ there is a matrix $B$ such that its rank rank is equal to $\min(m, n)$ and $$ \Vert A - B \Vert = \varepsilon. $$ So, does this mean that numerically matrix rank has no meaning? (I.e., small perturbations lead to full rank)!
The solution is to compute the best rank-r approximation to a matrix.

Singular value decomposition

To compute low-rank approximation, we need to compute singular value decomposition.
Any matrix $A$ can be written as a product of three matrices:
$$ A = U \Sigma V^*, $$ where $U$ is an $n \times K$ unitary matrix, $V$ is an $m \times K$ unitary matrix, $K = \min(m, n)$ $\Sigma$ is a diagonal matrix with non-negative elements $\sigma_1 \geq \ldots, \geq \sigma_K$ on the diagonal.

Computation of the best rank-$r$ approximation is equivalent to setting $\sigma_{r+1}= 0, \ldots, \sigma_K = 0$. The error is then determined by the omitted singular value!
$$ \min_{A_r} \Vert A - A_r \Vert = \sigma_{r+1}, $$ that is why it is important to look at the decay of the singular vectors.

Low-rank approximation via SVD

$\Vert A - A_r \Vert = \Vert U \Sigma V^* - A_r \Vert = \Vert \Sigma - U^* A_r V \Vert$,
where we used that for spectral norm (and for the Frobenius as well) we have

$$\Vert X Q \Vert = \Vert X \Vert$$

for any unitary $Q$ and any matrix $X$. This is called unitary equivalence of the spectral norm (and Frobenius as well).
What is left is to note that the best rank-$r$ approximation of the diagonal matrix is its subdiagonal.

Computing the SVD is tricky; we will talk about the algorithms hidden in this computation later. But we already can do this in Python!

In [4]:
#Computing matrix rank
import numpy as np
n = 50 
a = np.ones((n, n))
print 'Rank of the matrix:', np.linalg.matrix_rank(a)
b = a + 1e-5 * np.random.randn(n, n)
print 'Rank of the matrix:', np.linalg.matrix_rank(b)

Rank of the matrix: 1
Rank of the matrix: 50

In [15]:
u, s, v = np.linalg.svd(b)
print s[1]
r = 1
u1 = u[:, :r]
s1 = s[:r]
v1 = v[:r, :]
a1 = u1.dot(np.diag(s1).dot(v1))
print np.linalg.norm(a - a1, 'fro')/np.linalg.norm(a)


We can use SVD to compute approximations of function-related matrices, i.e. the matrices of the form $$a_{ij} = f(x_i, y_j),$$ where $f$ is a certain function, and $x_i, \quad i = 1, \ldots, n$ and $y_j, \quad j = 1, \ldots, m$ are some one-dimensional grids.

In [32]:
%matplotlib inline
import prettyplotlib as ppl
n = 1000
a = [[1.0/(i+j+1) for i in xrange(n)] for j in xrange(n)] #Hilbert matrix 
a = np.array(a)
u, s, v = np.linalg.svd(a)
fig, ax = ppl.subplots(1, 1)
#We have very good low-rank approximation of it!

[<matplotlib.lines.Line2D at 0x10f813a50>]

Now we can do something more interesting, like function approximation

In [23]:
import numpy as np
n = 128
t = np.linspace(0, 5, n)
x, y = np.meshgrid(t, t)
f = 1.0 / (x + y ** 2 + 0.5)
u, s, v = np.linalg.svd(f, full_matrices=False)
r = 10
u = u[:, :r]
s = s[:r]
v = v[:r, :] #Mind the transpose here!
fappr = u.dot(np.diag(s).dot(v))
er = np.linalg.norm(fappr - f, 'fro') / np.linalg.norm(f, 'fro')

And do some 3D plotting

In [27]:
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(12, 5))
ax = fig.add_subplot(131, projection='3d')
ax.plot_surface(x, y, f)
ax.set_title('Original function')
ax = fig.add_subplot(132, projection='3d')
ax.plot_surface(x, y, fappr)
ax.set_title('Approximated function')
ax = fig.add_subplot(133, projection='3d')
ax.plot_surface(x, y, fappr - f)
ax.set_title('Approximation error with rank=%d, err=%3.1e' % (r, er))

Linear factor analysis & low-rank

Consider a linear factor model,

$$y = Ax, $$

where $y$ is a vector of length $n$, and $x$ is a vector of length $r$.
The data is organizes as samples: we observe vectors
$$y_1, \ldots, y_T,$$ then the factor model can be written as
$$ Y = AX, $$ where $Y$ is $n \times T$, $A$ is $n \times r$ and $X$ is $r \times T$. This is exactly a rank-$r$ model: it tells us that the vector lie in a small subspace!
We also can use SVD to recover this subspace (but not the independent components). Principal component analysis can be done by SVD!

Take home message

  • Scalar product
  • Unitary/orthogonal matrices and norm conservation
  • Matrix rank definition
  • Skeleton approximation and dyadic representation of a rank-$r$ matrix
  • Singular value decomposition

In [63]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)