Lecture 1: Matrices & vectors


Week 2: Matrices, vectors, norms, ranks

What is a matrix

A matrix is a two-dimensional table. Here is an example of a $3 \times 3$ matrix \begin{equation} A = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} \end{equation}

A vector is a $n \times 1$ vector (there are row and column vectors).

What we can do with matrices and vectors

  1. Add them: $a = b + c$
  2. Multiply by numbers

Matrix as a linear operator

Matrix is typically used to encode a linear operator: $$ y = A x,$$ called matrix-by-vector product, in the index form
$$y_i = \sum_{j=1}^m A_{ij} x_j, \quad i = 1, \ldots, n.$$

Linear dependencies

Many physical models are formulated as linear equations:

  • Newton law $F = ma$
  • Hookes law $F = kx$

which is based on the fact that if the change is small, everything can be approximated by a linear function: $$ f(x + \delta x) \approx f(x) + \delta x. $$ Of course, nonlinearities may come into play, but even in this case the numerical methods linearize the problem around the current approximation. But matrices and linear dependence also play important role in data analysis as well.

Linear dependencies in real life

Matrix encodes linear dependence.
Linear dependence is the simplest and often very efficient model for the data. We will give two illustrations:

  • Principal component analysis
  • Independent component analysis

Demo: principal component analysis

One of the basic factor models, factor analysis:
$y_1, \ldots, y_P$ are vectors (data points), that are observed. We think of the as linear mixture. We generate random points on a plane and then rotate them by a certain mixture.

We generate a sequence of random points in two dimensions, and setup a rotation matrix $A$.

In [57]:
%matplotlib inline
import matplotlib.pyplot as plt
import prettyplotlib as ppl
from sklearn.decomposition import PCA
import numpy as np

P = 1000
points = np.random.randn(P,2)
A = [[2, 1], [0, 1]] 
A = np.array(A)
ppl.plot(points[:, 0], points[:, 1], ls='', marker='o')

[<matplotlib.lines.Line2D at 0x1109e6a10>]

We can also plot the rotated points: look at how they are "skewed".

In [32]:
R = np.dot(points, A)
ppl.plot(R[:, 0], R[:, 1], ls='', marker='o')

[<matplotlib.lines.Line2D at 0x13a802450>]

We can also rotate them back by finding principal components, which is equivalent to singular value decomposition (SVD). We will talk about SVD later, now we just call it from numpy.linalg module. As we can see, the points are back to symmetric distribution, so we have recovered the matrix.
Principal component analysis is used for many applications, in particular, for the classification problem: after the rotation, the points, belonging to to different classes, can be visually separated.

In [40]:
%matplotlib inline
u, s, v = np.linalg.svd(R, full_matrices=False)
unrotated = R.dot(v.T)
ppl.plot(unrotated[:, 0], unrotated[:, 1], ls='', marker='o')

[<matplotlib.lines.Line2D at 0x10c5dae50>]

Demo: Cocktail party problem

The linear models and the factors may have a real physical meaning. One of the most interesting illustrations is the cocktail party problem, which is defined as follows. We have a set of sources $x(t)$ (people talking) and a set of microphones.
At each microphone we record a linear mixture: $$ y = A x(t) + \eta(t), $$ where $\eta(t)$ is some noise. We do not know $A$ and want to recover it.


Matrix-by-matrix product

Consider composition of two linear operators:

  1. $y = Bx$
  2. $z = Ay$

Then, $z = Ay = A B x = C x$, where $C$ is the matrix-by-matrix product.

A product of an $n \times k$ matrix $A$ and a $k \times m$ matrix $B$ is a $n \times m$ matrix $C$ with the elements
$$ c_{ij} = \sum_{s=1}^k a_{is} b_{sj}, \quad i = 1, \ldots, n, \quad j = 1, \ldots, m $$

Complexity of a naive algorithm for MM is $\mathcal{O}(n^3)$.
Matrix-by-matrix product is the core for almost all efficient algorithms in linear algebra. Basically, all the NLA algorithms are reduced to a sequence of matrix-by-matrix products, so efficient implementation of MM reduces the complexity of numerical algorithms by the same factor. However, implementing MM is not easy at all!

Efficient implementation for MM

Is it easy to multiply a matrix by a matrix?
The answer is: no, if you want it as fast as possible, using the computers that are at hand.


Let us do a short demo and compare a np.dot() procedure which in my case uses MKL with a hand-written matrix-by-matrix routine in Python and also its Cython version (and also gives a very short introduction to Cython).

In [16]:
import numpy as np
def matmul(a, b):
    n = a.shape[0]
    k = a.shape[1]
    m = b.shape[1]
    c = np.zeros((n, m))
    for i in xrange(n):
        for j in xrange(m):
            for s in xrange(k):
                c[i, j] += a[i, s] * b[s, j]

load_ext cythonmagic makes it possible to use %%cython magic for an automatic compilation of the Cython code.
Cython is a Python compiler with additional directives that allow to specify the type of some variables. In many cases it may lead to significant speedup.
For other possibilities you may look at the following comparison

In [13]:
%load_ext cythonmagic

In [18]:
import numpy as np
def cython_matmul(double [:, :] a, double[:, :] b):
    cdef int n = a.shape[0]
    cdef int k = a.shape[1]
    cdef int m = b.shape[1]
    cdef int i
    cdef int j 
    cdef int s
    c = np.zeros((n, m))
    cdef double[:, :] cview = c
    for i in xrange(n):
        for j in xrange(m):
            for s in xrange(k):
                c[i, j] += a[i, s] * b[s, j]
    return c

Then we just time three different routines. Guess the answer.

In [19]:
n = 100
a = np.random.randn(n, n)
b = np.random.randn(n, n)
%timeit c = matmul(a, b)
%timeit cf = cython_matmul(a, b)
%timeit c = np.dot(a, b)

1 loops, best of 3: 1.57 s per loop
1 loops, best of 3: 567 ms per loop
10000 loops, best of 3: 110 µs per loop

Why it is so?
There are two important issues:

  • Computers are more and more parallel (multicore, graphics processing units)
  • The memory pyramid: there is a whole hierarchy of levels

Memory architecture

Fast memory is small, bigger memory is slow.

  • Data fits into the fast memory: Load all data, do computations
  • Data does not fit into the fast memory: load data by chunks, do computations, load again

We need to reduce the number of read/write operations! This is typically achieved in efficient implementations of the BLAS libraries, one of which (Intel MKL) we now use.


Basic linear algebra operations (BLAS) have three levels:

  1. BLAS-1, operations like $c = a + b$
  2. BLAS-2, operations like matrix-by-vector product
  3. BLAS-3, matrix-by-matrix product

What is the principal differences between those?

The main difference is the number of operations vs. the number of data!

  1. BLAS-1: $\mathcal{O}(n)$ data, $\mathcal{O}(n)$ operations
  2. BLAS-2: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^2)$ operations
  3. BLAS-3: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^3)$ operations

Remark: a quest for $\mathcal{O}(n^2)$ matrix-by-matrix multiplication algorithm is not yet done.

Strassen gives $\mathcal{O}(n^{2.78...})$
World record $\mathcal{O}(n^{2.37})$ Reference
The constant is unfortunately too big to make it practical!

Memory hierarchy

How we can use memory hierarchy?

Break the matrix into blocks! ($2 \times 2$ is an illustration) $$ A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}, \quad B = \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix} $$

Then, $$ A B = \begin{bmatrix} A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22} \end{bmatrix} $$

If $A_{11}, B_{11}$ and their product fit into the cache memory (which is 1024 Kb for the Haswell Intel Chip), then we load them only once into the memory.

Key point

The number of read/writes is reduced by a factor $\sqrt{M}$, where $M$ is the cache size.

  • Have to do linear algebra in terms of blocks!
  • So, you can not even do Gaussian elimination as usual (or just suffer 10x performance loss)


The blocking has also deep connection with parallel computations.
Consider adding two vectors: $$ c = a + b$$ and we have two processors. How fast can we go? Of course, not faster then twice.

In [10]:
## This demo requires Anaconda distribution to be installed
import mkl
import numpy as np
n = 1000
a = np.random.randn(n)
%timeit a + a
%timeit a + a

100000 loops, best of 3: 2.21 µs per loop
100000 loops, best of 3: 2.13 µs per loop

In [11]:
## This demo requires Anaconda distribution to be installed
import mkl
n = 500
a = np.random.randn(n, n)
%timeit a.dot(a)
%timeit a.dot(a)

100 loops, best of 3: 12.7 ms per loop
100 loops, best of 3: 10.8 ms per loop

Typically, two cases are distinguished:

  1. Shared memory (i.e., multicore on every desktop/smartphone)
  2. Distributed memory (i.e. each processor has its own memory, can send information through a network)

In both cases, the efficiency is governed by a bandwidth:
I.e., for BLAS-1 routines (like sum of two vectors) reads/writes take all the time.
For BLAS-3 routines, the speedup can be obtained that is more noticable.
For large-scale clusters (>100 000 cores, see the Top500 list) there is still scaling.


In [64]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)


In [ ]: