A matrix is a two-dimensional table. Here is an example of a $3 \times 3$ matrix \begin{equation} A = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} \end{equation}

A vector is a $n \times 1$ vector (there are **row** and **column** vectors).

Many physical models are formulated as **linear equations**:

- Newton law $F = ma$
- Hookes law $F = kx$

which is based on the fact that if the change is small, everything can be approximated by a linear function:
$$
f(x + \delta x) \approx f(x) + \delta x.
$$
Of course, nonlinearities may come into play, but even in this case the numerical methods **linearize** the problem around the current approximation. But matrices and linear dependence also play important role in data analysis as well.

We generate a sequence of random points in two dimensions, and setup a rotation matrix $A$.

```
In [57]:
```%matplotlib inline
import matplotlib.pyplot as plt
import prettyplotlib as ppl
from sklearn.decomposition import PCA
import numpy as np
P = 1000
points = np.random.randn(P,2)
A = [[2, 1], [0, 1]]
A = np.array(A)
ppl.plot(points[:, 0], points[:, 1], ls='', marker='o')

```
Out[57]:
```

We can also plot the rotated points: look at how they are "skewed".

```
In [32]:
```R = np.dot(points, A)
ppl.plot(R[:, 0], R[:, 1], ls='', marker='o')

```
Out[32]:
```

**principal components**, which is equivalent to singular value decomposition (SVD). We will talk about SVD later, now we just call it from **numpy.linalg** module. As we can see, the points are back to symmetric distribution, so we have recovered the matrix.

Principal component analysis is used for many applications, in particular, for the **classification problem**: after the rotation, the points, belonging to to different classes, can be visually separated.

```
In [40]:
```%matplotlib inline
u, s, v = np.linalg.svd(R, full_matrices=False)
unrotated = R.dot(v.T)
ppl.plot(unrotated[:, 0], unrotated[:, 1], ls='', marker='o')

```
Out[40]:
```

The linear models and the factors may have a real physical meaning.
One of the most interesting illustrations is the **cocktail party problem**, which is defined as follows.
We have a set of sources $x(t)$ (people talking) and a set of microphones.

At each microphone we record a **linear mixture**:
$$
y = A x(t) + \eta(t),
$$
where $\eta(t)$ is some noise.
We do not know $A$ and want to recover it.

Consider composition of two linear operators:

- $y = Bx$
- $z = Ay$

Then, $z = Ay = A B x = C x$, where $C$ is the **matrix-by-matrix product**.

A product of an $n \times k$ matrix $A$ and a $k \times m$ matrix $B$ is a $n \times m$ matrix $C$ with the elements

$$
c_{ij} = \sum_{s=1}^k a_{is} b_{sj}, \quad i = 1, \ldots, n, \quad j = 1, \ldots, m
$$

Complexity of a naive algorithm for MM is $\mathcal{O}(n^3)$.

Matrix-by-matrix product is the **core** for almost all efficient algorithms in linear algebra. Basically, all the NLA algorithms are reduced to a sequence of matrix-by-matrix products, so efficient implementation of MM reduces the complexity of numerical algorithms by the same factor. However, implementing MM is not easy at all!

```
In [16]:
```import numpy as np
def matmul(a, b):
n = a.shape[0]
k = a.shape[1]
m = b.shape[1]
c = np.zeros((n, m))
for i in xrange(n):
for j in xrange(m):
for s in xrange(k):
c[i, j] += a[i, s] * b[s, j]

`load_ext cythonmagic`

makes it possible to use `%%cython`

magic for an automatic compilation of the Cython code.

Cython is a Python compiler with additional directives that allow to specify the type of some variables. In many cases it may lead to significant speedup.

For other possibilities you may look at the following comparison

```
In [13]:
```%load_ext cythonmagic

```
In [18]:
```%%cython
import numpy as np
def cython_matmul(double [:, :] a, double[:, :] b):
cdef int n = a.shape[0]
cdef int k = a.shape[1]
cdef int m = b.shape[1]
cdef int i
cdef int j
cdef int s
c = np.zeros((n, m))
cdef double[:, :] cview = c
for i in xrange(n):
for j in xrange(m):
for s in xrange(k):
c[i, j] += a[i, s] * b[s, j]
return c

Then we just time three different routines. Guess the answer.

```
In [19]:
```n = 100
a = np.random.randn(n, n)
b = np.random.randn(n, n)
%timeit c = matmul(a, b)
%timeit cf = cython_matmul(a, b)
%timeit c = np.dot(a, b)

```
```

Why it is so?

There are two important issues:

- Computers are more and more parallel (multicore, graphics processing units)
- The memory pyramid: there is a whole hierarchy of levels

- Data fits into the fast memory: Load all data, do computations
- Data does not fit into the fast memory: load data by chunks, do computations, load again

We need to reduce the number of read/write operations! This is typically achieved in efficient implementations of the BLAS libraries, one of which (Intel MKL) we now use.

The main difference is the number of operations vs. the number of data!

- BLAS-1: $\mathcal{O}(n)$ data, $\mathcal{O}(n)$ operations
- BLAS-2: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^2)$ operations
- BLAS-3: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^3)$ operations

**Remark**: a quest for $\mathcal{O}(n^2)$ matrix-by-matrix multiplication algorithm is not yet done.

Strassen gives $\mathcal{O}(n^{2.78...})$

World record $\mathcal{O}(n^{2.37})$ Reference

The constant is unfortunately too big to make it practical!

**illustration**)
$$
A = \begin{bmatrix}
A_{11} & A_{12} \\
A_{21} & A_{22}
\end{bmatrix}, \quad B = \begin{bmatrix}
B_{11} & B_{12} \\
B_{21} & B_{22}
\end{bmatrix}
$$

```
In [10]:
```## This demo requires Anaconda distribution to be installed
import mkl
import numpy as np
n = 1000
a = np.random.randn(n)
mkl.set_num_threads(1)
%timeit a + a
mkl.set_num_threads(2)
%timeit a + a

```
```

```
In [11]:
```## This demo requires Anaconda distribution to be installed
import mkl
n = 500
a = np.random.randn(n, n)
mkl.set_num_threads(1)
%timeit a.dot(a)
mkl.set_num_threads(2)
%timeit a.dot(a)

```
```

Typically, two cases are distinguished:

- Shared memory (i.e., multicore on every desktop/smartphone)
- Distributed memory (i.e. each processor has its own memory, can send information through a network)

**bandwidth**:

I.e., for BLAS-1 routines (like sum of two vectors) reads/writes take all the time.

For BLAS-3 routines, the speedup can be obtained that is more noticable.

For large-scale clusters (>100 000 cores, see the Top500 list) there is still scaling.

```
In [64]:
```from IPython.core.display import HTML
def css_styling():
styles = open("./styles/custom.css", "r").read()
return HTML(styles)
css_styling()

```
Out[64]:
```

```
In [ ]:
```