Linear Algebra

Version: 1.2, September 2020

Download: This and various other Jupyter notebooks are available from my GitHub repo.

License: Creative Commons Attribution-ShareAlike 4.0 International License (CA BY-SA 4.0)

This is a tutorial related to the L665 course on Machine Learning for NLP focusing on Deep Learning, Spring 2018 at Indiana University.

The following material is based on Linear Algebra Review and Reference by Zico Kolter (updated by Chuong Do) from September 30, 2015. This means, many passages are literally copied, many are rewritten. I do not mark sections that are added or different. Consider this notebook a extended annotation of Kolter's (and Do's) notes. See also James E. Gentle (2017) Matrix Algebra: Theory, Computations and Applications in Statistics. Second edition. Springer. Another good resource is Philip N. Klein (2013) Coding the Matrix: Linear Algebra through Applications to Computer Science, Newtonian Press.

Basic Concepts and Notation

The following system of equations:

$\begin{equation} \begin{split} 4 x_1 - 5 x_2 & = -13 \\ -2x_1 + 3 x_2 & = 9 \end{split} \end{equation}$

We are looking for a unique solution for the two variables $x_1$ and $x_2$. The system can be described as:

$Ax = b$

as matrices:

$A = \begin{bmatrix} 4 & -5 \\[0.3em] -2 & 3 \end{bmatrix},\ b = \begin{bmatrix} -13 \\[0.3em] 9 \end{bmatrix}$

A scalar is an element in a vector, containing a real number value. In a vector space model or a vector mapping of (symbolic, qualitative, or quantitative) properties the scalar holds the concrete value or property of a variable.

A vector is an array, tuple, or ordered list of scalars (or elements) of size $n$, with $n$ a positive integer. The length of the vector, that is the number of scalars in the vector, is also called the order of the vector.

Vectorization is the process of creating a vector from some data using some process.

Vectors of the length $n$ could be treated like points in $n$-dimensional space. One can calculate the distance between such points using measures like Euclidean Distance. The similarity of vectors could also be calculated using Cosine Similarity.

Notation

A matrix is a list of vectors that all are of the same length. $A$ is a matrix with $m$ rows and $n$ columns, entries of $A$ are real numbers:

$A \in \mathbb{R}^{m \times n}$

A vector $x$ with $n$ entries of real numbers, could also be thought of as a matrix with $n$ rows and $1$ column, or as known as a column vector.

$x = \begin{bmatrix} x_1 \\[0.3em] x_2 \\[0.3em] \vdots \\[0.3em] x_n \end{bmatrix}$

Representing a row vector, that is a matrix with $1$ row and $n$ columns, we write $x^T$ (this denotes the transpose of $x$, see above).

$x^T = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}$

We use the notation $a_{ij}$ (or $A_{ij}$, $A_{i,j}$, etc.) to denote the entry of $A$ in the $i$th row and $j$th column:

$A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\[0.3em] a_{21} & a_{22} & \cdots & a_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}$

We denote the $j$th column of $A$ by $a_j$ or $A_{:,j}$:

$A = \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] a_{1} & a_{2} & \cdots & a_{n} \\[0.3em] \big| & \big| & & \big| \end{bmatrix}$

We denote the $i$th row of $A$ by $a_i^T$ or $A_{i,:}$:

$A = \begin{bmatrix} -- & a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix}$

A $n \times m$ matrix is a two-dimensional array with $n$ rows and $m$ columns.

Matrix Multiplication

The result of the multiplication of two matrixes $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$ is the matrix:

$C = AB \in \mathbb{R}^{m \times n}$

That is, we are multiplying the columns of $A$ with the rows of $B$:

$C_{ij}=\sum_{k=1}^n{A_{ij}B_{kj}}$

The number of columns in $A$ must be equal to the number of rows in $B$.

Vector-Vector Products

Inner or Dot Product of Two Vectors

For two vectors $x, y \in \mathbb{R}^n$, the inner product or dot product $x^T y$ is a real number:

$x^T y \in \mathbb{R} = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix} \begin{bmatrix} y_1 \\[0.3em] y_2 \\[0.3em] \vdots \\[0.3em] y_n \end{bmatrix} = \sum_{i=1}^{n}{x_i y_i}$

The inner products are a special case of matrix multiplication.

It is always the case that $x^T y = y^T x$.

Example

To calculate the inner product of two vectors $x = [1 2 3 4]$ and $y = [5 6 7 8]$, we can loop through the vector and multiply and sum the scalars (this is simplified code):



In [1]:

    
x = (1, 2, 3, 4)
y = (5, 6, 7, 8)
n = len(x)
if n == len(y):
    result = 0
    for i in range(n):
        result += x[i] * y[i]
    print(result)

It is clear that in the code above we could change line 7 to result += y[i] * x[i] without affecting the result.

We can use the numpy module to apply the same operation, to calculate the inner product. We import the numpy module and assign it a name np for the following code:



In [ ]:

    
import numpy as np

We define the vectors $x$ and $y$ using numpy:



In [ ]:

    
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
print("x:", x)
print("y:", y)

We can now calculate the $dot$ or $inner product$ using the dot function of numpy:



In [ ]:

    
np.dot(x, y)

The order of the arguments is irrelevant:



In [ ]:

    
np.dot(y, x)

Note that both vectors are actually row vectors in the above code. We can transpose them to column vectors by using the shape property:



In [ ]:

    
print("x:", x)
x.shape = (4, 1)
print("xT:", x)
print("y:", y)
y.shape = (4, 1)
print("yT:", y)

In fact, in our understanding of Linear Algebra, we take the arrays above to represent row vectors. Numpy treates them differently.

We see the issues when we try to transform the array objects. Usually, we can transform a row vector into a column vector in numpy by using the T method on vector or matrix objects:



In [ ]:

    
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
print("x:", x)
print("y:", y)
print("xT:", x.T)
print("yT:", y.T)

The problem here is that this does not do, what we expect it to do. It only works, if we declare the variables not to be arrays of numbers, but in fact a matrix:



In [ ]:

    
x = np.array([[1, 2, 3, 4]])
y = np.array([[5, 6, 7, 8]])
print("x:", x)
print("y:", y)
print("xT:", x.T)
print("yT:", y.T)

Note that the numpy functions dot and outer are not affected by this distinction. We can compute the dot product using the mathematical equation above in numpy using the new $x$ and $y$ row vectors:



In [ ]:

    
print("x:", x)
print("y:", y.T)
np.dot(x, y.T)

Or by reverting to:



In [ ]:

    
print("x:", x.T)
print("y:", y)
np.dot(y, x.T)

To read the result from this array of arrays, we would need to access the value this way:



In [ ]:

    
np.dot(y, x.T)[0][0]

Outer Product of Two Vectors

For two vectors $x \in \mathbb{R}^m$ and $y \in \mathbb{R}^n$, where $n$ and $m$ do not have to be equal, the outer product of $x$ and $y$ is:

$xy^T \in \mathbb{R}^{m\times n}$

The outer product results in a matrix with $m$ rows and $n$ columns by $(xy^T)_{ij} = x_i y_j$:

$xy^T \in \mathbb{R}^{m\times n} = \begin{bmatrix} x_1 \\[0.3em] x_2 \\[0.3em] \vdots \\[0.3em] x_n \end{bmatrix} \begin{bmatrix} y_1 & y_2 & \cdots & y_n \end{bmatrix} = \begin{bmatrix} x_1 y_1 & x_1 y_2 & \cdots & x_1 y_n \\[0.3em] x_2 y_1 & x_2 y_2 & \cdots & x_2 y_n \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] x_m y_1 & x_m y_2 & \cdots & x_m y_n \\[0.3em] \end{bmatrix}$

Some useful property of the outer product: assume $\mathbf{1} \in \mathbb{R}^n$ is an $n$-dimensional vector of scalars with the value $1$. Given a matrix $A \in \mathbb{R}^{m\times n}$ with all columns equal to some vector $x \in \mathbb{R}^m$, using the outer product $A$ can be represented as:

$A = \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] x & x & \cdots & x \\[0.3em] \big| & \big| & & \big| \end{bmatrix} = \begin{bmatrix} x_1 & x_1 & \cdots & x_1 \\[0.3em] x_2 & x_2 & \cdots & x_2 \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] x_m &x_m & \cdots & x_m \end{bmatrix} = \begin{bmatrix} x_1 \\[0.3em] x_2 \\[0.3em] \vdots \\[0.3em] x_m \end{bmatrix} \begin{bmatrix} 1 & 1 & \cdots & 1 \end{bmatrix} = x \mathbf{1}^T$

Example

If we want to compute the outer product of two vectors $x$ and $y$, we need to transpose the row vector $x$ to a column vector $x^T$. This can be achieved by the reshape function in numpy, the T method, or the transpose() function. The reshape function takes a parameter that describes the number of colums and rows for the resulting transposing:



In [ ]:

    
x = np.array([[1, 2, 3, 4]])
print("x:", x)
print("xT:", np.reshape(x, (4, 1)))
print("xT:", x.T)
print("xT:", x.transpose())

We can now compute the outer product by multiplying the column vector $x$ with the row vector $y$:



In [ ]:

    
x = np.array([[1, 2, 3, 4]])
y = np.array([[5, 6, 7, 8]])
x.T * y

Numpy provides an outer function that does all that:



In [ ]:

    
np.outer(x, y)

Note, in this simple case using the simple arrays for the data structures of the vectors does not affect the result of the outer function:



In [ ]:

    
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
np.outer(x, y)

Matrix-Vector Products

Assume a matrix $A \in \mathbb{R}^{m\times n}$ and a vector $x \in \mathbb{R}^n$ the product results in a vector $y = Ax \in \mathbb{R}^m$.

$Ax$ could be expressed as the dot product of row $i$ of matrix $A$ with the column value $j$ of vector $x$. Let us first consider matrix multiplication with a scalar:

$A = \begin{bmatrix} 1 & 2 \\[0.3em] 3 & 4 \end{bmatrix}$

We can compute the product of $A$ with a scalar $n = 2$ as:

$A = \begin{bmatrix} 1 * n & 2 * n \\[0.3em] 3 * n & 4 * n \end{bmatrix} = \begin{bmatrix} 1 * 2 & 2 * 2 \\[0.3em] 3 * 2 & 4 * 2 \end{bmatrix} = \begin{bmatrix} 2 & 4 \\[0.3em] 6 & 8 \end{bmatrix} $

Using numpy this can be achieved by:



In [ ]:

    
import numpy as np
A = np.array([[4, 5, 6],
             [7, 8, 9]])
A * 2

Assume that we have a column vector $x$:

$x = \begin{bmatrix} 1 \\[0.3em] 2 \\[0.3em] 3 \end{bmatrix}$

To be able to multiply this vector with a matrix, the number of columns in the matrix must correspond to the number of rows in the column vector. The matrix $A$ must have $3$ columns, as for example:

$A = \begin{bmatrix} 4 & 5 & 6\\[0.3em] 7 & 8 & 9 \end{bmatrix}$

To compute $Ax$, we multiply row $1$ of the matrix with column $1$ of $x$:

$\begin{bmatrix} 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 1 \\[0.3em] 2 \\[0.3em] 3 \end{bmatrix} = 4 * 1 + 5 * 2 + 6 * 3 = 32 $

We do the compute the dot product of row $2$ of $A$ and column $1$ of $x$:

$\begin{bmatrix} 7 & 8 & 9 \end{bmatrix} \begin{bmatrix} 1 \\[0.3em] 2 \\[0.3em] 3 \end{bmatrix} = 7 * 1 + 8 * 2 + 9 * 3 = 50 $

The resulting column vector $Ax$ is:

$Ax = \begin{bmatrix} 32 \\[0.3em] 50 \end{bmatrix}$

Using numpy we can compute $Ax$:



In [ ]:

    
A = np.array([[4, 5, 6],
             [7, 8, 9]])
x = np.array([1, 2, 3])
A.dot(x)

We can thus describe the product writing $A$ by rows as:

$y = Ax = \begin{bmatrix} -- & a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix} x = \begin{bmatrix} a_1^T x \\[0.3em] a_2^T x \\[0.3em] \vdots \\[0.3em] a_m^T x \end{bmatrix}$

This means that the $i$th scalar of $y$ is the inner product of the $i$th row of $A$ and $x$, that is $y_i = a_i^T x$.

If we write $A$ in column form, then:

$y = Ax = \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] a_1 & a_2 & \cdots & a_n \\[0.3em] \big| & \big| & & \big| \end{bmatrix} \begin{bmatrix} x_1 \\[0.3em] x_2 \\[0.3em] \vdots \\[0.3em] x_n \end{bmatrix} = \begin{bmatrix} a_1 \end{bmatrix} x_1 + \begin{bmatrix} a_2 \end{bmatrix} x_2 + \dots + \begin{bmatrix} a_n \end{bmatrix} x_n $

In this case $y$ is a linear combination of the columns of $A$, the coefficients taken from $x$.

The above examples multiply be the right with a column vector. One can multiply on the left by a row vector as well, $y^T = x^T A$ for $A \in \mathbb{R}^{m\times n}$, $x\in \mathbb{R}^m$, $y \in \mathbb{R}^n$. There are two ways to express $y^T$, with $A$ expressed by its columns, with $i$th scalar of $y^T$ corresponds to the inner product of $x$ and the $i$th column of $A$:

$y^T = x^T A = x^t \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] a_1 & a_2 & \cdots & a_n \\[0.3em] \big| & \big| & & \big| \end{bmatrix} = \begin{bmatrix} x^T a_1 & x^T a_2 & \dots & x^T a_n \end{bmatrix}$

One can express $A$ by rows, where $y^T$ is a linear combination of the rows of $A$ with the scalars from $x$.

$\begin{equation} \begin{split} y^T & = x^T A \\ & = \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix} \begin{bmatrix} -- & a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix} \\ & = x_1 \begin{bmatrix}-- & a_1^T & --\end{bmatrix} + x_2 \begin{bmatrix}-- & a_2^T & --\end{bmatrix} + \dots + x_n \begin{bmatrix}-- & a_n^T & --\end{bmatrix} \end{split} \end{equation}$

Matrix-Matrix Products

One can view matrix-matrix multiplication $C = AB$ as a set of vector-vector products. The $(i,j)$th entry of $C$ is the inner product of the $i$th row of $A$ and the $j$th column of $B$:

$C = AB = \begin{bmatrix} -- & a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix} \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] b_1 & b_2 & \cdots & b_p \\[0.3em] \big| & \big| & & \big| \end{bmatrix} = \begin{bmatrix} a_1^T b_1 & a_1^T b_2 & \cdots & a_1^T b_p \\[0.3em] a_2^T b_1 & a_2^T b_2 & \cdots & a_2^T b_p \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] a_m^T b_1 & a_m^T b_2 & \cdots & a_m^T b_p \end{bmatrix}$

Here $A \in \mathbb{R}^{m\times n}$ and $B \in \mathbb{R}^{n\times p}$, $a_i \in \mathbb{R}^n$ and $b_j \in \mathbb{R}^n$, and $A$ is represented by rows, $B$ by columns.

If we represent $A$ by columns and $B$ by rows, then $AB$ is the sum of the outer products:

$C = AB = \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] a_1 & a_2 & \cdots & a_n \\[0.3em] \big| & \big| & & \big| \end{bmatrix} \begin{bmatrix} -- & b_1^T & -- \\[0.3em] -- & b_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & b_n^T & -- \end{bmatrix} = \sum_{i=1}^n a_i b_i^T $

This means that $AB$ is the sum over all $i$ of the outer product of the $i$th column of $A$ and the $i$th row of $B$.

One can interpret matrix-matrix operations also as a set of matrix-vector products. Representing $B$ by columns, the columns of $C$ are matrix-vector products between $A$ and the columns of $B$:

$C = AB = A \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] b_1 & b_2 & \cdots & b_p \\[0.3em] \big| & \big| & & \big| \end{bmatrix} = \begin{bmatrix} \big| & \big| & & \big| \\[0.3em] A b_1 & A b_2 & \cdots & A b_p \\[0.3em] \big| & \big| & & \big| \end{bmatrix} $

In this interpretation the $i$th column of $C$ is the matrix-vector product with the vector on the right, i.e. $c_i = A b_i$.

Representing $A$ by rows, the rows of $C$ are the matrix-vector products between the rows of $A$ and $B$:

$C = AB = \begin{bmatrix} -- & a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix} B = \begin{bmatrix} -- & a_1^T B & -- \\[0.3em] -- & a_2^T B & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_n^T B & -- \end{bmatrix}$

The $i$th row of $C$ is the matrix-vector product with the vector on the left, i.e. $c_i^T = a_i^T B$.

Notes on Matrix-Matrix Products

Matrix multiplication is associative: $(AB)C = A(BC)$

Matrix multiplication is distributive: $A(B + C) = AB + AC$

Matrix multiplication is, in general, not commutative; It can be the case that $AB \neq BA$. (For example, if $A \in \mathbb{R}^{m\times n}$ and $B \in \mathbb{R}^{n\times q}$, the matrix product $BA$ does not even exist if $m$ and $q$ are not equal!)

Identity Matrix

The identity matrix $I \in \mathbb{R}^{n\times n}$ is a square matrix with the value $1$ on the diagonal and $0$ everywhere else:

$I_{ij} = \left\{ \begin{array}{lr} 1 & i = j\\ 0 & i \neq j \end{array} \right. $

For all $A \in \mathbb{R}^{m\times n}$:

$AI = A = IA$

In the equation above multiplication has to be made possible, which means that in the portion $AI = A$ the dimensions of $I$ have to be $n\times n$, while in $A = IA$ they have to be $m\times m$.

We can generate an identity matrix in numpy using:



In [ ]:

    
import numpy as np
A = np.array([[0, 1, 2],
              [3, 4, 5],
              [6, 7, 8],
              [9, 10, 11]])
print("A:", A)

We can ask for the shape of $A$:



In [ ]:

    
A.shape

The shape property of a matrix contains the $m$ (number of rows) and $n$ (number of columns) properties in a tuple, in that particular order. We can create an identity matrix for the use in $AI$ by using the $n$ value:



In [ ]:

    
np.identity(A.shape[1], dtype="int")

Note that we specify the dtype parameter to identity as int, since the default would return a matrix of float values.

To generate an identity matrix for the use in $IA$ we would use the $m$ value:



In [ ]:

    
np.identity(A.shape[0], dtype="int")

We can compute the dot product of $A$ and its identity matrix $I$:



In [ ]:

    
n = A.shape[1]
I = np.array(np.identity(n, dtype="int"))
np.dot(A, I)

The same is true for the other direction:



In [ ]:

    
m = A.shape[0]
I = np.array(np.identity(m, dtype="int"))
np.dot(I, A)

Diagonal Matrix

In the diagonal matrix non-diagonal elements are $0$, that is $D = diag(d_1, d_2, \dots{}, d_n)$, with:

$D_{ij} = \left\{ \begin{array}{lr} d_i & i = j\\ 0 & i \neq j \end{array} \right. $

The identity matrix is a special case of a diagonal matrix: $I = diag(1, 1, \dots{}, 1)$.

In numpy we can create a diagonal matrix from any given matrix using the diag function:



In [ ]:

    
import numpy as np
A = np.array([[0,   1,  2,  3],
              [4,   5,  6,  7],
              [8,   9, 10, 11],
              [12, 13, 14, 15]])
np.diag(A)

An optional parameter k to the diag function allows us to extract the diagonal above the main diagonal with a positive k, and below the main diagonal with a negative k:



In [ ]:

    
np.diag(A, k=1)



In [ ]:

    
np.diag(A, k=-1)

Transpose of a Matrix

Transposing a matrix is achieved by flipping the rows and columns. For a matrix $A \in \mathbb{R}^{m\times n}$ the transpose $A^T \in \mathbb{R}^{n\times m}$ is the $n\times m$ matrix given by:

$(A^T)_{ij} = A_{ji}$

Properties of transposes:

$(A^T)^T = A$
$(AB)^T = B^T A^T$
$(A+B)^T = A^T + B^T$

Symmetric Metrices

Square metrices $A \in \mathbb{R}^{n\times n}$ are symmetric, if $A = A^T$.

$A$ is anti-symmetric, if $A = -A^T$.

For any matrix $A \in \mathbb{R}^{n\times n}$, the matrix $A + A^T$ is symmetric.

For any matrix $A \in \mathbb{R}^{n\times n}$, the matrix $A - A^T$ is anti-symmetric.

Thus, any square matrix $A \in \mathbb{R}^{n\times n}$ can be represented as a sum of a symmetric matrix and an anti-symmetric matrix:

$A = \frac{1}{2} (A + A^T) + \frac{1}{2} (A - A^T)$

The first matrix on the right, i.e. $\frac{1}{2} (A + A^T)$ is symmetric. The second matrix $\frac{1}{2} (A - A^T)$ is anti-symmetric.

$\mathbb{S}^n$ is the set of all symmetric matrices of size $n$.

$A \in \mathbb{S}^n$ means that $A$ is symmetric and of the size $n\times n$.

The Trace

The trace of a square matrix $A \in \mathbb{R}^{n\times n}$ is $tr(A)$ (or $trA$) is the sum of the diagonal elements in the matrix:

$trA = \sum_{i=1}^n A_{ii}$

Properties of the trace:

For $A \in \mathbb{R}^{n\times n}$, $\mathrm{tr}A = \mathrm{tr}A^T$
For $A,B \in \mathbb{R}^{n\times n}$, $\mathrm{tr}(A + B) = \mathrm{tr}A + \mathrm{tr}B$
For $A \in \mathbb{R}^{n\times n}$, $t \in \mathbb{R}$, $\mathrm{tr}(tA) = t \mathrm{tr}A$
For $A,B$ such that $AB$ is square, $\mathrm{tr}AB = \mathrm{tr}BA$
For $A,B,C$ such that $ABC$ is square, $\mathrm{tr}ABC = \mathrm{tr}BCA = \mathrm{tr}CAB$, and so on for the product of more matrices.

See for proofs the paper!

Norms

The norm of a vector $x$ is $\| x\|$, informally the length of a vector.

Example: the Euclidean or $\mathscr{l}_2$ norm:

$\|x\|_2 = \sqrt{\sum_{i=1}^n{x_i^2}}$

Note: $\|x\|_2^2 = x^T x$

A norm is any function $f : \mathbb{R}^n \rightarrow \mathbb{R}$ that satisfies the following properties:

For all $x \in \mathbb{R}^n$, $f(x) \geq 0$ (non-negativity)
$f(x) = 0$ if and only if $x = 0$ (definiteness)
For all $x \in \mathbb{R}^n$, $t \in \mathbb{R}$, $f(tx) = |t|\ f(x)$ (homogeneity)
For all $x, y \in \mathbb{R}^n$, $f(x + y) \leq f(x) + f(y)$ (triangle inequality)

Norm $\mathscr{l}_1$:

$\|x\|_1 = \sum_{i=1}^n{|x_i|}$

Norm $\mathscr{l}_\infty$:

$\|x\|_\infty = \max_i|x_i|$

All these three norms are examples of the $\mathscr{l}_p$ norms, with $p$ a real number parameter $p \geq 1$:

$\|x\|_p = \left(\sum_{i=1}^n{|x_i|^p}\right)^{\frac{1}{p}}$

Frobenius norm for matrices:

$\|A\|_F = \sqrt{\sum_{i=1}^m\sum_{i=1}^n A_{ij}^2} = \sqrt{\mathrm{tr}(A^T A)}$

And many more.

Linear Independence and Rank

A set of vectors $\{x_1, x_2, \dots{}, x_n\} \subset \mathbb{R}^m$ is said to be (linearly) independent if no vector can be represented as a linear combination of the remaining vectors.

A set of vectors $\{x_1, x_2, \dots{}, x_n\} \subset \mathbb{R}^m$ is said to be (lineraly) dependent if one vector from this set can be represented as a linear combination of the remaining vectors.

For some scalar values $\alpha_1, \dots{}, \alpha_{n-1} \in \mathbb{R}$ the vectors $x_1, \dots{}, x_n$ are linerly dependent, if:

$\begin{equation} x_n = \sum_{i=1}^{n-1}{\alpha_i x_i} \end{equation}$

Example: The following vectors are lineraly dependent, because $x_3 = -2 x_1 + x_2$

$x_1 = \begin{bmatrix} 1 \\[0.3em] 2 \\[0.3em] 3 \end{bmatrix} \quad x_2 = \begin{bmatrix} 4 \\[0.3em] 1 \\[0.3em] 5 \end{bmatrix} \quad x_3 = \begin{bmatrix} 2 \\[0.3em] -1 \\[0.3em] -1 \end{bmatrix} $

Column Rank of a Matrix

The column rank of a matrix $A \in \mathbb{R}^{m\times n}$ is the size of the largest subset of columns of $A$ that constitute a linear independent set. Informaly this is the number of linearly independent columns of $A$.

Row Rank of a Matrix

The row rank of a matrix $A \in \mathbb{R}^{m\times n}$ is the largest number of rows of $A$ that constitute a lineraly independent set.

Rank of a Matrix

For any matrix $A \in \mathbb{R}^{m\times n}$, the column rank of $A$ is equal to the row rank of $A$. Both quantities are referred to collectively as the rank of $A$, denoted as $rank(A)$. Here are some basic properties of the rank:

For $A \in \mathbb{R}^{m\times n}$, $rank(A) \leq \min(m, n)$. If $rank(A) = \min(m, n)$, then $A$ is said to be full rank.
For $A \in \mathbb{R}^{m\times n}$, $rank(A) = rank(A^T)$
For $A \in \mathbb{R}^{m\times n}$, $B \in \mathbb{R}^{n\times p}$, $rank(AB) \leq \min(rank(A), rank(B))$
For $A,B \in \mathbb{R}^{m\times n}$, $rank(A + B) \leq rank(A) + rank(B)$

Subtraction and Addition of Metrices

Assume $A \in \mathbb{R}^{m\times n}$ and $B \in \mathbb{R}^{m\times n}$, that is $A$ and $B$ are of the same size, to add $A$ to $B$, or to subtract $B$ from $A$, we add or subtract corresponding entries:

$A + B = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\[0.3em] a_{21} & a_{22} & \cdots & a_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} + \begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1n} \\[0.3em] b_{21} & b_{22} & \cdots & b_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] b_{m1} & b_{m2} & \cdots & b_{mn} \end{bmatrix} = \begin{bmatrix} a_{11} + b_{11} & a_{12} + b_{12} & \cdots & a_{1n} + b_{1n} \\[0.3em] a_{21} + b_{21} & a_{22} + b_{22} & \cdots & a_{2n} + b_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] a_{m1} + b_{m1} & a_{m2} + b_{m2} & \cdots & a_{mn} + b_{mn} \end{bmatrix} $

The same is applies to subtraction:

$A - B = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\[0.3em] a_{21} & a_{22} & \cdots & a_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} - \begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1n} \\[0.3em] b_{21} & b_{22} & \cdots & b_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] b_{m1} & b_{m2} & \cdots & b_{mn} \end{bmatrix} = \begin{bmatrix} a_{11} - b_{11} & a_{12} - b_{12} & \cdots & a_{1n} - b_{1n} \\[0.3em] a_{21} - b_{21} & a_{22} - b_{22} & \cdots & a_{2n} - b_{2n} \\[0.3em] \vdots & \vdots & \ddots & \vdots \\[0.3em] a_{m1} - b_{m1} & a_{m2} - b_{m2} & \cdots & a_{mn} - b_{mn} \end{bmatrix} $

In Python using numpy this can be achieved using the following code:



In [ ]:

    
import numpy as np
print("np.arange(9):", np.arange(9))
print("np.arange(9, 18):", np.arange(9, 18))
A = np.arange(9, 18).reshape((3, 3))
B = np.arange(9).reshape((3, 3))
print("A:", A)
print("B:", B)

The numpy function arange is similar to the standard Python function range. It returns an array with $n$ elements, specified in the one parameter version only. If we provide to parameters to arange, it generates an array starting from the value of the first parameter and ending with a value one less than the second parameter. The function reshape returns us a matrix with the corresponding number of rows and columns.

We can now add and subtract the two matrices $A$ and $B$:



In [ ]:

    
A + B



In [ ]:

    
A - B

Inverse

The inverse of a square matrix $A \in \mathbb{R}^{n\times n}$ is $A^{-1}$:

$A^{-1} A = I = A A^{-1}$

Not all matrices have inverses. Non-square matrices do not have inverses by definition. For some square matrices $A$ the inverse might not exist.

$A$ is invertible or non-singular if $A^{-1}$ exists.

$A$ is non-invertible or singular if $A^{-1}$ does not exist.

Note: **non-singular** means the opposite of **non-invertible**!

For $A$ to have an inverse $A^{-1}$, $A$ must be full rank.

Assuming that $A,B \in \mathbb{R}^{n\times n}$ are non-singular, then:

$(A^{-1})^{-1} = A$
$(AB)^{-1} = B^{-1} A^{-1}$
$(A^{-1})^T = (A^T)^{-1}$ (often simply $A^{-T}$)

Orthogonal Matrices

Two vectors $x, y \in \mathbb{R}^n$ are orthogonal if $x^T y = 0$.

A vector $x \in \mathbb{R}^n$ is normalized if $\|x\|^2 = 1$.

A square matrix $U \in \mathbb{R}^{n\times n}$ is orthogonal if all its columns are orthogonal to each other and are normalized. The columns are then referred to as being orthonormal.

It follows immediately from the definition of orthogonality and normality that:

$U^T U = I = U U^T$

This means that the inverse of an orthogonal matrix is its transpose.

If U is not square - i.e., $U \in \mathbb{R}^{m\times n}$, $n < m$ - but its columns are still orthonormal, then $U^T U = I$, but $U U^T \neq I$.

We generally only use the term orthogonal to describe the case, where $U$ is square.

Another nice property of orthogonal matrices is that operating on a vector with an orthogonal matrix will not change its Euclidean norm. For any $x \in \mathbb{R}^n$, $U \in \mathbb{R}^{n\times n}$ orthogonal.

$\|U_x\|^2 = \|x\|^2$

Range and Nullspace of a Matrix

The span of a set of vectors $\{ x_1, x_2, \dots{}, x_n\}$ is the set of all vectors that can be expressed as a linear combination of $\{ x_1, \dots{}, x_n \}$:

$\mathrm{span}(\{ x_1, \dots{}, x_n \}) = \{ v : v = \sum_{i=1}^n \alpha_i x_i, \alpha_i \in \mathbb{R} \}$

It can be shown that if $\{ x_1, \dots{}, x_n \}$ is a set of n linearly independent vectors, where each $x_i \in \mathbb{R}^n$, then $\mathrm{span}(\{ x_1, \dots{}, x_n\}) = \mathbb{R}^n$. That is, any vector $v \in \mathbb{R}^n$ can be written as a linear combination of $x_1$ through $x_n$.

The projection of a vector $y \in \mathbb{R}^m$ onto the span of $\{ x_1, \dots{}, x_n\}$ (here we assume $x_i \in \mathbb{R}^m$) is the vector $v \in \mathrm{span}(\{ x_1, \dots{}, x_n \})$, such that $v$ is as close as possible to $y$, as measured by the Euclidean norm $\|v − y\|^2$. We denote the projection as $\mathrm{Proj}(y; \{ x_1, \dots{}, x_n \})$ and can define it formally as:

$\mathrm{Proj}( y; \{ x_1, \dots{}, x_n \}) = \mathrm{argmin}_{v\in \mathrm{span}(\{x_1,\dots{},x_n\})}\|y − v\|^2$

The range (sometimes also called the columnspace) of a matrix $A \in \mathbb{R}^{m\times n}$, denoted $\mathcal{R}(A)$, is the the span of the columns of $A$. In other words,

$\mathcal{R}(A) = \{ v \in \mathbb{R}^m : v = A x, x \in \mathbb{R}^n\}$

Making a few technical assumptions (namely that $A$ is full rank and that $n < m$), the projection of a vector $y \in \mathbb{R}^m$ onto the range of $A$ is given by:

$\mathrm{Proj}(y; A) = \mathrm{argmin}_{v\in \mathcal{R}(A)}\|v − y\|^2 = A(A^T A)^{−1} A^T y$

See for more details in the notes page 13.

The nullspace of a matrix $A \in \mathbb{R}^{m\times n}$, denoted $\mathcal{N}(A)$ is the set of all vectors that equal $0$ when multiplied by $A$, i.e.,

$\mathcal{N}(A) = \{ x \in \mathbb{R}^n : A x = 0 \}$

Note that vectors in $\mathcal{R}(A)$ are of size $m$, while vectors in the $\mathcal{N}(A)$ are of size $n$, so vectors in $\mathcal{R}(A^T)$ and $\mathcal{N}(A)$ are both in $\mathbb{R}^n$. In fact, we can say much more. It turns out that:

$\{ w : w = u + v, u \in \mathcal{R}(A^T), v \in \mathcal{N}(A) \} = \mathbb{R}^n$ and $\mathcal{R}(A^T) \cap \mathcal{N}(A) = \{0\}$

In other words, $\mathcal{R}(A^T)$ and $\mathcal{N}(A)$ are disjoint subsets that together span the entire space of $\mathbb{R}^n$. Sets of this type are called orthogonal complements, and we denote this $\mathcal{R}(A^T) = \mathcal{N}(A)^\perp$.

Determinant

The determinant of a square matrix $A \in \mathbb{R}^{n\times n}$, is a function $\mathrm{det} : \mathbb{R}^{n\times n} \rightarrow \mathbb{R}$, and is denoted $|A|$ or $\mathrm{det}A$ (like the trace operator, we usually omit parentheses).

A geometric interpretation of the determinant

Given

$\begin{bmatrix} -- & a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_n^T & -- \end{bmatrix}$

consider the set of points $S \subset \mathbb{R}^n$ formed by taking all possible linear combinations of the row vectors $a_1, \dots{}, a_n \in \mathbb{R}^n$ of $A$, where the coefficients of the linear combination are all between $0$ and $1$; that is, the set $S$ is the restriction of $\mathrm{span}( \{ a_1, \dots{}, a_n \})$ to only those linear combinations whose coefficients $\alpha_1, \dots{}, \alpha_n$ satisfy $0 \leq \alpha_i \leq 1$, $i = 1, \dots{}, n$. Formally:

$S = \{v \in \mathbb{R}^n : v = \sum_{i=1}^n \alpha_i a_i \mbox{ where } 0 \leq \alpha_i \leq 1, i = 1, \dots{}, n \}$

The absolute value of the determinant of $A$, it turns out, is a measure of the volume of the set $S$. The volume here is intuitively for example for $n = 2$ the area of $S$ in the Cartesian plane, or with $n = 3$ it is the common understanding of volume for 3-dimensional objects.

Example:

$A = \begin{bmatrix} 1 & 3\\[0.3em] 3 & 2 \end{bmatrix}$

The rows of the matrix are:

$a_1 = \begin{bmatrix} 1 \\[0.3em] 3 \end{bmatrix} \quad a_2 = \begin{bmatrix} 3 \\[0.3em] 2 \end{bmatrix}$

The set S corresponding to these rows is shown in:

The figure above is an illustration of the determinant for the $2\times 2$ matrix $A$ above. Here, $a_1$ and $a_2$ are vectors corresponding to the rows of $A$, and the set $S$ corresponds to the shaded region (i.e., the parallelogram). The absolute value of the determinant, $|\mathrm{det}A| = 7$, is the area of the parallelogram.

For two-dimensional matrices, $S$ generally has the shape of a parallelogram. In our example, the value of the determinant is $|A| = −7$ (as can be computed using the formulas shown later), so the area of the parallelogram is $7$.

In three dimensions, the set $S$ corresponds to an object known as a parallelepiped (a three-dimensional box with skewed sides, such that every face has the shape of a parallelogram). The absolute value of the determinant of the $3 \times 3$ matrix whose rows define $S$ give the three-dimensional volume of the parallelepiped. In even higher dimensions, the set $S$ is an object known as an $n$-dimensional parallelotope.

Algebraically, the determinant satisfies the following three properties (from which all other properties follow, including the general formula):

The determinant of the identity is $1$, $|I| = 1$. (Geometrically, the volume of a unit hypercube is $1$).
Given a matrix $A \in \mathbb{R}^{n\times n}$, if we multiply a single row in $A$ by a scalar $t \in \mathbb{R}$, then the determinant of the new matrix is $t|A|$,
$\left| \begin{bmatrix} -- & t a_1^T & -- \\[0.3em] -- & a_2^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix}\right| = t|A|$
(Geometrically, multiplying one of the sides of the set $S$ by a factor $t$ causes the volume to increase by a factor $t$.)
If we exchange any two rows $a^T_i$ and $a^T_j$ of $A$, then the determinant of the new matrix is $−|A|$, for example
$\left| \begin{bmatrix} -- & a_2^T & -- \\[0.3em] -- & a_1^T & -- \\[0.3em] & \vdots & \\[0.3em] -- & a_m^T & -- \end{bmatrix}\right| = -|A|$

Several properties that follow from the three properties above include:

For $A \in \mathbb{R}^{n\times n}$, $|A| = |A^T|$
For $A,B \in \mathbb{R}^{n\times n}$, $|AB| = |A||B|$
For $A \in \mathbb{R}^{n\times n}$, $|A| = 0$ if and only if $A$ is singular (i.e., non-invertible). (If $A$ is singular then it does not have full rank, and hence its columns are linearly dependent. In this case, the set $S$ corresponds to a "flat sheet" within the $n$-dimensional space and hence has zero volume.)
For $A \in \mathbb{R}^{n\times n}$ and $A$ non-singular, $|A−1| = 1/|A|$

Continue on page 16.

Tensors

A tensor could be thought of as an organized multidimensional array of numerical values. A vector could be assumed to be a sub-class of a tensor. Rows of tensors extend alone the y-axis, columns along the x-axis. The rank of a scalar is 0, the rank of a vector is 1, the rank of a matrix is 2, the rank of a tensor is 3 or higher.

Hyperplane

The hyperplane is a sub-space in the ambient space with one dimension less. In a two-dimensional space the hyperplane is a line, in a three-dimensional space it is a two-dimensional plane, etc.

Hyperplanes divide an $n$-dimensional space into sub-spaces that might represent clases in a machine learning algorithm.

Summary

Dot Product

This is also the inner product. It is a function that returns a number computed from two vectors of the same length by summing up the product of the corresponding dimensions.

For two vectors $a = [a_1, a_2, \dots{}, a_n]$ and $b = [b_1, b_2, \dots{}, b_n]$ the dot product is:

$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_{i} b_{i} = a_{1} b_{1} + a_{2} b_{2} + \cdots + a_{n} b_{n}$

If we normalize two vectors and compute the dot product, we get the cosine similarity, which can be used as a metric for cimilarity of vectors. Independent of the absolute length we look at the angle between the vectors, i.e. the lenght is neutralized via normalization.

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula (see Wikipedia: Cosine similarity):

$\mathbf{a} \cdot \mathbf{b} = \left\|\mathbf{a}\right\| \left\|\mathbf{b}\right\| \cos\theta$

Given two vectors of attributes, $A$ and $B$, the cosine similarity, $cos(\theta)$, is represented using a dot product and magnitude as:

$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{ \|\mathbf{A} \|\|\mathbf{B} \| } = \frac{\sum \limits_{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}$, with $A_i$ and $B_i$ components of vector $A$ and $B$ respectively.

Hadamard Product

This is also known as the entrywise product. For two matrices $A \in \mathbb{R}^{m\times n}$ and $B \in \mathbb{R}^{m\times n}$ the Hadamard product $A\circ B$ is:

$(A\circ B)_{i,j} = (A)_{i,j} (B)_{i,j}$

For example:

$\begin{bmatrix} a_{11} & a_{12} & a_{13} \\[0.3em] a_{21} & a_{22} & a_{23} \\[0.3em] a_{31} & a_{32} & a_{33} \end{bmatrix} \circ \begin{bmatrix} b_{11} & b_{12} & b_{13} \\[0.3em] b_{21} & b_{22} & b_{23} \\[0.3em] b_{31} & b_{32} & b_{33} \end{bmatrix} = \begin{bmatrix} a_{11}b_{11} & a_{12}b_{12} & a_{13}b_{13} \\[0.3em] a_{21}b_{21} & a_{22}b_{22} & a_{23}b_{23} \\[0.3em] a_{31}b_{31} & a_{32}b_{32} & a_{33}b_{33} \end{bmatrix}$

Outer Product

This is also called the tensor product of two vectors. Compute the resulting matrix by multiplying each element from a column vector with all alements in a row vector.

...