BVH:GC:Author needs to add figure labels

Outline
Glossary
1. Mathematical Background
  - Previous: 2.9 Sampling Theory
  - Next: 2.11 Least-squares Minimization

Import standard modules:



In [ ]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import HTML 
HTML('../style/course.css') #apply general CSS

2.10 Linear Algebra

In the sections that follow we will be studying vectors and matrices. We use vectors and matrices quite frequently in radio interferometry which is why it is quite important to have some familiarity with them.

2.10.1 Vectors

Definition: An $n$-dimensional column vector ${\bf v } \in\mathbb{C}^{n\times 1}$ is an ordered collection of $n$-complex numbers, i.e. $v_1,v_2,....v_n$; stacked in the following manner:

\begin{equation} {\bf v } = \begin{pmatrix} v_1\\ v_2\\ \vdots\\ v_n \end{pmatrix}. \end{equation}

An $n$-dimensional row vector ${\bf u}^T \in\mathbb{C}^{1\times n}$ is an ordered collection of $n$-complex numbers, i.e. $u_1,u_2,....u_n$; stacked in the following manner:

\begin{equation} {\bf u }^T=(u_1,u_2,....u_n). \end{equation}

Note that, with this notation, a vector $\bf{v}$ is always assumed to be a column vector and we explicitly use a superscipt $T$ to indicate that a row vector is the transpose of a vector. Thus it should be understood that the notation ${\bf{v}}\in\mathbb{C}^{n}$ denotes a column vector.

2.10.1.1 Vector Products

Suppose that ${\bf{u}},{\bf{v}}\in \mathbb{C}^{n}$. The following vector products are often used in radio interferometry:

Hermitian inner product:

The Hermitian inner product of ${\bf{u}}$ and ${\bf{v}}$ is given by

\begin{equation} {\bf{u}}\cdot{\bf{v}} = {\bf u}^H {\bf v} = \sum_{k=1}^n u_i^*v_i, \end{equation}

where the superscript $H$ denotes the Hermitian transpose (see below). Recall that the inner product of a real vector ${\bf b} \in \mathbb{R}^n$ with a real unit vector $\hat{{\bf a}} \in \mathbb{R}^n$ (where $\hat{{\bf a}} = {\bf \frac{a}{|a|}}$) gives the component of ${\bf b}$ in the direction defined by $\hat{{\bf a}}$. The inner product on real vectors is sometimes refered to as the dot product and we say that it projects ${\bf b}$ onto $\hat{{\bf a}}$. Explicitly we have

\begin{equation} {\bf b }\cdot{\bf \hat{a}} = |{\bf b}| \ |{\bf \hat{a}}| \cos(\kappa) = |{\bf b}|\cos(\theta), \end{equation}

where $|\cdot|$ denotes the magnitude of a vector and $\kappa$ is the acute angle between $\hat{{\bf a}}$ and ${\bf b }$. This is illustrated in Fig. 2.10.1 ⤵
Outer Product:

The outer product of ${\bf{u}}$ and ${\bf{v}}$ is given by

\begin{equation} {\bf u} * {\bf v} = {\bf u }{\bf v}^H, \quad \rightarrow \quad {(\bf{u}}*{\bf{v})}_{ij} = u_iv_j^* \end{equation}

We may also assign a geometric meaning to the outer product. In the case where both vectors are unit vectors viz. ${\bf \hat{u}}$ and ${\bf \hat{v}}$, we may think of the outer product as an operator which acts on an arbitrary vector ${\bf x}$ to produce

\begin{equation} ({\bf \hat{u}} * {\bf \hat{v}}) {\bf x} = {\bf \hat{u} }{\bf \hat{v}}^H {\bf x} = {\bf \hat{u} }({\bf \hat{v}}\cdot {\bf x}). \end{equation}

Thus this operator finds the component of ${\bf x}$ along the direction of ${\bf \hat{v}}$ and points it in the direction of ${\bf \hat{u}}$.

**Figure 2.10.1**: Using the dot product to project ${\bf b}$ onto ${\bf \hat{a}}$.

2.10.2 Matrices

Definition: A matrix ${\bf{A}}\in \mathbb{C}^{m\times n}$ is defined as an ordered rectangular array of complex numbers, i.e. \begin{equation} {\bf{A}} = \begin{pmatrix} a_{11}&a_{12}&\dots& a_{1n}\\ a_{21}&a_{22}&\dots &a_{2n}\\ \vdots&\vdots&\ddots &\vdots\\ a_{m1}&a_{m2}&\dots &a_{mn} \end{pmatrix} \end{equation}

If $m=n$, then ${\bf{A}}$ is a square matrix. Note that the first index is used to label the row number and the second the column number.

2.10.2.1 Basic Matrix Operations and Properties

The traspose of ${\bf{A}}\in \mathbb{C}^{m\times n}$, denoted by ${\bf{A}}^T$ is given by \begin{equation} {\bf{A}}^T_{ij} = a_{ji}, \end{equation} i.e. the transpose operation interchanges the rows and columns of the matrix.
The complex conjugate of ${\bf{A}}\in \mathbb{C}^{m\times n}$, denoted by ${\bf{A}}^*$ is given by \begin{equation} {\bf{A}}^*_{ij} = a_{ij}^*. \end{equation}
The Hermitian tranpose of ${\bf{A}}\in \mathbb{C}^{m\times n}$, denoted by ${\bf{A}}^H$ is given by \begin{equation} {\bf{A}}^H_{ij} = a_{ji}^*, \end{equation} i.e. the Hermitian transpose is the conjugate of the transposed matrix ${\bf A}^H = ({\bf A}^T)^*$.
The vectorization of matrix ${\bf{A}}\in \mathbb{C}^{m\times n}$, denoted by vec$({\bf{A}})$, is the $mn \times 1$ column vector obtained by stacking the columns of the matrix ${\bf{A}}$ on top of one another: \begin{equation} \mathrm{vec}({\bf{A}}) = [a_{11}, \ldots, a_{m1}, a_{12}, \ldots, a_{m2}, \ldots, a_{1n},\ldots, a_{mn}]^T \end{equation} The inverse operation of of vec$({\bf{A}})$ is denoted by vec$^{-1}({\bf{A}})$. We will refer to this operation as the matrization. This should not be confused with matricization which is the generalisation of vectorisation to higher order tensors.
We use diag$(\bf{u})$ to denote a matrix whose diagonal is equal to $\bf{u}$, while all its other entries are set to zero. Conversely, when applied to a matrix ${\bf A}$, the notation diag$({\bf A})$ refers to the vector formed by extracting only the diagonal elements of ${\bf A}$.
A square matrix ${\bf{A}}\in \mathbb{C}^{m\times m}$ is said to be:
- Invertible if there exists a matrix ${\bf{B}}\in \mathbb{C}^{m\times m}$ such that: \begin{equation} {\bf{B}}{\bf{A}} = {\bf{A}}{\bf{B}} ={\bf{I}}. \end{equation} We denote the inverse of ${\bf{A}}$ with ${\bf{A}}^{-1}$.
- Hermitian if ${\bf{A}} = {\bf{A}}^H$.
- (Special-) Orthogonal if ${\bf AA}^T = I \rightarrow {\bf{A}}^{-1} = {\bf{A}}^T$.
- (Special-) Unitary if ${\bf A A}^H = I \rightarrow {\bf{A}}^{-1} = {\bf{A}}^H$.
We use $|\bf{A}|$ to denote the determinant (see for example Wikipedia ➞) of the square matrix $\bf{A}$.

2.10.2.2 Matrix Products

There are a few matrix products which are commonly used in radio interferometry. We define the most frequently used matrix products below (also see Hadamard, Khatri-Rao, Kronecker and other matrix products ⤴ however be aware that our notation differs from the notation used there).

We assume below that ${\bf{A}},{\bf{C}}\in\mathbb{C}^{m\times n}$, ${\bf{B}}\in\mathbb{C}^{p\times q}$ and ${\bf{D}}\in\mathbb{C}^{n\times r}$. Sometimes it will be necessary to partition the matrices into sub-matrices. We will use boldface subscripts to refer to sub-matrices. For example, the notation ${\bf{A}}_{\bf ij}$ refers to the sub-matrix ${\bf{A}}_{\bf ij}\in\mathbb{C}^{m_i\times n_j}$ of the matrix ${\bf A}$. Thus the boldface indices ${\bf i}$ and ${\bf j}$ are themselves lists of indices of length $m_i$ and $n_j$, respectively. Similarly, the notation ${\bf{B}}_{\bf kl}$ refers to the sub-matrix ${\bf{B}}_{\bf kl}\in\mathbb{C}^{p_k\times q_l}$ of the matrix ${\bf B}$ and this time the boldface indices ${\bf k}$ and ${\bf l}$ are lists of length $p_k$ and $q_l$ respectively. The meaning of this notation will become clearer in the examples below. For now keep in mind that the boldface indices are lists whose elements do not necessarily need to start counting from one (see the definition of the Kronecker product below). Moreover $\sum m_i = m, \sum n_j = n, \sum p_k = p$ and $\sum q_l = q$. We can now define the following matrix products:

Matrix Product The matrix product of ${\bf{A}}$ and ${\bf{D}}$, denoted ${\bf{A}}{\bf{D}}$, is of order $m\times r$. The $ij$-th element of the matrix product is given by

\begin{equation} ({\bf{A}} {\bf{D}})_{ij} = \sum^m_{k=1} a_{ik}d_{kj}. \end{equation}

Note that the number of columns of ${\bf{A}}$ must be equal to the number of rows of ${\bf{D}}$ for this product to exist.
Hadamard Product

The Hadamard product of ${\bf{A}}$ and ${\bf{C}}$, denoted ${\bf{A}}\odot{\bf{C}}$, is the element-wise product of ${\bf{A}}$ and ${\bf{C}}$ i.e.

\begin{equation} {(\bf{A}} \odot {\bf{C}})_{ij} = a_{ij}c_{ij}. \end{equation}

Note that ${\bf{A}}$ and ${\bf{C}}$ must be of the same order for this product to exist and that the resulting matrix will be of the same order as both ${\bf A}$ and ${\bf C}$. We may also define the Hadamard or elementwise inverse of a matrix ${\bf A}$. The $ij-$th element of the Haramard inverse is given by \begin{equation} (A^{\odot -1})_{ij} = \frac{1}{a_{ij}} \end{equation}
Kronecker Product

The Kronecker product of ${\bf A}$ and ${\bf B}$, denoted ${\bf A} \otimes {\bf B}$, multiplies every element of ${\bf A}$ with every element of ${\bf B}$ and arranges the result in a matrix of the following form

\begin{equation} {\bf{A}} \otimes {\bf{B}} = \left(\begin{array}{ccc}a_{11}{\bf B} & a_{12}{\bf B} & \cdots & a_{1n}{\bf B} \\ a_{21}{\bf B} & a_{22}{\bf B} & \cdots & a_{2n}{\bf B} \\ \cdot & \cdot & \cdot & \cdot \\ a_{m1}{\bf B} & a_{m2}{\bf B} & \cdots & a_{mn}{\bf B} \end{array}\right). \end{equation}

Note that the resulting matrix is of order $mp\times nq$. Using the boldface indices introduced above we may also write this as

\begin{equation} ({\bf{A}} \otimes {\bf{B}})_{\bf ij} = a_{ij}{\bf{B}}. \end{equation}

Note that the boldface indices ${\bf ij}$ always correspond to the indices of the element $a_{ij}$ of the original matrix ${\bf A}$. If we think of the result of the Kronecker product as the matrix ${\bf Q} = {\bf{A}} \otimes {\bf{B}}$ where ${\bf Q} \in \mathbb{C}^{mp\times nq}$, then

\begin{equation} ({\bf{A}} \otimes {\bf{B}})_{\bf ij} = {\bf Q}_{(ip+1):(i+1)p \ , \ (jq+1):(j+1)q} \ . \end{equation}

In other words the boldface index ${\bf i}$ is a list of indices starting at $ip+1$ and ending at $(i+1)p$ and similarly ${\bf j}$ is a list of indices starting at $jq+1$ and ending at $(j+1)q$. This notation can be difficult to get used to but it will be useful in practical implementations involving the Kronecker product.
Khatri-Rao Product

The Khatri-Rao product, denoted ${\bf{A}} \oplus {\bf{B}}$, can be defined in terms of the Kronecker product. Operationally the Khatri-Rao product of ${\bf{A}}$ and ${\bf{B}}$ is the same as the Kronecker product of each row of ${\bf{A}}$ with the corresponding row of ${\bf{B}}$. We will use the notation ${\bf A}_{i,:}$ to denote the $i-$th row of the matrix ${\bf A}$ (i.e. ${\bf A}_{i,:} = {\bf A}_{i,1:n}$) and similarly ${\bf B}_{k,:}$ denotes the $k-$th row of ${\bf B}$. The Khatri-Rao product can then be defined as

\begin{equation} {\bf{A}} \oplus {\bf{B}} = \left(\begin{array}{c} {\bf A}_{1,:} \otimes {\bf B}_{1,:} \\ {\bf A}_{2,:} \otimes {\bf B}_{2,:} \\ \cdot \\ {\bf A}_{n,:} \otimes {\bf B}_{n,:} \end{array} \right). \end{equation}

Note that the matrices ${\bf A}$ and ${\bf B}$ must have the same number of rows for the Khatri-Rao product to be defined and, if ${\bf A} \in \mathbb{C}^{m\times n}$ and ${\bf B} \in \mathbb{C}^{m\times q}$, the result will be of order $m\times nq$. We should point out that a different convention is sometimes used for the Khatri-Rao product. Our convention has been chosen because it is the one that is encountered most frequently in radio-interferometry.

During this course the Kronecker and Khatri-Rao products will mainly be used to convert between the Jones and Mueller formalisms $\S$ 7 ➞.

2.10.2.3 Examples

Consider the vectors ${\bf{u}}=\begin{pmatrix} u_1,u_2,u_3\end{pmatrix}^T$ and ${\bf{v}}=\begin{pmatrix} v_1,v_2,v_3\end{pmatrix}^T$.

Now: \begin{align} {\bf{u}}*{\bf{v}}&= \begin{pmatrix} u_1\\u_2\\u_3\end{pmatrix}\begin{pmatrix}v_1,v_2,v_3\end{pmatrix}=\begin{pmatrix} u_1v_1&u_1v_2&u_1v_3\\ u_2v_1&u_2v_2&u_2v_3\\u_3v_1&u_3v_2&u_3v_3\end{pmatrix}. \end{align} Consider the matrices \begin{equation} {\bf{A}} = \begin{pmatrix} a_{11}&a_{12}\\a_{21}&a_{22}\\a_{31}&a_{32}\end{pmatrix} \hspace{0.5cm} \text{and} \hspace{0.5cm} {\bf{B}} = \begin{pmatrix}b_{11}&b_{12}\\b_{21}&b_{22}\\b_{31}&b_{32}\end{pmatrix}. \end{equation} Now: \begin{align} {\bf{A}} \odot {\bf{B}} &= \begin{pmatrix} a_{11}&a_{12}\\a_{21}&a_{22}\\a_{31}&a_{32}\end{pmatrix}\odot \begin{pmatrix}b_{11}&b_{12}\\b_{21}&b_{22}\\b_{31}&b_{32}\end{pmatrix} = \begin{pmatrix} a_{11}b_{11}&a_{12}b_{12}\\a_{21}b_{21}&a_{22}b_{22}\\a_{31}b_{31}&a_{32}b_{32}\end{pmatrix}\\ {\bf{A}} \otimes {\bf{B}} &= \begin{pmatrix} a_{11}&a_{12}\\a_{21}&a_{22}\\a_{31}&a_{32}\end{pmatrix}\otimes \begin{pmatrix}b_{11}&b_{12}\\b_{21}&b_{22}\\b_{31}&b_{32}\end{pmatrix} = \begin{pmatrix} a_{11}b_{11}&a_{11}b_{12}&a_{12}b_{11}&a_{12}b_{12}\\a_{11}a_{21}&a_{11}b_{22}&a_{12}b_{21}&a_{12}b_{22}\\a_{11}b_{31}&a_{11}b_{32}&a_{12}b_{31}&a_{12}b_{32}\\a_{21}b_{11}&a_{21}b_{12}&a_{22}b_{11}&a_{22}b_{12}\\a_{21}a_{21}&a_{21}b_{22}&a_{22}b_{21}&a_{22}b_{22}\\a_{21}b_{31}&a_{21}b_{33}&a_{22}b_{33}&a_{22}b_{32}\\a_{31}b_{11}&a_{31}b_{12}&a_{32}b_{11}&a_{32}b_{12}\\a_{31}a_{21}&a_{31}b_{22}&a_{32}b_{21}&a_{32}b_{22}\\a_{31}b_{31}&a_{31}b_{32}&a_{32}b_{31}&a_{32}b_{32} \end{pmatrix}\\ {\bf{A}} \oplus {\bf{B}} &= \begin{pmatrix} a_{11}&a_{12}\\a_{21}&a_{22}\\a_{31}&a_{32}\end{pmatrix}\oplus \begin{pmatrix}b_{11}&b_{12}\\b_{21}&b_{22}\\b_{31}&b_{32}\end{pmatrix} = \begin{pmatrix} {\bf A}_{1,:} \otimes {\bf B}_{1,:} \\{\bf A}_{2,:} \otimes {\bf B}_{2,:} \\ {\bf A}_{3,:} \otimes {\bf B}_{3,:} \end{pmatrix}\\ &= \begin{pmatrix} a_{11}b_{11}&a_{11}b_{21}&a_{12}b_{11}&a_{12}b_{12}\\a_{21}b_{21}&a_{21}b_{22}&a_{22}b_{21}&a_{22}b_{22}\\ a_{31}b_{31}&a_{31}b_{32}&a_{32}b_{31}&a_{32}b_{32}\end{pmatrix} \end{align}

Ipython implemenations of the above examples are given below:



In [ ]:

    
# Defining the vectors and matrices
u = np.array((3.,4.,1j)) # 3x1 vector
v = np.array((2.,1.,7.)) # 3x1 vector
A = np.array(([3,4],[5,-1],[4,-2j])) #3x2 matrix
B = np.array(([1,-8],[2,3],[6,1-5j])) #3x2 matrix

# Outer product
out_prod = np.outer(u,v)
# Hadamard product
had_prod = A*B
# Kronecker product
kron_prod = np.kron(A,B)
# Khatri-Rao product
kha_prod = np.zeros((3,4),dtype=complex) # create a matrix of order (m x n^2 = 3X4 matrix)
for i in range(len(A[:,0])):
    kha_prod[i,:] = np.kron(A[i,:],B[i,:])

# Printing inputs and products:
print 'u:', u
print 'v:', v
print "\n Outer Product: (%i x %i)\n"%(out_prod.shape), out_prod
print '\n-------------------------------------'
print 'A: (%i x %i)\n'%(A.shape), A
print 'B: (%i x %i)\n'%(B.shape), B
print "\n Hadamard Product: (%i x %i)\n"%(had_prod.shape), had_prod
print "\n Kronecker Product: (%i x %i)\n"%(kron_prod.shape), kron_prod
print "\n Khatri-Rao Product: (%i x %i)\n"%(kha_prod.shape), kha_prod

2.10.2.4 Product Identities

We can establish the following product identities (assuming the matrices and vectors below have the appropriate dimensions):

$({\bf{A}} \oplus {\bf{B}}) \odot ({\bf{C}} \oplus {\bf{D}})=({\bf{A}} \odot {\bf{C}}) \oplus({\bf{B}} \odot {\bf{D}}) $
$({\bf{A}} \otimes {\bf{B}})({\bf{C}} \oplus {\bf{D}}) = {\bf{A}}{\bf{C}} \oplus {\bf{B}}{\bf{D}}$
$({\bf{A}} \oplus {\bf{B}})^H({\bf{C}} \oplus {\bf{D}}) = {\bf{A}}^H{\bf{C}} \odot {\bf{B}}^H {\bf{D}}$
$\text{vec}({\bf{A}}{\bf{X}}{\bf{B}}) = ({\bf{B}}^T \otimes {\bf{A}})\text{vec}({\bf{X}})$
$\text{vec}({\bf{A}}\text{diag}({\bf{x}}){\bf{B}}) = ({\bf{B}}^T \oplus {\bf{A}}){\bf{x}}$

for more information about these identities the reader is referred to Hadamard, Khatri-Rao, Kronecker and other matrix products ⤴ and (Fundamental imaging limits of radio telescope arrays ⤴). Although we will not be making extensive use of these identities (however see $\S$ 7 ➞), they are frequently encountered in radio interferometry literature especially as it pertains to calibration. To gain some fimiliarity with them we will quickly validate identity four with an example. The others can be verified in a similar way.

Consider the two complex $2\times 2$ matrices: \begin{equation} {\bf{J}} = \begin{pmatrix} j_{11} &j_{12}\\ j_{21}&j_{22}\end{pmatrix} \hspace{0.5cm} \text{and} \hspace{0.5cm} {\bf{C}} = \begin{pmatrix} c_{11} &c_{12}\\ c_{21}&c_{22}\end{pmatrix}. \end{equation}

Now:

\begin{eqnarray} \text{vec}({\bf{J}}{\bf{C}}{\bf{J}}^H) &=& \begin{pmatrix} j_{11} &j_{12}\\ j_{21}&j_{22}\end{pmatrix} \begin{pmatrix} c_{11} &c_{12}\\ c_{21}&c_{22}\end{pmatrix} \begin{pmatrix} j_{11}^* &j_{21}^*\\ j_{12}^*&j_{22}^*\end{pmatrix}\\ &=& \begin{pmatrix} j_{11}^*j_{11}c_{11} + j_{11}^*j_{12}c_{21} + j_{12}^*j_{11}c_{12} + j_{12}^*j_{12}c_{22}\\ j_{11}^*j_{21}c_{11} + j_{11}^*j_{22}c_{21} + j_{12}^*j_{21}c_{12} + j_{12}^*j_{22}c_{22} \\ j_{21}^*j_{11}c_{11} + j_{21}^*j_{12}c_{21} + j_{22}^*j_{11}c_{12} + j_{22}^*j_{12}c_{22}\\ j_{21}^*j_{21}c_{11} + j_{21}^*j_{22}c_{21} + j_{22}^*j_{21}c_{12} + j_{22}^*j_{22}c_{22} \end{pmatrix}\\ &=& \Bigg[\begin{pmatrix} j_{11}^* &j_{12}^*\\ j_{21}^*&j_{22}^*\end{pmatrix} \otimes \begin{pmatrix} j_{11} &j_{12}\\j_{21}&j_{22}\end{pmatrix}\Bigg]\begin{pmatrix}c_{11}\\c_{21}\\c_{12}\\c_{22} \end{pmatrix}\\ &=& \left ({\bf{J}}^{*} \otimes {\bf{J}}\right ) \text{vec}({\bf{C}})\\ &=& \left( \left ({\bf{J}}^{H} \right)^T \otimes {\bf{J}} \right ) \text{vec}({\bf{C}}) \end{eqnarray}

2.10.3 Linear systems

2.10.3.1 Theory and definitions

Next we will briefly look at how to use matrix algebra to solve linear systems of equations. Linear systems of equations can be written in the form

$$ {\bf A x} = {\bf b } + \epsilon, \quad \mbox{where} \quad \epsilon \sim \mathcal{N}\left(0, \Sigma\right). $$

Here $\epsilon$ is a realisation of Gaussian noise with covariance matrix $\Sigma$ i.e. it has a probability distribution function (pdf) of the form

$$ p(\epsilon|\Sigma) = \frac{1}{\sqrt{(2\pi)^D |\Sigma|}}\exp^{\left(-\frac{1}{2}\epsilon^\dagger \Sigma^{-1} \epsilon\right)}. $$

The pdf allows us to assign a likelihood to any specific realisation of $\epsilon$. Clearly, since the quantity in the exponent is non-negative, the likelihood increases as $\epsilon \rightarrow 0$. Thus, if $\bf{A}$ and ${\bf b}$ are fixed, we can use the pdf to find the most likely value of ${\bf x}$ which we denote ${\bf \bar{x}}$. Doing so is actually surprisingly simple, well at least when the system is of full rank. First note that the position of the maximum in the likelihood does not change on taking the log of the pdf, a fact which comes in handy in numerical applications. The negative log likelihood of the Gaussian pdf is called the $\chi^2$ distribution and it takes the following form

$$ \chi^2 = ({\bf Ax} - {\bf b})^H {\bf W} ({\bf Ax} - {\bf b}), $$

where $W = \Sigma^{-1}$ is called the weight matrix. Note that the minimum of $\chi^2$ occurs at the same value of ${\bf \bar{x}}$ as the maximum in the likelihood. This allows us to use numerical minimisation schemes (of which you will hear more of in the next section 2.11 Least-squares Minimization ➞) to find ${\bf \bar{x}}$. Actually, when the system is of full rank, there is a simpler analytic way to find ${\bf \bar{x}}$. This is because a Gaussian pdf has only a single stationary point. You should remember (from first year calculus) how to find this stationary point i.e. take the derivative and set it to zero. The only (hopefully) slightly tricky part is taking the gradient of $\chi^2$ w.r.t. ${\bf x}$ (i.e. finding the Jacobian). The expression for the Jacobian is

$$ \partial_x \chi^2 = \mathbb{J} = {\bf A}^H {\bf W} ({\bf b} - {\bf Ax}). $$

One more derivative gives the Hessian $\mathbb{H}$ as

$$ \partial^2_x \chi^2 = \mathbb{H} = {\bf A}^H {\bf W A}. $$

Furthermore, setting the Jacobian to zero produces the normal equations

$$ {{\bf A}^H {\bf W A x}} = {{\bf A}^H{\bf W b}}. $$

When the Hessian matrix $\mathbb{H} = {{\bf A}^H{\bf W A}}$ is non-singular (i.e. $\mathbb{H}^{-1}$ exists) we can solve for ${\bf \bar{x}}$ directly from the normal equations

$$ {\bf \bar{x}} = \left( {{\bf A}^H {\bf W A} } \right)^{-1} {{\bf A}^H{\bf W b}}. $$

This is very useful since many different problems can be written in linear form. Note that when the Hessian matrix $\mathbb{H} = {{\bf A}^H{\bf W A}}$ is singular (as is the case in interferometric imaging), or we are looking to the solution to a non-linear system (as is the case in interferometric calibration) we have to resort to approximate and/or iterative methods. Iterative minimisation techniques are discussed in the next section 2.11 Least-squares Minimization ➞. Next we demonstrate the above by doing a familiar example viz. fitting a straight line to data.

BVH:MC:This section is quite terse and needs a citation and/or further reading materials. The imaging section does not link up with this discussion and it may be necessary to formally define the operators using the formulation above? Don't recall the fourier operator defined as a matrix. Also why is the hessian singular in the imaging case? Is it because $A^H$ contains a sparcely sampled uv plane. Consider a brief expansion and refer the reader to the imaging section. The author needs to define the jacobian and hessian in the glossary. For the last equation do you mean ${\bf x}$ not ${\bf \bar{x}}$

2.10.3.2 Linear regression example

Suppose we are given some a data set $\mathcal{D} = [t_i, y_i, \delta y_i]$ where $\delta y_i$ represents uncertainties in the measured values of $y_i$. How do we find the best fit straight line passing through the data? Well, when the errors are normally distributed, the answer is fairly simple. It involves writing the equation for a straight line i.e.

$$ f(t) = c + mt $$

in matrix form and finding the values of the constants $c$ and $m$ which minimises the $\chi^2$. To write the function in matrix form we simply define, first the matrix (sometimes called the design matrix)

$$ {\bf A} = [{\bf 1}, {\bf t}], $$

followed by the parameter vector ${\bf x} = [c,m]^T$. Here ${\bf 1}$ refers to a vector of ones and ${\bf t}$ the vector formed by stacking together all the points $t_i$ at which we have data. Similarly we will use ${\bf b}$ to denote the vector formed by stacking together all the observations $y_i$. The assumption that the errors are normally distributed is encoded by writing

$$ {\bf b} = {\bf Ax} + \epsilon, \quad \mbox{where} \quad \epsilon \sim \mathcal{N}\left(0, \Sigma \right), $$

and $\Sigma$ is a diagonal matrix containing the $\delta y_i$ as entries. That is all there is to it. We can now simply apply the formulas derived above. The code snippet below illustrates a simple linear regression.



In [ ]:

    
# Start by defining a function that solves the normal equations
def solve_Normal(b,A,W):
    ATW = np.dot(A.T.conj(), W)
    return np.linalg.inv(ATW.dot(A)).dot(ATW.dot(b))



In [ ]:

    
# Next we simulate some data
m = -2.5 # the model slope
c = 10.0 # the model y-intercept
N = 15 # the number of data points
xi = 10*np.random.random(N) # some random points in the domain (0,10)
x = np.linspace(0,10,N)
y = c + m*x # the real model
deltay = 5*np.abs(np.random.randn(N)) # uncertainties in the data
Sigma = np.diag(deltay**2) # covariance matrix
W = np.diag(1.0/deltay**2) # the weight matrix
epsilon = np.random.multivariate_normal(np.zeros(N),Sigma) # a realisation of the noise
yi = c + m*xi + epsilon # the data points

# plot the model and data
plt.figure('Linear model', figsize=(15,7))
plt.plot(x,y,label=r'$true \ model$')
plt.errorbar(xi,yi,deltay,fmt='xr', label=r'$ Data \ with \ 1-\sigma \ uncertainty $')
plt.title(r'$ Data \ for \ linear \ model $', fontsize=20)
plt.xlabel(r'$x$', fontsize=18)
plt.ylabel(r'$y$', fontsize=18)
plt.legend()

Figure 2.10.1:

This is typically what your data will look like. Note that the points with large error bars can lie quite far away from the true model and we have to account for this during the regression. This can be done by weighting the observations by the inverse of their variance. To illustrate the effect that the uncertainties (or inverse weights) have on the data we will now perform two regressions, one where we don't account for uncertainty and one where we do.



In [ ]:

    
# Construct the design matrix
A = np.ones([N,2])
A[:,1] = xi
AM = np.ones([N,2])
AM[:,1] = x 

# Solve normal equations without accounting for uncertainty
xbar_no_uncertainty = solve_Normal(yi,A,np.eye(N))
print "Best fit parameters without accounting for uncertainty c = %f, m = %f"%(xbar_no_uncertainty[0],xbar_no_uncertainty[1])

# reconstruct the function corresponding to these parameters
y_no_uncertainty = np.dot(AM,xbar_no_uncertainty)

# Solve normal equations while accounting for uncertainty
xbar = solve_Normal(yi,A,W)
print "Best fit parameters while accounting for uncertainty c = %f, m = %f"%(xbar[0],xbar[1])

# reconstruct the function corresponding to these parameters
y_with_uncertainty = np.dot(AM,xbar)

# plot the results
plt.figure('Linear model', figsize=(15,7))
plt.plot(x,y,label=r'$true \ model$')
plt.errorbar(xi,yi,deltay,fmt='xr', label=r'$ Data \ with \ 1-\sigma \ uncertainty $')
plt.title(r'$ Linear \ regression $', fontsize=20)
plt.xlabel(r'$x$', fontsize=18)
plt.ylabel(r'$y$', fontsize=18)
plt.plot(x,y_no_uncertainty,'g',label=r'$Best \ fit \ model \ without \ uncertainty$')
plt.plot(x,y_with_uncertainty,'m',label=r'$Best \ fit \ model \ with \ uncertainty$')
plt.legend()

Figure 2.10.2:

You should see that the weighted regression produces more accurate results. In interferometric imaging weighting functions can be applied for reasons other than accounting for uncertainty. We will see examples of this in $\S$ 5.4.

2.10.4 Convolution and Toeplitz matrices

We now turn a useful way to look at the discrete convolution operator. Recall that, if ${\bf r} = ({\bf y} \circ {\bf z})$, then we have that

$$ ({\bf y} \circ {\bf z})_k = r_k = \sum_{n\,=\,0}^{N-1} y_n z_{(k-n) \, mod \, N}. $$

Here we will show how this operation can be represented as a special kind of matrix (viz. a Toeplitz matrix) multiplied with a vector. Perhaps the simplest way to think of a Toeplitz matrix is as a matrix with constants on each diagonal i.e. if ${\bf T}$ is Toeplitz then

$$ {\bf T} = \left[ \begin{array}{ccccc} t_0 & t_1 & t_2 & \cdots & t_{n-1} \\ t_{-1} & t_0 & t_1 & \ & \ \\ t_{-2} & t_{-1} & t_0 & \ & \ \\ \vdots & \ & \ & \ddots & \ \\ t_{-(n-1)} & \ & \ & \cdots & t_0 \end{array} \right]. $$

In other words ${\bf T} = [t_{i,j}; i,j=0,1,\cdots,n-1]$ where $t_{i,j} = t_{j-i}$ and $t_{j-i}$ refers to the first row of ${\bf T}$. Matrices of this type occur in many applications and have a rich theory associated with them (see for example Toeplitz and Circulant Matrices: A review ⤴). We will not even scratch the surface. In particular we will focus on a simpler subclass of matrices called circulant matrices. Every row of a circulant matrix is a right cyclic shift of the row immediately above it i.e if ${\bf C}$ is a circulant matrix then

$$ {\bf C} = \left[ \begin{array}{ccccc} y_0 & y_1 & y_2 & \cdots & y_{n-1} \\ y_{n-1} & y_0 & y_1 & \ & \ \\ y_{n-2} & y_{n-1} & y_0 & \ & \ \\ \vdots & \ & \ & \ddots & \ \\ y_{1} & y_2\ & \ & \cdots & y_0 \end{array} \right]. $$

Equivalently we could write ${\bf C} = [y_{i,j}; i,j=0,1,\cdots,n-1]$ where $y_{i,j} = y_{i-j} = y_{j-i}$. Let us now consider the effect of multiplying a vector $ {\bf z} = [z_0, z_2, \cdots, z_{n-1}]$ by a matrix of this form i.e.

$$ {\bf r} = {\bf C z}, \quad \Rightarrow \quad r_k = y_{n-k} z_0 + y_{n-(k-1)} z_1 + \cdots + y_0z_k + \cdots + y_{k-1} z_{n-1}, $$

where on the right we have used the fact that the $k^{th}$ row of ${\bf r}$ is going to consist of the $k^{th}$ row of ${\bf C}$ dotted with ${\bf z}$. You should convince yourself that this is exactly the definition of the discrete convolution given above. Also note that a circulant matrix is diagonalised in a Fourier basis. In fact, if ${\bf C}$ is a circulant matrix and ${\bf F}$ is the discrete Fourier transform operator, we have that

$$ {\bf C} = {\bf F} \psi {\bf F}^\dagger, \quad \mbox{where} \quad {\bf C v} = \psi {\bf v}, $$

so that $\psi$ are the eigenvalues and ${\bf v}$ the eigenvectors of ${\bf C}$. We will not delve any further into the details here but it turns out that the eigenvectors of all circulant matrices are related to the N${}^{th}$ roots of unity and that their eigenvalues can therefore be computed in a straightforward manner. Furthermore, as we will see later on, the Hessian matrix associated with the measurement equation is a circulant matrix with the point spread function of the instrument as it's convolution kernel.

BVH:MC:Please add link to section this will be discussed. This should probably be a section by itself in the deconvolution section and could complement the current more visual discussion we have in there.

Next: 2.11 Least-squares Minimization

Future Additions:

nice/pretty matrix printing