Conjugate gradient methods are suitable for optimization problems of the type: $f(x)\ =\ x^TAx\ -\ x^Tb$, where A is a real, symmetric and positive definite matrix, and x and b are row vectors. x and b are 1*n and A is n*n.
The derivative of f(x) is a well-known result from Vector Calculus:
$$f'(x)\ =\ x^T(A\ +\ A^T)/2\ -\ b\ =\ Ax\ -\ b\ (since\ A\ is\ symmetric,\ A=A^T).\ Let\ x^*\ =\ argmin_x f(x).\ Then\ we\ set\ f'(x^*) = 0,\ =>\ Ax^*\ =\ b.$$In other words, the value of x which mimimizes f(x) is given by the solution to the linear system of equations Ax = b.
This is one of the most important problems in Linear Algebra, and so there are a number of ways to approach it.
Conjugate Gradient method is an iterative method to solve the Ax = b problem that can be useful when A is large and sparse. The idea is that in each iteration, we go in a direction that is conjugate w.r.t A (i.e. A-orthogonal) to the current gradient.
First, let's define conjugacy: Two vectors x & y are conjugate with respect to A if $x^TAy\ =\ 0$.
Imagine that $(p_1,\ p_2,\ ..., p_n)$ are a set of vectors that are mutually conjugate w.r.t A. Then, since the row/column space of A is $\mathbb{R}^n$, they are also a basis of $\mathbb{R}^n$. So, we can represent any real vector of length n (e.g. x) as a linear combination of the $\alpha$s. If x* is the solution of Ax = b, then:
$$ \begin{align} & x^* = \sum_{k=1}^n \alpha_kp_k \\ &=> Ax^* = \sum_{k=1}^n\ \alpha_kAp_k,\ multiplying\ both\ sides\ by\ A \\ &=> p_i^{T}Ax^* = \sum_{k\ =\ 1}^n\ \alpha_kp_i^{T}Ap_k,\ multiplying\ both\ sides\ by\ some\ p_{i}^T \\ &=> p_i^{T}Ax^* = \alpha_ip_i^{T}Ap_i,\ because\ p_i\ and\ p_k\ are\ elements\ of\ an\ orthogonal\ basis,\ p_i^{T}Ap_k = 0\ if\ i\ \ne k \\ &=> \alpha_i = \frac{p_i^{T}Ax^*} {p_i^{T}Ap_i} = \frac{p_i^{T}b} {p_i^{T}Ap_i},\ since\ Ax*= b\\ \end{align} $$This gives us a way to iteratively build the solution:
The $p_i$s are called 'search directions'.
But how to find the conjugate gradient in each step? For this, we employ a technique in Linear Algebra called Gram-Schmidt Orthogonalization.
Let's define the residual at the k-th iteration, $r_k$ as the negative of the gradient $Ax_k - b$. We want to establish a relation between $r_k$ and $r_{k+1}$.
$$\begin{align} & r_k = b - Ax_k \\ & r_{k+1} = b - Ax_{k+1} \\ &= r_k + Ax_k - Ax_{k+1} \\ &= r_k - A(x_{k+1} - x_k) \\ &= r_k - \alpha_kAp_k \end{align} $$It's also useful to express $\alpha$ in terms of the $r_k$s: $\alpha_k = -(r_{k+1} - r_k) / Ap_k$
The residuals have two important properties:
If we multiply both sides by $p_j^T$, where $j < k-1$, we get:
$$p^T_jr_k = \sum_{i=k-1}^{n-1} \alpha_ip^T_jAp_i = 0,\ since\ p^T_jAp_i = 0\ if\ i \ne j$$Now we're ready to apply the Gram-Schmidt rule, which says given a set of basis vectors $(u_1,...,u_n)$, a corresponding set of vectors $(v_1,...,v_n)$ that are orthogonal in the inner product space can be found by:
$$\begin{align} & v_1 = u_1 \\ & v_{n} = u_{n} - \sum_{i = 1}^{n-1} \frac {<u_n, v_i>}{<v_i, v_i>} v_i \end{align} $$Suppose we're at the k-th iteration. Applied to our case, the input to the Gram-Schmidt rule is the set of residuals $(r_0, r_1, ..., r_k)$. The outputs are:
$$\begin{align} & p_0 = r_0 \\ & p_k = r_{k} - \sum_{i = 0}^{k-1} \frac {r_{k}^TAp_i}{p_i^TAp_i} p_i \\ \end{align} $$But we had proved earlier that the residual is independent of the previous search direction. So, the numerators of the form $r_{k}^TAp_i$ are 0 for $i = (0, 1, 2, ..., k-2)$, but not for $i = k-1$. So the summation reduces to just one term.
$$p_k = r_{k} - \frac {r_{k}^TAp_{k-1}} {p_{k-1}^TAp_{k-1}} p_{k-1}$$Let's call the coefficient of $p_{k-1}$, $\beta_{k-1}$. So we have $p_k = r_{k} + \beta_{k-1}p_{k-1}$ with $\beta_{k-1} = - \frac {r_{k}^TAp_{k-1}} {p_{k-1}^TAp_{k-1}}$.
We can further simplify the numerator of $\beta_{k-1}$ by expanding $Ap_{k-1}$:
Simialrly, we can simplify the denominator: $$\begin{align} p_{k-1}^TAp_{k-1} &= [r_{k-1} + \beta_{k-2}p_{k-2}]^T Ap_{k-1} \\ &= r_{k-1}^TAp_{k-1} \\ &= r_{k-1}^T[\frac{r_{k-1} - r_{k}} {\alpha_{k-1}}] \\ &= \frac{r_{k-1}^Tr_{k-1}} {\alpha_{k-1}} \end{align} $$
We use the last result to obtain the simplified expressions for both $\alpha$ and $\beta$:
$$\begin{align} & \beta_{k-1} = - \frac{r_{k}^Tr_{k}} {r_{k-1}^Tr_{k-1}} \\ & \alpha_{k-1} = \frac{r_{k-1}^Tr_{k-1}} {p_{k-1}^TAp_{k-1}} \\ \end{align} $$
In [11]:
import numpy as np
from numpy.linalg import cholesky, inv, LinAlgError
def is_symmetric(x):
return np.all((x-x.T) == 0)
def is_positive_definite(x):
pos_def = False
try:
cholesky(x)
pos_def = True
except LinAlgError:
return False
return pos_def
A = np.matrix([[2, -1, 0],
[-1, 2, -1],
[0, -1, 2]])
print('A is both symmetric & positive definite: %s' % (is_symmetric(A) and is_positive_definite(A),))
b = np.atleast_2d([2, 4, 5]).T
# Use the inverse method to find the expected solution:
expected = np.squeeze(inv(A)*b, axis=1)
print('Solution with inverse: %s' % (expected,))
In [13]:
def f(A, b, x):
return (0.5*x.T*A*x) - (x.T*b)
def conjugate_gradient(A, b):
x_init = np.atleast_2d(np.zeros(A.shape[0])).T
x_new, x = x_init, x_init
r = b - A*x
r_new = r
p = r.copy()
n_iters = A.shape[0]
for i in range(n_iters):
alpha = np.asscalar((r.T*r) / (p.T*A*p))
x_new = x + alpha*p
r_new = r - alpha*A*p
beta = np.asscalar((r_new.T*r_new) / (r.T*r))
p = r_new + beta*p
x, r = x_new, r_new
return x
calc_x_min = conjugate_gradient(A, b)
print('\nSolution using conjugate gradient: %s' % np.squeeze(calc_x_min, axis=1))
print('Minimum value of f(x): %s' % (f(A, b, calc_x_min),))
In [ ]: