Conjugate Gradient Descent

Conjugate gradient methods are suitable for optimization problems of the type: $f(x)\ =\ x^TAx\ -\ x^Tb$, where A is a real, symmetric and positive definite matrix, and x and b are row vectors. x and b are 1*n and A is n*n.

The derivative of f(x) is a well-known result from Vector Calculus:

$$f'(x)\ =\ x^T(A\ +\ A^T)/2\ -\ b\ =\ Ax\ -\ b\ (since\ A\ is\ symmetric,\ A=A^T).\ Let\ x^*\ =\ argmin_x f(x).\ Then\ we\ set\ f'(x^*) = 0,\ =>\ Ax^*\ =\ b.$$

In other words, the value of x which mimimizes f(x) is given by the solution to the linear system of equations Ax = b.

This is one of the most important problems in Linear Algebra, and so there are a number of ways to approach it.

Iterative Solution to Ax = b

Conjugate Gradient method is an iterative method to solve the Ax = b problem that can be useful when A is large and sparse. The idea is that in each iteration, we go in a direction that is conjugate w.r.t A (i.e. A-orthogonal) to the current gradient.

First, let's define conjugacy: Two vectors x & y are conjugate with respect to A if $x^TAy\ =\ 0$.

Imagine that $(p_1,\ p_2,\ ..., p_n)$ are a set of vectors that are mutually conjugate w.r.t A. Then, since the row/column space of A is $\mathbb{R}^n$, they are also a basis of $\mathbb{R}^n$. So, we can represent any real vector of length n (e.g. x) as a linear combination of the $\alpha$s. If x* is the solution of Ax = b, then:

$$ \begin{align} & x^* = \sum_{k=1}^n \alpha_kp_k \\ &=> Ax^* = \sum_{k=1}^n\ \alpha_kAp_k,\ multiplying\ both\ sides\ by\ A \\ &=> p_i^{T}Ax^* = \sum_{k\ =\ 1}^n\ \alpha_kp_i^{T}Ap_k,\ multiplying\ both\ sides\ by\ some\ p_{i}^T \\ &=> p_i^{T}Ax^* = \alpha_ip_i^{T}Ap_i,\ because\ p_i\ and\ p_k\ are\ elements\ of\ an\ orthogonal\ basis,\ p_i^{T}Ap_k = 0\ if\ i\ \ne k \\ &=> \alpha_i = \frac{p_i^{T}Ax^*} {p_i^{T}Ap_i} = \frac{p_i^{T}b} {p_i^{T}Ap_i},\ since\ Ax*= b\\ \end{align} $$

This gives us a way to iteratively build the solution:

Incrementally find $p_i$ and $\alpha_i$
$x_{i+1} = x_i + \alpha_ip_i$

Search Direction and Residuals

The $p_i$s are called 'search directions'.

But how to find the conjugate gradient in each step? For this, we employ a technique in Linear Algebra called Gram-Schmidt Orthogonalization.

Let's define the residual at the k-th iteration, $r_k$ as the negative of the gradient $Ax_k - b$. We want to establish a relation between $r_k$ and $r_{k+1}$.

$$\begin{align} & r_k = b - Ax_k \\ & r_{k+1} = b - Ax_{k+1} \\ &= r_k + Ax_k - Ax_{k+1} \\ &= r_k - A(x_{k+1} - x_k) \\ &= r_k - \alpha_kAp_k \end{align} $$

It's also useful to express $\alpha$ in terms of the $r_k$s: $\alpha_k = -(r_{k+1} - r_k) / Ap_k$

Properties of the Residual

The residuals have two important properties:

The residual at the k-th iteration $r_{k-1}$ is independent (i.e. orthogonal) of the previous search directions, $(p_0, ...p_{k-2})$. To see why:

$$\begin{align} &r_k = b - Ax_k \\ & = Ax_n - Ax_k \\ &= A(x_n - x_k) \\ &= A(\sum_{i=0}^{n-1} \alpha_ip_i - \sum_{i=0}^{k-2} \alpha_ip_i) \\ &= \sum_{i=k-1}^{n-1} \alpha_iAp_i \\ \end{align} $$

If we multiply both sides by $p_j^T$, where $j < k-1$, we get:

$$p^T_jr_k = \sum_{i=k-1}^{n-1} \alpha_ip^T_jAp_i = 0,\ since\ p^T_jAp_i = 0\ if\ i \ne j$$

The residual at the k-th iteration $r_{k-1}$ is independent (i.e. orthogonal) of the previous residuals $(r_0, ...r_{k-2})$

A-Orthogonal Basis from Gram Schmidt Rule

Now we're ready to apply the Gram-Schmidt rule, which says given a set of basis vectors $(u_1,...,u_n)$, a corresponding set of vectors $(v_1,...,v_n)$ that are orthogonal in the inner product space can be found by:

$$\begin{align} & v_1 = u_1 \\ & v_{n} = u_{n} - \sum_{i = 1}^{n-1} \frac {<u_n, v_i>}{<v_i, v_i>} v_i \end{align} $$

Suppose we're at the k-th iteration. Applied to our case, the input to the Gram-Schmidt rule is the set of residuals $(r_0, r_1, ..., r_k)$. The outputs are:

$$\begin{align} & p_0 = r_0 \\ & p_k = r_{k} - \sum_{i = 0}^{k-1} \frac {r_{k}^TAp_i}{p_i^TAp_i} p_i \\ \end{align} $$

But we had proved earlier that the residual is independent of the previous search direction. So, the numerators of the form $r_{k}^TAp_i$ are 0 for $i = (0, 1, 2, ..., k-2)$, but not for $i = k-1$. So the summation reduces to just one term.

$$p_k = r_{k} - \frac {r_{k}^TAp_{k-1}} {p_{k-1}^TAp_{k-1}} p_{k-1}$$

Let's call the coefficient of $p_{k-1}$, $\beta_{k-1}$. So we have $p_k = r_{k} + \beta_{k-1}p_{k-1}$ with $\beta_{k-1} = - \frac {r_{k}^TAp_{k-1}} {p_{k-1}^TAp_{k-1}}$.

Simplifying $\alpha$ and $\beta$

We can further simplify the numerator of $\beta_{k-1}$ by expanding $Ap_{k-1}$:

$$\begin{align} r_{k}^TAp_{k-1} &= r_{k}^T [\frac{r_{k-1} - r_{k}} {\alpha_{k-1}}],\ using\ r_{k} = r_{k-1} - \alpha_{k-1}Ap_{k-1} \\ &= - \frac{r_{k}^Tr_{k}} {\alpha_{k-1}},\ since\ r_k\ and\ r_{k-1}\ are\ orthogonal. \\ \end{align} $$

Simialrly, we can simplify the denominator: $$\begin{align} p_{k-1}^TAp_{k-1} &= [r_{k-1} + \beta_{k-2}p_{k-2}]^T Ap_{k-1} \\ &= r_{k-1}^TAp_{k-1} \\ &= r_{k-1}^T[\frac{r_{k-1} - r_{k}} {\alpha_{k-1}}] \\ &= \frac{r_{k-1}^Tr_{k-1}} {\alpha_{k-1}} \end{align} $$

We use the last result to obtain the simplified expressions for both $\alpha$ and $\beta$:

$$\begin{align} & \beta_{k-1} = - \frac{r_{k}^Tr_{k}} {r_{k-1}^Tr_{k-1}} \\ & \alpha_{k-1} = \frac{r_{k-1}^Tr_{k-1}} {p_{k-1}^TAp_{k-1}} \\ \end{align} $$



In [11]:

    
import numpy as np
from numpy.linalg import cholesky, inv, LinAlgError

def is_symmetric(x):
    return np.all((x-x.T) == 0)

def is_positive_definite(x):
    pos_def = False
    try:
        cholesky(x)
        pos_def = True
    except LinAlgError:
        return False
    return pos_def

A = np.matrix([[2, -1, 0], 
               [-1, 2, -1], 
               [0, -1, 2]])

print('A is both symmetric & positive definite: %s' % (is_symmetric(A) and is_positive_definite(A),))

b = np.atleast_2d([2, 4, 5]).T

# Use the inverse method to find the expected solution:
expected = np.squeeze(inv(A)*b, axis=1)

print('Solution with inverse: %s' % (expected,))









    



A is both symmetric & positive definite: True
Solution with inverse: [[ 4.75  7.5   6.25]]



In [13]:

    
def f(A, b, x):
    return (0.5*x.T*A*x) - (x.T*b)

def conjugate_gradient(A, b):
    x_init = np.atleast_2d(np.zeros(A.shape[0])).T
    x_new, x = x_init, x_init
    
    r = b - A*x
    r_new = r
    
    p = r.copy()   
    n_iters = A.shape[0]
    
    for i in range(n_iters):
        alpha = np.asscalar((r.T*r) / (p.T*A*p))
        x_new = x + alpha*p
        r_new = r - alpha*A*p
        
        beta = np.asscalar((r_new.T*r_new) / (r.T*r))        
        p = r_new + beta*p        
        x, r = x_new, r_new
    
    return x

calc_x_min = conjugate_gradient(A, b)
print('\nSolution using conjugate gradient: %s' % np.squeeze(calc_x_min, axis=1))
print('Minimum value of f(x): %s' % (f(A, b, calc_x_min),))









    



Solution using conjugate gradient: [[ 4.75  7.5   6.25]]
Minimum value of f(x): [[-35.375]]



In [ ]: