ILI285 - Computación Científica I / INF285 - Computación Científica

Conjugate Gradient Method

[S]cientific [C]omputing [T]eam

Version: 1.16


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import solve_triangular
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
# pip install memory_profiler
%load_ext memory_profiler
np.random.seed(0)
from ipywidgets import interact, IntSlider
import matplotlib as mpl
mpl.rcParams['font.size'] = 14
mpl.rcParams['axes.labelsize'] = 20
mpl.rcParams['xtick.labelsize'] = 14
mpl.rcParams['ytick.labelsize'] = 14

def plot_matrices_with_values(ax,M,flag_values):
    N=M.shape[0]
    cmap = plt.get_cmap('GnBu')
    ax.matshow(M, cmap=cmap)
    if flag_values:
        for i in np.arange(0, N):
            for j in np.arange(0, N):
                ax.text(i, j, '{:.2f}'.format(M[i,j]), va='center', ha='center', color='r')

Introduction

Welcome to another edition of our Jupyter Notebooks. Here, we'll teach you how to solve $A\,x = b$ with $A$ being a symmetric positive-definite matrix, but the following methods have a key difference with the previous ones: these do not depend on a matrix factorization. The two methods that we'll see are called the Gradient Descent and the Conjugate Gradient Method. On the latter, we'll also see the benefits of preconditioning.

Gradient Descent

This is an iterative method. If you remember the iterative methods in the previous Notebook, to find the next approximate solution $\mathbf{x}_{k+1}$ you'd add a vector to the current approximate solution, $\mathbf{x}_k$, that is: $\mathbf{x}_{k+1} = \mathbf{x}_k + \text{vector}$. In this method, $\text{vector}$ is $\alpha_{k}\,\mathbf{r}_k$, where $\mathbf{r}_k$ is the residue ($\mathbf{b} - A\,\mathbf{x}_k$) and $\alpha_k = \cfrac{(\mathbf{r}_k)^T\,\mathbf{r}_k}{(\mathbf{r}_k)^T\,A\,\mathbf{r}_k}$, starting with some initial guess $\mathbf{x}_0$. Let's look at the implementation below:


In [3]:
def gradient_descent(A, b, x0, n_iter=10, tol=1e-10):
    n = A.shape[0]
    #array with solutions
    X = np.full((n_iter, n),np.nan)
    X[0] = x0

    for k in range(1, n_iter):
        r = b - np.dot(A, X[k-1])
        if np.linalg.norm(r)<tol: # The algorithm "converged"
            X[k:] = X[k-1]
            return X
            break
        alpha = np.dot(r, r)/np.dot(r, np.dot(A, r))
        X[k] = X[k-1] + alpha*r

    return X

Now let's try our algorithm! But first, let's borrow a function to generate a random symmetric positive-definite matrix, kindly provided by the previous notebook, and another one to calculate the vectorized euclidean metric.


In [4]:
"""
Randomly generates an nxn symmetric positive-
definite matrix A.
"""
def generate_spd_matrix(n):
    A = np.random.random((n,n))
    #constructing symmetry
    A += A.T
    #symmetric+diagonally dominant -> symmetric positive-definite
    deltas = 0.1*np.random.random(n)
    row_sum = A.sum(axis=1)-np.diag(A)
    np.fill_diagonal(A, row_sum+deltas)
    return A

We'll try our algorithm with some matrices of different sizes, and we'll compare it with the solution given by Numpy's solver.


In [5]:
def show_small_example_GD(n_size=3, n_iter=10):
    np.random.seed(0)
    A = generate_spd_matrix(n_size)
    b = np.ones(n_size)
    x0 = np.zeros(n_size)

    X = gradient_descent(A, b, x0, n_iter)
    sol = np.linalg.solve(A, b)
    print('Gradiente descent : ',X[-1])
    print('np solver         : ',sol)
    print('norm(difference): \t',np.linalg.norm(X[-1] - sol)) # difference between gradient_descent's solution and Numpy's solver solution
interact(show_small_example_GD,n_size=(3,50,1),n_iter=(5,50,1))


Out[5]:
<function __main__.show_small_example_GD(n_size=3, n_iter=10)>

As we can see, we're getting ok solutions with 15 iterations, even for larger matrices. A variant of this method is currently used in training neural networks and in Data Science in general, the main difference is that they call the \alpha parameter 'learning rate' and keep it constant. Another important reason is that sometimes in Data Science they need to solve a nonlinear system of equations rather than a linear one, the good thing is that to solve nonlinear system of equations we do it by a sequence of linear system of equations! Now, we will discuss a younger sibling, the Conjugate Gradient Method, which is the prefered when the associated matrix is symmetric and positive definite.

Conjugate Gradient Method

This method works by succesively eliminating the $n$ orthogonal components of the error, one by one. The method arrives at the solution with the following finite loop:


In [10]:
def conjugate_gradient(A, b, x0, full_output=False, tol=1e-16):
    n = A.shape[0]
    X = np.full((n+1, n),np.nan) # Storing partial solutions x_i
    R = np.full((n+1, n),np.nan) # Storing residues r_i=b-A\,x_i
    D = np.full((n+1, n),np.nan) # Storing conjugate directions d_i
    alphas = np.full(n,np.nan)   # Storing alpha's
    betas = np.full(n,np.nan)    # Storing beta's
    X[0] = x0                    # initial guess: x_0
    R[0] = b - np.dot(A, x0)     # initial residue: r_0=b-A\,x_0
    D[0] = R[0]                  # initial direction: d_0
    n_residuals = np.full(n+1,np.nan) # norm of residuals over iteration: ||r_i||_2

    n_residuals[0] = np.linalg.norm(R[0]) # initilizing residual: ||r_0||_2
    x_sol=x0                     # first approximation of solution
    
    for k in np.arange(n):
        if np.linalg.norm(R[k])<=tol: # The algorithm converged
            if full_output:
                return X[:k+1], D[:k+1], R[:k+1], alphas[:k+1], betas[:k+1], n_residuals[:k+1]
            else:
                return x_sol
        # This is the 'first' version of the algorithm
        alphas[k] = np.dot(D[k], R[k]) / np.dot(D[k], np.dot(A, D[k]))
        X[k+1] = X[k] + alphas[k]*D[k]
        R[k+1] = R[k] - alphas[k]*np.dot(A, D[k])
        n_residuals[k+1] = np.linalg.norm(R[k+1])
        betas[k] = np.dot(D[k],np.dot(A,R[k+1]))/np.dot(D[k],np.dot(A,D[k]))
        D[k+1] = R[k+1] - betas[k]*D[k]
        x_sol=X[k+1]
        
    if full_output:
        return X, D, R, alphas, betas, n_residuals
    else:
        return x_sol

In [11]:
# This function computes the A-inner product 
# between each pair of vectors provided in V. 
# If 'A' is not provided, it becomes the 
# traditional inner product.
def compute_A_orthogonality(V,A='identity'):
    m = V.shape[0]
    n = V.shape[1]
    
    if isinstance(A, str):
        A=np.eye(n)
    
    output =  np.full((m-1,m-1),np.nan)
    
    for i in range(m-1):
        for j in range(m-1):
            output[i,j]=np.dot(V[i],np.dot(A,V[j]))
    return output

In [13]:
def show_small_example_CG(n_size=2,flag_image=False,flag_image_values=True):
    np.random.seed(0)
    A = generate_spd_matrix(n_size)
    b = np.ones(n_size)
    x0 = np.zeros(n_size)

    X, D, R, alphas, betas, n_residuals = conjugate_gradient(A, b, x0, True)
    
    if flag_image:
        outR=compute_A_orthogonality(R)
        outD=compute_A_orthogonality(D,A)
        M=8
        fig, ((ax1, ax2), (ax3, ax4))  = plt.subplots(2, 2, figsize=(2*M,M))
        plot_matrices_with_values(ax1,np.log10(np.abs(outR)+1e-16),flag_image_values)
        ax1.set_title(r'$\log_{10}(|\mathbf{r}_i^T \, \mathbf{r}_j|+10^{-16})$',pad=20)
        plot_matrices_with_values(ax2,np.log10(np.abs(outD)+1e-16),flag_image_values)
        ax2.set_title(r'$\log_{10}(|\mathbf{d}_i^T\,A\,\mathbf{d}_j|+10^{-16})$',pad=20)
        plt.sca(ax3)
        plt.semilogy(n_residuals,'.')
        plt.grid(True)
        plt.ylabel(r'$||\mathbf{r}_i||$')
        plt.xlabel(r'$i$')
        plt.title('n= %d'%n_size)
        plt.sca(ax4)
        plt.plot(alphas,'.',label=r'$\alpha_i$',markersize=10)
        plt.plot(betas,'.',label=r'$\beta_i$',markersize=10)
        plt.grid(True)
        plt.legend()
        plt.xlabel(r'$i$')
        plt.show()
    else:
        print('n_residuals:')
        print(n_residuals)
        print('alphas:')
        print(alphas)
        print('betas:')
        print(betas)
        print('R:')
        print(R)
        print('X:')
        print(X)
        print('D:')
        print(D)
interact(show_small_example_CG,n_size=(2,50,1),flag_image=False,flag_image_values=True)


Out[13]:
<function __main__.show_small_example_CG(n_size=2, flag_image=False, flag_image_values=True)>

In [14]:
def plot_iterative_solution(A,b,X,R,D,n=0,elev=30,azim=310):
    L=lambda x: np.dot(x,np.dot(A,x))-np.dot(b,x)
    
    fig=plt.figure(figsize=(20,10))
    ax1 = fig.add_subplot(121, projection='3d')
    ax2 = fig.add_subplot(122, projection='3d')
    
    # Plotting the residual vectors
    for v in R[:n+1]:
        # We use ax1 for the actual values and ax1 for the normalized values.
        # We normalize it just for plotting purposes, otherwise the last
        # vectors look too tiny.
        ax1.quiver(0, 0, 0, v[0], v[1], v[2],color='blue')
        ax2.quiver(0, 0, 0, v[0]/np.linalg.norm(v), v[1]/np.linalg.norm(v), v[2]/np.linalg.norm(v),color='blue')
    # Plotting the residual vectors
    for v in X[1:n+1]:
        ax1.quiver(0, 0, 0, v[0], v[1], v[2],color='red')
        ax2.quiver(0, 0, 0, v[0]/np.linalg.norm(v), v[1]/np.linalg.norm(v), v[2]/np.linalg.norm(v),color='red')
    # Plotting the direction vectors
    for v in D[:n]:
        ax1.quiver(0, 0, 0, v[0], v[1], v[2],color='green',linewidth=10,alpha=0.5)
        ax2.quiver(0, 0, 0, v[0]/np.linalg.norm(v), v[1]/np.linalg.norm(v), 
                   v[2]/np.linalg.norm(v),color='green',linewidth=10,alpha=0.5)
    
    # plotting evolution of solution
    v = X[0]
    ax1.quiver(0, 0, 0, v[0], v[1], v[2], color='black', linestyle='dashed')
    ax2.quiver(0, 0, 0, v[0]/np.linalg.norm(v), v[1]/np.linalg.norm(v), v[2]/np.linalg.norm(v),color='black',linestyle='dashed')
    for k in np.arange(1,n+1):
        v = X[k]-X[k-1]
        vp= X[k-1]
        ax1.quiver(vp[0], vp[1], vp[2], v[0], v[1], v[2], color='magenta',linewidth=10,alpha=0.5)
        v = X[k]/np.linalg.norm(X[k])-X[k-1]/np.linalg.norm(X[k-1])
        vp= X[k-1]/np.linalg.norm(X[k-1])
        ax2.quiver(vp[0], vp[1], vp[2], v[0], v[1], v[2],color='magenta',linewidth=10,alpha=0.5)
        
    
    #for v in X[]
    ax1.set_xlim(min(0,np.min(X[:,0]),np.min(R[:,0])),max(0,np.max(X[:,0]),np.max(R[:,0])))
    ax1.set_ylim(min(0,np.min(X[:,1]),np.min(R[:,1])),max(0,np.max(X[:,1]),np.max(R[:,1])))
    ax1.set_zlim(min(0,np.min(X[:,2]),np.min(R[:,2])),max(0,np.max(X[:,2]),np.max(R[:,2])))
    ax2.set_xlim(-1,1)
    ax2.set_ylim(-1,1)
    ax2.set_zlim(-1,1)
    #fig.tight_layout()
    ax1.view_init(elev,azim)
    ax2.view_init(elev,azim)
    plt.title('r-blue, x-red, d-green, x-mag, x0-black')
    plt.show()

# Setting a standard name for the variables
np.random.seed(0)
A = generate_spd_matrix(3)
b = np.ones(3)
x0 = np.ones(3)
X, D, R, alphas, betas, n_residuals = conjugate_gradient(A, b, x0, True)
# For plotting with widgets
n_widget = IntSlider(min=0, max=b.shape[0], step=1, value=0)
elev_widget = IntSlider(min=-180, max=180, step=10, value=-180)
azim_widget = IntSlider(min=0, max=360, step=10, value=30)
solution_evolution = lambda n,elev,azim: plot_iterative_solution(A,b,X,R,D,n,elev,azim)
interact(solution_evolution,n=n_widget,elev=elev_widget,azim=azim_widget)


Out[14]:
<function __main__.<lambda>(n, elev, azim)>

The science behind this algorithm is in the classnotes and in the textbook (Numerical Analysis, 2nd Edition, Timothy Sauer). Now let's try it!

Here are some questions to think about:

  • What are the advantages and disadvantages of each method: gradient_descent and conjugate_gradient?
  • In which cases can the Conjugate Gradient Method converge in less than $n$ iterations?
  • What will happen if you use the Gradient Descent or Conjugate Gradient Method with non-symmetric, non-positive-definite matrices?

Let's Play: Practical Exercises and Profiling

First of all, define a function to calculate the progress of the relative error for a given method, that is, input the array of approximate solutions X and the real solution provided by Numpy's solver r_sol and return an array with the relative error for each step.


In [15]:
def relative_error(X, r_sol):
    n_steps = X.shape[0]
    n_r_sol = np.linalg.norm(r_sol)
    E = np.zeros(n_steps)
    for i in range(n_steps):
        E[i] = np.linalg.norm(X[i] - r_sol) / n_r_sol
    return E

Trying the two methods with a small non-symmetric, non-positive-definite matrix and plotting the forward error for all the methods.


In [16]:
def show_output_for_non_symmetric_and_npd(np_seed=0):
    np.random.seed(np_seed)
    n = 10
    A = 10 * np.random.random((n,n))
    b = 10 * np.random.random(n)
    x0 = np.zeros(n)

    X1 = gradient_descent(A, b, x0, n)
    X2, D, R, alphas, betas, n_residuals = conjugate_gradient(A, b, x0, True)
    r_sol = np.linalg.solve(A, b)

    E1 = relative_error(X1, r_sol)
    E2 = relative_error(X2, r_sol)
    iterations1 = np.linspace(1, n, n)
    iterations2 = np.linspace(1, X2.shape[0], X2.shape[0])

    plt.figure(figsize=(10,5))
    plt.xlabel('Iteration')
    plt.ylabel('Relative Error')
    plt.title('Evolution of the Relative Forward Error for each method')
    plt.semilogy(iterations1, E1, 'rd', markersize=8, label='GD') # Red diamonds are for Gradient Descent
    plt.semilogy(iterations2, E2, 'b.', markersize=8, label='CG') # Blue dots are for Conjugate Gradient
    plt.grid(True)
    plt.legend(loc='best')
    plt.show()
interact(show_output_for_non_symmetric_and_npd,np_seed=(0,100,1))


Out[16]:
<function __main__.show_output_for_non_symmetric_and_npd(np_seed=0)>

As you can see, if the matrix doesn't meet the requirements for these methods, the results can be quite terrible.

Let's try again, this time using an appropriate matrix.


In [17]:
def show_output_for_symmetric_and_pd(np_seed=0,n=100):
    np.random.seed(np_seed)
    A = generate_spd_matrix(n)
    b = np.random.random(n)
    x0 = np.zeros(n)

    X1 = gradient_descent(A, b, x0, n)
    X2, D, R, alphas, betas, n_residuals = conjugate_gradient(A, b, x0, True)
    r_sol = np.linalg.solve(A, b)

    E1 = relative_error(X1, r_sol)
    E2 = relative_error(X2, r_sol)
    iterations1 = np.linspace(1, n, n)
    iterations2 = np.linspace(1, X2.shape[0], X2.shape[0])

    plt.figure(figsize=(10,5))
    plt.xlabel('Iteration')
    plt.ylabel('Relative Error')
    plt.title('Evolution of the Relative Forward Error for each method')
    plt.semilogy(iterations1, E1, 'rd', markersize=8, label='GD') # Red diamonds are for Gradient Descent
    plt.semilogy(iterations2, E2, 'b.', markersize=8, label='CG') # Blue dots are for Conjugate Gradient
    plt.grid(True)
    plt.legend(loc='best')
    plt.xlim([0,40])
    plt.show()
interact(show_output_for_symmetric_and_pd,np_seed=(0,100,1),n=(10,1000,10))


Out[17]:
<function __main__.show_output_for_symmetric_and_pd(np_seed=0, n=100)>

Amazing! We started with a huge relative error and reduced it to practically zero in just under 10 iterations (the algorithms all have 100 iterations but we're showing you the first 40). We can clearly see that the Conjugate Gradient Method converges faster than the Gradient Descent method, even for larger matrices.

We can see that, reached a certain size for the matrix, the amount of iterations needed to reach a small error remains more or less the same. We encourage you to try other kinds of matrices to see how the algorithms behave, and experiment with the code. Now let's move on to profiling.

Of course, you win some, you lose some. Accelerating the convergence of the algorithm means you have to spend more of other resources. We'll use the functions %timeit and %memit to see how the algorithms behave.


In [18]:
A = generate_spd_matrix(100)
b = np.ones(100)
x0 = np.random.random(100)

In [19]:
%timeit gradient_descent(A, b, x0, n_iter=100, tol=1e-5)
%timeit conjugate_gradient(A, b, x0, tol=1e-5)


349 µs ± 25.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
469 µs ± 6.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [14]:
# Commented because it is taking too long, we need to review this!
# %memit gradient_descent(A, b, x0, n_iter=100, tol=1e-5)
# %memit conjugate_gradient(A, b, x0, tol=1e-5)

We see something interesting here: all algorithms need about the same amount of memory.

What happened with the measure of time? Why is it so big for the algorithm that has the best convergence rate? Besides the end of the loop, we have one other criteria for stopping the algorithm: When the residue r reaches the exact value of zero, we say that the algorithm converged, and stop. However it's very hard to get an error of zero for randomized initial guesses, so this almost never happens, and we can't take advantage of the convergence rate of the algorithms.

There's a way we can fix this: instead of using this criteria, make the algorithm stop when a certain tolerance or threshold is reached. That way, when the error gets small enough, we can stop and say that we got a good enough solution.

You can try with different matrices, different initial conditions, different sizes, etc. Try some more plotting, profiling, and experimenting. Have fun!

Acknowledgements

  • Material created by professor Claudio Torres (ctorres@inf.utfsm.cl) and assistants: Laura Bermeo, Alvaro Salinas, Axel Simonsen and Martín Villanueva. DI UTFSM. April 2016.
  • Modified by professor Claudio Torres (ctorres@inf.utfsm.cl). DI UTFSM. April 2019.
  • Update May 2020 - v1.15 - C.Torres : Fixing formatting issues.
  • Update June 2020 - v1.16 - C.Torres : Adding 'compute_A_orthogonality' and extending 'show_small_example_CG'.

In [ ]: