In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import solve_triangular
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
# pip install memory_profiler
%load_ext memory_profiler
np.random.seed(0)
from ipywidgets import interact, IntSlider
import matplotlib as mpl
mpl.rcParams['font.size'] = 14
mpl.rcParams['axes.labelsize'] = 20
mpl.rcParams['xtick.labelsize'] = 14
mpl.rcParams['ytick.labelsize'] = 14
def plot_matrices_with_values(ax,M,flag_values):
N=M.shape[0]
cmap = plt.get_cmap('GnBu')
ax.matshow(M, cmap=cmap)
if flag_values:
for i in np.arange(0, N):
for j in np.arange(0, N):
ax.text(i, j, '{:.2f}'.format(M[i,j]), va='center', ha='center', color='r')
Welcome to another edition of our Jupyter Notebooks. Here, we'll teach you how to solve $A\,x = b$ with $A$ being a symmetric positive-definite matrix, but the following methods have a key difference with the previous ones: these do not depend on a matrix factorization. The two methods that we'll see are called the Gradient Descent and the Conjugate Gradient Method. On the latter, we'll also see the benefits of preconditioning.
This is an iterative method. If you remember the iterative methods in the previous Notebook, to find the next approximate solution $\mathbf{x}_{k+1}$ you'd add a vector to the current approximate solution, $\mathbf{x}_k$, that is: $\mathbf{x}_{k+1} = \mathbf{x}_k + \text{vector}$. In this method, $\text{vector}$ is $\alpha_{k}\,\mathbf{r}_k$, where $\mathbf{r}_k$ is the residue ($\mathbf{b} - A\,\mathbf{x}_k$) and $\alpha_k = \cfrac{(\mathbf{r}_k)^T\,\mathbf{r}_k}{(\mathbf{r}_k)^T\,A\,\mathbf{r}_k}$, starting with some initial guess $\mathbf{x}_0$. Let's look at the implementation below:
In [3]:
def gradient_descent(A, b, x0, n_iter=10, tol=1e-10):
n = A.shape[0]
#array with solutions
X = np.full((n_iter, n),np.nan)
X[0] = x0
for k in range(1, n_iter):
r = b - np.dot(A, X[k-1])
if np.linalg.norm(r)<tol: # The algorithm "converged"
X[k:] = X[k-1]
return X
break
alpha = np.dot(r, r)/np.dot(r, np.dot(A, r))
X[k] = X[k-1] + alpha*r
return X
Now let's try our algorithm! But first, let's borrow a function to generate a random symmetric positive-definite matrix, kindly provided by the previous notebook, and another one to calculate the vectorized euclidean metric.
In [4]:
"""
Randomly generates an nxn symmetric positive-
definite matrix A.
"""
def generate_spd_matrix(n):
A = np.random.random((n,n))
#constructing symmetry
A += A.T
#symmetric+diagonally dominant -> symmetric positive-definite
deltas = 0.1*np.random.random(n)
row_sum = A.sum(axis=1)-np.diag(A)
np.fill_diagonal(A, row_sum+deltas)
return A
We'll try our algorithm with some matrices of different sizes, and we'll compare it with the solution given by Numpy's solver.
In [5]:
def show_small_example_GD(n_size=3, n_iter=10):
np.random.seed(0)
A = generate_spd_matrix(n_size)
b = np.ones(n_size)
x0 = np.zeros(n_size)
X = gradient_descent(A, b, x0, n_iter)
sol = np.linalg.solve(A, b)
print('Gradiente descent : ',X[-1])
print('np solver : ',sol)
print('norm(difference): \t',np.linalg.norm(X[-1] - sol)) # difference between gradient_descent's solution and Numpy's solver solution
interact(show_small_example_GD,n_size=(3,50,1),n_iter=(5,50,1))
Out[5]:
As we can see, we're getting ok solutions with 15 iterations, even for larger matrices. A variant of this method is currently used in training neural networks and in Data Science in general, the main difference is that they call the \alpha parameter 'learning rate' and keep it constant. Another important reason is that sometimes in Data Science they need to solve a nonlinear system of equations rather than a linear one, the good thing is that to solve nonlinear system of equations we do it by a sequence of linear system of equations! Now, we will discuss a younger sibling, the Conjugate Gradient Method, which is the prefered when the associated matrix is symmetric and positive definite.
This method works by succesively eliminating the $n$ orthogonal components of the error, one by one. The method arrives at the solution with the following finite loop:
In [10]:
def conjugate_gradient(A, b, x0, full_output=False, tol=1e-16):
n = A.shape[0]
X = np.full((n+1, n),np.nan) # Storing partial solutions x_i
R = np.full((n+1, n),np.nan) # Storing residues r_i=b-A\,x_i
D = np.full((n+1, n),np.nan) # Storing conjugate directions d_i
alphas = np.full(n,np.nan) # Storing alpha's
betas = np.full(n,np.nan) # Storing beta's
X[0] = x0 # initial guess: x_0
R[0] = b - np.dot(A, x0) # initial residue: r_0=b-A\,x_0
D[0] = R[0] # initial direction: d_0
n_residuals = np.full(n+1,np.nan) # norm of residuals over iteration: ||r_i||_2
n_residuals[0] = np.linalg.norm(R[0]) # initilizing residual: ||r_0||_2
x_sol=x0 # first approximation of solution
for k in np.arange(n):
if np.linalg.norm(R[k])<=tol: # The algorithm converged
if full_output:
return X[:k+1], D[:k+1], R[:k+1], alphas[:k+1], betas[:k+1], n_residuals[:k+1]
else:
return x_sol
# This is the 'first' version of the algorithm
alphas[k] = np.dot(D[k], R[k]) / np.dot(D[k], np.dot(A, D[k]))
X[k+1] = X[k] + alphas[k]*D[k]
R[k+1] = R[k] - alphas[k]*np.dot(A, D[k])
n_residuals[k+1] = np.linalg.norm(R[k+1])
betas[k] = np.dot(D[k],np.dot(A,R[k+1]))/np.dot(D[k],np.dot(A,D[k]))
D[k+1] = R[k+1] - betas[k]*D[k]
x_sol=X[k+1]
if full_output:
return X, D, R, alphas, betas, n_residuals
else:
return x_sol
In [11]:
# This function computes the A-inner product
# between each pair of vectors provided in V.
# If 'A' is not provided, it becomes the
# traditional inner product.
def compute_A_orthogonality(V,A='identity'):
m = V.shape[0]
n = V.shape[1]
if isinstance(A, str):
A=np.eye(n)
output = np.full((m-1,m-1),np.nan)
for i in range(m-1):
for j in range(m-1):
output[i,j]=np.dot(V[i],np.dot(A,V[j]))
return output
In [13]:
def show_small_example_CG(n_size=2,flag_image=False,flag_image_values=True):
np.random.seed(0)
A = generate_spd_matrix(n_size)
b = np.ones(n_size)
x0 = np.zeros(n_size)
X, D, R, alphas, betas, n_residuals = conjugate_gradient(A, b, x0, True)
if flag_image:
outR=compute_A_orthogonality(R)
outD=compute_A_orthogonality(D,A)
M=8
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(2*M,M))
plot_matrices_with_values(ax1,np.log10(np.abs(outR)+1e-16),flag_image_values)
ax1.set_title(r'$\log_{10}(|\mathbf{r}_i^T \, \mathbf{r}_j|+10^{-16})$',pad=20)
plot_matrices_with_values(ax2,np.log10(np.abs(outD)+1e-16),flag_image_values)
ax2.set_title(r'$\log_{10}(|\mathbf{d}_i^T\,A\,\mathbf{d}_j|+10^{-16})$',pad=20)
plt.sca(ax3)
plt.semilogy(n_residuals,'.')
plt.grid(True)
plt.ylabel(r'$||\mathbf{r}_i||$')
plt.xlabel(r'$i$')
plt.title('n= %d'%n_size)
plt.sca(ax4)
plt.plot(alphas,'.',label=r'$\alpha_i$',markersize=10)
plt.plot(betas,'.',label=r'$\beta_i$',markersize=10)
plt.grid(True)
plt.legend()
plt.xlabel(r'$i$')
plt.show()
else:
print('n_residuals:')
print(n_residuals)
print('alphas:')
print(alphas)
print('betas:')
print(betas)
print('R:')
print(R)
print('X:')
print(X)
print('D:')
print(D)
interact(show_small_example_CG,n_size=(2,50,1),flag_image=False,flag_image_values=True)
Out[13]:
In [14]:
def plot_iterative_solution(A,b,X,R,D,n=0,elev=30,azim=310):
L=lambda x: np.dot(x,np.dot(A,x))-np.dot(b,x)
fig=plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(121, projection='3d')
ax2 = fig.add_subplot(122, projection='3d')
# Plotting the residual vectors
for v in R[:n+1]:
# We use ax1 for the actual values and ax1 for the normalized values.
# We normalize it just for plotting purposes, otherwise the last
# vectors look too tiny.
ax1.quiver(0, 0, 0, v[0], v[1], v[2],color='blue')
ax2.quiver(0, 0, 0, v[0]/np.linalg.norm(v), v[1]/np.linalg.norm(v), v[2]/np.linalg.norm(v),color='blue')
# Plotting the residual vectors
for v in X[1:n+1]:
ax1.quiver(0, 0, 0, v[0], v[1], v[2],color='red')
ax2.quiver(0, 0, 0, v[0]/np.linalg.norm(v), v[1]/np.linalg.norm(v), v[2]/np.linalg.norm(v),color='red')
# Plotting the direction vectors
for v in D[:n]:
ax1.quiver(0, 0, 0, v[0], v[1], v[2],color='green',linewidth=10,alpha=0.5)
ax2.quiver(0, 0, 0, v[0]/np.linalg.norm(v), v[1]/np.linalg.norm(v),
v[2]/np.linalg.norm(v),color='green',linewidth=10,alpha=0.5)
# plotting evolution of solution
v = X[0]
ax1.quiver(0, 0, 0, v[0], v[1], v[2], color='black', linestyle='dashed')
ax2.quiver(0, 0, 0, v[0]/np.linalg.norm(v), v[1]/np.linalg.norm(v), v[2]/np.linalg.norm(v),color='black',linestyle='dashed')
for k in np.arange(1,n+1):
v = X[k]-X[k-1]
vp= X[k-1]
ax1.quiver(vp[0], vp[1], vp[2], v[0], v[1], v[2], color='magenta',linewidth=10,alpha=0.5)
v = X[k]/np.linalg.norm(X[k])-X[k-1]/np.linalg.norm(X[k-1])
vp= X[k-1]/np.linalg.norm(X[k-1])
ax2.quiver(vp[0], vp[1], vp[2], v[0], v[1], v[2],color='magenta',linewidth=10,alpha=0.5)
#for v in X[]
ax1.set_xlim(min(0,np.min(X[:,0]),np.min(R[:,0])),max(0,np.max(X[:,0]),np.max(R[:,0])))
ax1.set_ylim(min(0,np.min(X[:,1]),np.min(R[:,1])),max(0,np.max(X[:,1]),np.max(R[:,1])))
ax1.set_zlim(min(0,np.min(X[:,2]),np.min(R[:,2])),max(0,np.max(X[:,2]),np.max(R[:,2])))
ax2.set_xlim(-1,1)
ax2.set_ylim(-1,1)
ax2.set_zlim(-1,1)
#fig.tight_layout()
ax1.view_init(elev,azim)
ax2.view_init(elev,azim)
plt.title('r-blue, x-red, d-green, x-mag, x0-black')
plt.show()
# Setting a standard name for the variables
np.random.seed(0)
A = generate_spd_matrix(3)
b = np.ones(3)
x0 = np.ones(3)
X, D, R, alphas, betas, n_residuals = conjugate_gradient(A, b, x0, True)
# For plotting with widgets
n_widget = IntSlider(min=0, max=b.shape[0], step=1, value=0)
elev_widget = IntSlider(min=-180, max=180, step=10, value=-180)
azim_widget = IntSlider(min=0, max=360, step=10, value=30)
solution_evolution = lambda n,elev,azim: plot_iterative_solution(A,b,X,R,D,n,elev,azim)
interact(solution_evolution,n=n_widget,elev=elev_widget,azim=azim_widget)
Out[14]:
The science behind this algorithm is in the classnotes and in the textbook (Numerical Analysis, 2nd Edition, Timothy Sauer). Now let's try it!
Here are some questions to think about:
gradient_descent and conjugate_gradient?First of all, define a function to calculate the progress of the relative error for a given method, that is, input the array of approximate solutions X and the real solution provided by Numpy's solver r_sol and return an array with the relative error for each step.
In [15]:
def relative_error(X, r_sol):
n_steps = X.shape[0]
n_r_sol = np.linalg.norm(r_sol)
E = np.zeros(n_steps)
for i in range(n_steps):
E[i] = np.linalg.norm(X[i] - r_sol) / n_r_sol
return E
Trying the two methods with a small non-symmetric, non-positive-definite matrix and plotting the forward error for all the methods.
In [16]:
def show_output_for_non_symmetric_and_npd(np_seed=0):
np.random.seed(np_seed)
n = 10
A = 10 * np.random.random((n,n))
b = 10 * np.random.random(n)
x0 = np.zeros(n)
X1 = gradient_descent(A, b, x0, n)
X2, D, R, alphas, betas, n_residuals = conjugate_gradient(A, b, x0, True)
r_sol = np.linalg.solve(A, b)
E1 = relative_error(X1, r_sol)
E2 = relative_error(X2, r_sol)
iterations1 = np.linspace(1, n, n)
iterations2 = np.linspace(1, X2.shape[0], X2.shape[0])
plt.figure(figsize=(10,5))
plt.xlabel('Iteration')
plt.ylabel('Relative Error')
plt.title('Evolution of the Relative Forward Error for each method')
plt.semilogy(iterations1, E1, 'rd', markersize=8, label='GD') # Red diamonds are for Gradient Descent
plt.semilogy(iterations2, E2, 'b.', markersize=8, label='CG') # Blue dots are for Conjugate Gradient
plt.grid(True)
plt.legend(loc='best')
plt.show()
interact(show_output_for_non_symmetric_and_npd,np_seed=(0,100,1))
Out[16]:
As you can see, if the matrix doesn't meet the requirements for these methods, the results can be quite terrible.
Let's try again, this time using an appropriate matrix.
In [17]:
def show_output_for_symmetric_and_pd(np_seed=0,n=100):
np.random.seed(np_seed)
A = generate_spd_matrix(n)
b = np.random.random(n)
x0 = np.zeros(n)
X1 = gradient_descent(A, b, x0, n)
X2, D, R, alphas, betas, n_residuals = conjugate_gradient(A, b, x0, True)
r_sol = np.linalg.solve(A, b)
E1 = relative_error(X1, r_sol)
E2 = relative_error(X2, r_sol)
iterations1 = np.linspace(1, n, n)
iterations2 = np.linspace(1, X2.shape[0], X2.shape[0])
plt.figure(figsize=(10,5))
plt.xlabel('Iteration')
plt.ylabel('Relative Error')
plt.title('Evolution of the Relative Forward Error for each method')
plt.semilogy(iterations1, E1, 'rd', markersize=8, label='GD') # Red diamonds are for Gradient Descent
plt.semilogy(iterations2, E2, 'b.', markersize=8, label='CG') # Blue dots are for Conjugate Gradient
plt.grid(True)
plt.legend(loc='best')
plt.xlim([0,40])
plt.show()
interact(show_output_for_symmetric_and_pd,np_seed=(0,100,1),n=(10,1000,10))
Out[17]:
Amazing! We started with a huge relative error and reduced it to practically zero in just under 10 iterations (the algorithms all have 100 iterations but we're showing you the first 40). We can clearly see that the Conjugate Gradient Method converges faster than the Gradient Descent method, even for larger matrices.
We can see that, reached a certain size for the matrix, the amount of iterations needed to reach a small error remains more or less the same. We encourage you to try other kinds of matrices to see how the algorithms behave, and experiment with the code. Now let's move on to profiling.
Of course, you win some, you lose some. Accelerating the convergence of the algorithm means you have to spend more of other resources. We'll use the functions %timeit and %memit to see how the algorithms behave.
In [18]:
A = generate_spd_matrix(100)
b = np.ones(100)
x0 = np.random.random(100)
In [19]:
%timeit gradient_descent(A, b, x0, n_iter=100, tol=1e-5)
%timeit conjugate_gradient(A, b, x0, tol=1e-5)
In [14]:
# Commented because it is taking too long, we need to review this!
# %memit gradient_descent(A, b, x0, n_iter=100, tol=1e-5)
# %memit conjugate_gradient(A, b, x0, tol=1e-5)
We see something interesting here: all algorithms need about the same amount of memory.
What happened with the measure of time? Why is it so big for the algorithm that has the best convergence rate? Besides the end of the loop, we have one other criteria for stopping the algorithm: When the residue r reaches the exact value of zero, we say that the algorithm converged, and stop. However it's very hard to get an error of zero for randomized initial guesses, so this almost never happens, and we can't take advantage of the convergence rate of the algorithms.
There's a way we can fix this: instead of using this criteria, make the algorithm stop when a certain tolerance or threshold is reached. That way, when the error gets small enough, we can stop and say that we got a good enough solution.
You can try with different matrices, different initial conditions, different sizes, etc. Try some more plotting, profiling, and experimenting. Have fun!
ctorres@inf.utfsm.cl) and assistants: Laura Bermeo, Alvaro Salinas, Axel Simonsen and Martín Villanueva. DI UTFSM. April 2016.ctorres@inf.utfsm.cl). DI UTFSM. April 2019.
In [ ]: