In this homework, we will implement the conjugate graident descent algorithm. While you should nearly always use an optimization routine from a library for practical data analyiss, this exercise is useful because it will make concepts from multivariatble calculus and linear algebra covered in the lectrures concrete for you. Also, it brings you up the learning curve for the implementaiton of more complex algorithms than the ones you have been exposed to so far.
Note: The exercise assumes that we can calculate the gradient and Hessian of the fucntion we are trying to minimize. This can be computationally expensive or not even possible for soeme functions. Approximate methods can then be used; we do not go into such complexities here.
We want to implement the line search method
$$ x_{k+1} = x_k + \alpha_k p_k $$where $\alpha_k$ is the step size and $p_k$ is the search direction.
In particular, we want the search directions $p_k$ to be conjugate, as this will allow us to find the minimum in $n$ steps for $x \in \mathbb{R}^n$ if $f(x)$ is a quadratic function.
The following exercises will unpack this:
and finally wrap them all into a conjugate gradient algorithm.
Recall that our objective is to minimize a scalar valued function which maps $\mathbb{R}^n \mapsto \mathbb{R}$, for example, a log likelihoood function (for MLE) or unnormalized posterior distribution (for MAP). Geometrically, we are tring to find the value of the lowest point of some surface. The conjugate gradient algorihtm assumes that the surface can be approximated by the quadratic expression (say, by using a Taylor series expansion about $x$)
$$ f(x) = \frac{1}{2}x^TAx - b^Tx + c $$and that
$$ \nabla f = Ax - b = 0 $$at the minimum (if A is positive definite). Note that $A$ is a matrix, $b$ is a vector, and $c$ is a scalar. Also, note that the matrix $A$ is the Hessian of the quadratic function.For simplicity, we'll work in $\mathbb{R}^2$ so we can visualize the surface, so that $x$ is a 2-vector.
Note: A form is a polynomial function where every term has the same degree - for example, $x^2 + 2xy + y^2$ is a quadratic form, whcih can be rewritten as $$ \begin{pmatrix} x & y \end{pmatrix} \begin{pmatrix} 1 & 1\\ 1 & 1 \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} $$
That is, $x^TAx$ is a quadratic form.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
Exercise 1 (20 points)
We will work with function $f_1$
$$ f1(x) = \frac{1}{2} x^T \pmatrix{1 & 0 \\ 0 & 1}x $$and function $f_2$
$$ f2(x) = \frac{1}{2} x^T \pmatrix{1 & 0 \\ 0 & 3}x $$streamplot
to show the gradient field of the above quadratic functions.
In [ ]:
In [2]:
def f(x, A, b, c):
"""Surface of a quadratic function."""
return 0.5*x.T@A@x + b.T@x + c
In [3]:
def plot_contour(bounds, n, A, b, c):
"""Contour plot of quadratic function."""
xmin, xmax, ymin, ymax = bounds
x = np.linspace(xmin, xmax, n)
y = np.linspace(ymin, ymax, n)
X, Y = np.meshgrid(x, y)
z = np.zeros((n, n))
for i in range(n):
for j in range(n):
v = np.array([X[i, j], Y[i, j]])
z[i, j] = f(v, A, b, c)
g = plt.contour(X, Y, z)
plt.clabel(g, inline=True, fontsize=10)
plt.axis('square')
In [4]:
def plot_vectors(vs):
"""Plot the vectors vs."""
for v in vs:
plt.arrow(0, 0, v[0], v[1], head_width=0.5, head_length=0.5)
In [5]:
A = np.eye(2)
b = np.zeros(2)
c = 0
n = 25
bounds = [-8, 8, -8, 8]
plot_contour(bounds, n, A, b, c)
u1 = np.array([3,3])
v1 = np.array([3,-3])
plot_vectors([u1, v1])
plt.axis(bounds)
u1 @ v1
Out[5]:
In [6]:
Y, X = np.mgrid[bounds[0]:bounds[1]:n*1j, bounds[2]:bounds[3]:n*1j]
U = A[0,0]*X + A[0,1]*Y - b[0]
V = A[1,0]*X + A[1,1]*Y - b[1]
plt.streamplot(X, Y, U, V, color=U, linewidth=2, cmap=plt.cm.autumn)
plt.axis('square')
pass
In [7]:
A = np.array([[1,0],[0,3]])
b = np.zeros(2)
c = 0
u2 = np.array([3, np.sqrt(3)])
v2 = np.array([3, -np.sqrt(3)])
plot_contour(bounds, n, A, b, c)
plot_vectors([u2, v2])
np.around(u2@A@v2, 6)
plt.axis(bounds)
Out[7]:
Exercise 2 (30 points)
The vectors $u$ and $v$ are orthogonal i.e. $u^Tv = 0$ and conjugate with respect to $A$ if $u^TAv = 0$. Write a Gram-Schmidt function to find orthogonal and conjuate vectors with the following signature
def gram_schmidt(U, inner):
"""Return an orthogonal matrix.
U is a matrix of (column) vecotrs.
inner is a function that calculates the inner product.
Returns an orthogonal matrix of the same shape as U.
"""
Use this function and the appropiate inner product to plot
where the first basis vector is to parallel to $\pmatrix{1 \\ 1}$.
In [8]:
def inner(u, v, A):
"""Inner product with respect to matrix A."""
return u@A@v
def gram_schmidt(U, inner):
"""Find matrix of conjugate (under A) vectors V from the matrix of basiss vectors U."""
n = U.shape[1]
V = np.zeros_like(U).astype('float')
V[:, 0] = U[:, 0]
for i in range(1, n):
v = U[:, i]
for j in range(i):
u = V[:, j]
v = v - inner(u, v)/inner(u, u)*u
V[:, i] = v
return V
In [9]:
from functools import partial
inner_ = partial(inner, A=A)
U = np.array([[3,3], [3,-3]]).T
gram_schmidt(U, inner_)
Out[9]:
In [10]:
A = np.array([[1,0],[0,3]])
b = np.zeros(2)
c = 0
u2 = np.array([3, 3])
v2 = np.array([4.5, -1.5])
plot_contour(bounds, n, A, b, c)
plot_vectors([u2, v2])
np.around(u2@A@v2, 6)
plt.axis(bounds)
Out[10]:
Exercise 3 (20 points)
We now need to find the "step size" $\alpha$ to take in the direction of the search vector $p$. We can get a quadratic approximation to a general nonliner function $f$ by taking the Taylor series in the driection of $p$
$$ f(x + \alpha p) = f(x) + \alpha [f'(x)]^T p + \frac{\alpha^2}{2} p^T f''(x) p $$Find the derivative with respect to $\alpha$ and use this to find the optimal value for $\alpha$ with respect to the quadratic approcimaiton. Write a funciton that returns $\alpha$ for a quadratic funciton with the following signature
def step(x, p, A, b):
"""Returns the optimal step size to take in line search on a quadratic.
A and b are the coefficients of the quadartic expression
$$
f(x) = \frac{1}{2}x^TAx - b^Tx + c
$$
p is the search direction
x is the current location
"""
We now know how to find a search direction $p_k$ - this is a vector that is conjugate to the previous search direction. The first search direction is usually set to be the gradient. Next we need to find out how far along $p_k$ we need to travel, i.e., we need to find $\alpha_k$. First we take a Taylor expansion in the direction of $p$
$$ f(x + \alpha p) = f(x) + \alpha [f'(x)]^T p + \frac{\alpha^2}{2} p^T f''(x) p $$followed by finding the derivative with respect to $\alpha$
$$ \frac{d}{d\alpha} f(x + \alpha p) = [f'(x)]^T p + \alpha p^T f''(x) p $$Solvign for $\frac{d}{d\alpha} f(x + \alpha p) = 0$, we get
$$ \alpha = - \frac{[f'(x)]^T p}{p^T f''(x) p} \\ = - \frac{\nabla f^T p}{p^T A p} \\ = \frac{(b - Ax)^T p}{p^T A p} $$Exercise 4 (30 points)
Implement the conjugate grdient descent algorithm with the following signature
def cg(x, A, b, c, max_iter=100, tol=1e-3):
"""Conjugate gradient descent on a quadratic function surface.
x is the starting position
A, b and c are the coefficients of the quadartic expression
$$
f(x) = \frac{1}{2}x^TAx - b^Tx + c
$$
max_iter is the maximum number of iterations to take
tol is the tolerance (stop if the length of the gradient is smaller than tol)
Returns the number of steps taken and the list of all positions visited.
"""
Use cg to find the minimum of the funciton $f_2$ from Exercise 1, starting from $\pmatrix{6 \\ 7}$.
Plot the contour of the funciton f and the trajectory taken from the inital starting poitn $x$ to the final position, inlcuding all the intermediate steps.
We are not particularly concerned about efficiency here, so don't worry about JIT/AOT/C++ level optimization.
For a more comprehensive discussion and efficient implementaiton, see An Introduction to the Conjugate Gradient Method Without the Agonizing Pain
In [11]:
def cg(x, A, b, c, max_iter=100, tol=1e-3):
"""Conjugate gradient descent on a quadratic function surface."""
i = 0
r = b - A@x
p = r
delta = r@r
xs = [x]
while i < max_iter and delta > tol**2:
# find next position using optimal step size
alpha = (r @ p)/(p @ A @ p)
x = x + alpha*p
xs.append(x)
# find new direction using Gram-Schmidt
r = b - A@x
beta = (r@A@p ) / (p.T @ A @ p)
p = r - beta*p
# calculate distance moved
delta = r@r
# update count
i = i+1
return i, np.array(xs)
In [12]:
x = np.array([6,7])
A = np.array([[1, 0], [0, 3]])
b = np.zeros(2)
c = 0
i, xs = cg(x, A, b, c)
In [13]:
n = 25
bounds = [-8, 8, -8, 8]
plot_contour(bounds, n, A, b, c)
plt.scatter([xs[0,0]], [xs[0,1]], c='red')
plt.plot(xs[:,0], xs[:,1], c='red')
plt.axis(bounds)
pass
In [14]:
i
Out[14]:
In [ ]: