$\newcommand{\trace}[1]{\operatorname{tr}\left\{#1\right\}}$ $\newcommand{\Norm}[1]{\lVert#1\rVert}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $\newcommand{\DD}{\mathscr{D}}$ $\newcommand{\grad}[1]{\operatorname{grad}#1}$ $\DeclareMathOperator*{\argmin}{arg\,min}$
Setting up python environment (do not use pylab)
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.optimize as opt
import time
%matplotlib inline
Consider the following cost function $ f(X) $ defined over the space of real $ n \times p $ matrices \begin{equation} f(X) = \frac{1}{2} \trace{X^T C X N} + \mu \frac{1}{4} \Norm{N - X^T X}^2_F \end{equation} where $ X \in \RR^{n \times p} $, $ n \ge p $, and the matrix $ C $ is symmetric, such that $ C = C^T $. The scalar $ \mu $ is assumed to be larger than the $p$th smallest eigenvalue of $ C $. The matrix $ N $ is diagonal with distinct positive entries on the diagonal. The trace of a square matrix $ A $ is denoted as $ \trace{A} $, and the Frobenius norm of an arbitrary matrix $ A $ is defined as $ \Norm{A}_F = \sqrt{\trace{A^T A}} $.
Implement a Python function frobenius_norm
which accepts an arbitrary matrix $ A $ and returns
$ \Norm{A}_F $ using the formula given. (Use numpy.trace
and numpy.sqrt
.)
frobenius_norm
using the formula above?Given $A \in \RR^{n \times p} $, $ p \le n $, the straightforward implementation of $ \trace{A^T A} = \trace{A A^T}$ is first multiplying the matrices and then taking the trace. This is of complexity $ O(p^2 n) $ (or even $ O(p n^2) $ if you are not careful). But this trace can be reformulated as \begin{equation} \trace{A^T A} = \sum_{i=1}^p (A^T A)_{i,i} = \sum_{i=1}^p \sum_{j=1}^n \underbrace{(A^T)_{i,j}}_{=A_{j, i}} A_{j,i} = \sum_{i=1}^p \sum_{j=1}^n A_{j,i}^2 \end{equation} So we can implement it with complexity $ O(np)$.
In [ ]:
# Solution
def frobenius_norm(A):
"""Calculate the Frobenius norm of an array or matrix.
A -- array or matrix
"""
return np.sqrt((np.asarray(A)**2).sum(axis=None))
M = np.random.rand(5,3)
print(M)
print(frobenius_norm(M))
In [ ]:
# Solution
def cost_function_for_matrix(X, C, N, mu):
"""
Calculate the cost function at point X given as a matrix.
X -- current point in matrix form
C -- symmetric matrix
N -- diagonal matrix
mu -- scalar
returns the value of the cost function as a scalar.
"""
if not isinstance(X, np.matrix):
raise TypeError("X is not a matrix")
if not isinstance(C, np.matrix):
raise TypeError("C is not a matrix")
if not isinstance(N, np.matrix):
raise TypeError("N is not a matrix")
r1 = 0.5 * np.trace(X.T * C * X * N)
r2 = 0.25 * mu * frobenius_norm(N - X.T * X)**2
return r1 + r2
X = np.matrix(np.random.rand(5,3))
C = np.random.rand(5,5)
C = np.matrix(C+C.T)
N = np.matrix(np.diag(np.random.rand(3)))
print(cost_function_for_matrix(X,C,N,np.random.rand()))
Many standard optimisation routines work only with vectors. Fortunately, as vector spaces, the space of matrices $ \RR^{n \times p} $ and the space of vectors $ \RR^{n p} $ are isomorphic. What does this mean?
Implement the cost function $ f(X) $ given $ X $ as a vector and call it cost_function_for_vector
.
Which extra arguments do you need for the cost function?
In [ ]:
# Solution
def cost_function_for_vector(X, C, N, mu, n, p):
"""Calculate the cost function at point X given as 1-D array
X -- current point as 1-D array
C -- symmetric matrix
N -- diagonal matrix
mu -- scalar
n -- row dimension of X
p -- column dimension of X
returns the value of the cost function as a scalar
"""
if not isinstance(X, np.ndarray):
raise TypeError("X is not a matrix")
if X.ndim != 1:
raise ValueError("X is not a 1-D vector")
Xmatrix = np.matrix(X.reshape((n, p)))
return cost_function_for_matrix(Xmatrix, C, N, mu)
A diagonal matrix has the nice property that the eigenvalues can be directly read off the diagonal. Given a diagonal matrix $ C \in \RR^{n \times n} $ with distinct eigenvalues, how many different diagonal matrices have the same set of eigenvalues?
Given a diagonal matrix $ C \in \RR^{n \times n} $ with distinct eigenvalues, how many different matrices have the same set of eigenvalues?
Given a set of $ n $ distinct real eigenvalues $ \mathcal{E} = \{e_1, \dots, e_n \} $, write a Python function \lstinline$random_matrix_from_eigenvalues$ which takes a list of eigenvalues $ E $ and returns a random symmetric matrix $ C $ having the same eigenvalues.
There are $ n! $ permutations of diagonal elements, but infinitely many matrices with the same set of eigenvalues.
In order to construct a random matrix with given eigenvalues $\lambda_i$, $i=1,\dots,n$ we first create a diagonal matrix $ \Lambda $ with those eigenvalues on the diagonal. Then we can get another matrix $ A $ with the same eigenvalues as $ \Lambda $ if we apply an arbitrary nonsingular matrix $ B $ to get $ A = B \Lambda B^{-1} $. (Can you prove that $ A $ and $ \Lambda $ have the same set of eigenvalues?)
If $ B $ is an orthogonal matrix $ Q $, then $ Q^{-1} = Q^T $ and therefore the above formula results in $ A = Q \Lambda Q^T $ which is much faster to calculate then using the inverse of a matrix.
How to get a random orthogonal matrix? We use the QR-decomposition of a matrix which decomposes every arbitrary matrix $ B $ into an orthogonal matrix $ Q $ and an upper-triangular matrix $ R $, $ B = Q R $.
The algorithm therefore is
Note: $ A $ and $ \Lambda $ share the same \emph{set} of eigenvalues. The order can be arbitrary.
In [ ]:
# Solution
def random_matrix_from_eigenvalues(E):
"""Create a square random matrix with a given set of eigenvalues
E -- list of eigenvalues
return the random matrix
"""
n = len(E)
# Create a random orthogonal matrix Q via QR decomposition
# of a random matrix A
A = np.matrix(np.random.rand(n,n))
Q, R = np.linalg.qr(A)
# similarity transformation with orthogonal
# matrix leaves eigenvalues intact
return Q * np.diag(E) * Q.T
Use the minimisation functions fmin
or fmin_powell
provided in the
Python package scipy.optimize
to minimise the cost function cost_function_for_vector
.
Hint: Use the argument args
in the minimisation functions fmin
or fmin_powell
to provide the extra parameters to
your cost function cost_function_for_vector
.
In [ ]:
# Solution
def minimise_f_using_fmin(initialise_proc):
"""Run minimisation with simplex algorithm."""
C, N, mu, n, p, X0 = initialise_proc()
X_at_min = opt.fmin(cost_function_for_vector,
X0,
args=(C, N, mu, n, p),
xtol=0.0001,
ftol=0.0001,
maxiter=None,
maxfun=None,
full_output = 0,
disp=1,
retall=0,
callback=None)
X_at_min = np.matrix(X_at_min.reshape((n, p)))
show_results(X_at_min, C)
def minimise_f_using_fmin_powell(initialise_proc):
"""Run minimisation with Powell algorithm"""
C, N, mu, n, p, X0 = initialise_proc()
X_at_min = opt.fmin_powell(cost_function_for_vector,
X0,
args=(C, N, mu, n, p),
xtol=0.0001,
ftol=0.0001,
maxiter=None,
maxfun=None,
full_output = 0,
disp=1,
retall=0,
callback=None,
direc=None)
X_at_min = np.matrix(X_at_min.reshape((n, p)))
show_results(X_at_min, C)
Calculate the gradient for the cost function $f(X)$ given the inner product on the space of real matrices $ n \times p $ is defined as \begin{equation} \inner{A}{B} = \trace{A^T B} \end{equation}
Implement a function gradient_for_vector
which takes $ X $ as a vector, and
returns the gradient of $ f(X) $ as a vector.
The definition of the directional derivative is in the slides. A straightforward calculation using the definition gives the directional derivative of the cost function as
\begin{align*} \DD f(X) (\xi) = & \phantom{+} \frac{1}{2} \trace{\xi^T C X N} \\ & + \frac{1}{2} \trace{X^T C \xi N} \\ & + \frac{1}{4} \mu \trace{(- \xi^T X)(N - X^T X)} \\ & + \frac{1}{4} \mu \trace{(- X^T \xi)(N - X^T X)} \\ & + \frac{1}{4} \mu \trace{(N - X^T X)(- \xi^T X)} \\ & + \frac{1}{4} \mu \trace{(N - X^T X)(- X^T \xi)} . \end{align*}Note, the shortcut was to replace each occurrence of the variable $ X $ in the function $ f(X) $ once with $ \xi $ and then add together all those expressions. Reason: The directional derivative gives the linear approximation of the infinitesimal change of the function $ f(X) $ at $ X $ in direction $ \xi $. Therefore it must be linear in $ \xi $.
The above expression can be simplified by using that for any matrices $ A, B $,
\begin{align*} \trace{A^T} & = \trace{A}, \\ \trace{A B} & = \trace{B A} \end{align*}From this we get therefore the simplified form \begin{align*} \DD f(X) (\xi) & = \trace{\xi^T C X N} - \mu \trace{\xi^T X(N - X^T X)} \\ & = \trace{\xi^T \left(C X N - \mu X(N - X^T X) \right)} \end{align*}
For the given inner product $ \inner{A}{B} = \trace{A^T B} $, the gradient will therefore be
\begin{equation} \grad f(X) = C X N - \mu X(N - X^T X) \end{equation}
In [ ]:
# Solution
def gradient_for_matrix(X, C, N, mu):
"""Calculate the gradient for the cost function at a point X
X -- current point in matrix form
C -- symmetric matrix
N -- diagonal matrix
mu -- scalar
returns the gradient of the cost function as matrix
"""
gradient = C * X * N - mu * X * (N - X.T * X)
return gradient
def gradient_for_vector(X, C, N, mu, n, p):
"""Calculate the gradient for the cost function at a point X
X -- current point as 1-D array
C -- symmetric matrix
N -- diagonal matrix
mu -- scalar
n -- row dimension of X
p -- column dimension of X
returns the gradient of the cost function as 1-D array
"""
Xmatrix = np.matrix(X.reshape((n, p)))
gradient = gradient_for_matrix(Xmatrix, C, N, mu)
return np.asarray(gradient).reshape((n*p,))
Use the minimisation functions fmin_cg
or fmin_bfgs
provided in the
Python package scipy.optimize
to minimise the cost function cost_function_for_vector
utilising the gradient gradient_for_vector
.
Compare the speed of convergence to the minimisation with fmin
or fmin_powell
.
In [ ]:
# Solution
def normalize_columns(A):
"""Normalise the columns of a 2-D array or matrix to length one
A - array or matrix (which will be modified)
"""
if A.ndim != 2:
raise ValueError("A is not a 2-D array")
number_of_columns = A.shape[1]
for i in range(number_of_columns):
A[:,i] /= np.linalg.norm(A[:,i], ord=2)
def show_results(X_at_min, C):
"""Display the found arg min and compare with eigenvalues of C
X_at_min -- arguement at minimum found
C -- symmetric matrix
"""
n,p = X_at_min.shape
normalize_columns(X_at_min)
# Get the eigenvectors belonging to the smallest eigenvalues
Eigen_Values, Eigen_Vectors = np.linalg.eig(C)
Permutation = Eigen_Values.argsort()
Smallest_Eigenvectors = Eigen_Vectors[:, Permutation[:p]]
if n < 10:
print("X_at_min :\n", X_at_min)
print()
print("Smallest_Eigenvectors :\n", Smallest_Eigenvectors)
print()
else:
Project_into_Eigenvectorspace = \
Smallest_Eigenvectors * Smallest_Eigenvectors.T * X_at_min
Normal_Component = X_at_min - Project_into_Eigenvectorspace
print("norm(Normal_Component)/per entry :", \
np.linalg.norm(Normal_Component, ord=2) / float(n*p))
def minimise_f_using_fmin_cg(initialise_proc):
"""Run minimisation with conjugate gradient algorithm"""
C, N, mu, n, p, X0 = initialise_proc()
X_at_min = opt.fmin_cg(cost_function_for_vector,
X0,
fprime=gradient_for_vector,
args=(C, N, mu, n, p),
gtol=1.0000000000000001e-05,
norm=2,
epsilon=1.49011611938e-08,
maxiter=None,
full_output=0,
disp=1,
retall=0,
callback=None)
X_at_min = np.matrix(X_at_min.reshape((n, p)))
show_results(X_at_min, C)
def minimise_f_using_fmin_bfgs(initialise_proc):
"""Run minimisation with BFGS algorithm"""
C, N, mu, n, p, X0 = initialise_proc()
X_at_min = opt.fmin_bfgs(cost_function_for_vector,
X0,
fprime=gradient_for_vector,
args=(C, N, mu, n, p),
gtol=1.0000000000000001e-05,
norm=2,
epsilon=1.49011611938e-08,
maxiter=None,
full_output=0,
disp=1,
retall=0,
callback=None)
X_at_min = np.matrix(X_at_min.reshape((n, p)))
show_results(X_at_min, C)
In [ ]:
# Solution
def initialise_low_dimensional_data():
"""Initialise the data, low dimensions"""
n = 3
p = 2
mu = 2.7
N = np.matrix(np.diag([2.5, 1.5]))
E = [1, 2, 3]
C = random_matrix_from_eigenvalues(E)
X0 = np.random.rand(n*p)
return C, N, mu, n, p, X0
def initialise_higher_dimensional_data():
"""Initialise the data, higher dimensions"""
n = 20
p = 5
mu = p + 0.5
N = np.matrix(np.diag(np.arange(p, 0, -1)))
E = np.arange(1, n+1)
C = random_matrix_from_eigenvalues(E)
X0 = np.random.rand(n*p)
return C, N, mu, n, p, X0
def run_and_time_all_tests():
"""Run all test and time them using a list of function names"""
List_of_Test_Names = ["minimise_f_using_fmin",
"minimise_f_using_fmin_powell",
"minimise_f_using_fmin_cg",
"minimise_f_using_fmin_bfgs"]
List_of_Initialisations = ["initialise_low_dimensional_data",
"initialise_higher_dimensional_data"]
for test_name in List_of_Test_Names:
for init_routine in List_of_Initialisations:
task_string = test_name + "(" + init_routine + ")"
line_length = 76
spaces = 2
left_padding = (line_length - len(task_string)) // 2
right_padding = line_length - left_padding - len(task_string)
print("=" * line_length)
print("=" * (left_padding - spaces) + " " * spaces + task_string + \
" " * spaces + "=" * (right_padding - spaces))
print("=" * line_length)
start = time.clock()
exec(task_string)
run_time = time.clock() - start
print("run_time :", run_time)
run_and_time_all_tests()
In [ ]: