Linear Algebra and Optimisation

COMP4670/8600 - Introduction to Statistical Machine Learning - Tutorial 1B

$\newcommand{\trace}[1]{\operatorname{tr}\left\{#1\right\}}$ $\newcommand{\Norm}[1]{\lVert#1\rVert}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $\newcommand{\DD}{\mathscr{D}}$ $\newcommand{\grad}[1]{\operatorname{grad}#1}$ $\DeclareMathOperator*{\argmin}{arg\,min}$

Setting up python environment (do not use pylab)


In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.optimize as opt
import time

%matplotlib inline

Consider the following cost function $ f(X) $ defined over the space of real $ n \times p $ matrices \begin{equation} f(X) = \frac{1}{2} \trace{X^T C X N} + \mu \frac{1}{4} \Norm{N - X^T X}^2_F \end{equation} where $ X \in \RR^{n \times p} $, $ n \ge p $, and the matrix $ C $ is symmetric, such that $ C = C^T $. The scalar $ \mu $ is assumed to be larger than the $p$th smallest eigenvalue of $ C $. The matrix $ N $ is diagonal with distinct positive entries on the diagonal. The trace of a square matrix $ A $ is denoted as $ \trace{A} $, and the Frobenius norm of an arbitrary matrix $ A $ is defined as $ \Norm{A}_F = \sqrt{\trace{A^T A}} $.

Frobenious Norm

Implement a Python function frobenius_norm which accepts an arbitrary matrix $ A $ and returns $ \Norm{A}_F $ using the formula given. (Use numpy.trace and numpy.sqrt.)

  1. Given a matrix $ A \in \RR^{n \times p} $, what is the complexity of your implementation of frobenius_norm using the formula above?
  2. Can you come up with a faster implementation, if you were additionally told that $ p \ge n $ ?
  3. Can you find an even faster implementation than in 1. and 2.?

Solution

Given $A \in \RR^{n \times p} $, $ p \le n $, the straightforward implementation of $ \trace{A^T A} = \trace{A A^T}$ is first multiplying the matrices and then taking the trace. This is of complexity $ O(p^2 n) $ (or even $ O(p n^2) $ if you are not careful). But this trace can be reformulated as \begin{equation} \trace{A^T A} = \sum_{i=1}^p (A^T A)_{i,i} = \sum_{i=1}^p \sum_{j=1}^n \underbrace{(A^T)_{i,j}}_{=A_{j, i}} A_{j,i} = \sum_{i=1}^p \sum_{j=1}^n A_{j,i}^2 \end{equation} So we can implement it with complexity $ O(np)$.


In [ ]:
# Solution
def frobenius_norm(A):
    """Calculate the Frobenius norm of an array or matrix.
    A -- array or matrix
    """
    return np.sqrt((np.asarray(A)**2).sum(axis=None))

M = np.random.rand(5,3)
print(M)
print(frobenius_norm(M))

Cost Function $f(X)$ with matrix argument

Implement the cost function defined as $f(X)$ above as a function cost_function_for_matrix in Python.

Hint: As good programmers, we do not use global variables in subroutines.


In [ ]:
# Solution
def cost_function_for_matrix(X, C, N, mu):
    """
    Calculate the cost function at point X given as a matrix.
    X -- current point in matrix form
    C -- symmetric matrix
    N -- diagonal matrix
    mu -- scalar
    returns the value of the cost function as a scalar.
    """
    if not isinstance(X, np.matrix):
        raise TypeError("X is not a matrix")

    if not isinstance(C, np.matrix):
        raise TypeError("C is not a matrix")

    if not isinstance(N, np.matrix):
        raise TypeError("N is not a matrix")

    r1 = 0.5  * np.trace(X.T * C * X * N)
    r2 = 0.25 * mu * frobenius_norm(N - X.T * X)**2
    return r1 + r2

X = np.matrix(np.random.rand(5,3))
C = np.random.rand(5,5)
C = np.matrix(C+C.T)
N = np.matrix(np.diag(np.random.rand(3)))
print(cost_function_for_matrix(X,C,N,np.random.rand()))

Cost Function $f(X)$ with vector argument

Many standard optimisation routines work only with vectors. Fortunately, as vector spaces, the space of matrices $ \RR^{n \times p} $ and the space of vectors $ \RR^{n p} $ are isomorphic. What does this mean?

Implement the cost function $ f(X) $ given $ X $ as a vector and call it cost_function_for_vector. Which extra arguments do you need for the cost function?


In [ ]:
# Solution
def cost_function_for_vector(X, C, N, mu, n, p):
    """Calculate the cost function at point X given as 1-D array
    X  -- current point as 1-D array
    C  -- symmetric matrix
    N  -- diagonal matrix
    mu -- scalar
    n  -- row dimension of X
    p  -- column dimension of X
    returns the value of the cost function as a scalar
    """
    if not isinstance(X, np.ndarray):
        raise TypeError("X is not a matrix")

    if X.ndim != 1:
        raise ValueError("X is not a 1-D vector")

    Xmatrix = np.matrix(X.reshape((n, p)))
    return cost_function_for_matrix(Xmatrix, C, N, mu)

Construction of a random matrix $C$ with given eigenvalues

A diagonal matrix has the nice property that the eigenvalues can be directly read off the diagonal. Given a diagonal matrix $ C \in \RR^{n \times n} $ with distinct eigenvalues, how many different diagonal matrices have the same set of eigenvalues?

Given a diagonal matrix $ C \in \RR^{n \times n} $ with distinct eigenvalues, how many different matrices have the same set of eigenvalues?

Given a set of $ n $ distinct real eigenvalues $ \mathcal{E} = \{e_1, \dots, e_n \} $, write a Python function \lstinline$random_matrix_from_eigenvalues$ which takes a list of eigenvalues $ E $ and returns a random symmetric matrix $ C $ having the same eigenvalues.

Solution

There are $ n! $ permutations of diagonal elements, but infinitely many matrices with the same set of eigenvalues.

In order to construct a random matrix with given eigenvalues $\lambda_i$, $i=1,\dots,n$ we first create a diagonal matrix $ \Lambda $ with those eigenvalues on the diagonal. Then we can get another matrix $ A $ with the same eigenvalues as $ \Lambda $ if we apply an arbitrary nonsingular matrix $ B $ to get $ A = B \Lambda B^{-1} $. (Can you prove that $ A $ and $ \Lambda $ have the same set of eigenvalues?)

If $ B $ is an orthogonal matrix $ Q $, then $ Q^{-1} = Q^T $ and therefore the above formula results in $ A = Q \Lambda Q^T $ which is much faster to calculate then using the inverse of a matrix.

How to get a random orthogonal matrix? We use the QR-decomposition of a matrix which decomposes every arbitrary matrix $ B $ into an orthogonal matrix $ Q $ and an upper-triangular matrix $ R $, $ B = Q R $.

The algorithm therefore is

  1. Choose a random matrix $ B $ by randomly choosing the elements of $ B $.
  2. Calculate the QR-decomposition $ B = Q R $. (Check that $ B $ is nonsingular by checking the diagonal of $ R $ has nonzero elements.)
  3. Calculate $ A = Q \Lambda Q^T $, the wanted arbitrary matrix with the same eigenvalues as $ \Lambda $.

Note: $ A $ and $ \Lambda $ share the same \emph{set} of eigenvalues. The order can be arbitrary.


In [ ]:
# Solution
def random_matrix_from_eigenvalues(E):
    """Create a square random matrix with a given set of eigenvalues
    E -- list of eigenvalues
    return the random matrix
    """
    n    = len(E)
    # Create a random orthogonal matrix Q via QR decomposition
    # of a random matrix A
    A    = np.matrix(np.random.rand(n,n))
    Q, R = np.linalg.qr(A)
    #  similarity transformation with orthogonal
    #  matrix leaves eigenvalues intact
    return Q * np.diag(E) * Q.T

Minimising the cost function $f(X)$

Use the minimisation functions fmin or fmin_powell provided in the Python package scipy.optimize to minimise the cost function cost_function_for_vector.

Hint: Use the argument args in the minimisation functions fmin or fmin_powell to provide the extra parameters to your cost function cost_function_for_vector.


In [ ]:
# Solution
def minimise_f_using_fmin(initialise_proc):
    """Run minimisation with simplex algorithm."""
    C, N, mu, n, p, X0 = initialise_proc()

    X_at_min = opt.fmin(cost_function_for_vector,
                                 X0,
                                 args=(C, N, mu, n, p),
                                 xtol=0.0001,
                                 ftol=0.0001,
                                 maxiter=None,
                                 maxfun=None,
                                 full_output = 0,
                                 disp=1,
                                 retall=0,
                                 callback=None)
    X_at_min = np.matrix(X_at_min.reshape((n, p)))
    show_results(X_at_min, C)


def minimise_f_using_fmin_powell(initialise_proc):
    """Run minimisation with Powell algorithm"""
    C, N, mu, n, p, X0 = initialise_proc()

    X_at_min = opt.fmin_powell(cost_function_for_vector,
                                 X0,
                                 args=(C, N, mu, n, p),
                                 xtol=0.0001,
                                 ftol=0.0001,
                                 maxiter=None,
                                 maxfun=None,
                                 full_output = 0,
                                 disp=1,
                                 retall=0,
                                 callback=None,
                                 direc=None)
    X_at_min = np.matrix(X_at_min.reshape((n, p)))
    show_results(X_at_min, C)

Gradient of $f(X)$

Calculate the gradient for the cost function $f(X)$ given the inner product on the space of real matrices $ n \times p $ is defined as \begin{equation} \inner{A}{B} = \trace{A^T B} \end{equation}

Implement a function gradient_for_vector which takes $ X $ as a vector, and returns the gradient of $ f(X) $ as a vector.

Solution

The definition of the directional derivative is in the slides. A straightforward calculation using the definition gives the directional derivative of the cost function as

\begin{align*} \DD f(X) (\xi) = & \phantom{+} \frac{1}{2} \trace{\xi^T C X N} \\ & + \frac{1}{2} \trace{X^T C \xi N} \\ & + \frac{1}{4} \mu \trace{(- \xi^T X)(N - X^T X)} \\ & + \frac{1}{4} \mu \trace{(- X^T \xi)(N - X^T X)} \\ & + \frac{1}{4} \mu \trace{(N - X^T X)(- \xi^T X)} \\ & + \frac{1}{4} \mu \trace{(N - X^T X)(- X^T \xi)} . \end{align*}

Note, the shortcut was to replace each occurrence of the variable $ X $ in the function $ f(X) $ once with $ \xi $ and then add together all those expressions. Reason: The directional derivative gives the linear approximation of the infinitesimal change of the function $ f(X) $ at $ X $ in direction $ \xi $. Therefore it must be linear in $ \xi $.

The above expression can be simplified by using that for any matrices $ A, B $,

\begin{align*} \trace{A^T} & = \trace{A}, \\ \trace{A B} & = \trace{B A} \end{align*}

From this we get therefore the simplified form \begin{align*} \DD f(X) (\xi) & = \trace{\xi^T C X N} - \mu \trace{\xi^T X(N - X^T X)} \\ & = \trace{\xi^T \left(C X N - \mu X(N - X^T X) \right)} \end{align*}

For the given inner product $ \inner{A}{B} = \trace{A^T B} $, the gradient will therefore be

\begin{equation} \grad f(X) = C X N - \mu X(N - X^T X) \end{equation}

In [ ]:
# Solution
def gradient_for_matrix(X, C, N, mu):
    """Calculate the gradient for the cost function at a point X
    X  -- current point in matrix form
    C  -- symmetric matrix
    N  -- diagonal matrix
    mu -- scalar
    returns the gradient of the cost function as matrix
    """
    gradient = C * X * N - mu * X * (N - X.T * X)
    return gradient

def gradient_for_vector(X, C, N, mu, n, p):
    """Calculate the gradient for the cost function at a point X
    X  -- current point as 1-D array
    C  -- symmetric matrix
    N  -- diagonal matrix
    mu -- scalar
    n  -- row dimension of X
    p  -- column dimension of X
    returns the gradient of the cost function as 1-D array
    """
    Xmatrix = np.matrix(X.reshape((n, p)))
    gradient =  gradient_for_matrix(Xmatrix, C, N, mu)
    return np.asarray(gradient).reshape((n*p,))

Minimising the cost function $f(X)$ using the gradient

Use the minimisation functions fmin_cg or fmin_bfgs provided in the Python package scipy.optimize to minimise the cost function cost_function_for_vector utilising the gradient gradient_for_vector.

Compare the speed of convergence to the minimisation with fmin or fmin_powell.


In [ ]:
# Solution
def normalize_columns(A):
    """Normalise the columns of a 2-D array or matrix to length one
    A - array or matrix (which will be modified)
    """
    if A.ndim != 2:
        raise ValueError("A is not a 2-D array")

    number_of_columns = A.shape[1]
    for i in range(number_of_columns):
        A[:,i] /= np.linalg.norm(A[:,i], ord=2)


def show_results(X_at_min, C):
    """Display the found arg min and compare with eigenvalues of C
    X_at_min -- arguement at minimum found
    C        -- symmetric matrix
    """
    n,p = X_at_min.shape

    normalize_columns(X_at_min)

    # Get the eigenvectors belonging to the smallest eigenvalues
    Eigen_Values, Eigen_Vectors = np.linalg.eig(C)
    Permutation = Eigen_Values.argsort()
    Smallest_Eigenvectors = Eigen_Vectors[:, Permutation[:p]]

    if n < 10:
        print("X_at_min               :\n", X_at_min)
        print()
        print("Smallest_Eigenvectors  :\n", Smallest_Eigenvectors)
        print()
    else:
        Project_into_Eigenvectorspace = \
          Smallest_Eigenvectors * Smallest_Eigenvectors.T * X_at_min
        Normal_Component = X_at_min - Project_into_Eigenvectorspace

        print("norm(Normal_Component)/per entry :", \
            np.linalg.norm(Normal_Component, ord=2) / float(n*p))



def minimise_f_using_fmin_cg(initialise_proc):
    """Run minimisation with conjugate gradient algorithm"""
    C, N, mu, n, p, X0 = initialise_proc()

    X_at_min = opt.fmin_cg(cost_function_for_vector,
                                 X0,
                                 fprime=gradient_for_vector,
                                 args=(C, N, mu, n, p),
                                 gtol=1.0000000000000001e-05,
                                 norm=2,
                                 epsilon=1.49011611938e-08,
                                 maxiter=None,
                                 full_output=0,
                                 disp=1,
                                 retall=0,
                                 callback=None)
    X_at_min = np.matrix(X_at_min.reshape((n, p)))
    show_results(X_at_min, C)



def minimise_f_using_fmin_bfgs(initialise_proc):
    """Run minimisation with BFGS algorithm"""
    C, N, mu, n, p, X0 = initialise_proc()

    X_at_min = opt.fmin_bfgs(cost_function_for_vector,
                                 X0,
                                 fprime=gradient_for_vector,
                                 args=(C, N, mu, n, p),
                                 gtol=1.0000000000000001e-05,
                                 norm=2,
                                 epsilon=1.49011611938e-08,
                                 maxiter=None,
                                 full_output=0,
                                 disp=1,
                                 retall=0,
                                 callback=None)
    X_at_min = np.matrix(X_at_min.reshape((n, p)))
    show_results(X_at_min, C)

Minima of $f(X)$

Compare the columns $x_1,\dots, x_p$ of the matrix $X^\star$ which minimises $ f(X) $ \begin{equation} X^\star = \argmin_{X \in \RR^{n \times p}} f(X) \end{equation}

with the eigenvectors related to the smallest eigenvalues of $ C $.

Solution

The minimum of the given cost function are matrices $ X $ which contain the $ p $ eigenvectors of $ C $ which are associated with the $ p $ smallest eigenvalues of $ C $. The order of the eigenvector in the minimum $ X $ is defined by the order of the diagonal elements in $ N $.


In [ ]:
# Solution

def initialise_low_dimensional_data():
    """Initialise the data, low dimensions"""
    n = 3
    p = 2
    mu = 2.7

    N = np.matrix(np.diag([2.5, 1.5]))
    E = [1, 2, 3]
    C = random_matrix_from_eigenvalues(E)
    X0 = np.random.rand(n*p)

    return C, N, mu, n, p, X0


def initialise_higher_dimensional_data():
    """Initialise the data, higher dimensions"""
    n  = 20
    p  =  5
    mu = p + 0.5

    N = np.matrix(np.diag(np.arange(p, 0, -1)))
    E = np.arange(1, n+1)
    C = random_matrix_from_eigenvalues(E)
    X0 = np.random.rand(n*p)

    return C, N, mu, n, p, X0


def run_and_time_all_tests():
    """Run all test and time them using a list of function names"""
    List_of_Test_Names = ["minimise_f_using_fmin",
                 "minimise_f_using_fmin_powell",
                 "minimise_f_using_fmin_cg",
                 "minimise_f_using_fmin_bfgs"]

    List_of_Initialisations = ["initialise_low_dimensional_data",
                               "initialise_higher_dimensional_data"]

    for test_name in List_of_Test_Names:
        for init_routine in List_of_Initialisations:
            task_string  = test_name + "(" + init_routine + ")"
            line_length  = 76
            spaces       = 2
            left_padding = (line_length - len(task_string)) // 2
            right_padding = line_length - left_padding - len(task_string)
            print("=" * line_length)
            print("=" * (left_padding - spaces) + " " * spaces + task_string + \
                " " * spaces + "=" * (right_padding - spaces))
            print("=" * line_length)

            start = time.clock()
            exec(task_string)
            run_time = time.clock() - start
            print("run_time :", run_time)

run_and_time_all_tests()

In [ ]: