Discussion:

  • in the paper "Linear dimensionality reduction: survey, insights, and generalizations" by Cunningham and Ghahramani, 2015, JMLR [CG15], the authors point out that considering some linear dimensionality reduction algorithms e.g. sparse PCA, maximal autocorrelation factors (MAF), as eigenvalue / generalized-eigenvalue problems is impossible or inaccurate, and instead suggest that we should consider all such problems as optimization problems over matrix manifolds. In most cases, these are optimization problems over orthogonal matrices that minimize simple objectives:

PCA

Find the orthogonal linear projection $f: X \mapsto M^T X$ that minimizes squared reconstruction error, i.e. $$ \text{argmin}_{M \in \mathcal{O}^{d \times r}} \|X - M M^T X \|^2_F$$

MAF

Find the linear projection $M^T X$ that maximizes the correlation between time-adjacent points $$ \text{argmin}_{M \in \mathcal{O}^{d \times r}} \frac{\text{tr}(M^T \Sigma_{\tau} M)}{\text{tr}(M^T \Sigma M)}$$

where

  • $\Sigma$ is the empirical covariance $E(x_t x_t^T) = \frac{1}{n}X X^T$
  • $\Sigma_\tau$ is the symmetrized empirical cross-covariance at time-lag $\tau$, $ \frac{1}{2} (E(x_{t+\tau} x_t^T) + E(x_t x_{t+\tau}^T))$

Comments:

  • Interpretation: maximizes cross-covariance of the transformation (numerator) while avoiding overcounting data that has high power (denominator)
  • Cunningham and Ghahramani note that we can make several variations easily:
    • To maximize the cross-covariance rather than correlation, maximize $\text{tr}(M^T \Sigma_{\tau} M)$ (the eigenvector solution is optimal for this variant)
    • To maximize (or minimize) the squared distance between time-adjacent points, ($E(\|y_{t+tau}-y_t\|^2)$), the objective becomes $\text{tr}(M^T (\Sigma - \Sigma_{\tau}) M)$ (interpretation: discrete-time analog of SFA)
      • See also: Turner and Sahani 2007:
    • If you specify a particular form of temporal structure, then you can seek linear projections containing that structure. Example: to find a linear subspace where linear dynamics is preserved, minimize $\| \dot{X} - MDM^TX\|_F^2$ for some dynamics matrix $D \in \mathbb{R}^{r\times r}$, where we have access to first-derivative data $\dot{X}$
      • This is of particular interest in the context of MSM-construction, since we have a clear idea of the type of dynamics we would like our projections to preserve.
      • We could directly optimize the linear projection to capture Markovian jump dynamics in the projected space
      • See also Churchland et al., 2012: "Neural population dynamics during reaching" (Nature), describes this method for finding projections of the population state that best capture "rotational" information

In [145]:
import autograd
from autograd import grad
from autograd import numpy as np
from autograd.numpy import random as npr
from autograd.numpy import linalg as la

In [5]:
from sklearn.decomposition import PCA

In [2]:
def proj_stiefel(Z,M):
    ''' M is a matrix on the stiefel manifold,
    Z is what we want to project'''
    skew_MTZ = 0.5 * (np.dot(M.T,Z) - np.dot(Z.T,M))
    MMT = np.dot(M,M.T)
    I = np.eye(len(MMT))
    return np.dot(M,skew_MTZ) + np.dot(I - MMT,Z)

In [ ]:
def proj_grad_descent_stiefel(d,r,obj_func,
                              convergence_threshold=0.1,max_iter=100):
    obj_grad = grad(obj_func)
    
    # initialize M in the set of orthonormal matrices of desired dimension, M ∈ O^{d×r}
    
    M = PCA().fit_transform(npr.randn(d,r)) # orthogonalize
    M /= M.sum(1) # normalize
    
    
    
        
    # while obj_func(M) has not converged:
    converged = False
    timed_out=False
    n_iter = 0
    while (not converged) and (not timed_out):
        old_M = M
        
        
        # calculate free gradient of the objective
        free_grad = obj_grad(M)
        
        # calculate search direction (eq. 34)
        proj_M = proj_stiefel(M,free_grad)
        retr_M = retr_stiefel_closed_form(
        
        # while the objective_func(retraction(projection(step))) is not < f(M) - \eps
            
            # adjust step size using linesearch / retraction (eq. 35)
        
        # update M
        
        
        n_iter += 1
        if n_iter >= max_iter:
        if np.mean((new_M - old_M)**2) < thresh:
            converged = True
        
    
    # return M

Notes from "optimization on manifolds" (http://press.princeton.edu/chapters/absil/)

Chapter 1: Intro

Two prototypical structures for problems to be considered:

  • Search space is an embedded sub-manifold of the original vector space (e.g. by imposing a norm equality constraint, inducing a spherical search space)
  • Search space is a quotient manifold of the original vector space

Goal: efficiently translate iterative optimization procedures from their original context (Euclidean vector space) to nonlinear manifolds.

Key concepts from diff-geom --> numerical relaxation:

  • Motion along geodesics --> retraction
  • Parallel transport --> vector transport

Chapter 2: Motivations and Applications

We'll look at a case study, the eigenvalue problem. This is important in many areas, and since eigenvectors are scale-invariant they nicely illustrate the approach in this book

  • Review:

    • Let $\mathbb{F}$ be the field of real or complex numbers
    • Let $A$ be an $n \times n$ matrix with $A_{ij} \in \mathbb{F}$
    • Any nonzero vector $v \in \mathbb{C}^n$ that satisfies $$A v = \lambda v$$ for some $\lambda \in \mathbb{C}$ is an eigenvector of $A$, and $\lambda$ is its associated eigenvalue (with $(\lambda,v)$ called an eigenpair)
    • The set of eigenvalues of $A$ is the spectrum of $A$
    • The eigenvalues of $A$ are the zeros of the characteristic polynomial of $A$ $$\mathcal{P}_A (z) \equiv \det(A - zI)$$
    • Their algebraic multiplicity is their multiplicity as zeros of $\mathcal{P}_A$.
    • If $T$ is an invertible matrix and $(\lambda,v)$ is an eigenpair of $A$, then
      • $(\lambda , Tv)$ is an eigenpair of $TAT^{-1}$,
      • the transformation $A \mapsto TAT^{-1}$ is a similarity transformation of $A$
    • A linear subspace $\mathcal{S}$ of $\mathbb{F}^n$ is a subset of $\mathbb{F}^n$ closed under linear combinations: $$ (ax + by) \in \mathcal{S} \; \forall x,y \in \mathcal{S}, \forall a,b \in \mathbb{F}$$
    • Span. A set $\{y_1,\dots,y_p\}$ of elements of $\mathcal{S}$ such that every element of $\mathcal{S}$ can be written as a linear combination of $y_1,\dots,y_p$ is called a spanning set of $\mathcal{S}$
      • $\mathcal{S}$ is the column space or the span of the $n \times p$ matrix $Y = [y_1,\dots,y_p]$
      • $Y$ spans $\mathcal{S}$
      • $\mathcal{S} = \text{span}(Y) = \{ Y x : x \in \mathbb{F}^p \} = Y \mathbb{F}^p$
    • The set of $p$-dimensional subspaces of $\mathbb{F}^n$ is a Grassman manifold denoted $\text{Grass}(p,n)$
    • The kernel $\text{ker}(B)$ of a matrix $B$ is the subspace formed by the vectors $x$ such that $Bx = 0$.

      ... incomplete notes ...

- The eigenvalue problem as an optimization problem


In [23]:
X = npr.randn(1000,10)
from sklearn.decomposition import PCA
pca = PCA(2)
pca.fit(X)
M = pca.components_.T
M.shape


Out[23]:
(10, 2)

In [15]:
def cov(X):
    ''' compute covariance matrix of X '''
    return np.dot(X,X.T) / len(X)

def t_cov(X,tau=1):
    ''' compute time-lagged covariance matrix of X '''
    E1 = np.dot(X[tau:],X[:-tau].T) / len(X)
    E2 = np.dot(X[:-tau],X[tau:].T) / len(X)
    return 0.5*(E1 + E2)

In [16]:
def MAF_obj(M,cov_X,t_cov_X):
    ''' minimize this'''
    return np.trace(M.T.dot(t_cov_X.dot(M)))

In [24]:
def PCA_obj(M,X):
    return np.sum((X.T - M.dot(M.T.dot(X.T)))**2)

In [28]:
opt = PCA_obj(M,X)
opt


Out[28]:
7901.0659563781683

In [43]:
M,M.shape,(M**2).sum(0),M / (M**2).sum(0)


Out[43]:
(array([[-0.06636459,  0.15036049],
        [ 0.27727964,  0.2738639 ],
        [-0.30247766, -0.15038684],
        [-0.05136056,  0.36339234],
        [ 0.29914065, -0.55265155],
        [ 0.12406988, -0.0604768 ],
        [ 0.1172024 ,  0.08542416],
        [-0.08513059, -0.65586791],
        [ 0.36465999, -0.01944447],
        [-0.7521583 , -0.02829893]]),
 (10, 2),
 array([ 1.,  1.]),
 array([[-0.06636459,  0.15036049],
        [ 0.27727964,  0.2738639 ],
        [-0.30247766, -0.15038684],
        [-0.05136056,  0.36339234],
        [ 0.29914065, -0.55265155],
        [ 0.12406988, -0.0604768 ],
        [ 0.1172024 ,  0.08542416],
        [-0.08513059, -0.65586791],
        [ 0.36465999, -0.01944447],
        [-0.7521583 , -0.02829893]]))

In [72]:
winners = []
for i in range(100):
    M_ = M+npr.randn(*M.shape)*0.001
    for i in range(M_.shape[1]):
        M_[:,i] = M_[:,i] / np.sqrt(sum(M_[:,i]**2))
    M_val = PCA_obj(M_,X)
    if M_val < opt:
        print('omg nooo',M_val)
        winners.append((M_,M_val))


('omg nooo', 7901.0654428110711)
('omg nooo', 7901.0625223425814)
('omg nooo', 7901.0615678569666)
('omg nooo', 7901.0654086907498)
('omg nooo', 7901.0624129863827)
('omg nooo', 7901.0648717670429)
('omg nooo', 7901.0657691569431)
('omg nooo', 7901.0646361123781)
('omg nooo', 7901.0632247688573)
('omg nooo', 7901.064269383347)
('omg nooo', 7901.0633142935258)
('omg nooo', 7901.0606578805236)
('omg nooo', 7901.0656958986874)
('omg nooo', 7901.0651823817861)
('omg nooo', 7901.063804997988)
('omg nooo', 7901.0654964295518)
('omg nooo', 7901.061599159023)
('omg nooo', 7901.0636243430436)
('omg nooo', 7901.0640195981468)
('omg nooo', 7901.0656457135883)
('omg nooo', 7901.0635915276289)

In [ ]:


In [76]:
M_ = winners[0][0]
(M_**2).sum(0)
M_[:,0].dot(M_[:,1]),M[:,0].dot(M[:,1])


Out[76]:
(0.00071477004472439432, 2.7755575615628914e-16)

In [56]:
M_[0]


Out[56]:
array([-0.06604021,  0.15108481])

In [71]:
(M_[:,0]**2).sum(),((M_[:,0] / (M_[:,0]**2).sum())**2).sum()


Out[71]:
(1.001101986953032, 0.99889922608546011)

In [82]:
np.linalg.norm(M_.T.dot(M_) - np.eye(len(M_.T)))


Out[82]:
0.0010108374912272426

In [83]:
np.allclose(M_.T.dot(M_),np.eye(len(M_.T)))


Out[83]:
False

(notes from [CG15] Appendix A)

Optimization over the Stiefel manifold

$\newcommand{\odr}{\mathcal{O}^{d\times r}}$ $\newcommand{\rdr}{\mathbb{R}^{d \times r}}$

Overview: We want to take a free gradient in $\rdr$, project it onto the tangent space $T_M \odr$, then retract from the tangent space to the manifold $\odr$

Tangent space: linear approximation to the manifold at a particular point.

  • Define a curve $\gamma(\cdot)$ on the manifold $ \odr$ as a smooth map $\gamma(\cdot): \mathbb{R} \to \odr$
  • The tangent space $T_M \odr$ is then: $$ \{ \dot{\gamma}(0): \gamma(\cdot) \text{ is a curve on } \odr \text{ with } \gamma(0) = M \}$$ where $\dot{\gamma}$ is the derivative $\frac{d}{dt}\gamma(t)$
  • Or in words: the tangent space at $M$ is the set of first derivatives at $M$ of all curves on the manifold that pass through $M$, i.e. the space of directions along the manifold at point $M$
  • The Stiefel manifold admits some equivalent forms for the tangent space at a point $M$: $$ \begin{align} T_M \odr & = \{ \dot{\gamma}(0): \gamma(\cdot) \text{ is a curve on } \odr \text{ with } \gamma(0) = M \}\\ & = \{ X \in \rdr : M^TX + X^TM = 0 \}\\ & = \{ MA + (I - MM^T)B : A = -A^T,B \in \rdr \} \end{align}$$
  • The third form is particularly useful since it allows us to construct points in the tangent space

Projection from $\rdr$ onto the tangent space $T_M \odr$

  • We would like to project an arbitrary vector $Z \in \mathbb{R}^{d\times r}$ onto the tangent space $T_M \odr$. In other words, given a point not in the tangent space, we would like to find the closest point to it that's in the tangent space.
  • This projection $\pi_M(Z)$ is given by $$ \pi_M(Z) = M \text{skew}(M^TZ) + (I - MM^T)Z$$

Retraction from $T_M \odr$ to $\odr$

  • After computing the gradient $-\nabla_M f$ at a point $M$, we would like to step in the direction of the projected gradient to $M + \beta \pi_M(-\nabla_M f)$ where $\beta$ is the step size

In [ ]:
dot = np.dot

In [153]:
M,M.shape


Out[153]:
(array([[-0.06636459,  0.15036049],
        [ 0.27727964,  0.2738639 ],
        [-0.30247766, -0.15038684],
        [-0.05136056,  0.36339234],
        [ 0.29914065, -0.55265155],
        [ 0.12406988, -0.0604768 ],
        [ 0.1172024 ,  0.08542416],
        [-0.08513059, -0.65586791],
        [ 0.36465999, -0.01944447],
        [-0.7521583 , -0.02829893]]), (10, 2))

In [149]:
nabla = np.zeros(M.shape)
nabla[0,0] = 1
nabla[1,1] = 1

In [154]:
M + 0.01*proj_stiefel(M,-nabla)


Out[154]:
array([[-0.07599905,  0.15063037],
       [ 0.2776812 ,  0.26520679],
       [-0.30259848, -0.15144546],
       [-0.05054947,  0.36427772],
       [ 0.29776045, -0.55352544],
       [ 0.12385823, -0.06037713],
       [ 0.11730727,  0.08590871],
       [-0.08647647, -0.65784612],
       [ 0.36437641, -0.018718  ],
       [-0.75171964, -0.02998469]])

In [136]:
M.T.dot(M)


Out[136]:
array([[  1.00000000e+00,   2.91433544e-16],
       [  2.91433544e-16,   1.00000000e+00]])

In [128]:
def skew(M,Z):
    ''' return the skew-symmetric part of $M^T Z$'''
    return 0.5 * (dot(M.T,Z) - dot(Z.T,M))

def proj_stiefel(M,Z):
    ''' M is a d-by-r point on the stiefel manifold, defining a tangent
    space $T_M \mathcal{O}^{d \times r}$
    $Z \in \mathbb{R}^{d\times r}$ is an arbitrary point 
    we would like to project onto the '''
    MskewMTZ = np.dot(M,skew(M,Z))
    IMMTZ = dot((np.eye(len(M)) - dot(M,M.T)),Z)
    return MskewMTZ + IMMTZ

In [160]:
def retr_stiefel_closed_form(M,Z):
    ''' compute '''
    ZTZ = dot(Z.T,Z)
    return la.inv(la.sqrtm(dot(M+Z,(np.eye(len(ZTZ)) + ZTZ))))

def retr_stiefel_svd(M,Z):
    U,s,V = la.svd(M+Z)
    print(U.shape,V.shape)
    return dot(U,V.T)

In [156]:
retr_stiefel_svd(M,M + 0.01*proj_stiefel(M,-nabla))


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-156-85ab8700ee34> in <module>()
----> 1 retr_stiefel_svd(M,M + 0.01*proj_stiefel(M,-nabla))

<ipython-input-155-c426e82feb6d> in retr_stiefel_svd(M, Z)
      6 def retr_stiefel_svd(M,Z):
      7     U,s,VT = la.svd(M+Z)
----> 8     return dot(U,VT)

/Users/joshuafass/anaconda/envs/py27/lib/python2.7/site-packages/autograd/core.pyc in __call__(self, *args, **kwargs)
    105                         tapes.add(tape)
    106 
--> 107         result = self.fun(*argvals, **kwargs)
    108         if result is NotImplemented: return result
    109         if ops:

ValueError: shapes (10,10) and (2,2) not aligned: 10 (dim 1) != 2 (dim 0)

In [124]:
X.shape, np.dot(X,X.T).shape


Out[124]:
((10, 2), (10, 10))

In [125]:
def proj(X,U):
    skew_XTU = 0.5 * (np.dot(X.T,U) - np.dot(U.T,X))
    XXT = np.dot(X,X.T)
    I = np.eye(len(XXT))
    return np.dot(X,skew_XTU) + np.dot(I - XXT,U)

In [126]:
X = M
Z = M_

In [144]:
proj_stiefel(X,X)


Out[144]:
array([[ -2.08166817e-17,  -5.57279917e-17],
       [ -1.66533454e-16,  -2.03830008e-16],
       [  1.66533454e-16,   1.49186219e-16],
       [ -7.28583860e-17,  -1.82471226e-16],
       [  5.55111512e-17,   1.47451495e-16],
       [ -4.16333634e-17,   2.16840434e-18],
       [ -5.55111512e-17,  -6.41847686e-17],
       [  2.28983499e-16,   2.97938757e-16],
       [ -1.38777878e-16,  -8.41340886e-17],
       [  2.77555756e-16,   2.35922393e-16]])

In [135]:
proj(X,Z)


Out[135]:
array([[ -8.08533463e-04,  -1.74880914e-03],
       [  1.54659220e-03,  -7.71805873e-04],
       [  7.91613473e-04,  -7.55978183e-04],
       [  1.25280879e-03,   4.85741442e-04],
       [ -3.87057535e-05,  -1.46878111e-04],
       [ -4.73537614e-04,  -4.58327309e-04],
       [  5.55372729e-04,  -2.09554031e-05],
       [  2.13110793e-03,  -1.21287630e-04],
       [  2.62155575e-04,   8.22368371e-04],
       [  1.16520863e-04,  -4.75602255e-04]])

In [99]:
X.shape


Out[99]:
(1000, 10)

In [103]:
X.T.shape


Out[103]:
(10, 1000)

In [106]:
np.dot(X.T,X).shape,np.dot(X,X.T).shape


Out[106]:
((10, 10), (1000, 1000))

In [107]:
X.T.dot(X).shape,(X.dot(X.T)).shape


Out[107]:
((10, 10), (1000, 1000))

In [108]:
np.linalg.norm(X),np.trace(np.dot(X,X.T)),np.trace(np.dot(X.T,X))


Out[108]:
(100.96890275582501, 10194.719323715246, 10194.719323715248)

In [89]:
X


Out[89]:
array([[ -1.15793447e+00,   7.08890921e-04,   1.08963956e-02, ...,
          1.34422705e+00,   1.21995753e+00,  -8.89123821e-01],
       [ -1.38337094e+00,  -3.17456729e-01,  -1.61633593e+00, ...,
         -1.53145704e+00,  -5.66074139e-01,   8.15097114e-01],
       [ -7.05693721e-01,   8.58759013e-01,  -2.00323789e+00, ...,
         -4.35369887e-01,   8.36220132e-01,   9.51043165e-01],
       ..., 
       [  4.27508644e-01,   4.50559213e-01,  -1.00726948e+00, ...,
         -1.19187319e+00,   3.18524852e-01,   6.62183770e-02],
       [  6.52649944e-01,   1.61298106e+00,  -4.77056430e-01, ...,
         -9.90813058e-01,   1.67999338e-01,   1.09724008e+00],
       [ -1.38277316e+00,  -1.55336640e+00,   5.09705697e-01, ...,
         -8.89839013e-01,  -6.80742608e-01,   4.95242623e-01]])

In [ ]: