CSE 6040, Fall 2015 [29]: PCA + SVD

The main topic of this class was a data analysis method referred to as Principal Components Analysis (PCA). The method requires computing the eigenvectors of a certain matrix; one way to compute those eigenvectors is to use a special factorization from linear algebra called the Singular Value Decomposition (SVD).

We covered these topics in class mostly on the whiteboard. What follows is a quick summary of the main ideas, taken from my written notes.

Motivation: Data "compression"

In previous classes, we've looked at 3 major tasks in data analysis: regression, classification, and clustering. Beyond these, the last problem area you'll consider in our class is what we'll call compression.

At a high level, the term compression simply refers to finding any compact representation of the data. Such representations can help us in two ways. First, it can make the data set smaller and therefore faster to process or analyze. Secondly, choosing a clever representation can reveal hidden structure.

As a concrete example, the specific task we discussed in the last class was one of dimensionality reduction: given a $d$-dimensional data set, we wish to transform it into a $k$-dimensional data set where $k \leq d$.

Choosing the $k$ axes in a clever way might even reveal structure that is hard to see in all $d$ original dimensions. For instance, look at the 3-D examples at the "visualizing PCA" website we looked at during this class, where the first principal component axis reveals 3 sub-clusters: http://setosa.io/ev/principal-component-analysis/

PCA: Definitions

Input data matrix, centered. Per our usual conventions, let $x_0, \ldots, x_{m-1}$ be the input data where $x_i \in \mathbb{R}^d$ is a $d$-dimensional real-valued vector. As usual, we can stack these into a data matrix, denoted $X \equiv \left(\begin{array}{c} x_0^T \\ \vdots \\ x_{m-1}^T \end{array}\right)$.

However, we'll add one more important assumption: these data should be centered about their mean, i.e., $\frac{1}{m} \sum_{i=0}^{m-1} x_i = 0$. If the data are not, preprocess them accordingly.

Projections. Let $\varphi \in \mathbb{R}^d$ be a vector of unit length, i.e., $\|\varphi\|_2^2 = \varphi^T \varphi = 1$. The projection of a data point $x_i$ onto $\varphi$ is $x_i^T \varphi$, which measures the length of the projected vector.

Let $J(\varphi)$ be a cost function that is proportional to the sum of squared projections of the data onto $\varphi$:

$$ \begin{array}{rcl} J(\phi) & \equiv & \displaystyle \frac{1}{2m} \sum_{i=0}^{m-1} (x_i^T \varphi)^2. \end{array} $$

Let's also apply some algebra-fu to the right-hand side to put it into a more concise matrix form:

$$ \begin{array}{rcl} J(\phi) & = & \displaystyle \frac{1}{2} \varphi^T \left(\frac{1}{m} \sum_{i=0}^{m-1} x_i x_i^T \right) \varphi \\ & = & \displaystyle \frac{1}{2} \varphi^T \left(\frac{1}{m} X^T X\right) \varphi \\ & \equiv & \displaystyle \frac{1}{2} \varphi^T C \varphi. \end{array} $$

In the last step, we defined $C \equiv \frac{1}{m} X^T X$; in statistics, $C$ is also known as the sample covariance matrix of the data. (Recall that the data are centered about their mean.)

Principal components via maximizing projections. There are several ways to formulate the PCA problem. Here we consider the one based on maximizing projections.

Start by defining a principal component of the data $X$ to be a vector, $\varphi$, of unit length that maximizes the sum of squared projections.

To convert this definition into a formal problem, you can use the method of Langrange multipliers, which may be applied to any minimization or maximization problem that has equality constraints. The idea is to modify the cost function in a certain way that effectively incorporates each constraint. In particular, for each constraint you will add to the cost function an additional term; this term will be proportional to a dummy parameter times a form of the constraint.

Huh? It's easiest to see this by example: in the case of a principal component, the modified cost function is

$$ \begin{array}{rcl} \hat{J}(\varphi, \lambda) & \equiv & \displaystyle J(\varphi) + \frac{\lambda}{2} (1 - \varphi^T \varphi), \end{array} $$

where the second term captures the constraint: it introduces a dummy optimization parameter, $\lambda$, times the constraint that $\varphi$ has unit length, i.e., $\|\varphi\|_2^2 = \varphi^T \varphi = 1$, or $1 - \varphi^T \varphi = 0$.

The reason to add the constraint in this way will become clear momentarily.

As for the factor of $\frac{1}{2}$, it is there solely for aesthetic reasons, as it cancels out later on.

The optimization task is then to find the $\varphi_*$ and $\lambda_*$ such that

$$ \begin{array}{rcl} (\varphi_*, \lambda_*) & \equiv & \displaystyle \underset{\varphi, \lambda}{\arg\max} \, \hat{J}(\varphi, \lambda). \end{array} $$

To find its solution, you just need to "take derivatives" of $\hat{J}$ with respect to $\varphi$ and $\lambda$, and then set these derivatives to 0. You can show that

$$ \begin{array}{rcl} \nabla_\varphi \hat{J} & = & C \varphi - \lambda \varphi \\ \displaystyle \frac{\partial}{\partial \lambda} \hat{J} & = & \frac{1}{2} (1 - \varphi^T \varphi). \end{array} $$

Setting these to zero and solving yields the following computational problem:

$$ \begin{array}{rcl} C \varphi = X^T X \varphi & = & \lambda \varphi \\ \| \varphi \|_2^2 & = & 1. \end{array} $$

Is it now clear why the constraint was incorporated into $\hat{J}$ as it was? Doing so produces a second equation that exactly captures the constraint!

This problem is an eigenproblem, which is the task of computing an eigenvalue and its corresponding eigenvector of $C = \frac{1}{m} X^T X$.

The matrix $C$ will usually have many eigenvalues and eigenvectors. So which one do you want? Plug the eigenvector back into the original cost function. Then, $J(\varphi) = \frac{1}{2} \varphi^T C \varphi = \frac{\lambda}{2} \varphi^T \varphi = \frac{\lambda}{2}$. In other words, to maximize $J(\varphi)$ you should pick the $\varphi$ with the largest eigenvalue $\lambda$.

Finding an eigenpair via the SVD

So how do you find the eigenvectors of $C$? That is, what algorithm will compute them?

One way is to form $C$ explicitly and then call an off-the-shelf eigensolver. However, forming $C$ explicitly from the data $X$ may be costly in time and storage, not to mention possibly less accurate. (Recall the condition number blow-up problem in the case of solving the normal equations.)

Instead, we can turn to the "Swiss Army knife" of linear algebra, which is the singular value decomposition, or SVD. It is an extremely versatile tool for simplifying linear algebra problems. It can also be somewhat expensive to compute accurately, but a lot of scientific and engineering effort has gone into building robust and reasonably efficient SVD algorithms. So let's assume these exist -- and they do in both Numpy and Scipy -- and use them accordingly.

The SVD. Every real-valued matrix $X \in \mathbb{R}^{m \times d}$ has a singular value decomposition. Let $k = \min(m, d)$. The SVD of $X$ is the factorization, $X = U \Sigma V^T$, where $U \in \mathbb{R}^{m \times k}$ and $V \in \mathbb{R}^{k \times d}$ are orthogonal matrices ($U^T U = I$ and $V^T V = I$) and $\Sigma$ is a $k \times k$ diagonal matrix whose entries are $\sigma_0 \geq \sigma_1 \geq \cdots \geq \sigma_{k-1} \geq 0$.

Usually, $m \geq d$ so $k = d$. We'll assume that in what follows.

Consider the columns of $U$ and $V$:

$$ \begin{array}{rcl} U & = & \left(\begin{array}{cccc} u_0 & u_1 & \cdots & u_{d-1} \end{array}\right) \\ V & = & \left(\begin{array}{cccc} v_0 & v_1 & \cdots & v_{d-1} \end{array}\right). \end{array} $$

The columns of $U$ and of $V$ are also referred to as the left singular vectors and right singular vectors, respectively.

From these definitions, the SVD says that $X V = U \Sigma$. This form is just a compact way of writing down a system of independent vector equations,

$$ \begin{array}{rcl} X v_i & = & \sigma_i u_i. \end{array} $$

Recall that in PCA, you care about $C = \frac{1}{m} X^T X$. In terms of the SVD,

$$X^T X = V \Sigma^T U^T U \Sigma V^T = V \Sigma^2 V^T,$$

$$X^T X V = V \Sigma^2.$$

This relation may in turn be rewritten as the system of vector equations,

$$ \begin{array}{rcl} X^T X v_i & = & \sigma_i^2 v_i. \end{array} $$

In other words, every pair $(\varphi, \lambda) \equiv \left(v_i, \frac{\sigma_i^2}{m}\right)$ is a potential solution to the eigenproblem, $C\varphi = \frac{1}{m} X^T X \varphi = \lambda \varphi$. The pair with the largest eigenvalue is $\left(v_0, \frac{\sigma_0^2}{m}\right)$.

Rank-$k$ approximations: the truncated SVD

We motivated PCA by asking for a single vector $\varphi$, which effectively projects the data onto a one-dimensional subspace (i.e., a line). You might instead want to represent the original $d$-dimensional data points on a $k$-dimensional surface or subspace, where $k \leq d$. As the previous discussion suggests, you should probably choose the top-$k$ eigenvectors, $v_0, \ldots, v_{k-1}$.

Indeed, there is another "principled" reason for this choice.

Let $A \in \mathbb{R}^{m \times d}$ be any matrix with an SVD given by $A = U \Sigma V^T$.

Let's furthermore define the $k$-truncated SVD, for any $k \leq d$, as the $U_k \Sigma_k V_k^T$, product where $U_k$, $\Sigma_k$, and $V_k$ consist of the singular vectors and values corresponding to the $k$ largest singular values. That is, $U_k$ is the first $k$ columns of $U$, same for $V_k$, and $\Sigma_k$ is the upper $k \times k$ submatrix of $\Sigma$.

Now consider the following alternative way to write the SVD:

$$ A = U \Sigma V^T = \sum_{i=0}^{d-1} u_i \sigma_i v_i^T. $$

Each term, $u_i \sigma_i v_i^T$ is known as a rank-$1$ product. So the existence of the SVD means that $A$ may be written as a sum of rank-1 products.

It would be natural to try to approximate $A$ by truncating the SVD after $k$ terms, i.e.,

$$ A \approx U_k \Sigma_k V_k^T = \sum_{i=0}^{k-1} u_i \sigma_i v_i^T. $$

And in fact, there is no rank-$k$ approximation of $A$ that is better than this one! In particular, consider any pair of $k$ column vectors, $Y_k \in \mathbb{R}^{m \times k}$ and $Z_k \in \mathbb{R}^{d \times k}$; their product, $Y_k Z_k$ has rank at most $k$. Then there is a theorem that says the smallest difference between $A$ and the rank-$k$ product $Y_k Z_k$, measured in the Frobenius norm, is

$$ \min_{Y_k, Z_k} \|A - Y_k Z_k^T\|_F^2 = \|A - U_k \Sigma_k V_k^T\|_F^2 = \sigma_{k+1}^2 + \sigma_{k+2}^2 + \cdots + \sigma_{d-1}^2. $$

In other words, the truncated SVD gives the best rank-$k$ approximation to $A$ in the Frobenius norm. Moreover, the error of the approximation is the sum of the squares of all the smallest $d-k$ singular values.

Applied to the covariance matrix, we may conclude that $C = \frac{1}{m} X^T X \approx \frac{1}{m} V_k \Sigma_k^2 V_k^T$ is in fact the best rank-$k$ approximation of $C$, which justifies choosing the $k$ eigenvectors corresponding to the top $k$ eigenvalues of $C$ as the principal components.

Summary: Solving PCA problem

Based on the preceding discussion, here is the basic algorithm to compute the PCA, given the data $X$ and the desired dimension $k$ of the subspace.

If the data are not already centered, transform them so that they have a mean of 0 in all coordinates, i.e., $\displaystyle \frac{1}{m} \sum_{i=0}^{m-1} x_i = 0$.
Compute the $k$-truncated SVD, $\displaystyle X \approx U_k \Sigma_k V_k^T$.
Choose $v_0, v_1, \ldots, v_{k-1}$ to be the principal components.