In Chapter 4, Raschka covered a number of methods for reducing the dimensions of a dataset through feature selection, including L1 regularization, SBS, and random forests. In different ways, these algorithms accomplish the same goal of reducing the feature space of a dataset by eliminating those features that provide the least amount of information to a classifier (or whose absence has the least impact on model performance). These techniques can help improve the efficiency of a model, or combat the "curse of dimensionality" when using nearest-neighbor methods.
As we saw when dealing with missing data, however, elminating entire features from the sample space can also remove too much information, increasing a classifer's bias to an unacceptable degree. How might we reduce the dimensions of a dataset while preserving necessary information?
In this chapter, Raschka covers three techniques for reducing the feature space of a dataset through transformation (also referred to as dimensionality reduction or compression):
These techniques can drastically reduce the dimensions of a dataset while preserving as much information as possible.
Throughout, I supplement notes on Raschka's writing with insight from An Introduction to Statistical Learning, which covers the math behind these techniques in greater detail.
Principal Component Analysis (PCA) is an unsupervised algorithm for dimensionality reduction, meaning that a human doesn't have to train it at all. Using magic (AKA linear algebra), PCA identifies patterns in a dataset based on correlations between features, and transforms the data in such a way that it maximizes the amount of variance that the new dataset retains from the old.
In PCA, "variance" refers to a different phenomenon than it does when we use it to analyze classifiers. Here, we're interested in maximizing the statistical variance of our dataset: the amount of "spread" represented by each feature.
At a high level, PCA works by finding the "directions" in the data that account for most of its variance, subject to the constraint that the directions not be correlated (i.e that they are perpendicular). Put another way, PCA looks for a set of new axes that we can project the data onto while preserving as much spread as possible. We call the new axes (or directions) "principal components." One of the nicest parts of PCA is that it finds directions in declining order of importance, and assigns a "score" to each direction: because of this, we can easily select whatever size dimension we want to project our data into.
Mathematically, we can define our goal as producing a $k$-dimensional subspace of a $d$ dimensional feature space by finding a $d \times k$ transformation matrix, $W$, such that:
$$ X \cdot W = z $$Where $X$ is our dataset in $d$ dimensions, and $z$ is a dataset in $k$ dimensions.
In broad strokes, the steps to perform PCA are as follows:
To understand this algorithm, we'll need to briefly cover some explanatory background on covariance and eigenvectors.
Covariance is a statistic that quantifies how closely related two features are to each other. We define it mathematically as the mean product of the differences from the means of two features (here, $x$ and $y$, but when considering our dataset we'll represent the features as $X_{i}$ and $X_{j}$). That's a mouthful, but it's a lot simpler in algebraic notation:
$$ cov(x, y) = \sum_{i=1}^{n} \frac{(x_{i} - \mu_{x})(y_{i} - \mu_{y})}{n} $$Where $\mu_{x}$ and $\mu_{y}$ are the means of each feature. Since in PCA we standardize the features prior to analyzing their covariances, the definition is even simpler:
$$ cov(x_{std}, y_{std}) = \frac{1}{n} \sum_{i=1}^{n} x^{(i)}_{std} y^{(i)}_{std} $$By taking the product of the differences of the means (phew), we can neatly capture two properties of the relationship between the features: 1) what direction the features trend in (think about the quadrants of a coordinate plane, centered around the mean) and 2) how tightly they cluster together (or: do values of $x$ far from $\mu_{x}$ imply values of $y$ far from $\mu_{y}$?).
The covariance matrix of a dataset is a square matrix that records the covariance between every possible pair of features in the dataset. For a dataset in $k$ dimensions, with features in the sequence ${1, 2, 3, ... k}$, we can represent the covariance matrix $\Sigma$ as:
$$ \Sigma = \begin{bmatrix} cov(1,1) & cov(1,2) & cov(1,3) & ... & cov(1,k) \\ cov(2,1) & cov(2,2) & cov(2,3) & ... & cov(2,k) \\ cov(3,1) & cov(3,2) & cov(3,3) & ... & cov(1,k) \\ ... & ... & ... & ... & ... \\ cov(k,1) & cov(k,2) & cov(k,3) & ... & cov(k,k) \end{bmatrix} $$The eigenvectors of a matrix transformation are vectors whose directions don't change when the transformation is applied to them. While they don't change direction, they may change length: that is, they may get scaled by a constant factor, which we refer to as the eigenvalue of a given eigenvector.
Formally, we can say that for a given matrix $\Sigma$, its eigenvectors are defined as the set of all vectors $v$ that satisfy the relationship:
$$ \Sigma \cdot v = \lambda v $$Where $\lambda$ is a constant scalar (the vector's eigenvalue).
To find a solution to the equivalence above, we can begin by rearranging terms (noting that the scalar multiplication $\lambda v$ is equivalent to the matrix multiplication $\lambda I \cdot v$, where $I$ is the identity matrix in the same dimensions as $v$):
$$ (\Sigma - \lambda I) \cdot v = 0 $$And since a matrix-vector multiplication can only result in 0 when the determininant of the matrix is 0, it follows that:
$$ det(\Sigma - \lambda I) = 0 $$We solve this equation for $\lambda$ (where $\lambda \neq 0$, since that's not a very interesting case) and then use that eigenvalue to find the corresponding eigenvector.
So why do we care about the eigenvectors of our dataset's covariance matrix with the $k$ largest eigenvalues? What does that do for us?
In PCA, the eigenvectors of the covariance matrix with the highest eigenvalues represent the directions in the data that maximize variance and are perpendicular to one another. If we treat those directions as axes, we can project our data onto them and preserve as much variance as possible!
The geometric interpretation of PCA is still slightly beyond me, but its effect is magical: we can squish the data onto a smaller coordinate space that still captures the most relevant information. (We can also use these transformed values as "scores" to interpret general directions in the data, a concept that we'll explore in more detail in the exercises for this chapter.
One drawback of PCA is that it is unsupervised: it doesn't learn its transformation matrix based on class labels, it simply decides what directions in the data maximize the variance. Often, this is good enough for the purpose of improving computational efficiency.
But what if we wanted to reduce dimensions based on the separation between our classes, rather than the variance in our data? Linear Discriminant Analysis (LDA for short) is a supervised algorithm that does just that: it maximizes class separability in the set of possible feature subspaces for a sample space with a linear decision surface. (We'll talk about nonlinear decision surfaces in the next section.)
LDA makes two important assumptions about your dataset. It assumes that:
In real-world applications, these assumptions rarely hold. However, Raschka claims that LDA can still work "reasonably well" if these assumptions are violated "to a certain extent". The gist of the argument is that LDA when used as a classifier requires these properties of a dataset; when we're simply using it for dimensionality reduction, we can be looser with the prerequisites. I haven't seen the proof, so I remain skeptical.
The LDA algorithm is very similar to PCA, with some minor modifications. The include:
As you can see, the distinguishing aspect of the algorithm comes in steps 2-4: instead of decomposing the covariance matrix to create $W$, we use a strange combination of scatter matrices.
The mean vector (we'll call it $m_{i}$) stores the mean of the features for a given class $i$.
We can represent the mean for a given class $i$ and feature $n$ by $\mu_{i, n}$:
$$ \mu_{i, n} = \frac{1}{n_{i}} \sum_{x \in D_{i}}^{c} x_{m} $$Where $c$ is the total number of features, $n_{i}$ is the number of samples in the class $i$, and $D_{i}$ is the set of features in the sample space. Hence, the corresponding mean vector $m_{i}$ is given by:
$$ m_{i} = \begin{bmatrix} \mu_{i, 1} \\ \mu_{i, 2} \\ ... \\ \mu_{i, c} \end{bmatrix} $$That's a great question. I still don't know the answer to this, and Raschka doesn't explain it in great detail. I'll return to this section when I figure it out, but in the meantime, here's a cheap approximation:
A "scatter matrix" is basically a non-normalized covariance matrix. Hence, we define the within-class scatter matrix like so:
$$ S_{W} = \sum_{x \in D_{i}}^{c} (x - m_{i})(x - m_{i})^{T} $$Notice that we're missing the normalization factor ($\frac{1}{n_{i}}$) of covariance, but everything else is the same! The between-class scatter matrix looks similar, but measures the "non-normalized covariance" (scatter) by subtracting the mean vectors for each class from the mean vector for each feature across the entire set of samples:
$$ S_{B} = \sum_{i = 1}^{c} N_{i} (m_{i} - m)(m_{i} - m)^{T} $$Note that we also scale this measure by the number of samples in each class, $N_{i}$.
The important difference to note between PCA and LDA is that in LDA, we calculate our measure of variability (scatter, in this case) with respect to each class. Decomposing our combined scatter matrices into linear discriminants will accomplish a similar goal as PCA, but will accomplish it in a supervised fashion.
PCA and LDA are great, but they share a critical assumption: that the dataset we're looking to transform has linearly separable classes. Any nonlinearity in the data, according to PCA and LDA, must just be noise.
The "linear assumption" is pretty clear in LDA, which explicitly seeks to maximize linear class separability. But what about PCA, which isn't even supervised (it doesn't consider class labels at all)? We can understand the linear assumption of PCA as originating in the definition of a principal component: PCA seeks to transform the data by projecting it onto the line (or linear combination) that maximizes variance in the data. PCA can't help but think in straight lines.
So what do we do if our decision surface is nonlinear? That is, what do we do if the nonlinearity in our decision surface isn't noise, but is actually due to an underlying phenomenon in the data? To provide a solution, we'll return to the kernel methods that we learned back in chapter 3 and define a kernelized approach to PCA.
Recall the basic intuition behind kernel methods: since our decision surface is nonlinear, we'll aim to project our data into a higher dimension where the surface is linear, perform the algorithm, and then project everything back down into the original dimension.
For instance, if we have a sample $x$ in two dimensions:
$$ x = [ \; x_{1}, x_{2} \; ]^{T} $$We can define a transformed version of $x$, which we'll call $z$, in a higher dimension:
$$ z = [ \; x_{1}^{2}, \sqrt{ 2x_{1}x_{2} }, \; x_{2}^{2} \; ]^{T} $$And we'll seek to find a transformation $\phi$ such that:
$$ \phi \cdot x = z $$and
$$ \phi^{-1} \cdot z = x $$However, you'll also recall that this transformation is computationally expensive! Not only is it expensive to move between dimensions, but it's expensive to perform PCA within an expanded dimension (where we may have drastically more features, depending on the type of transformation).
To save on computational efficiency, we'll use the kernel trick: we'll define a similarity function, $k$, that approximates the distance between two features in a higher dimension without actually performing all of the expensive calculations required to move to that dimension.
To kernelize PCA, we substitute the covariance matrix with a matrix of pairwise comparisons between features using our similarity function $k$. We refer to this $N \times N$ matrix as $K$:
$$ K = k(x,y) $$Since the means of $K$ may not be centered around 0 in the higher-dimensional space, we have to take care to re-centralize $K$ by transforming it into $K'$:
$$ K' = K - 1_{N}K + K1_{N} + 1_{N}K1_{N} $$Where $1_{N}$ is an $N \times N$ matrix with each element taking the value $\frac{1}{N}$. Then, we're ready to perform the eigendecomposition at the heart of PCA:
$$ N \lambda a = Ka $$Where $a$ represents the eigenvectors of $K$ and $\lambda$ represents its eigenvalues.