This article will look at the benefits of using Principle Component Analysis to reduce the dimensionality of our datasesets. We will show that high dimensional datasets can most often times be expressed using only a few dimensions and that reducing dimensionality can make our datasets easier to work with and decrease the training time of our models.
The MNIST dataset is composed of 28x28 pixel images of handwriten digits. Typically when training machine learning models, MNIST is vectorized by considering all the pixels as a 784 dimensional vector, or in other words, putting all the 28x28 pixels next to each other in an 1x784 array. Not many datasets are greater than 784D (dimensions), and if you have ever worked with high dimensional datasets before we know that training time and prediction time are proportional to the size of the dataset and the number of features (dimensionality).
Before we go a further we should have a small discussion about dimensionalty and why it's important. In simplistic terms, it is just the number of columns in the dataset, but it has significant downstream effects on the eventual models. At the extremes, the concept of the "curse of dimensionality" (wiki) discusses that in high-dimensional spaces some things just stop working properly. Even in relatively low dimensional problems, a dataset with more dimensions requires more parameters for the model to understand, and that means more rows to reliably learn those parameters. If the number of rows in the dataset is fixed, addition of extra dimensions without adding more information for the models to learn from can have a detrimental effect on the eventual model accuracy.
Let's take a look at how long it takes to train a Gaussian Classifier on the MNIST dataset:
In [2]:
import loader as support #support library to read mnist files into memory
import gaussian_classifier as gf
import time
%pylab inline
In [3]:
X_train, Y_train = support.loadmnist('data/train-images-idx3-ubyte',
'data/train-labels-idx1-ubyte')
X_test, Y_test = support.loadmnist('data/t10k-images-idx3-ubyte',
'data/t10k-labels-idx1-ubyte')
In [3]:
X_train.shape
Out[3]:
As you can see the we have 60,000 training examples with 784 features. Let's see how long it takes to train and predict on this high dimensional dataset:
In [4]:
clf = gf.GaussianClassifier(c=3400)
start = time.time()
clf.fit(X_train, Y_train)
Y = clf.predict(X_test)
errors = (Y_test != Y).sum();total = X_test.shape[0]
print("Error rate:\t %d/%d = %f" % ((errors,total,(errors/float(total)))))
end = time.time()
duration = end-start
duration
Out[4]:
On my modest hardware (2016 macbook pro) it took about 1 minute (54 s) to train and predict with a 4.3% error rate -- pretty good!
Now let's experiment and see if we really need all those dimensions???
If we take all of our training examples and look at the variance of each feature, we can build a covariance matrix that will give us a quantifiable way to show variance across examples in a feature.
$$ \textbf{A} = \left(\begin{array}{cccc}a_{11} & a_{12} & \dots & a_{1p}\\ a_{21} & a_{22} & \dots & a_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ a_{p1} & a_{p2} & \dots & a_{pp} \end{array}\right) $$
In [29]:
cov = numpy.cov(X_train.T)
Out[29]:
Once we have a convariance matrix, we can get the eigen values and the eigen vectors.
In [90]:
evals, evecs = numpy.linalg.eig(cov)
evals = np.float64(evals)
evecs = np.float64(evecs)
Each eigen value represents the variance of that feature. By definition, the total variation is given by the sum of the variances. It turns out that this is also equal to the sum of the eigenvalues of the covariance matrix. Thus, the total variation is:
$$ \sum_{j=1}^{p}s^2_j = s^2_1 + s^2_2 +\dots + s^2_p = \lambda_1 + \lambda_2 + \dots + \lambda_p = \sum_{j=1}^{p}\lambda_j $$
In [91]:
total_variation = np.sum(evals)
np.float64(total_variation)
Out[91]:
As we can see from the output above, the total variance of our dataset is: 3428502.5747802104
. Now lets look at our eigenvalues as a fraction of the overall variance they should equal up to 1:
In [92]:
explained_variance_ratio = []
explained_variance = []
for i in range(0,X_train.shape[1]):
explained_variance.append(evals[i])
variance = evals[i] / total_variation
explained_variance_ratio.append(variance)
explained_variance_ratio
print sum(explained_variance_ratio)
cumulative_explained = cumsum(explained_variance_ratio)
plot(cumulative_explained);
grid()
This means that each feature has a variance that contributes to the overall variance. Is variance equally distributed among features? Are some features more expressive of the dataset and some not contributing any infomration? If so, then we can remove those features form our training set.
In [97]:
plt.figure(1, figsize=(8, 6))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(explained_variance_ratio, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')
grid()
The above figure shows how much variance each feature contributes to the whole. The x axis represents our eigen value/vector pairs which correspond to a feature in the original dataset. We can see that after about 150 vectors, the subsequent vectors don't contibute much to the whole. We can probably cut out the last 634 vectors w/ot loosing much accuracy with our calssifier while at the same time decreasing training time!
In [100]:
np.sum(explained_variance_ratio[:150])
Out[100]:
In [101]:
np.sum(explained_variance_ratio[150:])
Out[101]:
Just to quantify the statement above: If we only use the first 150 vectors in our PCA analysis, we will capture 94% of the overall variance of our dataset and only loose about %5. Let's check if the we get a speed boost and not too much of an accuracy loss on our classifier
The output of a PCA process is a system of linear combinations of the data that we will use to transform our original dataset to the reduced dimensiaional dataset:
In [4]:
from sklearn.decomposition import PCA
pca = PCA(n_components=150)
pca.fit(X_train)
In [10]:
pca.components_.shape
Out[10]:
At this point, we have the same Eigenvalues and Eigenvectors that we generated ourselves in the previous steps. However, now we only selected the top 150 and Scikit calls them components_
.
To do the transform to the reduced dimensionality, scikit will do the dot product of our Eigenvectors (components) and our input matrix:
X_transformed = fast_dot(X, self.components_.T)
So applying the PCA transformation is simply a linear combination -- very fast. Now we will apply the projection to the training set and any new data that we want to tests against in the future.
In [7]:
X_train_150 = pca.transform(X_train)
X_test_150 = pca.transform(X_test)
As you can see, we now have a 150 feature training set w/ only useful information. Lets benchmark it against the classification job we did pre-PCA:
In [8]:
clf = gf.GaussianClassifier(c=1)
start = time.time()
clf.fit(X_train_150, Y_train)
Y = clf.predict(X_test_150)
errors = (Y_test != Y).sum(); total = X_test_150.shape[0]
print("Error rate:\t %d/%d = %f" % ((errors, total, (errors/float(total)))))
end = time.time()
duration = end-start
duration
Out[8]:
Training Time (s) | Error Rate % | |
---|---|---|
Original | 58 | 4.3 |
PCA | 3.1 | 4.9 |
So as you can see, PCA has some pretty compelling results when applied to machine learning tasks. Reduced dimensionaltiy leads to faster training!
PCA is also helpfull when trying to visualize data, take the scikit Iris dataset as an example. It's 4 dimensional trainging set. If we do PCA on it reduce it to 2D we can now visualize the data:
In [138]:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
# Percentage of variance explained for each components
print('explained variance ratio (first two components): %s'
% str(pca.explained_variance_ratio_))
plt.figure(2, figsize=(8, 6))
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title('PCA of IRIS dataset')
plt.show()