Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables. Precisely, we look for $k \leq n$ orthogonal vectors $u^{(1)}, u^{(2)}, \ldots, u^{(k)}$, called principal components, that minimize the projection error of the example features on the linear subspace they span. As a result of this construction the greatest variance of the data lies progressively on the different axes individuated by the principal components. Usually, we choose the number of principal components $k$ to be the smallest to retain a certain value of variance, usually between 0.9 and 0.99.
PCA is mainly used as a tool in exploratory data analysis and to support the building of predictive models, but it can be used also to speed up supervised machine learning algorithms.
It is usually done by eigenvalue decomposition of the correlation matrix or by singular value decomposition of the data matrix, usually after some preprocessing. Note that we won't implement directly the algorithm, as we are primarily interested in showing its application in a common scenario, but we'll delegate it to the scikit-learn library.
Let's import the Olivetti faces dataset from the scikit-learn built-in datasets:
In [1]:
from sklearn.datasets import fetch_olivetti_faces
dataset = fetch_olivetti_faces(data_home='/tmp/data/', shuffle=False)
n_samples, width, height = dataset.images.shape
print("Dataset consists of {n} {w}x{h} faces".format(n=n_samples, w=width, h=height))
In [2]:
dataset.keys()
Out[2]:
In [3]:
print(dataset.DESCR)
As told in the description, the dataset consists of ten 64x64 face images of 40 different subjects, for a total of 400 pictures.
In [4]:
import matplotlib.pyplot as plt
%matplotlib inline
faces = dataset.data
plt.imshow((-1)*faces[79].reshape((64,64)),cmap='Greys')
Out[4]:
To visualize more than one face at a time, we build a simple function to represent the faces in a grid:
In [5]:
import numpy as np
import matplotlib.pyplot as plt
def displayData(data, *width):
'''
Display data in a 2-dimensional grid
'''
# Set ex_width
if width:
ex_width = width[0]
else:
ex_width = int(np.sqrt(np.shape(data)[1]));
# Compute dimensions
(n_examples, n) = np.shape(data);
ex_height = int(n / ex_width);
n_rows = int(np.floor(np.sqrt(n_examples)));
n_cols = int(np.ceil(n_examples / n_rows));
# Set padding
pad = 1;
# Core
grid = np.zeros( (pad + n_rows * (ex_height + pad),
pad + n_cols * (ex_width + pad)) );
cur = 0; # current example
for j in range(0, n_rows):
if cur >= n_examples: break;
for i in range(0, n_cols):
if cur >= n_examples: break;
max_val = np.max(np.abs(data[cur, :]));
from_row = pad + j * (ex_height + pad); to_row = from_row + ex_height;
from_col = pad + i * (ex_width + pad); to_col = from_col + ex_width;
grid[from_row:to_row,from_col:to_col] = \
data[cur, :].reshape( (ex_height, ex_width) ) / max_val;
cur += 1;
# Display data
if(n_examples<100):
fig = plt.figure();
else:
fig = plt.figure(figsize=(n_examples/10,n_examples/20));
ax = fig.add_axes([0, 0, 1, 1]);
ax.imshow((-1)*grid, extent=[0, 1, 0, 1], cmap='Greys');
plt.show();
print('examples: ',n_examples);
print('n_rows: ', n_rows);
print('n_cols: ', n_cols);
print('example width: ', ex_width);
print('example height:', ex_height);
In [7]:
displayData(faces[:100],width)
When applying PCA, it's usually a good practice to make some data preprocessing, specifically feature scaling and normalizing. This is reasonable, as PCA deals with the variance of different features, which may differ significantly. In scikit-learn we make use of the preprocessing library to actually realize feature scaling and normalizing and the pipeline library to concatenate these actions in a process flow:
In [14]:
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.pipeline import Pipeline
preprocessing = Pipeline([
('scaler', StandardScaler()), # StandardScaler must be first
('normalizer', Normalizer()),
])
preprocessing.fit(faces)
X = preprocessing.transform(faces)
In [15]:
print('Means: ', X.mean(axis=0)[0:5]) # should be 0 before normalization
print('Standard deviations: ', X.std(axis=0)[0:5]) # should be 1 before normalization
print('Norms: ', np.sqrt(np.sum(X**2, axis=1))[0:5]) # should be 1
In [16]:
displayData(X[:100,:],64)
In [17]:
from sklearn.decomposition import PCA
pca = PCA(n_components=24)
pca.fit(X)
Out[17]:
In scikit-learn PCA algorithm, the $k$ vectors which minimize the projection error are the first $k$ vectors $u^{(1)}, u^{(2)}, \ldots, u^{(k)}$
of the matrix $U$ obtained from the singular value decomposition of the matrix $\Sigma = \frac{1}{m} \sum_{i=1}^n (x_i)(x_i)^T$.
Singular value decomposition is a factorization of the form $M = U \varSigma V^{\ast}$, where $M$ is the $m \times n$ real or complex matrix to be decomposed, $U$ an $m \times m$ real or complex unitary matrix, $\varSigma$ an $m \times n$ rectangular diagonal matrix with non-negative real numbers on the diagonal, and $V^{\ast}$ an $n \times n$ real or complex unitary matrix. The diagonal entries $\sigma_i$ of $\varSigma$ are known as the singular values of $X$. The columns of $U$ are called left-singular vectors, while the columns of $V^{\ast}$ are called right-singular vectors.
In [18]:
U = pca.components_
U.shape
Out[18]:
These vectors are linear combinations of the standard basis vectors and can be visualized as images, representing the coefficients as different gray intensities:
In [19]:
displayData(U, width)
The projected (compressed) features are then given by:
In [20]:
Z = pca.transform(X)
While the reconstructed features from compressed representation $x_\text{approx} = U \,z $ are given by:
In [21]:
X_recover = pca.inverse_transform(Z)
In [22]:
displayData(X_recover[:100,:])
We can observe that with just 24 components, the faces are still recognizable, but to have a more rigorous method to evaluate our model, we consider the explained variance as a metric:
In [23]:
sum(pca.explained_variance_)
Out[23]:
We observe that the explained variance is pretty low, as we usually want to retain at least the 99% of the data variance. To get the desired variance explanation, we proceed manually by testing crescent values of the n_components parameter:
In [24]:
pca = PCA(n_components=50).fit(X)
print(sum(pca.explained_variance_))
pca = PCA(n_components=100).fit(X)
print(sum(pca.explained_variance_))
pca = PCA(n_components=200).fit(X)
print(sum(pca.explained_variance_))
pca = PCA(n_components=250).fit(X)
print(sum(pca.explained_variance_))
pca = PCA(n_components=260).fit(X)
print(sum(pca.explained_variance_))
In conclusion, with around 260 (of 4096) principal components the 99% of the variance in the data is retained.
In this, like in most real life scenarios, it's possible to highly compress data and still retain most of the variance. If we are not appying PCA as a preliminar step in a supervised machine learning problem, but we are in a more traditional analytical setting, the challenge at this point is in interpreting these vectors in order to get some insights on the data.
Reconstructed features are give by:
In [26]:
Z = pca.transform(X)
X_recover = pca.inverse_transform(Z)
displayData(X_recover[:100,:])
In [ ]: