Principal Component Analysis in Shogun

By Abhijeet Kislay (GitHub ID: kislayabhi)

This notebook is about finding Principal Components (PCA) of data (unsupervised) in Shogun. Its dimensional reduction capabilities are further utilised to show its application in data compression, image processing and face recognition.


In [ ]:
%pylab inline
%matplotlib inline
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
# import all shogun classes
from shogun import *

Some Formal Background (Skip if you just want code examples)

PCA is a useful statistical technique that has found application in fields such as face recognition and image compression, and is a common technique for finding patterns in data of high dimension.

In machine learning problems data is often high dimensional - images, bag-of-word descriptions etc. In such cases we cannot expect the training data to densely populate the space, meaning that there will be large parts in which little is known about the data. Hence it is expected that only a small number of directions are relevant for describing the data to a reasonable accuracy.

The data vectors may be very high dimensional, they will therefore typically lie closer to a much lower dimensional 'manifold'. Here we concentrate on linear dimensional reduction techniques. In this approach a high dimensional datapoint $\mathbf{x}$ is 'projected down' to a lower dimensional vector $\mathbf{y}$ by: $$\mathbf{y}=\mathbf{F}\mathbf{x}+\text{const}.$$ where the matrix $\mathbf{F}\in\mathbb{R}^{\text{M}\times \text{D}}$, with $\text{M}<\text{D}$. Here $\text{M}=\dim(\mathbf{y})$ and $\text{D}=\dim(\mathbf{x})$.

From the above scenario, we assume that

  • The number of principal components to use is $\text{M}$.
  • The dimension of each data point is $\text{D}$.
  • The number of data points is $\text{N}$.

We express the approximation for datapoint $\mathbf{x}^n$ as:$$\mathbf{x}^n \approx \mathbf{c} + \sum\limits_{i=1}^{\text{M}}y_i^n \mathbf{b}^i \equiv \tilde{\mathbf{x}}^n.$$

  • Here the vector $\mathbf{c}$ is a constant and defines a point in the lower dimensional space.
  • The $\mathbf{b}^i$ define vectors in the lower dimensional space (also known as 'principal component coefficients' or 'loadings').
  • The $y_i^n$ are the low dimensional co-ordinates of the data.

Our motive is to find the reconstruction $\tilde{\mathbf{x}}^n$ given the lower dimensional representation $\mathbf{y}^n$(which has components $y_i^n,i = 1,...,\text{M})$. For a data space of dimension $\dim(\mathbf{x})=\text{D}$, we hope to accurately describe the data using only a small number $(\text{M}\ll \text{D})$ of coordinates of $\mathbf{y}$. To determine the best lower dimensional representation it is convenient to use the square distance error between $\mathbf{x}$ and its reconstruction $\tilde{\mathbf{x}}$:$$\text{E}(\mathbf{B},\mathbf{Y},\mathbf{c})=\sum\limits_{n=1}^{\text{N}}\sum\limits_{i=1}^{\text{D}}[x_i^n - \tilde{x}_i^n]^2.$$

  • Here the basis vectors are defined as $\mathbf{B} = [\mathbf{b}^1,...,\mathbf{b}^\text{M}]$ (defining $[\text{B}]_{i,j} = b_i^j$).
  • Corresponding low dimensional coordinates are defined as $\mathbf{Y} = [\mathbf{y}^1,...,\mathbf{y}^\text{N}].$
  • Also, $x_i^n$ and $\tilde{x}_i^n$ represents the coordinates of the data points for the original and the reconstructed data respectively.
  • The bias $\mathbf{c}$ is given by the mean of the data $\sum_n\mathbf{x}^n/\text{N}$.

Therefore, for simplification purposes we centre our data, so as to set $\mathbf{c}$ to zero. Now we concentrate on finding the optimal basis $\mathbf{B}$( which has the components $\mathbf{b}^i, i=1,...,\text{M} $).

Deriving the optimal linear reconstruction

To find the best basis vectors $\mathbf{B}$ and corresponding low dimensional coordinates $\mathbf{Y}$, we may minimize the sum of squared differences between each vector $\mathbf{x}$ and its reconstruction $\tilde{\mathbf{x}}$:

$\text{E}(\mathbf{B},\mathbf{Y}) = \sum\limits_{n=1}^{\text{N}}\sum\limits_{i=1}^{\text{D}}\left[x_i^n - \sum\limits_{j=1}^{\text{M}}y_j^nb_i^j\right]^2 = \text{trace} \left( (\mathbf{X}-\mathbf{B}\mathbf{Y})^T(\mathbf{X}-\mathbf{B}\mathbf{Y}) \right)$

where $\mathbf{X} = [\mathbf{x}^1,...,\mathbf{x}^\text{N}].$ Considering the above equation under the orthonormality constraint $\mathbf{B}^T\mathbf{B} = \mathbf{I}$ (i.e the basis vectors are mutually orthogonal and of unit length), we differentiate it w.r.t $y_k^n$. The squared error $\text{E}(\mathbf{B},\mathbf{Y})$ therefore has zero derivative when:

$y_k^n = \sum_i b_i^kx_i^n$

By substituting this solution in the above equation, the objective becomes

$\text{E}(\mathbf{B}) = (\text{N}-1)\left[\text{trace}(\mathbf{S}) - \text{trace}\left(\mathbf{S}\mathbf{B}\mathbf{B}^T\right)\right],$

where $\mathbf{S}$ is the sample covariance matrix of the data. To minimise equation under the constraint $\mathbf{B}^T\mathbf{B} = \mathbf{I}$, we use a set of Lagrange Multipliers $\mathbf{L}$, so that the objective is to minimize:

$-\text{trace}\left(\mathbf{S}\mathbf{B}\mathbf{B}^T\right)+\text{trace}\left(\mathbf{L}\left(\mathbf{B}^T\mathbf{B} - \mathbf{I}\right)\right).$

Since the constraint is symmetric, we can assume that $\mathbf{L}$ is also symmetric. Differentiating with respect to $\mathbf{B}$ and equating to zero we obtain that at the optimum

$\mathbf{S}\mathbf{B} = \mathbf{B}\mathbf{L}$.

This is a form of eigen-equation so that a solution is given by taking $\mathbf{L}$ to be diagonal and $\mathbf{B}$ as the matrix whose columns are the corresponding eigenvectors of $\mathbf{S}$. In this case,

$\text{trace}\left(\mathbf{S}\mathbf{B}\mathbf{B}^T\right) =\text{trace}(\mathbf{L}),$

which is the sum of the eigenvalues corresponding to the eigenvectors forming $\mathbf{B}$. Since we wish to minimise $\text{E}(\mathbf{B})$, we take the eigenvectors with the largest corresponding eigenvalues. Whilst the solution to this eigen-problem is unique, this only serves to define the solution subspace since one may rotate and scale $\mathbf{B}$ and $\mathbf{Y}$ such that the value of the squared loss is exactly the same. The justification for choosing the non-rotated eigen solution is given by the additional requirement that the principal components corresponds to directions of maximal variance.

Maximum variance criterion

We aim to find that single direction $\mathbf{b}$ such that, when the data is projected onto this direction, the variance of this projection is maximal amongst all possible such projections. The projection of a datapoint onto a direction $\mathbf{b}$ is $\mathbf{b}^T\mathbf{x}^n$ for a unit length vector $\mathbf{b}$. Hence the sum of squared projections is: $$\sum\limits_{n}\left(\mathbf{b}^T\mathbf{x}^n\right)^2 = \mathbf{b}^T\left[\sum\limits_{n}\mathbf{x}^n(\mathbf{x}^n)^T\right]\mathbf{b} = (\text{N}-1)\mathbf{b}^T\mathbf{S}\mathbf{b} = \lambda(\text{N} - 1)$$ which ignoring constants, is simply the negative of the equation for a single retained eigenvector $\mathbf{b}$(with $\mathbf{S}\mathbf{b} = \lambda\mathbf{b}$). Hence the optimal single $\text{b}$ which maximises the projection variance is given by the eigenvector corresponding to the largest eigenvalues of $\mathbf{S}.$ The second largest eigenvector corresponds to the next orthogonal optimal direction and so on. This explains why, despite the squared loss equation being invariant with respect to arbitrary rotation of the basis vectors, the ones given by the eigen-decomposition have the additional property that they correspond to directions of maximal variance. These maximal variance directions found by PCA are called the $\text{principal} $ $\text{directions}.$

There are two eigenvalue methods through which shogun can perform PCA namely

  • Eigenvalue Decomposition Method.
  • Singular Value Decomposition.

EVD vs SVD

  • The EVD viewpoint requires that one compute the eigenvalues and eigenvectors of the covariance matrix, which is the product of $\mathbf{X}\mathbf{X}^\text{T}$, where $\mathbf{X}$ is the data matrix. Since the covariance matrix is symmetric, the matrix is diagonalizable, and the eigenvectors can be normalized such that they are orthonormal:

$\mathbf{S}=\frac{1}{\text{N}-1}\mathbf{X}\mathbf{X}^\text{T},$

where the $\text{D}\times\text{N}$ matrix $\mathbf{X}$ contains all the data vectors: $\mathbf{X}=[\mathbf{x}^1,...,\mathbf{x}^\text{N}].$ Writing the $\text{D}\times\text{N}$ matrix of eigenvectors as $\mathbf{E}$ and the eigenvalues as an $\text{N}\times\text{N}$ diagonal matrix $\mathbf{\Lambda}$, the eigen-decomposition of the covariance $\mathbf{S}$ is

$\mathbf{X}\mathbf{X}^\text{T}\mathbf{E}=\mathbf{E}\mathbf{\Lambda}\Longrightarrow\mathbf{X}^\text{T}\mathbf{X}\mathbf{X}^\text{T}\mathbf{E}=\mathbf{X}^\text{T}\mathbf{E}\mathbf{\Lambda}\Longrightarrow\mathbf{X}^\text{T}\mathbf{X}\tilde{\mathbf{E}}=\tilde{\mathbf{E}}\mathbf{\Lambda},$

where we defined $\tilde{\mathbf{E}}=\mathbf{X}^\text{T}\mathbf{E}$. The final expression above represents the eigenvector equation for $\mathbf{X}^\text{T}\mathbf{X}.$ This is a matrix of dimensions $\text{N}\times\text{N}$ so that calculating the eigen-decomposition takes $\mathcal{O}(\text{N}^3)$ operations, compared with $\mathcal{O}(\text{D}^3)$ operations in the original high-dimensional space. We then can therefore calculate the eigenvectors $\tilde{\mathbf{E}}$ and eigenvalues $\mathbf{\Lambda}$ of this matrix more easily. Once found, we use the fact that the eigenvalues of $\mathbf{S}$ are given by the diagonal entries of $\mathbf{\Lambda}$ and the eigenvectors by

$\mathbf{E}=\mathbf{X}\tilde{\mathbf{E}}\mathbf{\Lambda}^{-1}$

  • On the other hand, applying SVD to the data matrix $\mathbf{X}$ follows like:

$\mathbf{X}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\text{T}$

where $\mathbf{U}^\text{T}\mathbf{U}=\mathbf{I}_\text{D}$ and $\mathbf{V}^\text{T}\mathbf{V}=\mathbf{I}_\text{N}$ and $\mathbf{\Sigma}$ is a diagonal matrix of the (positive) singular values. We assume that the decomposition has ordered the singular values so that the upper left diagonal element of $\mathbf{\Sigma}$ contains the largest singular value.

Attempting to construct the covariance matrix $(\mathbf{X}\mathbf{X}^\text{T})$from this decomposition gives:

$\mathbf{X}\mathbf{X}^\text{T} = \left(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\text{T}\right)\left(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\text{T}\right)^\text{T}$

$\mathbf{X}\mathbf{X}^\text{T} = \left(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\text{T}\right)\left(\mathbf{V}\mathbf{\Sigma}\mathbf{U}^\text{T}\right)$

and since $\mathbf{V}$ is an orthogonal matrix $\left(\mathbf{V}^\text{T}\mathbf{V}=\mathbf{I}\right),$

$\mathbf{X}\mathbf{X}^\text{T}=\left(\mathbf{U}\mathbf{\Sigma}^\mathbf{2}\mathbf{U}^\text{T}\right)$

Since it is in the form of an eigen-decomposition, the PCA solution given by performing the SVD decomposition of $\mathbf{X}$, for which the eigenvectors are then given by $\mathbf{U}$, and corresponding eigenvalues by the square of the singular values.

CPCA Class Reference (Shogun)

CPCA class of Shogun inherits from the CPreprocessor class. Preprocessors are transformation functions that doesn't change the domain of the input features. Specifically, CPCA performs principal component analysis on the input vectors and keeps only the specified number of eigenvectors. On preprocessing, the stored covariance matrix is used to project vectors into eigenspace.

Performance of PCA depends on the algorithm used according to the situation in hand. Our PCA preprocessor class provides 3 method options to compute the transformation matrix:

  • $\text{PCA(EVD)}$ sets $\text{PCAmethod == EVD}$ : Eigen Value Decomposition of Covariance Matrix $(\mathbf{XX^T}).$ The covariance matrix $\mathbf{XX^T}$ is first formed internally and then its eigenvectors and eigenvalues are computed using QR decomposition of the matrix. The time complexity of this method is $\mathcal{O}(D^3)$ and should be used when $\text{N > D.}$
  • $\text{PCA(SVD)}$ sets $\text{PCAmethod == SVD}$ : Singular Value Decomposition of feature matrix $\mathbf{X}$. The transpose of feature matrix, $\mathbf{X^T}$, is decomposed using SVD. $\mathbf{X^T = UDV^T}.$ The matrix V in this decomposition contains the required eigenvectors and the diagonal entries of the diagonal matrix D correspond to the non-negative eigenvalues.The time complexity of this method is $\mathcal{O}(DN^2)$ and should be used when $\text{N < D.}$
  • $\text{PCA(AUTO)}$ sets $\text{PCAmethod == AUTO}$ : This mode automagically chooses one of the above modes for the user based on whether $\text{N>D}$ (chooses $\text{EVD}$) or $\text{N<D}$ (chooses $\text{SVD}$)

PCA on 2D data

Step 1: Get some data

We will generate the toy data by adding orthogonal noise to a set of points lying on an arbitrary 2d line. We expect PCA to recover this line, which is a one-dimensional linear sub-space.


In [ ]:
#number of data points.
n=100

#generate a random 2d line(y1 = mx1 + c)
m = random.randint(1,10)
c = random.randint(1,10)
x1 = random.random_integers(-20,20,n)
y1=m*x1+c

#generate the noise.
noise=random.random_sample([n]) * random.random_integers(-35,35,n)

#make the noise orthogonal to the line y=mx+c and add it.
x=x1 + noise*m/sqrt(1+square(m))
y=y1 + noise/sqrt(1+square(m))

twoD_obsmatrix=array([x,y])

In [ ]:
#to visualise the data we must plot it.

rcParams['figure.figsize'] = 7, 7 
figure,axis=subplots(1,1)
xlim(-50,50)
ylim(-50,50)
axis.plot(twoD_obsmatrix[0,:],twoD_obsmatrix[1,:],'o',color='green',markersize=6)

#the line from which we generated the data is plotted in red
axis.plot(x1[:],y1[:],linewidth=0.3,color='red')
title('One-Dimensional sub-space with noise')
xlabel("x axis")
_=ylabel("y axis")

Step 2: Subtract the mean.

For PCA to work properly, we must subtract the mean from each of the data dimensions. The mean subtracted is the average across each dimension. So, all the $x$ values have $\bar{x}$ subtracted, and all the $y$ values have $\bar{y}$ subtracted from them, where:$$\bar{\mathbf{x}} = \frac{\sum\limits_{i=1}^{n}x_i}{n}$$ $\bar{\mathbf{x}}$ denotes the mean of the $x_i^{'s}$

Shogun's way of doing things :

Preprocessor PCA performs principial component analysis on input feature vectors/matrices. It provides an interface to set the target dimension by $\text{set_target_dim method}.$ When the $\text{init()}$ method in $\text{PCA}$ is called with proper feature matrix $\text{X}$ (with say $\text{N}$ number of vectors and $\text{D}$ feature dimension), a transformation matrix is computed and stored internally.It inherenty also centralizes the data by subtracting the mean from it.


In [ ]:
#convert the observation matrix into dense feature matrix.
train_features = RealFeatures(twoD_obsmatrix)

#PCA(EVD) is choosen since N=100 and D=2 (N>D).
#However we can also use PCA(AUTO) as it will automagically choose the appropriate method. 
preprocessor = PCA(EVD)

#since we are projecting down the 2d data, the target dim is 1. But here the exhaustive method is detailed by
#setting the target dimension to 2 to visualize both the eigen vectors.
#However, in future examples we will get rid of this step by implementing it directly.
preprocessor.set_target_dim(2)

#Centralise the data by subtracting its mean from it.
preprocessor.init(train_features)

#get the mean for the respective dimensions.
mean_datapoints=preprocessor.get_mean()
mean_x=mean_datapoints[0]
mean_y=mean_datapoints[1]

Step 3: Calculate the covariance matrix

To understand the relationship between 2 dimension we define $\text{covariance}$. It is a measure to find out how much the dimensions vary from the mean $with$ $respect$ $to$ $each$ $other.$$$cov(X,Y)=\frac{\sum\limits_{i=1}^{n}(X_i-\bar{X})(Y_i-\bar{Y})}{n-1}$$ A useful way to get all the possible covariance values between all the different dimensions is to calculate them all and put them in a matrix.

Example: For a 3d dataset with usual dimensions of $x,y$ and $z$, the covariance matrix has 3 rows and 3 columns, and the values are this: $$\mathbf{S} = \quad\begin{pmatrix}cov(x,x)&cov(x,y)&cov(x,z)\\cov(y,x)&cov(y,y)&cov(y,z)\\cov(z,x)&cov(z,y)&cov(z,z)\end{pmatrix}$$

Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix

Find the eigenvectors $e^1,....e^M$ of the covariance matrix $\mathbf{S}$.

Shogun's way of doing things :

Step 3 and Step 4 are directly implemented by the PCA preprocessor of Shogun toolbar. The transformation matrix is essentially a $\text{D}$$\times$$\text{M}$ matrix, the columns of which correspond to the eigenvectors of the covariance matrix $(\text{X}\text{X}^\text{T})$ having top $\text{M}$ eigenvalues.


In [ ]:
#Get the eigenvectors(We will get two of these since we set the target to 2). 
E = preprocessor.get_transformation_matrix()

#Get all the eigenvalues returned by PCA.
eig_value=preprocessor.get_eigenvalues()

e1 = E[:,0]
e2 = E[:,1]
eig_value1 = eig_value[0]
eig_value2 = eig_value[1]

Step 5: Choosing components and forming a feature vector.

Lets visualize the eigenvectors and decide upon which to choose as the $principle$ $component$ of the data set.


In [ ]:
#find out the M eigenvectors corresponding to top M number of eigenvalues and store it in E
#Here M=1

#slope of e1 & e2
m1=e1[1]/e1[0]
m2=e2[1]/e2[0]

#generate the two lines
x1=range(-50,50)
x2=x1
y1=multiply(m1,x1)
y2=multiply(m2,x2)

In [ ]:
#plot the data along with those two eigenvectors
figure, axis = subplots(1,1)
xlim(-50, 50)
ylim(-50, 50)
axis.plot(x[:], y[:],'o',color='green', markersize=5, label="green")
axis.plot(x1[:], y1[:], linewidth=0.7, color='black')
axis.plot(x2[:], y2[:], linewidth=0.7, color='blue')
p1 = Rectangle((0, 0), 1, 1, fc="black")
p2 = Rectangle((0, 0), 1, 1, fc="blue")
legend([p1,p2],["1st eigenvector","2nd eigenvector"],loc='center left', bbox_to_anchor=(1, 0.5))
title('Eigenvectors selection')
xlabel("x axis")
_=ylabel("y axis")

In the above figure, the blue line is a good fit of the data. It shows the most significant relationship between the data dimensions. It turns out that the eigenvector with the $highest$ eigenvalue is the $principle$ $component$ of the data set. Form the matrix $\mathbf{E}=[\mathbf{e}^1,...,\mathbf{e}^M].$ Here $\text{M}$ represents the target dimension of our final projection


In [ ]:
#The eigenvector corresponding to higher eigenvalue(i.e eig_value2) is choosen (i.e e2).
#E is the feature vector.
E=e2

Step 6: Projecting the data to its Principal Components.

This is the final step in PCA. Once we have choosen the components(eigenvectors) that we wish to keep in our data and formed a feature vector, we simply take the vector and multiply it on the left of the original dataset. The lower dimensional representation of each data point $\mathbf{x}^n$ is given by

$\mathbf{y}^n=\mathbf{E}^T(\mathbf{x}^n-\mathbf{m})$

Here the $\mathbf{E}^T$ is the matrix with the eigenvectors in rows, with the most significant eigenvector at the top. The mean adjusted data, with data items in each column, with each row holding a seperate dimension is multiplied to it.

Shogun's way of doing things :

Step 6 can be performed by shogun's PCA preprocessor as follows:

The transformation matrix that we got after $\text{init()}$ is used to transform all $\text{D-dim}$ feature matrices (with $\text{D}$ feature dimensions) supplied, via $\text{apply_to_feature_matrix methods}$.This transformation outputs the $\text{M-Dim}$ approximation of all these input vectors and matrices (where $\text{M}$ $\leq$ $\text{min(D,N)}$).


In [ ]:
#transform all 2-dimensional feature matrices to target-dimensional approximations.
yn=preprocessor.apply_to_feature_matrix(train_features)

#Since, here we are manually trying to find the eigenvector corresponding to the top eigenvalue.
#The 2nd row of yn is choosen as it corresponds to the required eigenvector e2.
yn1=yn[1,:]

Step 5 and Step 6 can be applied directly with Shogun's PCA preprocessor (from next example). It has been done manually here to show the exhaustive nature of Principal Component Analysis.

Step 7: Form the approximate reconstruction of the original data $\mathbf{x}^n$

The approximate reconstruction of the original datapoint $\mathbf{x}^n$ is given by : $\tilde{\mathbf{x}}^n\approx\text{m}+\mathbf{E}\mathbf{y}^n$


In [ ]:
x_new=(yn1 * E[0]) + tile(mean_x,[n,1]).T[0]
y_new=(yn1 * E[1]) + tile(mean_y,[n,1]).T[0]

The new data is plotted below


In [ ]:
figure, axis = subplots(1,1)
xlim(-50, 50)
ylim(-50, 50)

axis.plot(x[:], y[:],'o',color='green', markersize=5, label="green")
axis.plot(x_new, y_new, 'o', color='blue', markersize=5, label="red")
title('PCA Projection of 2D data into 1D subspace')
xlabel("x axis")
ylabel("y axis")

#add some legend for information
p1 = Rectangle((0, 0), 1, 1, fc="r")
p2 = Rectangle((0, 0), 1, 1, fc="g")
p3 = Rectangle((0, 0), 1, 1, fc="b")
legend([p1,p2,p3],["normal projection","2d data","1d projection"],loc='center left', bbox_to_anchor=(1, 0.5))

#plot the projections in red:
for i in range(n):
    axis.plot([x[i],x_new[i]],[y[i],y_new[i]] , color='red')

PCA on a 3d data.

Step1: Get some data

We generate points from a plane and then add random noise orthogonal to it. The general equation of a plane is: $$\text{a}\mathbf{x}+\text{b}\mathbf{y}+\text{c}\mathbf{z}+\text{d}=0$$


In [ ]:
rcParams['figure.figsize'] = 8,8 
#number of points
n=100

#generate the data
a=random.randint(1,20)
b=random.randint(1,20)
c=random.randint(1,20)
d=random.randint(1,20)

x1=random.random_integers(-20,20,n)
y1=random.random_integers(-20,20,n)
z1=-(a*x1+b*y1+d)/c

#generate the noise
noise=random.random_sample([n])*random.random_integers(-30,30,n)

#the normal unit vector is [a,b,c]/magnitude
magnitude=sqrt(square(a)+square(b)+square(c))
normal_vec=array([a,b,c]/magnitude)

#add the noise orthogonally
x=x1+noise*normal_vec[0]
y=y1+noise*normal_vec[1]
z=z1+noise*normal_vec[2]
threeD_obsmatrix=array([x,y,z])

In [ ]:
#to visualize the data, we must plot it.
from mpl_toolkits.mplot3d import Axes3D

fig = pyplot.figure()
ax=fig.add_subplot(111, projection='3d')

#plot the noisy data generated by distorting a plane
ax.scatter(x, y, z,marker='o', color='g')

ax.set_xlabel('x label')
ax.set_ylabel('y label')
ax.set_zlabel('z label')
legend([p2],["3d data"],loc='center left', bbox_to_anchor=(1, 0.5))
title('Two dimensional subspace with noise')
xx, yy = meshgrid(range(-30,30), range(-30,30))
zz=-(a * xx + b * yy + d) / c

Step 2: Subtract the mean.


In [ ]:
#convert the observation matrix into dense feature matrix.
train_features = RealFeatures(threeD_obsmatrix)

#PCA(EVD) is choosen since N=100 and D=3 (N>D).
#However we can also use PCA(AUTO) as it will automagically choose the appropriate method. 
preprocessor = PCA(EVD)

#If we set the target dimension to 2, Shogun would automagically preserve the required 2 eigenvectors(out of 3) according to their
#eigenvalues.
preprocessor.set_target_dim(2)
preprocessor.init(train_features)

#get the mean for the respective dimensions.
mean_datapoints=preprocessor.get_mean()
mean_x=mean_datapoints[0]
mean_y=mean_datapoints[1]
mean_z=mean_datapoints[2]

Step 3 & Step 4: Calculate the eigenvectors of the covariance matrix


In [ ]:
#get the required eigenvectors corresponding to top 2 eigenvalues.
E = preprocessor.get_transformation_matrix()

Steps 5: Choosing components and forming a feature vector.

Since we performed PCA for a target $\dim = 2$ for the $3 \dim$ data, we are directly given the two required eigenvectors in $\mathbf{E}$

E is automagically filled by setting target dimension = M. This is different from the 2d data example where we implemented this step manually.

Step 6: Projecting the data to its Principal Components.


In [ ]:
#This can be performed by shogun's PCA preprocessor as follows:
yn=preprocessor.apply_to_feature_matrix(train_features)

Step 7: Form the approximate reconstruction of the original data $\mathbf{x}^n$

The approximate reconstruction of the original datapoint $\mathbf{x}^n$ is given by : $\tilde{\mathbf{x}}^n\approx\text{m}+\mathbf{E}\mathbf{y}^n$


In [ ]:
new_data=dot(E,yn)

x_new=new_data[0,:]+tile(mean_x,[n,1]).T[0]
y_new=new_data[1,:]+tile(mean_y,[n,1]).T[0]
z_new=new_data[2,:]+tile(mean_z,[n,1]).T[0]

In [ ]:
#all the above points lie on the same plane. To make it more clear we will plot the projection also.

fig=pyplot.figure()
ax=fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z,marker='o', color='g')
ax.set_xlabel('x label')
ax.set_ylabel('y label')
ax.set_zlabel('z label')
legend([p1,p2,p3],["normal projection","3d data","2d projection"],loc='center left', bbox_to_anchor=(1, 0.5))
title('PCA Projection of 3D data into 2D subspace')

for i in range(100):
    ax.scatter(x_new[i], y_new[i], z_new[i],marker='o', color='b')
    ax.plot([x[i],x_new[i]],[y[i],y_new[i]],[z[i],z_new[i]],color='r')

PCA Performance

Uptill now, we were using the EigenValue Decomposition method to compute the transformation matrix$\text{(N>D)}$ but for the next example $\text{(N<D)}$ we will be using Singular Value Decomposition.

Practical Example : Eigenfaces

The problem with the image representation we are given is its high dimensionality. Two-dimensional $\text{p} \times \text{q}$ grayscale images span a $\text{m=pq}$ dimensional vector space, so an image with $\text{100}\times\text{100}$ pixels lies in a $\text{10,000}$ dimensional image space already.

The question is, are all dimensions really useful for us?

$\text{Eigenfaces}$ are based on the dimensional reduction approach of $\text{Principal Component Analysis(PCA)}$. The basic idea is to treat each image as a vector in a high dimensional space. Then, $\text{PCA}$ is applied to the set of images to produce a new reduced subspace that captures most of the variability between the input images. The $\text{Pricipal Component Vectors}$(eigenvectors of the sample covariance matrix) are called the $\text{Eigenfaces}$. Every input image can be represented as a linear combination of these eigenfaces by projecting the image onto the new eigenfaces space. Thus, we can perform the identfication process by matching in this reduced space. An input image is transformed into the $\text{eigenspace,}$ and the nearest face is identified using a $\text{Nearest Neighbour approach.}$

Step 1: Get some data.

Here data means those Images which will be used for training purposes.


In [ ]:
rcParams['figure.figsize'] = 10, 10 
import os
def get_imlist(path):
    """ Returns a list of filenames for all jpg images in a directory"""
    return [os.path.join(path,f) for f in os.listdir(path) if f.endswith('.pgm')]

#set path of the training images
path_train=os.path.join(SHOGUN_DATA_DIR, 'att_dataset/training/')
#set no. of rows that the images will be resized.
k1=100
#set no. of columns that the images will be resized.
k2=100

filenames = get_imlist(path_train)
filenames = array(filenames)

#n is total number of images that has to be analysed.
n=len(filenames)

Lets have a look on the data:


In [ ]:
# we will be using this often to visualize the images out there.
def showfig(image):
    imgplot=imshow(image, cmap='gray')
    imgplot.axes.get_xaxis().set_visible(False)
    imgplot.axes.get_yaxis().set_visible(False)
    
from PIL import Image
from scipy import misc

# to get a hang of the data, lets see some part of the dataset images.
fig = pyplot.figure()
title('The Training Dataset')

for i in range(49):
    fig.add_subplot(7,7,i+1)
    train_img=array(Image.open(filenames[i]).convert('L'))
    train_img=misc.imresize(train_img, [k1,k2])
    showfig(train_img)

Represent every image $I_i$ as a vector $\Gamma_i$


In [ ]:
#To form the observation matrix obs_matrix.
#read the 1st image.
train_img = array(Image.open(filenames[0]).convert('L'))

#resize it to k1 rows and k2 columns
train_img=misc.imresize(train_img, [k1,k2])

#since Realfeatures accepts only data of float64 datatype, we do a type conversion
train_img=array(train_img, dtype='double')

#flatten it to make it a row vector.
train_img=train_img.flatten()

# repeat the above for all images and stack all those vectors together in a matrix
for i in range(1,n):
    temp=array(Image.open(filenames[i]).convert('L'))    
    temp=misc.imresize(temp, [k1,k2])
    temp=array(temp, dtype='double')
    temp=temp.flatten()
    train_img=vstack([train_img,temp])

#form the observation matrix    
obs_matrix=train_img.T

Step 2: Subtract the mean

It is very important that the face images $I_1,I_2,...,I_M$ are $centered$ and of the $same$ size

We observe here that the no. of $\dim$ for each image is far greater than no. of training images. This calls for the use of $\text{SVD}$.

Setting the $\text{PCA}$ in the $\text{AUTO}$ mode does this automagically according to the situation.


In [ ]:
train_features = RealFeatures(obs_matrix)
preprocessor=PCA(AUTO)

preprocessor.set_target_dim(100)
preprocessor.init(train_features)

mean=preprocessor.get_mean()

Step 3 & Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix.


In [ ]:
#get the required eigenvectors corresponding to top 100 eigenvalues
E = preprocessor.get_transformation_matrix()

In [ ]:
#lets see how these eigenfaces/eigenvectors look like:
fig1 = pyplot.figure()
title('Top 20 Eigenfaces')

for i in range(20):
    a = fig1.add_subplot(5,4,i+1)
    eigen_faces=E[:,i].reshape([k1,k2])
    showfig(eigen_faces)

These 20 eigenfaces are not sufficient for a good image reconstruction. Having more eigenvectors gives us the most flexibility in the number of faces we can reconstruct. Though we are adding vectors with low variance, they are in directions of change nonetheless, and an external image that is not in our database could in fact need these eigenvectors to get even relatively close to it. But at the same time we must also keep in mind that adding excessive eigenvectors results in addition of little or no variance, slowing down the process.

Clearly a tradeoff is required.

We here set for M=100.

Step 5: Choosing components and forming a feature vector.

Since we set target $\dim = 100$ for this $n \dim$ data, we are directly given the $100$ required eigenvectors in $\mathbf{E}$

E is automagically filled. This is different from the 2d data example where we implemented this step manually.

Step 6: Projecting the data to its Principal Components.

The lower dimensional representation of each data point $\mathbf{x}^n$ is given by $$\mathbf{y}^n=\mathbf{E}^T(\mathbf{x}^n-\mathbf{m})$$


In [ ]:
#we perform the required dot product.
yn=preprocessor.apply_to_feature_matrix(train_features)

Step 7: Form the approximate reconstruction of the original image $I_n$

The approximate reconstruction of the original datapoint $\mathbf{x}^n$ is given by : $\mathbf{x}^n\approx\text{m}+\mathbf{E}\mathbf{y}^n$


In [ ]:
re=tile(mean,[n,1]).T[0] + dot(E,yn)

In [ ]:
#lets plot the reconstructed images.
fig2 = pyplot.figure()
title('Reconstructed Images from 100 eigenfaces')
for i in range(1,50):
    re1 = re[:,i].reshape([k1,k2])
    fig2.add_subplot(7,7,i)
    showfig(re1)

Recognition part.

In our face recognition process using the Eigenfaces approach, in order to recognize an unseen image, we proceed with the same preprocessing steps as applied to the training images. Test images are represented in terms of eigenface coefficients by projecting them into face space$\text{(eigenspace)}$ calculated during training. Test sample is recognized by measuring the similarity distance between the test sample and all samples in the training. The similarity measure is a metric of distance calculated between two vectors. Traditional Eigenface approach utilizes $\text{Euclidean distance}$.


In [ ]:
#set path of the training images
path_train=os.path.join(SHOGUN_DATA_DIR, 'att_dataset/testing/')
test_files=get_imlist(path_train)
test_img=array(Image.open(test_files[0]).convert('L'))

rcParams.update({'figure.figsize': (3, 3)})
#we plot the test image , for which we have to identify a good match from the training images we already have
fig = pyplot.figure()
title('The Test Image')
showfig(test_img)

In [ ]:
#We flatten out our test image just the way we have done for the other images
test_img=misc.imresize(test_img, [k1,k2])
test_img=array(test_img, dtype='double')
test_img=test_img.flatten()

#We centralise the test image by subtracting the mean from it.
test_f=test_img-mean

Here we have to project our training image as well as the test image on the PCA subspace.

The Eigenfaces method then performs face recognition by:

  1. Projecting all training samples into the PCA subspace.
  2. Projecting the query image into the PCA subspace.
  3. Finding the nearest neighbour between the projected training images and the projected query image.

In [ ]:
#We have already projected our training images into pca subspace as yn.
train_proj = yn

#Projecting our test image into pca subspace
test_proj = dot(E.T, test_f)
Shogun's way of doing things:

Shogun uses CEuclideanDistance class to compute the familiar Euclidean distance for real valued features. It computes the square root of the sum of squared disparity between the corresponding feature dimensions of two data points.

$\mathbf{d(x,x')=}$$\sqrt{\mathbf{\sum\limits_{i=0}^{n}}|\mathbf{x_i}-\mathbf{x'_i}|^2}$


In [ ]:
#To get Eucledian Distance as the distance measure use EuclideanDistance.
workfeat = RealFeatures(mat(train_proj))
testfeat = RealFeatures(mat(test_proj).T)
RaRb=EuclideanDistance(testfeat, workfeat)

#The distance between one test image w.r.t all the training is stacked in matrix d.
d=empty([n,1])
for i in range(n):
    d[i]= RaRb.distance(0,i)
   
#The one having the minimum distance is found out
min_distance_index = d.argmin()
iden=array(Image.open(filenames[min_distance_index]))
title('Identified Image')
showfig(iden)

References:

[1] David Barber. Bayesian Reasoning and Machine Learning.

[2] Lindsay I Smith. A tutorial on Principal Component Analysis.

[3] Philipp Wanger. Face Recognition with GNU Octave/MATLAB.