Principle Component Analysis

Comparing different methods to reduce dimensionsionality of data; from orthogonal projections in conventional PCA to Gaussian Process Latent Variable Model.

Say we have N of D dimensional data points so our dataset will be $X \in R^{D \times N}$ with covariance $\Sigma \in R^{D \times D}$. We want the best ideas to reduce dimensionality of our dataset so that in the end we have $X' \in R^{q \times N}$ where $q << D$.

We will go through Conventional PCA based on projection of the data into its principle subspace, then we will see Probbilistic PCA and finally we will see a python implementation of Gaussian Process Latent Variable Model (GPLVM).

Conventional PCA

One idea is to project all the points into a direction which retains the most variance in our dataset, and that was the original idea of PCA.

If you feel comfortable with the covariance matrix and the related transformation it can apply to the data you can continue on. But if you want to get an image of the idea take a look here.

In 2D space we know that any data can be made by transformation of the white data (drawn from 2D normal RV with mean zero and diagonal covariance matrix with diagonal elements or variances equal 1). This transformation includes a scaling S and a rotation R. So transformation $T = RS$. If the covariance of the white data is $\Sigma = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix}$ and $D' = TD$ then $\Sigma' = RSSR^{-1}$. On the other hand $\Sigma' = VLV^{-1}$ where L is the diagonal matrix with eigenvalues and each column of V includes the respective eigen vectors. After equating previous two relations eigenvectors V are seen as only rotations and eigenvalues L are the scale in that direction. If we wanted to capture just the highest amont of variance in our data then this unit vector will be the eigenvector corresponding to the largest eigenvalue of the covariance matrix of the original dataset. So if that direction can be shown by a vector $U \in R^{D \times 1}$ (q=1) then $ X' = U^TX$ and $ X' \in R^{1 \times N} $. This statement can be more formarly followed in Strang Linear Algebra page 475; ISBN: 9780980232714. To preserve more than 1 dimension we can either sequentially find the largest eigenvector and falt out data in that direction by subtracting it from each datapoint and continue this procedure upto the desired dimension or equally take the first largest eigenvalues and the corresponding eigenvectors and transform data in their direction.

Lets have an example:


In [1]:
import numpy as np

import matplotlib.pyplot as plt
from scipy.optimize import minimize
from scipy.io import loadmat
from scipy.stats import multivariate_normal


from scipy.linalg import det
from scipy.linalg import pinv2 as inv #pinv uses linalg.lstsq algorithm while pinv2 uses SVD
from scipy.linalg import sqrtm
from numpy.linalg import svd
from scipy.linalg import norm

%matplotlib inline
%load_ext autoreload
%autoreload 2
%autosave 0


Autosave disabled

We will work with MNIST handwritten digits dataset. We will work with zeros and ones. Dataset includes ~6K images of $28 \times 28 = 784$ pixels for each digit. Through out the note we will reduce the dimensionality to only 2 dimensions and we will try to classify zeros and ones in the visualizable 2D space and measure the performance based on highest accuracy achievable.


In [2]:
mnist_train = loadmat('mnist_train.mat')['train']
X0 = mnist_train[0,9]
X1 = mnist_train[0,0]
X = np.hstack([X0,X1])

sample0 = X0[:,0].reshape((28,28))
sample1 = X1[:,0].reshape((28,28))
f, axarr = plt.subplots(1,2, sharex=True)
axarr[0].imshow(sample0,cmap='gray');axarr[0].set_title('Sample 0')
axarr[1].imshow(sample1,cmap='gray');axarr[1].set_title('Sample 1')
axarr[0].axis('off')
axarr[1].axis('off')
plt.show()


PCA code:


In [3]:
# Conventional PCA computations
def computeCovariance(X):
    '''computing the covariance function'''
    # center data and compute the covariances
    N = X.shape[1];
    originalD = X.shape[0] #784 is the original dimension
    M = X.mean(axis = 1).reshape(originalD,1) #mean of all pixels in the dataset
    centered_X = X-M
    COV = (np.dot((X-M),(X-M).T.conj()))/(N-1)
    return COV
    
def PCA(X,newD=2):
    ''' newD: we want to show all data in 2D
        D_ld: low dimenstional data
        '''
    #COV = computeCovariance(X)
    COV = np.cov(X)
    U,S,V = svd(COV)

    X_ld = np.dot(U[:,0:newD].T,X)
    return U,X_ld

Visualization of the results


In [4]:
#visualization of low dimensional data
def visualize_lowD_01(D_ld,N0,N1):
    D0_ld = D_ld[:,0:N0]
    D1_ld = D_ld[:,N0+1:]
    f, ax = plt.subplots()

    ax.plot(D0_ld[0,:],D0_ld[1,:],'b.')
    ax.plot(D1_ld[0,:],D1_ld[1,:],'r.')
    ax.set_title('0:blue, 1:red')

    plt.show()

In [5]:
m = 2 # project data into 2D

N = X.shape[1];N0 = X0.shape[1];N1 = X1.shape[1]
d = X.shape[0];
Xmu = np.mean(X, axis=1).reshape(d,-1)

U, _ = PCA(X)
Xpca = np.dot(U[:,0:m].T,X)

visualize_lowD_01(Xpca,N0,N1)


Two clusters of data points should be visible in the above plot.We check how easily we can classify using MAP estimator assuimg equally probable classes with bivarite normal distribution:


In [6]:
# MAP classifier to check the results in 2D
def MAP_01_classifier(X,N0,N1):
    '''If we have zeros of 1s and 0s how easy it can be to 
    detect them with equiprobable bivariate normals. The higher the better'''
    X0 = X[:,0:N0]
    X1 = X[:,N0+1:]
    
    M0 = np.mean(X0,axis = 1)
    M1 = np.mean(X1,axis = 1)

    COV0 = np.cov(X0)
    COV1 = np.cov(X1)

    mvn0 = multivariate_normal(M0,COV0)
    mvn1 = multivariate_normal(M1,COV1)

    P00 =  mvn0.pdf(X0.T)
    P10 = mvn1.pdf(X0.T)
    sumP = P00+P10
    estim_zero = P00/sumP > P10/sumP
    

    P01 =  mvn0.pdf(X1.T)
    P11 = mvn1.pdf(X1.T)
    sumP = P01+P11
    estim_one = P01/sumP < P11/sumP
    
    acc0 =  sum(estim_zero)/float(N0)
    acc1 = sum(estim_one)/float(N1)
    
    return acc0,acc1

In [7]:
print MAP_01_classifier(Xpca,N0,N1)


(0.99831166638527769, 0.98976564817561552)

We reconstruct the data from this low dimentional information and see how good the result might be:


In [8]:
# distance between the reconstructed image and the original in average
def reconstruction_err(D,D_hat): return np.mean(np.sqrt(np.sum((D-D_hat)**2,axis=0)))/np.mean(np.sqrt(np.sum((D)**2,axis=0)))

In [9]:
Xpca_hat = np.dot(U[:,0:2],Xpca)

print reconstruction_err(X,Xpca_hat)


20.1627719454

In [10]:
#visualization of 2D low dimentional space
def visualize_reconstructed_01(D,D_hat,N0,N1):
    sample0 = D[:,0].reshape((28,28))
    sample1 = D[:,N0+1].reshape((28,28))
    sample0_hat = D_hat[:,0].reshape((28,28))
    sample1_hat = D_hat[:,N0+1].reshape((28,28))
    f, axarr = plt.subplots(2,2)
    axarr[0,0].imshow(sample0,cmap='gray');axarr[0,0].set_title('Sample 0')
    axarr[0,1].imshow(sample1,cmap='gray');axarr[0,1].set_title('Sample 1')
    axarr[1,0].imshow(sample0_hat,cmap='gray');axarr[1,0].set_title('Reconstruction')
    axarr[1,1].imshow(sample1_hat,cmap='gray');axarr[1,1].set_title('Reconstruction')

    axarr[0,0].axis('off')
    axarr[0,1].axis('off')
    axarr[1,0].axis('off')
    axarr[1,1].axis('off')
    plt.show()

In [11]:
visualize_reconstructed_01(X,Xpca_hat,N0,N1)


Probabilistic Principle Component Analysis - PPCA

Conventional PCA can be seen as a projection while PPCA is a mapping. This mapping is from the latent space into data sapce: $ X =WZ+\epsilon $ where $Z$ is considered M dimentional latent vriable and $\epsilon$ is zero-mean isotropic noise. Implementations here closely flow PRML book by bishop pages [587-596].


In [12]:
# Probabilistic Principal Component Analysis, Tipping and Bishop 1999
def compute_E_completeDataLL(X,W,s2):
    d, N = X.shape
    m = W.shape[1]
    
    M = np.dot(W.T,W)+s2*np.eye(m)
    Minv = inv(M)

    ll = 0.0
    for n in xrange(N):
        Xn = X[:, n].reshape(-1,1)
        
        Ezn = np.dot(np.dot(Minv,W.T),Xn)
        Ezzn = (s2)*Minv+np.dot(Ezn,Ezn.T)
        ll += 0.5 * np.trace(Ezzn)
        ll += 0.5 * d * np.log(2*np.pi*s2)
        ll += 0.5 * np.dot(Xn.T,Xn)/2.
        ll -= np.dot(np.dot(Ezn.T,W.T),Xn)/float(s2)
        ll += 0.5 * np.trace(np.dot(Ezzn,np.dot(W.T,W)))/float(s2)
        
    ll *= -1.0

    return ll/N
def compute_Wnew(X,W,s2):
    d, N = X.shape
    m = W.shape[1]
    
    M = np.dot(W.T,W)+s2*np.eye(m)
    Minv = inv(M)

    term1, term2 = 0.,0.
    for n in xrange(N):
        Xn = X[:, n].reshape(-1,1)
        
        Ezn = np.dot(np.dot(Minv,W.T),Xn)
        Ezzn = (s2)*Minv+np.dot(Ezn,Ezn.T)
        
        term1 += np.dot(Xn,Ezn.T)
        term2 += Ezzn
        
    Wnew = np.dot(term1,term2)
    return Wnew
    
def compute_s2new(X,W,s2):
    d, N = X.shape
    m = W.shape[1]
    
    M = np.dot(W.T,W)+s2*np.eye(m)
    Minv = inv(M)

    s2new = 0.
    for n in xrange(N):
        Xn = X[:, n].reshape(-1,1)
        
        Ezn = np.dot(np.dot(Minv,W.T),Xn)
        Ezzn = (s2)*Minv+np.dot(Ezn,Ezn.T)
        
        s2new += np.dot(Xn.T,Xn)
        s2new -= 2*np.dot(Ezn.T,np.dot(W.T,Xn))
        s2new += np.trace(np.dot(Ezzn,np.dot(W.T,W)))
        
    s2new = s2new/float(N*d)
    return s2new
def ppca(X,mode = 'ML',m=2,em_maxit=200,tol=1e-4):
    '''X: observed variable D*N
       Z: latent variable M*N
       m: # output dimension
       W: D*m
       M: m*m
       '''
    d, N = X.shape
    X = X - np.mean(X, axis=1).reshape(d,-1)
    I = np.eye(m);

    if mode == 'ML':
        
        S = np.cov(X)
        U,L,_ = svd(S)

        Um = U[:,0:m]
        Lm = L[0:m]*I# to diagonalize it
        
        sigma2 = np.sum(L[m+1:])/(d-m)

        Lm_temp = np.sqrt(np.maximum(0, Lm- sigma2*I))#sqrtm
        W = np.dot(Um,Lm_temp)

        M = np.dot(W.T,W)+sigma2*I
        
        muML = np.dot(inv(M).dot(W.T),X)
        sigmaML = (sigma2**-1)*M
        
        return W,muML,sigmaML
    if mode == 'EM':
        W = np.random.normal(size=(d,m)) 
        s2 = np.random.rand()
        S = np.cov(X)
        Ellh = compute_E_completeDataLL(X,W,s2)
        for i in np.arange(em_maxit):
            
            M = W.T.dot(W)+s2*np.eye(m)
            Minv = inv(M)

            temp = inv(s2*np.eye(m)+Minv.dot(W.T).dot(S).dot(W))
            Wnew = S.dot(W).dot(temp)
            #Wnew = compute_Wnew(X,W,s2)
            s2new = np.trace(S - S.dot(W).dot(Minv).dot(Wnew.T))/d
            #s2new = compute_s2new(X,W,s2)
            
            Ellhnew = compute_E_completeDataLL(X,Wnew,s2new)
            
            W = Wnew
            s2 = s2new
            
            if abs(Ellhnew-Ellh) < tol*abs(Ellh):
                print '<LL> Converged at %d Iteration.'%i 
                break; 
            elif i%10==0:
                print  'Iter #%d, Diff(<LL>) = %2.2f'%(i,abs(Ellhnew-Ellh))  
                
            Ellh = Ellhnew
            
        M = np.dot(W.T,W)+s2*np.eye(m)
        Minv = inv(M)
    
        muEM = np.dot(np.dot(Minv,W.T),X)
        sigmaEM = (s2**-1)*M
        return W,muEM,sigmaEM,s2

In [13]:
wML,muML,sigmaML = ppca(X,m=2, mode='ML')
wEM,muEM,sigmaEM,s2 = ppca(X,m=2,mode='EM',tol=1e-7)


Iter #0, Diff(<LL>) = 70947.90
Iter #10, Diff(<LL>) = 1.19
Iter #20, Diff(<LL>) = 0.36
Iter #30, Diff(<LL>) = 0.09
<LL> Converged at 32 Iteration.

In [14]:
Xppca_hat = wEM.dot(muEM)
visualize_reconstructed_01(X,Xppca_hat,N0,N1)


We can generate new images with our generative model $X = Wz + \epsilon $


In [15]:
#Generating new images
z = np.random.normal(size=(m,N)) 
epsi = np.random.multivariate_normal(Xmu[:,0],s2*np.eye(d)).reshape(-1,1)
Xppca_generated_image = wEM.dot(z)+epsi
visualize_reconstructed_01(X,Xppca_generated_image,N0,N1)



In [16]:
#visualize_lowD_01(muML,N0,N1)
visualize_lowD_01(muEM,N0,N1)
visualize_lowD_01(muEM,N0,N1)

print MAP_01_classifier(muML,N0,N1)
print MAP_01_classifier(muEM,N0,N1)
print MAP_01_classifier(Xpca,N0,N1)


(0.99831166638527769, 0.98976564817561552)
(0.99831166638527769, 0.98961732423613169)
(0.99831166638527769, 0.98976564817561552)

We observe that the converged PPCA with EM algorithems visually looks the same as ML PPCA.

Gaussian Process Latent Variable Model

Here we will start our investigation of GPLVM with simple linear covariance kernel function. A matlab implementation is available in the directory but here we will use a much stable python implementation in GPy for demonstrating GPLVM.


In [19]:
import GPy 

q = 2
k = GPy.kern.RBF(q, ARD=True) #+ kern.Linear(q, ARD=True) + kern.White(q, np.exp(-2))  + kern.Bias(q)

Xsubset = np.hstack([X0[:,0:50],X1[:,0:50]])

GPmodel = GPy.models.GPLVM(Xsubset.T, q, init="PCA", kernel=k)#first 200 training points

GPmodel.optimize('bfgs', messages=1, max_iters=2e10, gtol=.05)

In [23]:
Xgplvm = GPmodel.latent_mean
visualize_lowD_01(Xgplvm.T,50,50)


What has happened is that we made a dual form of PPCA but put a prior on parameters and marginilized the latent space. The we maximized the likelihood with respect to latent space and GP paramters. Next we visualize the results. The kernel for the GP has the size of $N \times N$ which seemingly didn't fit to our RAM so we only try simple GPLVM on 200 randomly selected subset of trainig set.

References

  • Christopher Bishop, Pattern Recognition and Machine Learning
  • Tipping, M. E., & Bishop, C. M. (1999). Probabilistic Principal Component Analysis