In this notebook, we will give an introduction to dimension reduction using Principal Component Analysis (PCA) and Multidimensional Scaling (MDS).
As we have emphasized in the book, there are two views of looking at the dataset ($X_{L \times N}$, with $L$ being the amount of samples or models, and $N$ the dimension of each model), one of which is in the variable space ($\mathbb{R}^N$), and the other in the sample space ($\mathbb{R}^L$) (Scheidt et al., 2017). Accordingly, two types of dimension reduction exist that correspond to these two spaces. PCA is used to reduce the dimension in the variable space, while MDS is used to reduce the dimension in the sample space.
Here, we will show an example processing 136 2D flume experiment images. The dimension of each image is 260 rows and 300 columns.
Before performing PCA, the data need to be reshaped as a 2D matrix (data, 136 rows by 78000 columns), with each row representing a image. This data matrix was directly loaded here.
In [1]:
addpath('../data/dimred/');
load('data.mat')
nmodels = 136; %number of variogram models
nx=300; % columns of flume image grid
ny=260; % rows of flume image grid
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables, termed "principal components". This transformation is defined in such a way that the first principal component explains the largest variability, and the succeeding component explains lower variability under the condition of being orthogonal to preceding components. If only the first several principal components are retained, then the dimension can be reduced with only a small amount of variability lost (Jolliffe, 2002).
Principal components are linear combinations of the original variables, and we can derive that the coefficients are the eigen vectors of the covariance matrix of the original data. In matlab, we can directly perform PCA as follows:
In [2]:
[pcs,score,latent,~,explained] = pca(data);
A scree plot is a plot of the eigenvalues against the principal component numbers. It indicates how much variability can be explained by each principal component. It can assist to find an appropriate number of principal components to retain in dimension reduction.
In [3]:
%plot inline -s 600,400
figure; %Scree plot
X = 1:nmodels-1;
Y = cumsum(explained);
N=25; % number of pcs to plot in scree plot
plot(X(1:N),Y(1:N),'*-b')
xlabel('Principal Component','Fontsize',15);
ylabel('Variance Explained (%)','Fontsize',15);
set(gca,'fontsize',15)
In our case, the principal components obtained are 2D images, and any given realization in the original dataset can be roughly represented by linear combinations of these extracted principal components.
In [18]:
%plot inline -s 2000,1000
figure;
npc = 6;
for i=1:npc
subplot(2,3,i)
imagesc(reshape(pcs(:,i),ny,nx));
title(['PC',num2str(i)],'Fontsize',30)
set(gca,'fontsize',20)
axis image
axis off;
colorbar;
end
colormap(jet);
Principal component scores can be seen as the values of new variables representing the oringinal high-dimensional data. One of adavantages is to visualize the data or models in lower-dimensional space.
In [5]:
%plot inline -s 1200,800
figure;
plot(score(:,1),score(:,2),'o','Markersize',6,'markerface','b','markeredgecolor',[44, 62, 80]./255)
xlabel(['1st Principal Component','(',num2str(round(explained(1))*100/100),'%)'],'Fontsize',15)
ylabel(['2nd Principal Component','(',num2str(round(explained(2))*100/100),'%)'],'Fontsize',15)
set(gca,'fontsize',15)
axis equal
colormap(jet);
grid on
PCA is one kind of bijective statistical procedure, meaning that we can use the small amount of principal components retained to reconstruct the original data. The more principal components are used in reconstruction, the more similar of the reconstructed data is to the original data. As expected, the reconstructed data will be the same as the original data if all the principal components are used.
In [6]:
%plot inline -s 1200,800
nextractpc = 50; % number of pcs used to reconstruct the original data
dataapprox = score(:,1:nextractpc)*pcs(:,1:nextractpc)';
datamean = mean(data);
ReconstructData = bsxfun(@plus,datamean, dataapprox);
figure;
imagesc(reshape(ReconstructData(1,:), ny, nx));
title(['Reconstructed(# of PCs=',num2str(nextractpc),')'],'Fontsize',15)
set(gca,'fontsize',15)
colorbar;colormap(jet);
axis equal
XX=data(1,:);
originD=reshape(XX,ny,nx);
figure;
imagesc(originD)
title('Original image','Fontsize',15)
set(gca,'fontsize',15)
colorbar;colormap(jet);
axis equal
MDS is a widely-used means to visualize high-dimensional objects by projecting them into lower-dimensional space, i.e. one-, two-, or three-dimensional Cartesian space (Borg and Groenen, 2005). Classical multidimensional scaling, which takes an input matrix giving dissimilarities between pairs of items and outputs a coordinate matrix for the potential items, is used here. Note that a variety of dissimilarity measures can be chosen as the input distance matrix.
Hausdorff distance measures how far two subsets of a metrix space are from each other. Since the computation of modified Hausdorff distance takes a very long time, therefore, we directly load the modified Hausdorff distance matrix for the flume images here.
In [7]:
load('MHD_TI_300x260.mat') % Modified Hausdorff distance matrix
D = d_mhd_ti;
In matlab, it is very easy to carry out MDS, with only the need of calling the function.
In [8]:
[Y,Eigvals0]=cmdscale(D);
Y is the coordinate matrix, with rows representing items or objects, and columns the coordinates. Eigvals0 are the eigen values for all dimensions.
Only the first 2 or 3 columns are usually used to visualize the items.
In [9]:
exno = 2;
approX = Y(:,1:exno);
addpath('../src/dimred/');
m_plotMDS(approX,1:136);
axis equal
Eigvals = abs(Eigvals0);
explained = Eigvals./sum(Eigvals);
xlabel([num2str(round(explained(1)*100)),'%'],'fontsize',15)
ylabel([num2str(round(explained(2)*100)),'%'],'fontsize',15)
title('Colored by time(min)','fontsize',15)
set(gca,'fontsize',15)
colorbar;
colormap(jet);
MDS is not a bijective procedure, meaning that there should be infinite coordinate matrices that correspond to the same dissimilarity matrix. But we can still reconstruct the distance matrix to find an appropriate number of dimensions retained in futher modeling.
In [10]:
nextractpc = 30;
dataapprox = Y(:,1:nextractpc);
reconsD = pdist(dataapprox);
originD = squareform(D);
figure;
plot(reconsD,originD,'o','markerfacecolor','b','Markersize',4)
set(gca,'fontsize',15)
axis equal;
xlabel('reconstructed distance')
ylabel('Original distance')
Scheidt, C., Li, L. and Caers, J. (2017). Quantifying Uncertainty in Subsurface systems. Wiley-Blackwell.
Jolliffe, I. (2002). Principal component analysis. John Wiley & Sons, Ltd.
Borg, I., Groenen, P. J. (2005). Modern multidimensional scaling: Theory and applications. Springer Science & Business Media.
In [ ]: