Lab session 4: Dimensionality Reduction using Gaussian Processes

Gaussian Process Road School, 16th February 2014 written by Max Zwiessele, Neil Lawrence

This lab session will focus on three aspects of GPs: sampling, the design of Experiments and uncertainty propagation.

1 Getting Started and Downloading Data

As for the last two lab sessions, the first thing to do is set the plots to appear inline and import relevant modules for the lab session.



In [1]:

    
%pylab inline
import numpy as np
import pylab as pb
import GPy
import string









    



Populating the interactive namespace from numpy and matplotlib

For this lab, we've created a dataset digits.npy containing all handwritten digits from $0 \cdots 9$ handwritten, provided by deCampos et al. [2009]. All digits were cropped and scaled down to an appropriate format. You can retrieve the dataset as follows:



In [2]:

    
import urllib
urllib.urlretrieve('http://staffwww.dcs.sheffield.ac.uk/people/J.Hensman/gpsummer/Lab3.zip', 'Lab3.zip')
import zipfile
zip = zipfile.ZipFile('Lab3.zip', 'r')
for name in zip.namelist():
    zip.extract(name, '.')
from load_plotting import * # for plotting

We will only use some of the digits for the demonstrations in this lab class, but you can edit the code below to select different subsets of the digit data as you wish.



In [3]:

    
digits = np.load('digits.npy')
which = [0,1,2,6,7,9] # which digits to work on
digits = digits[which,:,:,:]
num_classes, num_samples, height, width = digits.shape
labels = np.array([[str(l)]*num_samples for l in which])

You can try to plot some sample using pb.matshow

2 Principal Component Analysis

Principal component analysis (PCA) finds a rotation of the observed outputs, such that the rotated principal component (PC) space maximizes the variance of the data observed, sorted from most to least important (most to least variable in the corresponding PC).

In order to apply PCA in an easy way, we have included a PCA module in pca.py. You can import the module by import <path.to.pca> (without the ending .py!). To run PCA on the digits we have to reshape (Hint: np.reshape ) digits .

What is the right shape $n \times d$ to use?

We will call the reshaped observed outputs $\mathbf{Y}$ in the following.



In [4]:

    
Y = digits.reshape((digits.shape[0]*digits.shape[1],digits.shape[2]*digits.shape[3]))
Yn = Y-Y.mean()

Now let’s run PCA on the reshaped dataset $\mathbf{Y}$:



In [5]:

    
import pca
p = pca.PCA(Y) # create PCA class with digits dataset

The resulting plot will show the lower dimensional representation of the digits in 2 dimensions.



In [6]:

    
p.plot_fracs(20) # plot first 20 eigenvalue fractions
p.plot_2d(Y,labels=labels.flatten(), colors=colors)
pb.legend()









    Out[6]:





<matplotlib.legend.Legend at 0x4babf50>

3 Gaussian Process Latent Variable Model

The Gaussian Process Latent Variable Model (GP-LVM) embeds of PCA into a Gaussian process framework, where the latent inputs $\mathbf{X}$ are learnt as hyperparameters and the mapping variables $\mathbf{W}$ are integrated out. The advantage of this interpretation is it allows PCA to be generalized in a non linear way by replacing the resulting linear covariance witha non linear covariance. But first, let's see how GPLVM is equivalent to PCA using an automatic relevance determination (ARD, see e.g. Bishop et al. [2006]) linear kernel:



In [36]:

    
input_dim = 4 # How many latent dimensions to use
kernel = GPy.kern.Linear(input_dim, ARD=True) # ARD kernel
#kernel += GPy.kern.White(input_dim) + GPy.kern.Bias(input_dim)
m = GPy.models.GPLVM(Yn, input_dim=input_dim, kernel=kernel)
#m.ensure_default_constraints()
m['.*noise'] = m.Y.var()/20. # start noise is 5% of datanoise
m.optimize(messages=1, max_iters=1000) # optimize for 1000 iterations
print 'done'
m.kern.plot_ARD()
plot_model(m, m['.*linear.variances'].argsort()[-2:], labels.flatten())
pb.legend()









    



 I      F              Scale          |g|        
0004   4.385056e+04   1.250000e-01   1.856819e+08 
0008   3.910837e+04   3.125000e-02   5.779741e+07 
0017   3.459780e+04   2.441406e-04   6.500608e+05 
0026   3.330578e+04   4.768372e-07   2.862445e+04 
0034   3.328137e+04   1.862645e-09   5.096867e+03 
0067   3.327216e+04   1.000000e-15   9.099295e+02 
0092   3.327127e+04   1.000000e-15   8.550175e+01 
0124   3.327117e+04   1.000000e-15   9.477403e+00 
0178   3.327116e+04   1.000000e-15   7.048452e-02 
0201   3.327116e+04   1.000000e-15   2.080463e-02 
converged - relative reduction in objective
done






    Out[36]:





<matplotlib.legend.Legend at 0x6e18c90>



In [38]:

    
print m









    



  GPLVM.                   |      Value      |  Constraint  |  Prior  |  Tied to
  latent_mean              |       (330, 4)  |              |         |         
  linear.variances         |           (4,)  |     +ve      |         |         
  Gaussian_noise.variance  |  0.12206486192  |     +ve      |         |

As you can see the solution with a linear kernel is the same as the PCA solution with the exception of rotational changes and axis flips.

For the sake of time, the solution you see was only running for 1000 iterations, thus it might not be converged fully yet. The GP-LVM proceeds by iterative optimization of the inputs to the covariance. As we saw in the lecture earlier, for the linear covariance, these latent points can be optimized with an eigenvalue problem, but generally, for non-linear covariance functions, we are obliged to use gradient based optimization.

Exercise 1

a) How do your linear solutions differ between PCA and GPLVM with a linear kernel? Look at the plots and also try and consider how the linear ARD parameters compare to the eigenvalues of the principal components.

# Exercise 1 a) answer

b) The next step is to use a non-linear mapping between inputs $\mathbf{X}$ and ouputs $\mathbf{Y}$ by selecting the exponentiated quadratic (GPy.kern.rbf) covariance function.



In [39]:

    
kern = GPy.kern.RBF(input_dim)

c) How does the nonlinear model differe from the linear model? Are there digits that the GPLVM with an exponentiated quadratic covariance can separate, which PCA is not able to?

# Exercise 1 c) answer

d) Try modifying the covariance function and running the model again. For example you could try a combination of the linear and exponentiated quadratic covariance function or the Matern 5/2. If you run into stability problems try initializing the covariance function parameters differently.

# Exercise 1 d) answer

4 Bayesian GPLVM

In GP-LVM we use a point estimate of the distribution of the input $\mathbf{X}$. This estimate is derived through maximum likelihood or through a maximum a posteriori (MAP) approach. Ideally, we would like to also estimate a distribution over the input $\mathbf{X}$. In the Bayesian GPLVM we approximate the true distribution $p(\mathbf{X}|\mathbf{Y})$ by a variational approximation $q(\mathbf{X})$ and integrate $\mathbf{X}$ out.

Approximating the posterior in this way allows us to optimize a lower bound on the marginal likelihood. Handling the uncertainty in a principled way allows the model to make an assessment of whether a particular latent dimension is required, or the variation is better explained by noise. This allows the algorithm to switch off latent dimensions. The switching off can take some time though, so below in Section 6 we provide a pre-learnt module, but to complete section 6 you'll need to be working in the IPython console instead of the notebook.

For the moment we'll run a short experiment applying the Bayesian GP-LVM with an exponentiated quadratic covariance function.



In [41]:

    
# Model optimization
input_dim = 5 # How many latent dimensions to use
kern = GPy.kern.RBF(input_dim,ARD=True) # ARD kernel
m = GPy.models.BayesianGPLVM(Yn, input_dim=input_dim, kernel=kern, num_inducing=25)

# initialize noise as 1% of variance in data
m['*.noise.variance'] = m.likelihood.Y.var()/100.
m.optimize('scg', messages=1, max_iters=1000)









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-41-9031bf03348a> in <module>()
      6 print Y
      7 # initialize noise as 1% of variance in data
----> 8 m['*.noise.variance'] = m.likelihood.Y.var()/100.
      9 m.optimize('scg', messages=1, max_iters=1000)

AttributeError: 'Gaussian' object has no attribute 'Y'





    



  bayesian_gplvm.          |   Value    |  Constraint  |  Prior  |  Tied to
  latent space.mean        |  (330, 5)  |              |         |         
  latent space.variance    |  (330, 5)  |     +ve      |         |         
  inducing inputs          |   (25, 5)  |              |         |         
  rbf.variance             |       1.0  |     +ve      |         |         
  rbf.lengthscale          |      (5,)  |     +ve      |         |         
  Gaussian_noise.variance  |       1.0  |     +ve      |         |         
[[ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 ..., 
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]]



In [ ]:

    
# Plotting the model
plot_model(m, m['rbf_len'].argsort()[:2], labels.flatten())
pb.legend()
m.kern.plot_ARD()
# Saving the model:
m.pickle('bgplvm_rbf.pickle')

Because we are now also considering the uncertainty in the model, this optimization can take some time. However, you are free to interrupt the optimization at any point selecting Kernel->Interupt from the notepad menu. This will leave you with the model, m in the current state and you can plot and look into the model parameters.

Exercise 2

a) How does the Bayesian GP-LVM compare with the standard model?

# Exercise 2 a) answer

Preoptimized Model

A good way of working with latent variable models is to interact with the latent dimensions, generating data. This is a little bit tricky in the notebook, so below in section 6 we provide code for setting up an interactive demo in the standard IPython shell. If you are working on your own machine you can try this now. Otherwise continue with section 5.

5 Multiview Learning: Manifold Relevance Determination

In Manifold Relevance Determination we try to find one latent space, common for $K$ observed output sets (modalities) $\{\mathbf{Y}_{k}\}_{k=1}^{K}$. Each modality is associated with a separate set of ARD parameters so that it switches off different parts of the whole latent space and, therefore, $\mathbf{X}$ is softly segmented into parts that are private to some, or shared for all modalities. Can you explain what happens in the following example?

Again, you can stop the optimizer at any point and explore the result obtained with the so far training:



In [ ]:

    
m = GPy.examples.dimensionality_reduction.mrd_simulation(optimize = False)
m.optimize(messages = True, max_iters=1000, optimizer = 'bfgs')



In [ ]:

    
m.plot_scales()
m.plot_X_1d()

Exercise 3

The simulated data set is a sinusoid and a double frequency sinusoid function as input signals.

a) Which signal is shared across the three datasets?

# Exercise 3 a) answer

b) Which are private?

# Exercise 3 b) answer

c) Are there signals shared only between two of the three datasets?

# Exercise 3 c) answer

6 Interactive Demo: For Use Outside the Notepad

The module below loads a pre-optimized Bayesian GPLVM model (like the one you just trained) and allows you to interact with the latent space. Three interactive figures pop up: the latent space, the ARD scales and a sample in the output space (corresponding to the current selected latent point of the other figure). You can sample with the mouse from the latent space and obtain samples in the output space. You can select different latent dimensions to vary by clicking on the corresponding scales with the left and right mouse buttons. This will also cause the latent space to be projected on the selected latent dimensions in the other figure.



In [42]:

    
run load_bgplvm_dimension_select.py









    



Already cached, to reload run with 'reload' as the only argument



In [43]:

    
import cPickle as pickle
with open('./digit_bgplvm_demo.pickle', 'rb') as f:
    m = pickle.load(f)









    



---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
<ipython-input-43-848b389c202c> in <module>()
      1 import cPickle as pickle
      2 with open('./digit_bgplvm_demo.pickle', 'rb') as f:
----> 3     m = pickle.load(f)

EOFError:

Prepare for plotting of this model. If you run on a webserver the interactive plotting will not work. Thus, you can skip to the next codeblock and run it on your own machine, later.



In [ ]:

    
fig = pb.figure('Latent Space & Scales', figsize=(16,6))
ax_latent = fig.add_subplot(121)
ax_scales = fig.add_subplot(122)

fig_out = pb.figure('Output', figsize=(1,1))
ax_image  = fig_out.add_subplot(111)
fig_out.tight_layout(pad=0)

data_show = GPy.util.visualize.image_show(m.likelihood.Y[0:1, :], dimensions=(16, 16), transpose=0, invert=0, scale=False, axes=ax_image)
lvm_visualizer = GPy.util.visualize.lvm_dimselect(m.X.copy(), m, data_show, ax_latent, ax_scales, labels=labels.flatten())

Observations

Confirm the following observations by interacting with the demo:

We tend to obtain more "strange" outputs when sampling from latent space areas away from the training inputs.
When sampling from the two dominant latent dimensions (the ones corresponding to large scales) we differentiate between all digits. Also note that projecting the latent space into the two dominant dimensions better separates the classes.
When sampling from less dominant latent dimensions the outputs vary in a more subtle way.

You can also run the dimensionality reduction example



In [ ]:

    
GPy.examples.dimensionality_reduction.bgplvm_simulation()

Questions

Can you see a difference in the ARD parameters to the non Bayesian GPLVM?
How does the Bayesian GPLVM allow the ARD parameters of the RBF kernel magnify the two first dimensions?
Is Bayesian GPLVM better in differentiating between different kinds of digits?
Why does the starting noise variance have to be lower then the variance of the observed values?
How come we use the lowest variance when using a linear kernel, but the highest lengtscale when using an RBF kernel?

References

C. M. Bishop. Pattern recognition and machine learning, volume 1. springer New York, 2006.

T. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. VISAPP 2009.

N. D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. In Journal of Machine Learning Research 6, pp 1783--1816, 2005