Tutorial in Bayesian Optimization

Javier Gonzalez (j.h.gonzalez@sheffield.ac.uk) University of Sheffield.

The basics

Bayesian optimization (BO) is an strategy for global optimation of black-box functions. For instance, consider a Lipschitz continous function $f(x)$ defined on a domain $\mathcal{X}$. BO aims to obtain

$$x^* = \arg \max_{\mathcal{X}} f(x)$$

There are two crucial bits in any Bayesian Optimization (BO) procedure approach.

  • Define a prior probability measure on $f$: this function wil capture the our prior beliefs on $f$. The prior will be updated to a 'posterior' using the available data.

  • Define an acquisition function $acqu(x)$: this is a criterium to decide where to sample next in order to gain the maximum information about the location of the global maximum of $f$.

Given a prior over the function $f$ and an aquisition function a BO procedure will converge to the optimum of $f$ under some conditions.

Use of Bayesian Optimization in real applications

BO has been applied to solve a wide range of problems such us: Interactive animation, Sensor networks, Automatic algorithm configuration, Automatic machine learning toolboxes, Reinforcement learning, Organization planning, Deep learning, Engeneering and a long etc!

1D-Toy illustration

We illustrate the idea behind BO using a one-dimensional example. We start by importing the required libraries for the analysis. Note that we use our library GPy for Gaussian Processes! The online documentation of GPy is available from the SheffieldML github page.


In [62]:
%pylab inline  
import GPy     
import pylab as pb
import numpy as np
import matplotlib.pyplot as plt  # plots
import scipy.stats
from scipy.stats import norm


Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['norm']
`%matplotlib` prevents importing * from pylab and numpy

Let's start by considering the function $f(x) = − \cos(2πx ) + \sin(4\pi x )$ defined on the interval $[0.08, 0.92]$. The maximun of this function is located at $x_{max}=0.6010$. Obviously, to obtain it in this example is trivial. But, what if $f$ is not explicitely available and we only have access to a small number of noise evaluations? We see how the BO acts in this case for illustrative purpose but, of course, BO can be used in more complex scenarios. We first generate 3 noisy observations sampled from $f$ and we proceed.


In [90]:
## Function f(x)
X_star = np.linspace(0,1,1000)[:,None]
Y_star = -np.cos(2*np.pi*X_star) + np.sin(4*np.pi*X_star)
X_eval = X_star

# Sampled values
seed([1])
n = 3
X = np.array([[0.2],[0.5],[0.85]])
Y = -np.cos(2*np.pi*X) + np.sin(4*np.pi*X) + np.random.randn(n,1)*0.1

# Plot of f(x) and the generated sample
pylab.rcParams['figure.figsize'] = 8, 5 
plt.figure();
plt.plot(X_star,Y_star,c='grey',lw=2,ls='--',mew=1.5)
plt.plot(X,Y,'kx',mew=1.5)
plt.xlabel('x')
plt.ylabel('f(x)')
savefig('data.pdf')


3.1 Gaussian Process Prior

Now we define a Gaussian Process (GP) prior on $f$. A GP is an extension of the multivariate Gaussian distribution to an infinite dimension stochastic process for which any finite combination of dimensions is a Gaussian distribution. Therefore a GP is a distribution over functions, which is totally specified by its mean function $m(x)$ and its covariance function $k(x,x')$:

$$f(x) \sim \mathcal{GP}(m(x),k(x,x')) $$

For convenience, the mean is often fixed as $m(x)=0$. We use as covariance fucntion the exponentiated quadratic kernel

$$ k(x,x') = l \cdot exp{ \left(\frac{\|x-x'\|}{2\sigma^2}\right)} $$

where $\sigma^2$ and and $l$ are postive parameters. Next, we fit this model in our dataset. We start by a kernel object.


In [91]:
# Choose the kernel
k = GPy.kern.rbf(input_dim=1, variance=0.01, lengthscale=1)

Now we create a Gaussian Process model using as covariance function the the previous kernel and we optimize the parameters by maximixing the log-likelihood. Ir order to ovaid local solutions we use 10 different initial points in the optimization process.


In [92]:
# We create the GP model
m = GPy.models.GPRegression(X, Y, k)
#m.constrain_positive('')
m.optimize()
m.optimize_restarts(num_restarts = 10)


Optimization restart 1/10, f = 4.8999605469
Optimization restart 2/10, f = 4.89918434819
Optimization restart 3/10, f = 4.89861202167
Optimization restart 4/10, f = 4.89855060546
Optimization restart 5/10, f = 4.89850406417
Optimization restart 6/10, f = 4.89856035568
Optimization restart 7/10, f = 4.89850988247
Optimization restart 8/10, f = 4.90028658134
Optimization restart 9/10, f = 4.89964773286
Optimization restart 10/10, f = 4.89973789889

Now, it is time to have a look to the fitted model. We show the parameters and the fitted function and a plot to see how it fit the data.


In [94]:
print m
#m.plot()
fest = m.predict(X_star)
plt.plot(X_star,fest[0],c='blue',lw=2,ls='-',mew=1.5)
plt.plot(X_star,fest[0]+1.96*sqrt(fest[1]),c='blue',lw=1,ls='-',mew=1)
plt.plot(X_star,fest[0]-1.96*sqrt(fest[1]),c='blue',lw=1,ls='-',mew=1)
plt.plot(X,Y,'kx',mew=2.5)
plt.title('GP model')
plt.xlabel('x')
plt.plot(X_star,Y_star,c='grey',lw=2,ls='--',mew=1.5)
plt.ylabel('f(x)')
plt.xlim(0,1)
savefig('datamodel.pdf')


Log-likelihood: -4.899e+00

       Name        |  Value   |  Constraints  |  Ties  |  prior  
-----------------------------------------------------------------
   rbf_variance    |  1.0693  |     (+ve)     |        |         
  rbf_lengthscale  |  0.0858  |     (+ve)     |        |         
  noise_variance   |  0.4673  |     (+ve)     |        |         

Given this model, where do you think the maximum of the function should be? Around 3 the posterior mean is maximum around the variance is large. If you could collect a new data point, where would you do it. This is the job of the second main element of any bayesian optimization porcedure: the acquisition function

3.2 Aquisition functions

Next lines of code define the three acquisition functions we are going to use in our example. They are the functions that represents our beliefs over the maximum of $f(x)$. Denote by $\theta$ the parameters of the GP model and by $\{x_i,y_i\}$ the available sample. Three of the most common acquisition fucntions are:

  • Maximum probability of improvement (MPI):
$$acqu_{MPI}(x;\{x_n,y_n\},\theta) = \Phi(\gamma(x)), \mbox{where}\ \gamma(x)=\frac{\mu(x;\{x_n,y_n\},\theta)-f(x_{best})-\psi}{\sigma(x;\{x_n,y_n\},\theta)}.$$

  • Expected improvement (EI):
$$acqu_{EI}(x;\{x_n,y_n\},\theta) = \sigma(x;\{x_n,y_n\},\theta) (\gamma(x) \Phi(\gamma(x))) + N(\gamma(x);0,1).$$
  • Upper confidence bound (UCB):
$$acqu_{UCB}(x;\{x_n,y_n\},\theta) = \mu(x;\{x_n,y_n\},\theta)+\eta\sigma(x;\{x_n,y_n\},\theta).$$

Both, $\psi$ and $\eta$ are tunable parameters that help to make the aquisition functions more flexible. Also, in the case of the UBC, the paramter $\eta$ is usefull to define the balance between the importance we give to the mean and the variance of the model. This is know as the exploration/explotation trade off.


In [84]:
def MPI_max(x,model,par = 0.01):
 fest = model.predict(x)
 acqu = norm.cdf((fest[0]-max(fest[0])-par) / fest[1])
 return acqu

def EI_max(x,model,par = 0.01):
 fest = model.predict(x)
 Z = (fest[0]-max(fest[0])-par) / (fest[1])
 acqu = (fest[0]-max(fest[0])-par)*norm.cdf(Z)+fest[1]*norm.pdf(Z)
 return acqu

def UBC_max(x,model,z_mui=1):
 fest = model.predict(x)
 acqu = fest[0]+z_mui*np.sqrt(fest[1])
 return acqu

We evaluate the fucntions of our interval of interest. Here, the maximum is found using grid search but in higher dimensional problems and their maximum can be sistematically obtained with a Conjugate Gradient method.


In [85]:
## Evaluate and get the maximum of the acquisition function (grid seach for-plotting purposes)
# MPI
acqu_MPI1 = MPI_max(X_eval,m,0.01) 
acqu_MPI2 = MPI_max(X_eval,m,0.1) 
acqu_MPI3 = MPI_max(X_eval,m,0.25) 
max_MPI1 = X_eval[np.argmax(-acqu_MPI1)]
max_MPI2 = X_eval[np.argmax(-acqu_MPI2)]
max_MPI3 = X_eval[np.argmax(-acqu_MPI3)]

# EI
acqu_EI1  = EI_max(X_eval,m,0.01)
acqu_EI2  = EI_max(X_eval,m,0.1)
acqu_EI3  = EI_max(X_eval,m,0.5)
max_EI1 = X_eval[np.argmax(-acqu_EI1)]
max_EI2 = X_eval[np.argmax(-acqu_EI2)]
max_EI3 = X_eval[np.argmax(-acqu_EI3)]
res_max_EI3 = max_EI3

# UBC
acqu_UBC1 = UBC_max(X_eval,m,0.5)   
acqu_UBC2 = UBC_max(X_eval,m,1)
acqu_UBC3 = UBC_max(X_eval,m,4)
max_UBC1 = X_eval[np.argmax(-acqu_UBC1)]
max_UBC2 = X_eval[np.argmax(-acqu_UBC2)]
max_UBC3 = X_eval[np.argmax(-acqu_UBC3)]
res_max_UBC3 = max_UBC3

In [51]:
# Plot GP posterior, collected data and the acquisition function
m.plot()
plt.ylim(-2,3)
plt.plot(X_star,Y_star,c='grey',lw=2,ls='--',mew=1.5)
plt.title('GP model')


Out[51]:
<matplotlib.text.Text at 0xb09fdec>

In [52]:
plt.figure(figsize=(20,5))
plt.subplot(1, 3, 1)
plt.title('Acquisition functions for MPI')
plt.xlim(0.08,0.92)
p1, = plt.plot(X_eval, acqu_MPI1, 'r-',lw=2.5)
p2, = plt.plot(X_eval, acqu_MPI2, 'b-',lw=2.5)
p3, = plt.plot(X_eval, acqu_MPI3, 'g-',lw=2.5)
plt.title('Acquisition functions for MPI')
plt.xlim(0.08,0.92)
plt.xlabel('x')
plt.ylabel('Acquisition value')
plt.legend([p1, p2, p3], ["0.01", "0.1", "0.5"])
axvline(x=max_MPI1,ls='-',c='red')
axvline(x=max_MPI2,ls='-',c='blue')
axvline(x=max_MPI3,ls='-',c='green')

plt.subplot(1, 3, 2)
plt.plot(X_eval, acqu_EI1, 'r-',lw=2.5)
plt.plot(X_eval, acqu_EI2, 'b-',lw=2.5)
plt.plot(X_eval, acqu_EI3, 'g-',lw=2.5)
plt.title('Acquisition functions for EI')
plt.xlim(0.08,0.92)
plt.xlabel('x')
plt.ylabel('Acquisition value')
plt.legend([p1, p2, p3], ["0.01", "0.1", "0.5"])
axvline(x=max_EI1,ls='-',c='red')
axvline(x=max_EI2,ls='-',c='blue')
axvline(x=max_EI3,ls='-',c='green')

plt.subplot(1, 3, 3)
p1, = plt.plot(X_eval, acqu_UBC1, 'r-',lw=2.5)
p2, = plt.plot(X_eval, acqu_UBC2, 'b-',lw=2.5)
p3, = plt.plot(X_eval, acqu_UBC3, 'g-',lw=2.5)
plt.title('Acquisition functions for UBC')
plt.xlim(0.08,0.92)
plt.xlabel('x')
plt.ylabel('Acquisition value')
plt.legend([p1, p2, p3], ["0.5", "1", "4"])
axvline(x=max_UBC1,ls='-',c='red')
axvline(x=max_UBC2,ls='-',c='blue')
axvline(x=max_UBC3,ls='-',c='green')


Out[52]:
<matplotlib.lines.Line2D at 0xb848acc>

Next, we show the how the three functions represents our belifs about the maximum of $f(x)$. Note that all of them use the mean and the variance of the Gaussian process we have fitted to the data. In this example we simply select some values of the parameters. Of course, a detail exploration needs to be done in real and more complex scenarios.

You can see how the thre acquisition functions represent their belifs about the maximum of $f(x)$ in a different way. It is up to the user to select the most appropiate depending on the problem. Tipically, if can collect new data we will do it in the maximum of the acquisition function.

3.3 Iterative sampling/sequential design

Next, to see how BO works iteratively, we use the Expected improvement with $\psi=0.5$. In each iteration we use the same generative model we considered for our first three data points in the point where $acqu_{EI}(x)$ is maximum. See what happens by running several times the cell below!!


In [97]:
# 1.- Collect an new sample where the MPI indicates and attach to the previous dataset
#x_new = max_EI3
#y_new = -np.cos(2*np.pi*x_new) + np.sin(4*np.pi*x_new) + np.random.randn(1,1)*0.1
#X = np.vstack([X,x_new])
#Y = np.vstack([Y,y_new])

# 2.- Run and optimize the new GP model
k = GPy.kern.rbf(input_dim=1, variance=.1, lengthscale=1) + GPy.kern.bias(0.1)
m_augmented.constrain_positive('')
m_augmented = GPy.models.GPRegression(X, Y, k)
m_augmented.optimize_restarts(num_restarts = 10)

# 3.- Optimize aquisition function MPI
acqu_EI3 = EI_max(X_eval,m_augmented,0.5) 
max_EI3 = X_eval[np.argmax(acqu_EI3)]
res_max_EI3 = np.vstack([res_max_EI3,max_EI3])
x_res = np.linspace(1,res_max_EI3.shape[0],res_max_EI3.shape[0])
acqu_EI3 

pylab.rcParams['figure.figsize'] = 10, 3 
# GP plot
fest = m_augmented.predict(X_star)
plt.plot(X_star,fest[0],c='blue',lw=2,ls='-',mew=4)
plt.plot(X_star,fest[0]+1.96*sqrt(fest[1]),c='blue',lw=1,ls='-',mew=1)
plt.plot(X_star,fest[0]-1.96*sqrt(fest[1]),c='blue',lw=1,ls='-',mew=1)
plt.plot(X,Y,'kx',mew=2.5)
plt.title('GP model')
plt.xlabel('x')
plt.plot(X_star,Y_star,c='grey',lw=2,ls='--',mew=1.5)
plt.ylabel('f(x)')
plt.xlim(0,1)
savefig('datamodel1.pdf')
# EI plot
pylab.rcParams['figure.figsize'] = 10, 3 
plt.figure(figsize=(10,3))
p1, = plt.plot(X_eval,(acqu_EI3-min(acqu_EI3))/(max(acqu_EI3-min(acqu_EI3))), 'g-',lw=2.5)
plt.title('Acquisition function')
plt.xlim(0,1.01)
plt.ylim(-0.1,1.1)
plt.xlabel('x')
plt.ylabel('Value')
plt.legend([p1], ["Expected improvement"])
savefig('aq1.pdf')

# Convergence plot
#plt.subplot(1, 2, 2)
#plt.plot(x_res,res_max_EI3,'kx',mew=4.5)
#plt.title('Convergence to the maximum')
#plt.xlabel('iteration')
#plt.ylabel('Value')
#plt.ylim(-0.25,1.5)
#plt.plot(x_res,res_max_EI3,'g-',lw=2.5)
#axhline(y=0.6010,ls='--',c='red')


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-97-0232d30d9568> in <module>()
      6 
      7 # 2.- Run and optimize the new GP model
----> 8 k = GPy.kern.rbf(input_dim=1, variance=.1, lengthscale=1) + GPy.kern.bias(0.1)
      9 m_augmented.constrain_positive('')
     10 m_augmented = GPy.models.GPRegression(X, Y, k)

/usr/local/lib/python2.7/dist-packages/GPy/kern/kern.pyc in __add__(self, other)
    180     def __add__(self, other):
    181         """ Overloading of the '+' operator. for more control, see self.add """
--> 182         return self.add(other)
    183 
    184     def add(self, other, tensor=False):

/usr/local/lib/python2.7/dist-packages/GPy/kern/kern.pyc in add(self, other, tensor)
    213             newkern.tied_indices = self.tied_indices + [self.num_params + x for x in other.tied_indices]
    214         else:
--> 215             assert self.input_dim == other.input_dim
    216             newkern = kern(self.input_dim, self.parts + other.parts, self.input_slices + other.input_slices)
    217             # transfer constraints:

AssertionError: 

As you can see, after the first iterations, the aquisition function explores the domain of $f$. In some cases it is either flat or it concentrates all the mass in a point, which normally conincides with the exploration of the limit of the function domain. After some iterations, however, it becomes more narrow and the convergence to the maximum of $f$, 0.6010, is achieved. We only needed 7 data points!!


In [203]:
fest = m_augmented.predict(X_eval)
 Z = (fest[0]-max(fest[0])-0.4) / fest[1]
 acqu = (fest[0]-max(fest[0])-0.4)*norm.cdf(Z)+fest[1]*norm.pdf(Z)

In [203]:


In [ ]: